<a href="https://colab.research.google.com/github/Nudrat-Habib/preprocessing-text-data/blob/main/preprocessing_textData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Natural Language Toolkit

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

In [None]:
import nltk #python library for NLP


#2. About Twitter Dataset

 The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly. The rationale behind equal number of tweets is to have a balanced dataset. That does not reflect the real distributions of positive and negative classes in live Twitter streams. It is just because balanced datasets simplify the design of most computational methods that are required for sentiment analysis. However, it is better to be aware that this balance of classes is artificial.

 from nltk.corpus import twitter_samples.This will import three datasets from NLTK that contain various tweets to train and test the model:

   1. negative_tweets.json: 5000 tweets with negative sentiments
   2. positive_tweets.json: 5000 tweets with positive sentiments
   3. tweets.20150430-223406.json: 20000 tweets with no sentiments

In [None]:
import nltk
from nltk.corpus import twitter_samples
nltk.download ("twitter_samples") #sample dataset for classification.

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [None]:
positive_tweets=twitter_samples.strings('positive_tweets.json')
negative_tweets=twitter_samples.strings('negative_tweets.json')
print('no of positive tweets',len(positive_tweets))
print('no of negative tweets',len(negative_tweets))

no of positive tweets 5000
no of negative tweets 5000


##2.1 printing some tweets

we can print few tweets and explore dataset.we can use the index to print a spefic tweet or we can use random function to randomly select a tweet from dataset and the print it.
we can also color code the tweets to make it more understandable.

In [None]:
print("\033[92m" + positive_tweets[15]) #will print positive tweet at location 15 in green color (code=92m)
print("\033[91m" + negative_tweets[15]) #will print negative tweet at location 15 in red color (code=91m)
import random
# print positive in greeen
print('\033[92m' + positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + negative_tweets[random.randint(0,5000)])


[92mLaying out a greetings card range for print today - love my job :-)
[91mrelate to the "sweet n' sour" kind of "bi-polar" people in your life... cuz my life... is FULL of them... :(
[92m@Kentsson have a wonderful day! Thank you :-)
[91mFrom our investigations the answer is 'no'. Especially when the pups die within the free insurance period :( https://t.co/ZZRf0jKCeH


#3. Preprocessing

Data preprocessing, a component of data preparation, describes any type of processing performed on raw data to prepare it for another data processing procedure. for texual data some of the preprocessing techniques used and discussed in this notebook are;

            1. Removing URL
            2. Lower casing text
            3. tokenization
            4. POS tags
            5. removing Stopwords
            6. removing punctuations
            6. stemming
  
  
  in subsequesnt sections we will discuss these briefly and then will apply it to twitter dataset.
        
  

## 3.1 Removing URL
 we first need to import re module which stands for "regular expressions." Regular expressions are a powerful tool for pattern matching and manipulation of strings.
 it provides various methods like;
1. re.search(pattern, string): Searches for the first occurrence of the string.
2. re.match(pattern, string): Checks if the pattern matches at the beginning of the string.
3. re.sub(pattern, replacement, string): Replaces occurrences of the pattern in the string with the replacement. and so on.
for URL removal we remove url and replace it with ' '.

In [None]:
import re
corpus="my name is nudrat and i am interested in '#NLP'. My linkedin profile is https://www.linkedin.com/in/nudrat-habib-a3486a234/.it's a beautiful day coming next "
print('corpus with URL:',corpus)
new_corpus=re.sub(r'http\S+','',corpus)
print('corpus with URL removed',new_corpus)


corpus with URL: my name is nudrat and i am interested in '#NLP'. My linkedin profile is https://www.linkedin.com/in/nudrat-habib-a3486a234/.it's a beautiful day coming next 
corpus with URL removed my name is nudrat and i am interested in '#NLP'. My linkedin profile is  a beautiful day coming next 


## 3.2 Lower casing
Nextt we convert all the text to lower case.

In [None]:
corpus=new_corpus.lower()
print(corpus)

my name is nudrat and i am interested in '#nlp'. my linkedin profile is  a beautiful day coming next 


## 3.3 Tokenization
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
we will convert the text into words/tokens.

In [None]:
tokens = nltk.word_tokenize(corpus) #2. converting sentence into tokens
print(tokens)


['my', 'name', 'is', 'nudrat', 'and', 'i', 'am', 'interested', 'in', "'", '#', 'nlp', "'", '.', 'my', 'linkedin', 'profile', 'is', 'a', 'beautiful', 'day', 'coming', 'next']


## 3.4 Removing Punctuations and stopwords.
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.
punctions also are insignifacnt to data. so we remove both stopwords and punctuations from data.

In [None]:
nltk.download('stopwords')
nltk.download('punct')
import string
from nltk.corpus import stopwords
clean_data=[]
stopwords=stopwords.words('english')
for word in tokens:
  if (word not in stopwords and word not in string.punctuation): # 4 stopwords 5. punctuations
    clean_data.append(word)
print(clean_data)

['name', 'nudrat', 'interested', 'nlp', 'linkedin', 'profile', 'beautiful', 'day', 'coming', 'next']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Error loading punct: Package 'punct' not found in index


##3.5 Stemmer
Stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root form. The process of stemming is used to normalize text and make it easier to process. It is an important step in text pre-processing, and it is commonly used in information retrieval and text mining applications.
The input to the stemmer is tokenized words.

In [None]:
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
print('clean words:',clean_data)
stem_words=[]
for word in clean_data:
  stem_word=stemmer.stem(word)
  stem_words.append(stem_word)
print('stemmed words: ',stem_words)

clean words: ['name', 'nudrat', 'interested', 'nlp', 'linkedin', 'profile', 'beautiful', 'day', 'coming', 'next']
stemmed words:  ['name', 'nudrat', 'interest', 'nlp', 'linkedin', 'profil', 'beauti', 'day', 'come', 'next']


now that we are familiar with the preprocessing, we will apply these techniques to twitter dataset.
first we familiraze ourself with the code, now we will create functions for each preprocessing step and then preprocessed twitter data.
all the preprocessing steps are included inside function.

In [None]:
import nltk
import string #for string operations
import re #for  maipulation of string e.g URL removal
from nltk.stem import PorterStemmer #for stemming
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punct')
import string
from nltk.corpus import stopwords
def clean_data(data):
  stemmer = PorterStemmer()
  stopword = stopwords.words('english')
  data=re.sub(r'https?:\/\/.*[\r\n]*', '', data)                   #removing hyperlinks
  data=re.sub(r'@#', '', data)
  corpus=data.lower()                                        #lower casing
  tokens = nltk.word_tokenize(data)                               #converting sentence into token
  clean_data=[]
  for word in tokens:
      if (word not in stopword and word not in string.punctuation): #removing stopwords and punctuations
        clean_data.append(word)

  stem_words=[]
  for word in clean_data:                                           #stemming
      stem_word=stemmer.stem(word)
      stem_words.append(stem_word)
  return stem_words


now let's just check if the function works properly on a sample tweet before applying it to entire dataset.

In [None]:
tweet="you are so beautiful enjoy your life, keep smiling and join me here  https://www.facebook.com/ "
clean_tweet=clean_data(tweet)
print(clean_tweet)

['beauti', 'enjoy', 'life', 'keep', 'smile', 'join']


finally we will iterate on positive tweets and negative tweets and preprocess each tweet.

In [None]:
cleaned_pos_tweets = []
cleaned_neg_tweets=[]
for tweet in positive_tweets:
    cleaned_tweet = clean_data(tweet)
    cleaned_pos_tweets.append(cleaned_tweet)
for tweet in negative_tweets:
  cleaned_tweet=clean_data(tweet)
  cleaned_neg_tweets.append(cleaned_tweet)


let's just print one tweet before and after cleaning from postive tweets and negative tweets.

In [None]:
print('positive tweet: ',positive_tweets[27])
print('cleaned positive tweet',cleaned_pos_tweets[127])


print('\n\n negative tweet:',negative_tweets[15])
print('cleaned negative tweet',cleaned_neg_tweets[15])

positive tweet:  Spiritual Ritual Festival (Népal)
Beginning of Line-up :)
It is left for the line-up (y)
See more at:... http://t.co/QMNz62OEuc
cleaned positive tweet ['followfriday', 'michelploria', 'myfrenchc', 'jasoncr', 'top', 'new', 'follow', 'commun', 'week']


 negative tweet: relate to the "sweet n' sour" kind of "bi-polar" people in your life... cuz my life... is FULL of them... :(
cleaned negative tweet ['relat', '``', 'sweet', 'n', 'sour', "''", 'kind', '``', 'bi-polar', "''", 'peopl', 'life', '...', 'cuz', 'life', '...', 'full', '...']
