# Text Preprocessing in NLP

Text preprocessing is a crucial step while building NLP solutions and applications. Once we acquire some data for our project, the first thing we need to perform is to do text preprocessing to make sure that our data is suitable for input to our machine learning / deep learning model. It makes sure that the data is consistent, does not contain unecessary things and data is as per the project requirements

## Basic Steps in Text Preprocessing

1. Lowercasing
2. Remove HTML Tags
3. Remove URLs
4. Remove Punctuations
5. Chat word treatment
6. Spelling Correction
7. Removing stop words
8. Handling emojis
9. Tokenization
10. Stemming
11. Lammatization

It is not necessary to apply all the above steps to our dataset. We need to apply common sense and make decisions on our own to apply proper text preprocessing as per the project requirements

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display

In [3]:
data = pd.read_csv('Datasets/IMDB Dataset.csv')

display(data.head(10))


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [4]:
data['review'][5]

'Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas\' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they\'d all be "up" for this movie.'

## Lowercasing

In [5]:
# To lowercase a text in python we simply use str.lower

# to lowercase the entire review column in our dataset we will do the following
data['review'] = data['review'].str.lower()

display(data.head())

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## Remove HTML Tags
HTML tags generally come when we scrape data from any website using an web scraping tools such as beautifulSoap or something else. These tags aren't required for training a machine learning or deep learning model. Hence we will remove the tags

In [6]:
import re
def remove_html_tag(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

In [7]:
data['review'] = data['review'].apply(remove_html_tag)

In [8]:
data.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## Remove URLs

URLs can be present in the text if we have some data from say facebook or twitter or instagram. These urls are again not very useful for our model and in fact can cause confusion for our model. Hence we will remove the urls from our text using regular expressions

In [9]:
import re
def remove_urls(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)


In [10]:
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

remove_urls(text4)
# our dataset does not contain any urls as such, hence we'll not apply this to our data. But if we wish to, we can simpy use .apply to apply our remove_urls function to our reviews column

'For notebook click  to search check '

# Remove Punctuations

In [11]:
import string
string.punctuation # this gives all the punctuation marks in the english language

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [12]:
exclude = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~' # did not take full stop(.) because it may be necessary. we can change this string as per our requirement
# this below code is our own logic developed from scratch. hence, it isn't optimized and hence would take a long time to execute on large datasets.
def remove_punctuation(text):
    for char in exclude:
        text = text.replace(char,'')

    return text

def remove_punctuation_optimized(text):
    return text.translate(str.maketrans('','',exclude))




In [13]:
data['review'] = data['review'].apply(remove_punctuation_optimized)
data.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


## Chat word treatment

In many chat applications, we use chat words such as fyi(for your information), imho(in my honest opinion) etc etc. It becomes difficult for our model to understand these words and hence, we need to treat these words in a manner that our model can understand

In [14]:
# we found a repo on github that contains a dictionary of the words and their full forms. We will use this 
slang_dict = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'B4N': 'Bye For Now',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': 'For What It\'s Worth',
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek you (also a chat program)',
    'ILU': 'I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'QPSA': 'Que Pasa?',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laughter',
    'TFW': 'That feeling when',
    'MFW': 'My face when',
    'MRW': 'My reaction when',
    'IFYP': 'I feel your pain',
    'LOL': 'Laughing out loud',
    'TNTL': 'Trying not to laugh',
    'JK': 'Just kidding',
    'IDC': 'I donâ€™t care',
    'ILY': 'I love you',
    'IMU': 'I miss you',
    'ADIH': 'Another day in hell',
    'IDC': 'I donâ€™t care',
    'ZZZ': 'Sleeping, bored, tired',
    'WYWH': 'Wish you were here',
    'TIME': 'Tears in my eyes',
    'BAE': 'Before anyone else',
    'FIMH': 'Forever in my heart',
    'BSAAW': 'Big smile and a wink',
    'BWL': 'Bursting with laughter',
    'LMAO': 'Laughing my a** off',
    'BFF': 'Best friends forever',
    'CSL': 'Canâ€™t stop laughing'
}



In [15]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in slang_dict:
            new_text.append(slang_dict[w.upper()])
        else:
            new_text.append(w)
    return ' '.join(new_text)

In [16]:
# example use case
text_slang = "FYI i already knew"
chat_conversion(text_slang)


'For Your Information i already knew'

# Spelling Correction

In [17]:
from textblob import TextBlob

In [18]:
incorrect_text = 'texxt preprocesing is a curciaal stap in naturla languag procesing'

text_blob = TextBlob(incorrect_text)

text_blob.correct().string

# there are many other ways that we can correct spelling in python. for exapmle we can use a library named pyspellchecker

'text preprocesing is a crucial step in natural language processing'

# Removing stop words

These are the words that help in sentence formation, but they contribute very little to the actual meaning of the sentence. For example: the, and, for etc. These words can be easily removed using the NLTK library

There are some tasks where we keep the stop words, such as Parts of Speech Tagging. 

In [19]:
from nltk.corpus import stopwords

stopwords.words('english') # gives all the stop words present in the english language. we can similarly do for other languages

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [20]:
def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')

        else:
            new_text.append(word)

    return " ".join(new_text)


In [21]:

text = 'This is a really great time for the field of AI. It is advancing exponentially'
remove_stopwords(text)

'This   really great time   field  AI. It  advancing exponentially'

In [23]:
# if we want to apply this on our dataset, we can simply use the apply function 
# data['review'] = data['review'].apply(remove_stopwords)

# Handling Emojis

Emojis are commonly used for expressing sentiment and expressions. But the problem with emojis is that our models dont understand the emojis.EnvironmentError
We have 2 options for handling emojis. One can remove the emojis altogether, or else we can replace them with their meaning 

In [24]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [25]:
remove_emoji("You are very funny ðŸ˜‚ðŸ˜‚ðŸ˜‚")

'You are very funny '

In [26]:
#remove emojis
import emoji
print(emoji.demojize('You are very funny ðŸ˜‚ðŸ˜‚ðŸ˜‚'))


You are very funny :face_with_tears_of_joy::face_with_tears_of_joy::face_with_tears_of_joy:


# Tokenization

Tokenization is basically breaking up our text document into smaller parts known as tokens. The smallers parts can be words or sentences as well

Why is tokenization important
A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements. The token occurrences in a document can be used directly as a vector representing that document. 



### First method(very basic): Using the split function

In [27]:
sent1= 'I am going to Mumbai'
sent1.split()

['I', 'am', 'going', 'to', 'Mumbai']

In [28]:
sent2 = 'I am going to Mumbai. I will stay there for 10 days. I hope I enjoy the trip'
sent2.split('.')

['I am going to Mumbai',
 ' I will stay there for 10 days',
 ' I hope I enjoy the trip']

### Second Method: Regular Expression

In [29]:
import re
sent3 = 'I am going to mumbai!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'mumbai']

In [30]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text) # this will split the sentences on the basis of . ? ! all of them
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

### Third Method: Use NLTK Library

This library already has some good sophisticated alogirthms for performing tokenization to handle difficult cases while tokenizing some document

In [31]:
from nltk.tokenize import word_tokenize, sent_tokenize

sent1= 'I am going to Mumbai!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'Mumbai', '!']

In [32]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [33]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

In [34]:
word_tokenize(sent5) # we can see in the output it did a good job in tokenizing difficult words such as Ph.D and A.I while 
# not mistaking it for 2 different words

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

### Method four: Using Spacy(works better than NLTK most of the time)

In [35]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [36]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [37]:
for token in doc3:
    print(token)

A
5
km
ride
cost
$
10.50


## Stemming

Stemming is the process of reducing inflection(it means a same word is expressed in different grammatical categories such as tense, case, voice etc. For example walk,walking,walked or do,undo,doable, undoable etc ) in words to their root forms such as mapping group of words to the same stem even if the stem itself is not a valid word in the language

Stemming is used alot in **Information Retrival Systems**

Stemmer is an algorithm that performs stemming for us. There are multiple stemmers in the NLTK library.

In [40]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [41]:
sample = 'walk walks walking walked'
stem_words(sample)

'walk walk walk walk'

### Lemmatization

It's essentially the same as stemming, but as we know, in the process of stemming, we sometimes get stem words that are not actual words in the english language. To solve this, we perform lemmatization, where the output for lemmatization is always a word from the dictionary. Lemmatization is slower. 

If we don't want to show the output to the user, we will use stemming. But suppose there is some output text that we want to display to the user, then we will have to use lemmatization.

In lemmatization, we will search for the word in a dictionary for example wordnet(it is a lexical dictionary) and hence it is slower as it is not algorithm based like stemming.


In [42]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 
