# Text Preprocessing in NLP

Text preprocessing is a crucial step while building NLP solutions and applications. Once we acquire some data for our project, the first thing we need to perform is to do text preprocessing to make sure that our data is suitable for input to our machine learning / deep learning model. It makes sure that the data is consistent, does not contain unecessary things and data is as per the project requirements

## Basic Steps in Text Preprocessing

1. Lowercasing
2. Remove HTML Tags
3. Remove URLs
4. Remove Punctuations
5. Chat word treatment
6. Spelling Correction
7. Removing stop words
8. Handling emojis
9. Tokenization
10. Stemming
11. Lammatization

It is not necessary to apply all the above steps to our dataset. We need to apply common sense and make decisions on our own to apply proper text preprocessing as per the project requirements

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display

In [4]:
data = pd.read_csv('Datasets/IMDB Dataset.csv')

display(data.head(10))


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [5]:
data['review'][5]

'Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas\' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they\'d all be "up" for this movie.'

## Lowercasing

In [7]:
# To lowercase a text in python we simply use str.lower

# to lowercase the entire review column in our dataset we will do the following
data['review'] = data['review'].str.lower()

display(data.head())

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## Remove HTML Tags
HTML tags generally come when we scrape data from any website using an web scraping tools such as beautifulSoap or something else. These tags aren't required for training a machine learning or deep learning model. Hence we will remove the tags

In [8]:
import re
def remove_html_tag(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

In [9]:
data['review'] = data['review'].apply(remove_html_tag)

In [10]:
data.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## Remove URLs

URLs can be present in the text if we have some data from say facebook or twitter or instagram. These urls are again not very useful for our model and in fact can cause confusion for our model. Hence we will remove the urls from our text using regular expressions

In [14]:
import re
def remove_urls(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)


In [16]:
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

remove_urls(text4)
# our dataset does not contain any urls as such, hence we'll not apply this to our data. But if we wish to, we can simpy use .apply to apply our remove_urls function to our reviews column

'For notebook click  to search check '

# Remove Punctuations

In [17]:
import string
string.punctuation # this gives all the punctuation marks in the english language

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [22]:
exclude = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~' # did not take full stop(.) because it may be necessary. we can change this string as per our requirement
# this below code is our own logic developed from scratch. hence, it isn't optimized and hence would take a long time to execute on large datasets.
def remove_punctuation(text):
    for char in exclude:
        text = text.replace(char,'')

    return text

def remove_punctuation_optimized(text):
    return text.translate(str.maketrans('','',exclude))




In [24]:
data['review'] = data['review'].apply(remove_punctuation_optimized)
data.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


## Chat word treatment

In many chat applications, we use chat words such as fyi(for your information), imho(in my honest opinion) etc etc. It becomes difficult for our model to understand these words and hence, we need to treat these words in a manner that our model can understand

In [25]:
# we found a repo on github that contains a dictionary of the words and their full forms. We will use this 
slang_dict = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'B4N': 'Bye For Now',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': 'For What It\'s Worth',
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek you (also a chat program)',
    'ILU': 'I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'QPSA': 'Que Pasa?',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laughter',
    'TFW': 'That feeling when',
    'MFW': 'My face when',
    'MRW': 'My reaction when',
    'IFYP': 'I feel your pain',
    'LOL': 'Laughing out loud',
    'TNTL': 'Trying not to laugh',
    'JK': 'Just kidding',
    'IDC': 'I don’t care',
    'ILY': 'I love you',
    'IMU': 'I miss you',
    'ADIH': 'Another day in hell',
    'IDC': 'I don’t care',
    'ZZZ': 'Sleeping, bored, tired',
    'WYWH': 'Wish you were here',
    'TIME': 'Tears in my eyes',
    'BAE': 'Before anyone else',
    'FIMH': 'Forever in my heart',
    'BSAAW': 'Big smile and a wink',
    'BWL': 'Bursting with laughter',
    'LMAO': 'Laughing my a** off',
    'BFF': 'Best friends forever',
    'CSL': 'Can’t stop laughing'
}



In [26]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in slang_dict:
            new_text.append(slang_dict[w.upper()])
        else:
            new_text.append(w)
    return ' '.join(new_text)

In [27]:
# example use case
text_slang = "FYI i already knew"
chat_conversion(text_slang)


'For Your Information i already knew'

# Spelling Correction

In [31]:
from textblob import TextBlob

In [36]:
incorrect_text = 'texxt preprocesing is a curciaal stap in naturla languag procesing'

text_blob = TextBlob(incorrect_text)

text_blob.correct().string

# there are many other ways that we can correct spelling in python. for exapmle we can use a library named pyspellchecker

'text preprocesing is a crucial step in natural language processing'