<a href="https://colab.research.google.com/github/RakeshSharma21/Sessions_Notebook/blob/main/Text_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# "Text Preprocessing"
> "Vairous preprocessing steps required to clean and prepare your data in NLP"

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [Text Preprocessing, Natural Language Processing]
- image: images/posts/text_preprocessing.jpg
- hide: false
- search_exclude: false

In any machine learning task, cleaning and pre-processing of the data is a very important step. The better we can represent our data, the better the model training and prediction can be expected.

Specially in the domain of Natural Language Processing (NLP) the data is unstructured. It become crucial to clean and properly format it based on the task at hand. There are various pre-processing steps that can be performed but not necessary to perform all. These steps should be applied based on the problem statement.

Example: Sentiment analysis on twitter data can required to remove hashtags, emoticons, etc. but this may not be the case if we are doing the same analysis on customer feedback data.

Here we are using the twitter_sample dataset from the nltk library.

In [2]:
!pip install demoji
!pip install emoji
!pip install contractions
!pip install unidecode
!pip install num2words
# !pip install spellchecker
!pip install pyspellchecker

Collecting demoji
  Downloading demoji-1.1.0-py3-none-any.whl (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.9/42.9 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: demoji
Successfully installed demoji-1.1.0
Collecting emoji
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.10.1
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━

In [117]:
#collapse-hide
# Import libraries and load the data
import numpy as np
import pandas as pd
import re
import nltk
import string
import demoji
import contractions
import unidecode
from num2words import num2words
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer #driving ==> driv
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer ## driving ==> drive
from bs4 import BeautifulSoup
from spellchecker import SpellChecker

pd.options.display.max_columns=None
pd.options.display.max_rows=None
pd.options.display.max_colwidth=None

nltk.download('twitter_samples') # Download the dataset

# We are going to use the Negative and Positive Tweets file which each contains 5000 tweets.
for name in twitter_samples.fileids():
    print(f' - {name}')

 - negative_tweets.json
 - positive_tweets.json
 - tweets.20150430-223406.json


[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


In [118]:
#collapse-hide

# Load the negative tweets file and assign label as 0 for negative
negative_tweets = twitter_samples.strings("negative_tweets.json")
df_neg = pd.DataFrame(negative_tweets, columns=['text'])
df_neg['label'] = 0

# Load the positive tweets file and assign label as 1 for positive
positive_tweets = twitter_samples.strings("positive_tweets.json")
df_pos = pd.DataFrame(positive_tweets, columns=['text'])
df_pos['label'] = 1

# df = pd.concat([df_pos, df_neg]) # Concatenate both the files
# df = df.sample(frac=1).reset_index(drop=True) # Shuffle the data to mix negative and positive tweets

In [120]:
df_neg.shape

(5000, 2)

In [121]:
df_neg.head()

Unnamed: 0,text,label
0,hopeless for tmr :(,0
1,Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(,0
2,@Hegelbon That heart sliding into the waste basket. :(,0
3,"“@ketchBurning: I hate Japanese call him ""bani"" :( :(”\n\nMe too",0
4,"Dang starting next week I have ""work"" :(",0


In [122]:
df_pos.shape

(5000, 2)

In [123]:
df_pos.head()

Unnamed: 0,text,label
0,#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :),1
1,@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!,1
2,@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!,1
3,@97sides CONGRATS :),1
4,yeaaaah yippppy!!! my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days,1


In [124]:
df = pd.concat([df_pos, df_neg])

In [125]:
df.shape

(10000, 2)

In [127]:
df[5000:].head()

Unnamed: 0,text,label
0,hopeless for tmr :(,0
1,Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(,0
2,@Hegelbon That heart sliding into the waste basket. :(,0
3,"“@ketchBurning: I hate Japanese call him ""bani"" :( :(”\n\nMe too",0
4,"Dang starting next week I have ""work"" :(",0


In [128]:
df = df.sample(frac=1).reset_index(drop=True)

In [129]:
df.tail()

Unnamed: 0,text,label
9995,@luvsgngrhd ME TOO :((( wHERE IS MARGO,0
9996,@markbowthai @twomeanyoung it's 6:15pm rn!! And it's currently 70 degrees Fahrenheit!! Which is really warm so it's good :),1
9997,@Doomcrew_Glen awwwww i'm so sorry. :( i bet she's looking down on you&lt;33,0
9998,"@BOCAGIRLSLAYED FOLLOWED ME THANKS, AND\n@justinbieber PLEASE FOLLOWED ME TOO :(",0
9999,i miss being a kid. i can't believe I'm doing adult stuff like paying rent and bills and doing a degree and shit like . the fuck??!?:(,0


In [130]:
print(f'Shape of the whole data is: {df.shape[0]} rows and {df.shape[1]} columns')

Shape of the whole data is: 10000 rows and 2 columns


In [133]:
# Look at the head of the dataframe
df.tail(40)

Unnamed: 0,text,label
9960,@JackJackJohnson ended up happy. :))),1
9961,Who even started this trend? I wanna know if there is a jot of truth :( #ZaynIsComingBackOnJuly26,0
9962,"Here it is, our #crowdfunding campaign! If we get our goal, we can walk on with this dream. Will you help us? :) http://t.co/tY9qr7g4my",1
9963,@HoldyHoldy Cotton Candy is my favourite too! :( Time to stock up again!,0
9964,@CraziestMocha I wish :(,0
9965,"Happy Birthday Nawazuddin Siddique...\nBig, Big fan of yours. :)",1
9966,Stats for the week have arrived. 1 new follower and NO unfollowers :) via http://t.co/4QiS2e3DaM.,1
9967,"@samm_amberr Then there will be a few ""Wtf this never happened?!!"" moments that you may or may not like. But there's a surprise thrown in :D",1
9968,@emis_nug Indeed. Thanks for link to article :),1
9969,NOT AN APOLOGY ME ENCANTA VALE OSEA BEA :-(,0


> Note: Always make it a practice to first skim the dataset before performing any text pre-processing steps. It is important because text data can be very noisy eg. dates are written in different formats, present of accented characters, etc. These are stuff we can easily miss if we don't go through the dataset properly.

## Lower Casing

Lowercasing is a common text preprocessing technique. It helps to transform all the text in same case. <br>
Examples 'The', 'the', 'ThE' -> 'the'

This is also useful to find all the duplicates since words in different cases are treated as separate words and becomes difficult for us to remove redundant words in all different case combination.

This may not be helpful when we do tasks like Part of Speech tagging (where proper casing gives some information about Nouns and so on) and Sentiment Analysis (where upper casing refers to anger and so on)

In [134]:
df.text = df.text.str.lower()
df.head(20)

Unnamed: 0,text,label
0,aww the poor thing :( hope it'okay and in good health. luckely it has been freed from those rocks #orcalove https://t.co/toszmoafv6,0
1,"@batesm0t3l hi there, i've spoken with the store who have advised they do have a paypoint and that card can be topped up :) thanks, beth",1
2,"@danielle_isla i could never go there ever after seeing that, so cruel :(",0
3,when the bus is late and you have no time to get food before work :(,0
4,another ridiculous headache :(,0
5,my kik : oulive70748 #kik #kikmenow #photo #babe #loveofmylife #brasileirao #viernesderolenahot :( http://t.co/gmmwd4prhu,0
6,"@dylanobrien @mazerunnermovie it is just beyond words, sooooooo freaking good! i'm going to die before it comes out!! :d",1
7,@jaymcguiness :-( please notice me,0
8,babe :(,0
9,@barbaranadel @kirkdalebooks definitely :),1


## Remove

### URL's

URL stands for Uniform Resource Locator. If present in a text, it represents the location of another website.

If we are performing any websites backlink analysis, in that case URL's are useful to keep. Otherwise, they don't provide any information. So we can remove them from our text.

In [135]:
df.text = df.text.str.replace(r'https?://\S+|www\.\S+', '', regex=True)
df.head(20)

Unnamed: 0,text,label
0,aww the poor thing :( hope it'okay and in good health. luckely it has been freed from those rocks #orcalove,0
1,"@batesm0t3l hi there, i've spoken with the store who have advised they do have a paypoint and that card can be topped up :) thanks, beth",1
2,"@danielle_isla i could never go there ever after seeing that, so cruel :(",0
3,when the bus is late and you have no time to get food before work :(,0
4,another ridiculous headache :(,0
5,my kik : oulive70748 #kik #kikmenow #photo #babe #loveofmylife #brasileirao #viernesderolenahot :(,0
6,"@dylanobrien @mazerunnermovie it is just beyond words, sooooooo freaking good! i'm going to die before it comes out!! :d",1
7,@jaymcguiness :-( please notice me,0
8,babe :(,0
9,@barbaranadel @kirkdalebooks definitely :),1


### E-mail

E-mail id's are common in customer feedback data and they do not provide any useful information. So we remove them from the text.

Twitter data that we are using does not contain any email id's. Hence, please find the code snipper with an dummy example to remove e-mail id's.

In [136]:
text = 'I have being trying to contact xyz via email to xyz@abc.co.in but there is no response.'
re.sub(r'\S+@\S+', '', text)

'I have being trying to contact xyz via email to  but there is no response.'

### Date

Dates can be represented in various formats and can be difficult at times to remove them. They are unlikely to contain any useful information for predicting the labels.

Below I have used dummy text to showcase the following task.

In [137]:
text = "Today is 22/12/2020 and after two days on 24-12-2020 our vacation starts until 25th.09.2021"

# 1. Remove date formats like: dd/mm/yy(yy), dd-mm-yy(yy), dd(st|nd|rd).mm/yy(yy)
re.sub(r'\d{1,2}(st|nd|rd|th)?[-./]\d{1,2}[-./]\d{2,4}', '', text)

'Today is  and after two days on  our vacation starts until '

In [138]:
text = "Today is 11th of January, 2021 when I am writing this post. I hope to post this by February 15th or max to max by 20 may 21 or 20th-December-21"

# 2. Remove date formats like: 20 apr 21, April 15th, 11th of April, 2021
pattern = re.compile(r'(\d{1,2})?(st|nd|rd|th)?[-./,]?\s?(of)?\s?([J|j]an(uary)?|[F|f]eb(ruary)?|[Mm]ar(ch)?|[Aa]pr(il)?|[Mm]ay|[Jj]un(e)?|[Jj]ul(y)?|[Aa]ug(ust)?|[Ss]ep(tember)?|[Oo]ct(ober)?|[Nn]ov(ember)?|[Dd]ec(ember)?)\s?(\d{1,2})?(st|nd|rd|th)?\s?[-./,]?\s?(\d{2,4})?')
pattern.sub(r'', text)

'Today is  when I am writing this post. I hope to post this byor max to max by or '

There are various formats in which dates are represented and the above regex can be customized in many ways. Above, "byor" got combined cause we are trying multiple format in single regex pattern. You can customize the above expression accordingly to your need.

### HTML Tags

If we are extracting data from various websites, it is possible that the data also contains HTML tags. These tags does not provide any information and should be removed. These tags can be removed using regex or by using BeautifulSoup library.

In [141]:
# Dummy text
text = """
<title>Below is a dummy html code.</title>
<body>
    <p>All the html opening and closing brackets should be remove.</p>
    <a href="https://www.abc.com">Company Site</a>
</body>
"""

In [142]:
# Using regex to remove html tags
pattern = re.compile('<.*?>')
pattern.sub('', text)

'\nBelow is a dummy html code.\n\n    All the html opening and closing brackets should be remove.\n    Company Site\n\n'

In [143]:
# Using Beautiful Soup
def remove_html(text):
    clean_text = BeautifulSoup(text).get_text()
    return clean_text

In [144]:
remove_html(text)

'Below is a dummy html code.\n\nAll the html opening and closing brackets should be remove.\nCompany Site\n\n'

In [147]:
df.tail(50)

Unnamed: 0,text,label
9950,@lawrenceispichu oh gosh what did you say? and aw hun :( *cuddles*,0
9951,@diongzons follow @jnlazts &amp; follow u back :),1
9952,@cath_tyldesley @yourstylist it is always good to stand out from the crowd :),1
9953,so much editing to do. so little time. :-(,0
9954,@iloveksoo we almost got to see his cute ankles but his socks :(,0
9955,@lukebryanonline yayyyy!!! i hope it's not while i am knocked out by anesthesia. i will be so sad if i miss it :(,0
9956,@tmhsposey that was the account that got hacked :(,0
9957,shadowplaylouis we recently got this mutual and i'm really loving your tweets and everything your about so yay that's exciting :),1
9958,@aangelayap easy :d,1
9959,phone fell out of my hand and it woke me up :(,0


## Emoji handling with emoji library

In [32]:
!pip install emoji



In [148]:
import emoji

In [149]:
emoji.replace_emoji("game is on 🔥🔥. Hilarious😂")

'game is on . Hilarious'

In [150]:
emoji.demojize("game is on 🔥🔥. Hilarious😂")

'game is on :fire::fire:. Hilarious:face_with_tears_of_joy:'

### Emojis

As more and more people have started using social media emoji's play a very crucial role. Emoji's are used to express emotions that are universally understood.

In some analysis such as sentiment analysis emoji's can be useful. We can convert them to words or create some new features based on them. For some analysis we need to remove them. Find the below code snippet used to remove the emoji's.

In [151]:
#collapse-hide
# Reference: https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b

def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [152]:
text = "game is on 🔥🔥. Hilarious😂"
remove_emoji(text)

'game is on . Hilarious'

In [154]:
# Remove emoji's from text
df.text = df.text.apply(lambda x: emoji.replace_emoji(x))

In [155]:
df.tail(50)

Unnamed: 0,text,label
9950,@lawrenceispichu oh gosh what did you say? and aw hun :( *cuddles*,0
9951,@diongzons follow @jnlazts &amp; follow u back :),1
9952,@cath_tyldesley @yourstylist it is always good to stand out from the crowd :),1
9953,so much editing to do. so little time. :-(,0
9954,@iloveksoo we almost got to see his cute ankles but his socks :(,0
9955,@lukebryanonline yayyyy!!! i hope it's not while i am knocked out by anesthesia. i will be so sad if i miss it :(,0
9956,@tmhsposey that was the account that got hacked :(,0
9957,shadowplaylouis we recently got this mutual and i'm really loving your tweets and everything your about so yay that's exciting :),1
9958,@aangelayap easy :d,1
9959,phone fell out of my hand and it woke me up :(,0


### Emoticons

Emoji's and Emoticons are different. Yes!!<br>
Emoticons are used to express facial expressions using keyboard characters such as letters, numbers, and pucntuation marks. Where emjoi's are small images.

Thanks to [Neel Shah](https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py) for curating a dictionary of emoticons and their description. We shall use this dictionary and remove the emoticons from our text.

In [156]:
#collapse-hide

EMOTICONS = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‑D":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u":d":"Laughing, big grin or laugh with glasses",
    u"8‑D":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‑D":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‑\(":"Frown, sad, andry or pouting",
    u":-\(":"Frown, sad, andry or pouting",
    u":\(":"Frown, sad, andry or pouting",
    u":‑c":"Frown, sad, andry or pouting",
    u":c":"Frown, sad, andry or pouting",
    u":‑<":"Frown, sad, andry or pouting",
    u":<":"Frown, sad, andry or pouting",
    u":‑\[":"Frown, sad, andry or pouting",
    u":\[":"Frown, sad, andry or pouting",
    u":-\|\|":"Frown, sad, andry or pouting",
    u">:\[":"Frown, sad, andry or pouting",
    u":\{":"Frown, sad, andry or pouting",
    u":@":"Frown, sad, andry or pouting",
    u">:\(":"Frown, sad, andry or pouting",
    u":'‑\(":"Crying",
    u":'\(":"Crying",
    u":'‑\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‑':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‑O":"Surprise",
    u":O":"Surprise",
    u":‑o":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‑0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‑\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‑\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‑,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‑\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‑x":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‑#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‑&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‑\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‑3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‑\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‑b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‑\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‑\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‑\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‑\)":"Cool",
    u"\|‑O":"Bored",
    u":‑J":"Tongue-in-cheek",
    u"#‑\)":"Party all night",
    u"%‑\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‑\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(・\.・;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)／":"Joyful",
    u"\(\^o\^\)／":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(一一\)":"Shame",
    u"\(；一_一\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\·\^=\)":"cat",
    u"\(=\^\·\·\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\・\・?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\（\*\^_\^\*）" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^—\^\）":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\（\^—\^\）":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\·\·\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(ーー;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\＾ｖ\＾\)":"Happy",
    u"\(\＾ｕ\＾\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*￣m￣\)":"Dissatisfied",
    u"\(‘A`\)":"Snubbed or Deflated"
}

In [157]:
def remove_emoticons(text):
    emoticons_pattern = re.compile(u'(' + u'|'.join(emo for emo in EMOTICONS) + u')')
    return emoticons_pattern.sub(r'', text)

In [158]:
remove_emoticons("Hello :->")

'Hello '

In [159]:
# Remove emoticons from text
df.text = df.text.apply(lambda x: remove_emoticons(x))

In [160]:
df.tail(50)

Unnamed: 0,text,label
9950,@lawrenceispichu oh gosh what did you say? and aw hun *cuddles*,0
9951,@diongzons follow @jnlazts &amp; follow u back,1
9952,@cath_tyldesley @yourstylist it is always good to stand out from the crowd,1
9953,so much editing to do. so little time.,0
9954,@iloveksoo we almost got to see his cute ankles but his socks,0
9955,@lukebryanonline yayyyy!!! i hope it's not while i am knocked out by anesthesia. i will be so sad if i miss it,0
9956,@tmhsposey that was the account that got hacked,0
9957,shadowplaylouis we recently got this mutual and i'm really loving your tweets and everything your about so yay that's exciting,1
9958,@aangelayap easy,1
9959,phone fell out of my hand and it woke me up,0


### Hashtags and Mentions

We are habituated to use hashtags and mentions in our tweet either to indicate the context or bring attention to an individual. Hashtags can be used to extract features, to see what's trending and in various other applications.

Since, we don't require them we'll remove them.

In [161]:
def remove_tags_mentions(text):
    pattern = re.compile(r'(@\S+|#\S+)')
    return pattern.sub('', text)

In [162]:
text = "live @flippinginja on #younow - jonah and jareddddd"
remove_tags_mentions(text)

'live  on  - jonah and jareddddd'

In [163]:
# Remove hashtags and mentions
df.text = df.text.apply(lambda x: remove_tags_mentions(x))

In [164]:
df.tail(50)

Unnamed: 0,text,label
9950,oh gosh what did you say? and aw hun *cuddles*,0
9951,follow &amp; follow u back,1
9952,it is always good to stand out from the crowd,1
9953,so much editing to do. so little time.,0
9954,we almost got to see his cute ankles but his socks,0
9955,yayyyy!!! i hope it's not while i am knocked out by anesthesia. i will be so sad if i miss it,0
9956,that was the account that got hacked,0
9957,shadowplaylouis we recently got this mutual and i'm really loving your tweets and everything your about so yay that's exciting,1
9958,easy,1
9959,phone fell out of my hand and it woke me up,0


### Punctuations

Punctuations are character other than alphaters and digits. These include [!"#$%&\'()*+,-./:;<=>?@\\^_`{|}~]

It is better remove or convert emoticons before removing the punctuations, since if we do the other we around we might loose the emoticons from the text. Another example, if the text contains $10.50 then we'll remove the .(dot) and the value will loose it's meaning.

In [165]:
PUNCTUATIONS = string.punctuation

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', PUNCTUATIONS))

In [166]:
PUNCTUATIONS

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [167]:
df.text = df["text"].apply(lambda text: remove_punctuation(text))

In [168]:
df.tail(50)

Unnamed: 0,text,label
9950,oh gosh what did you say and aw hun cuddles,0
9951,follow amp follow u back,1
9952,it is always good to stand out from the crowd,1
9953,so much editing to do so little time,0
9954,we almost got to see his cute ankles but his socks,0
9955,yayyyy i hope its not while i am knocked out by anesthesia i will be so sad if i miss it,0
9956,that was the account that got hacked,0
9957,shadowplaylouis we recently got this mutual and im really loving your tweets and everything your about so yay thats exciting,1
9958,easy,1
9959,phone fell out of my hand and it woke me up,0


### Stopwords

Stopwords are commonly occuring words in any language. Such as, in english these words are 'the', 'a', 'an', & many more. They are in most cases not useful and should be removed.

There are certain tasks in which these words are useful such as Part-of-Speech(POS) tagging, language translation. Stopwords are compiled for many languages, for english language we can use the list from the nltk package.

In [170]:
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in STOPWORDS])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [171]:
# Remove stopwords
df.text = df.text.apply(lambda text: remove_stopwords(text))
df.head(20)

Unnamed: 0,text,label
0,aww poor thing hope itokay good health luckely freed rocks,0
1,hi ive spoken store advised paypoint card topped thanks beth,1
2,could never go ever seeing cruel,0
3,bus late time get food work,0
4,another ridiculous headache,0
5,kik oulive70748,0
6,beyond words sooooooo freaking good im going die comes,1
7,please notice,0
8,babe,0
9,definitely,1


In [172]:
df.tail(20)

Unnamed: 0,text,label
9980,sorry could schedule change,0
9981,lead us youre gone,0
9982,still awake,1
9983,need,0
9984,amazing news enjoy toy,1
9985,followed thanks please followed,0
9986,youre welcome,1
9987,walk barely according good hope got home ok,1
9988,turning fair meant lot people came,1
9989,guys great weekend,1


### Numbers

We may remove numbers if they are not useful in our analysis. But analysis in the financial domain, numbers are very useful.

In [173]:
df.text = df.text.str.replace(r'\d+', '', regex=True)

In [175]:
df.tail(50)

Unnamed: 0,text,label
9950,oh gosh say aw hun cuddles,0
9951,follow amp follow u back,1
9952,always good stand crowd,1
9953,much editing little time,0
9954,almost got see cute ankles socks,0
9955,yayyyy hope knocked anesthesia sad miss,0
9956,account got hacked,0
9957,shadowplaylouis recently got mutual im really loving tweets everything yay thats exciting,1
9958,easy,1
9959,phone fell hand woke,0


### Extra whitespaces

After usually after preprocessing the text there might be extra whitespaces that might be created after transforming, removing various characters. Also, there is a need to remove all the new line, tab characters as well from our text.

In [176]:
def remove_whitespaces(text):
    return " ".join(text.split())

In [177]:
text = "  Whitespaces in the beginning are removed  \t as well \n  as in between  the text   "

clean_text = " ".join(text.split())
clean_text

'Whitespaces in the beginning are removed as well as in between the text'

In [178]:
df.text = df.text.apply(lambda x: remove_whitespaces(x))

In [179]:
df.tail(30)

Unnamed: 0,text,label
9970,added video playlist im back twitch today going league,1
9971,love tell every crowd theyre loudest loudest fans world arent x,1
9972,dd leave lagi,0
9973,vidcon,0
9974,love,1
9975,welcome aw go back next month rn tryna get healthy,0
9976,happy birthday,1
9977,guess blocking number phones spam sms feature wont work using specific words since guys write sinhala english,0
9978,need come back england,0
9979,wanna watch maybe comeback shows korea japan wanna visit country,0


In [None]:
# spacy.word_tokenize

### Frequent words

Previously we have removed stopwords which are common in any language. If we are working in any domain, we can also remove the common words used in that domain which don't provide us with much information.

In [180]:
from collections import Counter
def freq_words(text):
    tokens = nltk.word_tokenize(text)
    FrequentWords = []

    for word in tokens:
        counter[word] += 1

    for (word, word_count) in counter.most_common(10):
        FrequentWords.append(word)
    return FrequentWords

def remove_fw(text, FrequentWords):
    tokens = nltk.word_tokenize(text)
    without_fw = []
    for word in tokens:
        if word not in FrequentWords:
            without_fw.append(word)

    without_fw = ' '.join(without_fw)
    return without_fw

counter = Counter()

In [181]:
text = """
Natural Language Processing is the technology used to aid computers to understand the human’s natural language. It’s not an easy task teaching machines to understand how we communicate. Leand Romaf, an experienced software engineer who is passionate at teaching people how artificial intelligence systems work, says that “in recent years, there have been significant breakthroughs in empowering computers to understand language just as we do.” This article will give a simple introduction to Natural Language Processing and how it can be achieved. Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable. Most NLP techniques rely on machine learning to derive meaning from human languages.
"""

In [182]:
nltk.download('punkt')
FrequentWords = freq_words(text)
print(FrequentWords)

[',', 'to', '.', 'is', 'the', 'understand', 'Natural', 'Language', 'Processing', 'computers']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [183]:
fw_result = remove_fw(text, FrequentWords)
fw_result

'technology used aid human ’ s natural language It ’ s not an easy task teaching machines how we communicate Leand Romaf an experienced software engineer who passionate at teaching people how artificial intelligence systems work says that “ in recent years there have been significant breakthroughs in empowering language just as we do. ” This article will give a simple introduction and how it can be achieved usually shortened as NLP a branch of artificial intelligence that deals with interaction between and humans using natural language The ultimate objective of NLP read decipher and make sense of human languages in a manner that valuable Most NLP techniques rely on machine learning derive meaning from human languages'

### Rare words

Rare words are similar to frequent words. We can remove them because they are so less that they cannot add any value to the purpose.

In [184]:
def rare_words(text):
    # tokenization
    tokens = nltk.word_tokenize(text)
    for word in tokens:
        counter[word]= +1

    RareWords = []
    number_rare_words = 10
    # take top 10 frequent words
    frequentWords = counter.most_common()
    for (word, word_count) in frequentWords[:-number_rare_words:-1]:
        RareWords.append(word)

    return RareWords

def remove_rw(text, RareWords):
    tokens = nltk.word_tokenize(text)
    without_rw = []
    for word in tokens:
        if word not in RareWords:
            without_rw.append(word)

    without_rw = ' '.join(without_fw)
    return without_rw

counter = Counter()

In [185]:
text = """
Natural Language Processing is the technology used to aid computers to understand the human’s natural language. It’s not an easy task teaching machines to understand how we communicate. Leand Romaf, an experienced software engineer who is passionate at teaching people how artificial intelligence systems work, says that “in recent years, there have been significant breakthroughs in empowering computers to understand language just as we do.” This article will give a simple introduction to Natural Language Processing and how it can be achieved. Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable. Most NLP techniques rely on machine learning to derive meaning from human languages.
"""

In [186]:
RareWords = rare_words(text)
RareWords

['from',
 'meaning',
 'derive',
 'learning',
 'machine',
 'on',
 'rely',
 'techniques',
 'Most']

In [187]:
rw_result = remove_fw(text, RareWords)
rw_result

'Natural Language Processing is the technology used to aid computers to understand the human ’ s natural language . It ’ s not an easy task teaching machines to understand how we communicate . Leand Romaf , an experienced software engineer who is passionate at teaching people how artificial intelligence systems work , says that “ in recent years , there have been significant breakthroughs in empowering computers to understand language just as we do. ” This article will give a simple introduction to Natural Language Processing and how it can be achieved . Natural Language Processing , usually shortened as NLP , is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language . The ultimate objective of NLP is to read , decipher , understand , and make sense of the human languages in a manner that is valuable . NLP to human languages .'

## Conversion of Emoji to Words

To remove or not is done based on the purpose of the application. Example if we are building a sentiment analysis system emoji's can be useful.

"The movie was 🔥"
or
"The movie was 💩"

If we remove the emoji's the meaning of the sentence changes completely. In these cases we can convert emoji's to words.

demoji requires an initial data download from the Unicode Consortium's [emoji code repository](http://unicode.org/Public/emoji/12.0/emoji-test.txt).

On first use of the package, call download_codes().<br>
This will store the Unicode hex-notated symbols at ~/.demoji/codes.json for future use.

Read more about demoji on [pypi.org](https://pypi.org/project/demoji/)

In [None]:
demoji.download_codes()

  demoji.download_codes()


In [188]:
def emoji_to_words(text):
    return demoji.replace_with_desc(text, sep="__")

In [189]:
text = "game is on 🔥 🚣🏼"
emoji_to_words(text)

'game is on __fire__ __person rowing boat: medium-light skin tone__'

In [190]:
text = "Hey there!! :-)"
emoji_to_words(text)

'Hey there!! :-)'

## Conversion of Emoticons to Words

As we did for emoji's, we convert emoticons to words for the same purpose.

In [191]:
def emoticons_to_words(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").replace(":","").split()), text)
    return text

In [192]:
text = "Hey there!! :-)"
emoticons_to_words(text)

'Hey there!! Happy_face_smiley'

In [85]:
!pip install emot

Collecting emot
  Downloading emot-3.1-py3-none-any.whl (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.5/61.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emot
Successfully installed emot-3.1


In [193]:
import emot

In [194]:
emot_obj = emot.core.emot()

In [195]:
text = "I love python ☮ 🙂 ❤ :-) :-( :-)))"

In [196]:
emot_obj.emoji(text)

{'value': ['☮', '🙂', '❤'],
 'location': [[14, 15], [16, 17], [18, 19]],
 'mean': [':peace_symbol:', ':slightly_smiling_face:', ':red_heart:'],
 'flag': True}

In [197]:
emot_obj.emoticons(text)

{'value': [':-)', ':-(', ':-)))'],
 'location': [[20, 23], [24, 27], [28, 33]],
 'mean': ['Happy face smiley',
  'Frown, sad, andry or pouting',
  'Very very Happy face or smiley'],
 'flag': True}

## Converting Numbers to Words

If our analysis require us to use information based on the numbers in the text, we can convert them to words.

Read more about num2words on [github](https://github.com/savoirfairelinux/num2words)

In [198]:
def nums_to_words(text):
    new_text = []
    for word in text.split():
        if word.isdigit():
            new_text.append(num2words(word))
        else:
            new_text.append(word)
    return " ".join(new_text)

In [199]:
text = "I ran this track 30 times"
nums_to_words(text)

'I ran this track thirty times'

## Chat words Conversion

The more we use social media, we have become lazy to type the whole phrase or word. Due to which slang words came into existance such as "omg" which represents "Oh my god". Such slang words don't provide much information and if we need to use them we have to convert them.

Thank you: [GitHub repo](https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt) for the list of slang words

In [200]:
#collapse-hide
chat_words = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
QPSA?=Que Pasa?
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
OMG=Oh my god"""

In [201]:
chat_words_dict = dict()
chat_words_set = set()

def cw_conversion(text):
    new_text = []
    for word in text.split():
        if word.upper() in chat_words_set:
            new_text.append(chat_words_dict[word.upper()])
        else:
            new_text.append(word)
    return " ".join(new_text)

for line in chat_words.split('\n'):
    if line != '':
        cw, cw_expanded = line.split('=')[0], line.split('=')[1]

        chat_words_set.add(cw)
        chat_words_dict[cw] = cw_expanded

In [202]:
text = "omg that's awesome."
cw_conversion(text)

"Oh my god that's awesome."

## Expanding Contractions

Contractions are words or combinations of words created by dropping a few letters and replacing those letters by an apostrophe.

Example:
- don't: do not
- we'll: we will

Our nlp model don't understand these contractions i.e. they don't understand that "don't" and "do not" are the same thing. If our problem statement requires them then we can expand them or else leave it as it is.

In [203]:
def expand_contractions(text):
    expanded_text = []
    for line in text:
        expanded_text.append(contractions.fix(line))
    return expanded_text

In [204]:
text = ["I'll be there within 15 minutes.", "It's awesome to meet your new friends."]
expand_contractions(text)

['I will be there within 15 minutes.',
 'It is awesome to meet your new friends.']

## Stemming

In stemming we reduce the word to it's base or root form by removing the suffix characters from the word. It is one of the technique to normalize text.

Stemming for root word "like" include:
- "likes"
- "liked"
- "likely"
- "liking"

Stemmed word doesn't always match the words in our dictionary such as:
- console -> consol
- company -> compani
- welcome -> welcom

Due to which stemming is not performed in all nlp tasks.

There are various algorithms used for stemming but the most widely used is PorterStemmer. In this post we have used the PorterStemmer as well.

In [205]:
stemmer = PorterStemmer()

def stem_words(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

In [206]:
df['text_stemmed'] = df.text.apply(lambda text: stem_words(text))
df[['text', 'text_stemmed']].head()

Unnamed: 0,text,text_stemmed
0,aww poor thing hope itokay good health luckely freed rocks,aww poor thing hope itokay good health luck freed rock
1,hi ive spoken store advised paypoint card topped thanks beth,hi ive spoken store advis paypoint card top thank beth
2,could never go ever seeing cruel,could never go ever see cruel
3,bus late time get food work,bu late time get food work
4,another ridiculous headache,anoth ridicul headach


PorterStemmer can be used only for english. If we are working with other than english then we can use SnowballStemmer.

In [207]:
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

## Lemmatization

Lemmatization tried to perform the similar task as that of stemming i.e. trying to reduce the inflection words to it's base form. But lemmatization does it by using a different approach.

Lemmatizations takes into consideration of the morphological analysis of the word. It tries to reduce to words to it's dictionary form which is known as lemma.

In [208]:
lemmatizer = WordNetLemmatizer()

def text_lemmatize(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [209]:
nltk.download('wordnet')
df['text_lemmatized'] = df.text.apply(lambda text: text_lemmatize(text))
df[['text', 'text_stemmed', 'text_lemmatized']].head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,text_stemmed,text_lemmatized
0,aww poor thing hope itokay good health luckely freed rocks,aww poor thing hope itokay good health luck freed rock,aww poor thing hope itokay good health luckely freed rock
1,hi ive spoken store advised paypoint card topped thanks beth,hi ive spoken store advis paypoint card top thank beth,hi ive spoken store advised paypoint card topped thanks beth
2,could never go ever seeing cruel,could never go ever see cruel,could never go ever seeing cruel
3,bus late time get food work,bu late time get food work,bus late time get food work
4,another ridiculous headache,anoth ridicul headach,another ridiculous headache


Difference between Stemming and Lemmatization:

|Stemming | Lemmatization |
| ---- | ---- |
| Fast compared to lemmatization | Slow compared to stemming |
| Reduces the word to it's base form by removing the suffix | Uses lexical knowledge to get the base form of the word |
| Does not always provide meaning or dictionary form of the original word | Resulting words are always meaningful and dictionary words |

## Spelling Correction

We as human always make mistake. Normally incorrect spelling in text are know as typos.

Since the NLP model doesn't know the difference between a correct and an incorrect word. For the model "thanks" and "thnks" are two different words. Therefore, spelling correction is an important step to bring the incorrect words in the correct format.

In [210]:
spell = SpellChecker()

def correct_spelling(text):
    correct_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            correct_text.append(spell.correction(word))
        else:
            correct_text.append(word)
    return " ".join(correct_text)

In [211]:
text = "Hi, hwo are you doin? I'm good thnks for asking"
correct_spelling(text)

"Hi, how are you doing I'm good thanks for asking"

In [215]:
text = "hwo are you doin? I'm god thnks"
correct_spelling(text)

"how are you doing I'm god thanks"

## Convert accented characters to ASCII characters

Accent marks (also referred to as diacritics or diacriticals) usually appear above a character when we press the character for a long time. These need to be remove cause the model cannot distinguish between "dèèp" and "deep". It will consider them as two different words.

In [216]:
def accented_to_ascii(text):
    return unidecode.unidecode(text)

In [217]:
text = "This is an example text with accented characters like dèèp lèarning ánd cömputer vísíön etc."
accented_to_ascii(text)

'This is an example text with accented characters like deep learning and computer vision etc.'

In [218]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [219]:
vectorizer = TfidfVectorizer()

In [220]:
X = vectorizer.fit_transform(df['text_lemmatized'])

In [221]:
y=df['label']

In [222]:
# prompt: split the X and y data, thereafter fit a logistic regression algorithm

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)


In [223]:
# prompt: test the predictions of the model on X_test and give the accuracy ,recall, precision, f1-score and confusion metrics

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

y_pred = logistic_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='weighted')
precision = precision_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)
print("Recall:", recall)
print("Precision:", precision)
print("F1 Score:", f1)

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


Accuracy: 0.7404
Recall: 0.7404
Precision: 0.7422980869565218
F1 Score: 0.7399861024960781
Confusion Matrix:
 [[972 271]
 [378 879]]


## Conclusion

In this article, most of the text pre-processing techniques are explanied. I'll update this post as I learn more techniques to pre-process text.

Share if you liked it, comment if you loved it. Hope to see you guys in the next one. Peace!