# Text Processing Techniques in NLP

In this notebook, I have performed various text processing techniques that are important in NLP and before we begin any analysis/model building. Some of the techniques are:
1) Lowercasing
2) Removing html tags
3) Removing punctuations
4) Removing urls
5) Dealing with short forms
6) Spelling corrections
7) Removing stop-words
8) Tokenization
9) Stemming
10) Lemmatization

All these techniques are discussed in depth and with suitable examples

### 1) Getting the data ready

In [1]:
#importing required libraries
import pandas as pd
import matplotlib as plt
import nltk
import re

In [2]:
#loading the dataset
df = pd.read_csv('IMDB Dataset.csv')

In [3]:
#the dataset consists of two columns: review and sentiment
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

### 2) Lowercasing

In [5]:
text = df['review'][1]
text.lower()

'a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well d

In [6]:
df['review'] = df['review'].str.lower()

### 3) Removing html tags

In [7]:
def removeTags(text):
    pattern = '<.*?>'
    return re.sub(pattern,'',text)

In [8]:
df['review'] = df['review'].apply(removeTags)
df['review'][1]

'a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well done.'

### 4) Removing Punctuations

In [9]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [10]:
def removePuncs(text):
    puncs = string.punctuation
    for char in puncs:
        text = text.replace(char,'')
    return text

In [11]:
df['review'] = df['review'].apply(removePuncs)

In [12]:
#A much faster way to remove punctuations
def removePuncs1(text):
    return text.translate(str.maketrans('','',string.punctuation))

### 5) Removing urls

In [13]:
def removeUrls(text):
    pattern = 'https?://\S+|www\.\S+' #S+ represents any non-whitespace character (one or more occurences)
    return re.sub(pattern,'',text)

In [14]:
text = 'This is a sample website with the following urls: fb (https://facebook.com), instagram (http://instagram.com), google (www.google.com) end of text.'
removeUrls(text)

'This is a sample website with the following urls: fb ( instagram ( google ( end of text.'

### 6) Dealing with short forms

In [15]:
with open('slang.txt','r') as f:
    lines = f.readlines()
    slangs = {}
    for line in lines:
        abbs, meaning = line.strip().split('=')
        slangs[abbs] = meaning

In [16]:
slangs

{'AFAIK': 'As Far As I Know',
 'AFK': 'Away From Keyboard',
 'ASAP': 'As Soon As Possible',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'A3': 'Anytime, Anywhere, Anyplace',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRT': 'Be Right There',
 'BTW': 'By The Way',
 'B4': 'Before',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FWIW': "For What It's Worth",
 'FYI': 'For Your Information',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GN': 'Good Night',
 'GMTA': 'Great Minds Think Alike',
 'GR8': 'Great!',
 'G9': 'Genius',
 'IC': 'I See',
 'ICQ': 'I Seek you (also a chat program)',
 'ILU': 'ILU: I Love You',
 'IMHO': 'In My Honest/Humble Opinion',
 'IMO': 'In My Opinion',
 'IOW': 'In Other Words',
 'IRL': 'In Real Life',
 'KISS': 'Keep It Simple, Stupid',
 'LDR': 'Long Distance Relationship',
 'LM

In [20]:
text = 'FYI: This position is only for graduates. Please apply ASAP. Have a GR8 day ahead! '
def shortForms(text):
    new_text = []
    for word in text.split():
        if word.upper() in slangs:
            new_text.append(slangs[word.upper()])
        else:
            new_text.append(word)
    return " ".join(new_text)

In [21]:
text = removePuncs1(text) #first remove all the punctuations
shortForms(text) #complete the short forms.

'For Your Information This position is only for graduates Please apply As Soon As Possible Have a Great! day ahead'

### 7) Spelling corrections

In [22]:
from textblob import TextBlob

In [23]:
incorrect = 'This text contains sveral spelling errorrs.'
blob = TextBlob(incorrect)

In [24]:
blob.correct().string

'His text contains several spelling errors.'

### 8) Removing Stop words

In [25]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ishas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [27]:
def remove_stopwords(text):
    newtext = []
    for word in text.split():
        if word in stopwords.words('english'):
            newtext.append('')
        else:
            newtext.append(word)
    return ' '.join(newtext)

In [28]:
text = df['review'][1]
remove_stopwords(text)

' wonderful little production  filming technique   unassuming  oldtimebbc fashion  gives  comforting  sometimes discomforting sense  realism   entire piece  actors  extremely well chosen michael sheen    got   polari      voices  pat    truly see  seamless editing guided   references  williams diary entries     well worth  watching     terrificly written  performed piece  masterful production  one   great masters  comedy   life  realism really comes home   little things  fantasy   guard  rather  use  traditional dream techniques remains solid  disappears  plays   knowledge   senses particularly   scenes concerning orton  halliwell   sets particularly   flat  halliwells murals decorating every surface  terribly well done'

### 9) Tokenization

In [43]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [40]:
text = "I am going to delhi! I'll be late"
text

"I am going to delhi! I'll be late"

**Using split**

In [50]:
words = text.split()
sentences = text.split('.')
print('Word tokenization:',words)
print('Sentence tokenization:',sentences)

Word tokenization: ['I', 'am', 'going', 'to', 'delhi!', "I'll", 'be', 'late']
Sentence tokenization: ["I am going to delhi! I'll be late"]


**Using NLTK**

In [51]:
word_tokenize(text)

['I', 'am', 'going', 'to', 'delhi', '!', 'I', "'ll", 'be', 'late']

In [46]:
sent_tokenize(text) #it separated the sentence based on exclamation as well.

['I am going to delhi!', "I'll be late"]

In [48]:
sent = 'I have a Ph.D. in A.I.'
print("Sentence tokenization:",sent_tokenize(sent))
print("Word tokenization:",word_tokenize(sent))

Sentence tokenization: ['I have a Ph.D. in A.I.']
Word tokenization: ['I', 'have', 'a', 'Ph.D.', 'in', 'A.I', '.']


**Using Spacy**

In [1]:
import spacy

In [4]:
nlp = spacy.load("en_core_web_sm")
doc = nlp('Dr. Strange loves burger. Hulk loves ham.')
for i in doc.sents:
    print(i)

Dr. Strange loves burger.
Hulk loves ham.


In [9]:
doc = nlp('This is number two at $4')
for token in doc:
    print(token)

This
is
number
two
at
$
4


In [8]:
sent_tokenize('Dr. Strange loves burger. Hulk loves ham.')

['Dr.', 'Strange loves burger.', 'Hulk loves ham.']

**Some methods of spacy**

In [14]:
doc[3],doc[3].like_num

(two, True)

In [16]:
doc[5],doc[5].is_currency

($, True)

In [18]:
with open('students.txt') as f:
    text = f.readlines()
text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n',
 '\n',
 '\n']

In [19]:
text = " ".join(text)
text



In [23]:
text = nlp(text)
emails = []
for token in text:
    if token.like_email:
        emails.append(token)
emails

[virat@kohli.com, maria@sharapova.com, serena@williams.com, joe@root.com]

In [25]:
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc] #.text to convert into string
tokens

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

In [35]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

# TODO: Write code here
# Hint: token has an attribute that can be used to detect a url
doc = nlp(text)
urls = [token.text for token in doc if token.like_url]
urls

['http://www.data.gov/',
 'http://www.science',
 'http://data.gov.uk/.',
 'http://www3.norc.org/gss+website/',
 'http://www.europeansocialsurvey.org/.']

In [36]:
#Extract all money transaction from below sentence along with currency. Output should be,two $ 500 €
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc = nlp(transactions)
for token in doc:
    if token.like_num and doc[token.i+1].is_currency:
        print(token.text, doc[token.i+1].text)

two $
500 €


### 9) Stemming

 Inflection is a modification of a word to express different grammatical categories such as tence, case,voice, gender, mood etc. Stemming is the process of removing inflection.

In [52]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [55]:
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')

In [64]:
sample = 'dance dancer dances dancing'
print(" ".join([ps.stem(word) for word in sample.split()]),"--> Using porter")
print(" ".join([ls.stem(word) for word in sample.split()]),"--> Using lancester")
print(" ".join([ss.stem(word) for word in sample.split()]),"--> Using snowball")

danc dancer danc danc --> Using porter
dant dant dant dant --> Using lancester
danc dancer danc danc --> Using snowball


In [67]:
text = df['review'][1]
" ".join([ps.stem(word) for word in text.split()])

'a wonder littl product the film techniqu is veri unassum veri oldtimebbc fashion and give a comfort and sometim discomfort sens of realism to the entir piec the actor are extrem well chosen michael sheen not onli ha got all the polari but he ha all the voic down pat too you can truli see the seamless edit guid by the refer to william diari entri not onli is it well worth the watch but it is a terrificli written and perform piec a master product about one of the great master of comedi and hi life the realism realli come home with the littl thing the fantasi of the guard which rather than use the tradit dream techniqu remain solid then disappear it play on our knowledg and our sens particularli with the scene concern orton and halliwel and the set particularli of their flat with halliwel mural decor everi surfac are terribl well done'

**The words modified by stemming may/may not be english words. That's why we need lemmatization.**

### 10) Lemmatization

In [78]:
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ishas\AppData\Roaming\nltk_data...


True

In [79]:
sample = "He was running while eating. He has a bad habit of watching his phone at night!"
puncs = string.punctuation
words = word_tokenize(sample)
for word in words:
    if word in puncs:
       words.remove(word)
words

['He',
 'was',
 'running',
 'while',
 'eating',
 'He',
 'has',
 'a',
 'bad',
 'habit',
 'of',
 'watching',
 'his',
 'phone',
 'at',
 'night']

In [83]:
print("{0:20}{1:20}".format("Word","Lemma"))
for word in words:
    print("{0:20}{1:20}".format(word,lemm.lemmatize(word,pos='v'))) #pos = v (for verb)

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
while               while               
eating              eat                 
He                  He                  
has                 have                
a                   a                   
bad                 bad                 
habit               habit               
of                  of                  
watching            watch               
his                 his                 
phone               phone               
at                  at                  
night               night               
