Text Preprocessing is a second step of NLP Pipeline and it is very much important step as it involve analysing of data in the initial stage. 
It is generally of 2 type:
1. Basic
2. Advance

However in this code we will be focusing on Basic type :
Following Techinque we will see in basis text pre-processing
 - LowerCasing
 - Removing HTML Tags
 - Removing URL 
 - Removing Punctuations
 - Chatwords Treatment
 - Spelling Correction
 - Removing Stopwords
 - Handling Emojis
 - Tokenization
 - Stemming 
 - Lemmatization

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('IMDB Dataset.csv')

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
df.shape

(50000, 2)

### 1. Lower Casing

In [8]:
# Lower casing a particular review

df.review[1].lower()

'a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well d

In [11]:
# LowerCasing the entire corpus

df.review = df.review.str.lower()

In [12]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


### 2.Removing Html Tags

In [14]:
# Html tags are use to help the browser in displaying the data.However while doing or working on sentimental 
# analysis we donot require these tags.So we use Regular expression to remove those tags and create patterns

# General pattern to find any Html tag is '<.*?>'

import re

def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

df.review = df.review.apply(remove_html_tags)

In [15]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


### 3. Removing Url

In [18]:
# While working with Social Media data, we get lot lot of URL and it better to remove these URL's 
# We again use regular expression to remove these URL
# General Pattern follows > r'https?://\S+|www\.\S+'

def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

# We don't have any URL in our data, to implement it we use some demo data

text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'


In [19]:
# Checking the results

print('Text1 :', remove_url(text1))
print('Text2 :', remove_url(text2))
print('Text3 :', remove_url(text3))
print('Text4 :', remove_url(text4))

Text1 : Check out my notebook 
Text2 : Check out my notebook 
Text3 : Google search here 
Text4 : For notebook click  to search check 


### 4. Removing the Punctuation Marks

In [20]:
# Punctuation are important to be removed as they may cause unnecessary complexity to the model during
# tokkenization, as Hello! or Hello may be consider as different word and ! can also be consider as 1 work which
# may increase the complexity of the Model by adding more words

# We use string, time library,
# string.punctution give the set of all the punctuation we have in the python

import string, time

punct = string.punctuation
print(punct)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [22]:
def remove_punctuation(text):
    for i in punct:
        text = text.replace(i,'')
    return text

text1 = 'string. with. Punctuation?'

remove_punctuation(text1)

'string with Punctuation'

In [23]:
# Those we have achieve our goal however there is an issue with this code.
# Time taken by this code is high

start = time.time()
remove_punctuation(text1)
time1 = time.time()-start
print(time1)

0.00011491775512695312


Since we are performing on single text that why is showing low.However if we to apply this over 50k rows then
its will be close to 67sec, more than a minute
So we write different code for the same which is more faster

In [29]:
def remove_punc1(text):
    return text.translate(str.maketrans('','', punct))

start1 = time.time()
remove_punc1(text1)
time2 = time.time()-start1
print(time2)

0.00013399124145507812


In [27]:
print(time1*50000, time2*50000)

5.745887756347656 3.3020973205566406


In [30]:
time1/time2

0.8576512455516014

So the results show its is much fater than earlier function

In [31]:
# Lets apply to this to a differnet dataset from Kaggle

df1 = pd.read_csv('labeled_data.csv')
df1.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [32]:
df1.shape

(24783, 7)

In [33]:
# Applying on tweet column 

df1.tweet.apply(remove_punc1)

0         RT mayasolovely As a woman you shouldnt compl...
1         RT mleew17 boy dats coldtyga dwn bad for cuff...
2         RT UrKindOfBrand Dawg RT 80sbaby4life You eve...
3           RT CGAnderson vivabased she look like a tranny
4         RT ShenikaRoberts The shit you hear about me ...
                               ...                        
24778    yous a muthafin lie 8220LifeAsKing 20Pearls co...
24779    youve gone and broke the wrong heart baby and ...
24780    young buck wanna eat dat nigguh like I aint fu...
24781                youu got wild bitches tellin you lies
24782    Ruffled  Ntac Eileen Dahlia  Beautiful color c...
Name: tweet, Length: 24783, dtype: object

In [34]:
# And it works faster

### 5. Chat Word Treatment

In [36]:
# Chatword we use in day to day social media need a proper treatment while doing a sentimental analysis.
# Word like , asap - As soon as possible, gn- good night and other need to be defined properly.
# We just a need define the dictionay to with the fullforms and then replace then in our text

chat_word = {
    'AFAIK': 'As Far As I Know',
    'AFK' : 'Away From Keyboard',
    'ASAP':'As Soon As Possible',
'ATK':'At The Keyboard',
'ATM' :'At The Moment',
'A3': 'Anytime, Anywhere, Anyplace',
'BAK': 'Back At Keyboard',
'BBL' : 'Be Back Later',
 'BBS':'Be Back Soon',
'BFN':'Bye For Now',
'B4N':'Bye For Now',
'BRB':'Be Right Back',
'BRT': 'Be Right There',
'BTW':'By The Way',
'B4' :'Before',
'B4N' :'Bye For Now',
'CU':'See You',
'CUL8R': 'See You Later',
'CYA' : 'See You',
'FAQ' :'Frequently Asked Questions',
'FC':'Fingers Crossed',
'FWIW': "For What It's Worth",
'FYI': 'For Your Information',
'GAL' :'Get A Life',
'GG':'Good Game',
'GN':'Good Night',
'GMTA': 'Great Minds Think Alike',
'GR8' :'Great!',
'G9':'Genius',
'IC':'I See',
'ICQ':'I Seek you (also a chat program)',
'ILU' :'ILU: I Love You',
'IMHO':'In My Honest/Humble Opinion',
'IMO' :'In My Opinion',
'IOW':'In Other Words',
'IRL' :'In Real Life',
'KISS':'Keep It Simple, Stupid',
'LDR':'Long Distance Relationship',
'LMAO':'Laugh My A.. Off',
'LOL':'Laughing Out Loud',
'LTNS' : 'Long Time No See',
'L8R' :'Later',
'MTE':'My Thoughts Exactly',
'M8' :'Mate',
'NRN' :'No Reply Necessary',
'OIC':'Oh I See',
'PITA' :'Pain In The A..',
'PRT' :'Party',
'PRW':'Parents Are Watching',
'QPSA' :'Que Pasa',
'ROFL':'Rolling On The Floor Laughing',
'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
'ROTFLMAO':'Rolling On The Floor Laughing My A.. Off',
'SK8':'Skate',
'STATS':'Your sex and age',
'ASL':'Age, Sex, Location',
'THX':'Thank You',
'TTFN':'Ta-Ta For Now!',
'TTYL':'Talk To You Later',
'U':'You',
'U2':'You Too',
'U4E':'Yours For Ever',
'WB':'Welcome Back',
'WTF':'What The Fuck',
'WTG':'Way To Go!',
'WUF':'Where Are You From?',
'W8':'Wait...',
'MFW':'My face when',
'MRW':'My reaction when',
'IFYP':'I feel your pain',
'LOL':'Laughing out loud',
'TNTL':'Trying not to laugh',
'JK':'Just kidding',
'IDC':'I don’t care',
'ILY':'I love you',
'IMU':'I miss you',
'ADIH':'Another day in hell',
'IDC':'I don’t care',
'ZZZ':'Sleeping, bored, tired',
'WYWH':'Wish you were here',
'TIME' :'Tears in my eyes',
'BAE': 'Before anyone else',
'FIMH': 'Forever in my heart',
'BSAAW': 'Big smile and a wink',
'BWL': 'Bursting with laughter',
'LMAO': 'Laughing my a** off',
'BFF': 'Best friends forever',
'CSL' : 'Can’t stop laughing'
}

In [38]:
# Defining the a function

def chat_correction(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_word:
            new_text.append(chat_word[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

chat_correction('IMHO he is the Best!')

'In My Honest/Humble Opinion he is the Best!'

In [None]:
# IMHO is replaced with In My Honest/Humble Opinion

### 6. Spelling Correction

In [43]:
# Spelling correction is also important due to the same propuse to avoid any kind of complexity during tokenization
# As words with same meaning and different spelling unnecessary increase the complexity of the model.
# There are different techniques to follow, We can either use NLTK library or textblobs library
# Here we will see it using textblob

from textblob import TextBlob

incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'

# create a TextBlob obj

txt_blob = TextBlob(incorrect_text)

# To print the correct sting, we will use textblob.correct() function

print(txt_blob.correct())

certain conditions during several generations are modified in the same manner.


TextBlob is helpful is making the spell check for normal words however while dealing with complex we may have to 
create our own spell checker

### 7.Stop Words Removal

In [47]:
# Stop words are the words which are used for sentence formation but they don't have any actual meaning hence
# while dealing with sentimental analysis these words need to be removed to avoid any complexity
# Exception to that would be while doing POS(Parts of Speech) tagging we DONOT remove Stop Word
# We use NLTK library to perform this task as it consist of build-in list for stop words used in English and other
# Languages

from nltk.corpus import stopwords

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [49]:
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    return " ".join(new_text)

sample = 'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times'

remove_stopwords(sample)

'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

As we observed, many stop words are removed like my, a ,of...

In [51]:
## Applying on dataset

df1.tweet.apply(remove_stopwords)

0        !!! RT @mayasolovely: As  woman   complain  cl...
1        !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2        !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3        !!!!!!!!! RT @C_G_Anderson: @viva_based  look ...
4        !!!!!!!!!!!!! RT @ShenikaRoberts: The shit  he...
                               ...                        
24778    you's  muthaf***in lie &#8220;@LifeAsKing: @20...
24779     gone  broke  wrong heart baby,  drove  rednec...
24780    young buck wanna eat!!.. dat nigguh like I ain...
24781                   youu got wild bitches tellin  lies
24782    ~~Ruffled | Ntac Eileen Dahlia - Beautiful col...
Name: tweet, Length: 24783, dtype: object

### 8. Handing Emojis

In [52]:
# Emojis has change the revolution of how people express themselves and they are important to handle
# Handling can be done either by removing the emojis or replacing them by their meaning
# We us regular expression perform the tast and the code can be use as snippet

import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [53]:
remove_emoji("Loved the movie. It was 😘😘")

'Loved the movie. It was '

In [54]:
remove_emoji("Lmao 😂😂")

'Lmao '

In [56]:
# For Replacing the emojis we use 'emoji' libraries from the python

!pip install emoji

Collecting emoji
  Downloading emoji-2.8.0-py2.py3-none-any.whl.metadata (5.3 kB)
Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.8.0


In [58]:
import emoji

# to replace the emoji with it meaning we use demojize()

print(emoji.demojize('Python is 🔥'))

Python is :fire:


In [59]:
print(emoji.demojize('Loved the movie. It was 😘'))

Loved the movie. It was :face_blowing_a_kiss:


### 9.Tokkenization

In [61]:
# Tokenization is nothing by splitting the raw text into small chunks of words or sentences, called tokens
# It is a very crucial and importance step of text pre-processing without doing proper tokenization the model
# may fail to perform
# It can be of Word Tokenization, Sentence Tokenization, Phrase Tokenization
# Tokenization and be done in multiple ways


### 9(a). Using the Split function

In [62]:
# Word Tokenization

sent1 = 'I am going to Jammu'
sent1.split()

['I', 'am', 'going', 'to', 'Jammu']

In [63]:
# Sentence Tokenizaton

sent2 = 'I am going to Delhi.I will stay there for a week.Let\'s hope the trip goes well!'
sent2.split('.')

['I am going to Delhi',
 'I will stay there for a week',
 "Let's hope the trip goes well!"]

In [64]:
#Problem with split function

sent3 = 'I am going to Delhi!'
sent3.split()

['I', 'am', 'going', 'to', 'Delhi!']

! come along with Delhi which is suppose to be treated separately and which will be different from Delhi

### 9(b). Using Regular Expression

In [65]:
# Word tokenization

import re
sent3 = 'I am going to Delhi!'
tokens = re.findall("[\w']+",sent3)
tokens

['I', 'am', 'going', 'to', 'Delhi']

Its better than split function but creating patterns everytime could be tedious task

In [66]:
#Sentence Tokenization

text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

### 9(c). Using NLTK Library

In [67]:
# Better option to perform tokenization is to use libraries which has built in function and gives better results
# than split() function and regular expression
# We have NLTK library which where we have 2 functions word_tokenize and sent_tokenize

from nltk.tokenize import word_tokenize, sent_tokenize

#word tokenization
    
sent1 = 'I am going to Delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'Delhi', '!']

Better results than above techniques as ! is also separated

In [68]:
#Sentence tokenization

text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [73]:
#Other eg for word tokkenization

sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

print(word_tokenize(sent5),'\n',word_tokenize(sent6),'\n',word_tokenize(sent7))

['I', 'have', 'a', 'Ph.D', 'in', 'A.I'] 
 ['We', "'re", 'here', 'to', 'help', '!', 'mail', 'us', 'at', 'nks', '@', 'gmail.com'] 
 ['A', '5km', 'ride', 'cost', '$', '10.50']


For send6 it failed as it split the email nks@gmail.com also so NLTK also has issues but still shows better result than regular expression and split() function

### 9(d). Using Spacy Library

In [76]:
# Spcay can also be use to perform the tokenization and give good results as compare to nltk
# we need to load 'en_core_web_sm' small dictonary to perform the task

import spacy
nlp = spacy.load('en_core_web_sm')

In [77]:
# Coverting text into document

doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [78]:
# Looping over each document to check the word_tokenization
for i in doc1:
    print(i)

I
have
a
Ph
.
D
in
A.I


In [79]:
for i in doc2:
    print(i)

We
're
here
to
help
!
mail
us
at
nks@gmail.com


In [80]:
for i in doc3:
    print(i)

A
5
km
ride
cost
$
10.50


In [81]:
for i in doc4:
    print(i)

I
am
going
to
Delhi
!


As observed the results are better than Nltk and other techniques so we can use the desired techinques based upon our requirements

### 10. Stemming 

In [82]:
# Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem 
# that affixes to suffixes and prefixes or the roots.
# Basically the process of reducing the inflected words from our data.

# Stemming can be performed using NLTk Library using PortorStemmer function

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

def stem_words(text):
    new_txt = []
    for word in text.split():
        new_txt.append(ps.stem(word)) # ps.stem()> will convert inflected working into it root form
    
    return " ".join(new_txt) 

sample = 'walks walking walked walked'
stem_words(sample)

'walk walk walk walk'

In [83]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

As we can observe, that the stemming changes the words to their root form however it is not necessary that they have meaning in that language like in above eg, probably becomes probabl, story becomes stori. So if we need to show this output to someone then stemming is not good and from their "lemmitization" comes into picture

### 11. Lemmatization

In [88]:
#Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word 
# down to its root meaning to identify similarities. 
# Lemmatization is search technique which search the words lexican dictionary which consist of relations between
# different words in a language and returns a meaningful root words of same langauge.
# Lemmatization is slower than stemming and is mainly use when the output need to be displayed
# We use nltk library which consit of WordNetLemmatizer to perform Lemmatization

import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

sent = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

punctuations="?:!.,;"

# performing tokenization

sentence_words = nltk.word_tokenize(sent)

# removing the puctuations

for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

# performing the lemmatization and comparing the output
# To perform the lemmatization we us .lemmatize(text, pos)
# pos variable we need to define as to which part of speech of the text we need to perform lemmatization
# Valid options are “n” for nouns, “v” for verbs, “a” for adjectives, “r” for adverbs 
# and “s” for satellite adjectives.


for i in sentence_words:
    print(f"{i} >> {lemmatizer.lemmatize(i, pos='v')}" ) # Applying Lemmatization on verbs

He >> He
was >> be
running >> run
and >> and
eating >> eat
at >> at
same >> same
time >> time
He >> He
has >> have
bad >> bad
habit >> habit
of >> of
swimming >> swim
after >> after
playing >> play
long >> long
hours >> hours
in >> in
the >> the
Sun >> Sun
