## **NLP Lecture -03:** **Text Preprocessing**
- LINK: https://www.nltk.org/nltk_data/  
## **Contents:**

- **[a. Removing HTML Tags](#a.-Removing-HTML-Tags)**
- **[b. Lower/Upper Case](#b.-Lower/Upper-Case)**
- **[c. Removing URLs](#c.-Removing-URLs)**
- **[d. Removing Punctuation](#d.-Removing-Punctuation)**  
- **[e. Chat Word Treatment](#e.-Chat-Word-Treatment)**
- **[f. Spelling Correction](#f.-Spelling-Correction)**
- **[g. Removing Stop Words](#g.-Removing-Stop-Words)**
- **[h. Handling Emojis](#h.-Handling-Emojis)**
- **[i. Tokenization](#i.-Tokenization)**
- **[j. Stemming](#j.-Stemming)**
- **[k. Lemmatization](#k.-Lemmatization)**

# Assignment_03
- **API LINK:**  
- https://api.themoviedb.org/3/movie/top_rated?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US&page=471
- https://api.themoviedb.org/3/genre/movie/list?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US

# Steps:
- Create the dataset and perform preprocessing
- Dataset will be multi-class classification (movie_name, description, genre)
- Use TMDB API website/ Given link
- Extract the data and store it in dataframe 

In [1]:
# Importing Library
import numpy as np 
import pandas as pd 

# Creating connection with dataset
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Reading the dataset
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df.head()

In [3]:
# Shape and size of dataset
df.shape, df.size

((50000, 2), 100000)

## **a. Removing HTML Tags**
- HTML tags are used for representation on web browser
- Remove all HTML tags because we don't require in NLP
- Require to remove almost every time specially for scrapped data from websites

In [3]:
# Defining function
import re
def remove_html_tags(text):
    p = re.compile(r'<.*?>')
    return p.sub('', text)

In [16]:
sample1 = '<html> <head> <style> </style> </head> <body> <p>Lorem ipsum dolor sit amet,<a href= http://google.com> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p> </body> </html>'
sample1

'<html> <head> <style> </style> </head> <body> <p>Lorem ipsum dolor sit amet,<a href= http://google.com> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p> </body> </html>'

In [17]:
remove_html_tags(sample1)

'      Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.   '

In [5]:
# Applying on dataset
df['review'] = df['review'].apply(lambda x: remove_html_tags(x))
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## **b. Lower/Upper case**
- Convert text to lower/upper case
- Require to do almost every time
- To make meaning of words in different cases same

In [19]:
sample2 = df['review'][3]
sample2

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [21]:
sample2.lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [6]:
# Applying on dataset
df['review'] = df['review'].str.lower()
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## **c. Removing URLs**
- Remove all the URLs from the text
- Required when dealing with the data of social media platform
- Removing URLs will be beneficial because it will ensures to get rid of ambiguity

In [27]:
# Defining function
import re
def remove_urls(text):
    p = re.compile(r'https?://\S+|www\.\S+')
    return p.sub('', text)

In [28]:
sample3='Check my notebook on https://www.kaggle.com/campusx/notebook01 or on http://www.kaggle.com/campusx/notebook01 else www.google.com/mynotebook01'
sample3

'Check my notebook on https://www.kaggle.com/campusx/notebook01 or on http://www.kaggle.com/campusx/notebook01 else www.google.com/mynotebook01'

In [29]:
remove_urls(sample3)

'Check my notebook on  or on  else '

In [30]:
# Applying on dataset
df['review'] = df['review'].apply(lambda x: remove_urls(x))
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## **d. Removing Punctuation**
- Remove all unnecessary special characters
- Removing punctuation will make the data more consistent for further processing

## Method-01:

In [7]:
# Method-01: Defining function
import re

def remove_spchar(text):
    p = r'[^a-zA-Z0-9\s]'             # This pattern will keep letters, digits, and whitespaces
    results = re.sub(p, '', text)     # sub() function to replace matches with an empty string
    return results

In [8]:
import time
start =time.time()
sample4 = "Hello!, How are you? Is it -->fine!!!, Let's meet tomorrow @->:)"
print(remove_spchar(sample4))
end=time.time()
print((end-start)*50000)   # time would take for 50K records

Hello How are you Is it fine Lets meet tomorrow 
68.27116012573242


## Method-02:

In [9]:
# Getting all punctuation
import string
spchar = string.punctuation
spchar

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [10]:
# Method-02: Defining function                                     ---More Useful---

spchar = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
def remove_spchar1(text):
    for char in spchar:
        text = text.replace(char, '')
    return text

# If above function not works --> Datatype issue
def remove_spchar1(text):
    if isinstance(text, str):
        for char in spchar:
            text = text.replace(char, '')
    return text

In [11]:
start =time.time()
sample4 = "Hello!, How are you? Is it -->fine!!!, Let's meet tomorrow @->:)"
print(remove_spchar1(sample4))
end=time.time()
print((end-start)*50000)   # time would take for 50K records

Hello How are you Is it fine Lets meet tomorrow 
11.289119720458984


## Method-03:

In [12]:
# Method-03: Defining function

spchar = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
def remove_spchar2(text):
    return text.translate(str.maketrans('', '', spchar))

In [13]:
start =time.time()
sample4 = "Hello!, How are you? Is it -->fine!!!, Let's meet tomorrow @->:)"
print(remove_spchar2(sample4))
end=time.time()
print((end-start)*50000)   # time would take for 50K records

Hello How are you Is it fine Lets meet tomorrow 
68.73607635498047


In [14]:
# Applying on dataset
df['review'] = df['review'].apply(lambda x: remove_spchar1(x))
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


## **e. Chat Word Treatment**
- Make chat words in normal form
- Required when working on such type of data which is based on social media/messaging app


In [15]:
# Dictionary having set of chat words
chat_word_dict = {'$': ' Dollar ',
 '4AO': 'For Adults Only',
 '7K': 'Sick:-D Laughter',
 'A.M': 'Before Midday',
 'A3': 'Anytime, Anywhere, Anyplace',
 'AAMOF': 'As A Matter Of Fact',
 'ACCT': 'Account',
 'ADIH': 'Another Day In Hell',
 'AFAIC': 'As Far As I Am Concerned',
 'AFAICT': 'As Far As I Can Tell',
 'AFAIK': 'As Far As I Know',
 'AFAIR': 'As Far As I Remember',
 'AFK': 'Away From Keyboard',
 'APP': 'Application',
 'APPROX': 'Approximately',
 'APPS': 'Applications',
 'ASAP': 'As Soon As Possible',
 'ASL': 'Age, Sex, Location',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'AVE.': 'Avenue',
 'AYMM': 'Are You My Mother',
 'AYOR': 'At Your Own Risk',
 'B&B': 'Bed And Breakfast',
 'B+B': 'Bed And Breakfast',
 'B.C': 'Before Christ',
 'B2B': 'Business To Business',
 'B2C': 'Business To Customer',
 'B4': 'Before',
 'B4N': 'Bye For Now',
 'B@U': 'Back At You',
 'BABE': 'Baby or Wife',
 'BAE': 'Before Anyone Else',
 'BAK': 'Back At Keyboard',
 'BBBG': 'Bye Bye Be Good',
 'BBC': 'British Broadcasting Corporation',
 'BBIAS': 'Be Back In A Second',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BE4': 'Before',
 'BFF': 'Best Friends Forever',
 'BFN': 'Bye For Now',
 'BLVD': 'Boulevard',
 'BOUT': 'About',
 'BRB': 'Be Right Back',
 'BROS': 'Brothers',
 'BRT': 'Be Right There',
 'BSAAW': 'Big Smile And A Wink',
 'BTW': 'By The Way',
 'BWL': 'Bursting With Laughter',
 'C/O': 'Care Of',
 'CET': 'Central European Time',
 'CF': 'Compare',
 'CIA': 'Central Intelligence Agency',
 'CSL': 'Can’T Stop Laughing',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CV': 'Curriculum Vitae',
 'CWOT': 'Complete Waste Of Time',
 'CYA': 'See You Again',
 'CYL': 'See You Later',
 'CYT': 'See You Tomorrow',
 'DAE': 'Does Anyone Else',
 'DBMIB': 'Do Not Bother Me I Am Busy',
 'DIY': 'Do It Yourself',
 'DM': 'Direct Message',
 'DWH': 'During Work Hours',
 'E123': 'Easy As One Two Three',
 'EET': 'Eastern European Time',
 'EG': 'Example',
 'EMBM': 'Early Morning Business Meeting',
 'ENCL': 'Enclosed',
 'ENCL.': 'Enclosed',
 'ETC': 'Et-cetera',
 'FAQ': 'Frequently Asked Questions',
 'FAWC': 'For Anyone Who Cares',
 'FB': 'Facebook',
 'FC': 'Fingers Crossed',
 'FIG': 'Figure',
 'FIMH': 'Forever In My Heart',
 'FT': 'Featuring',
 'FT.': 'Feet',
 'FTL': 'For The Loss',
 'FTW': 'For The Win',
 'FWIW': "For What It'S Worth",
 'FYI': 'For Your Information',
 'G.O.A.T': 'Greatest Of All Time',
 'G9': 'Genius',
 'G9T': 'Good Night',
 'GAHOY': 'Get A Hold Of Yourself',
 'GAL': 'Get A Life',
 'GCSE': 'General Certificate Of Secondary Education',
 'GFN': 'Gone For Now',
 'GG': 'Good Game',
 'GL': 'Good Luck',
 'GLHF': 'Good Luck Have Fun',
 'GMT': 'Greenwich Mean Time',
 'GMTA': 'Great Minds Think Alike',
 'GN': 'Good Night',
 'GNTCSD': 'Good Night Take Care Sweet Dream',
 'GOAT': 'Greatest Of All Time',
 'GOI': 'Get Over It',
 'GPS': 'Global Positioning System',
 'GR8': 'Great!',
 'GRATZ': 'Congratulations',
 'GYAL': 'Girl',
 'H&C': 'Hot And Cold',
 'HM': 'Yes',
 'HMM': 'Yes',
 'HP': 'Horsepower',
 'HR': 'Hour',
 'HRH': 'His Royal Highness',
 'HT': 'Height',
 'I.E': 'That Is',
 'IBRB': 'I Will Be Right Back',
 'IC': 'I See',
 'ICQ': 'I Seek You (Also A Chat Program)',
 'ICYMI': 'In Case You Missed It',
 'IDC': 'I Don’t Care',
 'IDGADF': 'I Do Not Give A Damn Fuck',
 'IDGAF': 'I Do Not Give A Fuck',
 'IDK': 'I Do Not Know',
 'IE': 'That Is',
 'IFYP': 'I Feel Your Pain',
 'IG': 'Instagram',
 'IIRC': 'If I Remember Correctly',
 'ILU': 'I Love You',
 'ILY': 'I Love You',
 'IMHO': 'In My Honest/Humble Opinion',
 'IMO': 'In My Opinion',
 'IMU': 'I Miss You',
 'IOW': 'In Other Words',
 'IRL': 'In Real Life',
 'J4F': 'Just For Fun',
 'JIC': 'Just In Case',
 'JK': 'Just Kidding',
 'JSYK': 'Just So You Know',
 'KISS': 'Keep It Simple, Stupid',
 'L8R': 'Later',
 'LB': 'Pound',
 'LBS': 'Pounds',
 'LDR': 'Long Distance Relationship',
 'LMAO': 'Laughing My A** Off',
 'LMFAO': 'Laugh My Fucking Ass Off',
 'LOL': 'Laughing Out Loud',
 'LTD': 'Limited',
 'LTNS': 'Long Time No See',
 'M8': 'Mate',
 'MF': 'Motherfucker',
 'MFS': 'Motherfuckers',
 'MFW': 'My Face When',
 'MOFO': 'Motherfucker',
 'MPH': 'Miles Per Hour',
 'MR': 'Mister',
 'MRW': 'My Reaction When',
 'MS': 'Miss',
 'MTE': 'My Thoughts Exactly',
 'NAGI': 'Not A Good Idea',
 'NBC': 'National Broadcasting Company',
 'NBD': 'Not Big Deal',
 'NFS': 'Not For Sale',
 'NGL': 'Not Going To Lie',
 'NHS': 'National Health Service',
 'NRN': 'No Reply Necessary',
 'NSFL': 'Not Safe For Life',
 'NSFW': 'Not Safe For Work',
 'NTH': 'Nice To Have',
 'NVR': 'Never',
 'N8T': 'Night',
 'NYC': 'New York City',
 'OC': 'Original Content',
 'OG': 'Original',
 'OHP': 'Overhead Projector',
 'OIC': 'Oh I See',
 'OMDB': 'Over My Dead Body',
 'OMG': 'Oh My God',
 'OMW': 'On My Way',
 'P.A': 'Per Annum',
 'P.M': 'After Midday',
 'PITA': 'Pain In The A..',
 'PM': 'Prime Minister',
 'POC': 'People Of Color',
 'POV': 'Point Of View',
 'PP': 'Pages',
 'PPL': 'People',
 'PRT': 'Party',
 'PRW': 'Parents Are Watching',
 'PS': 'Postscript',
 'PT': 'Point',
 'PTB': 'Please Text Back',
 'PTO': 'Please Turn Over',
 'QPSA': 'Que Pasa?',
 'RATCHET': 'Rude',
 'RBTL': 'Read Between The Lines',
 'RLRT': 'Real Life Retweet',
 'ROFL': 'Rolling On The Floor Laughing',
 'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
 'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
 'RT': 'Retweet',
 'RUOK': 'Are You Ok',
 'SFW': 'Safe For Work',
 'SK8': 'Skate',
 'SMH': 'Shake My Head',
 'SQ': 'Square',
 'SRSLY': 'Seriously',
 'SSDD': 'Same Stuff Different Day',
 'STATS': 'Your Sex And Age',
 'TBH': 'To Be Honest',
 'TBS': 'Tablespooful',
 'TBSP': 'Tablespooful',
 'TFW': 'That Feeling When',
 'THKS': 'Thank You',
 'THO': 'Though',
 'THX': 'Thank You',
 'TIA': 'Thanks In Advance',
 'TIL': 'Today I Learned',
 'TIME': 'Tears In My Eyes',
 'TL;DR': 'Too Long I Did Not Read',
 'TLDR': 'Too Long I Did Not Read',
 'TMB': 'Tweet Me Back',
 'TNTL': 'Trying Not To Laugh',
 'TTFN': 'Ta-Ta For Now!',
 'TTYL': 'Talk To You Later',
 'U': 'You',
 'U2': 'You Too',
 'U4E': 'Yours For Ever',
 'UTC': 'Coordinated Universal Time',
 'W/': 'With',
 'W/O': 'Without',
 'W8': 'Wait...',
 'WASSUP': 'What Is Up',
 'WB': 'Welcome Back',
 'WTF': 'What The F...',
 'WTG': 'Way To Go!',
 'WTPA': 'Where The Party At',
 'WUF': 'Where Are You From?',
 'WUZUP': 'What Is Up',
 'WYWH': 'Wish You Were Here',
 'YD': 'Yard',
 'YGTR': 'You Got That Right',
 'YNK': 'You Never Know',
 'ZZZ': 'Sleeping, Bored, Tired',
 '€': ' Euro '}
print('Total chat words listed:',len(chat_word_dict))

Total chat words listed: 243


In [16]:
# Defining function

def chat_word_conversion(text):
    new_text = []
    for word in text.split():
        if word.upper() in chat_word_dict:
            new_text.append(chat_word_dict[word.upper()])
        else:
            new_text.append(word)
    return ' '.join(new_text)

In [17]:
sample5='fyi ILU babe'
chat_word_conversion(sample5)

'For Your Information I Love You Baby or Wife'

In [None]:
# Applying on dataset
df['col_name'] = df['col_name'].apply(lambda x: chat_word_conversion(x))
df.head()

## **f. Spelling Correction**
- Correct the spelling of words if needed to be done
- To remove the complexity of model
- Required when dealing with manually typed data/voice chat/audio text
- We can use library like NLTK, spacy,TextBlob, PySpellChecker

In [60]:
sample6='ceertain conditions duriing sevaral ggeneration aree modified in the saame maner.'
sample6

'ceertain conditions duriing sevaral ggeneration aree modified in the saame maner.'

In [61]:
# Importing library
from textblob import TextBlob

txtblob = TextBlob(sample6)
txtblob.correct().string

'certain conditions during several generation are modified in the same manner.'

In [None]:
# Defining function

from textblob import TextBlob

def spell_checker(text):
    txtblob = TextBlob(text)
    return txtblob.correct().string

In [None]:
# Applying on dataset
df['col_name'] = df['col_name'].apply(lambda x: spell_checker(x))
df.head()

## **g. Removing Stop Words**
- Remove those words which doesn't creates any sense
- We didn't remove stop words when we are going to do Parts of speech tagging
- We can remove stop words using NLTK library
- It will takes time to execute depending on the size of dataset

In [18]:
# Importing library
from nltk.corpus import stopwords

eng_stop_words = stopwords.words('english')
# OR
# sw_list = set(stopwords.words('english'))   # Convert the list to a set for faster membership tests

In [19]:
# Defining function

def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word.lower() in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return ' '.join(x)

# If above function not works
def remove_stopwords(text):
    if isinstance(text, str):
        new_text = []
        for word in text.split():
            if word.lower() not in eng_stop_words:
                new_text.append(word)
        return ' '.join(new_text)
    else:
        return text

In [20]:
sample7 = df['review'][3]
sample7

'basically theres a family where a little boy jake thinks theres a zombie in his closet  his parents are fighting all the timethis movie is slower than a soap opera and suddenly jake decides to become rambo and kill the zombieok first of all when youre going to make a film you must decide if its a thriller or a drama as a drama the movie is watchable parents are divorcing  arguing like in real life and then we have jake with his closet which totally ruins all the film i expected to see a boogeyman similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents  descent dialogs as for the shots with jake just ignore them'

In [21]:
remove_stopwords(sample7)

'basically theres  family   little boy jake thinks theres  zombie   closet  parents  fighting   timethis movie  slower   soap opera  suddenly jake decides  become rambo  kill  zombieok first    youre going  make  film  must decide    thriller   drama   drama  movie  watchable parents  divorcing arguing like  real life     jake   closet  totally ruins   film  expected  see  boogeyman similar movie  instead  watched  drama   meaningless thriller spots3   10    well playing parents descent dialogs    shots  jake  ignore '

In [None]:
# Applying on dataset  ---> it might takes time
df['review'] = df['review'].apply(lambda x: remove_stopwords(x))
df.head()

In [22]:
# Time test on dataset
start =time.time()

x = df['review'].head(10).apply(lambda x: remove_stopwords(x))
print(x)

end=time.time()
print((end-start)*5000) 

0    one    reviewers  mentioned   watching  1 oz e...
1     wonderful little production  filming techniqu...
2     thought    wonderful way  spend time    hot s...
3    basically theres  family   little boy jake thi...
4    petter matteis love   time  money   visually s...
5    probably  alltime favorite movie  story  selfl...
6     sure would like  see  resurrection    dated s...
7     show   amazing fresh innovative idea   70s   ...
8    encouraged   positive comments   film     look...
9      like original gut wrenching laughter   like ...
Name: review, dtype: object
1225.365400314331


## **h. Handling Emojis**
- Either remove the emoji from text
- Or replace the emoji with its relevant meaning
- Require when dealing with messaging app data or personal chats data
- Do it before removing punctuation or special characters


In [29]:
# Defining function to remove emojis

import emoji
import re
def remove_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               "]+", flags=re.UNICODE)
    results = emoji_pattern.sub('', text)
    
    return results

In [30]:
# Defining function to replace emojis with its relevant meaning

import emoji
def replace_emojis(text):
    results = emoji.demojize(text)
    return results

In [31]:
sample8 = "Hello! How are you doing? 😊👋🌟"
sample8

'Hello! How are you doing? 😊👋🌟'

In [32]:
print(remove_emojis(sample8))
print(replace_emojis(sample8))

Hello! How are you doing? 
Hello! How are you doing? :smiling_face_with_smiling_eyes::waving_hand::glowing_star:


## **i. Tokenization**
- Tokenization is basically breaking text documents into smaller parts
- Smaller parts can be word, sentence or phases
### Challenges
- Prefix/Suffix/Infix ---> eg. $10, 10KM, ...-...
- Exception ---> eg. Let's, U.S.

### Method-01: Using split function

In [34]:
# Word tokenization
sent1 = "Hi I am currently learning Natural Language Processing"
sent1.split()

['Hi', 'I', 'am', 'currently', 'learning', 'Natural', 'Language', 'Processing']

In [35]:
# Sentence tokenization
sent2 = "Hi I am currently learning Natural Language Processing. It is currently in market demand. If you wish you too can start learning"
sent2.split('.')

['Hi I am currently learning Natural Language Processing',
 ' It is currently in market demand',
 ' If you wish you too can start learning']

In [37]:
# Problem with split function
sent3 = "Hi!, I'm currently learning NLP!"
sent3.split()

['Hi!,', 'I', 'am', 'currently', 'learning', 'NLP!']

### Method-02: Using Regular Expression 

In [41]:
import re
sent4 = "Hi!, I am currently learning NLP!. It's currently in market demand. Would you like to learn?."
res4 = re.split(r'[!?,\s]+', sent4)
res4

['Hi',
 'I',
 'am',
 'currently',
 'learning',
 'NLP',
 '.',
 "It's",
 'currently',
 'in',
 'market',
 'demand.',
 'Would',
 'you',
 'like',
 'to',
 'learn',
 '.']

### Method-03: NLTK Library

In [45]:
# Importing library
from nltk.tokenize import word_tokenize, sent_tokenize

sample9 = "Hi!, I am currently learning NLP!. It's currently in market demand. Would you like to learn?."
print(word_tokenize(sample9))
print()
print(sent_tokenize(sample9))

['Hi', '!', ',', 'I', 'am', 'currently', 'learning', 'NLP', '!', '.', 'It', "'s", 'currently', 'in', 'market', 'demand', '.', 'Would', 'you', 'like', 'to', 'learn', '?', '.']

['Hi!, I am currently learning NLP!.', "It's currently in market demand.", 'Would you like to learn?.']


In [46]:
# some more example ---> challenges for NLTK library
s1='I have a Ph.D in A.I.'
s2="We're here to help! you can mail us at abc@gmail.com"
s3='This face cream costs $5 in U.S.A.'
s4="Hi!, I'm currently learning NLP!"

print(word_tokenize(s1))
print()
print(word_tokenize(s2))
print()
print(word_tokenize(s3))
print()
print(word_tokenize(s4))

['I', 'have', 'a', 'Ph.D', 'in', 'A.I', '.']

['We', "'re", 'here', 'to', 'help', '!', 'you', 'can', 'mail', 'us', 'at', 'abc', '@', 'gmail.com']

['This', 'face', 'cream', 'costs', '$', '5', 'in', 'U.S.A', '.']


### Method-04: spaCy Library

In [47]:
# Importing library
import spacy
nlp = spacy.load('en_core_web_sm')

In [48]:
# Let's try it
s1='I have a Ph.D in A.I.'
s2="We're here to help! you can mail us at abc@gmail.com"
s3='This face cream costs $5 in U.S.A.'
s4="Hi!, I'm currently learning NLP!"

doc1 = nlp(s1)
doc2 = nlp(s2)
doc3 = nlp(s3)
doc4 = nlp(s4)

In [56]:
for token in doc1:
    print(token, end=' - ')
print()
for token in doc2:
    print(token, end=' - ')
print()
for token in doc3:
    print(token, end=' - ')
print()
for token in doc4:
    print(token, end=' - ')

I - have - a - Ph - . - D - in - A.I. - 
We - 're - here - to - help - ! - you - can - mail - us - at - abc@gmail.com - 
This - face - cream - costs - $ - 5 - in - U.S.A. - 
Hi - ! - , - I - 'm - currently - learning - NLP - ! - 

### j. Stemming
- In grammar, Inflection is the modification of a word to express different grammatical categories such as a tense, case, voice, aspect, person, number, gender, mood etc.
- **Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language**
- [eg. Write --> Write, Writes, Wrote, Written, Writing]
- Most widely used in Information Retrieval system
- Stemmer is an algorithm which is used to do stemming
- Some stemmer in NLTK are PorterStemmer(english), SnowballStemmer(others)

In [57]:
# Importing library

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [58]:
# Defining function
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [64]:
s10 = 'Write Writes Wrote Written Writing walk walks walking walked'
s10

'Write Writes Wrote Written Writing walk walks walking walked'

In [65]:
stem_words(s10)

'write write wrote written write walk walk walk walk'

In [66]:
sample10 = df['review'][1]
sample10

'a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done'

In [67]:
stem_words(sample10)

'a wonder littl product the film techniqu is veri unassum veri oldtimebbc fashion and give a comfort and sometim discomfort sens of realism to the entir piec the actor are extrem well chosen michael sheen not onli ha got all the polari but he ha all the voic down pat too you can truli see the seamless edit guid by the refer to william diari entri not onli is it well worth the watch but it is a terrificli written and perform piec a master product about one of the great master of comedi and hi life the realism realli come home with the littl thing the fantasi of the guard which rather than use the tradit dream techniqu remain solid then disappear it play on our knowledg and our sens particularli with the scene concern orton and halliwel and the set particularli of their flat with halliwel mural decor everi surfac are terribl well done'

## **k. Lemmatization**
- In Lemmatization root word is lemma which is canonical form, dictionary form or citation form of set of words
- Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item
- It's used in computational linguistics, natural language processing (NLP) and chatbots.
- Lemmatization is slower than Stemming
- If we wants to show results to users then use Lemmatization else Stemming
## **NOTE:**
- Stemming is algorithm based so it is faster
- Lemmatization is searching based that is why it is slow (it searches in Lexical dictionary in WordNet Lemmatizer)

In [3]:
sample11 = 'Hello!, I was reading and studying same time. Also I have a bad habit of bathing and running after taking breakfast or eating something.'
sample11

'Hello!, I was reading and studying same time. Also I have a bad habit of bathing and running after taking breakfast or eating something.'

**----------------xxx----------------------CHECK BELOW---------------------xxx---------------------xxx**

In [5]:
# Importing library

import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')                 # If not working
# nltk.data.path.append('path_to_wordnet')   # Download and give path

lemmatizer = WordNetLemmatizer()

[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


False

In [None]:
# Lets try this
punc = '!,.?'
sent_words = nltk.word_tokenize(sample11)
for word in sent_words:
    if word in punc:
        sent_words.remove(word)
sent_words

In [None]:
print("{0:20}{1:20}".format('Word','Lemma'))
for word in sent_words:
    print("{0:20}{1:20}".format(word, lemmatizer.lemmatize(word, pos='v')))


In [None]:
# Questions: 
Why and when we need to do tokenization, stemming and lemmatization