# Text Preprocessing

## Part 0 - Common Terms
Some common terms and the meanings of that term (imagine there is a data set for sentiment analysis having two columns *review* and *sentiment*.):
- **Corpus (C):** All the words (repeated also included) available into your *review* column.
- **Vocabulary (V):** All the unique words together available into your *review* column
- **Document:** The individual record of the dataset is called document.
- **Word:** And every word is called words 😅

## Part 1 - Load the Dataset from Kaggle

In [None]:
# Upload "kaggle.json" file and install kaggle library
! pip install kaggle
# Create a "kaggle" named folder
! mkdir ~/.kaggle
# Copy the “kaggle.json” into this new directory
! cp kaggle.json ~/.kaggle/
# Allocate the required permission for this file.
! chmod 600 ~/.kaggle/kaggle.json
# Download the dataset
! kaggle datasets download lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
# Unzip the dataset
! unzip imdb-dataset-of-50k-movie-reviews.zip

mkdir: cannot create directory ‘/root/.kaggle’: File exists
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 89% 23.0M/25.7M [00:00<00:00, 115MB/s]
100% 25.7M/25.7M [00:00<00:00, 115MB/s]
Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv("IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df.shape

(50000, 2)

In [None]:
review = df["review"][3]
review

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

## Part 2 - Convert to Lower Case

In [None]:
# Convert to lower case
review.lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [None]:
# Convert all records of "review" column to lower case
df["review"] = df["review"].str.lower()
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## Part 3 - Remove HTML Tags

In [None]:
# Remove HTML tags
import re

def remove_html_tags(text):
    pattern = re.compile("<.*?>")
    return pattern.sub(r"", text)

remove_html_tags(df["review"][3])

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [None]:
# Remove all the tags from the "review" column
df["review"] = df["review"].apply(remove_html_tags)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## Part 4 - Remove URLs

In [None]:
# Remove URLs
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

print(remove_url(text1))
print(remove_url(text2))
print(remove_url(text3))
print(remove_url(text4))

Check out my notebook 
Check out my notebook 
Google search here 
For notebook click  to search check 


## Part 5 - Remove Punctuations

In [None]:
import string
import time

# Show all punctuations in English
exclude = string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
def remove_punctuation(text):
    exclude = string.punctuation

    for char in exclude:
        text = text.replace(char, "")
    return text

In [None]:
text = "string. With. Punctuation?"
start = time.time()
print(remove_punctuation(text))
time1 = time.time() - start
print(f"Time taken: {time1}")

string With Punctuation
Time taken: 0.0010445117950439453


In [None]:
0.0010445117950439453 * 50000

52.225589752197266

In [None]:
def remove_punctuation_fast(text):
    return text.translate(str.maketrans("", "", exclude))

In [None]:
text = "string. With. Punctuation?"
start = time.time()
print(remove_punctuation_fast(text))
time1 = time.time() - start
print(f"Time taken: {time1}")

string With Punctuation
Time taken: 0.0009534358978271484


In [None]:
0.0009534358978271484 * 50000

47.67179489135742

In [None]:
! kaggle datasets download arkhoshghalb/twitter-sentiment-analysis-hatred-speech
! unzip twitter-sentiment-analysis-hatred-speech.zip

Downloading twitter-sentiment-analysis-hatred-speech.zip to /content
  0% 0.00/1.89M [00:00<?, ?B/s]
100% 1.89M/1.89M [00:00<00:00, 63.4MB/s]
Archive:  twitter-sentiment-analysis-hatred-speech.zip
  inflating: test.csv                
  inflating: train.csv               


In [None]:
twitter_df = pd.read_csv("train.csv")
twitter_df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [None]:
start3 = time.time()
twitter_df["tweet"] = twitter_df["tweet"].apply(remove_punctuation_fast)
time3 = time.time() - start3

In [None]:
time3

0.20287704467773438

## Part 6 - Chat Word Treatment

In [None]:
# Convery short hand words to actual works like "U" to "You"
import requests
import json

chat_words = requests.get("https://speedkode.herokuapp.com/chat_abbreviations").json()["all_abbreviations"]
chat_words

{'?': 'I don’t understand what you mean',
 '?4U': 'I have a question for you',
 '^^': 'Meaning “read line” or “read message” above',
 '<3': 'Meaning “sideways heart” (love, friendship)',
 '</3': 'Meaning “broken heart”',
 '<3333': 'Meaning “heart or love” (more 3s is a bigger heart)',
 '@TEOTD': 'At the end of the day',
 '.02': 'My (or your) two cents worth',
 '1TG, 2TG': 'Meaning number of items needed for win (online gaming)',
 '1UP': 'Meaning extra life (online gaming)',
 '121': 'One-to-one (private chat initiation)',
 '1337': 'Leet, meaning ‘elite’',
 '143': 'I love you',
 '1432': 'I love you too',
 '14AA41': 'One for all, and all for one',
 '182': 'I hate you',
 '19': 'Zero hand (online gaming)',
 '10M': 'Ten man (online gaming)',
 '10X': 'Thanks',
 '10Q': 'Thank you',
 '1CE': 'Once',
 '1DR': 'I wonder',
 '1NAM': 'One in a million',
 '2': 'Meaning “to” in SMS',
 '20': 'Meaning “location”',
 '2B': 'To be',
 '2EZ': 'Too easy',
 '2G2BT': 'Too good to be true',
 '2M2H': 'Too much to h

In [None]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    
    return " ".join(new_text)

chat_conversion("IMHO he is the best")

'In my humble opinion he is the best'

In [None]:
chat_conversion("FYI delhi is the capital of india.")

'For your information delhi is the capital of india.'

## Part 7 - Spelling Correction

In [1]:
from textblob import TextBlob

incorrect_text = "ceertain conditionas during seveal ggnerations aree modified in the same maner."

textBlb = TextBlob(incorrect_text)

textBlb.correct().string

'certain conditions during several generations are modified in the same manner.'

In [6]:
incorrect_text = "a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only \"has got all the polari\" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master's of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell's murals decorating every surface) are terribly well done."

textBlb = TextBlob(incorrect_text)

textBlb.correct().string

'a wonderful little production. the filling technique is very assuming- very old-time-bc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polar" but he has all the voices down pat too! you can truly see the fearless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrific written and performed piece. a wasteful production about one of the great master\'s of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' technique remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning norton and halliwell and the sets (particularly of their flat with halliwell\'s morals decorating every surface) are terribly well done.'

## Part 8 - Removing Stopwords

In [None]:
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords

stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words("english"):
            new_text.append("")
        else:
            new_text.append(word)
    
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [None]:
remove_stopwords("probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times")

'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

In [None]:
df["review"] = df["review"].apply(remove_stopwords)
df.head()

## Part 9 - Handling Emojis

### Remove Emojis

In [None]:
import re

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [None]:
remove_emoji("Loved the movie. It was 😘😘")

'Loved the movie. It was '

In [None]:
remove_emoji("Lmao 😂😂")

'Lmao '

### Convert Emojis

In [None]:
!pip install emoji
import emoji

Collecting emoji
  Downloading emoji-1.6.1.tar.gz (170 kB)
[K     |████████████████████████████████| 170 kB 7.6 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.6.1-py3-none-any.whl size=169314 sha256=6c3c9c26e6b81047a2d35da054ddf2ce334091a433b6f6662e2ba74b87817305
  Stored in directory: /root/.cache/pip/wheels/ea/5f/d3/03d313ddb3c2a1a427bb4690f1621eea60fe6f2a30cc95940f
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-1.6.1


In [None]:
print(emoji.demojize('Python is 🔥'))

Python is :fire:


In [None]:
print(emoji.demojize('Loved the movie. It was 😘'))

Loved the movie. It was :face_blowing_a_kiss:


## Part 10 - Tokenization
- **Prefix:** Character(s) at the beginning
- **Suffix:** Character(s) at the end
- **Infix:** Character(s) in between
- **Exception:** Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.

### Using the `split` method

In [None]:
# word tokenization
sent1 = "I am going to delhi"
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [None]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great.'
sent2.split(".")

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great",
 '']

In [None]:
# problems with split method
sent3 = 'I am going to delhi!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!']

In [None]:
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split(".")

['Where do think I should go? I have 3 day holiday']

### Regular Expression

In [None]:
import re

sent3 = 'I am going to delhi'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

In [None]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

### NLTK

In [None]:
import nltk
nltk.download("punkt")

from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
sent1 = "I am going to visit delhi!"
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [None]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [None]:
sent5 = "I have a Ph.D in A.I"
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = "A 5km ride cost $10.0"

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [None]:
word_tokenize(sent6)

['We',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'mail',
 'us',
 'at',
 'nks',
 '@',
 'gmail.com']

In [None]:
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.0']

### Spacy

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [None]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [None]:
for token in doc1:
    print(token)

I
have
a
Ph
.
D
in
A.I


In [None]:
for token in doc2:
    print(token)

We
're
here
to
help
!
mail
us
at
nks@gmail.com


In [None]:
for token in doc3:
    print(token)

A
5
km
ride
cost
$
10.0


In [None]:
for token in doc4:
    print(token)

I
am
going
to
visit
delhi
!


## Part 11 - Stemming
*In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender and mood.*

*Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.*

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
ps = PorterStemmer()

def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [None]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [None]:
text = "probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie"
text

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

In [None]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

## Part 12 - Lemmatization
*Lemmatization, unlike Stemming, reduces he inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called **Lemma**. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.*

In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations = "?:!.,;"
sentence_words = nltk.word_tokenize(sentence)

for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)
    
sentence_words
print("{0:20}{1:20}".format("Word", "Lemma"))
for word in sentence_words:
    print("{0:20}{1:20}".format(word, wordnet_lemmatizer.lemmatize(word, pos="v"))) # Have to pass "pos" (parts of speech)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun 