## **Text preprocessing**

1) Data Aquisition

2) Text Preparation

3) Feature Engineering

4) Modelling

5) Deployment

## **1)Data Aquisition**


## **2)Basic Text Preprocessing**
- Lowering
- Removing HTML Tag
- Removing URLS
- Removing Punctuation
- Chat Word Treatement
- Spelling Correction
- Removing Stop Words
- Handling Emojis
- Tokenization
- Stemming
- Lemmatization

## **Advanced Preprocessing**
- POS Tag
- Chunking
- Parsing
- Coreference resolution

In [2]:
import pandas as pd 
import numpy as np

In [3]:
df = pd.read_csv("IMDB Dataset.csv")

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df.shape

(50000, 2)

### **Step1:-Lower**

In [6]:
df["review"][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [7]:
df["review"]=df["review"].str.lower()

### **Step2:-Remove HTML Tags**

In [8]:
import re
def remove_html_tags(text):
  pattern = re.compile("<.*?>")
  return pattern.sub(r'',text)

In [9]:
text =("""<li><data value="21053">Cherry Tomato</data></li>
  <li><data value="21054">Beef Tomato</data></li>
  <li><data value="21055">Snack Tomato</data></li>""")

In [10]:
remove_html_tags(text)

'Cherry Tomato\n  Beef Tomato\n  Snack Tomato'

In [11]:
df["review"] =df["review"].apply(remove_html_tags)

**step3:-URL Remove**

In [12]:
def remove_url(text):
  pattern = re.compile(r"https?://\S+|www\.\S+")
  return pattern.sub(r"", text)

In [13]:
text1 = "how are you https://search.yahoo.com/search?fr=mcafee&type=E211US714G0&p=html+data "
text2 = "GooGle search https://www.youtube.com/watch?v=6C0sLtw5ctc&list=PLKnIA16_RmvZo7fp5kkIth6nRTeQQ about india"

In [14]:
remove_url(text2)

'GooGle search  about india'

## **Step4:-Removing Punctuation**

In [15]:
df["review"][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [16]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [17]:
punctuation = string.punctuation

In [18]:
def pun_rem(text):
  for char in punctuation:
    text=text.replace(char,"")
  return text  

In [19]:
pun_rem(df["review"][5])

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

In [20]:
#execution time
start = time.time()
print(pun_rem(df["review"][5]))
time1 =  time.time()-start
print(time1)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie
0.0


### chat word treatement

In [21]:
chat_words =  {"brb":"Be Right Back","bcoz":"Because","bb4n":"By by for Now","cos":	"Because","gr8":"Great","hagd":"Have a Good Day"}


In [22]:
chat_words

{'brb': 'Be Right Back',
 'bcoz': 'Because',
 'bb4n': 'By by for Now',
 'cos': 'Because',
 'gr8': 'Great',
 'hagd': 'Have a Good Day'}

In [23]:
def cht_conver(text):
  new_text=[]
  for w in text.split():
    if w.lower() in chat_words:
      new_text.append(chat_words[w.lower()])
    else:
      new_text.append(w)
  return " ".join(new_text)      

In [24]:
cht_conver("bb4n bcoz online class starts now")

'By by for Now Because online class starts now'

### spelling correction
- nltk
- spacy
- textblob

In [25]:
!pip install textblob



You should consider upgrading via the 'C:\Users\Sravanthi\anaconda3\python.exe -m pip install --upgrade pip' command.





In [26]:
from textblob import TextBlob

In [27]:
sentence = "i am going for a walk, you know wlking is god for health"

In [28]:
textblob = TextBlob(sentence)
textblob.correct().string

'i am going for a walk, you know walking is god for health'

### stop words
- nltk

In [29]:
from nltk.corpus import stopwords

In [30]:
sw=stopwords.words("english")
sw

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [31]:
def remove_stopwrds(text):
    new_text=[]
    for w in text.split():
        if w in sw:
            new_text.append('')
        else:
            new_text.append(w)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)
            

In [32]:
remove_stopwrds("pinky and smith are doing homework and afterrwords they went for dinner")

'pinky  smith   homework  afterrwords  went  dinner'

In [33]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [34]:
df["review"].apply(remove_stopwrds)

0        one    reviewers  mentioned   watching  1 oz e...
1         wonderful little production.  filming techniq...
2         thought    wonderful way  spend time    hot s...
3        basically there's  family   little boy (jake) ...
4        petter mattei's "love   time  money"   visuall...
                               ...                        
49995     thought  movie    right good job.    creative...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997       catholic taught  parochial elementary schoo...
49998    i'm going    disagree   previous comment  side...
49999     one expects  star trek movies   high art,   f...
Name: review, Length: 50000, dtype: object

### Removing emojis

In [44]:
import re
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F" # emoticons
                           u"\U0001F300-\U0001F5FF" # symbols & pictographs
                           u"\U0001F680-\U0001F6FF" # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF" # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

remove_emoji("I like to eat 🍕")

'I like to eat '

### tokenization

In [50]:
import nltk

In [52]:
from nltk.tokenize import sent_tokenize, word_tokenize
  
text = "Natural language processing (NLP) is a field " + \
       "of computer science, artificial intelligence " + \
       "and computational linguistics concerned with " + \
       "the interactions between computers and human " + \
       "(natural) languages, and, in particular, " + \
       "concerned with programming computers to " + \
       "fruitfully process large natural language " + \
       "corpora. Challenges in natural language " + \
       "processing frequently involve natural " + \
       "language understanding, natural language" + \
       "generation frequently from formal, machine" + \
       "-readable logical forms), connecting language " + \
       "and machine perception, managing human-" + \
       "computer dialog systems, or some combination " + \
       "thereof."
  
print(sent_tokenize(text))


['Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.', 'Challenges in natural language processing frequently involve natural language understanding, natural languagegeneration frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.']


In [54]:
print(word_tokenize(text))

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'and', ',', 'in', 'particular', ',', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', '.', 'Challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', ',', 'natural', 'languagegeneration', 'frequently', 'from', 'formal', ',', 'machine-readable', 'logical', 'forms', ')', ',', 'connecting', 'language', 'and', 'machine', 'perception', ',', 'managing', 'human-computer', 'dialog', 'systems', ',', 'or', 'some', 'combination', 'thereof', '.']
