# Steps to build Machine Learning Model
  ## 1. Data gathering
  ## 2. Data Accessing and cleaning
     > Steps:-
     1. lowercasing
     2. remove leading and trailing spaces
     3. removing html tags
     4. removing urls
     5. Expanding abbreviations
     6. spelling correction
     7. puntuations
     8. remove special characters
  ## 3. Data Preprocessing
     > Steps:-
     1. Tokenization
     2. Stopword removal
     3. Stemming
  ## 4. EDA (Exploratory Data Analysis)
     > Steps:-
     1. Distribution of text lengths/word count
     2. common unigrams/bigrams/trigrams
     3. Wordcloud
  ## 5. Make Features
     > Steps:-
     1. number of words
  ## 6. Vectorization
     > Techniques
     1. Bag of words (BOW)
     2. TFIDF
     3. WordToVec
  ## 7. Modelling
  ## 8. Model Evaluation
  ## 9. Model Deployment
  ## 10. Monitoring
  ## Additional
     1. PCA
     2. POS Tagging
     3. Stemming

# New Section

In [28]:
import pandas as pd
import numpy as np
import re
import seaborn as sns

df=pd.read_csv('/content/IMDB Dataset.csv')

In [29]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [31]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [32]:
df.duplicated().sum()

418

In [33]:
df=df.drop_duplicates()
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [34]:
#Cleaning

#lower case
df['review']=df['review'].str.lower()

#Remove white space
df['review']=df['review'].str.strip()
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [35]:
# code to check if a review contains an URL
df[df['review'].str.contains(r"https?://\S+|www\.\S+")]

Unnamed: 0,review,sentiment
742,mario lewis of the competitive enterprise inst...,negative
907,following directly from where the story left o...,positive
1088,this quasi j-horror film followed a young woma...,negative
1137,i really think i should make my case and have ...,positive
1141,this show has to be my favorite out of all the...,positive
...,...,...
48887,trite and unoriginal. it's like someone watche...,negative
49063,"trick or treat, quickie review this zany romp ...",positive
49596,"this is absolutely the best 80s cartoon ever, ...",positive
49637,if you liked the richard chamberlain version o...,positive


## Example
 In the pattern a.* ?c, the .*? matches as few characters as necessary between a and c. This is useful in cases where you want to match the smallest possible substring. For instance, in the string a123c456c, a. *?c will match a123c and not a123c456c.

In [36]:
# removing html tags
def remove_tag(data):
    data=re.sub(r'<.*?>','',data)
    return data


df['review']=df['review'].apply(remove_tag)

df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [37]:
#removing urls

def remove_url(data):
    data=re.sub(r"https?://\S+|www\.\S+",'',data)
    return data

df['review'] = df['review'].apply(remove_url)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [40]:
def remove_abb(data):
    data = re.sub(r"he's", "he is", data)
    data = re.sub(r"there's", "there is", data)
    data = re.sub(r"We're", "We are", data)
    data = re.sub(r"That's", "That is", data)
    data = re.sub(r"won't", "will not", data)
    data = re.sub(r"they're", "they are", data)
    data = re.sub(r"Can't", "Cannot", data)
    data = re.sub(r"wasn't", "was not", data)
    data = re.sub(r"don\x89Ûªt", "do not", data)
    data= re.sub(r"aren't", "are not", data)
    data = re.sub(r"isn't", "is not", data)
    data = re.sub(r"What's", "What is", data)
    data = re.sub(r"haven't", "have not", data)
    data = re.sub(r"hasn't", "has not", data)
    data = re.sub(r"There's", "There is", data)
    data = re.sub(r"He's", "He is", data)
    data = re.sub(r"It's", "It is", data)
    data = re.sub(r"You're", "You are", data)
    data = re.sub(r"I'M", "I am", data)
    data = re.sub(r"shouldn't", "should not", data)
    data = re.sub(r"wouldn't", "would not", data)
    data = re.sub(r"i'm", "I am", data)
    data = re.sub(r"I\x89Ûªm", "I am", data)
    data = re.sub(r"I'm", "I am", data)
    data = re.sub(r"Isn't", "is not", data)
    data = re.sub(r"Here's", "Here is", data)
    data = re.sub(r"you've", "you have", data)
    data = re.sub(r"you\x89Ûªve", "you have", data)
    data = re.sub(r"we're", "we are", data)
    data = re.sub(r"what's", "what is", data)
    data = re.sub(r"couldn't", "could not", data)
    data = re.sub(r"we've", "we have", data)
    data = re.sub(r"it\x89Ûªs", "it is", data)
    data = re.sub(r"doesn\x89Ûªt", "does not", data)
    data = re.sub(r"It\x89Ûªs", "It is", data)
    data = re.sub(r"Here\x89Ûªs", "Here is", data)
    data = re.sub(r"who's", "who is", data)
    data = re.sub(r"I\x89Ûªve", "I have", data)
    data = re.sub(r"y'all", "you all", data)
    data = re.sub(r"can\x89Ûªt", "cannot", data)
    data = re.sub(r"would've", "would have", data)
    data = re.sub(r"it'll", "it will", data)
    data = re.sub(r"we'll", "we will", data)
    data = re.sub(r"wouldn\x89Ûªt", "would not", data)
    data = re.sub(r"We've", "We have", data)
    data = re.sub(r"he'll", "he will", data)
    data = re.sub(r"Y'all", "You all", data)
    data = re.sub(r"Weren't", "Were not", data)
    data = re.sub(r"Didn't", "Did not", data)
    data = re.sub(r"they'll", "they will", data)
    data = re.sub(r"they'd", "they would", data)
    data = re.sub(r"DON'T", "DO NOT", data)
    data = re.sub(r"That\x89Ûªs", "That is", data)
    data = re.sub(r"they've", "they have", data)
    data = re.sub(r"i'd", "I would", data)
    data = re.sub(r"should've", "should have", data)
    data = re.sub(r"You\x89Ûªre", "You are", data)
    data = re.sub(r"where's", "where is", data)
    data = re.sub(r"Don\x89Ûªt", "Do not", data)
    data = re.sub(r"we'd", "we would", data)
    data = re.sub(r"i'll", "I will", data)
    data = re.sub(r"weren't", "were not", data)
    data = re.sub(r"They're", "They are", data)
    data = re.sub(r"Can\x89Ûªt", "Cannot", data)
    data = re.sub(r"you\x89Ûªll", "you will", data)
    data = re.sub(r"I\x89Ûªd", "I would", data)
    data = re.sub(r"let's", "let us", data)
    data = re.sub(r"it's", "it is", data)
    data = re.sub(r"can't", "cannot", data)
    data = re.sub(r"don't", "do not", data)
    data = re.sub(r"you're", "you are", data)
    data = re.sub(r"i've", "I have", data)
    data = re.sub(r"that's", "that is", data)
    data = re.sub(r"i'll", "I will", data)
    data = re.sub(r"doesn't", "does not",data)
    data = re.sub(r"i'd", "I would", data)
    data = re.sub(r"didn't", "did not", data)
    data = re.sub(r"ain't", "am not", data)
    data = re.sub(r"you'll", "you will", data)
    data = re.sub(r"I've", "I have", data)
    data = re.sub(r"Don't", "do not", data)
    data = re.sub(r"I'll", "I will", data)
    data = re.sub(r"I'd", "I would", data)
    data = re.sub(r"Let's", "Let us", data)
    data = re.sub(r"you'd", "You would", data)
    data = re.sub(r"It's", "It is", data)
    data = re.sub(r"Ain't", "am not", data)
    data = re.sub(r"Haven't", "Have not", data)
    data = re.sub(r"Could've", "Could have", data)
    data = re.sub(r"youve", "you have", data)
    data = re.sub(r"donå«t", "do not", data)

    return data


df['review'].apply(remove_abb)

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there is a family where a little boy...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    I am going to have to disagree with the previo...
49999    no one expects the star trek movies to be high...
Name: review, Length: 49582, dtype: object

In [41]:
# correcting spellings
from textblob import TextBlob

text = "hi I can drve at nigt"
TextBlob(text).correct().string


'hi I can drove at night'

In [None]:
def spelling_correction(text):
    return TextBlob(text).correct().string


df['review']=df['review'].apply(spelling_correction)

In [None]:
# removing punctutations
import string
string.punctuation

In [None]:
def remove_punctuation(text):
  for i in string.punctuation:
    if i in text:
      text=text.replace(i,'')

  return text

  remove_punctuation('Hi! How are you ?')

In [None]:
df['review']=df['review'].apply(remove_punctuation)

#To remove special characters we can also do below inside replace function
# '[^\w\s]'

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize

df['tokenized_review']= df['review'].apply(word_tokenize)

In [None]:
# Stop word removal
from nltk.corpus import stopwords


#stopwords.words contains all stopwords of the given language
len(stopwords.words('english'))


def remove_stopwords(text):
  L=[]
  for i in text:
    if i not in stopwords.words('english'):
      L.append(i)

  return L




In [None]:
#removing all the stop words
df['tokenized_review']=df['tokenized_review'].apply(remove_stopwords)

In [None]:
# Joining all the lists of string
df['review']=df['tokenized_review'].apply(lambda x:" ".join(x))
df.head()

In [None]:
df['char_length']=df['review'].str.len()
df['words_count']=df['tokenized_review'].apply(len)
df.head()

In [None]:
# Distribution plot (probability density function)
sns.distplot(df['char_length'])

In [None]:
#Plotting for postive and negative sentiments
sns.displot(df[df['sentiment'] == 'positive']['char_length'])
sns.displot(df[df['sentiment'] == 'negative']['char_length'])

# If we can see the difference between it then we can say the feature is helpful otherwise if both overlaps its not useful feature

In [None]:
#unigram means single word
#bigram means two words
#trigram means 3 words

#example :- My name is parth my weight is 70
#unigram: [My] [name] [is] [parth] [my] [weight] [is] [70]
#bigram: [My,name] [name,is] [is,parth] [parth,my] [my,weight] [weight,is] [is,70]
#trigram: [My,name,is] [name,is,parth] [is,parth,my] [parth,my,weight] [my,weight,is] [weight,is,70]

#Printing ngrams

from nltk import ngrams

#Merging all tokenized_review to create a large list of all tokenized reviews
pd.series(ngrams(df['tokenized_revew'].sum(),2)).value_counts() #If we pass 2 it will give bigram, and value counts will count
                                                                #values of bigram, how many times they are repeated


In [None]:
#Printing all trigrams with its count
pd.series(ngrams(df['tokenized_revew'].sum(),3)).value_counts()

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

plt.figure(figsize=(20,20)) # Positive Review Text
wc = WordCloud(width = 1600 , height = 800).generate(" ".join(df[df['sentiment'] == 'positive']['review']))
plt.imshow(wc)

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

plt.figure(figsize = (20,20)) # Negative Review Text
wc = WordCloud(width = 1600 , height = 800).generate(" ".join(df[df['sentiment'] == 'negative']['review']))
plt.imshow(wc)

In [None]:
#Bag of Words
# in this we count the frequency of words which are repeated more and store in 2 dimension where row is the row
# and coumn is the word and the value of it is the frequency count of word

from sklearn.feature_extraction.text import CountVectorizer
# ngram range(1,3)means it will be unigram,bigram,trigram if i want only unigram use (1,1) in it
count_vectorizer = CountVectorizer(max_features=5000,ngram_range=(1,3))
bag_of_words = count_vectorizer.fit_transform(df['review'])
bag_of_words = pd.DataFrame(bag_of_words.toarray(),columns = count_vectorizer.get_feature_names())

#This will be of (50000,5000) 50000 is the number of rows and 5000 is the number of words with highest frequency in reviws or we can say top 5000 words which are repeated most.



In [None]:
bag_of_words

In [None]:
#PCA
# It is used to convert high dimensional data to low dimensional data

from sklearn.decomposition import PCA

pca = PCA(n_components=2) # n_components will contain of our desired dimension which i want
                          # here it is 2 means i want to convert the data in 2 dimension
pca_result = pca.fit_transform(bag_of_words.values)

# so now bag of words will become (50000,2)

In [None]:
pca_result.shape

In [None]:
sns.scatterplot(pca_result[:,0],pca_result[:,1],hue=df['sentiment'])

In [None]:
# POS tagging
from nltk import pos_tag_sents

pos_tag_sents(df['review'].apply(lambda x:x.split()))