This dataset is derived from [Kaggle Website](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/downloads/imdb-dataset-of-50k-movie-reviews.zip/1)!

-------------------------------------------

In [19]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import LancasterStemmer,WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

from wordcloud import WordCloud,STOPWORDS
from bs4 import BeautifulSoup
import spacy
import re,string,unicodedata

from textblob import TextBlob
from textblob import Word

import time

In [31]:
#Tokenization of text
tokenizer=ToktokTokenizer()
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

In [37]:
Data = pd.read_csv("imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")

In [26]:
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_rows', None)  
pd.set_option('display.max_colwidth', -1)

In [38]:
data = Data

In [7]:
print(data.shape)

(50000, 2)


In [36]:
data.head()

Unnamed: 0,review,sentiment
0,,positive
1,,positive
2,,positive
3,,negative
4,,positive


In [10]:
data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,"Loved today's show!!! It was a variety and not solely cooking (which would have been great too). Very stimulating and captivating, always keeping the viewer peeking around the corner to see what was coming up next. She is as down to earth and as personable as you get, like one of us which made the show all the more enjoyable. Special guests, who are friends as well made for a nice surprise too. Loved the 'first' theme and that the audience was invited to play along too. I must admit I was shocked to see her come in under her time limits on a few things, but she did it and by golly I'll be writing those recipes down. Saving time in the kitchen means more time with family. Those who haven't tuned in yet, find out what channel and the time, I assure you that you won't be disappointed.",negative
freq,5,25000


In [11]:
data['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

In [12]:
def train_test_split(data,percentage):
    train_size= int(len(data)*percentage)
    dataB=data
    train_data = pd.DataFrame()
    test_data = pd.DataFrame()
    random_index = np.random.choice(len(dataB),train_size)
    random_index = np.sort(random_index)
    random_index = random_index[::-1]
#     print(len(dataB), '\n',random_index)
    for i in random_index:
        train_data = train_data.append(dataB.iloc[i],ignore_index=True)
        dataB.drop(dataB.index[i], inplace=True)
    test_data = dataB
                          
    return train_data , test_data

In [39]:
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
#Apply function on review column
data['review']=data['review'].apply(denoise_text)

In [40]:
data.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.We wish Mr. Mattei good luck and await anxiously for his next work.",positive


In [41]:
#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
#Apply function on review column
data['review']=data['review'].apply(remove_special_characters)

In [42]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that after watching just 1 Oz episode youll be hooked They are right as this is exactly what happened with meThe first thing that struck me about Oz was its brutality and unflinching scenes of violence which set in right from the word GO Trust me this is not a show for the faint hearted or timid This show pulls no punches with regards to drugs sex or violence Its is hardcore in the classic use of the wordIt is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary It focuses mainly on Emerald City an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda Em City is home to manyAryans Muslims gangstas Latinos Christians Italians Irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayI would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare Forget pretty pictures painted for mainstream audiences forget charm forget romanceOZ doesnt mess around The first episode I ever saw struck me as so nasty it was surreal I couldnt say I was ready for it but as I watched more I developed a taste for Oz and got accustomed to the high levels of graphic violence Not just violence but injustice crooked guards wholl be sold out for a nickel inmates wholl kill on order and get away with it well mannered middle class inmates being turned into prison bitches due to their lack of street skills or prison experience Watching Oz you may become comfortable with what is uncomfortable viewingthats if you can get in touch with your darker side,positive
1,A wonderful little production The filming technique is very unassuming very oldtimeBBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece The actors are extremely well chosen Michael Sheen not only has got all the polari but he has all the voices down pat too You can truly see the seamless editing guided by the references to Williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece A masterful production about one of the great masters of comedy and his life The realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwell and the sets particularly of their flat with Halliwells murals decorating every surface are terribly well done,positive
2,I thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air conditioned theater and watching a lighthearted comedy The plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer While some may be disappointed when they realize this is not Match Point 2 Risk Addiction I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to loveThis was the most Id laughed at one of Woodys comedies in years dare I say a decade While Ive never been impressed with Scarlet Johanson in this she managed to tone down her sexy image and jumped right into a average but spirited young womanThis may not be the crown jewel of his career but it was wittier than Devil Wears Prada and more interesting than Superman a great comedy to go see with friends,positive
3,Basically theres a family where a little boy Jake thinks theres a zombie in his closet his parents are fighting all the timeThis movie is slower than a soap opera and suddenly Jake decides to become Rambo and kill the zombieOK first of all when youre going to make a film you must Decide if its a thriller or a drama As a drama the movie is watchable Parents are divorcing arguing like in real life And then we have Jake with his closet which totally ruins all the film I expected to see a BOOGEYMAN similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents descent dialogs As for the shots with Jake just ignore them,negative
4,Petter Matteis Love in the Time of Money is a visually stunning film to watch Mr Mattei offers us a vivid portrait about human relations This is a movie that seems to be telling us what money power and success do to people in the different situations we encounter This being a variation on the Arthur Schnitzlers play about the same theme the director transfers the action to the present time New York where all these different characters meet and connect Each one is connected in one way or another to the next person but no one seems to know the previous point of contact Stylishly the film has a sophisticated luxurious look We are taken to see how these people live and the world they live in their own habitatThe only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits A big city is not exactly the best place in which human relations find sincere fulfillment as one discerns is the case with most of the people we encounterThe acting is good under Mr Matteis direction Steve Buscemi Rosario Dawson Carol Kane Michael Imperioli Adrian Grenier and the rest of the talented cast make these characters come aliveWe wish Mr Mattei good luck and await anxiously for his next work,positive


In [43]:
#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
#Apply function on review column
data['review']=data['review'].apply(simple_stemmer)

In [44]:
data.head(1)

Unnamed: 0,review,sentiment
0,one of the other review ha mention that after watch just 1 Oz episod youll be hook they are right as thi is exactli what happen with meth first thing that struck me about Oz wa it brutal and unflinch scene of violenc which set in right from the word GO trust me thi is not a show for the faint heart or timid thi show pull no punch with regard to drug sex or violenc it is hardcor in the classic use of the wordit is call OZ as that is the nicknam given to the oswald maximum secur state penitentari It focus mainli on emerald citi an experiment section of the prison where all the cell have glass front and face inward so privaci is not high on the agenda Em citi is home to manyaryan muslim gangsta latino christian italian irish and moreso scuffl death stare dodgi deal and shadi agreement are never far awayi would say the main appeal of the show is due to the fact that it goe where other show wouldnt dare forget pretti pictur paint for mainstream audienc forget charm forget romanceoz doesnt mess around the first episod I ever saw struck me as so nasti it wa surreal I couldnt say I wa readi for it but as I watch more I develop a tast for Oz and got accustom to the high level of graphic violenc not just violenc but injustic crook guard wholl be sold out for a nickel inmat wholl kill on order and get away with it well manner middl class inmat be turn into prison bitch due to their lack of street skill or prison experi watch Oz you may becom comfort with what is uncomfort viewingthat if you can get in touch with your darker side,positive


In [46]:
#set stopwords to english
stop=set(stopwords.words('english'))
print(stop)

#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
#Apply function on review column
data['review']=data['review'].apply(remove_stopwords)

{'some', 'mightn', 'same', 'over', 'won', 'both', 'for', 'there', 'll', "haven't", 'that', 'no', 'because', 'herself', 'don', 'so', 'myself', 'by', 'isn', 'whom', 'further', 'ma', 'we', 'the', 'of', "mustn't", 'wasn', "that'll", "hasn't", 'under', 'these', 'just', "shouldn't", 'all', "didn't", 'my', 'y', 'or', 'i', 'until', 'do', 'very', 'couldn', 'and', "hadn't", 'most', 'am', 'doesn', 'having', 'ourselves', 'few', 'here', 'hers', 'after', 'why', 'above', 'm', 'o', 'needn', 'this', 'any', 'own', "weren't", 'where', 'how', 'haven', 'you', 'than', 'on', 'it', 'up', 'me', 'be', 'into', 'other', "won't", 'been', 'about', 'his', 'each', 'only', 'which', 'off', 'hadn', 'if', 'her', 'aren', 'hasn', 'will', 't', 'more', "you'd", 'mustn', 'weren', "don't", 'shan', 'they', "aren't", 'did', "you'll", 'in', 'can', 'd', 'then', 'at', 'their', 'was', 'have', 'from', 'down', 'again', 've', 'she', 'ours', "you've", 'a', 'are', 'its', 'out', 'itself', 'nor', 'didn', 'yourself', 'between', "she's", 'to

In [47]:
data.head(1)

Unnamed: 0,review,sentiment
0,one review ha mention watch 1 Oz episod youll hook right thi exactli happen meth first thing struck Oz wa brutal unflinch scene violenc set right word GO trust thi show faint heart timid thi show pull punch regard drug sex violenc hardcor classic use wordit call OZ nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda Em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti wa surreal couldnt say wa readi watch develop tast Oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch Oz may becom comfort uncomfort viewingthat get touch darker side,positive


In [48]:
train, test = train_test_split(data , 0.8)

In [49]:
train.head()

Unnamed: 0,review,sentiment
0,bad plot bad dialogu bad act idiot direct annoy porn groov soundtrack ran continu overact script crappi copi vh cannot redeem consum liquor trust becaus stuck thi turkey end wa pathet bad figur wa fourthrat spoof springtim hitlerth girl play jani joplin wa onli faint spark interest wa onli becaus could sing better originalif want watch someth similar thousand time better watch beyond valley doll,negative
1,thought thi movi right good job wasnt creativ origin first wa expect wa whole lotta fun think like come dvd Im go pay money veri proudli everi last cent sharon stone great alway even movi horriblecatwoman thi movi isnt thi one movi underr lifetim probabl becom classic like 20 yr dont wait classic watch enjoy dont expect masterpiec someth grip soul touch allow get life get involv theirsal thi movi entertain recommend peopl havent seen see becaus critic box offic say doesnt alway count see never know might enjoy tip hat thi movie810,positive
2,robert colomb ha two fulltim job known throughout world globetrot TV report less wellknown equal effort hi exploit fulltim philandereri saw ` vivr pour vivr dub english titl live life life robert seem alway least three women hi life one mistress way one way cheat wife home help robert glib liar among hi use lie ` ill call tomorrow ` mi work took longer plan spend lot time money plane train hotel room hi success liaison wonder thi guy get caught hi pant downsom may find hi life excit thought tediou hi companion includ hi wife catherin attract desir women hi lifestyl hectic deceit wonder enjoy thisad tedium consider footag doesnt plot extend section dialogu frenchonli dialogu see documentari war tortur troop train interspers live action robert flight return africa wait wait plane land taxi airport terminalanni girardot standout perform thi film wa interest charact play perfect wa also nice see candic bergen begin career cant find fault yve montand perform wa basic amor bumi enjoy claud lelouch novel techniqu hotel room scene camera pan around room robert hi mistress argu catch sight briefli dure pass around room anoth scene set sleep car train robert lie upper bunk hi wife lower robert give hi wife import distress news hear onli part becaus clatter train sens hi wife wa also unabl absorb everi word due shock natur news also like excit safari scene africa cinematographi scene amsterdam wa superbi review thi movi part project librari congress ive name project fifti 50 notabl film forgotten within 50 year best determin thi film like fortynin ive identifi ha video telecast distribut US sinc origin releas opinion worthi made avail,negative
3,le visiteur first movi mediev time travel wa actual funni like jean reno actor wa unexpect twist funni situat cours plain absurd would remind littl bit loui de funesnow thi sequel ha charact actor great part time travel plot chang littl sinc charact suppos experienc time travel jump histori without pay ani attent fact keep get absurd advanc movi duke jean reno tri keep whole thing togeth hi play hi charact ha empti lot save filmnow duke slavehelp ha realli attent movi mere hi clumsi annoy stupid whatev wa suppos fact thi charact tri produc laughter audienc doe succeed someon wa tell realli veri veri bad joke alreadi know insist tell joke till end ad detail make suffer littl longerif like le visiteur spoil tast mouth sequel didnt like le visiteur would never consid see sequel like thi sequel well suppos still need see lot movi,negative
4,first tune thi morn news thought wow final entertain wa slightli amus week face news report one even call way much play around timeat first thought jillian wa breath fresh air serious thi woman ha got least bit journalist veri unprofession keep interrupt steve start inform viewer certain news report realli becom annoy point cant watch anymorejillian good journalist hell celebr love celebr henc instantli transform celebr around celebr suppos interview veri profession quit possibl perceiv relationship celebr import right insati journalist say heralso disappointingli thi show ha entertain news necessari news report world govern US someth benefit andor serv public best interest theyr focu sensation everyth talk come commerci product hand field report interestingli tolerablei believ good day LA young teenag celebr definit peopl actual care newssid note Id realli rather watch ktla howev tri hard entertain sometim theyr still bit dull though Oh well ill stick nbc today abc good morn america also okay long dian sawyer doesnt becom way seriou,negative


In [50]:
train.to_csv('IMDB_50k_train_data.csv',index=False)
test.to_csv('IMDB_50k_test_data.csv',index=False)