In [124]:
import pandas as pd
import spacy
import re

# for displaying purpose
pd.set_option('display.max_colwidth', 200)

In [97]:
nlp = spacy.load("en_core_web_sm")

In [12]:
# reading in the data
DATA_PATH = "IMDB_Dataset.csv"
df = pd.read_csv(DATA_PATH)

overiew

In [125]:
df.head(10)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",positive
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",positive
5,"Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 o...",positive
6,I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gun...,positive
7,"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny ...",negative
8,Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful i...,negative
9,"If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!",positive


Any duplicates?

In [67]:
print(f'number of samples:{len(df.review)}')
print(f'number of unique reviews: {df.review.nunique()}')
print(f'percentage of duplicates: {(len(df.review) - df.review.nunique()) / len(df.review)*100}')

number of samples:50000
number of unique reviews: 49582
percentage of duplicates: 0.836


Any missing values?

In [72]:
df.isna().sum()

review       0
sentiment    0
dtype: int64

Is target balanced?

In [139]:
df.sentiment.value_counts(normalize = True)

positive    0.5
negative    0.5
Name: sentiment, dtype: float64

Observations:
* HTML tags present in the second review; and probably in many other reviews. We'll need to perform some cleaning for the whole dataset.
* It looks like there are duplicated reviews. We'll remove these to avoid overly optimistic estimation of model performance (which will happen if same reviews appear in both training and test sets).
* No need to worry about data imbalance

Experimenting with cleaning.  
It's a good idea to see what's the output of the cleaning procedure before integrating it in the overall workflow.  
Let's take the third review as an example.

In [126]:
text = df.review[3]
text

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [129]:
# remove html tags and transform to lower case
text = re.sub(r'<.*?>', '', text).lower()
text

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [133]:
# removing ponctuation and stop words and proceed to lemmatization
" ".join([token.lemma_ for token in nlp(text) if not token.is_punct and not token.is_stop])

'basically family little boy jake think zombie closet parent fight time.this movie slow soap opera suddenly jake decide rambo kill zombie.ok go film decide thriller drama drama movie watchable parent divorce argue like real life jake closet totally ruin film expect boogeyman similar movie instead watch drama meaningless thriller spots.3 10 play parent descent dialog shot jake ignore'

In [134]:
# Let's create a function for cleaning

def clean(text):
    text = re.sub(r'<.*?>', '', text).lower()
    return [token.lemma_ for token in nlp(text) if not token.is_punct and not token.is_stop]