# Project Orange: IMDB reviews sentiment analysis

The purpose of the project is to build a model that would be able to classify the sentiment of a review of a movie and label if it has a positive or a negative sentiment. This way the opinions could be classified to be either negative or positive, while the medium sentiment is excluded from the analysis. 
Thus, the goal is to assign and weight either a positive or a negative connotation associated with each word or group of words. In addition, no weight is intended to be assigned to the words that are commonly used is sentence formation and that do not reflect any emotion. 

# 1. Data Cleaning

## Setup

Install all required dependencies for the future analysis in the current Jupyter kernel.

In [61]:
import sys
!{sys.executable} -m pip install -U scikit-learn spacy pandas seaborn sklearn nltk jupyter
!{sys.executable} -m spacy download en_core_web_sm

Requirement already up-to-date: scikit-learn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.22)
Requirement already up-to-date: spacy in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (2.2.3)
Requirement already up-to-date: pandas in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.25.3)
Requirement already up-to-date: seaborn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.9.0)
Requirement already up-to-date: sklearn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.0)
Requirement already up-to-date: nltk in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (3.4.5)
Requirement already up-to-date: jupyter in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (1.0.0)


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [62]:
import pandas as pd
import re
import spacy
import nltk.corpus

sp = spacy.load('en_core_web_sm')
nltk.download('words')

[nltk_data] Downloading package words to /Users/sylvain/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

### Import dataset

In [63]:
data = pd.read_csv("../data/imdb_dataset.csv") 

# Keep the first 100 elements to reduce the load on cpu
data=data[:100]
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Cleaning

In the following cells, we will define various cleaning methods. Each review will then be passed through each of those methods sequentially.


In [64]:
cleaning_methods = []

#### Remove HTML elements

In [65]:
def remove_html(review):
    return re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)").sub(" ", review) 

cleaning_methods.append(remove_html)

#### Spacy processing

In [66]:
# Remove the following stopwords from default list
sp.Defaults.stop_words -= {"n't", "most", "much", "never", "no", "not", "nothing", "n‘t", "n’t", "really", "top", "very", "well"}

# Remove the following stopwords from default list
sp.Defaults.stop_words |= {'.', ',', '!', '?', ':', '&', '...', '(', ')','-', '/', '"', ';', '-PRON-', ' ', "'", '....', '  ', '*'}

def spacy_processing(review):
    doc = sp(review.lower())
    
    # Remove stopwords
    tokens_without_stopwords = [token for token in doc if not token.is_stop]
    
    # Remove entities
    entities = [ent.text for ent in doc.ents]
    tokens_without_entities = [token for token in tokens_without_stopwords if not token.text in entities]
    
    # Lemmatization
    lemmatized_tokens = [token.lemma_ for token in tokens_without_entities]

    return lemmatized_tokens

cleaning_methods.append(spacy_processing)

#### Check if tokens are words and remove those that aren't (using NLTK)

In [67]:
nltk_corpus = set(nltk.corpus.words.words())

def remove_non_words(tokens):
    non_words_removed = [token for token in tokens if token in nltk_corpus]  
    return " ".join(non_words_removed)

cleaning_methods.append(remove_non_words)

## Final dataset

In [68]:
def cleaning(review):
    for cleaning_method in cleaning_methods:
        review = cleaning_method(review)
    return review

In [69]:
from tqdm import tqdm
tqdm.pandas()
data['review'] = data['review'].progress_map(cleaning)

100%|██████████| 100/100 [00:04<00:00, 23.76it/s]


In [70]:
for review in data['review']:
    print(review)

reviewer mention watch episode hook right exactly happen thing strike brutality scene violence set right word trust faint hearted timid pull no punch regard drug sex violence classic use word call nickname give maximum security state focus mainly emerald city experimental section prison cell glass front face inward privacy high agenda city home scuffle death stare dodgy dealing shady agreement never far away main appeal fact go show dare forget pretty picture paint audience forget charm forget romance mess episode see strike nasty ready watch develop taste get accustomed high level graphic violence violence injustice crooked guard sell nickel inmate kill order away well mannered middle class inmate turn prison bitch lack street skill prison experience watch comfortable uncomfortable view that touch dark
wonderful little production technique very unassuming very old time fashion give comforting sense realism entire piece actor extremely well choose sheen get voice truly seamless guide r

In [71]:
# Save to csv
#data.to_csv("../data/clean_dataset.csv", index=False)