# Project Orange: IMDB reviews sentiment analysis

The purpose of the project is to build a model that would be able to classify the sentiment of a review of a movie and label if it has a positive or a negative sentiment. This way the opinions could be classified to be either negative or positive, while the medium sentiment is excluded from the analysis. 
Thus, the goal is to assign and weight either a positive or a negative connotation associated with each word or group of words. In addition, no weight is intended to be assigned to the words that are commonly used is sentence formation and that do not reflect any emotion. 

# 1. Data Cleaning

## Setup

Install all required dependencies for the future analysis in the current Jupyter kernel.

In [98]:
import sys
!{sys.executable} -m pip install -U scikit-learn spacy pandas seaborn sklearn nltk jupyter
!{sys.executable} -m spacy download en_core_web_sm

Requirement already up-to-date: scikit-learn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.22)
Requirement already up-to-date: spacy in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (2.2.3)
Requirement already up-to-date: pandas in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.25.3)
Requirement already up-to-date: seaborn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.9.0)
Requirement already up-to-date: sklearn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.0)
Requirement already up-to-date: nltk in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (3.4.5)
Requirement already up-to-date: jupyter in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (1.0.0)


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [99]:
import pandas as pd
import re
import spacy
import nltk.corpus

sp = spacy.load('en_core_web_sm')
nltk.download('words')

[nltk_data] Downloading package words to /Users/sylvain/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

### Import dataset

In [100]:
data = pd.read_csv("../data/imdb_dataset.csv") 

# Keep the first 100 elements to reduce the load on cpu
data=data[:10]
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Cleaning

In the following cells, we will define various cleaning methods. Each review will then be passed through each of those methods sequentially.


In [101]:
cleaning_methods = []

#### Text to lowercase

In [102]:
def to_lower(review):
    return review.lower()

cleaning_methods.append(to_lower)

#### Remove HTML elements

In [103]:
def remove_html(review):
    return re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)").sub(" ", review) 

cleaning_methods.append(remove_html)

#### Remove stopwords and entities

In [104]:
# Remove the following stopwords from default list
sp.Defaults.stop_words -= {"n't", "most", "much", "never", "no", "not", "nothing", "n‘t", "n’t", "really", "top", "very", "well"}

# Remove the following stopwords from default list
sp.Defaults.stop_words |= {'.', ',', '!', '?', ':', '&', '...', '(', ')','-', '/', '"', ';', '-PRON-', ' ', "'", '....', '  ', '*'}


def remove_stopwords_and_entities(review):
    doc = sp(review)
    
    # Remove stopwords
    tokens_without_stopwords = [token.text for token in doc if not token.is_stop]
    
    # Remove entities
    entities = [ent.text for ent in doc.ents]
    tokens_without_entities = [token for token in tokens_without_stopwords if not token in entities]
    
    return " ".join(tokens_without_entities)

cleaning_methods.append(remove_stopwords_and_entities)

#### Lemmatization

In [105]:
def lemmatize_it(review):
    doc = sp(review)
    tokens = [token.lemma_ for token in doc]
    return " ".join(tokens)

cleaning_methods.append(lemmatize_it)

#### Check if tokens are words and remove those that aren't (using NLTK)

In [106]:
def remove_non_words(this_review):
    clean_sent=[]
    for word in this_review:
        if word in nltk.corpus.words.words():
            clean_sent.append(word)
    return clean_sent        

#### reconcatenate list to string

In [107]:
def reconcatenate_list_to_string (this_review):
    this_review=' '.join(this_review)
    return this_review

#cleaning_methods.append(reconcatenate_list_to_string)

## Final dataset

In [108]:
def cleaning(review):
    for cleaning_method in cleaning_methods:
        review = cleaning_method(review)
    return review

In [109]:
data['review'] = data['review'].map(cleaning)
data.head()

# from tqdm import tqdm
# tqdm.pandas()
# data['review'] = data['review'].progress_map(cleaning)

Unnamed: 0,review,sentiment
0,reviewer mention watch 1 oz episode hook right...,positive
1,wonderful little production filming techniq...,positive
2,think wonderful way spend time hot summer week...,positive
3,basically family little boy think zombie close...,negative
4,petter love time money visually stunning film ...,positive


In [110]:
# Save to csv
#data.to_csv("../data/clean_dataset.csv", index=False)