# Project Orange: IMDB reviews sentiment analysis

The purpose of the project is to build a model that would be able to classify the sentiment of a review of a movie and label if it has a positive or a negative sentiment. This way the opinions could be classified to be either negative or positive, while the medium sentiment is excluded from the analysis. 
Thus, the goal is to assign and weight either a positive or a negative connotation associated with each word or group of words. In addition, no weight is intended to be assigned to the words that are commonly used is sentence formation and that do not reflect any emotion. 

# 1. Required Libraries

## Setup

Install all required dependencies for the future analysis in the current Jupyter kernel.

In [71]:
import sys
!{sys.executable} -m pip install -U scikit-learn spacy pandas seaborn sklearn
!{sys.executable} -m spacy download en_core_web_sm

Requirement already up-to-date: scikit-learn in c:\users\cyrill\anaconda3\lib\site-packages (0.22)
Requirement already up-to-date: spacy in c:\users\cyrill\anaconda3\lib\site-packages (2.2.3)
Requirement already up-to-date: pandas in c:\users\cyrill\anaconda3\lib\site-packages (0.25.3)
Requirement already up-to-date: seaborn in c:\users\cyrill\anaconda3\lib\site-packages (0.9.0)
Requirement already up-to-date: sklearn in c:\users\cyrill\anaconda3\lib\site-packages (0.0)
[+] Download and installation successful


You can now load the model via spacy.load('en_core_web_sm')


In [72]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import re
import spacy
import nltk.corpus

from nltk.corpus import words


from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, average_precision_score
from sklearn.metrics import precision_recall_curve, plot_precision_recall_curve


%matplotlib inline
sns.set_style("darkgrid")
sp = spacy.load('en_core_web_sm')

# 2. EDA

### Import dataset

In [73]:
data = pd.read_csv("../data/imdb_dataset.csv") 

# Keep the first 100 elements to reduce the load on cpu
data=data[:50]
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [74]:
# Base rate, delete later
a=data[data["sentiment"]=="positive"].shape
b=data[data["sentiment"]=="negative"].shape
base_rate=max(a[0], b[0])/data.shape[0]
print("The base rate is "+ str(base_rate))

The base rate is 0.54


## Cleaning

#### Text to lowercase

In [75]:
def to_lower(this_review):
    this_review=this_review.lower()
    return this_review

#### Remove HTML elements

In [76]:
REMOVE_HTML = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def remove_html(review):
    return REMOVE_HTML.sub(" ", review) 

#### Identify and remove entities

In [77]:
def remove_entities(this_review):
    doc=sp(this_review)
    
    for i in doc.ents:
            i=str(i)
            this_review=this_review.replace(" "+i,"")
    return this_review

#### Lemmatization

In [78]:
# Implementing lemmatization
def lemmatize_it(this_review):
    filtered_sent=[]

    #  "nlp" Object is used to create documents with linguistic annotations.
    lem = sp(this_review)
    
   # finding lemma for each word
    for word in lem:
        filtered_sent.append(word.lemma_)
    return filtered_sent

#### Tokenization (not used)

In [79]:
# "nlp" Object is used to create documents with linguistic annotations.
nlp = spacy.lang.en.English()

def tokenize_review(this_review):
    my_doc = nlp(this_review)
    
    # Create list of word tokens
    token_list = []
    for token in my_doc:
        token_list.append(token.text)
    return token_list

#### Adapt spacy stopwords list to our topic

In [80]:
#print stopword list from spacy
spacy_stopwords = list(spacy.lang.en.stop_words.STOP_WORDS)

remove_from_stopwordlist=["n't", "most", "much", "never", "no", "not", "nothing", "n‘t", "n’t", "really", "top", "very", "well"]
for word in spacy_stopwords:
    if word in remove_from_stopwordlist:
         spacy_stopwords.remove(word)

add_to_stopwords=['.', ',', '!', '?', ':', '&', '...', '(', ')','-', '/', '"', ';', '-PRON-', ' ', "'", '....', '  ', '*']
for word in add_to_stopwords:
    spacy_stopwords.append(word)

####  Remove stopwords and punctuation

In [81]:
def eliminate_stopwords(this_review):
    
    filtered_sent=[]

    #  "nlp" Object is used to create documents with linguistic annotations.
    doc = this_review
    
    # filtering stop words
    for word in doc:
        if word not in spacy_stopwords:
            filtered_sent.append(word)
    return filtered_sent
    

#### Check if tokens are words and remove those that aren't (using NLTK)

In [82]:
def remove_non_words(this_review):
    clean_sent=[]
    for word in this_review:
        if word in words.words():
            clean_sent.append(word)
    return clean_sent        

#### reconcatenate list to string

In [83]:
def reconcatenate_list_to_string (this_review):
    this_review=' '.join(this_review)
    return this_review

## Final dataset

In [84]:
def cleaning(this_review):
    this_review=to_lower(this_review)
    this_review=remove_entities(this_review)
    this_review=recognize_it(this_review)
    this_review=lemmatize_it(this_review)
    this_review=eliminate_stopwords(this_review)
    this_review=remove_non_words(this_review)
    this_review=reconcatenate_list_to_string (this_review)
    
    return this_review

In [85]:
data['mastacleaned_reviews'] = data['review'].map(cleaning)
data.head()

Unnamed: 0,review,sentiment,mastacleaned_reviews
0,One of the other reviewers has mentioned that ...,positive,reviewer mention watch episode hook right exac...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production technique very ver...
2,I thought this was a wonderful way to spend ti...,positive,think wonderful way spend time hot sit air con...
3,Basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter love time money visually stunning film ...


In [86]:
#Process all reviews and save to csv
#data = pd.read_csv("../data/imdb_dataset.csv") 

data['review'] = data['review'].map(MASTA_CLEAN)

data.to_csv("../data/very_clean_dataset.csv", index=False)
