# Data Cleaning
_IMDB Moview Review Sentiment Analysis_  


### Goal

With the data cleaning completed, we are now ready to begin the analysis

Our goal in this project is to create the best model to predict the _polarity_ of movie reviews.

>**Polarity**: whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral.
[Source](https://en.wikipedia.org/wiki/Sentiment_analysis)

In our case, we will create binary classification models predicting wheter a given review is _positive_ or _negative_.

###  Plan of the data cleaning

In this notebook, we are taking the raw dataset and cleaning it for the analysis in notebook 2.

We will use the following cleaning steps:
* Removal HTML code from the reviews
* Removal of specific stopwords
* Removal of entities
* Lemmatization
* Removal of all non-words

## Setup

The following block installs all required dependencies for both the data cleaning and the analysis.

In [1]:
import sys
!{sys.executable} -m pip install -U scikit-learn spacy pandas seaborn sklearn scikit-plot nltk jupyter graphviz
!{sys.executable} -m spacy download en_core_web_sm

Requirement already up-to-date: scikit-learn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.22)
Requirement already up-to-date: spacy in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (2.2.3)
Requirement already up-to-date: pandas in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.25.3)
Requirement already up-to-date: seaborn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.9.0)
Requirement already up-to-date: sklearn in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.0)
Requirement already up-to-date: scikit-plot in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (0.3.7)
Requirement already up-to-date: nltk in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (3.4.5)
Requirement already up-to-date: jupyter in /Users/sylvain/anaconda3/envs/optimize/lib/python3.7/site-packages (1.0.0)
Requirement already up-to-date: graphviz in /Users/sylv



[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [2]:
import pandas as pd
import re
import spacy
import nltk.corpus

sp = spacy.load('en_core_web_sm')
nltk.download('words')

[nltk_data] Downloading package words to /Users/sylvain/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

### Import dataset
We are working with a [dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) from the website kaggle.com. It includes 50'000 movie reviews from IMDB, that have been labeled positive or negative with a base rate of 50%. The dataset has two columns: the comment itself and the sentiment of the comment.

In [3]:
data = pd.read_csv("../data/imdb_dataset.csv") 
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Cleaning

In the following cells, we will define various cleaning methods. Each review will then be passed through each of those methods sequentially.


In [4]:
cleaning_methods = []

#### Remove HTML elements
We start by removing HTML elements from the reviews, which won't be of any use to us. In review Nr. 1 you can see that the text includes the line break sign of HTML, "br", which is one example of an HTML element we want to remove

In [5]:
def remove_html(review):
    # Remove html elements from dataset
    return re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)").sub(" ", review) 

cleaning_methods.append(remove_html)

#### Spacy cleaning methods
We grouped three data cleaning methods, that are using the spacy package, in a function we called spacy_preprocessing. These methods are:
##### Tokenization and stopword removal
The next thing we do is tokenizing the text, which means transforming the reviews (type: string) to list of words. After that, we removed the stopwords from our reviews, since they don't have any importance for our analysis. We used the predefined stopword list provided by spacy, that we adapted to our specific task by removing words from it that are crucial for our analysis (for example: "not", "never", "very") and therefore shouldn't be added to the stopword list.

##### Entity removal
In our task the most represented entities are : name of characters, name of actors / directors and locations. Even though we acknowledge the fact that the success of a movie might be highly correlated with the actors starring in it, the director that shot it or the characters involved, we wanted our code to be as general as possible and our results not to depend on the names of the people that made the movie. 
That's why we decided to remove all entities from our reviews.
##### Lemmatization
Finally we lemmatized the words. Lemmatization is a really important tool for text analytics since it reduces the total amount of different word variables and therefore adds significance to their coefficients (speaking in terms of logistic regression).

In [6]:
# Remove the following stopwords from default list
sp.Defaults.stop_words -= {"n't", "n‘t", "most", "much", "never", "no", "not", "nothing", "really", "top", "very", "well"}

# Add the following elements, mainly punctuation to default list
sp.Defaults.stop_words |= {'.', ',', '!', '?', ':', '&', '...', '(', ')','-', '/', '"', ';', '-PRON-', ' ', "'", '....', '  ', '*'}

def spacy_processing(review):
    doc = sp(review.lower())
    
    # Remove stopwords
    tokens_without_stopwords = [token for token in doc if not token.text in sp.Defaults.stop_words]
    
    # Remove entities
    entities = {ent.text for ent in doc.ents}

    tokens_without_entities = [token for token in tokens_without_stopwords if not token.text in entities]
    
    # Lemmatization
    lemmatized_tokens = [token.lemma_ for token in tokens_without_entities]
    return lemmatized_tokens

cleaning_methods.append(spacy_processing)

#### Remove non words using the nltk library
Since amongst our tokens we still had some strange punctuation + word combinations like ".....and" aswell as numbers, we decided to compare all our tokens with the nltk word corpus and remove tokens that hadn't been identified as words. 

In [7]:
# Sets have higher perfomance than list when determining if an object is present in them
nltk_corpus = set(nltk.corpus.words.words())

def remove_non_words(tokens):
    non_words_removed = [token for token in tokens if token in nltk_corpus]
    
    # Join tokens into a string
    return " ".join(non_words_removed)

cleaning_methods.append(remove_non_words)

## Final dataset
The last step of this data cleaning is to apply our cleaning method to all the 50'000 reviews and save them as a new dataset.

In [8]:
def cleaning(review):
    # Iterate over delcared cleaning methods and apply them
    for cleaning_method in cleaning_methods:
        review = cleaning_method(review)
    return review

In [9]:
from tqdm import tqdm
tqdm.pandas()
data['review'] = data['review'].progress_map(cleaning)

  from pandas import Panel
100%|██████████| 50000/50000 [33:03<00:00, 25.21it/s]  


In [10]:
# Save to csv
data.to_csv("../data/extremely_clean_dataset.csv", index=False)