# Challenge START&GO Data Science 2020 "Botando pra Quebrar" 

Hello! We are Andreis Purim and Eduarda Agostini, we're two brazilian students in currently doing a double-degree program at the École Centrale de Lille, in France. For our challenge in Data Science, we chose to make a mix of NLP algorithms in multiple intersting datasets.

All documentation in this notebook will be in english (though we will be presenting our project in French) because we believe it may be intersting to a wider audience. I hope you like it.

# 1. IMDB
Our first challenge will be to make a Machine Learning classifier for the 50k movie reviews IMDB Dataset. The dataset itself is very simple: the review and the sentiment (positive or negative).

## 1.1. Visualizing our Data
Let's start by plotting some graphs and of course, seeing how our data works.

In [None]:
import numpy
import pandas
%matplotlib inline

Reviews = pandas.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
print(Reviews.shape)
print(Reviews.head())

Let's usa the spaCy library to take a look at how our phrases are constructed. spaCy is a beautifully constructed library for NLP that has a pretrained statistical models in various languages. We could use the starter models to make a transfer learning, but for now let's just use the complete model to see how it fares.

By loading the English core pretarined models, we can use it to deconstruct the phrase and see every part of it (with explanations!). Let's choose our second review, in this case, I don't want to print all words (because it'd be too huge), so I made a zip with range(20), in case you want to observe all words, just make a for in Chosen_Sentence


In [None]:
import spacy

Spacy = spacy.load('en_core_web_sm')

Chosen_Sentence = Spacy(Reviews['review'][1])
for i,word in zip(range(20),Chosen_Sentence):
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

Note how our data still needs some cleaning. For example, there are HTML tags (like <br/>) which spaCy classifies as "superfluous punctuation". We'll clean the text before starting our models.

Another beautiful thing about spaCy is displaCy, displaCy is a visualizer which not only makes not only visualizing dependencies in NLP fun but also very helpful for us.

In [None]:
from spacy import displacy

displacy.render(Chosen_Sentence, style='dep', jupyter=True, options={'distance': 50})

Another powerful thing spaCy can do is to identify entities in the text, like organizations, people, nationalities, etc... you can use displaCy again to visualize the entities in the text.

In [None]:
for entity in Chosen_Sentence.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

displacy.render(Chosen_Sentence, style='ent', jupyter=True)

Ok, so let's clear our dataset a little and see some graphics.

In [None]:
import re

# First, let's make a small function to clean our strings, because as we have seen before, there are tons of unwanted punctuations and other useless tags
def clear_sentence(sentence: str) -> str:
    '''A function to clear texts using regex.'''
    sentence = re.sub(r'\W', ' ', str(sentence))
    sentence = re.sub(r'\s+[a-zA-Z]\s+', ' ', sentence)
    sentence = re.sub(r'\^[a-zA-Z]\s+', ' ', sentence) 
    sentence = re.sub(r'\s+', ' ', sentence, flags=re.I)
    sentence = re.sub(r'^b\s+', '', sentence)
    sentence = sentence.lower()
    return sentence

# Clears every sentence in review.
Reviews['review'] = [clear_sentence(sentence) for sentence in Reviews['review']]

# 1.2 Choosing algorithms for our IMDB
Ok, now that we know how our data looks like, let's choose some Machine Learning algorithms to work with our data. The first thing we need to do is to reduce our dataset (because some of these algorithms can take quite a while), so we'll be using only the first 5000 reviews.




In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC, SVC, NuSVC
from sklearn.naive_bayes import MultinomialNB, GaussianNB
import matplotlib.pyplot as matplotlib
from matplotlib.lines import Line2D
from xgboost import XGBClassifier
import seaborn
import time


# Now, let's get a small sample of our reviews
Reviews_small = Reviews[0:5000]
Reviews_small

# Makes two datasets, x and y, x will be the clear reviews and y will be the sentiment
x_small = Reviews_small['review'].tolist()
y_small = Reviews_small['sentiment'].tolist()

# Split the dataset in a 80%/20% fashion
X_train, X_test, y_train, y_test = train_test_split(x_small, y_small, test_size=0.2, random_state=0)


# I'm making two dictionaries, one for models to transform our words in vectors and the other of models to work on these vectors
Vectorizer_Models = {
    'Count': CountVectorizer(stop_words="english"),
    'Hash': HashingVectorizer(stop_words="english"),
    'Tfidf': TfidfVectorizer(stop_words="english",ngram_range=(1, 2))
}

ML_Models = {
    'LinearSVC': LinearSVC(),
    'SVC': SVC(),
    'NuSVC': NuSVC(),
    'DecisionTree': DecisionTreeClassifier(),
    'XGBClassifier': XGBClassifier(),
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=0),
    'SGDC': SGDClassifier(),
    'MultiNB': MultinomialNB(),    
}

# I'll make a new list for a future dataset to see the accuracy and time of each algorithm.
plotting_data = []

for j,vector in enumerate(Vectorizer_Models):
    
    # Let's first vectorize, because the vectorized words will be used in common by all MLs. Also, starts counting the time to vectorize.
    time_vector_start = time.time()
    X_train_vectorized = Vectorizer_Models[vector].fit_transform(X_train) 
    X_test_vectorized= Vectorizer_Models[vector].transform(X_test)
    time_vector_end = time.time()
    
    for i,ml in enumerate(ML_Models):
        
        # Small detail: Multinomial Naive-Baise does not work with negative numbers, so we can just use him with Count
        if (ml == 'MultiNB' and vector != 'Count') == False:
            # Ok, let's start the time and put our models to fit the data.
            starting_time = time.time()
            model = ML_Models[ml]
            model.fit(X_train_vectorized, y_train)

            # Predict the data and try to find the accuracy
            y_predicted = model.predict(X_test_vectorized)
            accuracy = accuracy_score(y_test, y_predicted)
            ending_time = time.time()

            # Now, get the times and append everything in our plotting data.
            cut_time = round(time_vector_end - time_vector_start,2)
            ml_time = round(ending_time - starting_time,2)
            plotting_data.append([ml,vector,accuracy,ml_time,cut_time,cut_time+ml_time])


# Makes a pandas dataset for our data (for better visualization)
plot_times = pandas.DataFrame(plotting_data, columns=['ML','Vectorizer','Accuracy','ML_Time','Cut_time','Total_time'])

# Now, let's make a Seaborn scatterplot
seaborn.set(color_codes=True)
matplotlib.figure(figsize=(12, 8))
matplotlib.title("Best vectorization and Accuracy Algorithms")

ax = seaborn.scatterplot(data=plot_times, x='Total_time', y='Accuracy', hue='ML', style='Vectorizer')
matplotlib.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set(xlabel="Time (s)", ylabel="Accuracy")
plot_times

As you have probably seen in the graph, the best algorithm is probably LinearSVC using Tfidf or Hash, with a small difference in accuracy and time, while things like Hash with XGB or RandomForest probably fared pretty bad in time.

I won't explain in great detail why (if you google you'll probably find better answers) but this is because LinearSVC are Support Vector Machine, that is, machine learning algorithms made to use vectors as inputs, while RandomForest, while a very good algorithm, just can't handle vectors with hundreds of dimensions in a good time. In this case, you can see SGDClassifier rates a little higher than LinearSVC because SGD (Stochastic Gradient Descent) is a good approach of fitting linear classifiers in a manner similar to SVM.

In fact, our SGDClassifier is LinearSVM with some better training, as the docs in scikit state: "Strictly speaking, SGD is merely an optimization technique and does not correspond to a specific family of machine learning models. It is only a way to train a model. Often, an instance of SGDClassifier or SGDRegressor will have an equivalent estimator in the scikit-learn API, potentially using a different optimization technique."

And for the vectorizes, you can see Count is way faster, Hash is a mix of fast and accurate, and Tfidf is accurate(r).

But, when we scale the data back to its 50.000 original size, you might notice LinearSVC might outscore SDGC.

In [None]:
x = [clear_sentence(sentence) for sentence in Reviews['review']]
y = Reviews['sentiment'].tolist()

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
X_train_vectorized = Vectorizer_Models[vector].fit_transform(X_train) 
X_test_vectorized= Vectorizer_Models[vector].transform(X_test)

Chosen_Models = {
    'LinearSVC': LinearSVC(),
    'SGDC': SGDClassifier(),
}

for ml in Chosen_Models:
    starting_time = time.time()
    model = Chosen_Models[ml]
    model.fit(X_train_vectorized, y_train)
    y_predicted = model.predict(X_test_vectorized)
    accuracy = accuracy_score(y_test, y_predicted)
    ending_time = time.time()
    print(ml,'Accuracy:',"{:.2f}".format(accuracy*100),"in","{:.2f}s".format(ending_time-starting_time))
    print(confusion_matrix(y_test, y_predicted))

So, let's stick to LinearSVC a little more and do one final thing: fine-tuning. Scikit comes with a nice tool called GridSearchCV that allows us to fine tune our model a little further.

Ideas:
```python
import nltk
from nltk.stem.snowball import SnowballStemmer
```
and fine tune with
```python
from sklearn.model_selection import GridSearchCV
GridSearchCV()
```

In [None]:
# fine tuning