
## Figuring out Natural Language Processing
As I have never worked on NLP before, the purpose of this notebook was to play arround with a dataset and try to figure out a bunch of stuff on the subject.
Here we will be working on the IMDB dataset which provides 50k movies text reviews and their corresponding sentiment  "Positive" or "Negative".

Our job will be to find a way to learn some features that can predict the sentiment based on a textual review. 

### Load the data
We will be getting the data from my google drive. I have downloaded those data from Kaggle https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews .

In [113]:
import pandas as pd
import requests
from io import StringIO

orig_url='https://drive.google.com/file/d/1Tl9AMNkExM5mFw3xDuIeZ1RiDIEu4Oci/view?usp=sharing'
file_id = orig_url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
df_dwnld = pd.read_csv(csv_raw)
df = df_dwnld.copy()
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [114]:
df.iloc[0,0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

## Cleaning
Now that we have the data, and displayed some of those data, we know that there is cleaning to be made. 

For this analysis, I will assume that numbers are meaningless and that we need only words to predict the sentiment. 
Therefore, we will get rid of : 
* numbers,
* html tags,
* uppercases,
* any special characters

In [115]:
# Remove numbers
df['clean_review'] = df['review'].str.replace('\d+', '')
# Remove any <> and everything inside
df['clean_review'] = df['clean_review'].str.replace('<[^<]+?>', '')
# Remove anything that is not alphanumeric
df['clean_review'] = df['clean_review'].str.replace(r'[^A-Za-z0-9 ]+', '')
# Remove any uppercase character
df['clean_review'] = df['clean_review'].str.lower()
# Remove any one character words
df['clean_review'] = df['clean_review'].str.replace(r'\b\w\b', '')
# Remove multiple spaces 
df['clean_review'] = df['clean_review'].str.replace(r'\s+', ' ')
# Strip data
df['clean_review'] = df['clean_review'].str.strip()

df['clean_review'][0]

'one of the other reviewers has mentioned that after watching just oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictures p

## What direction ? 
Now, we have a text that seems to be way more clean. 

Obviously, we will have to create some features out of all these words in order to extract the sentiment. 

What I mean by that is that we need to create a standardized framework in which any review could fit. The problem with those textual input is that they are of random sizes, and any model that we might create will need inputs of pre-defined sizes.

What we will be using here is some kind of one-hot-encoding technic. The concept is simple, you take a categorical variable and transform it in vector space. ie: 

| category |
|---|
| A |
| B |
| C | 

| A | B | C |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |

-----------

Here, the columns will be some relevants words that we believe to have predictive power.

In order to find them, let's play arround with the data.

In [117]:
# The columns 'words' will contains a list of all the words in the 'clean' column
df['words_'] = df.clean_review.str.split('\s+')
df.words_[0][:10]

['one',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching']

## Feature engineering
Now we will identify ALL the words that have been used and count how many time they have been used.

It is important to split your data in training and testing set. Therefore we will do it right now by spliting the dataset in half and we will be doing ou analysis only on the first half.

In [134]:
# Here we define the lenght of our training set. which will be half of the dataset.
total = x.shape[0]
n = total // 2

dict_count = {}
data = list(df.itertuples(index=False, name=None))

# We will loop only on the first n reviews
for d in data[:n]:
    for w in d[3]:
        if not w in dict_count:
            dict_count[w] = 1
        else:
            dict_count[w] +=1

df_count = pd.DataFrame(dict_count, index=['Count']).T.sort_values('Count')
df_count.tail()

Unnamed: 0,Count
is,210064
to,266297
of,288080
and,319406
the,650762


## Stop words problem
And here we are, the famous stop words problems. 
This was indeed pretty well expected, the words that are the most common will be completely useless in our case. 

A good practice is to get rid of them.
The sklearn library has a english stop word frozen set, we will use it.


In [123]:
from sklearn.feature_extraction import stop_words
df_count = df_count.loc[~df_count.index.isin(stop_words.ENGLISH_STOP_WORDS)]
df_count.tail()

Unnamed: 0,Count
good,14378
just,17317
like,19555
film,37345
movie,41962


#### Next step
Now for each of these words, I will add a column to the DataFrame and I want to count how many time each of them appear in each review. This is where we use "some kind" of one-hot-encoding technics as we will not populate with 1 or 0 but with a number of occurence. 

In [124]:
top_words = df_count.tail(1500).index.tolist()

# Rename the columns as their name might appear in the list of words
df = df.rename({
    "sentiment": '_predict',
    "review": "_review"
}, axis=1)

for word in top_words:
    df[word] = df.clean_review.str.count(word)

df.head()

Unnamed: 0,_review,_predict,clean_review,words_,memories,mike,locations,learned,lovers,noticed,...,lowbudget,occasionally,wind,fights,cars,officer,leader,ordinary,boat,review
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,"[one, of, the, other, reviewers, has, mentione...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,A wonderful little production. <br /><br />The...,positive,wonderful little production the filming techni...,"[wonderful, little, production, the, filming, ...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,I thought this was a wonderful way to spend ti...,positive,thought this was wonderful way to spend time o...,"[thought, this, was, wonderful, way, to, spend...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Basically there's a family where a little boy ...,negative,basically theres family where little boy jake ...,"[basically, theres, family, where, little, boy...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love in the time of money is vi...,"[petter, matteis, love, in, the, time, of, mon...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Where are the interesting stuff ??
Alright, now that we have counted everything, why don't we group our data by sentiment, positive or negative, and see if any words appears way more often in a group and not in the other

In [125]:
result = df[['_predict',*top_words]].groupby('_predict').mean().T
result['diff_'] = (result.negative / result.positive) -1
result.diff_.sort_values()

wonderfully   -0.866038
beautifully   -0.843915
superb        -0.821084
wonderful     -0.805434
touching      -0.805195
                 ...   
poorly         8.388060
laughable      8.653846
waste          8.919271
redeeming      9.017857
worst          9.961883
Name: diff_, Length: 1500, dtype: float64

### That is interesting
So here we are, words such as beautiful and wonderful are way more often used in a positive review than in a negative review. And words like worst, and awful are more often used in a negative review. 

Again, I believe those results are pretty obvious, that is just common sense. However it still took us less time than coming up with 1500 words by ourself. 

By looking at the data so far, I'm assuming that there should be some predictive power in our variable. Let's prepare our data and try to fit a simple model.

In [138]:
predict_df = df[['_predict', *top_words]]

In [139]:
x = predict_df.drop('_predict', axis=1)
y = predict_df['_predict']
y = y.replace({'positive':1,'negative':0})

In [140]:
n_test = total - n
x_train = x.iloc[:n,:].values
y_train = y.iloc[:n].values

x_test =  x.iloc[n:,:].values
y_test =  y.iloc[n:].values

## Standardize the data
Some learning models require the data to be normalize in some way. 
Here we will just standardize them.

In [141]:
x_train_std = (x_train - x_train.mean(axis=0)) / x_train.std(axis=0)
x_test_std = (x_test - x_train.mean(axis=0)) / x_train.std(axis=0)

# Learn
It is time to create our model. 

This problem is a classification problem. Therefore we can choose among the following learning technics :

* Linear Models
    * Logistic Regression
    * Support Vector Machines
* Nonlinear models
    * K-nearest Neighbors (KNN)
    * Kernel Support Vector Machines (SVM)
    * Naïve Bayes
    * Decision Tree Classification
    * Random Forest Classification

For this type of classification problem, I usually run a simple logistic regression as well as a Random forest classification. 
Now, we will just look into the logistic regression

In order to evaluate the quality of our model we will be using the following metrics:

* Accuracy: Correct Predictions / Total predictions
* Precision: True Positive / (True Positive + False Positive)
* Recall: True Positive / (True Positive + False Negative)

In [142]:
from sklearn.metrics import confusion_matrix,plot_confusion_matrix, accuracy_score, recall_score, precision_score
def scores(y, y_pred):
    precision = precision_score(y, y_pred)
    accuracy = accuracy_score(y, y_pred)
    recall = recall_score(y, y_pred)
    print('-----------------')
    print('Precision')
    print(precision)
    print('-----------------')
    print('Accuracy')
    print(accuracy)
    print('-----------------')
    print('Recall')
    print(recall)
    cnf_mat = confusion_matrix(y,y_pred)
    print('-----------------')
    print('Confusion Matrix')
    print(cnf_mat)

## Logistic Regression 

This regression is used when the variable to predict is categorical (1 or 0). 

For our convenience, we will be using the LogisticRegressionCV class from scikit-learn which is doing the Cross-validation for us.
What it does is playing with a list of lambda values which defines the strenght of the penalty term and a list of l1_ratios which makes our penalty term closer to either L1 or L2 when using the elasticnet penalty term.

As the default solver only supports the l2 penalty term, we will be fine with that and our CV will decide only which lambda fits better.



In [178]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
clf = LogisticRegressionCV()
clf.fit(x_train_std, y_train)
y_pred = clf.predict(x_test_std)
scores(y_test, y_pred)

-----------------
Precision
0.8644014718546935
-----------------
Accuracy
0.87132
-----------------
Recall
0.8814465910905317
-----------------
Confusion Matrix
[[10742  1732]
 [ 1485 11041]]


The results are pretty good as we have not been doing any extensive feature engineering here.
On the back of those results, we are now able to read some review that the model was not able to predict and try to identify a pattern that our model can't handle. This would help us improving our feature engineering process. 

However, for the purpose of this notebook, I will just stop here. Feel free to play with the data and find better alternatives.

In [183]:
output = df.clean_review.iloc[:n].to_frame()
output.loc[:,'prediction'] = (y_test - y_pred) 

In [186]:
# Wrong prediction
output.query('prediction==1').iloc[0,0]

'this show was an amazing fresh innovative idea in the when it first aired the first or years were brilliant but things dropped off after that by the show was not really funny anymore and its continued its decline further to the complete waste of time it is todayits truly disgraceful how far this show has fallen the writing is painfully bad the performances are almost as bad if not for the mildly entertaining respite of the guesthosts this show probably wouldnt still be on the air find it so hard to believe that the same creator that handselected the original cast also chose the band of hacks that followed how can one recognize such brilliance and then see fit to replace it with such mediocrity felt must give stars out of respect for the original cast that made this show such huge success as it is now the show is just awful cant believe its still on the air'

In [187]:
# Wrong prediction
output.query('prediction==-1').iloc[0,0]

'petter matteis love in the time of money is visually stunning film to watch mr mattei offers us vivid portrait about human relations this is movie that seems to be telling us what money power and success do to people in the different situations we encounter this being variation on the arthur schnitzlers play about the same theme the director transfers the action to the present time new york where all these different characters meet and connect each one is connected in one way or another to the next person but no one seems to know the previous point of contact stylishly the film has sophisticated luxurious look we are taken to see how these people live and the world they live in their own habitatthe only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits big city is not exactly the best place in which human relations find sincere fulfillment as one discerns is the case with most of the people we encounterthe acting is good under

In [188]:
# Good prediction
output.query('prediction==-1').iloc[0,0]

'petter matteis love in the time of money is visually stunning film to watch mr mattei offers us vivid portrait about human relations this is movie that seems to be telling us what money power and success do to people in the different situations we encounter this being variation on the arthur schnitzlers play about the same theme the director transfers the action to the present time new york where all these different characters meet and connect each one is connected in one way or another to the next person but no one seems to know the previous point of contact stylishly the film has sophisticated luxurious look we are taken to see how these people live and the world they live in their own habitatthe only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits big city is not exactly the best place in which human relations find sincere fulfillment as one discerns is the case with most of the people we encounterthe acting is good under