# Binary Sentiment Analysis of French Movie Reviews

### Objectives
1. Text cleaning
2. Text preprocessing for custom embedding Neural Network
3. Train RNN model for sentiment analysis

‚ö†Ô∏è This notebook will be your final deliverable. 
- Make sure it can run "restart and run all"
- Delete useless code cells
- Do not "clear output"

# 0. Load data

Our dataset contains 30,000 french reviews of movies, along with the binary class 1 (positive) or 0 (negative) score

In [1]:
# We load the dataset for you
import pandas as pd
import numpy as np
data = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/movies.csv')
data

Unnamed: 0,review,polarity
0,√áa commence √† devenir √©nervant d'avoir l'impre...,0
1,"J'ai aim√© ce film, si il ressemble a un docume...",1
2,Une grosse merde ce haneke ce faire produire p...,0
3,"Beau m√©lodrame magnifiquement photographi√©, ""V...",1
4,A la poursuite du diamant vers est un film pro...,1
...,...,...
29946,Le meilleur film de super-h√©ros derri√®re le ba...,1
29947,Un drame qui est d'une efficacit√© remarquable....,1
29948,"Une daube hollywoodienne de plus, aucun int√©r√™...",0
29949,Et voil√† un nouveau biopic sur la star du X Li...,0


In [2]:
# We create features
y = data.polarity
X = data.review

# We analyse class balance
print(pd.value_counts(y))

1    15051
0    14900
Name: polarity, dtype: int64


In [3]:
# We check various reviews
print(f'polarity: {y[0]} \n')
print(X[0])

polarity: 0 

√áa commence √† devenir √©nervant d'avoir l'impression de voir et revoir le m√™me genre de film √† savoir : la com√©die romantique, surement le genre le plus prolifique de le production fran√ßaise actuelle. Le probl√®me c'est que l'on a souvent affaire √† des niaiseries de faible niveau comme celui ci. Avec un scenario ultra balis√© et conventionnel, c'est √† se demander comment √ßa peut passer les portes d'un producteur. Bref cette sempiternel histoire d'un homme mentant au nom de l'amour pour reconqu√©rir une femme et qui √† la fin se prend son mensonge en pleine figure est d'une originalit√© affligeante, et ce n'est pas la pr√©sence au casting de l'ex miss m√©t√©o Charlotte Le Bon qui r√™ve surement d'avoir la m√™me carri√®re que Louise Bourgoin qui change la donne.


# 1. Clean Text

‚ùì We need to give a _quick & dirty_ cleaning to all the sentences in the dataset. Create a variable `X_clean` of similar shape, but with the following cleaning:
- Replace french accents by their non-accentuated equivalent using the [unidecode.unidecode()](https://pypi.org/project/Unidecode/) method
- Reduce all uppercases to lowercases
- Remove any characters outside of a-z, for instance using `string.isalpha()`

üòå You will be given the solution `X_clean` in the next question to make sure you can complete the challenge

In [4]:
### YOUR CODE
import unidecode
import string

import re

def remove_accents(words):
    remove_accents = unidecode.unidecode(words)
    return remove_accents

def lowered_text(words):
    lowercased = words.lower()
    return lowercased

def remove_alphas(words):
    sentence = ''.join(c for c in words if c.isalpha() or c == ' ')
    return re.sub(" +", " ", words)


X_clean = data.review.apply(remove_accents)
X_clean = X_clean.apply(lowered_text)
X_clean = X_clean.apply(remove_alphas)
X_clean

0        ca commence a devenir enervant d'avoir l'impre...
1        j'ai aime ce film, si il ressemble a un docume...
2        une grosse merde ce haneke ce faire produire p...
3        beau melodrame magnifiquement photographie, "v...
4        a la poursuite du diamant vers est un film pro...
                               ...                        
29946    le meilleur film de super-heros derriere le ba...
29947    un drame qui est d'une efficacite remarquable....
29948    une daube hollywoodienne de plus, aucun intere...
29949    et voila un nouveau biopic sur la star du x li...
29950    un film qui fait vieux, avec des acteurs pas t...
Name: review, Length: 29951, dtype: object

In [5]:
X_clean[0]

"ca commence a devenir enervant d'avoir l'impression de voir et revoir le meme genre de film a savoir : la comedie romantique, surement le genre le plus prolifique de le production francaise actuelle. le probleme c'est que l'on a souvent affaire a des niaiseries de faible niveau comme celui ci. avec un scenario ultra balise et conventionnel, c'est a se demander comment ca peut passer les portes d'un producteur. bref cette sempiternel histoire d'un homme mentant au nom de l'amour pour reconquerir une femme et qui a la fin se prend son mensonge en pleine figure est d'une originalite affligeante, et ce n'est pas la presence au casting de l'ex miss meteo charlotte le bon qui reve surement d'avoir la meme carriere que louise bourgoin qui change la donne."

In [6]:
from nbresult import ChallengeResult

result = ChallengeResult('C14',
    shape = X_clean.shape,
    first_sentence = X_clean[0]
)
result.write()

# 2. Preprocess data

Now that we have clean sentences, we need to convert each one into a list of integers of fixed size
- For example, the sentence: `"this was good"` should become something like `array([1, 3, 18, 0, 0, 0, ...0], dtype=int32)` where each integer match to a each _unique_ word in your corpus of sentences.

‚ùì Create a numpy ndarray `X_input` of shape (29951, 100) that will be the direct input to your Neutral Network. 

- 29951 represents the number of reviews in the dataset `X_clean`
- 100 represents the maximum number of words to keep for each movie review.
- It must contain only numerical values, without any `NaN`
- In the process, compute and save the number of _unique_ words in your cleaned corpus under `vocab_size` variable

üëâ First, you **must** start back from the clean solution below (14Mo)

In [7]:
X_clean = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/movies_X_clean.csv")['review']
X_clean

0        ca commence a devenir enervant de voir et revo...
1        aime ce film si il ressemble a un documentaire...
2        une grosse merde ce haneke ce faire produire p...
3        beau melodrame magnifiquement photographie ver...
4        a la poursuite du diamant vers est un film pro...
                               ...                        
29946    le meilleur film de derriere le batman de nola...
29947    un drame qui est efficacite remarquable un fil...
29948    une daube hollywoodienne de plus aucun interet...
29949    et voila un nouveau biopic sur la star du x li...
29950    un film qui fait vieux avec des acteurs pas to...
Name: review, Length: 29951, dtype: object

In [8]:
### YOUR CODE
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=100)
vectorizer_2 = CountVectorizer()
X_vecto = vectorizer.fit_transform(X_clean)
X_vecto_2 = vectorizer_2.fit_transform(X_clean)
X_input = X_vecto.toarray()
X_input_2 = X_vecto_2.toarray()

In [9]:
np.isnan(X_input).sum()

0

In [10]:
vocab_size = X_vecto_2.shape[1]
vocab_size

62353

In [11]:
from nbresult import ChallengeResult

result = ChallengeResult('C1415',
    type_X = type(X_input),
    shape = X_input.shape, 
    input_1 = X_input[1], 
)
result.write()

# 3. Neural Network

‚ùìCreate and fit a Neural Netork that takes `X_input` and `y` as input, to binary classify each sentence's sentiment

- You cannot use transfer learning or other pre-existing Word2Vec models
- You must use a "recurrent" architecture to _capture_ a notion of order in the sentences' words
- The performance metrics for this task is "accuracy"
- Store your model in a variable `model` 
- Store the result your `model.fit()` in a variable `history`. 
- ‚ö†Ô∏è `history.history` must comprises a measure of the `val_accuracy` at each epoch.
- You don't need to cross-validate your model

üòå Don't worry, you will not be judged on your computer power: You should be able to reach accuracy significantly better than baseline in less than 3 minutes even without GPUs.

üëâ But first, you **must** start back from the solution below (70Mo)

In [12]:
url = 'https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/movies_X_input.csv'
X_input = np.genfromtxt(url, delimiter=',', dtype='int32')

In [38]:
## YOUR CODE
X_input
y.astype(np.int64)

0        0
1        1
2        0
3        1
4        1
        ..
29946    1
29947    1
29948    0
29949    0
29950    0
Name: polarity, Length: 29951, dtype: int64

In [39]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_input, y, test_size=0.3)

In [47]:
y_train.astype(np.int32)

19883    0
12714    0
26506    1
20441    0
23527    0
        ..
22322    0
4876     1
7005     1
21814    0
2720     1
Name: polarity, Length: 20965, dtype: int32

In [40]:
from tensorflow.keras import Sequential
from tensorflow.keras import layers

def init_model():
    model = Sequential()
    model.add(layers.Masking())
    model.add(layers.LSTM(20, activation='tanh'))
    model.add(layers.Dense(15, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    
    return model

model = init_model()

In [48]:
model.fit(X_train, y_train, 
          batch_size = 32,
          epochs=100,
          validation_split=0.3,
         )

Epoch 1/100


TypeError: in user code:

    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:795 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
        return fn(*args, **kwargs)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:788 run_step  **
        outputs = model.train_step(data)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:754 train_step
        y_pred = self(x, training=True)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py:1012 __call__
        outputs = call_fn(inputs, *args, **kwargs)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/keras/engine/sequential.py:389 call
        outputs = layer(inputs, **kwargs)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py:1012 __call__
        outputs = call_fn(inputs, *args, **kwargs)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/keras/layers/core.py:128 call
        math_ops.not_equal(inputs, self.mask_value), axis=-1, keepdims=True)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/ops/math_ops.py:1715 not_equal
        return gen_math_ops.not_equal(x, y, name=name)
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/ops/gen_math_ops.py:6409 not_equal
        _, _, _op, _outputs = _op_def_library._apply_op_helper(
    /Users/guillaume/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py:527 _apply_op_helper
        raise TypeError(

    TypeError: Expected int32 passed to parameter 'y' of op 'NotEqual', got 0.0 of type 'float' instead. Error: Expected int32, got 0.0 of type 'float' instead.


In [None]:
from nbresult import ChallengeResult
result = ChallengeResult('C1517',
                         history=history.history)
result.write()