# Classification de critiques de cinéma
Est ce que le champ lexical (pondéré) d'une critique permet de déterminer si elle est positive ou négative?

In [1]:
import pandas as pd
import numpy as np
import os
import string
import re
import nltk

In [2]:
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [3]:
from imdb_utils import preprocess, tokenize, extract_vocabulary, create_bow, create_word2idx

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kingd\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Dataset
Large Movie Review Dataset

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
http://ai.stanford.edu/~amaas/data/sentiment/

Fichier préparé: `Data/imdb/imdb_dataset.csv`

In [38]:
data_path = '../'

imdb_data = pd.read_csv(os.path.join(data_path, 'IMDB dataset.csv'))

print(imdb_data.shape)

(50000, 2)


Equilibre des classes dans `imdb_data`

Transformation de la colonne `sentiment`en valeurs catégorielles (pour passer au numérique si besoin)

## Jeu de données pour le développement
Dataset (beaucoup) plus petit pour les phases de développement, à commenter lors de l'éxécution finale.

In [39]:
imdb_data = imdb_data.sample(frac=0.01, replace=False, random_state=9)
imdb_data.reset_index(drop=True, inplace=True)
imdb_data.head()

Unnamed: 0,review,sentiment
0,Some people seem to think this was the worst m...,negative
1,"This is one of my favourite films, dating back...",positive
2,A rousing adventure form director George Steve...,positive
3,"This film is an eery, but interesting film. I ...",positive
4,Lonesome Jim is kind of like a romantic dark c...,negative


## Séparation du dataset en `train` et `test`

In [7]:
from sklearn.model_selection import train_test_split

In [106]:
X_train,X_test,y_train,y_test = train_test_split(imdb_data.review,imdb_data.sentiment,test_size=.3,random_state=9)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((350,), (150,), (350,), (150,))

In [67]:
X_train[0]

'Some people seem to think this was the worst movie they have ever seen, and I understand where they\'re coming from, but I really have seen worse.<br /><br />That being said, the movies that I can recall (ie the ones I haven\'t blocked out) that were worse than this, were so bad that they physically pained every sense that was involved with watching the movie. The movies that are worse than War Games 2 are the ones that make you want to gouge out your eyes, or stab sharp objects in your ears to keep yourself from having another piece of your soul ripped away from you by the awfulness.<br /><br />War Games: The Dead Code isn\'t that bad, but it comes pretty close. Yes I was a fan of the original, but no I wasn\'t expecting miracles from this one. Let\'s face it the original wasn\'t really that great of a movie in the first place, it was basically just a campy 80s teen romance flick with some geek-appeal to it.<br /><br />That\'s all I was hoping for, something bad, but that might have 

## Exploration des données textuelles

## Preprocess
Préparation du texte pour la segmentation  
Actions que l'on effectuera sur l'ensemble du jeu de données

In [107]:
import nltk.data
#nltk.download()
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
X_train = X_train.map(lambda x: " ".join(sent_detector.tokenize(x.strip().replace("<br />"," "))))
X_test = X_test.map(lambda x: " ".join(sent_detector.tokenize(x.strip().replace("<br />"," "))))

In [90]:
pd.concat([pd.DataFrame(X_train),pd.DataFrame(y_train)],axis=1)

Unnamed: 0,review,sentiment
25,"This is a good movie, although people unfamili...",positive
175,This film could of been a hell of a lot better...,negative
62,"That's a weird, weird movie and doesn't deserv...",negative
6,I have been getting into the Hitchcock series ...,positive
54,This was truly a heart warming movie. It is fi...,positive
...,...,...
56,I saw this film without knowing much about it ...,negative
438,... but watch Mary McDonnell's performance clo...,positive
126,The creator of Donnie Darko brings you a twili...,positive
348,I remember this film fondly from seeing it in ...,positive


In [91]:
X_train[0]

'Some people seem to think this was the worst movie they have ever seen, and I understand where they\'re coming from, but I really have seen worse. That being said, the movies that I can recall (ie the ones I haven\'t blocked out) that were worse than this, were so bad that they physically pained every sense that was involved with watching the movie. The movies that are worse than War Games 2 are the ones that make you want to gouge out your eyes, or stab sharp objects in your ears to keep yourself from having another piece of your soul ripped away from you by the awfulness. War Games: The Dead Code isn\'t that bad, but it comes pretty close. Yes I was a fan of the original, but no I wasn\'t expecting miracles from this one. Let\'s face it the original wasn\'t really that great of a movie in the first place, it was basically just a campy 80s teen romance flick with some geek-appeal to it. That\'s all I was hoping for, something bad, but that might have tugged at my geek-strings. Was th

On retire les mots trop fréquents en anglais: `stopwords`

In [108]:
from nltk.tokenize import word_tokenize
import re
X_train = X_train.map(lambda x: word_tokenize(re.sub("[`,!?:.\(\)]"," ",x)))
X_test = X_test.map(lambda x: word_tokenize(re.sub("[`,!?:.\(\)]"," ",x)))

In [109]:
X_train[0]

['Some',
 'people',
 'seem',
 'to',
 'think',
 'this',
 'was',
 'the',
 'worst',
 'movie',
 'they',
 'have',
 'ever',
 'seen',
 'and',
 'I',
 'understand',
 'where',
 'they',
 "'re",
 'coming',
 'from',
 'but',
 'I',
 'really',
 'have',
 'seen',
 'worse',
 'That',
 'being',
 'said',
 'the',
 'movies',
 'that',
 'I',
 'can',
 'recall',
 'ie',
 'the',
 'ones',
 'I',
 'have',
 "n't",
 'blocked',
 'out',
 'that',
 'were',
 'worse',
 'than',
 'this',
 'were',
 'so',
 'bad',
 'that',
 'they',
 'physically',
 'pained',
 'every',
 'sense',
 'that',
 'was',
 'involved',
 'with',
 'watching',
 'the',
 'movie',
 'The',
 'movies',
 'that',
 'are',
 'worse',
 'than',
 'War',
 'Games',
 '2',
 'are',
 'the',
 'ones',
 'that',
 'make',
 'you',
 'want',
 'to',
 'gouge',
 'out',
 'your',
 'eyes',
 'or',
 'stab',
 'sharp',
 'objects',
 'in',
 'your',
 'ears',
 'to',
 'keep',
 'yourself',
 'from',
 'having',
 'another',
 'piece',
 'of',
 'your',
 'soul',
 'ripped',
 'away',
 'from',
 'you',
 'by',
 'the',

In [114]:
from nltk.corpus import stopwords
X_train = X_train.map(lambda x: [i for i in x if not i in stopwords.words()])
X_test = X_test.map(lambda x: x)

Racinisation  
SnowballStemmer: https://www.nltk.org/howto/stem.html

In [11]:
from nltk.stem.snowball import SnowballStemmer
stemport = SnowballStemmer("french")

Transformation en sac de mots

## Création de la liste des critiques segmentées  
L'indice dans la liste correspond à l'index du document dans le dataframe `train`  
Assez long

Exemple : reviews[0]  
`['lack', 'content', 'movi', 'amaz', 'first', 'though', 'peopl', 'go', 'compar', 'rock', 'realli', 'surpris', 'say', 'worst', 'rock', 'stori', 'horribl', 'cast', 'ajay', 'devgan', 'jam', 'salman', 'khan', 'asin', 'got', 'ta', 'kid', 'music', 'okay', 'khanabadosh', 'track', 'movi', 'rest', 'bad', 'vipul', 'shah', 'still', 'learn', 'singh', 'king', 'critic', 'bash', 'comedi', 'asin', 'come', 'sorri', 'asin', 'fan', 'suck', 'big', 'time', 'movi', 'serious', 'bad', 'act', 'look', 'good', 'overdos', 'make', 'final', 'verdict', 'go', 'watch', 'aladin', 'famili', 'instead', 'wast', 'time']`

## Vocabulaire de notre corpus d'entraînement

64574

## Multinomial Naïve Bayes
Utilisation de scikit learn: [sklearn.naive_bayes.MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)  
Paramètres en entrée de la fonction de scikit learn:
- X : {array-like, sparse matrix} of shape (n_samples, n_features) : Training vectors, where n_samples is the number of samples and n_features is the number of features.
- y : array-like of shape (n_samples,): Target values.

### Adaptation de nos données d'entraînement à scikit learn `X`, `y`

Chaque ligne de `X` correspond à un point dans l'espace défini par le vocabulaire. Les coordonnées sont nulles sauf pour les mots du sac de mot du document où elles sont égales au nombre d'occurence dans le document. Par exemple si on a les deux documents suivants:
- `d1` "le petit chat"
- `d2` "le gros chien-chien"

In [28]:
# Après segmentation
d1 = ['le', 'petit', 'chat']
d2 = ['le', 'gros', 'chien', 'chien']
documents = [d1, d2]

# Vocabulaire du corpus
V = list(sorted(extract_vocabulary(documents))) # Transformation du set en liste ordonnée pour pouvoir avoir accès aux indices
V

['chat', 'chien', 'gros', 'le', 'petit']

In [29]:
# Transformation de la liste de documents en liste de sacs de mots
documents = [create_bow(document) for document in documents]
documents

[Counter({'le': 1, 'petit': 1, 'chat': 1}),
 Counter({'le': 1, 'gros': 1, 'chien': 2})]

In [30]:
# Création naïve de la matrice X
X = np.array([[1, 0, 0, 1, 1], [0, 2, 1, 1, 0]])
X_df = pd.DataFrame(X, columns=['chat', 'chien', 'gros', 'le', 'petit'], index=['d1', 'd2'])
X_df

Unnamed: 0,chat,chien,gros,le,petit
d1,1,0,0,1,1
d2,0,2,1,1,0


In [31]:
X

array([[1, 0, 0, 1, 1],
       [0, 2, 1, 1, 0]])

### Matrices creuses
Beaucoup de coordonnées sont nulles dans notre matrice (car beaucoup de mots du vocabulaire sont absents d'un document donné), pour optimiser la manipulation on va utiliser l'implémentation des matrices creuses par la bibiothèque Scipy: sparse matrix  
Conversion des sacs de mots des documents en une matrice creuse

In [32]:
from scipy.sparse import dok_matrix

In [33]:
X = dok_matrix((len(documents), len(V)), dtype=np.int)

# Alimentation de la matrice creuse à partir des décomptes dans les sacs de mots
for idx, document in enumerate(documents):
    for word, count in document.items():
        X[idx, V.index(word)] = count

In [34]:
X.todense()

matrix([[1, 0, 0, 1, 1],
        [0, 2, 1, 1, 0]])

### Application à notre dataset

In [2]:
# Transformation de la liste de critiques en liste de sacs de mots
# Transformation du vocabulaire en liste ordonnée des mots pour les coordonnées
# Dictionnaire qui a un mot associe son index dans vocabulary {word: index}
# Alimentation de la matrice creuse à partir des décomptes dans les sacs de mots

### Adaptation de nos données d'entraînement à scikit learn `y`

## Utilisation de scikit learn

Création et entraînement d'un classificateur MultinomialNB.

Validation en faisant des prédictions sur l'ensemble d'entraînement

## Evaluation sur l'ensemble de test

### `X_test`
Création d'une matrice creuse à partir des décomptes dans les sacs de mots de `test`

On suit les mêmes étapes pour le traitement des critiques `reviews` de `test`  
- preprocess
- tokenize
- bow

Lors de l'alimentation de la matrice creuse, on ignore tous les mots que l'on n'a pas vu dans l'ensemble d'entraînement

### `y_test`

### Prédictions

## Métriques