# NLP - Lab 02

### Let's code a sentiment classifier on the IMDB sentiment datase

---

Authors :

eliott.bouhana\
victor.simonin\
alexandre.lemonnier\
sarah.gutierez

---

## The dataset

In [1]:
import numpy as np
from datasets import load_dataset
dataset = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /home/bictole/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /home/bictole/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

### What are the datasets splits ?

In [7]:
from datasets import get_dataset_split_names

get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

### What are the dataset splits size ?

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

### What is the proportion of each label

In [4]:
print("Number of negatives sentences: ", dataset["train"]["label"].count(0))
print("Number of positives sentences: ", dataset["train"]["label"].count(1))

Number of negatives sentences:  12500
Number of positives sentences:  12500


---

## Naive Bayes Classifier 

### Preprocessing

Lowercase the text\
Remove punctuation 

In [6]:
from string import punctuation
from typing import List

def preprocessor(x_list: List[str]) -> List[str]:
    return [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]

### Model

A scikit-learn `Pipeline` with a `CountVectorizer` and `MultinomialNB` classifier

In [9]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("preprocess", FunctionTransformer(preprocessor)),
    ("vectorizer", CountVectorizer(lowercase=True)),
    ("nb", MultinomialNB()),
])

In [10]:
pipeline.fit(np.array(dataset["train"]["text"]), np.array(dataset["train"]["label"]))

Pipeline(steps=[('preprocess',
                 FunctionTransformer(func=<function preprocessor at 0x7fa56ef30430>)),
                ('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

###  Accuracy report on both training and test set

In [11]:
from sklearn.metrics import accuracy_score

print("Train accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["train"]["text"])), np.array(dataset["train"]["label"])))
print("Test accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["test"]["text"])), np.array(dataset["test"]["label"])))

Train accuracy:  0.91284
Test accuracy:  0.8172


### Why is accuracy a sufficient measure of evaluation here ?

--

### [Bonus] What are the top 10 most important words (features) for each class?

The words with the highest likelihood in each class :

In [12]:
features_likelihood_zero = {}
features_likelihood_one = {}
features = pipeline.get_params()["vectorizer"].get_feature_names()
likelihood_zero = pipeline.get_params()['nb'].feature_log_prob_[0]
likelihood_one = pipeline.get_params()['nb'].feature_log_prob_[1]
for i, feature in enumerate(features):
    features_likelihood_zero[feature] = likelihood_zero[i]
    features_likelihood_one[feature] = likelihood_one[i]

print(sorted(features_likelihood_zero, key=features_likelihood_zero.get, reverse=True)[:10])
print(sorted(features_likelihood_one, key=features_likelihood_one.get, reverse=True)[:10])



['the', 'and', 'of', 'to', 'is', 'in', 'this', 'it', 'that', 'br']
['the', 'and', 'of', 'to', 'is', 'in', 'it', 'this', 'that', 'br']


Removing the stopwords and checking again :

In [15]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /home/bictole/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/bictole/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [16]:
stopWords = set(stopwords.words('english'))
stopWords_zero = []
stopWords_one = []
for feature in features_likelihood_zero.keys():
    if feature in stopWords:
        stopWords_zero.append(feature)
for feature in features_likelihood_one.keys():
    if feature in stopWords:
        stopWords_one.append(feature)
for stopWord in stopWords_zero:
    del features_likelihood_zero[stopWord]
for stopWord in stopWords_one:
    del features_likelihood_one[stopWord]
    
print(sorted(features_likelihood_zero, key=features_likelihood_zero.get, reverse=True)[:10])
print(sorted(features_likelihood_one, key=features_likelihood_one.get, reverse=True)[:10])

['br', 'movie', 'film', 'one', 'like', 'even', 'good', 'bad', 'would', 'really']
['br', 'film', 'movie', 'one', 'like', 'good', 'story', 'great', 'time', 'see']


Take at least 2 wrongly classified example from the test set and try explaining why the model failed :