# NLP - Lab 02

### Let's code a sentiment classifier on the IMDB sentiment datase

---

Authors :

eliott.bouhana\
victor.simonin\
alexandre.lemonnier\
sarah.gutierez

---

## The dataset

In [1]:
import numpy as np
from datasets import load_dataset
dataset = load_dataset("imdb")

Reusing dataset imdb (/home/bictole/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

### What are the datasets splits ?

In [2]:
from datasets import get_dataset_split_names

get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

### What are the dataset splits size ?

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

### What is the proportion of each label

In [4]:
print("Number of negatives sentences: ", dataset["train"]["label"].count(0))
print("Number of positives sentences: ", dataset["train"]["label"].count(1))

Number of negatives sentences:  12500
Number of positives sentences:  12500


---

## Naive Bayes Classifier 

### Preprocessing

Lowercase the text\
Remove punctuation 

In [5]:
from string import punctuation
from typing import List

def preprocessor(x_list: List[str]) -> List[str]:
    return [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]

### Model

A scikit-learn `Pipeline` with a `CountVectorizer` and `MultinomialNB` classifier

In [6]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("preprocess", FunctionTransformer(preprocessor)),
    ("vectorizer", CountVectorizer(lowercase=True)),
    ("nb", MultinomialNB()),
])

In [7]:
pipeline.fit(np.array(dataset["train"]["text"]), np.array(dataset["train"]["label"]))

Pipeline(steps=[('preprocess',
                 FunctionTransformer(func=<function preprocessor at 0x7f430d7e5750>)),
                ('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

###  Accuracy report on both training and test set

In [8]:
from sklearn.metrics import accuracy_score

print("Train accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["train"]["text"])), np.array(dataset["train"]["label"])))
print("Test accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["test"]["text"])), np.array(dataset["test"]["label"])))

Train accuracy:  0.91284
Test accuracy:  0.8172


### Why is accuracy a sufficient measure of evaluation here ?

--

### [Bonus] What are the top 10 most important words (features) for each class?

The words with the highest likelihood in each class :

In [9]:
features_likelihood_zero = {}
features_likelihood_one = {}
features = pipeline.get_params()["vectorizer"].get_feature_names()
likelihood_zero = pipeline.get_params()['nb'].feature_log_prob_[0]
likelihood_one = pipeline.get_params()['nb'].feature_log_prob_[1]
for i, feature in enumerate(features):
    features_likelihood_zero[feature] = likelihood_zero[i]
    features_likelihood_one[feature] = likelihood_one[i]

print(sorted(features_likelihood_zero, key=features_likelihood_zero.get, reverse=True)[:10])
print(sorted(features_likelihood_one, key=features_likelihood_one.get, reverse=True)[:10])

['the', 'and', 'of', 'to', 'is', 'in', 'this', 'it', 'that', 'br']
['the', 'and', 'of', 'to', 'is', 'in', 'it', 'this', 'that', 'br']




Removing the stopwords and checking again :

In [12]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /home/bictole/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/bictole/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
stopWords = set(stopwords.words('english'))
stopWords_zero = []
stopWords_one = []
for feature in features_likelihood_zero.keys():
    if feature in stopWords:
        stopWords_zero.append(feature)
for feature in features_likelihood_one.keys():
    if feature in stopWords:
        stopWords_one.append(feature)
for stopWord in stopWords_zero:
    del features_likelihood_zero[stopWord]
for stopWord in stopWords_one:
    del features_likelihood_one[stopWord]
    
print(sorted(features_likelihood_zero, key=features_likelihood_zero.get, reverse=True)[:10])
print(sorted(features_likelihood_one, key=features_likelihood_one.get, reverse=True)[:10])

['br', 'movie', 'film', 'one', 'like', 'even', 'good', 'bad', 'would', 'really']
['br', 'film', 'movie', 'one', 'like', 'good', 'story', 'great', 'time', 'see']


Take at least 2 wrongly classified example from the test set and try explaining why the model failed :

--

---

## Stemming and Lemmatization

Let's add lemmatization to our pretreatment.

First, let's demonstrate the lemmatization effect on the first element of our train dataset :

In [16]:
import spacy

spacy_nlp = spacy.load('en_core_web_sm')
test_list = dataset['train']['text'][0].split()[:25]
test_sentence = ' '.join(test_list)

doc = spacy_nlp(test_sentence)
tokens = [token.text for token in doc]
print('Original Sentence: %s' % (test_sentence))
print()
for token in doc:
    if token.text != token.lemma_:
        print('Original : %s, New: %s' % (token.text, token.lemma_))

Original Sentence: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I

Original : rented, New: rent
Original : AM, New: be
Original : surrounded, New: surround
Original : was, New: be
Original : released, New: release


In [17]:
def lemma_preprocessor(x_list: List[str]) -> List[str]:
    no_punc_lower = [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]
    spacy_nlp = spacy.load('en_core_web_sm')
    return [spacy_nlp(element).text for element in no_punc_lower]

Training and evaluation of the model again with these pretreatment :

In [18]:
pipeline = Pipeline([
    ("preprocess", FunctionTransformer(lemma_preprocessor)),
    ("vectorizer", CountVectorizer(lowercase=True)),
    ("nb", MultinomialNB()),
])

pipeline.fit(np.array(dataset["train"]["text"]), np.array(dataset["train"]["label"]))

Pipeline(steps=[('preprocess',
                 FunctionTransformer(func=<function lemma_preprocessor at 0x7f4290443e20>)),
                ('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [None]:
print("Train accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["train"]["text"])), np.array(dataset["train"]["label"])))
print("Test accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["test"]["text"])), np.array(dataset["test"]["label"])))

Train accuracy:  0.91284


Need to change the test dataset

Are the results better or worse? Try explaining why the accuracy changed.