# NLP - Lab 02

### Let's code a sentiment classifier on the IMDB sentiment datase

---

Authors :

eliott.bouhana\
victor.simonin\
alexandre.lemonnier\
sarah.gutierez

---

## The dataset

In [19]:
import numpy as np
from datasets import load_dataset
dataset = load_dataset("imdb")

Reusing dataset imdb (/home/leme/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
100%|██████████| 3/3 [00:00<00:00, 290.60it/s]


### How many splits does the dataset has?


In [20]:
from datasets import get_dataset_split_names

get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

The dataset is composed of 3 splits: `train`, `test` and `unsupervised`

### How big are these splits ?

In [21]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

The `train` and `test` splits have 25000 rows, whereas `unsupervised` has 50000.

### What is the proportion of each class on the supervised splits ?

In [22]:
print("Number of negatives sentences in train supervised split: ", dataset["train"]["label"].count(0))
print("Number of positives sentences in train supervised split: ", dataset["train"]["label"].count(1))
print("--------------------------------")
print("Number of negatives sentences in test supervised split: ", dataset["test"]["label"].count(0))
print("Number of positives sentences in test supervised split: ", dataset["test"]["label"].count(1))

Number of negatives sentences in train supervised split:  12500
Number of positives sentences in train supervised split:  12500
--------------------------------
Number of negatives sentences in test supervised split:  12500
Number of positives sentences in test supervised split:  12500


---

## Naive Bayes Classifier 

### Preprocessing

In [23]:
from string import punctuation
from typing import List

def preprocessor(x_list: List[str]) -> List[str]:
    """
    Preprocessing function to lowercase and remove punctuation of a list of string.
    
    Args:
        x_list: List of strings
    
    Returns:
        List of preprocessed strings.
    """
    return [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]

### Model

A scikit-learn `Pipeline` with a `FunctionTransformer` calling our preprocessing function, a `CountVectorizer` and a `MultinomialNB` classifier.

In [24]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Create the pipeline
pipeline = Pipeline([
    ("preprocess", FunctionTransformer(preprocessor)),
    ("vectorizer", CountVectorizer(lowercase=True)),
    ("nb", MultinomialNB()),
])

In [25]:
# Fit the pipeline with the train dataset
pipeline.fit(np.array(dataset["train"]["text"]), np.array(dataset["train"]["label"]))

Pipeline(steps=[('preprocess',
                 FunctionTransformer(func=<function preprocessor at 0x7f731571e950>)),
                ('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

###  Accuracy report on both training and test set

In [26]:
from sklearn.metrics import accuracy_score

# Evaluate the prediction with train and test datasets
print("Train accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["train"]["text"])), np.array(dataset["train"]["label"])))
print("Test accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["test"]["text"])), np.array(dataset["test"]["label"])))

Train accuracy:  0.91284
Test accuracy:  0.8172


### Why is accuracy a sufficient measure of evaluation here ?

The accuracy is a sufficient measure because the number of positives and negatives sentences are the same.

### [Bonus] What are the top 10 most important words (features) for each class?

The words with the highest likelihood in each class :

In [27]:
features_likelihood_zero = {}
features_likelihood_one = {}
# Get all features (words)
features = pipeline.get_params()["vectorizer"].get_feature_names()
likelihood_zero = pipeline.get_params()['nb'].feature_log_prob_[0]
likelihood_one = pipeline.get_params()['nb'].feature_log_prob_[1]
# Assign to words their likelihood probability to be in negative and positive text
for i, feature in enumerate(features):
    features_likelihood_zero[feature] = likelihood_zero[i]
    features_likelihood_one[feature] = likelihood_one[i]

print(sorted(features_likelihood_zero, key=features_likelihood_zero.get, reverse=True)[:10])
print(sorted(features_likelihood_one, key=features_likelihood_one.get, reverse=True)[:10])



['the', 'and', 'of', 'to', 'is', 'in', 'this', 'it', 'that', 'br']
['the', 'and', 'of', 'to', 'is', 'in', 'it', 'this', 'that', 'br']


Removing the stopwords and checking again :

In [28]:
# Setup nltk
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /home/leme/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/leme/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [29]:
stopWords = set(stopwords.words('english'))
stopWords_zero = []
stopWords_one = []
# Find stop words in features
for feature in features_likelihood_zero.keys():
    if feature in stopWords:
        stopWords_zero.append(feature)
for feature in features_likelihood_one.keys():
    if feature in stopWords:
        stopWords_one.append(feature)
# Remove stop words in dictionaries
for stopWord in stopWords_zero:
    del features_likelihood_zero[stopWord]
for stopWord in stopWords_one:
    del features_likelihood_one[stopWord]
    
print(sorted(features_likelihood_zero, key=features_likelihood_zero.get, reverse=True)[:10])
print(sorted(features_likelihood_one, key=features_likelihood_one.get, reverse=True)[:10])

['br', 'movie', 'film', 'one', 'like', 'even', 'good', 'bad', 'would', 'really']
['br', 'film', 'movie', 'one', 'like', 'good', 'story', 'great', 'time', 'see']


Take at least 2 wrongly classified example from the test set and try explaining why the model failed :

In [30]:
# Store the prediction result on the test dataset
test_predict_res = pipeline.predict(np.array(dataset["test"]["text"])), np.array(dataset["test"]["label"])
wrong_class = {}

# Find some wrongly classified text from the test set
for i, (got, expected) in enumerate(zip(test_predict_res[0], test_predict_res[1])):
    if len(wrong_class) == 2:
        break
    if got != expected:
        wrong_class[dataset["test"]["text"][i]] = (got, expected)
        
# Display the results
for text, res in wrong_class.items():
    print(text)
    print("Prediction:", res[0])
    print("Expected:", res[1])
    print()

Blind Date (Columbia Pictures, 1934), was a decent film, but I have a few issues with this film. First of all, I don't fault the actors in this film at all, but more or less, I have a problem with the script. Also, I understand that this film was made in the 1930's and people were looking to escape reality, but the script made Ann Sothern's character look weak. She kept going back and forth between suitors and I felt as though she should have stayed with Paul Kelly's character in the end. He truly did care about her and her family and would have done anything for her and he did by giving her up in the end to fickle Neil Hamilton who in my opinion was only out for a good time. Paul Kelly's character, although a workaholic was a man of integrity and truly loved Kitty (Ann Sothern) as opposed to Neil Hamilton, while he did like her a lot, I didn't see the depth of love that he had for her character. The production values were great, but the script could have used a little work.
Prediction

These two examples are classified as positive sentences instead of negative.

For the first example, the reviewer says some positive keywords such as `decent film`, `loved`, `great`, `good`, `like`... that could make the prediction goes wrong.

For the second one, there are still some positive keywords such as `excellent`, `enjoyable`, `best`... but the text is still really negative, so we can guess that the length of the review can inferred with the prediction, because this one is pretty long. The presence of some HTML tag may make failed the prediction too.

To conclude, this two text may be wrongly classified due to an highest number of words that have more probability to be in the other prediction label.

---

## Stemming and Lemmatization
### Lemmatization
Let's add lemmatization to our pretreatment.

First, let's demonstrate the lemmatization effect on the first element of our train dataset :

In [31]:
%%capture
!python -m spacy download en_core_web_sm

In [40]:
# Setup spacy
import spacy
spacy_nlp = spacy.load('en_core_web_sm')

# Take a 25 characters sentence example from the test dataset
test_list = dataset['train']['text'][0].split()[:25]
test_sentence = ' '.join(test_list)

# Lemmatize the sentence
doc = spacy_nlp(test_sentence)

tokens = [token.text for token in doc]

print('Original Sentence: %s' % (test_sentence))
print()
for token in doc:
    if token.text != token.lemma_:
        print('Original : %s, New: %s' % (token.text, token.lemma_))

Original Sentence: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I

Original : rented, New: rent
Original : AM, New: be
Original : surrounded, New: surround
Original : was, New: be
Original : released, New: release


Now let's apply lemmatization to the data used in our model:

In [37]:
def lemma_preprocessor(x_list: List[str]) -> List[str]:
    """
    Preprocessing function to lowercase and remove punctuation
    of a list of string and lemmatize each string.
    
    Args:
        x_list: List of strings
    
    Returns:
        List of preprocessed strings.
    """
    no_punc_lower = [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]
    spacy_nlp = spacy.load('en_core_web_sm')
    res = []
    for sentence in no_punc_lower:
        doc = spacy_nlp(sentence)
        s = []
        for word in doc:
            s.append(word.lemma_)
        s = ' '.join(s)
        res.append(s)
    return res

Training and evaluation of the model again with these pretreatment :

In [38]:
# Create pipeline
pipeline = Pipeline([
    ("preprocess", FunctionTransformer(lemma_preprocessor)),
    ("vectorizer", CountVectorizer(lowercase=True)),
    ("nb", MultinomialNB()),
])

# Fit the pipeline with train dataset
pipeline.fit(np.array(dataset["train"]["text"]), np.array(dataset["train"]["label"]))

Pipeline(steps=[('preprocess',
                 FunctionTransformer(func=<function lemma_preprocessor at 0x7f72d4cea710>)),
                ('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [39]:
# Evaluate the prediction with train and test datasets
print("Train accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["train"]["text"])), np.array(dataset["train"]["label"])))
print("Test accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["test"]["text"])), np.array(dataset["test"]["label"])))

Train accuracy:  0.90624
Test accuracy:  0.81356


Need to change the test dataset

Are the results better or worse? Try explaining why the accuracy changed.

### Stemming

Let's add stemming to our pretreatment.

First, let's demonstrate the stemming effect on the first element of our train dataset :

In [12]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 

In [13]:
# We need to download a package for word tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/leme/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
# Take a 25 characters sentence example from the test dataset
test_list = dataset['train']['text'][0].split()[:25]
test_sentence = ' '.join(test_list)
test_sentence

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I'

In [15]:
import re

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")
stemmed = [stemmer.stem(word) for word in word_tokenize(test_sentence.lower()) if re_word.match(word)]
stemmed = ' '.join(stemmed)
print("Original sentence:", test_sentence)
print()
print("Stemmed sentence:", stemmed)

Original sentence: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I

Stemmed sentence: i rent i am from my video store becaus of all the controversi that surround it when it was first releas in i


In [34]:
def stem_preprocessor(x_list: List[str]) -> List[str]:
    """
    Preprocessing function to lowercase and remove punctuation
    of a list of string and stem each string.
    
    Args:
        x_list: List of strings
    
    Returns:
        List of preprocessed strings.
    """
    no_punc_lower = [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]
    re_word = re.compile(r"^\w+$")
    stemmer = SnowballStemmer("english")
    res = []
    for sentence in no_punc_lower:
        s = []
        for word in word_tokenize(sentence):
            if re_word.match(word):
                s.append(stemmer.stem(word))
        s = ' '.join(s)
        res.append(s)
    return res

In [35]:
# Create pipeline
pipeline = Pipeline([
    ("preprocess", FunctionTransformer(stem_preprocessor)),
    ("vectorizer", CountVectorizer(lowercase=True)),
    ("nb", MultinomialNB()),
])

# Fit the pipeline with train dataset
pipeline.fit(np.array(dataset["train"]["text"]), np.array(dataset["train"]["label"]))

Pipeline(steps=[('preprocess',
                 FunctionTransformer(func=<function stem_preprocessor at 0x7f72d4ce9bd0>)),
                ('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [36]:
# Evaluate the prediction with train and test datasets
print("Train accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["train"]["text"])), np.array(dataset["train"]["label"])))
print("Test accuracy: ", accuracy_score(pipeline.predict(np.array(dataset["test"]["text"])), np.array(dataset["test"]["label"])))

Train accuracy:  0.90068
Test accuracy:  0.80972
