NLP LAB 02
Théo Ripoll - Quentin Fish - Nicolas Fidel

## Part 1: The Dataset

In [1]:
from datasets import load_dataset
from datasets import get_dataset_split_names
import pandas as pd

In [2]:
dataset = load_dataset("imdb")
dataset

Found cached dataset imdb (/Users/quentinfisch/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [70]:
get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

In [3]:
#Count the number of different labels in each datasets
train_labels = pd.DataFrame(dataset["train"]['label'], columns=["label"])
print(train_labels.groupby("label")["label"].count())

test_labels = pd.DataFrame(dataset["test"]['label'], columns=["label"])
print(test_labels.groupby("label")["label"].count())

label
0    12500
1    12500
Name: label, dtype: int64
label
0    12500
1    12500
Name: label, dtype: int64


### Question 1: How many splits does the dataset has?
There are 3 splits: train, test and unsupervised

### Question 2: How big are the splits ?
train: 25000
test: 25000
unsupervised: 50000

### Question 3: What is the proportion of each class on the supervised splits?
train: 50% positive, 50% negative
test: 50% positive, 50% negative

## Partie 2: Naive Bayes classifier

In [3]:
from string import punctuation
import re

def preprocess(dataset: pd.DataFrame) -> pd.DataFrame :
    """
    Preprocess the dataset by lowercasing the text and removing the punctuation manually

    Parameters
    ----------
    dataset : pd.DataFrame
        The dataset to preprocess

    Returns
    -------
    pd.DataFrame
        The preprocessed dataset
    """
    # First lower the case
    dataset["document"] = dataset["document"].apply(lambda x: x.lower())
    # Replace the punctuation with spaces. We keep the ' - that may give revelant informations
    # Replace HTML tag <br />
    punctuation_to_remove = '|'.join(map(re.escape, sorted(list(filter(lambda p: p != "'" and p != '-', punctuation)), reverse=True)))
    print(f"Deleting all these punctuation: {punctuation_to_remove}")
    dataset["document"] = dataset["document"].apply(lambda x: re.sub(punctuation_to_remove, " ", x.replace('<br />', "")))
    return dataset

In [4]:
train_raw = pd.DataFrame(dataset["train"], columns=["text", "label"]).rename(columns={"text": "document", "label": "class"})
preprocessed_train = preprocess(train_raw)
preprocessed_train

Deleting all these punctuation: \~|\}|\||\{|`|_|\^|\]|\\|\[|@|\?|>|=|<|;|:|/|\.|,|\+|\*|\)|\(|\&|%|\$|\#|"|!


Unnamed: 0,document,class
0,i rented i am curious-yellow from my video sto...,0
1,i am curious yellow is a risible and preten...,0
2,if only to avoid making this type of film in t...,0
3,this film was probably inspired by godard's ma...,0
4,oh brother after hearing about this ridicul...,0
...,...,...
24995,a hit at the time but now better categorised a...,1
24996,i love this movie like no other another time ...,1
24997,this film and it's sequel barry mckenzie holds...,1
24998,'the adventures of barry mckenzie' started lif...,1


In [5]:
test_raw = pd.DataFrame(dataset["test"], columns=["text", "label"]).rename(columns={"text": "document", "label": "class"})
preprocessed_test = preprocess(test_raw)
preprocessed_test

Deleting all these punctuation: \~|\}|\||\{|`|_|\^|\]|\\|\[|@|\?|>|=|<|;|:|/|\.|,|\+|\*|\)|\(|\&|%|\$|\#|"|!


Unnamed: 0,document,class
0,i love sci-fi and am willing to put up with a ...,0
1,worth the entertainment value of a rental esp...,0
2,its a totally average film with a few semi-alr...,0
3,star rating saturday night friday ...,0
4,first off let me say if you haven't enjoyed a...,0
...,...,...
24995,just got around to seeing monster man yesterda...,1
24996,i got this as part of a competition prize i w...,1
24997,i got monster man in a box set of three films ...,1
24998,five minutes in i started to feel how naff th...,1


### Question 2: Naive Bayes Classifier using pseudo-code

In [35]:
import numpy as np
from typing import List
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.feature_extraction.text import CountVectorizer
import operator

def get_vocabulary(d: pd.DataFrame) -> List[str]:
    """
    Return the vocabulary of the dataset

    Parameters
    ----------
    d : pd.DataFrame

    Returns
    -------
    List[str]
        The vocabulary
    """
    res = list(set(" ".join(d["document"]).split(" ")))
    # Remove empty string and words without any letter
    res = list(filter(lambda x: x != "" and re.search("[a-zA-Z]", x), res))
    return res

def train_naive_bayes(d: pd.DataFrame):
    """
    Train a Naive Bayes classifier

    Parameters
    ----------
    d : pd.DataFrame

    Returns
    -------
    logprior : dict
        The log prior of each class
    loglikelihood : dict
        The log likelihood of each word for each class
    V : List[str]
        The vocabulary
    """
    classes = d["class"].unique()
    logprior = {}
    bigdoc = {}
    loglikelihood = {}
    V = get_vocabulary(d)
    for c in classes:
        count = {}
        n_doc = len(d)
        n_c = len(d[d["class"] == c])
        logprior[c] = np.log(n_c / n_doc)
        bigdoc[c] = list(" ".join(d[d["class"] == c]["document"]).split(" "))
        for word in V:
            count[(word, c)] = bigdoc[c].count(word)
        for word in V:
            loglikelihood[(word, c)] = np.log((count[(word, c)] + 1) / (sum(count.values()) + len(V)))
    return logprior, loglikelihood, V

def test_naive_bayes(testdoc, classes, logprior, loglikelihood, V) -> int:
    """
    Test a Naive Bayes classifier

    Parameters
    ----------
    testdoc : str
        The document to classify
    classes : List[int]
        The list of classes
    logprior : dict
        The log prior of each class
    loglikelihood : dict
        The log likelihood of each word for each class
    V : List[str]
        The vocabulary

    Returns
    -------
    int
        The predicted class
    """
    sum_loglikelihood = {}
    for c in classes:
        sum_loglikelihood[c] = logprior[c]
        for word in testdoc.split(" "):
            if word in V:
                sum_loglikelihood[c] += loglikelihood[(word, c)]
    return max(sum_loglikelihood, key=sum_loglikelihood.get)

In [40]:
train_dataset_reduced = preprocessed_train.loc[::10, :]
test_dataset_reduced = preprocessed_test.loc[::10, :]
logprior_r, loglikelyhood_r, V_r = train_naive_bayes(train_dataset_reduced)

all_res = []
for row in test_dataset_reduced.iterrows():
    test_doc = row[1]["document"]
    res = test_naive_bayes(test_doc, preprocessed_test["class"].unique(), logprior_r, loglikelyhood_r, V_r)
    # if res != row[1]["class"]:
    #     print(f"Error: {res} != {row[1]['class']}")
    all_res.append(res)

print("Manual Naive Bayes Accuracy Score -> ",accuracy_score(test_dataset_reduced["class"], all_res)*100)
print("Manual Naive Bayes Precision Score -> ",precision_score(test_dataset_reduced["class"], all_res)*100)
print("Manual Naive Bayes Recall Score -> ",recall_score(test_dataset_reduced["class"], all_res)*100)

Manual Naive Bayes Accuracy Score ->  80.12
Manual Naive Bayes Precision Score ->  84.44647758462946
Manual Naive Bayes Recall Score ->  73.83999999999999


### Question 3: Naive Bayes Classifier using sklearn (Pipeline with CountVectorizer and MultinomialNB)

In [58]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [59]:
def sklearn_naive_bayes(d_train: pd.DataFrame, pipeline_params: dict = {}) -> Pipeline:
    """
    Train a Naive Bayes classifier using sklearn

    Parameters
    ----------
    d_train : pd.DataFrame
        The training dataset
    pipeline_params : dict, optional
        The parameters of the pipeline, by default {}

    Returns
    -------
    Pipeline
        The trained pipeline
    """
    # create pipeline
    pipeline = Pipeline([
        ('vectorizer', CountVectorizer()),
        ('classifier', MultinomialNB())
    ])
    pipeline.set_params(**pipeline_params)

    # train the model
    pipeline.fit(d_train["document"], d_train["class"])
    return pipeline

def test_sklearn_naive_bayes(pipeline: Pipeline, d_test: pd.DataFrame) -> List[int]:
    """
    Test a Naive Bayes classifier using sklearn

    Parameters
    ----------
    pipeline : Pipeline
        The trained pipeline
    d_test : pd.DataFrame
        The test dataset

    Returns
    -------
    List[int]
        The predicted classes
    """
    # predict the labels on validation dataset
    predictions = pipeline.predict(d_test["document"])

    print("Sklearn Naive Bayes Accuracy Score -> ",accuracy_score(d_test["class"], predictions)*100)
    print("Sklearn Naive Bayes Precision Score -> ",precision_score(d_test["class"], predictions)*100)
    print("Sklearn Naive Bayes Recall Score -> ",recall_score(d_test["class"], predictions)*100)

    return predictions

In [60]:
pipeline = sklearn_naive_bayes(preprocessed_train)
predictions = test_sklearn_naive_bayes(pipeline, preprocessed_test)

Sklearn Naive Bayes Accuracy Score ->  81.44
Sklearn Naive Bayes Precision Score ->  86.05504587155963
Sklearn Naive Bayes Recall Score ->  75.03999999999999


### Question 4: Report the accuracy on the test set

See prints above

### Question 5: Most likely, the scikit-learn implementation will give better results. Looking at the documentation, explain why it could be the case.

The scikit-learn implementation is better because it uses a MultinomialNB which is a more efficient way to compute the probabilities. It also uses a CountVectorizer which is a more efficient way to count the words in the dataset.

### Question 6: Why is accuracy a sufficient measure of evaluation here?

Because the dataset is balanced, we have the same number of positive and negative reviews. So the accuracy is a good measure of evaluation.

### Question 7: Using one of the implementation, take at least 2 wrongly classified example from the test set and try explaining why the model failed.

In [33]:
# We will take a look at the sklearn implementation
# First we need to get the wrongly classified examples
wrongly_classified = preprocessed_test[preprocessed_test["class"] != predictions]

# We will take the first 2 examples
# We can see that the first example is a negative review but the model predicted it as a positive review
# The second example is a positive review but the model predicted it as a negative review
print(wrongly_classified.iloc[0]["document"])
print(wrongly_classified.iloc[1]["document"])
print()

# Let's see the probability of each class for the first example
print(pipeline.predict_proba([wrongly_classified.iloc[0]["document"]]))
# Let's see the probability of each class for the second example
print(pipeline.predict_proba([wrongly_classified.iloc[1]["document"]]))

blind date  columbia pictures  1934   was a decent film  but i have a few issues with this film. first of all  i don t fault the actors in this film at all  but more or less  i have a problem with the script. also  i understand that this film was made in the 1930 s and people were looking to escape reality  but the script made ann sothern s character look weak. she kept going back and forth between suitors and i felt as though she should have stayed with paul kelly s character in the end. he truly did care about her and her family and would have done anything for her and he did by giving her up in the end to fickle neil hamilton who in my opinion was only out for a good time. paul kelly s character  although a workaholic was a man of integrity and truly loved kitty  ann sothern  as opposed to neil hamilton  while he did like her a lot  i didn t see the depth of love that he had for her character. the production values were great  but the script could have used a little work.
ben   rupe

We can see that the model is very confident about its prediction for the two examples (0.99...) but it's wrong. These examples are very hard to classify because they are very close to the decision boundary and also mixing a movie description (which can have positive or negative connotations due to the life of the main character, etc) and a review. So the model is not able to classify them correctly because of the confusing bundary between description and facts and the opinion.

### Question 8: What are the top 10 most important words (features) for each class? (bonus points)

In [71]:
# We will use the sklearn implementation to get the top 10 most important words for each class

def get_top_10_words(pipeline: Pipeline) -> dict:
    """
    Get the top 10 words for each class

    Parameters
    ----------
    pipeline : Pipeline
        The trained pipeline

    Returns
    -------
    dict
        The top 10 words for each class
    """
    top_10_words = {}
    for c in preprocessed_test["class"].unique():
        loglikelihood = pipeline.named_steps["classifier"].feature_log_prob_[c]
        V = pipeline.named_steps["vectorizer"].vocabulary_
        top_10_words[c] = [list(V.keys())[list(V.values()).index(i)] for i in np.argsort(loglikelihood)[-10:]]
    return top_10_words

In [72]:
get_top_10_words(pipeline)

{0: ['was', 'that', 'this', 'in', 'it', 'is', 'to', 'of', 'and', 'the'],
 1: ['as', 'this', 'that', 'it', 'in', 'is', 'to', 'of', 'and', 'the']}

In [84]:
pipeline_stopwords = sklearn_naive_bayes(preprocessed_train, {"vectorizer__stop_words": "english"})
predictions_stopwords = test_sklearn_naive_bayes(pipeline_stopwords, preprocessed_test)

get_top_10_words(pipeline_stopwords)

Sklearn Naive Bayes Accuracy Score ->  81.976
Sklearn Naive Bayes Precision Score ->  86.22439731738264
Sklearn Naive Bayes Recall Score ->  76.112


{0: ['story',
  'don',
  'time',
  'really',
  'bad',
  'good',
  'just',
  'like',
  'film',
  'movie'],
 1: ['people',
  'really',
  'great',
  'time',
  'story',
  'just',
  'good',
  'like',
  'movie',
  'film']}

We see that the top 10 words are more unique using stopwords

### Question 9: Play with scikit-learn's version parameters. For example, see if you can consider unigram and bigram instead of only unigrams.

In [89]:
# Unigram and bigram
pipeline_bigram = sklearn_naive_bayes(preprocessed_train, {"vectorizer__ngram_range": (1, 2), "vectorizer__stop_words": "english"})
predictions_bigram = test_sklearn_naive_bayes(pipeline_bigram, preprocessed_test)

Sklearn Naive Bayes Accuracy Score ->  84.244
Sklearn Naive Bayes Precision Score ->  87.4857693318154
Sklearn Naive Bayes Recall Score ->  79.92


In [93]:
# Unigram and bigram with stopwords
pipeline_bigram_stopwords = sklearn_naive_bayes(preprocessed_train, {"vectorizer__ngram_range": (1, 2)})
predictions_bigram_stopwords = test_sklearn_naive_bayes(pipeline_bigram_stopwords, preprocessed_test)

Sklearn Naive Bayes Accuracy Score ->  85.672
Sklearn Naive Bayes Precision Score ->  88.62612612612612
Sklearn Naive Bayes Recall Score ->  81.848


In [91]:
# Only bigram
pipeline_only_bigram = sklearn_naive_bayes(preprocessed_train, {"vectorizer__ngram_range": (2, 2), "vectorizer__stop_words": "english"})
predictions_only_bigram = test_sklearn_naive_bayes(pipeline_only_bigram, preprocessed_test)

Sklearn Naive Bayes Accuracy Score ->  82.952
Sklearn Naive Bayes Precision Score ->  87.63018454229857
Sklearn Naive Bayes Recall Score ->  76.736


In [94]:
# Only bigram with stopwords
pipeline_only_bigram_stopwords = sklearn_naive_bayes(preprocessed_train, {"vectorizer__ngram_range": (2, 2)})
predictions_only_bigram_stopwords = test_sklearn_naive_bayes(pipeline_only_bigram_stopwords, preprocessed_test)

Sklearn Naive Bayes Accuracy Score ->  86.952
Sklearn Naive Bayes Precision Score ->  89.35753237900477
Sklearn Naive Bayes Recall Score ->  83.896


The accuracy is better with only unigrams and without removing stopwords.

## Part 3: Stemming & Lemmatization


In [None]:
! python -m spacy download en_core_web_sm

### Lemmatization

Let's do a quick test

In [41]:
# Setup spacy
import spacy
nlp = spacy.load('en_core_web_sm')

# Take a 20 characters sentence example from the test dataset
test_list = dataset['train']['text'][0].split()[:20]
test_sentence = ' '.join(test_list)

# Lemmatize the sentence
doc = nlp(test_sentence)

# Get all token
tokens = [token.text for token in doc]

print('Original Sentence: {test_sentence}')
print()
for token in doc:
    if token.text != token.lemma_:
        print(f'Original : {token.text}, New: {token.lemma_}')

Original Sentence: {test_sentence}

Original : rented, New: rent
Original : AM, New: be
Original : CURIOUS, New: curious
Original : surrounded, New: surround
Original : was, New: be


In [42]:
def lemma_preprocessor(x_list: List[str]) -> List[str]:
    """
    Preprocessing function to lowercase and remove punctuation
    of a list of string and lemmatize each string.
    
    Args:
        x_list: List of strings
    
    Returns:
        List of preprocessed strings.
    """
    no_punc_lower = [x.lower().translate(str.maketrans("", "", punctuation)) for x in x_list]
    spacy_nlp = spacy.load('en_core_web_sm')
    res = []
    for sentence in no_punc_lower:
        doc = spacy_nlp(sentence)
        s = []
        for word in doc:
            s.append(word.lemma_)
        s = ' '.join(s)
        res.append(s)
    return res

In [47]:
print(dataset['train']['text'][:10])
lemma_preprocessor(dataset['train']['text'][:10])

['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, e

['I rent I be curiousyellow from my video store because of all the controversy that surround it when it be first release in 1967 I also hear that at first it be seize by us customs if it ever try to enter this country therefore be a fan of film consider controversial I really have to see this for myselfbr br the plot be center around a young swedish drama student name lena who want to learn everything she can about life in particular she want to focus her attention to make some sort of documentary on what the average swede think about certain political issue such as the vietnam war and race issue in the united states in between ask politician and ordinary denizen of stockholm about their opinion on politic she have sex with her drama teacher classmate and married menbr br what kill I about I be curiousyellow be that 40 year ago this be consider pornographic really the sex and nudity scene be few and far between even then its not shoot like some cheaply make porno while my countryman mi

### Stemming

In [43]:
import nltk
from nltk.stem import PorterStemmer
nltk.download("punkt")

# Initialize Python porter stemmer
ps = PorterStemmer()

# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed"]

# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in example_words:
    print ("{0:20}{1:20}".format(word, ps.stem(word)))

--Word--            --Stem--            
program             program             
programming         program             
programer           program             
programs            program             
programmed          program             


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/quentinfisch/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [53]:
def stem_preprocessor(x_list: List[str]) -> List[str]:
    """
    Preprocessing function to stem each string.
    
    Args:
        x_list: List of strings
    
    Returns:
        List of preprocessed strings.
    """
    spacy_nlp = spacy.load('en_core_web_sm')
    res = []
    ps = PorterStemmer()
    for sentence in x_list:
        doc = spacy_nlp(sentence)
        s = []
        for word in doc:
            s.append(ps.stem(str(word)))
        s = ' '.join(s)
        res.append(s)
    return res

In [54]:
example_words = ["program","programming","programer","programs","programmed"]
stem_preprocessor(example_words)

['program', 'program', 'program', 'program', 'program']

Both are working well. Now let's try to use lemmatization in our pipeline

In [68]:
# use stem_preprocessor to preprocess the training and test data
preprocessed_train_stem = lemma_preprocessor(train_raw["document"][:10])
preprocessed_train_stem

['I rent I be curiousyellow from my video store because of all the controversy that surround it when it be first release in 1967   I also hear that at first it be seize by u s   custom if it ever try to enter this country   therefore be a fan of film consider   controversial   I really have to see this for myself the plot be center around a young swedish drama student name lena who want to learn everything she can about life   in particular she want to focus her attention to make some sort of documentary on what the average swede think about certain political issue such as the vietnam war and race issue in the united states   in between ask politician and ordinary denizen of stockholm about their opinion on politic   she have sex with her drama teacher   classmate   and married man what kill I about I be curiousyellow be that 40 year ago   this be consider pornographic   really   the sex and nudity scene be few and far between   even then its not shoot like some cheaply make porno   wh

In [63]:
from sklearn.preprocessing import FunctionTransformer

def sklearn_naive_bayes_lemma(d_train: pd.DataFrame, pipeline_params: dict = {}) -> Pipeline:
    """
    Train a Naive Bayes classifier using sklearn with lemmatization.

    Parameters
    ----------
    d_train : pd.DataFrame
        The training dataset
    pipeline_params : dict, optional
        The parameters of the pipeline, by default {}

    Returns
    -------
    Pipeline
        The trained pipeline
    """
    # create pipeline with lemmatization, vectorizer and classifier
    pipeline = Pipeline([
        ('lemmatizer', FunctionTransformer(lemma_preprocessor)),
        ('vectorizer', CountVectorizer()),
        ('classifier', MultinomialNB())
    ])
    pipeline.set_params(**pipeline_params)

    # train the model
    pipeline.fit(d_train["document"], d_train["class"])
    return pipeline

In [69]:
pipeline_lemma = sklearn_naive_bayes_lemma(train_raw[::50])
predictions_lemma = test_sklearn_naive_bayes(pipeline_lemma, test_raw[::50])

Sklearn Naive Bayes Accuracy Score ->  76.2
Sklearn Naive Bayes Precision Score ->  83.58974358974359
Sklearn Naive Bayes Recall Score ->  65.2
