# Introduction to Natural Language Processing Catch-up 1
 
Authors :
 * Tony George

## Introduction

*(Copied from subject)*

In this small project, you will code a sentiment classifier using the naive Bayes methods seen in class and compare it with the FastText library. There are a few theoritical questions to answer as well.

Please, read the full assignment before starting.

For coding standards, please respect the following guidelines

* Use docstring format to describe your functions and their arguments.
* Use typing.
* Have clear and verbatim variable names (not x, x1, x2, xx, another_x, ...).
* Make your results reproducible (force random seeds values when necessary and possible).
* Don't hesitate commenting in details part of the code you consider complex or hard to read.

Do not hesitate contacting me if you have any question, but please don't wait until the last moment to do so.

### Imports

Using conda with python version 3.8. A conda yml should be available if I didn't forget to generate it.


In [226]:
# TODO : Generate conda yml
import numpy as np
import matplotlib.pyplot as plt

### Forcing seed

This helps reproducibility, feel free to play with !

In [227]:
from typing import Dict, List, Tuple
np.random.seed(42)

## The dataset

Using HuggingFace's version of the IMDB dataset as asked by subject.

### Importing from HuggingFace

In [228]:
from datasets import load_dataset, load_dataset_builder

ds_info = load_dataset_builder("imdb")
print("Desc :", ds_info.info.description)
print("Features :", ds_info.info.features)
print("Splits :", ds_info.info.splits)

Desc : Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Features : {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
Splits : {'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000, dataset_name='imdb')}


In [229]:
# Cheating a bit by downloading the DS now
ds_train    = load_dataset("imdb", split = "train")
ds_test     = load_dataset("imdb", split = "test")

Reusing dataset imdb (C:\Users\Griffures\.cache\huggingface\datasets\imdb\plain_text\1.0.0\e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Reusing dataset imdb (C:\Users\Griffures\.cache\huggingface\datasets\imdb\plain_text\1.0.0\e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


### Dataset Exploration

#### How many splits does the dataset has?

In [230]:
# Using the builder to get info on splits.
ds_info.info.splits.keys()

dict_keys(['train', 'test', 'unsupervised'])

HuggingFaces's version of the IMDB dataset has 3 different splits, though we will only interest ourselves in the first two ones.

The *train* and *test* split are your standard ML splits, while the *unsupervised* data contains unlabelled data for those willing to gain more volumetry at the cost of some work.

#### How big are these splits?

In [231]:
# This small lib is great to convert hard numbers into a human-readable format.
from humanize import naturalsize
from datasets import SplitInfo

def split_info_on(sp_info : SplitInfo) -> str:
    """
    :param sp_info: The SplitInfo object from HuggingFace's datasets lib.
    :return: Human readable string with a quick description on the object.
    """
    return f"""Split {sp_info.name} contains {sp_info.num_examples} examples ({naturalsize(sp_info.num_bytes)})."""

# Coming from Scala, having to wrap a map object in Python makes me sad.
# You will further down that my mindset is heavily ~~~corrupted~~~ inspired by MapReduce.
list(map(split_info_on, ds_info.info.splits.values()))

['Split train contains 25000 examples (33.4 MB).',
 'Split test contains 25000 examples (32.7 MB).',
 'Split unsupervised contains 50000 examples (67.1 MB).']

In term of size, we have a 50/50 split between labelled and unlabelled data, and another 50/50 between the *train* and *test* split for the former one.

#### What is the proportion of each class on the supervised splits?

In [232]:
from datasets import Dataset
from typing import Dict

def class_histogram(ds : Dataset) -> Dict[str, int]:
    """
    :param ds: Dataset object to dress the class histogram of.
    :return: List of Tuple2 [('class_label', count), ...]
    """
    labels = ds.info.features['label'].names
    # Extracting the label from the example, then counting occurrences using np.unique()
    _, counts = np.unique(ds['label'], return_counts = True)
    return dict(zip(labels, counts))

In [233]:
print("Train :", class_histogram(ds_train))
print("Test :", class_histogram(ds_test))

Train : {'neg': 12500, 'pos': 12500}
Test : {'neg': 12500, 'pos': 12500}


Both our splits have no bias toward a class or the other, with again a perfect  50/50 split.

## Naive Bayes classifier

### Pretreatment

Using the standard NLP pretreatment workflow, up to stemming.

The reason I don't do lemmatization is because my pretreatment function is simple and tends to butcher words.
(i.e. "you'll" becomes ["you", "'ll"], and wordnet lemmatizer does not recognize the second word as will).


In [234]:
# Downloading NLTK (we will use it below anyway).
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import LancasterStemmer

stops_en = set(stopwords.words('english'))
lancaster = LancasterStemmer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Griffures\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Griffures\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [235]:
def pretreatment(text : str) -> str:
    # Lowercasing
    raw_words = word_tokenize(text.lower())

    # Filter out : punctuation & stop stopwords
    is_relevant = lambda word : not (all(map(lambda c: c in punctuation, list(word))) # Removing punctuation only words
                                     or word in stops_en) # BONUS : Removing stopwords
    filtered_words = list(filter(is_relevant, raw_words))

    # BONUS : Stemming
    lemmatized_words = list(map(lancaster.stem, filtered_words))

    return  ' '.join(lemmatized_words)

In [236]:
# Select an example here
ex_preprocessing = ds_test[0]['text']

print("Example with :", ex_preprocessing)
print("Which gives :", pretreatment(ex_preprocessing))

Example with : I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
Which gives : went saw movy last night coax friend min 'll admit reluct see knew ashton kutch abl comedy wrong kutch play charact jak fisch wel kevin costn play ben randal profess sign good 

### Implementing the model

As to not rely too much on sklearn's library, and only using it to gain time,
we will override the preprocessing function of the CountVectorizer with our own, and disable its stopwords list.

The model will actually take a hit from this workflow (both in performance, as the tokenizing will actually happen twice, and in accuracy, as we could do everything with a well configured CountVectorizer).
However, it proves a greater comprehension of the notions at play.

In [237]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy.typing as npt
from typing import Tuple


class CustomNaiveBayes:
    """
    Wrapper class to articulate sklearn's CountVectorizer and MultinomialNB models.
    """
    vectorizer = CountVectorizer(preprocessor = pretreatment, stop_words = None)
    clf = MultinomialNB()
    labels = []

    def fit(self, ds : Dataset) -> None:
        """
        Wrapper function to launch a complete train workflow from a HF's dataset
        :param ds: HuggingFace DataSet object to train on.
        :return: Nothing.
        """
        # Extracting labels for predict_label function
        self.labels = ds.features['label'].names
        # Shuffling and extracting features
        ds.shuffle()
        X_as_docs, y_as_ints = self.DS_to_XnY(ds)
        X_as_features = self.vectorizer.fit_transform(X_as_docs)
        # Actual training
        self.clf.fit(X_as_features, y_as_ints)

    def predict(self, X_as_docs : npt.ArrayLike) -> npt.NDArray[int]:
        """
        Extract features then launch model's predictions on a list of documents
        :param X_as_docs: List of documents (corpus)
        :return: Predicted classes as ints
        """
        X_as_features = self.vectorizer.transform(X_as_docs)
        return self.clf.predict(X_as_features)

    def predict_label(self, X_as_docs : npt.ArrayLike) -> npt.NDArray[str]:
        """
        Same as predict, but return the classes label
        :param X_as_docs: List of documents (corpus)
        :return: Predicted classes as strings
        """
        predictions = self.predict(X_as_docs)
        return np.fromiter([self.labels[y] for y in predictions], str)

    def score(self, ds : Dataset) -> float:
        """
        Evaluate the model with a HF's dataset.
        :param ds: HuggingFace DataSet object to evaluate upon.
        :return: Accuracy as float (0~1).
        """
        X_as_docs, y_as_ints = self.DS_to_XnY(ds)
        X_as_features = self.vectorizer.transform(X_as_docs)
        return self.clf.score(X_as_features, y_as_ints)

    def DS_to_XnY(self, ds : Dataset) -> Tuple[npt.ArrayLike, npt.ArrayLike]:
        """
        Transforms a HuggingFace dataset object to a more usual X and y tuple for fitting.
        :param ds: Dataset to convert
        :return: Tuple : X as list[str] and Y as list[int], both of the same len.
        """
        raw_corpus = np.array(ds['text'])
        targets = np.array(ds['label'])
        return raw_corpus, targets


### Training


In [238]:
customNB = CustomNaiveBayes()
customNB.fit(ds_train)

Loading cached shuffled indices for dataset at C:\Users\Griffures\.cache\huggingface\datasets\imdb\plain_text\1.0.0\e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a\cache-9d17e850e5680b7a.arrow


### Reporting accuracy

In [239]:
customNB_acc = customNB.score(ds_test)
print("Accuracy :", customNB_acc)

Accuracy : 0.81168


#### Is accuracy it sufficient here ?

Accuracy is a sufficient metric here for multiple reasons :

* Since we only have 2 classes, a confusion matrice looses its relevance (p versus 1 - p).
* Due to the nature of our classification, we do not care specifically for type 1 nor type 2 errors. Therefore focusing Precision or Recall is meaningless.
* The above, plus the fact that our classes are perfectly balanced, makes f1-score a very similar metric as accuracy (harmonic mean between precision and recall).

### Top 10 important feature from each class

I already removed the stop words, so we already have the best results ! :)

In [240]:
def top_10_features_per_class(model: CustomNaiveBayes) -> Dict:
    """
    Return the top ten heaviest features for each class according to the model
    :param model: the model to extract weights from
    :return: Dictionnary : Class Label -> Features (feature(stem) -> log_value)
    """
    # Using a custom dtype to preserve info on features' names.
    feature_log_t = [("feature", '<U50'), ("log_value", float)]
    # Just stacks the features names (stems) on top of the logs array
    feature_stack = lambda logs : np.array(list(zip(model.vectorizer.get_feature_names_out(), logs)), dtype = feature_log_t)
    # Sort the array, take the 10 last value, then flip it to get our top 10
    sort_10_biggest = lambda arr : np.flip(np.sort(arr, order="log_value")[-10:])

    # Applying the functions in the correct order
    logs_with_features = [sort_10_biggest(feature_stack(logs)) for logs in model.clf.feature_log_prob_]

    return dict(zip(model.labels, map(dict, logs_with_features)))

In [241]:
# Simple extraction of the above dictionary
for label, features in top_10_features_per_class(customNB).items():
    print(f"Top 10 features (stems) in {label.upper()} :")
    print('\n'.join([f"  * {feature:10}{log:2.3}" for feature, log in features.items()]))

Top 10 features (stems) in NEG :
  * br        -3.41
  * movy      -4.04
  * film      -4.25
  * on        -4.78
  * act       -4.79
  * lik       -4.87
  * ev        -4.89
  * real      -5.09
  * mak       -5.27
  * bad       -5.3
Top 10 features (stems) in POS :
  * br        -3.51
  * film      -4.18
  * movy      -4.32
  * on        -4.76
  * act       -5.05
  * lik       -5.06
  * real      -5.09
  * ev        -5.2
  * tim       -5.31
  * good      -5.35


Analysis :

Prevalent stems common in both classes (8/10) :
* br
* movy
* film
* on
* act
* like
* real
* ev

Prevalent NEG stems :
* mak
* bad

Prevalent POS stems :
* time
* good

A great number of steams are common between our two classes. This makes me think that a greater notion of context should be given to better the model.
Just giving the CountVectorizer 2-grams would help it makes the difference between 'bad' and 'not bad', 'good act' and 'bad act', etc.

However 81% for such a simple model is not bad at all already ! (n-grams would explode the size of our vocabulary).

### Digging up errors

In [259]:
def get_errors(model : CustomNaiveBayes, ds : Dataset, n):
    X_as_docs, Y = model.DS_to_XnY(ds.shuffle()[n * 100:])
    y_pred = model.predict(X_as_docs)
    errors = X_as_docs[y_pred != Y]
    return errors[:n]

In [260]:
for i, error in enumerate(get_errors(customNB, ds_test, 5)):
    print(f"{i + 1}. {error}")

1. (the description of the mood of the movie may be considered as a spoiler - because there is not much action in fact)<br /><br />Great one...<br /><br />Is it for my peculiar interest for the dystopias and utopias? Is it for the atmosphere of the movie. Or is there some more magic? If yes, it is for sure the utmost human one...<br /><br />This film is, no doubt, extremely artistic/artificial (depends on taste). I can imagine most of the people who hate to watch slow movies (and those of Tsai Ming Liang (who I didn't enjoy other times) are one of the slowest that I know), suffering during the movie. Yes, people are unable to slow down and to let time pass - and to watch it without feeling they waste it. One can take this piece as torture or as a therapy...<br /><br />The topic at the surface? The lack of communication - even if we live in rabbit cages - one next to each other - but not really together? People are tired, sick of something and unable to describe it - just don't want to 

Analysis

We will interest in the 1st and 5th documents.

In the first one, while the author enjoyed the movie, it also understands why other people would dislike it, and explains so.
The model got stuck on sentence like "most of the people who hate to watch slow movies", without understanding that the dark mood of the movie actually resonates with the reviewer.
This is therefore a false negative

The fifth one is a contrario a false positive. The author is actually let down by the movie, which according to him does not do justice to the match between the *greatest* chess player Kasparov.
The model misunderstanding all the superlatives to be about the film, while in reality the reviwever explains why he is let down.

In both case, we have respectively bad and good words used in a good and bad context, which tricks the models as it does not have a notion of context.

## FastText

### Dataset conversion

In [244]:
# TODO

In [245]:
# Also modifying pretreatment

### Training

#### With default hypers

#### Hypers search functionality

In [246]:
# Dataset split

In [247]:
# Search

### Model differences

In [248]:
# Attributes

### Digging up errors

Analysis

### MINN and MAXN nullified ?

Analysis

## Theorical questions

Answer the following questions.

1. Explain with your own words, using a short paragraph for each, what are:
   * Phonetics and phonology
   * Morphology and syntax
   * Semantics and pragmatics

```
Answer
```

2. What is the difference between stemming and lemmatization?
   * How do they both work?
   * What are the pros and cons of both methods?

```
Answer
```

3. On logistic regression:
   * How does stochastic gradient descent work?
   * What is the role of the learning rate?
   * Will it always find the global minimum?

```
Answer
```

4. What problems does TF-iDF try to solve?
   * What the is the TF part for?
   * What is the iDF part for?

```
Answer
```

5. Summarize how the skip-gram method of Word2Vec works using a couple of paragraphs.
   * How does it uses the fact that _two words appearing in similar contexts are likely to have similar meanings_?

```
Answer
```