# Lab 02
## Introduction
This project's goal is to code a sentiment classifier on the IMDB sentiment dataset. The IMDB sentiment [dataset](https://huggingface.co/datasets/imdb) is a collection of 50K movie reviews, annotated as positive or negative, and split in two sets of equal size: a training and a test set. Both set have an equal number of positive and negative review.

## The dataset

In [1]:
from datasets import load_dataset


dataset = load_dataset("imdb")

Found cached dataset imdb (C:/Users/junyi/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
dataset.num_rows

{'train': 25000, 'test': 25000, 'unsupervised': 50000}

### 1. How many splits does the dataset has ?
#### The dataset has 3 splits : train, test and unsupervised.

### 2. How big are these splits ?
#### They represent respectively 25000, 25000 and 50000 examples.

### 3. What is the proportion of each class on the supervised splits ?

In [3]:
positiveTrain = dataset['train'].filter(lambda example: example['label'] == 1).num_rows
positiveTest = dataset['test'].filter(lambda example: example['label'] == 1).num_rows
print("In the train and test splits, there are respectively "+ str(positiveTrain) + "/25000 and " + str(positiveTest) + "/25000 positive ratings")


Loading cached processed dataset at C:\Users\junyi\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-50ca35f45cb9d15a.arrow
Loading cached processed dataset at C:\Users\junyi\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-3cd99f281cbb7e3f.arrow


In the train and test splits, there are respectively 12500/25000 and 12500/25000 positive ratings


#### Both supervised  have an equal number of positive and negative review

## Naive Bayes classifier
### 1. Processing function

In [4]:
import string
def process(txt: str) -> str:
    """
    Converts all uppercase letters to lowercase, replaces all punctuation marks with spaces, and returns the processed string.
    """
    
    lowercase_txt = txt.lower()
    
    # create a translation table using maketrans method
    replace_punctuation = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    
    # use the translate method to replace the punctuation
    processed_txt = lowercase_txt.translate(replace_punctuation)
    return processed_txt
process("What's your name? I'm Ba-yes")

'what s your name  i m ba yes'

### 2. Our Naive Bayes

In [5]:
from collections import Counter
from collections import defaultdict

import math

def train_naive_bayes(documents: list[str], classes: list[int], process) -> tuple[set[int, float], set[tuple[str, int], float], list[str]]:
    """
    Trains a Naive Bayes classifier on a list of labeled documents.

    Args:
    - documents (list): A list of dictionaries containing the text and label for each document.
    - classes (list): A list of the possible class labels (0 or/and 1).

    Returns:
    - log_prior (dict): A dictionary containing the log prior probabilities for each class.
    - log_likelihood (dict): A defaultdict containing the log likelihood probabilities
    for each word in the vocabulary given each class.
    - vocabulary (set): A set of words in the vocabulary.
    - class_word_sets (dict): a dictionary containing sets of words in each class.
    """
    n_doc = len(documents)
    log_prior = {}
    whole_vocabulary = set([word for d in documents for word in process(d['text']).split()])
    vocabulary = sorted({s for s in whole_vocabulary if s.isalpha()})
    log_likelihood = defaultdict(lambda: math.log(1/len(vocabulary)))
    bigdoc = {}
    word_counts = {}
    class_word_sets = {}
    # Calculate P(c) terms
    for c in classes:
        n_c = len([d for d in documents if d['label'] == c])
        log_prior[c] = math.log(n_c / n_doc)

    # Calculate P(w|c) terms
        bigdoc = [process(d['text']).split() for d in documents if d['label'] == c]
        word_counts[c] = Counter([word for doc in bigdoc for word in doc])
        class_word_sets[c] = set(word_counts[c].keys())
        total_count = sum(word_counts[c].values())
        for word in vocabulary:
            count_w_c = word_counts[c][word]
            log_likelihood[(word, c)] = math.log((count_w_c + 1) / (total_count + len(vocabulary)))

    return log_prior, log_likelihood, vocabulary, class_word_sets


def test_naive_bayes(test_txt, log_prior, log_likelihood, classes, vocabulary, class_word_sets):
    """
    Predicts the label of a given test document using a trained Naive Bayes classifier.
    """
    # Calculate sum[c] for each class c
    sum_c = {}
    test_words = set(test_txt.split())
    for c in classes:
        if test_words & class_word_sets[c]:
            sum_c[c] = log_prior[c]
            for word in test_words & class_word_sets[c]:
                sum_c[c] += log_likelihood[(word, c)]
    
    # Return the class with highest sum[c]
    return max(sum_c, key=sum_c.get)

def test_accuracy(test_set, log_prior, log_likelihood, classes, vocabulary, class_word_sets, process):
    """
    Tests the accuracy of a trained Naive Bayes classifier on a given test set.

    Returns:
    - accuracy: The accuracy of the classifier on the test set as a fraction between 0 and 1.
    """
    true = 0
    total = 0
    for test_doc in test_set:
        test_class = test_naive_bayes(process(test_doc["text"]), log_prior, log_likelihood, classes, vocabulary, class_word_sets)
        if test_doc["label"] == test_class:
            true += 1
        total += 1
    return true/total

### 3. With Sickit-learn

In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

sickit_class = Pipeline([
    ('vect', CountVectorizer()),  # Tokenize and vectorize the data
    ('clf', MultinomialNB()),  # Train the naive bayes classifier
])

### 4. Compare

#### Training

In [7]:
log_prior, log_likelihood, vocabulary, class_word_sets = train_naive_bayes(dataset['train'], [0, 1], process)
sickit_class.fit(dataset['train']['text'], dataset['train']['label'])

Pipeline(steps=[('vect', CountVectorizer()), ('clf', MultinomialNB())])

#### Testing Accuracy

In [8]:
sickit_test = sickit_class.score(dataset['test']['text'], dataset['test']['label'])
sickit_train = sickit_class.score(dataset['train']['text'], dataset['test']['label'])
print("The accuracy on both training and test set, for the scikit-learn implementation are respectively " + str(sickit_train) +
      ' and ' + str(sickit_test))
ours_train = test_accuracy(dataset['train'], log_prior, log_likelihood, [0, 1], vocabulary, class_word_sets, process)
ours_test = test_accuracy(dataset['test'], log_prior, log_likelihood, [0, 1], vocabulary, class_word_sets, process)
print("The accuracy on both training and test set, for ours are respectively " + str(ours_train) + ' and ' + str(ours_test))

The accuracy on both training and test set, for the scikit-learn implementation are respectively 0.89812 and 0.81356
The accuracy on both training and test set, for ours are respectively 0.35104 and 0.67752


### 5. Most likely, the scikit-learn implementation will give better results. Looking at the documentation, explain why it could be the case.

By using CountVectorizer, scikit will process the text better than our process(txt) function.

Then, MultinomialNB has 2 extra default parameters :

* alpha : Additive (Laplace/Lidstone) smoothing parameter.
* fit_prior : Learn class prior probabilities.

which will increase the accuracy of the prediction

### 6. Why is accuracy a sufficient measure of evaluation here?

A balanced dataset ensures that the classifier does not have bias towards one class over the other, making accuracy an appropriate metric for evaluating the model's overall performance.
In imbalanced datasets or when different misclassification costs are involved, other measures like precision, recall, F1-score are used.

### 7. Using one of the implementation, take at least 2 wrongly classified example from the test set and try explaining why the model failed.

In [9]:
predicted_labels = sickit_class.predict(dataset['test']['text'])
count = 0
examples = ['','']
labels = [-1, -1]
true_labels = [-1, -1]
for i in range(len(dataset['test']['label'])):
    if count == 2:
        break
    if predicted_labels[i] != dataset['test']['label'][i]:
        examples[count] = dataset['test']['text'][i]
        labels[count] = predicted_labels[i]
        true_labels[count] = dataset['test']['label'][i]
        count += 1;
print(examples[0])
print(true_labels[0], labels[0])
print(examples[1])
print(true_labels[1], labels[1])

Blind Date (Columbia Pictures, 1934), was a decent film, but I have a few issues with this film. First of all, I don't fault the actors in this film at all, but more or less, I have a problem with the script. Also, I understand that this film was made in the 1930's and people were looking to escape reality, but the script made Ann Sothern's character look weak. She kept going back and forth between suitors and I felt as though she should have stayed with Paul Kelly's character in the end. He truly did care about her and her family and would have done anything for her and he did by giving her up in the end to fickle Neil Hamilton who in my opinion was only out for a good time. Paul Kelly's character, although a workaholic was a man of integrity and truly loved Kitty (Ann Sothern) as opposed to Neil Hamilton, while he did like her a lot, I didn't see the depth of love that he had for her character. The production values were great, but the script could have used a little work.
0 1
Ben, (

#### The model failed because in their reviews, they don't totaly take position.

#### (The production values were great, but)

#### (it's mildly enjoyable)

## Stemming and Lemmatization


In [10]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
import re

In [11]:
# We need to download a package for word tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\junyi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 1. Add stemming


In [12]:
def stemming(text: str) -> str:
    '''
    Converts all uppercase letters to lowercase, replaces all punctuation marks with spaces and stems the text.
    Return the processed string.
    '''
    text = process(text)
    " ".join(word_tokenize(text))
    re_word = re.compile(r"^\w+$")
    stemmer = SnowballStemmer("english")
    stemmed = [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
    return " ".join(stemmed)

text = "At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction."
stemming(text)

'at first histor linguist serv as the cornerston of compar linguist primarili as a tool for linguist reconstruct 5 scholar were concern chiefli with establish languag famili and reconstruct prehistor proto languag use the compar method and intern reconstruct'

### 2. Train and evaluate your model again with these pretreatment.

In [13]:
log_prior, log_likelihood, vocabulary, class_word_sets = train_naive_bayes(dataset['train'], [0, 1], stemming)
ours_test = test_accuracy(dataset['test'], log_prior, log_likelihood, [0, 1], vocabulary, class_word_sets, stemming)
print("The accuracy on the test set, for our implementation with the stemming is " + str(ours_test))

The accuracy on the test set, for our implementation with the stemming is 0.70272


### 3. Are the results better or worse? Try explaining why the accuracy changed.

#### The results are better since we have less vocabulary as the words with the same meaning are merged. 
#### Thus, we have more training data for a word.