# Introduction to NLP 01 - LAB 02
## The Dataset

In [1]:
from datasets import load_dataset
import pandas as pd
import numpy as np

X_train, X_test, X_unsupervised = load_dataset("imdb", split=["train", "test", "unsupervised"])
print(X_train.shape, X_test.shape, X_unsupervised.shape)

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:00<00:00, 306.05it/s](25000, 2) (25000, 2) (50000, 2)



In [2]:
df_train, df_test, df_unsupervised = pd.DataFrame(X_train), pd.DataFrame(X_test), pd.DataFrame(X_unsupervised)

In [3]:
df_train['label'].value_counts()

0    12500
1    12500
Name: label, dtype: int64

In [4]:
df_test['label'].value_counts()

0    12500
1    12500
Name: label, dtype: int64

In [5]:
df_unsupervised['label'].value_counts()

-1    50000
Name: label, dtype: int64


1. How many splits does the dataset has?  
    There are **3 splits** (train, test, unsupervised) in the dataset.  
2. How big are these splits? 
    - *train* : **25000** samples in total, **12500** positive (1) and **12500** negative (0) samples.
    - *test* : same on test but with different samples
    - *unsupervised* : **50000** samples in total with no labels (-1)
3. What is the proportion of each class on the supervised splits?
    - The proportion of each on train and test supervised splits is **50/50** (12500 for positive label and 12500)


## Naive Bayes Classifier

Implement your own naive Bayes classifier (the pseudo code can be found in the slides or the [book reference](https://web.stanford.edu/~jurafsky/slp3/)) or use [one provided by scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes) combined with a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Go through the following steps.
1. (2 points) Take a look at the data and create an adapted preprocessing function with at least:
   i. Lower case the text.
   ii. Replace punctuations with spaces (you can use `from string import punctuation` to ease your work). Think that maybe not all punctuations should be removed or replaced.

In [6]:
df_train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [7]:
from string import punctuation
import re

def preprocessing(data : str) -> str:
    """
    Preprocess the data by removing punctuation, converting to lowercase and removing extra spaces
    The function does not remove the "-" dash symbol

    For example : son-in-law ≠ son in law

    :param data: string data to preprocess 
    :return: preprocessed string data
    """
    data = data.replace("<br />", "")
    toremove = punctuation.replace("-", "")
    translator = str.maketrans(toremove, ' ' * len(toremove))
    res = data.lower().translate(translator).strip()
    res = re.sub(r'\s+', ' ', res)
    return res

Example on one piece of text

In [8]:
preprocessing(df_train.iloc[0]['text'])

'i rented i am curious-yellow from my video store because of all the controversy that surrounded it when it was first released in 1967 i also heard that at first it was seized by u s customs if it ever tried to enter this country therefore being a fan of films considered controversial i really had to see this for myself the plot is centered around a young swedish drama student named lena who wants to learn everything she can about life in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states in between asking politicians and ordinary denizens of stockholm about their opinions on politics she has sex with her drama teacher classmates and married men what kills me about i am curious-yellow is that 40 years ago this was considered pornographic really the sex and nudity scenes are few and far between even then it s not shot like some cheapl

We then apply it to the train dataset.

In [9]:
df_train["text"] = df_train["text"].apply(lambda x: preprocessing(x))

2. (4 points) Implement your own naive Bayes classifier from scratch. The pseudo code can be found in the slides or the [book reference](https://web.stanford.edu/~jurafsky/slp3/).

In [10]:
def get_count(dataset: pd.DataFrame, binary_class : str) -> pd.DataFrame:
    """
    Get the count of each word in a dataset for a specific class
    :param dataset: dataset to extract the words from
    :param binary_class: the class to extract
    :return: dataframe with the count of each word
    """
    return dataset[dataset["label"] == binary_class]['text'].str.split(expand=True).stack().value_counts()

In [11]:
def get_words(dataset: pd.DataFrame) -> list[str]:
    """
    Get all distinct words from dataset
    :param dataset: pd.DataFrame containing the text column
    :return: list of distinct string words from the corpus
    """
    return list(set(' '.join([word for word in dataset['text']]).split()))

In [12]:
def get_words_of_class(dataset: pd.DataFrame, binary_class: int) -> list[str]:
    """
    Get all distinct words from dataset of with a specific label
    :param dataset: pd.DataFrame containing the text and label columns
    :param binary_class: int representing the class to extract
    :return: list of distinct string with the specified label
    """
    dataset = dataset[dataset["label"] == binary_class]
    return get_words(dataset)

In [13]:
def train_naive_bayes(dataset: pd.DataFrame, classes: list[int]) -> (float, float, list[str]):
    """
    Train a Naive Bayes model based on the algorithms described in the slides and in https://web.stanford.edu/~jurafsky/slp3/4.pdf
    :param dataset: pd.DataFrame containing the text and label columns
    :param classes: list of int representing the binary classes (0 Negative and 1 Positive)
    :return: A tuple containing (logprior as a float, loglikelihood as a float, vocabulary as a list of distinct string)
    """
    vocabulary = []
    logprior = {}
    loglikelihood = {}
    counts = {}

    for binary_class in classes:
        Ndoc = len(dataset)
        Nc = dataset[dataset["label"] == binary_class].shape[0]
        logprior[binary_class] = np.log(Nc / Ndoc)
        vocabulary = get_words(dataset)

        
        counts[binary_class] = get_count(dataset, binary_class)
        total_count = counts[binary_class].sum() + counts[binary_class].shape[0]

        for word in vocabulary:
            if not (word in counts[binary_class]):
                count = 0
            else:
                count = counts[binary_class][word]
            loglikelihood[(word, binary_class)] = np.log((count + 1) / total_count)

    return logprior, loglikelihood, vocabulary

In [14]:
def test_naive_bayes(testdoc : str, logprior : float, loglikelihood : float, classes : list[int], vocabulary : set[str]) -> int:
    """
    Predict the label for the testdoc_words according to the train Naive Bayes model
    :param testdoc_words: Target document string to predict the label
    :param logprior: Logprior of the Naive Bayes model
    :param loglikelihood: Loglikelihood of the Naive Bayes model
    :param classes: list of int representing the binary classes (0 Negative and 1 Positive)
    :param vocabulary: list of distinct string representing the vocabulary
    :return: int representing the predicted label
    """
    testdoc_words = testdoc.split(" ")
    
    confidence_per_class = {}
    
    for binary_class in classes:
        confidence_per_class[binary_class] = logprior[binary_class]
        
        for i in range(len(testdoc_words)):
            word = testdoc_words[i]
            if word in vocabulary:
                confidence_per_class[binary_class] += loglikelihood[(word, binary_class)]
    
    return np.argmax(list(confidence_per_class.values()))

3. (3 points) Implement a naive Bayes classifier using scikit-learn.
   * Use a scikit-learn [Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) with a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) classifier. You can use other 

In [15]:
X_train, y_train = df_train['text'], df_train['label']

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()), 
    ('multiniomialNB', MultinomialNB())
])

4. (1 point) Report the accuracy on both training and test set, for both your implementation and the scikit-learn one.

In [17]:
from sklearn.metrics import accuracy_score

In [18]:
X_test, y_test = df_test.text.apply(lambda x: preprocessing(x)), df_test['label']

Training the home-made naive Bayes classifier

*We need to convert the list vocabulary to a set in order to get a $O(1)$ complexity on element existence*

In [19]:
logprior, loglikelihood, vocabulary = train_naive_bayes(df_train, [0, 1])
vocabulary = set(vocabulary)

Testing it on the Train & Test datasets


In [20]:
results_train = [test_naive_bayes(text, logprior, loglikelihood, [0, 1], vocabulary) for text in X_train]

In [21]:
results_test = [test_naive_bayes(text, logprior, loglikelihood, [0, 1], vocabulary) for text in X_test]

In [22]:
print(f"Our Naive Bayes accuracy : train {accuracy_score(y_train, results_train) * 100}% - test {accuracy_score(y_test, results_test) * 100}%")

Our Naive Bayes accuracy : train 89.932% - test 81.0%


Training the Scikit Learn Bayes classifier and predicting on the test dataset

In [23]:
pipeline.fit(X_train, y_train)

In [24]:
prediction_train = pipeline.predict(X_train)

In [25]:
prediction_test = pipeline.predict(X_test)

In [26]:
print(f"Scikit Learn Naive Bayes : train {accuracy_score(y_train, prediction_train) * 100}% - test {accuracy_score(y_test, prediction_test) * 100}%")

Scikit Learn Naive Bayes : train 89.812% - test 81.44%


5. (1 point) Most likely, the scikit-learn implementation will give better results. Looking at the documentation, explain why it could be the case.
    The output between our home-mode Naive bayes classifier and Sklearn seams to have almost the same result, but sklearn implementation in faster in term of time and memory.
    - Scikit-learn CountVectorizer uses sparse matrices to enhance computation time and space.
    - Scikit-learn CountVectorizer uses another type of tokenizer with the following pattern ”```(?u)\b\w\w+\b```” in order to extract words. The vocabulary will be different.

6. (1 point) Why is accuracy a sufficient measure of evaluation here?
    
    - ![img](https://media.discordapp.net/attachments/1069902292423286814/1088578203976740904/Calculation-of-Precision-Recall-and-Accuracy-in-the-confusion-matrix.png?width=1275&height=582)
    - As we can see on this image, the accuracy is the overall Right Predicted values over All Possible values. We need to measure the model's predictive ability, both on true positives and true negatives equally, this is why accuracy is the best metric.

7. (1 point) Using one of the implementation, take at least 2 wrongly classified example from the test set and try explaining why the model failed.

In [27]:
def incorrect_prediction(data : pd.DataFrame, pred : pd.Series) -> pd.DataFrame:
    """
    Get the rows of the test set where the prediction is different from the label
    :param data: pd.DataFrame containing the text and label columns
    :param pred: Prediction of the model
    :return: The rows of the test set where the prediction is different from the label
    """
    real_target = data['label']
    return data[real_target != pred]

In [28]:
incorrect_prediction(df_test, prediction_test).iloc[8]["text"]

'Talented screenwriter Alvin Sargent sadly cannot get any engaging ideas cooking in this artificial trifle about a wayward mother and her mature teenage daughter trying to make their lives work in Los Angeles despite mom\'s flighty behavior. Apart from several good sequences, I didn\'t quite buy Susan Sarandon as a flake (she\'s too intrinsically smart and focused to be passed off as this devil-may-care lady), and her naturally grounded personality is a bad fit for the role of an irresponsible parent. Natalie Portman fares much better as her kid, and yet there\'s a creepy aloofness to her work (and some of her scenes, such as the one where she asks a boy to strip, are misguided and uncomfortable to watch). Certainly not an incompetent piece, "Anywhere But Here" does have moments that work, but it isn\'t an embraceable film, nor has it proved to be an important one. ** from ****'

#### Incorrect classification 1:
*"The only reason this movie is not given a 1 (awful) vote is that the acting of both Ida Lupino and Robert Ryan is superb. Ida Lupino who is lovely, as usual, becomes increasingly distraught as she tries various means to rid herself of a madman. Robert Ryan is terrifying as the menacing stranger whose character, guided only by his disturbed mind, changes from one minute to the next. Seemingly simple and docile, suddenly he becomes clever and threatening. Ms. Lupino's character was in more danger from that house she lived in and her own stupidity than by anyone who came along. She could not manage to get out of her of her own house: windows didn't open, both front and back doors locked and unlocked from the inside with a key. You could not have designed a worse fire-trap if you tried. She did not take the precaution of having even one extra key. Nor could she figure out how to summon help from nearby neighbors or get out of her own basement while she was locked in and out of sight of her captor. I don't know what war her husband was killed in, but if it was World War II, the furnishings in her house, the styles of the clothes, especially the children and the telephone company repairman's car are clearly anachronistic. I recommend watching this movie just to see what oddities you can find."*

#### Explanation (Classified as positive but was negative)

This text has been wrongly predicted to be positive as it has mixed tones for 3/4 of the review. Even though the overall review is negative, the author uses positive vocabulary on some very specific and rare elements of the movie, but the Naive Bayes algorithm can't understand context, and since the author talks about it for 3/4 of the review the algorithm predicts it as positive.

For example:
 - *"the acting of both Ida Lupino and Robert Ryan is **superb**."* --> can be classified as positive
 - *"Ida Lupino who is **lovely**, as usual"* --> can be classified as positive

---

#### Incorrect classification 2:
*'Talented screenwriter Alvin Sargent sadly cannot get any engaging ideas cooking in this artificial trifle about a wayward mother and her mature teenage daughter trying to make their lives work in Los Angeles despite mom\'s flighty behavior. Apart from several good sequences, I didn\'t quite buy Susan Sarandon as a flake (she\'s too intrinsically smart and focused to be passed off as this devil-may-care lady), and her naturally grounded personality is a bad fit for the role of an irresponsible parent. Natalie Portman fares much better as her kid, and yet there\'s a creepy aloofness to her work (and some of her scenes, such as the one where she asks a boy to strip, are misguided and uncomfortable to watch). Certainly not an incompetent piece, "Anywhere But Here" does have moments that work, but it isn\'t an embraceable film, nor has it proved to be an important one.'*

#### Explanation (Classified as positive but was negative)

Usage of contracted negatition (n't) can lead to a bad interpretation due to the tokenization removing part of the information. The Naive Bayes classifier doesn't take context into account. 
- *"**cannot** get any **engaging** ideas cooking"* can be badly classified due to the negatition in the word cannot and engaging can be classified as positive

### 8. Bonus

1. Look at the words with the highest likelihood in each class (if you use scikit-learn, you want to check feature_log_prob_).

2. Remove stopwords and check again.

To increase our Naive Bayes accuracy, we will add a set of stopwords to ignore which have a negligent impact on a text positiveness but which add noise being to much present in text.

In [29]:
def get_10_best(data : pd.DataFrame, binary_class : str) -> pd.Series:
    """
    Get the 10 most frequent words in a dataset for a specific class
    :param data: dataset to extract the words from
    :param binary_class: label of the class to extract (Positive or Negative)
    :return: The 10 most frequent words in a dataset for a specific class
    """
    return pd.DataFrame(' '.join([i for i in data[data["label"] == binary_class]['text']]).split()).value_counts().head(10)

In [30]:
get_10_best(df_train, 0)

the     162861
a        79138
and      74068
of       68794
to       68718
is       50033
it       48201
i        46781
in       43441
this     40831
dtype: int64

In [31]:
get_10_best(df_train, 1)

the     172902
and      89396
a        83498
of       76638
to       66476
is       57208
in       49858
it       47879
i        40642
that     35599
dtype: int64

The 10 highset count of word of each classes are almost all only stopwords.

In [32]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stopswords_nltk = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
def get_10_best_without_stopword(data, binary_class, stopwords):
    """
    Get the 10 most frequent words in a dataset for a specific class without stopwords
    :param data: dataset to extract the words from
    :param binary_class: label of the class to extract (Positive or Negative)
    :param stopwords: list of stopwords to remove
    :return: The 10 most frequent words in a dataset for a specific class without stopwords
    """
    data = pd.DataFrame(' '.join([i for i in data[data["label"] == binary_class]['text']]).split())
    mask = data.isin(stopwords)  # mask in order to remove stopword

    return data[~mask].value_counts().head(10)

In [34]:
get_10_best_without_stopword(df_train, 0, stopswords_nltk)

movie     24580
film      18844
one       12734
like      10971
even       7646
good       7285
bad        7274
would      6988
really     6252
time       6001
dtype: int64

In [35]:
get_10_best_without_stopword(df_train, 1, stopswords_nltk)

film     20623
movie    18856
one      13368
like      8790
good      7543
story     6674
great     6384
time      6230
see       5896
well      5770
dtype: int64

Without the stopwords some word are both present in the classes but as we expect *"bad"* is classified as negative. *"Good"* and *"great"* are classified as positive. We are positive that this removal would improve the model's accuracy.

Remove stopwords (see NLTK stopwords corpus) and check again. 

As NLTK's stopwords list contains word with apostrophe like "doesn't" or "can't", we will not use it because we deleted every apostrophes in our preprocessing.
So we will use a list of stopwords without apostrophe but with words "don" and "t" to replace "don't"

In [36]:
stopwords_modified = {"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"}

In [37]:
def test_naive_bayes_without_stopwords(testdoc : str, logprior : float, loglikelihood : float, classes : list[int], vocabulary : set[str], stopword : set[str]) -> int:
    """
    Predict the label for the testdoc_words according to the train Naive Bayes model
    :param testdoc_words: Target document string to predict the label
    :param logprior: Logprior of the Naive Bayes model
    :param loglikelihood: Loglikelihood of the Naive Bayes model
    :param classes: list of int representing the binary classes (0 Negative and 1 Positive)
    :param vocabulary: list of distinct string representing the vocabulary
    :return: int representing the predicted label
    :param stopword: list of stopwords
    """
    testdoc_words = testdoc.split(" ")
    
    confidence_per_class = {}
   
    for binary_class in classes:
        confidence_per_class[binary_class] = logprior[binary_class]
        
        for i in range(len(testdoc_words)):
            word = testdoc_words[i]
            if (not (word in stopword)) and word in vocabulary:
                confidence_per_class[binary_class] += loglikelihood[(word, binary_class)]
    
    return np.argmax(list(confidence_per_class.values()))

In [38]:
results_train_without_stopwords = [test_naive_bayes_without_stopwords(text, logprior, loglikelihood, [0, 1], vocabulary, stopwords_modified) for text in X_train]

In [39]:
results_test_without_stopwords = [test_naive_bayes_without_stopwords(text, logprior, loglikelihood, [0, 1], vocabulary, stopwords_modified) for text in X_test]

In [40]:
print(f"Our Naive Bayes accuracy : train {accuracy_score(y_train, results_train_without_stopwords) * 100}% - test {accuracy_score(y_test, results_test_without_stopwords) * 100}%")

Our Naive Bayes accuracy : train 91.964% - test 83.204%


The accuracy was improved from 89.812% to 91.964% on the train split and from 81.44% to 83.204% on test split.

We reproduce the stopword removal with scikit-learn.

In [41]:
pipeline_without_stopwords = Pipeline([
    ('vectorizer', CountVectorizer(stop_words=stopswords_nltk)), 
    ('multiniomialNB', MultinomialNB())
])

In [42]:
pipeline_without_stopwords.fit(X_train, y_train)

In [43]:
pred_train_without_stopwords = pipeline_without_stopwords.predict(X_train)
pred_test_without_stopwords = pipeline_without_stopwords.predict(X_test)

In [44]:
print(f"Our Naive Bayes accuracy with stopwords ignored : train {accuracy_score(y_train, pred_train_without_stopwords) * 100}% - test {accuracy_score(y_test, pred_test_without_stopwords) * 100}%")

Our Naive Bayes accuracy with stopwords ignored : train 91.36% - test 82.464%


The outcome is almost the same than with our own Naive Bayes classifier.

9. Play with scikit-learn's version parameters. For example, see if you can consider unigram and bigram instead of only unigrams.

In [45]:
def test_parameters(message: str, vectorizer_params: dict, multinomial_params: dict) -> None:
    """
    Test the accuracy of the model with the given parameters
    :param message: Message to print before the accuracy
    :param vectorizer_params: Parameters of the CountVectorizer
    :param multinomial_params: Parameters of the MultinomialNB
    """
    pipeline_experimental = Pipeline([
        ('vectorizer', CountVectorizer(**vectorizer_params)),
        ('multiniomialNB', MultinomialNB(**multinomial_params))
    ])

    pipeline_experimental.fit(X_train, y_train)

    pred_train_experimental = pipeline_experimental.predict(X_train)
    pred_test_experimental = pipeline_experimental.predict(X_test)

    print(
        f"{message} : train {accuracy_score(y_train, pred_train_experimental) * 100}% - test {accuracy_score(y_test, pred_test_experimental) * 100}%")

In [46]:
test_parameters("Accuracy with stopwords ignored, both unigrams and bigrams", 
                {"stop_words": stopswords_nltk, "ngram_range": (1, 2)}, {})    

Accuracy with stopwords ignored, both unigrams and bigrams : train 99.804% - test 85.344%


In [47]:
test_parameters("Accuracy with stopwords ignored, only bigrams", 
                {"stop_words": stopswords_nltk, "ngram_range": (2, 2)}, {})   

Accuracy with stopwords ignored, only bigrams : train 99.964% - test 84.932%


In [48]:
test_parameters("Accuracy with stopwords ignored, the feature are made of character n-grams.", 
                {"stop_words": stopswords_nltk, "analyzer": "char"}, {})   

Accuracy with stopwords ignored, the feature are made of character n-grams. : train 61.007999999999996% - test 60.995999999999995%


In [49]:
test_parameters("Accuracy with stopwords ignored, creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space from scikit learm documentation", 
                {"stop_words": stopswords_nltk, "analyzer": "char_wb"}, {})   

Accuracy with stopwords ignored, creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space from scikit learm documentation : train 60.748000000000005% - test 60.736000000000004%


Changing the analyzer doesn't seem to improve the accuracy. Using only bigrams or (bigrams and unigrams) seems to improve the result, but mostly on the train dataset. The model is probably overfitting.

## Stemming and Lemmatization

(2 points) Add stemming or lemmatization to your pretreatment.

We chose to remove stopwords, thanks to the better results from the previous experiment.

In [50]:
from nltk import SnowballStemmer

stemmer = SnowballStemmer("english", ignore_stopwords=True)
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc)) 

We import the SnowballStemmer from nltk and integrate it into CountVectorizer.

(1 point) Train and evaluate your model again with these pretreatment.

In [51]:
pipeline_with_stemmer = Pipeline([
    ('vectorizer', CountVectorizer(analyzer=stemmed_words)), # An analyzer is added to CountVectorizer and it performs stemming
    ('multiniomialNB', MultinomialNB())
])

In [52]:
pipeline_with_stemmer.fit(X_train, y_train)

In [53]:
pred_train_with_stemmer = pipeline_with_stemmer.predict(X_train)

In [54]:
pred_test_with_stemmer = pipeline_with_stemmer.predict(X_test)

In [55]:
print(f"Our Naive Bayes accuracy with stopwords ignored : train {accuracy_score(y_train, pred_train_with_stemmer) * 100}% - test {accuracy_score(y_test, pred_test_with_stemmer) * 100}%")

Our Naive Bayes accuracy with stopwords ignored : train 88.444% - test 80.55600000000001%


Result are worst considering that we removed stopwords.

(1 point) Are the results better or worse? Try explaining why the accuracy changed.

Accuracy is better without stemming. Stemming is prone to mistakes (organization > organ ) and doesn't work well with irregular forms (was > wa).

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=872d72c8-1d54-4f60-8556-02f9978ad2c8' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>