# Naive Bayes for Sentiment Analysis 

This assignment is comprised of two parts:

1. **Theory**: Solve the Naive Bayes exercises 4.1 and 4.2 from Chapter 4 in the J&M textbook. Reformulate NB to emphasize title words.
2. **Implementation**: You will implement and experiment with various feature engineering techniques in the context of Naive Bayes models for Sentiment classification of movie reviews.

We will use the NB model implemented in sklearn:

https://scikit-learn.org/stable/modules/naive_bayes.html

## Manish Kumar Govind:

# <font color="blue"> Submission Instructions</font>

1. Click the Save button at the top of the Jupyter Notebook.
2. Please make sure to have entered your name above.
3. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of all cells). 
4. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
5. Once you've rerun everything, select File -> Download as -> PDF via LaTeX and download a PDF version *SentimentAnalysis.pdf* showing the code and the output of all cells, and save it in the same folder that contains the notebook file *SentimentAnalysis.ipynb*.
6. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing we will see when grading!
7. Submit **both** your PDF and notebook on Canvas.
8. Verify your Canvas submission contains the correct files by downloading it after posting it on Canvas.

*Make sure that you format your solutions to theory questions to show equations properly. We will not grade solutions that are not properly formatted. Jupyter-notebook understands Latex, alternatively you can edit in Word using its Equation editor and submit the PDF as a separate file in Canvas.*

<hr>

# Theory

## Theory: J&M Exercise 4.1 ##


### D =  I always like foreign films ###

P(P|D) = 0.5 * 0.09 * 0.07 * 0.29 * 0.04 * 0.08 = 0.0000029232 


P(N|D) = 0.5 * 0.16 * 0.06 * 0.06 * 0.15 * 0.11 = 0.000004752

NB classifier would classify the sentence to Negative class


## Theory: J&M Exercise 4.2 ##
Vocab  = fun , couple , love , fast , furious , shoot , fly  = {7}

D = fast, couple, shoot, fly

P(fast | comedy) =  (1 + 1) / (9 + 7) = 2/16

P(couple | comedy) =  (2 + 1) / (9 + 7) = 3/16

P(shoot | comedy) = (0 + 1) / (9 + 7) = 1/16

P(fly| comedy) =  (1 + 1) / (9 + 7) = 2/16

P(fast| action) =  (2 + 1) / (11 + 7) = 3/18

P(couple | action) =  (0 + 1) / (11 + 7) = 1/18

P(shoot | action) =  (4 + 1) / (11 + 7) = 5/18

P(fly | action) = (1 + 1) / (11 + 7) = 2/18

P(action|D) = 3/5 * 3/18 * 1/18 * 5/18 * 2/18 = 0.000171


P(comedy|D) = 2/5 * 2/16 * 3/16 * 1/16 * 2/16 = 0.0000732

The D belongs to action class




## Theory 5111: Title is 𝐾 times more important than Body 
The Naive Bayes algorithm for text categorization presented in class treats all sections of a document equally, ignoring the fact that words in the title are often more important than words in the text in determining the document category. Modify the Naive Bayes training algorithm to reflect that word occurrences in the title are $K$ times more important than word occurrences in the rest of the document for deciding the class, where $K$ is an input parameter. Describe the idea in English and include pseudocode, akin to the training pseudocode shown in class

**Solution:**

### Text Classification with Naïve Bayes and Weighted Features

When classifying a new document, we follow these steps:

1. Split the document into a title and body.
2. Calculate the likelihoods for each part separately using the p(wi | Cj) values.
3. Combine the likelihoods with appropriate weighting based on the factor K.

This process allows  to categorize new documents effectively using the Naïve Bayes classifier trained with weighted features.
In text classification, we employ the Naïve Bayes algorithm to determine the category of a document based on its content. Here's an overview of the process and the pseudo-code for training with a weight factor K.

### Training with Naïve Bayes

**Input:**
- Dataset of training documents D with vocabulary V.
  
**Output:**
-  Parameters p(Cj) and p(wi | Cj), K - weight factor for title .

1. For each category Ck:
2. Let Dj be the subset of documents in category Cj.
3. For each document d ∊ Dj, remove duplicates from d.
4. Set p(Cj) = |Dj| / |D|.
5. Let nbj be the total number of words in the body of Document Dj.
6. Let ntj be the total number of words in the title of Document Dj.
7. For each word wi Î V:
   1. Let wbji be the frequency of word wi in the body in document Dj.
   2. Let wtji be the frequency of word wi in the title in document Dj.
   3. Set p(wi | Cj) = (wbji + (K * wtji) + 1) / (nbj + (ntj * K) + |V|).





<hr>

# Implementation

## From documents to feature vectors
This section illustratess the prototypical components of machine learning pipeline for an NLP task, in this case document classification:

1. Read document examples (train, devel, test) from files with a predefined format:
    - assume one document per line, usign the format "\<label\> \<text\>".

2. Tokenize each document:
    - using a spaCy tokenizer.

3. Feature extractors:
    - so far, just words.

4. Process each document into a feature vector:
    - map document to a dictionary of feature names.
    - map feature names to unique feature IDs.
    - each document is a feature vector, where each feature ID is mapped to a feature value (e.g. word occurences).

In [1]:
import spacy
from spacy.lang.en import English
from scipy import sparse
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Create spaCy tokenizer.
spacy_nlp = English()

def spacy_tokenizer(text):
    tokens = spacy_nlp.tokenizer(text)
    
    return [token.text for token in tokens]

In [3]:
def read_examples(filename):
    X = []
    Y = []
    with open(filename, mode = 'r', encoding = 'utf-8') as file:
        for line in file:
            [label, text] = line.rstrip().split(' ', maxsplit = 1)
            X.append(text)
            Y.append(label)
            
    return X, Y

In [4]:
def word_features(tokens):
    feats = {}
    for word in tokens:
        feat = 'WORD_%s' % word
        if feat in feats:
            feats[feat] +=1
        else:
            feats[feat] = 1
    return feats



In [5]:
def add_features(feats, new_feats):
    for feat in new_feats:
        if feat in feats:
            feats[feat] += new_feats[feat]
        else:
            feats[feat] = new_feats[feat]
    return feats

This function tokenizes the document, runs all the feature extractors on it and assembles the extracted features into a dictionary mapping feature names to feature values. It is important that feature names do not conflict with each other, i.e. different features should have different names. Each document will have its own dictionary of features and their values.

In [6]:
def docs2features(trainX, feature_functions, tokenizer):
    examples = []
    count = 0
    for doc in trainX:
        feats = {}

        tokens = tokenizer(doc)
        
        for func in feature_functions:
            add_features(feats, func(tokens))

        examples.append(feats)
        count +=1
        
        if count % 100 == 0:
            print('Processed %d examples into features' % len(examples))
    
    return examples

In [7]:
# This helper function converts feature names to unique numerical IDs.

def create_vocab(examples):
    feature_vocab = {}
    idx = 0
    for example in examples:
        for feat in example:
            if feat not in feature_vocab:
                feature_vocab[feat] = idx
                idx += 1
                
    return feature_vocab

In [8]:
# This helper function converts a set of examples from a dictionary of feature names to values representation
# to a sparse representation of feature ids to values. This is important because almost all feature values will
# be 0 for most documents and it would be wasteful to save all in memory.

def features_to_ids(examples, feature_vocab):
    new_examples = sparse.lil_matrix((len(examples), len(feature_vocab)))
    for idx, example in enumerate(examples):
        for feat in example:
            if feat in feature_vocab:
                new_examples[idx, feature_vocab[feat]] = example[feat]
                
    return new_examples

In [9]:
# Evaluation pipeline for the Naive Bayes classifier.

def train_and_test(trainX, trainY, devX, devY, feature_functions, tokenizer):
    # Pre-process training documents. 
    trainX_feat = docs2features(trainX, feature_functions, tokenizer)

    # Create vocabulary from features in training examples.
    feature_vocab = create_vocab(trainX_feat)
    print('Vocabulary size: %d' % len(feature_vocab))

    trainX_ids = features_to_ids(trainX_feat, feature_vocab)
    # Train NB model.
    nb_model = MultinomialNB(alpha = 1.0)
    nb_model.fit(trainX_ids, trainY)
    
    # Pre-process test documents. 
    devX_feat = docs2features(devX, feature_functions, tokenizer)
    devX_ids = features_to_ids(devX_feat, feature_vocab)
    
    # Test NB model.
    print('Accuracy: %.3f' % nb_model.score(devX_ids, devY))

In [10]:
import os

datapath = '../data'

train_file = os.path.join(datapath, 'imdb_sentiment_train.txt')
trainX, trainY = read_examples(train_file)
dev_file = os.path.join(datapath, 'imdb_sentiment_dev.txt')
devX, devY = read_examples(dev_file)

# Specify features to use.
features = [word_features]

# Evaluate NB model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 28692
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

## Feature engineering

Evaluate NB model performance when using only alpha tokens. This can be done by changing the tokenizer function.

In [11]:
def spacy_tokenizer1(text):
    tokens = spacy_nlp.tokenizer(text)
    
    # YOUR CODE HERE
    # Keep in the tokens list only those whose text is made up solely from letters.
    tokens = [token.text for token in tokens if token.is_alpha]
    
    return tokens

# Evaluate NB model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer1)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 25054
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

Same as above, but lowercase all tokens before using as features.

In [12]:
def spacy_tokenizer2(text):
    tokens = spacy_nlp.tokenizer(text)
    
    # YOUR CODE HERE
    # Keep in the tokens list only those whose text is made up solely from letters.
    # Return a list of lowercased token text.
    tokens =  [token.text.lower() for token in tokens if token.is_alpha]
    
    return tokens

# Evaluate NB model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer2)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 21708
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

Same as above, but lowercase only tokens that appear at the beginning of sentences.

In [13]:
spacy_nlp = English()
spacy_nlp.add_pipe("sentencizer")

def spacy_tokenizer3(text):
    doc = spacy_nlp(text)
    tokens = []
    for sent in doc.sents:
        is_first = True
        for token in spacy_nlp.tokenizer(sent.text):
            if is_first :
                tokens.append(token.text.lower())
                is_first = False 
            else :
                tokens.append(token.text)
    # YOUR CODE HERE

    return tokens

# Evaluate NB model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer3)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 28695
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

Use spacy_tokenizer2 (only alpha tokens, lowered all) and display the top 10 most frequent tokens in the vocabulary, as a list of tuples (token, frequency).

In [14]:
# First, count token occurrences across all examples, where features are still strings. 
def create_feature_counts(examples):
    feature_counts = {}
    
    # YOUR CODE HERE
    for i in examples:
        for k,v in i.items():
            if k in feature_counts :
                feature_counts[k] += v
            else :
                feature_counts[k] = v
                

    return feature_counts

# Create features for all training examples, compute feature counts 
def fcounts_from_train(trainX, feature_functions, tokenizer):
    # Pre-process training documents. 
    trainX_feat = docs2features(trainX, feature_functions, tokenizer)

    # Create vocabulary from features in training examples.
    feature_counts = create_feature_counts(trainX_feat)
    print('Vocabulary size: %d' % len(feature_counts))

    return feature_counts

# Return a list of the top K most frequent tokens in the vocabulary.
def topK_tokens(vocab, k):
    return sorted(vocab.items(), key=lambda item: item[1], reverse=True)[:k]


vocab = fcounts_from_train(trainX, features, spacy_tokenizer2)
stop_words = topK_tokens(vocab, 20)
for item in stop_words:
    print(item)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 21708
('WORD_the', 20105)
('WORD_and', 9876)
('WORD_a', 9749)
('WORD_of', 9029)
('WORD_to', 8182)
('WORD_is', 6605)
('WORD_in', 5598)
('WORD_it', 5387)
('WORD_i', 4773)
('WORD_this', 4448)
('WORD_that', 4252)
('WORD_was', 2961)
('WORD_with', 2693)
('WORD_as', 2682)
('WORD_movie', 2574)
('WORD_for', 2556)
('WORD_but', 2435)
('WORD_film', 2359)
('WORD_you', 2066)
('WORD_on', 2061)


Evaluate NB model performance when ignoring the top 20 stop words. Use spacy_tokenizer2 (only alpha tokens, lowered all).

In [15]:
def create_vocab_without_stopwords(examples , stop_words):
    feature_vocab = {}
    idx = 0
    for example in examples:
        for feat in example:
            if feat not in feature_vocab and feat not in stop_words:
                feature_vocab[feat] = idx
                idx += 1
                
    return feature_vocab

In [16]:
# Evaluation pipeline for the Naive Bayes classifier.

def train_and_test(trainX, trainY, devX, devY, feature_functions, tokenizer):
    # Pre-process training documents. 
    trainX_feat = docs2features(trainX, feature_functions, tokenizer)
    # Create vocabulary from features in training examples.
    feature_counts = create_feature_counts(trainX_feat)
    stop_words = topK_tokens(vocab, 20)

    # Remove from each example features that appear in the stop words list.
    # YOUR CODE HERE.
    ignore_words = []   # list of stop words
    for (word ,fre) in stop_words:
        ignore_words.append(word)

    # Create vocabulary from features in training examples.
    #feature_vocab = create_vocab(trainX_feat) 
    feature_vocab = create_vocab_without_stopwords(trainX_feat,ignore_words)                  # accuracy improved  from 78.4 to 79.8%        
    print('Vocabulary size: %d' % len(feature_vocab))

    trainX_ids = features_to_ids(trainX_feat, feature_vocab)
    
    # Train NB model.
    nb_model = MultinomialNB(alpha = 1.0)
    nb_model.fit(trainX_ids, trainY)
    
    # Pre-process test documents. 
    devX_feat = docs2features(devX, feature_functions, tokenizer)
    devX_ids = features_to_ids(devX_feat, feature_vocab)
    
    # Test NB model.
    print('Accuracy: %.3f' % nb_model.score(devX_ids, devY))
    
# Evaluate NB model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer2)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 21688
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

Evaluate NB model performance when ignoring words that appear less than 5 times. Use spacy_tokenizer2 (only alpha tokens, lowered all).

In [17]:
def train_and_test(trainX, trainY, devX, devY, feature_functions, tokenizer):
    # Pre-process training documents. 
    trainX_feat = docs2features(trainX, feature_functions, tokenizer)
    # Create vocabulary from features in training examples.
    feature_counts = create_feature_counts(trainX_feat)
    stop_words = topK_tokens(vocab, 20)

    # Remove from each example features that appear in the stop words list.
   
    ignore_words = []   
    # list of  words whose frequency is less than 5
    
    for word , fre  in feature_counts.items():
        if fre < 5 :
            ignore_words.append(word)

    # list of   stop words 
    for (word ,fre) in stop_words:
        ignore_words.append(word)
    # Create vocabulary from features in training examples.
    #feature_vocab = create_vocab(trainX_feat) 
    feature_vocab = create_vocab_without_stopwords(trainX_feat,ignore_words)                  # accuracy improved  from 79.8 to 81.1         
    print('Vocabulary size: %d' % len(feature_vocab))

    trainX_ids = features_to_ids(trainX_feat, feature_vocab)
    
    # Train NB model.
    nb_model = MultinomialNB(alpha = 1.0)
    nb_model.fit(trainX_ids, trainY)
    
    # Pre-process test documents. 
    devX_feat = docs2features(devX, feature_functions, tokenizer)
    devX_ids = features_to_ids(devX_feat, feature_vocab)
    
    # Test NB model.
    print('Accuracy: %.3f' % nb_model.score(devX_ids, devY))
    
# Evaluate NB model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer2)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 5553
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processe

## Binary Multinomial Bayes
*Mandatory for graduate students, optional for undergraduate students*

Write code for transforming documents to features such that features are Boolean and only represent whether a word occurred in a document, as in the Binary Multinomial Naive Bayes discussed in class. Evaluate the Naive Bayes model with this feature representation, using spacy_tokenizer2 (only alpha tokens, lowered all).

In [18]:

def word_features_binary(tokens):
    feats = {}
    for word in tokens:
        feat = 'WORD_%s' % word
        if feat not in feats:
            feats[feat] = 1
    return feats

In [19]:
# Specify features to use.
features = [word_features_binary]

# Evaluate NB model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer2)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 4852
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processe

## Bonus points ##
Anything extra goes here. For example, implement NB from scratch in a separate module nbayes.py and use it for the exercises above.

In [20]:

from  nbayes import  train_nb , test_nb
def train_and_test_custom(trainX, trainY, devX, devY, feature_functions, tokenizer):
    # Pre-process training documents. 
 
    trainX_feat = docs2features(trainX, feature_functions, tokenizer)
    # Create vocabulary from features in training examples.
    feature_vocab = create_vocab(trainX_feat)
    print('Vocabulary size: %d' % len(feature_vocab))

    trainX_ids = features_to_ids(trainX_feat, feature_vocab)
    # Train NB model.
    logprior , loglikelihood = train_nb(trainX_feat,trainY,feature_vocab)
    
    # Pre-process test documents. 
    devX_feat = docs2features(devX, feature_functions, tokenizer)
    devX_ids = features_to_ids(devX_feat, feature_vocab)
    acc = test_nb(devX_feat, devY , logprior, loglikelihood)
    print("Accuracy:", acc)
train_and_test_custom(trainX, trainY, devX, devY, features, spacy_tokenizer2)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 21708
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

## Analysis ##
Include an analysis of the results that you obtained in the experiments above. Take advantage of the Jupyter Notebook markdown language, which can also process Latex and HTML, to format your report so that it looks professional.

### Analysis of Different Experiments on Basis of Feature Engineering

1. **All Tokens in Vocabulary (77.9% Accuracy):**
   - An initial experiment considered all tokens in the vocabulary, resulting in **77.9% accuracy**.

2. **Alpha Tokens Only (78.5% Accuracy):**
   - By considering only alpha tokens (words containing alphabets), the accuracy slightly improved to **78.5%**.

3. **Alpha Tokens to Lowercase (78.4% Accuracy):**
   - Converting alpha tokens to lowercase, with a slight difference in vocabulary size, resulted in **78.4% accuracy**.

4. **Lowercasing Beginning of Sentences (77.9% Accuracy):**
   - When only tokens at the beginning of sentences were converted to lowercase, accuracy dropped to **77.9%**.

Among these cases, tokenizing alpha tokens and converting them to lowercase appears to be the best feature engineering method. Further accuracy improvements can be achieved by removing less frequent words, unknown words, and stop words. Let's explore these cases:

1. **Ignoring Top 20 Stop Words (79.8% Accuracy):**
   - The top 20 stop words, such as "The," "and," and "with," were ignored to improve accuracy, resulting in **79.8%**.

2. **Ignoring Less Frequent Words (81.1% Accuracy):**
   - By further ignoring less frequent words, accuracy increased to **81.1%** in Naive Bayes  text classification.

### Binary Multinomial Bayes

In Binary Multinomial Bayes classification, the focus shifts to whether a word has occurred or not. This approach significantly improved accuracy to **82.6% compared to the initial accuracy of 77.9%** .

### Naive Bayes Implementation without sklearn Library(From sctrach )

An implementation of Naive Bayes using `spacytokenizer2` (considering only lower and alpha tokens) resulted in an accuracy of **80.66%**.

These experiments demonstrate the impact of different feature engineering techniques on text classification accuracy.