**Copyright: © NexStream Technical Education, LLC**.
All rights reserved

# Naive Bayes Multinomial Classifier using Sklearn  

In this project, you'll implement a spam email classifier using the Naive Bayes algorithm. This is one of the most common applications of Natural Language Processing (NLP) and provides a simple introduction to text classification.  
Spam detection is a binary classification problem: given an email message, we want to classify it as either:
- Ham: legitimate, wanted email
- Spam: unwanted, unsolicited email

The Naive Bayes algorithm is well-suited for text classification because:  
- It works well with high-dimensional data (like text)
- It can handle small training datasets effectively
- It's computationally efficient
- It's relatively simple to understand and implement  

How Naive Bayes Works for Text:   
- Naive Bayes applies Bayes' theorem with a "naive" assumption that features (words in our case) are conditionally independent given the class label. For spam classification:
  - P(Spam | Message) ∝ P(Spam) × P(Word₁ | Spam) × P(Word₂ | Spam) × ... × P(Wordₙ | Spam)
  - For each message, we calculate this probability for both Spam and Ham classes, then classify the message according to which probability is higher.

In this project, you will:
- Load and explore a dataset of labeled email messages
- Prepare the data by splitting it into training and testing sets
- Use scikit-learn to vectorize the text data (convert words to features)
- Train a Multinomial Naive Bayes classifier
- Evaluate the model's performance
- Analyze the results and identify the most informative features

<br>

Follow the instructions in the code cells to complete and test your code. You will replace all triple underscores (___) with your code.
Please refer to the lecture slides for details on each of the functions/algorithms and hints on the implementation.

**Step 1:**  

Setup the environment and load the dataset

In [None]:
#Step 1:

#Mount your google drive and copy the dataset to the current working directory (!cp),
#or change the working directory to the folder (%cd).
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#!ls /content/drive/MyDrive/

In [None]:
!ls /content/drive/MyDrive/Colab\ Notebooks/

 LinearRegression_project.ipynb         Untitled
 Naive_Bayes_project.ipynb	       'Untitled (1)'
 PCA_LogisticRegression_project.ipynb  'Untitled (2)'
 spam_training_dataset_2.csv


In [None]:
%cd /content/drive/MyDrive/Colab\ Notebooks/

/content/drive/MyDrive/Colab Notebooks


In [None]:
# Setup the environment and read the dataset

# Imports you will need are already provided, no coding needed here.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


# Read the dataset into a Pandas dataframe
df = pd.read_csv('spam_training_dataset_2.csv')


#For verification purposes - do not change code below this line.
import doctest

"""
  >>> print(df.shape)
  (5563, 2)
  >>> print(df['label'].value_counts().iloc[0], df['label'].value_counts().iloc[1])
  4818 745
"""

doctest.testmod()

TestResults(failed=0, attempted=2)

**Step 2:**  

Prepare the dataset
- Split the data into train and test sets with 80% train and 20% test
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
# Split the data into training and testing sets (80% train, 20% test)
# Set the random_state = 42 for reproducibility
# Set stratify=df['label'] to ensure the same ham/spam ratio
X_train, X_test, y_train, y_test =  train_test_split(   # train test split function
                                    df['text'],          # email text content
                                    df['label'],          # labels (ham/spam)
                                    test_size=0.2,        # Use 80% train, 20% test split
                                    random_state = 42,      # Set a random seed = 42 for reproducibility
                                    stratify=df['label'] # Set stratify to ensure the same ham/spam ratio in both sets
)

# Display the shapes of our training and testing sets
print(f"Training set size: {X_train.shape[0]} messages")
print(f"Testing set size: {X_test.shape[0]} messages")

# Check class distribution in training and testing sets
print("\nClass distribution in training set:")
print(y_train.value_counts())
print(f"Spam ratio: {y_train.value_counts(normalize=True)['spam']:.2%}")

print("\nClass distribution in testing set:")
print(y_test.value_counts())
print(f"Spam ratio: {y_test.value_counts(normalize=True)['spam']:.2%}")

# Display a few examples from the training set
print("\nExamples from the training set:")
train_examples = pd.DataFrame({
    'text': X_train.iloc[:5].values,
    'label': y_train.iloc[:5].values
})
print(train_examples)


#For verification purposes - do not change code below this line.
import doctest

"""
  >>> print(X_train.shape[0])
  4450
  >>> print(X_test.shape[0])
  1113
  >>> np.isclose(y_train.value_counts(normalize=True)['spam'], 0.13393258426966292, atol=10e-3)
  np.True_
  >>> np.isclose(y_test.value_counts(normalize=True)['ham'], 0.8660674157303371, atol=10e-3)
  np.True_
"""

doctest.testmod()

Training set size: 4450 messages
Testing set size: 1113 messages

Class distribution in training set:
label
ham     3854
spam     596
Name: count, dtype: int64
Spam ratio: 13.39%

Class distribution in testing set:
label
ham     964
spam    149
Name: count, dtype: int64
Spam ratio: 13.39%

Examples from the training set:
                                                text label
0  Me too baby! I promise to treat you well! I be...   ham
1  YOU HAVE WON! As a valued Vodafone customer ou...  spam
2                             When did dad get back.   ham
3  Daddy, shu shu is looking 4 u... U wan me 2 te...   ham
4                             Remember on that day..   ham


TestResults(failed=0, attempted=4)

**Step 3:**  

Create a Pipeline containing the sequence of processing functions to train a Multinomial Naive Bayes model.  

<br>

Pipeline:   
Chains together multiple data processing steps. It's useful for tasks that involve preprocessing, feature engineering, and model training.  
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html   
- Processing blocks are identified with a list of steps accessed by name, e.g. 'scalar', 'classifier', 'vectorizer'.  
- e.g. *pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', LogisticRegression()) ])*

  Our Pipeline for text classification will consist of:
  - A text transformer (CountVectorizer).  
    - Specify this with 'vectorizer'.
    - Set the minimum document frequency (*min_df*) for a word to be included to 2
    - Use only single words (unigrams).  This done by setting *ngram_range* to (1,1)
  - A classifier (MultinomialNB).  Specify this with 'classifier'

<br>

CountVectorizer:   
Converts the dataset text to features.  Use CountVectorizer in your implementation which will perform the following operations:
  - Tokenize: The text is split into individual words (tokens)
  - Build Vocabulary: A vocabulary of unique words is created from all documents
  - Create Vector: Each document is converted into a vector where:  
    - Each position corresponds to a word in the vocabulary
    - The value is the count of that word in the document

See the LSA lecture and sklearn library API for more details and examples on CountVectorizer.
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

<br>

MultiNomialNB:   
Classification algorithm that can be used on datasets with discrete features, commonly used in text classification where features represent word counts.
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
- Parameter alpha adds value (e.g. 1.0) to the feature counts to prevent zero frequency problem.


In [None]:
# Step 3:

# Create a pipeline with CountVectorizer and MultinomialNB
# CountVectorizer converts text to word count features
# MultinomialNB is the Multinomial Naive Bayes classifier

pipeline = Pipeline([      #Initialize Pipeline reference
           ('vectorizer', CountVectorizer(    # set vectorizer
               stop_words = 'english',  #  remove English stop words
               min_df = 2,    #  set minimum doc frequency for a term to be included to 2
               ngram_range=(1,1), #  use unigrams (single words)
              )
           ),

          ('classifier', MultinomialNB(            # set classifier
              alpha=1.0)    #    set alpha to 1.0 to prevent zero frequency problems
          )
])



#For verification purposes - do not change code below this line.
import doctest

"""
  >>> print(pipeline.steps)
  [('vectorizer', CountVectorizer(min_df=2, stop_words='english')), ('classifier', MultinomialNB())]
"""

doctest.testmod()

TestResults(failed=0, attempted=1)

**Step 4:**   

Train the model, make predictions, and calculate accuracy
- Use the Pipeline *fit* function.
- Use the Pipeline *predict* function.
- Use sklearn.metrics *accuracy_score* function.
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Print out a classification report for the model
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html


In [None]:
# Step 4:

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# Print some metrics that are typical for spam filters
# Calculations provided - no coding needed here.

print("\nSpam Detection Metrics:")
# Calculate precision for 'spam' class (how many predicted spam are actually spam)
spam_precision = conf_matrix[1, 1] / (conf_matrix[0, 1] + conf_matrix[1, 1])
# Calculate recall for 'spam' class (how many actual spam were caught)
spam_recall = conf_matrix[1, 1] / (conf_matrix[1, 0] + conf_matrix[1, 1])
# Calculate false positive rate (ham incorrectly classified as spam)
false_positive_rate = conf_matrix[0, 1] / (conf_matrix[0, 0] + conf_matrix[0, 1])

print(f"Spam Precision: {spam_precision:.4f} (higher is better)")
print(f"Spam Recall: {spam_recall:.4f} (higher is better)")
print(f"False Positive Rate: {false_positive_rate:.4f} (lower is better)")



#For verification purposes - do not change code below this line.
import doctest

"""
  >>> np.isclose(accuracy, 0.98472597, atol=10e-3)
  np.True_
  >>> np.isclose(spam_precision, 0.95833333, atol=10e-3)
  np.True_
  >>> np.isclose(spam_recall, 0.92617450, atol=10e-3)
  np.True_
  >>> np.isclose(false_positive_rate, 0.00622407, atol=10e-3)
  np.True_
"""

doctest.testmod()



Accuracy: 0.9847

Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       964
        spam       0.96      0.93      0.94       149

    accuracy                           0.98      1113
   macro avg       0.97      0.96      0.97      1113
weighted avg       0.98      0.98      0.98      1113


Confusion Matrix:
[[958   6]
 [ 11 138]]

Spam Detection Metrics:
Spam Precision: 0.9583 (higher is better)
Spam Recall: 0.9262 (higher is better)
False Positive Rate: 0.0062 (lower is better)


TestResults(failed=0, attempted=4)

#Reflection Questions  

Using your notebook results, answer the following questions:  

1. The Naive Bayes algorithm makes a fundamental "naive" assumption that is known to be incorrect in real-world text data, yet the algorithm still performs well for our spam filter application.
  - What is this "naive" assumption?
  - Provide an example from email text that contradicts this assumption.
  - Why does Naive Bayes still perform well for spam detection despite this incorrect assumption?
  - How might this assumption affect performance differently in other NLP tasks compared to spam detection?

2. Run the "Examine most informative features" code section below to identify the most informative words for spam and ham classification.
  - List the top 5 words most strongly associated with spam and explain why these words make intuitive sense for spam detection.
  - List 2-3 words that are strongly associated with ham (legitimate emails) and explain their significance.
  - Identify at least one word in the top results that surprised you, and explain why.
  - If you were to improve the spam filter, would you manually add or remove any specific features (words) based on this analysis? Why or why not?

3. When implementing a Naive Bayes spam filter in a real-world email system, various trade-offs must be considered.
  - Explain the trade-off between precision and recall in the context of spam filtering. Which metric would you prioritize and why?
  - How would you adjust the model to minimize the risk of important legitimate emails being classified as spam? Be specific about which parameters you would modify.
  - Our model used only unigrams (single words) as features. Discuss one advantage and one disadvantage of extending the model to include bigrams (word pairs) for spam detection.
  - Multinomial Naive Bayes uses Laplace (add-one) smoothing to handle unseen words. Explain why this smoothing is necessary and how it would affect classification of an email containing words not seen during training.

Reflection:

Using your notebook results, answer the following questions:

1. The Naive Bayes algorithm makes a fundamental "naive" assumption that is known to be incorrect in real-world text data, yet the algorithm still performs well for our spam filter application.
   - What is this "naive" assumption?
   - Provide an example from email text that contradicts this assumption.
   - Why does Naive Bayes still perform well for spam detection despite this incorrect assumption?
   - How might this assumption affect performance differently in other NLP tasks compared to spam detection?

**My Ans:** The naive assumption is the conditional independence between each features given the class label (y). A contradiction example might be: P(hotel|Spam) might not be independent from P(trip|Spam). Naive Bayes still performs well probably because the correlated spam words wouldn't hinder the classification of Spam/non-spam email, as they would increase the probability of the Spam email being classified as Spam. When the sequence of the sentence matters, but not simple classification, the performance would be affected by wrong models.



2. Run the "Examine most informative features" code section below to identify the most informative words for spam and ham classification.

- List the top 5 words most strongly associated with spam and explain why these words make intuitive sense for spam detection.
- List 2-3 words that are strongly associated with ham (legitimate emails) and explain their significance.
- Identify at least one word in the top results that surprised you, and explain why.
- If you were to improve the spam filter, would you manually add or remove any specific features (words) based on this analysis? Why or why not?

**My Ans:** According to the result, the top 5 words associated with spam are "claim", "prize", "150p", "tone", and "18". These words connect with the contents, such as "claim your prize", "18+", which are quite spammy. The top 3 words with legitimate emails can be: "later",  "said",  "ask".  These words are often used in our daily life or work. The words "gt", "lt" actually surprised me because these are not human-used words, but HTML-entity artifacts. These HTML artifacts can both appear in spam or legitimate emails. Therefore, to improve the spam filter, we can manually remove these non-human language words so that we can reduce noise.



3. When implementing a Naive Bayes spam filter in a real-world email system, various trade-offs must be considered.

- Explain the trade-off between precision and recall in the context of spam filtering. Which metric would you prioritize and why?
- How would you adjust the model to minimize the risk of important legitimate emails being classified as spam? Be specific about which parameters you would modify.
- Our model used only unigrams (single words) as features. Discuss one advantage and one disadvantage of extending the model to include bigrams (word pairs) for spam detection.
- Multinomial Naive Bayes uses Laplace (add-one) smoothing to handle unseen words. Explain why this smoothing is necessary and how it would affect classification of an email containing words not seen during training.

**My Ans:** 

Precision measures: among all emails that we mark as spam, how many of them are true spam emails; while recall measures: among all true spam emails, how many do we mark as spam. They are related to the two types of errors. I think in real-world scenario, the precision would be more useful, because people care more about whether their legitimate emails are mistakenly marked as spam. To minimize the risk of important legitimate emails being classified as spam, we can enhance the decision threshold or adjust the prior P(spam). 

One advantage of including the bigram is that, it takes phrases into account; however, including both unigram and bigram would increase the data and decrease the efficiency, and that would be a disadvantage of it. 

Smoothing is important to avoid including "zero possibility" when encountering unseen words.

In [None]:
# Examine most informative features
# Run this cell and use the output to answer the Reflection question above.

def show_most_informative_features(vectorizer, classifier, n=20):
    feature_names = vectorizer.get_feature_names_out()
    # Get the log probability ratios for features
    coefs_with_fns = sorted(zip(classifier.feature_log_prob_[1] - classifier.feature_log_prob_[0],
                               feature_names))
    # Get top features for spam (positive coefficient)
    top_spam = coefs_with_fns[-n:]
    # Get top features for ham (negative coefficient)
    top_ham = coefs_with_fns[:n]

    print("\nTop Spam-indicating words:")
    for coef, feat in reversed(top_spam):
        print(f"{feat}: {coef:.4f}")

    print("\nTop Ham-indicating words:")
    for coef, feat in top_ham:
        print(f"{feat}: {coef:.4f}")

# Get the vectorizer and classifier from the pipeline
vectorizer = pipeline.named_steps['vectorizer']
classifier = pipeline.named_steps['classifier']

# Show most informative features
show_most_informative_features(vectorizer, classifier)

# Compare with baseline (always predict majority class)
majority_class = y_train.mode()[0]
baseline_accuracy = (y_test == majority_class).mean()
print(f"\nBaseline (majority class) accuracy: {baseline_accuracy:.4f}")
print(f"Our model improvement over baseline: {accuracy - baseline_accuracy:.4f}")


Top Spam-indicating words:
claim: 5.3464
prize: 4.9786
150p: 4.8245
tone: 4.5491
18: 4.4991
cs: 4.4731
500: 4.4731
guaranteed: 4.4190
100: 4.3320
uk: 4.3013
1000: 4.2695
landline: 4.2028
awarded: 4.2028
www: 4.1314
ringtone: 4.1314
150ppm: 4.1314
collection: 4.0136
5000: 3.9266
16: 3.9266
000: 3.9266

Top Ham-indicating words:
gt: -4.7366
lt: -4.7290
lor: -4.0698
da: -4.0473
later: -3.8466
wat: -3.6531
amp: -3.5212
ask: -3.4820
said: -3.4130
home: -3.3840
cos: -3.3541
doing: -3.3541
come: -3.2916
morning: -3.2588
really: -3.2588
lol: -3.2075
sure: -3.1898
ll: -3.1673
gud: -3.1535
nice: -3.1348

Baseline (majority class) accuracy: 0.8661
Our model improvement over baseline: 0.1186
