In [1]:
intro_text = """
CSC620 – Naive Bayes Text Classification (Sarcasm Detection)

Goal:
- Train and evaluate a Naive Bayes text classifier using scikit-learn.
- Dataset: News Headlines Dataset for Sarcasm Detection.
  'is_sarcastic' = 1 means sarcastic, 0 means not sarcastic.
- Model: Multinomial Naive Bayes (bag-of-words).

At the end we will:
1. Show accuracy, precision, recall, f1-score, and confusion matrix.
2. Explain what those metrics mean.
3. Compare this model to my custom Naive Bayes from Part 1.
"""

print(intro_text)



CSC620 – Naive Bayes Text Classification (Sarcasm Detection)

Goal:
- Train and evaluate a Naive Bayes text classifier using scikit-learn.
- Dataset: News Headlines Dataset for Sarcasm Detection.
  'is_sarcastic' = 1 means sarcastic, 0 means not sarcastic.
- Model: Multinomial Naive Bayes (bag-of-words).

At the end we will:
1. Show accuracy, precision, recall, f1-score, and confusion matrix.
2. Explain what those metrics mean.
3. Compare this model to my custom Naive Bayes from Part 1.



In [2]:
# Libraries we need:
# - pandas: load and inspect the dataset
# - train_test_split: split data into train and test sets
# - CountVectorizer: convert text -> word count features
# - MultinomialNB: Naive Bayes model for word counts
# - metrics: evaluate predictions

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


In [3]:
# Load the sarcasm dataset. It's a JSON-lines file (one json object per line).
# Columns we care about:
# - "headline": the text we classify
# - "is_sarcastic": the label (1 sarcastic, 0 not sarcastic)

data = pd.read_json("Sarcasm_Headlines_Dataset.json", lines=True)

print("Columns in dataset:", data.columns.tolist())
print("\nFirst 5 rows:")
print(data.head())

print("\nTotal number of rows:", len(data))


Columns in dataset: ['article_link', 'headline', 'is_sarcastic']

First 5 rows:
                                        article_link  \
0  https://www.huffingtonpost.com/entry/versace-b...   
1  https://www.huffingtonpost.com/entry/roseanne-...   
2  https://local.theonion.com/mom-starting-to-fea...   
3  https://politics.theonion.com/boehner-just-wan...   
4  https://www.huffingtonpost.com/entry/jk-rowlin...   

                                            headline  is_sarcastic  
0  former versace store clerk sues over secret 'b...             0  
1  the 'roseanne' revival catches up to our thorn...             0  
2  mom starting to fear son's web series closest ...             1  
3  boehner just wants wife to listen, not come up...             1  
4  j.k. rowling wishes snape happy birthday in th...             0  

Total number of rows: 26709


In [4]:
# We want to see how many sarcastic vs non-sarcastic headlines there are.
# This gives us an idea of class balance, which affects priors in Naive Bayes.

label_counts = data["is_sarcastic"].value_counts()
print("Label counts:\n", label_counts)

sarcastic_pct = (label_counts[1] / len(data)) * 100
nonsarcastic_pct = (label_counts[0] / len(data)) * 100
print(f"\nSarcastic: {sarcastic_pct:.2f}%")
print(f"Not sarcastic: {nonsarcastic_pct:.2f}%")

print("\nSample sarcastic headlines (is_sarcastic = 1):")
print(data[data["is_sarcastic"] == 1]["headline"].head(5).tolist())

print("\nSample non-sarcastic headlines (is_sarcastic = 0):")
print(data[data["is_sarcastic"] == 0]["headline"].head(5).tolist())


Label counts:
 is_sarcastic
0    14985
1    11724
Name: count, dtype: int64

Sarcastic: 43.90%
Not sarcastic: 56.10%

Sample sarcastic headlines (is_sarcastic = 1):
["mom starting to fear son's web series closest thing she will have to grandchild", 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas', 'top snake handler leaves sinking huckabee campaign', "nuclear bomb detonates during rehearsal for 'spider-man' musical", "cosby lawyer asks why accusers didn't come forward to be smeared by legal team years ago"]

Sample non-sarcastic headlines (is_sarcastic = 0):
["former versace store clerk sues over secret 'black code' for minority shoppers", "the 'roseanne' revival catches up to our thorny political mood, for better and worse", 'j.k. rowling wishes snape happy birthday in the most magical way', "advancing the world's women", 'the fascinating case for eating lab-grown meat']


In [5]:
# X = text (headline)
# y = label (is_sarcastic)
#
# We will split 80% train / 20% test.
# stratify=y keeps the class ratio similar in both splits.
# random_state is for reproducibility.

X = data["headline"]
y = data["is_sarcastic"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))


Train size: 21367
Test size: 5342


In [6]:
# Naive Bayes works on counts of words.
# CountVectorizer:
#   1. learns a vocabulary from the training text
#   2. turns each headline into a sparse vector of word counts
#
# This is basically doing automatically what I did by hand in Part 1
# (count words per class, estimate P(word|class)).

vectorizer = CountVectorizer(stop_words='english')

# Learn vocabulary and transform training data
X_train_vec = vectorizer.fit_transform(X_train)

# Transform test data using the same vocabulary
X_test_vec = vectorizer.transform(X_test)

print("Vocabulary size:", len(vectorizer.vocabulary_))
print("Training matrix shape:", X_train_vec.shape)  # (num_train_examples, num_features)


Vocabulary size: 22636
Training matrix shape: (21367, 22636)


In [7]:
# MultinomialNB = Multinomial Naive Bayes.
# It learns:
#   P(class = c)    -> prior
#   P(word|class)   -> likelihood of each word in each class
#
# Basically same math as my Part 1 code, just optimized and built-in.

nb = MultinomialNB()
nb.fit(X_train_vec, y_train)

print("Classes (0 means not sarcastic, 1 means sarcastic):", nb.classes_)
print("Class log priors (log P(class)):", nb.class_log_prior_)


Classes (0 means not sarcastic, 1 means sarcastic): [0 1]
Class log priors (log P(class)): [-0.57794153 -0.82337453]


In [8]:
# Let's evaluate on the test set that the model never saw.
# We measure:
# - accuracy overall
# - classification_report (precision, recall, f1-score per class)
# - confusion_matrix

y_pred = nb.predict(X_test_vec)

acc = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, digits=4)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy on test set:", acc)
print("\nClassification Report:\n", report)
print("Confusion Matrix:\n", cm)


Accuracy on test set: 0.8049419692998877

Classification Report:
               precision    recall  f1-score   support

           0     0.8114    0.8498    0.8302      2997
           1     0.7957    0.7475    0.7709      2345

    accuracy                         0.8049      5342
   macro avg     0.8036    0.7987    0.8005      5342
weighted avg     0.8045    0.8049    0.8042      5342

Confusion Matrix:
 [[2547  450]
 [ 592 1753]]


In [9]:
explanation_metrics = """
How to read classification_report():

For each class (0 = not sarcastic, 1 = sarcastic):
- precision:
    Of all headlines the model PREDICTED as this class, how many were actually that class?
    High precision for class 1 = when we say 'sarcastic', we're usually right (few false alarms).
- recall:
    Of all the TRUE headlines of this class in the test set, how many did we correctly find?
    High recall for class 1 = we are catching most sarcastic headlines instead of missing them.
- f1-score:
    Harmonic mean of precision and recall. Balances 'don't lie' vs 'don't miss'.
- support:
    How many examples of that class were actually in the test set.

Bottom of the report:
- accuracy:
    Overall percent of correct predictions.
- macro avg:
    Average of precision/recall/F1 treating both classes equally (good if classes are imbalanced).
- weighted avg:
    Average weighted by how many examples are in each class.

Confusion matrix (2x2 here):
Rows = actual class, Cols = predicted class

[0,0] top-left  = actual non-sarcastic predicted non-sarcastic (correct)
[0,1] top-right = actual non-sarcastic predicted sarcastic (false positive for sarcasm)
[1,0] bottom-left = actual sarcastic predicted non-sarcastic (false negative for sarcasm)
[1,1] bottom-right = actual sarcastic predicted sarcastic (correct)

This tells us what kind of mistakes we make more:
- Are we calling normal headlines sarcastic too often?
- Or are we missing sarcasm because it's subtle?
"""

print(explanation_metrics)



How to read classification_report():

For each class (0 = not sarcastic, 1 = sarcastic):
- precision:
    Of all headlines the model PREDICTED as this class, how many were actually that class?
    High precision for class 1 = when we say 'sarcastic', we're usually right (few false alarms).
- recall:
    Of all the TRUE headlines of this class in the test set, how many did we correctly find?
    High recall for class 1 = we are catching most sarcastic headlines instead of missing them.
- f1-score:
    Harmonic mean of precision and recall. Balances 'don't lie' vs 'don't miss'.
- support:
    How many examples of that class were actually in the test set.

Bottom of the report:
- accuracy:
    Overall percent of correct predictions.
- macro avg:
    Average of precision/recall/F1 treating both classes equally (good if classes are imbalanced).
- weighted avg:
    Average weighted by how many examples are in each class.

Confusion matrix (2x2 here):
Rows = actual class, Cols = predicted cl

In [10]:
comparison_text = """
Comparison: Part 1 (my own Naive Bayes) vs Part 2 (scikit-learn Naive Bayes)

1. Data size / difficulty
   Part 1:
     - Very small, hand-built dataset (like spam vs ham texts).
     - Tiny vocabulary. Words like "free", "click", "prize" strongly scream spam.
   Part 2:
     - Real sarcasm headlines, much bigger dataset.
     - Sarcasm is harder than spam because sarcasm depends on tone and humor.
     - The model is learning from way more examples, so the probabilities are more reliable.

2. Feature extraction
   Part 1:
     - I manually tokenized with .split(), counted words per class,
       and computed P(word|class) with Laplace smoothing.
     - I wrote those probabilities to model.csv.
   Part 2:
     - I used CountVectorizer to build a bag-of-words matrix automatically.
     - MultinomialNB learns P(class) and P(word|class) for me.
     - It's literally the same math idea, just automated and scalable.

3. Evaluation
   Part 1:
     - I output test_predictions.csv with text, predicted, actual.
     - I could get accuracy by counting matches.
   Part 2:
     - I printed classification_report() which gives precision, recall,
       f1-score per class, plus overall accuracy.
     - I also printed the confusion matrix.
     - This shows which class is harder for the model.

4. Handling unseen words
   Part 1:
     - If a test message had a new word that never showed in training,
       I had to manually assign a tiny fallback probability (like 1e-10)
       so I wouldn't do log(0).
   Part 2:
     - MultinomialNB handles smoothing internally and is stable with large vocab.

5. Real-world usefulness
   Part 1:
     - More like proof I understand Naive Bayes math and file I/O.
   Part 2:
     - This is actually usable for a real NLP task (sarcasm detection),
       and we can judge how good it is with real metrics.

Summary:
Part 1 = I built Naive Bayes myself from scratch.
Part 2 = I used the same idea, but with scikit-learn on a real dataset,
         and I analyzed performance in a more professional way.
"""

print(comparison_text)



Comparison: Part 1 (my own Naive Bayes) vs Part 2 (scikit-learn Naive Bayes)

1. Data size / difficulty
   Part 1:
     - Very small, hand-built dataset (like spam vs ham texts).
     - Tiny vocabulary. Words like "free", "click", "prize" strongly scream spam.
   Part 2:
     - Real sarcasm headlines, much bigger dataset.
     - Sarcasm is harder than spam because sarcasm depends on tone and humor.
     - The model is learning from way more examples, so the probabilities are more reliable.

2. Feature extraction
   Part 1:
     - I manually tokenized with .split(), counted words per class,
       and computed P(word|class) with Laplace smoothing.
     - I wrote those probabilities to model.csv.
   Part 2:
     - I used CountVectorizer to build a bag-of-words matrix automatically.
     - MultinomialNB learns P(class) and P(word|class) for me.
     - It's literally the same math idea, just automated and scalable.

3. Evaluation
   Part 1:
     - I output test_predictions.csv with text, 