# 📘 C1W2_L1: Naive Bayes Likelihoods – Teaching Version

In this workbook, our goal is to build a system that can predict whether a Tweet is positive or negative using the Naive Bayes approach. Like last week’s logistic regression, we’ll keep things simple and interpretable: no word order, no phrasing — just plain counts of which words appear.

But instead of learning weights through optimization (like logistic regression), this time we’ll build a model using **Bayes’ Theorem** and **word frequencies**. Think of this as letting probabilities do the talking, based on how often a word appears in positive vs. negative tweets.

You’ll see lots of overlap with Week 1’s bag-of-words approach — but the core *math* behind the prediction is different. Let’s dive in!

Unlike logistic regression last week, there’s no optimization or gradient descent here — just a lookup-based statistical model derived directly from the labeled data.

---

## 🧠 What is Naive Bayes?
Naive Bayes is a simple yet powerful classification algorithm based on **Bayes’ Theorem**:

The “naive” assumption is that all words in a tweet are **conditionally independent** given the sentiment label (whether the Tweet is Positive or Negative).

🧮 How Naive Bayes Works (Simple Steps):

1️⃣ Clean and tokenize each tweet (remove stopwords, punctuation, etc.)

2️⃣ Count how often each word appears in positive and negative tweets

3️⃣ Calculate smoothed probabilities

4️⃣ For a new tweet, sum the log probabilities of each word under each label

5️⃣ Predict the label with the higher total score

---

---

## 🔧 1. Setup & Downloads

In [13]:
!pip install -q gradio
import nltk
nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

---

## 📥 2. Imports

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gradio as gr

from nltk.corpus import twitter_samples, stopwords
from nltk.tokenize import TweetTokenizer
import string

## Toy Example
Like last week let’s build a full, end-to-end Toy Example of Naive Bayes using 6 manually written Tweets, before moving on the the real Twitter dataset.



In [19]:
import numpy as np
import string
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

# Step 1: Toy dataset
toy_tweets = [
    "I love this!",           # positive
    "Horrible experience",    # negative
    "Best purchase ever",     # positive
    "Terrible product",       # negative
    "I am very happy",        # positive
    "I hate this thing"       # negative
]

toy_labels = np.array([1, 0, 1, 0, 1, 0])  # 1 = positive, 0 = negative

# Step 2: Preprocessing
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
stopwords_english = stopwords.words('english')

def process_tweet(tweet):
    tokens = tokenizer.tokenize(tweet)
    return [word for word in tokens if word not in stopwords_english and word not in string.punctuation]

#Print the cleared extracted works from the Tweets
for i, tweet in enumerate(toy_tweets):
    print(f"Original: {tweet}")
    print(f"Cleaned:  {process_tweet(tweet)}\n")

Original: I love this!
Cleaned:  ['love']

Original: Horrible experience
Cleaned:  ['horrible', 'experience']

Original: Best purchase ever
Cleaned:  ['best', 'purchase', 'ever']

Original: Terrible product
Cleaned:  ['terrible', 'product']

Original: I am very happy
Cleaned:  ['happy']

Original: I hate this thing
Cleaned:  ['hate', 'thing']



In [21]:
# Step 3: Count word frequencies by label
def count_tweets(freq_dict, tweets, labels):
    for label, tweet in zip(labels, tweets):
        for word in process_tweet(tweet):
            pair = (word, label)
            freq_dict[pair] = freq_dict.get(pair, 0) + 1
    return freq_dict

freqs = count_tweets({}, toy_tweets, toy_labels)

print("Word Frequencies by Label (Positive=1 or Negative=0):\n")
for pair in sorted(freqs.keys(), key=lambda x: (x[1], x[0])):  # sort by label, then word
    word, label = pair
    print(f"Label={label} | Word='{word}' → Count: {freqs[pair]}")




Word Frequencies by Label (Positive=1 or Negative=0):

Label=0 | Word='experience' → Count: 1
Label=0 | Word='hate' → Count: 1
Label=0 | Word='horrible' → Count: 1
Label=0 | Word='product' → Count: 1
Label=0 | Word='terrible' → Count: 1
Label=0 | Word='thing' → Count: 1
Label=1 | Word='best' → Count: 1
Label=1 | Word='ever' → Count: 1
Label=1 | Word='happy' → Count: 1
Label=1 | Word='love' → Count: 1
Label=1 | Word='purchase' → Count: 1


## 🧠 What train_naive_bayes() Does
This function builds a Naive Bayes classifier using:

- freqs: a dictionary of how often each word appears in positive or negative tweets

- train_x: the tweets (not directly used here)

- train_y: the labels (used to compute priors)

In [34]:
#  Step 1: Build the Vocabulary
vocab = set([pair[0] for pair in freqs])
V = len(vocab)

print("📚 Vocabulary (sorted):")
print(sorted(vocab))
print(f"\n🔢 Vocabulary size: {V}")


📚 Vocabulary (sorted):
['best', 'ever', 'experience', 'happy', 'hate', 'horrible', 'love', 'product', 'purchase', 'terrible', 'thing']

🔢 Vocabulary size: 11


In [33]:
# Total number of words in each class (positive & negative Tweets)
N_pos = sum([freqs.get((word, 1), 0) for word in vocab])
N_neg = sum([freqs.get((word, 0), 0) for word in vocab])

print(f"🟢 Total words in positive tweets: {N_pos}")
print(f"🔴 Total words in negative tweets: {N_neg}")


🟢 Total words in positive tweets: 5
🔴 Total words in negative tweets: 6


In [35]:
#Determine the ratio of positive to negative Tweets. This log prior gives you the base bias of the classifier before any words are seen.
#In our example this is 0, since we have an equal number of positive and negative Tweets.

logprior = np.log(sum(toy_labels) / (len(toy_labels) - sum(toy_labels)))

print(f"\n⚖️ Log Prior: {logprior:.4f}")



⚖️ Log Prior: 0.0000


In [38]:
#Step 4: Compute Log-Likelihood for Each Word: This tells you how strongly each word tilts the prediction toward positive or negative.

loglikelihood = {}

for word in vocab:
    freq_pos = freqs.get((word, 1), 0)
    freq_neg = freqs.get((word, 0), 0)

    # Laplace smoothing
    p_w_pos = (freq_pos + 1) / (N_pos + V)
    p_w_neg = (freq_neg + 1) / (N_neg + V)

    # Log-likelihood ratio
    loglikelihood[word] = np.log(p_w_pos / p_w_neg)

    print(f"🔤 '{word}': log(P(word|pos)/P(word|neg)) = {loglikelihood[word]:.4f}")

import pandas as pd

# List to store each row
rows = []

# Loop through vocabulary
for word in sorted(vocab):
    freq_pos = freqs.get((word, 1), 0)
    freq_neg = freqs.get((word, 0), 0)

    p_w_pos = (freq_pos + 1) / (N_pos + V)
    p_w_neg = (freq_neg + 1) / (N_neg + V)

    llr = np.log(p_w_pos / p_w_neg)
    loglikelihood[word] = llr

    # Append row for DataFrame
    rows.append({
        "Word": word,
        "Pos Count": freq_pos,
        "Neg Count": freq_neg,
        "P(word|pos)": round(p_w_pos, 4),
        "P(word|neg)": round(p_w_neg, 4),
        "Log-Likelihood": round(llr, 4)
    })

# Create and display table
df_likelihood = pd.DataFrame(rows)
df_likelihood = df_likelihood.sort_values("Log-Likelihood", ascending=False)
df_likelihood.reset_index(drop=True, inplace=True)
df_likelihood


Unnamed: 0,Word,Pos Count,Neg Count,P(word|pos),P(word|neg),Log-Likelihood
0,best,1,0,0.125,0.0588,0.7538
1,ever,1,0,0.125,0.0588,0.7538
2,happy,1,0,0.125,0.0588,0.7538
3,purchase,1,0,0.125,0.0588,0.7538
4,love,1,0,0.125,0.0588,0.7538
5,experience,0,1,0.0625,0.1176,-0.6325
6,hate,0,1,0.0625,0.1176,-0.6325
7,horrible,0,1,0.0625,0.1176,-0.6325
8,product,0,1,0.0625,0.1176,-0.6325
9,terrible,0,1,0.0625,0.1176,-0.6325


In [39]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    words = process_tweet(tweet)
    score = logprior
    for word in words:
        if word in loglikelihood:
            score += loglikelihood[word]
    return score

test_tweet = "I hate this product"
score = naive_bayes_predict(test_tweet, logprior, loglikelihood)
sentiment = "Positive 😀" if score > 0 else "Negative 😞"
print(f"\n🧪 Test Tweet: '{test_tweet}' → Score: {score:.2f} → {sentiment}")



🧪 Test Tweet: 'I hate this product' → Score: -1.27 → Negative 😞


---

## 💬 3. Load and Peek at the Data

In [15]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

print("Sample positive tweet:")
print(all_positive_tweets[0])

Sample positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)


---

## 🧹 4. Preprocess the Tweets

In [16]:
stopwords_english = stopwords.words('english')
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

def process_tweet(tweet):
    tokens = tokenizer.tokenize(tweet)
    clean = [word for word in tokens if word not in stopwords_english and word not in string.punctuation]
    return clean

# Try on a sample
process_tweet(all_negative_tweets[0])

['hopeless', 'tmr', ':(']

---

## 📊 5. Count Word Frequencies

In [17]:
def count_tweets(freq_dict, tweets, labels):
    for label, tweet in zip(labels, tweets):
        for word in process_tweet(tweet):
            pair = (word, label)
            freq_dict[pair] = freq_dict.get(pair, 0) + 1
    return freq_dict

---

## 📐 6. Train Naive Bayes

In [None]:
def train_naive_bayes(freqs, train_x, train_y):
    loglikelihood = {}
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)
    N_pos = N_neg = 0
    for pair in freqs:
        if pair[1] == 1:
            N_pos += freqs[pair]
        else:
            N_neg += freqs[pair]
    D = len(train_y)
    D_pos = sum(train_y)
    D_neg = D - D_pos
    logprior = np.log(D_pos / D_neg)
    for word in vocab:
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
    return logprior, loglikelihood

---

## 🏗️ 7. Build and Train the Model

In [None]:
train_x = all_positive_tweets[:4000] + all_negative_tweets[:4000]
train_y = np.append(np.ones(4000), np.zeros(4000))

freqs = count_tweets({}, train_x, train_y)
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

---

## 🔮 8. Predict on New Tweets

In [None]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    words = process_tweet(tweet)
    score = logprior
    for word in words:
        if word in loglikelihood:
            score += loglikelihood[word]
    return score

# Try a test tweet
tweet = "Today is awesome!"
print(f"Score: {naive_bayes_predict(tweet, logprior, loglikelihood):.2f}")

---

## 📊 9. Visualize Influential Words

In [None]:
def plot_loglikelihoods(loglikelihood):
    top_words = sorted(loglikelihood.items(), key=lambda x: abs(x[1]), reverse=True)[:20]
    words, vals = zip(*top_words)
    plt.figure(figsize=(10,6))
    sns.barplot(x=list(vals), y=list(words))
    plt.title("Most Influential Words")
    plt.xlabel("Log-likelihood")
    plt.tight_layout()
    plt.show()

plot_loglikelihoods(loglikelihood)