# 📘 C1W2_L1: Naive Bayes Likelihoods – Teaching Version

Welcome! In this lesson, you'll explore how Naive Bayes classification works under the hood, focusing on how word frequencies affect predictions. You’ll learn to:

- Process tweets and create a frequency dictionary
- Train a Naive Bayes model and compute log-likelihoods
- Visualize the most informative words
- Build an interactive sentiment classifier with Gradio

---

## 🔧 1. Setup & Downloads
We'll begin by installing dependencies and loading some Twitter data.

In [None]:
!pip install -q gradio
import nltk
nltk.download('twitter_samples')
nltk.download('stopwords')

---

## 📥 2. Imports
Here are the packages we’ll use throughout the notebook:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gradio as gr

from nltk.corpus import twitter_samples, stopwords
from nltk.tokenize import TweetTokenizer
import string

---

## 💬 3. Load and Peek at the Data
We’ll use 5,000 positive and 5,000 negative tweets from NLTK’s corpus.

In [None]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

print("Sample positive tweet:")
print(all_positive_tweets[0])

---

## 🧹 4. Preprocess the Tweets
This function:
- Tokenizes tweets
- Removes stopwords and punctuation

In [None]:
stopwords_english = stopwords.words('english')
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

def process_tweet(tweet):
    tokens = tokenizer.tokenize(tweet)
    clean = [word for word in tokens if word not in stopwords_english and word not in string.punctuation]
    return clean

# Try it on a tweet:
process_tweet(all_negative_tweets[0])

---

## 📊 5. Create Frequency Dictionary
We now build a dictionary with (word, label) → count pairs.

In [None]:
def count_tweets(freq_dict, tweets, labels):
    for label, tweet in zip(labels, tweets):
        for word in process_tweet(tweet):
            pair = (word, label)
            freq_dict[pair] = freq_dict.get(pair, 0) + 1
    return freq_dict

# Small test:
test_freq = count_tweets({}, ["I am happy", "I am sad"], [1, 0])
print(test_freq)

---

## 📐 6. Train Naive Bayes
We compute:
- `logprior` from class distribution
- `loglikelihoods` for each word

In [None]:
def train_naive_bayes(freqs, train_x, train_y):
    loglikelihood = {}
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)
    N_pos = N_neg = 0
    for pair in freqs:
        if pair[1] == 1:
            N_pos += freqs[pair]
        else:
            N_neg += freqs[pair]
    D = len(train_y)
    D_pos = sum(train_y)
    D_neg = D - D_pos
    logprior = np.log(D_pos / D_neg)
    for word in vocab:
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
    return logprior, loglikelihood

---

## 🏗️ 7. Build and Train Model

In [None]:
train_x = all_positive_tweets[:4000] + all_negative_tweets[:4000]
train_y = np.append(np.ones(4000), np.zeros(4000))

freqs = count_tweets({}, train_x, train_y)
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

---

## 🔮 8. Make Predictions
Here’s the prediction function:

In [None]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    words = process_tweet(tweet)
    score = logprior
    for word in words:
        if word in loglikelihood:
            score += loglikelihood[word]
    return score

# Test example:
tweet = "Today is awesome!"
print(f"Score: {naive_bayes_predict(tweet, logprior, loglikelihood):.2f}")

---

## 📊 9. Visualize Top Words
Let’s see which words are most influential in the model.

In [None]:
def plot_loglikelihoods(loglikelihood):
    top_words = sorted(loglikelihood.items(), key=lambda x: abs(x[1]), reverse=True)[:20]
    words, vals = zip(*top_words)
    plt.figure(figsize=(10,6))
    sns.barplot(x=list(vals), y=list(words))
    plt.title("Most Influential Words")
    plt.xlabel("Log-likelihood")
    plt.tight_layout()
    plt.show()

plot_loglikelihoods(loglikelihood)

---

## 🚀 10. Interactive Sentiment Classifier
Use Gradio to test tweets live!

In [None]:
def classify_tweet(tweet):
    score = naive_bayes_predict(tweet, logprior, loglikelihood)
    label = "Positive 😀" if score > 0 else "Negative 😞"
    return f"Score: {score:.2f} → {label}"

demo = gr.Interface(fn=classify_tweet,
                    inputs=gr.Textbox(lines=2, placeholder="Enter a tweet..."),
                    outputs="text",
                    title="Naive Bayes Tweet Sentiment Classifier")

demo.launch(share=False)

---

## ✅ Summary
In this lesson, you:
- Learned how to compute log-likelihoods from tweet data
- Built a Naive Bayes sentiment classifier from scratch
- Visualized word influence
- Created a working interactive classifier

Nice work!