# 📘 C1W2_L1: Naive Bayes Likelihoods – Teaching Version

In this workbook, our goal is to build a system that can predict whether a Tweet is positive or negative using the Naive Bayes approach. Like last week’s logistic regression, we’ll keep things simple and interpretable: no word order, no phrasing — just plain counts of which words appear.

But instead of learning weights through optimization (like logistic regression), this time we’ll build a model using **Bayes’ Theorem** and **word frequencies**. Think of this as letting probabilities do the talking, based on how often a word appears in positive vs. negative tweets.

You’ll see lots of overlap with Week 1’s bag-of-words approach — but the core *math* behind the prediction is different. Let’s dive in!

Unlike logistic regression last week, there’s no optimization or gradient descent here — just a lookup-based statistical model derived directly from the labeled data.

---

## 🧠 What is Naive Bayes?
Naive Bayes is a simple yet powerful classification algorithm based on **Bayes’ Theorem**:

The “naive” assumption is that all words in a tweet are **conditionally independent** given the sentiment label (whether the Tweet is Positive or Negative).

🧮 How Naive Bayes Works (Simple Steps):

1️⃣ Clean and tokenize each tweet (remove stopwords, punctuation, etc.)

2️⃣ Count how often each word appears in positive and negative tweets

3️⃣ Calculate smoothed probabilities

4️⃣ For a new tweet, sum the log probabilities of each word under each label

5️⃣ Predict the label with the higher total score

---

---

## 🔧 1. Setup & Downloads

In [None]:
!pip install -q gradio
import nltk
nltk.download('twitter_samples')
nltk.download('stopwords')

---

## 📥 2. Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gradio as gr

from nltk.corpus import twitter_samples, stopwords
from nltk.tokenize import TweetTokenizer
import string

---

## 💬 3. Load and Peek at the Data

In [None]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

print("Sample positive tweet:")
print(all_positive_tweets[0])

---

## 🧹 4. Preprocess the Tweets

In [None]:
stopwords_english = stopwords.words('english')
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

def process_tweet(tweet):
    tokens = tokenizer.tokenize(tweet)
    clean = [word for word in tokens if word not in stopwords_english and word not in string.punctuation]
    return clean

# Try on a sample
process_tweet(all_negative_tweets[0])

---

## 📊 5. Count Word Frequencies

In [None]:
def count_tweets(freq_dict, tweets, labels):
    for label, tweet in zip(labels, tweets):
        for word in process_tweet(tweet):
            pair = (word, label)
            freq_dict[pair] = freq_dict.get(pair, 0) + 1
    return freq_dict

---

## 📐 6. Train Naive Bayes

In [None]:
def train_naive_bayes(freqs, train_x, train_y):
    loglikelihood = {}
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)
    N_pos = N_neg = 0
    for pair in freqs:
        if pair[1] == 1:
            N_pos += freqs[pair]
        else:
            N_neg += freqs[pair]
    D = len(train_y)
    D_pos = sum(train_y)
    D_neg = D - D_pos
    logprior = np.log(D_pos / D_neg)
    for word in vocab:
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
    return logprior, loglikelihood

---

## 🏗️ 7. Build and Train the Model

In [None]:
train_x = all_positive_tweets[:4000] + all_negative_tweets[:4000]
train_y = np.append(np.ones(4000), np.zeros(4000))

freqs = count_tweets({}, train_x, train_y)
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

---

## 🔮 8. Predict on New Tweets

In [None]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    words = process_tweet(tweet)
    score = logprior
    for word in words:
        if word in loglikelihood:
            score += loglikelihood[word]
    return score

# Try a test tweet
tweet = "Today is awesome!"
print(f"Score: {naive_bayes_predict(tweet, logprior, loglikelihood):.2f}")

---

## 📊 9. Visualize Influential Words

In [None]:
def plot_loglikelihoods(loglikelihood):
    top_words = sorted(loglikelihood.items(), key=lambda x: abs(x[1]), reverse=True)[:20]
    words, vals = zip(*top_words)
    plt.figure(figsize=(10,6))
    sns.barplot(x=list(vals), y=list(words))
    plt.title("Most Influential Words")
    plt.xlabel("Log-likelihood")
    plt.tight_layout()
    plt.show()

plot_loglikelihoods(loglikelihood)

---

## 🚀 10. Gradio Sentiment Classifier

In [None]:
def classify_tweet(tweet):
    score = naive_bayes_predict(tweet, logprior, loglikelihood)
    label = "Positive 😀" if score > 0 else "Negative 😞"
    return f"Score: {score:.2f} → {label}"

demo = gr.Interface(fn=classify_tweet,
                    inputs=gr.Textbox(lines=2, placeholder="Enter a tweet..."),
                    outputs="text",
                    title="Naive Bayes Tweet Sentiment Classifier")

demo.launch(share=False)