
# Assignment A2 – Build Naïve Bayes from Scratch 🔧🪄

Welcome to your second assignment in Week 2! This notebook takes you beyond visualization – now **you'll actually implement Naïve Bayes yourself**.

---

### 👨‍🔬 What you’ll do
* Preprocess and clean tweet data using NLTK.
* Implement training logic for Naïve Bayes using log-likelihoods and Laplace smoothing.
* Predict sentiment using your model.
* Inspect influential words with odds ratios.
* Explore your model interactively with a Gradio-powered UI.

### 📌 What's special
This notebook is **autograder-compatible** with the Coursera version. If your local tests pass, **you’ll pass the official grader**.



## 🍀 Step 0 – Environment Setup

In [None]:
!pip -q install --upgrade nltk numpy>=1.26,<2.1 gradio>=4.27.0 websockets>=13,<15 --progress-bar off

import nltk, ssl, warnings; warnings.filterwarnings('ignore')
try:
    ssl._create_default_https_context = ssl._create_unverified_context
except AttributeError:
    pass
for res in ['stopwords','punkt','twitter_samples']:
    nltk.download(res, quiet=True)
print('✅ Environment ready')

## 1️⃣ Helper functions (provided)

These functions clean and process tweets, and help track word-label counts:

- `process_tweet(tweet)` – normalizes, removes stop-words and stems.
- `count_tweets(result, tweets, ys)` – builds a frequency dictionary.
- `lookup(freqs, word, label)` – helper to count how often a word appears with a specific label.

In [None]:
import numpy as np, re, random, math
from nltk.corpus import stopwords, twitter_samples
from nltk.stem import PorterStemmer
from collections import Counter
stemmer, stop_words = PorterStemmer(), set(stopwords.words('english'))

def process_tweet(tweet):
    tweet = tweet.lower()
    tweet = re.sub(r'https?://\S+','',tweet)
    tweet = re.sub(r'[^a-z\s]','',tweet)
    return [stemmer.stem(w) for w in tweet.split() if w not in stop_words]

def count_tweets(result, tweets, ys):
    for y,t in zip(ys, tweets):
        for w in process_tweet(t):
            pair = (w,y)
            result[pair] = result.get(pair, 0) + 1
    return result

def lookup(freqs, word, label):
    return freqs.get((word, label), 0)

## 2️⃣ Load tweet data & train/test split

We use NLTK's pre-labeled Twitter dataset:

- **5,000 positive** tweets  
- **5,000 negative** tweets

We’ll shuffle and split into **80% training** / **20% test**.

In [None]:
pos = twitter_samples.strings('positive_tweets.json')
neg = twitter_samples.strings('negative_tweets.json')
tweets = pos + neg
ys = np.array([1]*len(pos) + [0]*len(neg))

random.seed(0)
idx = list(range(len(tweets)))
random.shuffle(idx)

tweets = [tweets[i] for i in idx]
ys = ys[idx]

split = int(0.8 * len(tweets))
tweets_tr, tweets_te = tweets[:split], tweets[split:]
ys_tr, ys_te = ys[:split], ys[split:]

print(len(tweets_tr), 'train,', len(tweets_te), 'test')

## 3️⃣ Implement Naïve Bayes

Now it’s your turn to implement:

- `train_naive_bayes`: builds the logprior and loglikelihood from data.
- `predict_sentiment`: scores new tweets using your trained model.

Use Laplace smoothing and work in log space to avoid underflow.

In [None]:
### UNQ_C1
def train_naive_bayes(freqs, train_x, train_y):
    """Returns logprior, loglikelihood dict."""
    loglikelihood = {}
    vocab = {w for (w,_) in freqs.keys()}
    V = len(vocab)
    N_pos = N_neg = 0
    for pair,c in freqs.items():
        if pair[1]==1:
            N_pos += c
        else:
            N_neg += c
    D = len(train_y)
    D_pos = (train_y==1).sum(); D_neg = (train_y==0).sum()
    logprior = math.log(D_pos) - math.log(D_neg)
    for w in vocab:
        f_pos = freqs.get((w,1),0)
        f_neg = freqs.get((w,0),0)
        p_w_pos = (f_pos+1)/(N_pos+V)
        p_w_neg = (f_neg+1)/(N_neg+V)
        loglikelihood[w] = math.log(p_w_pos/p_w_neg)
    return logprior, loglikelihood

In [None]:
### UNQ_C2
def predict_sentiment(tweet, logprior, loglikelihood):
    words = process_tweet(tweet)
    score = logprior
    for w in words:
        if w in loglikelihood:
            score += loglikelihood[w]
    return 1 if score > 0 else 0

In [None]:
### UNQ_C3
def get_ratio(freqs, word):
    pos = freqs.get((word,1), 0)
    neg = freqs.get((word,0), 0)
    return (pos + 1) / (neg + 1)

### UNQ_C4
def get_words_by_threshold(freqs, label, threshold):
    out = {}
    for w in {w for w,_ in freqs.keys()}:
        ratio = get_ratio(freqs, w)
        if label == 1 and ratio >= threshold:
            out[w] = ratio
        elif label == 0 and ratio <= 1/threshold:
            out[w] = ratio
    return out

## 4️⃣ Train & Evaluate Your Model

In [None]:
freqs = count_tweets({}, tweets_tr, ys_tr)
logprior, loglikelihood = train_naive_bayes(freqs, tweets_tr, ys_tr)

y_hat = [predict_sentiment(t, logprior, loglikelihood) for t in tweets_te]

from sklearn.metrics import accuracy_score, classification_report
print('Test accuracy:', accuracy_score(ys_te, y_hat))
print(classification_report(ys_te, y_hat, target_names=['neg','pos']))

## ✅ Local Sanity Checks

In [None]:
assert predict_sentiment("I love it", logprior, loglikelihood) in [0,1]
assert abs(get_ratio(freqs,'love') - ((freqs.get(('love',1),0)+1)/(freqs.get(('love',0),0)+1))) < 1e-9
print('Local checks passed ✔️')

## 🧪 Gradio: Interactive Sentiment Explorer

In [None]:
import gradio as gr, numpy as np, math

def classify(text):
    words = process_tweet(text)
    score = logprior + sum(loglikelihood.get(w,0) for w in words)
    prob = 1 / (1 + math.exp(-score))
    return {
        'Prob‑positive': round(prob, 3),
        'Prediction': 'Positive 😊' if prob >= 0.5 else 'Negative 😞'
    }

with gr.Blocks() as demo:
    gr.Markdown('### 🔍 Naïve Bayes sentiment tester')
    txt = gr.Textbox(lines=3)
    out = gr.JSON()
    txt.submit(classify, txt, out)
    gr.Button('Run').click(classify, txt, out)

# Uncomment this to run:
# demo.launch()

---

## 🎓 You did it!

You implemented Naïve Bayes from scratch, trained it on real tweets, interpreted odds ratios, and built a live demo.

### Next steps
- Use `get_words_by_threshold` to explore the most opinionated words.
- Try replacing the dataset with your own texts or product reviews.
- Extend the Gradio UI with visual plots or confidence bars!

Happy modeling 🚀