# 📘 C1W2_L1: Naive Bayes Likelihoods – Teaching Version

🎯 **Let's build a classifier. First Toy, then real!**

In this workbook, our goal is to build a system that can predict whether a Tweet is positive or negative. Like last week’s logistic regression, we’ll keep things simple and interpretable: no word order, no phrasing — just plain counts of which words appear.

But instead of learning weights through optimization (like logistic regression), this time we’ll build a model using **Bayes’ Theorem** and **word frequencies**. Think of this as letting probabilities do the talking, based on how often a word appears in positive vs. negative tweets.

You’ll see lots of overlap with Week 1’s bag-of-words approach — but the core *math* behind the prediction is different. Let’s dive in!

Unlike logistic regression last week, there’s no optimization or gradient descent here — just a lookup-based statistical model derived directly from the labeled data. It’s still “learning,” but it’s more like rule-building from observed frequencie
🧠 So is this “learning”?
Barely. In the broadest ML sense: yes, because the model's behavior changes depending on what data it sees.

But:

It’s parametric (small number of parameters: priors and word likelihoods)

It’s closed-form (you just calculate values)

It’s non-adaptive beyond word frequencies

---

## 🧠 What is Naive Bayes?
Naive Bayes is a simple yet powerful classification algorithm based on **Bayes’ Theorem**:

\[ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} \]

In our case:
- A = a sentiment label (e.g., positive or negative)
- B = the tweet’s words

The “naive” assumption is that all words in a tweet are **conditionally independent** given the sentiment label.

This means we treat "bad" and "dream" as unrelated — even though "bad dream" might have a specific negative meaning together. That’s the simplification — and limitation — of this approach.

The classifier works by computing a **log-likelihood score** for each label:

\[ \text{score} = \log(P(\text{label})) + \sum_{i} \log(P(\text{word}_i \mid \text{label})) \]

We choose the label with the highest score. This is different from last week, where the score came from a **sigmoid curve fitted by gradient descent**.

---

## 🤔 How is this similar to last week?
- ✅ We still represent tweets using **bag-of-words** (no order, just counts)
- ✅ We still want to predict positive vs negative
- ✅ We still calculate a score per tweet

## 🔍 What’s different this week?
- ❌ No learned weights from gradient descent
- ✅ We calculate probabilities directly from word frequencies
- ✅ We use **log-likelihoods** instead of a sigmoid decision function

In short, both methods try to split tweets by some kind of score — but how that score is **calculated** is different.

---

## 🔧 1. Setup & Downloads
We'll begin by installing dependencies and loading some Twitter data.

In [None]:
!pip install -q gradio
import nltk
nltk.download('twitter_samples')
nltk.download('stopwords')

...