# Lesson L3 – Logistic Regression for Tweet Sentiment 📈❤️💔

Welcome to **Lesson L3** of our Colab‑ready DeepLearning.AI NLP remake!

## What you’ll learn
* Convert cleaned tweets into **numeric features**  
* Train a **Logistic Regression** classifier for sentiment  
* **Visualise** tweets in feature‑space and the model’s decision boundary  
* Evaluate accuracy & interpret the learned weights  
* Experiment live with an **interactive Gradio playground**

## Why this matters
Getting from *word counts* → *predictions* is the heart of many NLP systems.  
Logistic Regression is a surprisingly strong baseline and lays the groundwork for neural networks.

## Roadmap
1. **Setup & installs** – one cell, ready for Colab  
2. **Toy example** – six handmade tweets to see LR end‑to‑end  
3. **Real dataset** – 10 k NLTK tweets, build features on the fly  
4. **Visualisation** – scatter & decision boundary  
5. **Evaluation** – accuracy + confusion matrix  
6. **Gradio playground** – paste any text and get a sentiment score  

_👉 Let’s dive in!_


In [None]:
# 🍀 Colab setup – run this first!
# Installs pinned to avoid version conflicts with Colab pre‑installs
!pip -q install --upgrade \
        "nltk" \
        "wordcloud" \
        "numpy>=1.26,<2.1" \
        "scikit-learn<1.7" \
        "gradio>=4.27.0" \
        "websockets>=13,<15" --progress-bar off

import nltk, ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
    ssl._create_default_https_context = _create_unverified_https_context
except AttributeError:
    pass

for resource in ['stopwords', 'punkt', 'twitter_samples']:
    nltk.download(resource, quiet=True)

print("✅ Environment & NLTK corpora ready!")


## 1️⃣ Toy example – six mini‑tweets

In [None]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re, numpy as np, matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def simple_process(sentence):
    sentence = sentence.lower()
    sentence = re.sub(r'[^a-z\s]', '', sentence)
    return [stemmer.stem(w) for w in sentence.split() if w not in stop_words]

# Mini sentiment lexicon
pos_lex = {'love','great','happy'}
neg_lex = {'hate','bad','sad'}

def to_features(tokens):
    pos_cnt = sum(w in pos_lex for w in tokens)
    neg_cnt = sum(w in neg_lex for w in tokens)
    return [pos_cnt, neg_cnt]

toy_tweets = [
    "I love this!",
    "This is great and makes me happy",
    "So happy, great vibes",
    "I hate this, really bad",
    "This is sad and bad",
    "I hate it so much"
]
y_toy = np.array([1,1,1,0,0,0])

X_toy = np.array([to_features(simple_process(t)) for t in toy_tweets])

print("Toy feature matrix:\n", X_toy)

clf_toy = LogisticRegression()
clf_toy.fit(X_toy, y_toy)

print("Toy accuracy:", clf_toy.score(X_toy, y_toy))

# plot decision boundary
plt.figure(figsize=(4,4))
for label, marker, color in [(1,'o','green'), (0,'x','red')]:
    mask = y_toy==label
    plt.scatter(X_toy[mask,0], X_toy[mask,1], marker=marker, color=color, label='pos' if label else 'neg', s=80)

coef = clf_toy.coef_[0]; intercept = clf_toy.intercept_[0]
xs = np.linspace(0,3,100)
ys = -(coef[0]*xs + intercept)/coef[1]
plt.plot(xs, ys, '--k')
plt.xlabel('Positive count'); plt.ylabel('Negative count')
plt.xlim(-0.2,3.5); plt.ylim(-0.2,3.5); plt.legend()
plt.title("Toy decision boundary")
plt.show()


**What to notice**

* Tweets with more *positive cues* sit left of the boundary.  
* LR learns weights to separate the two classes with a straight line.  
* Coefficients magnitude ≈ feature importance.

## 2️⃣ Helper functions – inline (no utils.py)

In [None]:
import re, numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def process_tweet(tweet: str):
    tweet = tweet.lower()
    tweet = re.sub(r'https?://\S+', '', tweet)
    tweet = re.sub(r'[^a-z\s]', '', tweet)
    return [stemmer.stem(w) for w in tweet.split() if w not in stop_words]

def build_freqs(tweets, ys):
    freqs = {}
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            freqs[pair] = freqs.get(pair, 0) + 1
    return freqs

def tweet_to_xy(tweet, pos_vocab, neg_vocab):
    tokens = process_tweet(tweet)
    pos_cnt = sum(tok in pos_vocab for tok in tokens)
    neg_cnt = sum(tok in neg_vocab for tok in tokens)
    return np.array([pos_cnt, neg_cnt])


## 3️⃣ Full tweet corpus – build features & train LR

In [None]:
from nltk.corpus import twitter_samples
import numpy as np

pos_tweets = twitter_samples.strings('positive_tweets.json')
neg_tweets = twitter_samples.strings('negative_tweets.json')
tweets = pos_tweets + neg_tweets
ys = np.append(np.ones(len(pos_tweets)), np.zeros(len(neg_tweets)))

freqs = build_freqs(tweets, ys)

# Vocab threshold
pos_vocab = {w for (w,y) in freqs if y==1 and freqs[(w,1)] > 5}
neg_vocab = {w for (w,y) in freqs if y==0 and freqs[(w,0)] > 5}

print(f"Positive vocab: {len(pos_vocab)} words | Negative vocab: {len(neg_vocab)} words")

X = np.array([tweet_to_xy(t, pos_vocab, neg_vocab) for t in tweets])
print("Feature matrix shape:", X.shape)


### Train / test split & model performance

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

X_train, X_test, y_train, y_test = train_test_split(X, ys, test_size=0.2, random_state=42, stratify=ys)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("Train accuracy:", accuracy_score(y_train, model.predict(X_train)))
print("Test  accuracy:", accuracy_score(y_test,  model.predict(X_test)))

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test, display_labels=['neg','pos'])


### Visualising tweet distribution & decision boundary

In [None]:
import matplotlib.pyplot as plt

np.random.seed(0)
sample_idx = np.random.choice(len(X), 2000, replace=False)
X_s = X[sample_idx]; y_s = ys[sample_idx]

plt.figure(figsize=(6,6))
for label, marker, col in [(1,'o','limegreen'), (0,'x','crimson')]:
    mask = y_s==label
    plt.scatter(X_s[mask,0], X_s[mask,1], marker=marker, color=col, label='pos' if label else 'neg', alpha=0.5)

coef = model.coef_[0]; intercept=model.intercept_[0]
xs = np.linspace(0, X[:,0].max()+2, 100)
ys_line = -(coef[0]*xs + intercept)/coef[1]
plt.plot(xs, ys_line, '--k', linewidth=2)

plt.xlabel('Positive word count'); plt.ylabel('Negative word count')
plt.title('Tweet sentiment space')
plt.legend(loc='upper right')
plt.xlim(0, X[:,0].max()+1); plt.ylim(0, X[:,1].max()+1)
plt.show()


**Interpretation tips**

* Many tweets contain *no* positive words (x=0) or negative words (y=0); that’s why points hug the axes.  
* Adding richer features (bigrams, TF‑IDF) can separate overlapping clusters.

## 4️⃣ Interactive Gradio sentiment tester

In [None]:
import gradio as gr
from collections import Counter

def predict_sentiment(text):
    feats = tweet_to_xy(text, pos_vocab, neg_vocab)
    prob_pos = float(model.predict_proba([feats])[0][1])
    label = "Positive 😊" if prob_pos >= 0.5 else "Negative 😞"
    return {
        "Positive-count": int(feats[0]),
        "Negative-count": int(feats[1]),
        "Prob‑positive": round(prob_pos, 3),
        "Prediction": label
    }

with gr.Blocks() as demo:
    gr.Markdown("### 🎛️ Sentiment tester (Logistic Regression)")
    txt = gr.Textbox(label="Enter tweet text", lines=3)
    out = gr.JSON(label="Model output")
    txt.submit(predict_sentiment, txt, out)
    gr.Button("Run").click(predict_sentiment, txt, out)

# Uncomment the next line when running in Colab
# demo.launch()


---

🎉 **You trained, visualised, and deployed a sentiment classifier!**  
Try tweaking the vocabulary threshold, adding TF‑IDF, or swapping in a different model to see how performance changes.