# Assignment A1 – Building Logistic Regression from Scratch 🔧📈  

This teaching notebook **wraps the original Coursera Assignment A1** with extra scaffolding.

> **Goal**: implement gradient descent for logistic regression and apply it to tweet‑sentiment classification.  

This assignment takes the concepts learned in the lessons, and builds the code and some of the math intuition. Depending on your learning goals, you may not need to learn how to code this assignment.

✅ What’s new in the assignment versus the lessons:
- The assignment asks you to implement it yourself step by step:
- Implement sigmoid()
- Manually compute predictions and gradients
- Write your own gradientDescent() function
- Make predictions from scratch
- Evaluate accuracy without using sklearn

🧠 Teaching goal:
- While the lessons explain and visualize the concepts, the assignment ensures you:
- Understand the math of logistic regression
- Practice vectorizing operations
- Learn how models are trained through iteration

In [None]:
# 🍀 Colab setup – run once
!pip -q install --upgrade \
        "numpy>=1.26,<2.1" \
        "scikit-learn<1.7" \
        "nltk" \
        "wordcloud" \
        "gradio>=4.27.0" \
        "websockets>=13,<15" --progress-bar off

import nltk, ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
    ssl._create_default_https_context = _create_unverified_https_context
except AttributeError:
    pass

for res in ['stopwords','punkt','twitter_samples']:
    nltk.download(res, quiet=True)

print('✅ Environment ready')


## 1️⃣ Toy walkthrough (6 tweets)  

Before diving into full gradient descent, let’s solve a *mini* version with scikit‑learn so you can **see the expected behaviour**.

In [None]:
import numpy as np, re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

stemmer, stop_words = PorterStemmer(), set(stopwords.words('english'))
def clean(text):
    text = text.lower(); text = re.sub(r'[^a-z\s]', '', text)
    return [stemmer.stem(w) for w in text.split() if w not in stop_words]

pos_lex, neg_lex = {'love','great','happy'}, {'hate','bad','sad'}
def to_xy(tokens):
    return [sum(w in pos_lex for w in tokens), sum(w in neg_lex for w in tokens)]

tiny = ["I love this", "So happy!", "Great product",
        "I hate this", "Very bad", "So sad"]
y_tiny = np.array([1,1,1,0,0,0])
X_tiny = np.array([to_xy(clean(t)) for t in tiny])

clf = LogisticRegression(); clf.fit(X_tiny, y_tiny)
print("Tiny accuracy:", clf.score(X_tiny, y_tiny))

# Plot
plt.figure(figsize=(4,4))
for lab, m, c in [(1,'o','green'), (0,'x','red')]:
    sel = y_tiny==lab
    plt.scatter(X_tiny[sel,0], X_tiny[sel,1], marker=m, color=c)
coef, b = clf.coef_[0], clf.intercept_[0]
xs = np.linspace(-0.2,3,100); plt.plot(xs, -(coef[0]*xs + b)/coef[1], '--k')
plt.xlabel('Positive'); plt.ylabel('Negative'); plt.title('Toy LR boundary')
plt.show()

**Take‑away:** the straight boundary is what you’ll reproduce with *your* implementation in Section 3.

## 2️⃣ Helper code – cleaning & feature building

In [None]:
import numpy as np, re, math
from collections import Counter
from nltk.corpus import stopwords, twitter_samples
from nltk.stem import PorterStemmer

stop_words, stemmer = set(stopwords.words('english')), PorterStemmer()

def process_tweet(tweet):
    tweet = tweet.lower()
    tweet = re.sub(r'https?://\S+', '', tweet)
    tweet = re.sub(r'[^a-z\s]', '', tweet)
    return [stemmer.stem(w) for w in tweet.split() if w not in stop_words]

def build_freqs(tweets, ys):
    freqs = {}
    for y, t in zip(ys, tweets):
        for w in process_tweet(t):
            freqs[(w, y)] = freqs.get((w, y), 0) + 1
    return freqs

def extract_features(tweet, freqs):
    '''Return [1, pos_count, neg_count] for one tweet'''
    x = np.zeros(3)
    x[0] = 1
    for w in process_tweet(tweet):
        x[1] += freqs.get((w,1.0),0)
        x[2] += freqs.get((w,0.0),0)
    return x


## 3️⃣ Prepare full tweet dataset

In [None]:
from nltk.corpus import twitter_samples
pos_tweets = twitter_samples.strings('positive_tweets.json')
neg_tweets = twitter_samples.strings('negative_tweets.json')
tweets = pos_tweets + neg_tweets
ys = np.append(np.ones(len(pos_tweets)), np.zeros(len(neg_tweets)))

freqs = build_freqs(tweets, ys)
print("Frequency dict size:", len(freqs))

# Build feature matrix
X = np.stack([extract_features(t, freqs) for t in tweets])
print("Feature matrix shape:", X.shape)


## 4️⃣ Your turn – implement Logistic Regression with Gradient Descent  

We’ll guide you through:  

1. **Sigmoid** function  
2. Cost computation  
3. Gradient calculation  
4. Parameter update in loop  

👉 **Fill the TODOs** below (solutions hidden).

In [None]:
def sigmoid(z):
    """Compute sigmoid – ***complete this***"""
    ### TODO
    return 1/(1+np.exp(-z))


In [None]:
def compute_cost_and_grad(theta, X, y):
    m = len(y)
    z = np.dot(X, theta)
    h = sigmoid(z)
    cost = -(1/m)*(np.dot(y, np.log(h)) + np.dot(1-y, np.log(1-h)))
    grad = (1/m)*np.dot(X.T, (h - y))
    return cost, grad


In [None]:
def gradient_descent(X, y, alpha=1e-9, iters=1500):
    theta = np.zeros(X.shape[1])
    costs = []
    for i in range(iters):
        cost, grad = compute_cost_and_grad(theta, X, y)
        theta -= alpha * grad
        if i%100==0: costs.append(cost)
    return theta, costs

theta, costs = gradient_descent(X, ys, alpha=1e-9, iters=2000)
print("Final cost:", costs[-1])


### Cost curve

In [None]:
import matplotlib.pyplot as plt
plt.plot(costs); plt.title('Cost over iterations'); plt.xlabel('Every 100 steps'); plt.ylabel('Cost'); plt.show()

## 5️⃣ Evaluate your handmade model

In [None]:
from sklearn.metrics import accuracy_score

y_pred = sigmoid(np.dot(X, theta)) >= 0.5
print("Hand‑built LR accuracy:", accuracy_score(ys, y_pred))


## 6️⃣ Gradio tester

In [None]:
import gradio as gr

def predict(text):
    x = extract_features(text, freqs)
    prob = float(sigmoid(np.dot(x, theta)))
    label = "Positive 😊" if prob>=0.5 else "Negative 😞"
    return {"Prob‑positive": round(prob,3), "Prediction": label}

with gr.Blocks() as demo:
    gr.Markdown("### Test your gradient‑descent model")
    inp = gr.Textbox(lines=3, label="Tweet text")
    out = gr.JSON()
    inp.submit(predict, inp, out)
    gr.Button("Run").click(predict, inp, out)

# Uncomment when running in Colab
# demo.launch()


---

🎯 **You built Logistic Regression from scratch and deployed it!**  

Next: experiment with learning rates, iteration counts, or add L2 regularisation to improve stability.