# Part 2: Logistic Regression from Scratch

Scenario: After performing EDA and clustering analysis on the Maine legislative bills dataset, you want to test whether the bill's assigned committee can be predicted based on the title and text embeddings. You will implement logistic regression from scratch to perform this classification task. To keep things simple, we have picked just one category to predict (i.e., a binary classification problem). I have provided you with the boolean labels for whether a bill was assigned to the "Housing and Economic Development" committee in the `y.json` file.

In this notebook, you will implement logistic regression from scratch using only NumPy, train it with gradient descent, and compare its performance when using `text_embedding` vs. `title_embedding` as features.

In [None]:
# Importing necessary libraries for data processing and machine learning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Importing scikit-learn tools for model building
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import json

In [None]:
# Loading the json files from the directory  
X_df = pd.read_json('data/X.json')
y_df = pd.read_json('data/y.json')
# Extracting title embeddings into properly formatted numpy arrays
X_title = np.array(X_df['title_embedding'].tolist())
# Extracting the text embeddings as numpy arrays
X_text = np.array(X_df['text_embedding'].tolist())
# Extracting the target committee assignment boolean values 
y = y_df['committee_bool'].values

In [None]:
# Performing 80/20 train-test split for title features, stratifying by the target label to preserve class distribution
X_title_train, X_title_test, y_train, y_test = train_test_split(
    X_title, y, test_size=0.2, random_state=6140, stratify=y
)
# Performing identical split for text features using the same random state
X_text_train, X_text_test, _, _ = train_test_split(
    X_text, y, test_size=0.2, random_state=6140, stratify=y
)

# Initializing StandardScaler to ensure title features have 0 mean and 1 variance
scaler_title = StandardScaler()
X_title_train = scaler_title.fit_transform(X_title_train)
X_title_test = scaler_title.transform(X_title_test)

# Initializing and fitting the scaler for the text embeddings as well
scaler_text = StandardScaler()
X_text_train = scaler_text.fit_transform(X_text_train)
X_text_test = scaler_text.transform(X_text_test)

# Print out the results of our split and the class balancing
print(f'Training set: {X_title_train.shape[0]} samples')
print(f'Test set: {X_title_test.shape[0]} samples')
print(f'Positive class in train: {y_train.sum()} ({y_train.mean():.1%})')
print(f'Positive class in test: {y_test.sum()} ({y_test.mean():.1%})')

Training set: 1043 samples
Test set: 261 samples
Positive class in train: 81 (7.8%)
Positive class in test: 20 (7.7%)


In [None]:
class LogisticRegression:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.theta = None  # Parameters to be learned (includes bias as first element)

    # Sigmoid function squashes values to (0, 1) probability boundary
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    # Injecting bias term as a column of ones into our X matrix
    def _add_intercept(self, X):
        return np.column_stack([np.ones(X.shape[0]), X])

    # Standard batch gradient descent approach to minimize cross-entropy loss
    def fit(self, X, y):
        X = self._add_intercept(X)
        self.theta = np.zeros(X.shape[1])
        # Looping through the specified number of iterations and update theta values
        for _ in range(self.num_iterations):
            z = np.dot(X, self.theta)
            h = self.sigmoid(z)
            # Calculating the gradient of the loss with respect to theta
            gradient = np.dot(X.T, (h - y)) / y.size
            # Adjusting weights using the learning rate and calculated gradient
            self.theta -= self.learning_rate * gradient

    # Threshold raw probabilities at 0.5 to output binary prediction
    def predict(self, X):
        X = self._add_intercept(X)
        z = np.dot(X, self.theta)
        return (self.sigmoid(z) >= 0.5).astype(int)

In [None]:
# Creating and training the logistic regression model exclusively on text embeddings
model_text = LogisticRegression(learning_rate=0.1, num_iterations=2000)
model_text.fit(X_text_train, y_train)

# Creating and training an identical model exclusively on title embeddings
model_title = LogisticRegression(learning_rate=0.1, num_iterations=2000)
model_title.fit(X_title_train, y_train)

In [None]:
# Predicting the binary classes for our unseen testing data using both models
preds_text = model_text.predict(X_text_test)
preds_title = model_title.predict(X_title_test)

# Printing out evaluation metrics to compare the text vs title models
print('=== Text Embedding Model ===')
print(f'Accuracy:  {accuracy_score(y_test, preds_text):.4f}')
print(f'Precision: {precision_score(y_test, preds_text):.4f}')
print(f'Recall:    {recall_score(y_test, preds_text):.4f}')
print(f'F1-score:  {f1_score(y_test, preds_text):.4f}')

print('\n=== Title Embedding Model ===')
print(f'Accuracy:  {accuracy_score(y_test, preds_title):.4f}')
print(f'Precision: {precision_score(y_test, preds_title):.4f}')
print(f'Recall:    {recall_score(y_test, preds_title):.4f}')
print(f'F1-score:  {f1_score(y_test, preds_title):.4f}')

=== Text Embedding Model ===
Accuracy:  0.9540
Precision: 0.6818
Recall:    0.7500
F1-score:  0.7143

=== Title Embedding Model ===
Accuracy:  0.9157
Precision: 0.4500
Recall:    0.4500
F1-score:  0.4500


Answer the following questions:

1. Which feature — `text_embedding` or `title_embedding` — produced better classification performance? Why do you think that is the case?
2. Suppose this classifier is being used to flag potential Housing bills for a human reviewer. Which error metric (accuracy, precision, recall, or F1-score) is most important in this scenario? Which is least important? Justify your answer.
3. How does the choice of learning rate affect the convergence of gradient descent? What strategies can be used to choose an appropriate learning rate?

**Answers:**

1.  The text_embedding produced significantly better classification performance. Why: The text_embedding model achieved vastly superior precision and recall, leading to a much higher overall F1-score. A bill's title is very short and often relies on generic administrative phrasing, which provides the model with very little distinctive information. The full text, however, contains a massive amount of highly specific vocabulary and detailed context. This gives the Logistic Regression model the robust clues it needs to accurately assign weights and determine if the bill belongs to the target committee.

2. **Most important: Recall**.If the goal is strictly to flag potential bills for a human to review, we want to make sure the human sees every single possible housing bill. A high recall score guarantees that the model catches the vast majority of the actual housing bills in the pile, even if it accidentally throws a few non-housing bills into the reviewer's inbox 

   **Least important: Accuracy**. The dataset is highly likely to be imbalanced since most bills are not housing bills. High accuracy could be achieved by a trivial model that simply predicts everything as 'false', making accuracy a poor and misleading metric for this use case.

3.  In this highly imbalanced dataset (where over 92% of the bills are not the target class), accuracy is a deceptive metric. A useless model that simply predicts "Not Housing" 100% of the time would still achieve ~92% accuracy, completely failing its actual job of flagging bills for the reviewer.