# Part 2: Logistic

In this part, you will complete a Do-It-Yourself (DIY) implementation of binary logistic regression in an object-oriented pattern that corresponds with the Scikit-Learn API.

**Learning objectives.** You will:
1. Write object-oriented code for a Python class, matching standard API patterns.
2. Apply numerical Python (NumPy) to efficiently implement binary logistic regression, including code to fit the model to data using the gradient descent algorithm. 
3. Evaluate your implementation compared to the Scikit-Learn standard on synthetic data. 
4. Perform an ablation study on the impact of the learning rate hyperparameter for fitting a logistic regression model.

## Background
Before implementing logistic regression, let's understand the mathematical foundations.

**Sigmoid Function**: $\sigma(z) = \frac{1}{1 + e^{-z}}$ transforms any real value $z$ into a probability between 0 and 1.

**Logistic Regression Model**: For features $\mathbf{x}$ and weights $\mathbf{w}$:
$$P(y = 1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x})$$

**Decision Boundary**: The line/plane where $\mathbf{w}^T \mathbf{x} = 0$ (i.e., $P(y=1) = 0.5$).

**Cross-Entropy Loss**: For a given data point true label $y \in \{0,1\}$ and predicted probability $p$:
$$\text{Loss} = -[y \log(p) + (1-y) \log(1-p)]$$


### Worked Example

Suppose you have four training data points $x^{(1)}, x^{(2)}, x^{(3)}$, and $x^{(4)}$. Each data point $x^{(i)}$ has two input features $x^{(i)}_1$ and $x^{(i)}_2$ and a binary (0 or 1) predictive target $y_i$.

**Data Points**
- $x^{(1)} = [2, 1], \quad y_1 = 1$
- $x^{(2)} = [1, 3], \quad y_2 = 0$ 
- $x^{(3)} = [3, 0], \quad y_3 = 1$
- $x^{(4)} = [0, 2], \quad y_4 = 1$ (this will be misclassified)

**Model Parameters:**
- Weights: $w_1 = 1, w_2 = -1$
- Bias: $b = 0$ *(Note: Bias terms are not required for this assignment)*

This implies that the logits will be $z = w_1x_1 + w_2x_2 + b = x_1 - x_2$

#### Example Step-by-Step Calculations

**Point 1: $x^{(1)} = [2, 1], y_1 = 1$**
- $z_1 = 1(2) + (-1)(1) + 0 = 2 - 1 = 1$
- $p_1 = \sigma(1) = \frac{1}{1 + e^{-1}} = \frac{1}{1 + 0.368} = 0.731$
- Predicted class: $\hat{y}_1 = 1$ (since $p_1 > 0.5$)
- Loss: $L_1 = -(1 \cdot \log(0.731) + 0 \cdot \log(0.269)) = -\log(0.731) = 0.313$

**Point 4: $x^{(4)} = [0, 2], y_4 = 1$ (Misclassified)**
- $z_4 = 1(0) + (-1)(2) + 0 = 0 - 2 = -2$
- $p_4 = \sigma(-2) = \frac{1}{1 + e^{2}} = \frac{1}{1 + 7.389} = 0.119$
- Predicted class: $\hat{y}_4 = 0$ (since $p_4 < 0.5$) ❌ **MISCLASSIFIED**
- Loss: $L_4 = -(1 \cdot \log(0.119) + 0 \cdot \log(0.881)) = -\log(0.119) = 2.127$

#### Summary Results

| Point | Features | True $y$ | $z$ | $P(y=1)$ | Predicted $\hat{y}$ | Loss | Correct? |
|-------|----------|----------|-----|----------|-------------------|------|----------|
| 1     | [2, 1]   | 1        | 1   | 0.731    | 1                 | 0.313| ✓        |
| 2     | [1, 3]   | 0        | -2  | 0.119    | 0                 | 0.127| ✓        |
| 3     | [3, 0]   | 1        | 3   | 0.953    | 1                 | 0.048| ✓        |
| 4     | [0, 2]   | 1        | -2  | 0.119    | 0                 | 2.127| ❌       |

#### Performance Metrics

**Total Cross-Entropy Loss:** $L_{total} = 0.313 + 0.127 + 0.048 + 2.127 = 2.615$

**Mean Cross-Entropy Loss:** $L_{mean} = \frac{2.615}{4} = 0.654$

**Accuracy:** $\frac{3 \text{ correct}}{4 \text{ total}} = 75\%$

#### Observations

1. **Decision Boundary:** The decision boundary is the line $x_1 - x_2 = 0$, or $x_1 = x_2$.

2. **Misclassification Impact:** Point 4 has the highest cross-entropy loss (2.127) because it's confidently misclassified—the model predicts probability 0.119 for a true positive case.

3. **Loss vs. Accuracy:** While accuracy is 75%, the cross-entropy loss captures the confidence of predictions. Point 3, though correctly classified, has low loss (0.048) because the model is very confident in its correct prediction.

4. **Geometric Interpretation:** Points above the line $x_1 = x_2$ are classified as positive (class 1), while points below are classified as negative (class 0). Point 4 at [0, 2] falls below this line but should be classified as positive, hence the misclassification.

## Task 1

First, we will use Scikit-Learn to develop a baseline logistic regression model to which we can compare our DIY implementation. Run the following code to generate synthetic data for use in this part of the assignment. Observe that the predictive target is coded as 0 or 1, that the `sigmoid` function is defined for you, and that the code also splits the synthetic data into train and test sets for you.

Use Scikit-Learn to fit a [logistic regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#logisticregression) on the train set with the parameter setting `penalty = 'None'` (this will train a basic model without applying any regularization). Evaluate and report the [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) of the model on both the train set and the test set.

In [5]:
# Run but do not modify this code

import numpy as np
from sklearn.model_selection import train_test_split

def sigmoid(z):
    return 1.0/(1.0 + np.exp(-z))

np.random.seed(2025)
n = 1000
features = 20

X = np.random.normal(size = (n, features))
weights = np.random.normal(size = features)
probs = sigmoid(X @ weights + np.random.normal(scale=0.01)) 
y = np.random.binomial(n=1, p=probs)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2025)

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Write code for task 1 here

logistic_regression_model = LogisticRegression(penalty = None, random_state = 2025)
logistic_regression_model.fit(X_train, y_train)
y_train_prediction = logistic_regression_model.predict(X_train)
y_test_prediction = logistic_regression_model.predict(X_test)
train_accuracy = accuracy_score(y_train, y_train_prediction)
test_accuracy = accuracy_score(y_test, y_test_prediction)
print("The accuracy of the model on the train set is", train_accuracy)
print("The accuracy of the model on the test set is", test_accuracy)

The accuracy of the model on the train set is 0.85
The accuracy of the model on the test set is 0.8


## Task 2

Complete the following class to implement binary logistic regression. Some important notes about the implementation:

1. Remember that the Scikit-Learn API treats an input `X` array, whether to `fit` or `predict`, as a design matrix with a row for every data point and a column for every feature. 

2. For `fit`, every row in `X` corresponds to a given output in `y`, and  you don't need to return anything, just optimize the internal model weights (which should be stored as instance variables). For `predict_proba` and `predict`, you should return a NumPy array with one element (corresponding to a probability or a 0/1 value) for every row in the input `X`.

3. Remember that logistic regression models the probability of outputting `1` as a sigmoid of a linear function of features. This has several implications. 
    - One is that the number of weights in your model should equal the number of features, which equals the number of columns in the `X` matrix passed to the `fit` method. We recommend that you initialize these weights as random normally distributed values, for example by using NumPy's [random.normal](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html). 
    - Another implication is that `predict_proba` should return the sigmoid activation of the [dot product](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) (multiply element-wise then add together) of your model's weights and the given input. You do **not** need to include a bias term for this implementation.

4. For the `predict` function, you should use a simple thresholding of 0.5. That is, you should calculate the probabilities using the `predict_proba` method and return `1` if the probability is greater than 0.5 and `0` otherwise. You can assume that `y` will consist exclusively of `0`s or `1`s for the purpose of this implementation.

5. The `fit` method should implement gradient descent on the cross entropy loss. The process is described below followed by hints about vectorized NumPy operations. 
    - For a given feature/weight dimension $j$ and a particular data point $x$ with label $y$, the partial derivative with respect to weight $w_j$ is $(a - y)x_j$ where $a$ is the activation (the predicted probability) associated with example $x$. 
    - For each feature, this quantity should be averaged over all training data. The vector of all such values forms the gradient $\vec{\nabla}$. 
    - The gradient descent learning update should then be $\vec{w'} = \vec{w} - \eta \vec{\nabla}$ where $\eta$ is the learning rate `lr` passed to the constructor, $\vec{w}$ are the previous weights and $\vec{w'}$ are the weights for the next iteration. 
    - The algorithm should proceed for `max_iters` iterations unless the magnitude of the gradient becomes less than the `tol` hyperparameter. This is implemented in the code for you.

6. Vectorization hint: Instead of using nested loops, use NumPy matrix operations. The key insight is that you can compute gradients for all features simultaneously:
   - `X @ weights` gives you all linear combinations: shape `(n_samples,)`
   - `sigmoid(X @ weights)` gives you all predictions: shape `(n_samples,)`
   - `X.T @ (predictions - y)` gives you gradients for all features: shape `(n_features,)`
   - Don't forget to divide by the number of samples to get the average gradient

7. You will note the `fit` method takes an optional `verbose` parameter. While it is not required, we highly recommend that you include code in the `fit` method that, when `verbose` is `True`, provides additional logging or printing of information about the training process to help debug. The code already shown prints the magnitude of the gradient every 10 iterations.

8. The `pass` statements are syntactic placeholders that should be removed when you implement a method.

In [8]:
class BinaryLogisticRegression:
    def __init__(self, lr=0.1, max_iters=1000, random_state=2025, tol=1e-6):
        self.lr = lr
        self.max_iters = max_iters
        self.random_state = random_state 
        self.tol = tol  # Tolerance for checking convergence
        self.weights = None # Number of weights determined in fit


    def sigmoid(self, z):
        # Clip z to prevent numerical overflow/underflow
        z_clipped = np.clip(z, -500, 500)
        return 1.0/(1.0 + np.exp(-z_clipped))


    def predict_proba(self, X):
        """Predict probability of 1 for each row in X"""
        # todo: complete predict_proba method
        if self.weights is None:
            raise ValueError("Make sure to call fit(X, y, verbose) first.")
        linear_combinations = X @ self.weights
        predictions = self.sigmoid(linear_combinations)
        return predictions


    def predict(self, X):
        """Predict class label 1 or 0 for each row in X"""
        # Hint: Use predict_proba and apply 0.5 threshold
        # Hint: Result should have shape (n_samples,)
        # todo: complete predict method
        probabilities = self.predict_proba(X)
        class_labels = (probabilities > 0.5).astype(int)
        return class_labels


    def fit(self, X, y, verbose=False):
        """ Fit the training data with gradient descent.
        Parameters
        ----------
        X : {array-like}, shape = [n_examples, n_features]
          Training vectors, where n_examples is the number of examples and
          n_features is the number of features.
        y : array-like, shape = [n_examples]
          Target values, assumed to be 0 or 1.
        verbose : bool, optional (default=False)
          If True, print training progress information.
        """
        # Set random seed for reproducible weight initialization
        np.random.seed(self.random_state)
        
        # Initialize weights
        # todo: initialize self.weights using np.random.normal
        self.weights = np.random.normal(size = X.shape[1])
        
        # Gradient descent loop
        for i in range(self.max_iters):
            # todo: calculate gradient
            linear_predictions = X @ self.weights
            predicted_probabilities = self.sigmoid(linear_predictions)
            prediction_errors = predicted_probabilities - y
            gradient = (X.T @ prediction_errors) / X.shape[0]
            
            # Check for convergence -- update gradient variable name if different
            gradient_magnitude = np.linalg.norm(gradient)
            if gradient_magnitude < self.tol:
                if verbose:
                    print(f"Converged at iteration {i}, gradient magnitude: {gradient_magnitude:.2e}")
                break
                
            # todo: update self.weights using gradient descent rule
            self.weights -= self.lr * gradient
            
            if verbose and i % 10 == 0:
                print(f"Iteration {i}, gradient magnitude: {gradient_magnitude:.2e}")

## Task 3

Use your DIY `BinaryLogisticRegression` class from task 2 to fit a logistic regression model on the train set as you did for the Scikit-Learn implementation in task 1. Use the default parameters.

Evaluate and report the [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) of your DIY model on both the train set and the test set. You should achieve similar performance (within a few percent, like 2%) compared to the Scikit-Learn implementation.

In [10]:
# Write code for task 3 here
DIY_model = BinaryLogisticRegression()
DIY_model.fit(X_train, y_train)
y_train_prediction_DIY = DIY_model.predict(X_train)
y_test_prediction_DIY = DIY_model.predict(X_test)
DIY_train_accuracy = accuracy_score(y_train, y_train_prediction_DIY)
DIY_test_accuracy = accuracy_score(y_test, y_test_prediction_DIY)
print("The accuracy of the DIY model on the train set is", DIY_train_accuracy)
print("The accuracy of the DIY model on the test set is", DIY_test_accuracy)
print("The accuracy of the Scikit-Learn model on the train set is", train_accuracy)
print("The accuracy of the Scikit-Learn model on the test set is", test_accuracy)

The accuracy of the DIY model on the train set is 0.8457142857142858
The accuracy of the DIY model on the test set is 0.79
The accuracy of the Scikit-Learn model on the train set is 0.85
The accuracy of the Scikit-Learn model on the test set is 0.8


## Task 4 (Bonus)

Perform an *ablation* study on the learning rate hyperparameter `lr`. Specifically, try fitting your DIY `BinaryLogisticRegression` implementation seven different times with different settings of the learning rate hyperparameter `lr`. Try each of the values `[100, 10, 1, 0.1, 0.01, 0.001, 0.0001]`.

For each run:
  - Evaluate the [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) of the model predictions on the train set only (note that hyperparameters should never be selected using the test data).
  - Record how many iterations were necessary to fit the model (either by reaching the convergence criterion or just the `max_iters` limit).

Report all of your results. Based on your findings, briefly explain the importance of selecting a good learning rate, considering both model performance and computational complexity.

In [12]:
# Write code for task 4 here


*Explain for Task 4 here*

