<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 4: Differential calculus</a>
## <a name="0">Lab 4.1: Logistic Regression</a>

 1. <a href="#1">Data for binary classification</a> 
 2. <a href="#2">Gradient descent for logistic regression</a> 
 3. <a href="#3">Model evaluation and sklearn comparison</a>
 4. <a href="#4">Logistic regression on higher-dimensional data</a>
 
[**Logistic regression**](https://en.wikipedia.org/wiki/Logistic_regression) is a fundamental classification algorithm in machine learning. Despite its name, it's used for binary classification problems. The algorithm models the probability that an instance belongs to a particular class.

Key points about logistic regression are:
 - The decision boundary is linear on the data features and the parameters (weights).
 - It uses the logistic function (sigmoid) to map predictions to probabilities.



In [None]:
# Upgrade libraries
!pip install -q --upgrade pip
!pip install -q --upgrade scikit-learn

In [None]:
%%capture
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

import torch 
import time

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

from IPython.display import Markdown, display

## <a name="1">1. Data for binary classification</a>
(<a href="#0">Go to top</a>)

For this logistic regression exercise we will reuse data from a previous lab. Fashion-MNIST is a dataset of Zalando's product images, consisting of a data set of 70,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes corresponding to fashion items.

In [None]:
# Fetch dataset
data = fetch_openml(name="Fashion-MNIST")
# Assemble features and target in same DataFrame for easy data handling
df = data["data"]
# Store target as integer
df["target"] = data["target"].astype(int)
df.head()

### Selecting a subset of two classes

Since this dataset is multi-class, we cannot use it fully for a binary classification problem. But we can use a subset of it, if we select two out of the ten available fashion item categories. 

Let's for instance pick classes `"Sandal"` and `"Sneaker"`, but this can be changed by any other 2 classes that you like.

In [None]:
# The 10 classes are 0 to 9 and represent the types of items on the table above
label_description = {
    0: "T-shirt/top",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle boot"
}

chosen_labels = ("Sandal", "Sneaker")
chosen_classes = [list(label_description.values()).index(c) for c in chosen_labels]

df_binary = df[df.target.isin(chosen_classes)]
print(f"Selected classes: {[label_description[c] for c in df_binary.target.unique()]}")

### Dimensionality reduction and feature normalization

The Fashion-MNIST datapoints are 784-dimensional, black-and-white, 28x28 pixel images. While we could apply Logistic Regression to all those features, let's simplily by using a lower-dimensional representation of the dataset. Recall that we can apply PCA to the data points and project them to retain only a few of their principal components. 

Let's run PCA with `n_components=2` to transform the 784-dimensional data points into 2-dimensional entities and use those as features of the logistic regression modal. 

Recall that data has to be normalized before PCA can be applied. The code below is similar to that of Lab 2.2.

In [None]:
# Scale data
scaler = StandardScaler()

# Make a copy of the data so that pop doesn't overwrite df
df_ = df_binary.copy()

# Remove the target as we'll only scale the features
y = df_.pop("target").values

# Transform labels from the chosen ones to 0, 1 to run binary classification later
y = np.array([0 if i==chosen_classes[0] else 1 for i in y])

# X values
X = df_.values

# Scaled features with zero mean and unit variance
X_sc = scaler.fit_transform(X)

# Shape of the input data
print(f"Shape of features matrix: {X_sc.shape}")

# Initialize PCA object
pca = PCA(n_components=2)

# Fit PCA to normalized data
X_pca = pca.fit_transform(X_sc)

# Shape of the data after PCA
print(f"Shape of the features matrix after PCA: {X_pca.shape}")

Next, we split the data into train and test to be able to evaluate the train model on unseen data later.

In [None]:
# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.25, random_state=1)

**Feature scaling** is advisable to solve logistic regression via gradient descent as it ensures all features contribute proportionally to the model, regardless of their original units or ranges. 

This normalization leads to faster and more stable convergence during optimization, prevents any single feature from dominating due to its scale, and avoids numerical instabilities in calculations, particularly with exponential functions. Scaled features also enhance the interpretability of model coefficients, and ensure consistent performance with regularization techniques. 

Let's fit the scaler on the training data and transform both the train and test datasets to prevent data leakage. 

In [None]:
# Apply feature scaling to the low-dimensionality PCA data
scaler2 = StandardScaler()

X_train = scaler2.fit_transform(X_train)
X_test = scaler2.transform(X_test)

print(f"Train data mean = {X_train.mean()}, std = {X_train.std()}")

In [None]:
# Visualize the projected, scaled data
for cl in (0, 1):
    plt.scatter(X_train[y_train==cl][:, 0], X_train[y_train==cl][:, 1], s=2, alpha=0.7, label=chosen_labels[cl])
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Training Data')
plt.legend()
plt.show()

## <a name="2">2. Gradient descent for logistic regression</a>
(<a href="#0">Go to top</a>)

While logistic regression can be solved using other methods (like maximum likelihood estimation), we will use gradient descent because:
- It's a general optimization technique applicable to many machine learning problems, including the final project.
- It can handle large datasets efficiently, especially with variations like stochastic gradient descent.
- It provides insight into the learning process and serves as an introduction to neural networks.

For our gradient descent implementation from scratch, we will use the same matrix convention from the OLS lab in Lecture 2. 
$$
\mathbf{X} = \begin{pmatrix}
1 & x_{11} & \dots  & x_{1m}\\
1 & x_{21} & \dots  & x_{2m} \\
\vdots & \vdots & \dots & \vdots\\
1 & x_{n1}  & \dots  & x_{nm}\\
\end{pmatrix}
$$
The design matrix provides a concise way to represent all input features for all samples in a single matrix. In gradient descent, this representation allows efficient computation of gradients, because all weights, including the bias or intercept, are stored in one tensor variable. 

The logistic regression model can be then written as:
$$
\hat{y} = \sigma(Xw) = \frac{1}{1 + e^{-Xw}}
$$

Below we construct the design matrix for our data and also convert it to PyTorch tensors, as that will be needed to implement the gradient descent loop.

In [None]:
# Assemble design matrix
def assemble_design_matrix(X):
    return np.hstack([np.ones((X.shape[0], 1)), X])

# Construct design matrix for the train and test splits
X_train_bias = assemble_design_matrix(X_train)
X_test_bias = assemble_design_matrix(X_test)

In [None]:
# Convert to PyTorch tensors
X_train_bias_tensor = torch.FloatTensor(X_train_bias)
y_train_tensor = torch.FloatTensor(y_train)
X_test_bias_tensor = torch.FloatTensor(X_test_bias)
y_test_tensor = torch.FloatTensor(y_test)

We implement the sigmoid and the logistic regression function explicitely. Notice that you could also use native torch implementations [`torch.sigmoid`](https://pytorch.org/docs/stable/generated/torch.sigmoid.html) and [`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) instead. 

The loss function in logistic regression is given by 
$$
\mathcal{L}(w) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log\left( \sigma\left( X w \right)\right) + (1 - y_i) \log\left( 1 - \sigma\left( X w \right)\right) \right]
$$

This function can be easily implemented, however we will use here the [`torch.nn.functional.binary_cross_entropy`](https://pytorch.org/docs/stable/generated/torch.nn.functional.binary_cross_entropy.html) which is more performant.

In [None]:
def sigmoid(z):
    return 1 / (1 + torch.exp(-z))

def logistic_regression(data, weights):
    return sigmoid(torch.mm(data, weights))

def binary_cross_entropy(y_pred, y_true):
    return torch.nn.functional.binary_cross_entropy(y_pred, y_true)

### Exercise 1

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>It is your turn!</b></p>
        <p><b>Exercise 1. Gradient descent for logistic regression.</b></p>
        <p>Implement gradient descent to solve logistic regression. Follow these steps:
            <ul>
                <li>Initialize the weights <code>w</code> with small values using a normal distribution with mean 0 and standard deviation 0.01. This is a simple and often effective approach that will work in this case.</li>
                <li>Set the learning rate and number of epochs as hyperparameters. We recommend learning rate 0.1 and 1000 or 2000 training epochs</li>
                <li>Inside the training loop, perform a forward pass using the logistic regression function given above. This means computing the predictions of <code>logistic_regression</code> on the <code>X_train_bias_tensor</code> training data for the current weights.</li>
                <li>Compute the binary cross-entropy loss using the formula above.</li>
                <li>Perform a backward pass to calculate gradients.</li>
                <li>Update the weights all at once using gradient descent.</li>
                <li>Reset the gradients to zero.</li>
                <li>Store the current training loss in a list <code>train_losses</code>.</li>
                <li>Apply the model to the test data and store the test loss in a list <code>test_losses</code>.</li>
                </ul>
        </p>
        </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid lightcoral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_question.png" alt="MLU solution" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Challenge Help</b></p>
        <p>If you get an error computing the binary cross entropy, check that both input tensors are of the same shape. You can use <code>reshape</code> or <code>squeeze()</code>/<code>unsqueeze()</code> to ensure that.</p>
        <p>If you're stuck, remove the <code>#</code> before the <code>load</code> instruction in the next code cell to display a sample solution.</p>
    </span>
</div>

In [None]:
# %load solutions/lab43_ex1_solutions.txt

If you correctly stored the losses in `train_losses` and `test_losses`, you can visualize their evolution with the code below.

In [None]:
# Raise errors if losses from Exercise 1 are not defined
if "train_losses" not in dir():
    raise NameError("Please define a `train_losses` variable containing the train losses during the gradient descent loop.")
if "test_losses" not in dir():
    raise NameError("Please define a `test_losses` variable containing the test losses during the gradient descent loop.")

# Plot loss curve
plt.figure(figsize=(6, 4))
plt.plot(train_losses, label="Train loss")
plt.plot(test_losses, label="Test loss")
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss over Time')
plt.legend()
plt.show()

## <a name="2">3. Model evaluation and sklearn comparison</a>
(<a href="#0">Go to top</a>)

If your training proceeded correctly, you must have a vector `w` of weights with the 3 trained parameters of the model. To measure the performance of the model on unseen data, we apply it to the test data. Remember that the output of logistice regression is a probability between 0 and 1. To convert this to a class prediction, we typically use a threshold of 0.5, although this threshold could in theory be adjusted to another value, if that proved to perform better on the data.

Let's apply the trained model to the test data set and measure the accuracy of the logistic regressor, defined as: 
$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
$$

In [None]:
# Function to map probabilities to classes (0, 1) for different thresholds. 
def proba_to_class(proba, threshold=0.5):
    return (proba >= threshold).float()

In [None]:
# Raise errors if variable and function from Exercise 1 are not defined
if "w" not in dir():
    raise NameError("Please define a `w` variable containing the solution to the logistic regression model.")

# PyTorch model evaluation
with torch.no_grad():
    y_test_pred_torch = logistic_regression(X_test_bias_tensor, w)
    y_test_pred_torch = proba_to_class(y_test_pred_torch > 0.5)
    accuracy_torch = accuracy_score(y_test_tensor, y_test_pred_torch)
    print(f"PyTorch Model Accuracy: {accuracy_torch:.4f}")

### Comparison with sklearn implementation

Let's compare our custom gradient descent implementation with scikit-learn's built-in [`LogisticRegression`](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) class. This is part of the `sklearn.linear_model` module and uses the `lbfgs` solver, which is an optimization algorithm in the family of quasi-Newton methods.

The model is trained and used via the usual `sklearn` API for ML models, with `.fit()` and `.predict()`.

In [None]:
# Sklearn implementation
sk_logreg = LogisticRegression()
sk_logreg.fit(X_train, y_train)
y_test_pred_sklearn = sk_logreg.predict(X_test)
accuracy_sklearn = accuracy_score(y_test_pred_sklearn, y_test)
print(f"Sklearn Model Accuracy: {accuracy_sklearn:.4f}")

If you trained your model correctly, both results should be close to each other.

We can inspect the values of the parameters from both approaches:

In [None]:
# Raise errors if variable and function from Exercise 1 are not defined
if "w" not in dir():
    raise NameError("Please define a `w` variable containing the solution to the logistic regression model.")

grad_desc_params = [f"{p:.4f}" for p in w.detach().numpy().flatten().tolist()]
print(f"Weights found via gradient descent: {', '.join(grad_desc_params)}")

sklearn_params = sk_logreg.intercept_.tolist() + sk_logreg.coef_[0].tolist()
sklearn_params = [f"{p:.4f}" for p in sklearn_params]
print(f"Weights found via sklearn solution: {', '.join(sklearn_params)}")

### Visualize the decision boundary

Finally, let's plot the decision boundary found by both methods, which for the 2-dimensional data is given by equation:
$$
w_0 + w_1 x + w_2 y = 0.
$$

We can plot this line as: 
$$
y = \frac{-(w_0 + w_1 x)}{w_2}.
$$

The code below creates a plot of the probability values in different areas of the space of features, together with the linear decision boundary that separates the regions of assigned class 0 and assigned class 1.

In [None]:
# Visualize decision boundaries
def plot_decision_boundary(X, y, model, title):
    # Create 2D grid of points
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
    
    # Prepare the grid points
    grid = np.c_[xx.ravel(), yy.ravel()]
    
    # Predict probabilities
    if isinstance(model, LogisticRegression):
        # For sklearn use .predict_proba
        Z = torch.FloatTensor(model.predict_proba(grid)[:, 1])
        # The weights are in .intercept_ and .coef_
        w0 = model.intercept_.tolist()
        w1, w2 = model.coef_[0].tolist()
        
    elif isinstance(model, torch.Tensor):
        # For grad descent the model is a tensor of weiths, use to predict
        X_design = torch.FloatTensor(assemble_design_matrix(grid))
        Z = logistic_regression(X_design, model)
        Z = Z.detach().numpy()
        # The weights are in model
        w0, w1, w2 = [ww.detach().numpy() for ww in model]
    else: 
        raise ValueError("Please enter either a LogisticRegression object or a torch.Tensor of weights as model.")
    
    # Reshape values
    Z = Z.reshape(xx.shape)
    
    # Plot contours of probability
    plt.contourf(xx, yy, Z, alpha=0.2, cmap=cm.coolwarm)
    
    # Plot boundary line
    y_boundary = -(w0 + w1*xx[0])/w2
    plt.plot(xx[0], y_boundary, "k-.", lw=2)
    
    # Plot data points
    mrks = ["x", "."]
    clrs = ["r", "b"]
    for cl in (0, 1):
        plt.scatter(X[y==cl][:, 0], X[y==cl][:, 1], c=clrs[cl], s=20, marker=mrks[cl], alpha=0.7, label=chosen_labels[cl])

    # Adjust plot limits
    plt.ylim(y_min, y_max)
    plt.title(title)
    plt.legend()
    plt.show()

In [None]:
# Raise errors if variable and function from Exercise 1 are not defined
if "w" not in dir():
    raise NameError("Please define a `w` variable containing the solution to the logistic regression model.")

# Plot decision boundaries for both models
plot_decision_boundary(X_test, y_test, w, "PyTorch Model Decision Boundary")
plot_decision_boundary(X_test, y_test, sk_logreg, "Sklearn Model Decision Boundary")

## <a name="4">4. Logistic regression on higher-dimensional data</a>
(<a href="#0">Go to top</a>)

The example above was for 2-dimensional data. The method works exactly the same if the number of features is larger. 

### Exercise 2

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>It is your turn!</b></p>
        <p><b>Exercise 1. Logistic regression with more features.</b></p>
        <p>Train a logistic regression model on more features. Re-do the PCA on the original Fashion-MNIST dataset to retain more principal components. Run the gradient descent algorithm on this higher-dimensional dataset.</p>
        <p>Does the model achieve better or worse performance? What's a good number of features to use?</p>
        </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid lightcoral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_question.png" alt="MLU solution" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Challenge Help</b></p>
        <p>You can reuse the code above only changing the number of components of the PCA.</p>
        <p>If you're stuck, remove the <code>#</code> before the <code>load</code> instruction in the next code cell to display a sample solution.</p>
    </span>
</div>

In [None]:
# %load solutions/lab43_ex2_solutions.txt

As a stretch goal for this lab, you can go back to the label selection for the Fashion-MNIST data set and build models to classify any other pair of fashion items among the 10 different classes. 

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 4.3: Logistic regression of Lecture 4: Differential calculus of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>