# Week 6: Model Training

# Tutorial Module

With a basic understanding of logistic regression and its role in predicting categorical outcomes, we can now focus on how to train a logistic regression model effectively.

Training is a critical phase where the model learns the relationship between input features and the target variable. This step is especially important in applications such as predicting genetic traits, assessing the likelihood of medical conditions, or classifying species in ecological studies.

## Learning objectives

By the end of this module, you will be able to:

1. Define the role of the loss function in evaluating model performance during training.
2. Describe how gradient descent works and explain its role in optimizing model parameters.
3. Explain the impact of key hyperparameters, including learning rate and number of epochs.
4. Implement gradient descent using SGDClassifier to train a logistic regression model in Python.
5. Use a validation set to select hyperparameter values and avoid overfitting.
6. Assess model performance on training, validation, and test sets to support model selection and evaluation.

By focusing on these areas, you will build a solid understanding of how to train a logistic regression model and apply it to biological data using Python. Below is last week’s preprocessing code, which brings us to the point where training begins.

**Run the code below.**

In [None]:
# Data science
import pandas as pd
import seaborn as sns
import numpy as np

# Machine learning
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv('bc_data.csv', index_col=0)

print(f"Shape of original dataset: {df.shape}")

# Data cleaning
# Encode target feature to binary class and split target/predictor vars
y = df["diagnosis"].map({"B" : 0, "M" : 1})
x = df.drop("diagnosis", axis = 1)

# Drop all "worst" columns
cols = ['radius_worst',
        'texture_worst',
        'perimeter_worst',
        'area_worst',
        'smoothness_worst',
        'compactness_worst',
        'concavity_worst',
        'concave points_worst',
        'symmetry_worst',
        'fractal_dimension_worst']
x = x.drop(cols, axis=1)

# Drop perimeter and area (keep radius)
cols = ['perimeter_mean',
        'perimeter_se',
        'area_mean',
        'area_se']
x = x.drop(cols, axis=1)

### Loss Function

Before discussing how to train the model, we need to define what we are optimizing. This is where the **loss function** comes in. 

The <span style="background-color: #AFEEEE">**loss function**</span> (also called the cost or objective function) measures how far the model’s predictions are from the true values. It assigns a numerical value to the model’s errors—larger values mean worse performance, while smaller values mean better predictions. During training, algorithms like gradient descent use this function to guide updates to the model’s parameters, gradually reducing the loss and improving accuracy. You can think of the loss function as a landscape, and training as a process of descending into the lowest valley where the model performs best.

<img src="loss_func.jpeg" alt="Gradient descent steps towards loss function minimum" style="width: 600px;" class="center"/>

<a href="https://blog.gopenai.com/understanding-of-gradient-descent-intuition-and-implementation-b1f98b3645ea">Source</a>

### Understanding Gradient Descent

When you first learned logistic regression in statistics, you may have seen formulas that give an exact solution. In many real‑world problems, though—especially in machine learning—finding such exact solutions is either too slow for large datasets or not possible at all because the models are more complex.

This is where **gradient descent** comes in. Instead of solving equations directly, <span style="background-color: #AFEEEE">**gradient descent**</span> trains a model by gradually improving it step by step. This approach works well even for very large or high‑dimensional datasets, like those with thousands of genetic markers or detailed medical images.

Gradient descent works by repeatedly adjusting the model’s parameters to reduce the **loss function**—a measure of how far the model’s predictions are from the correct answers. Each step has two important parts:

- <span style="background-color: #AFEEEE">**Direction:**</span> The gradient tells us which way the loss increases fastest. To minimize loss, we move in the opposite direction.

- <span style="background-color: #AFEEEE">**Step size:**</span> Each step’s size matters. Too large, and we overshoot; too small, and learning becomes very slow. A well‑chosen step size helps the model reach good performance efficiently.

By iterating through these steps, the model gradually “learns” from the data and finds parameter values that work well.

### Working with Gradient Descent

To make gradient descent work well, we need to decide how big each step should be and how many steps to take. These decisions are controlled by <span style="background-color: #AFEEEE">**hyperparameters**</span>. Unlike the model’s weights (which are learned from the data), hyperparameters are settings we choose before training begins. Every machine learning model has them, and two of the most important are the **learning rate** and the **number of epochs**.

Gradient descent updates the model by repeatedly moving in the direction that reduces the loss. The size of each move is set by the **learning rate**, and the number of moves depends on how many **epochs** we run. The process stops when further steps no longer meaningfully reduce the loss. This state is called convergence—when further steps no longer significantly reduce the loss.

- <span style="background-color: #AFEEEE">**Learning Rate**</span>:
The learning rate controls the size of each step. A small learning rate means tiny, careful steps (slower progress but more stable), while a large learning rate means bigger steps (faster progress but risk of overshooting). Choosing a good learning rate is key to efficient training.

- <span style="background-color: #AFEEEE">**Epochs**</span>:
An epoch is one complete pass through the training data. The number of epochs controls how many times the model updates its weights. Too few epochs may stop training before the model has learned enough; too many can waste time or even lead to overfitting. The goal is to choose a number that allows the model to converge effectively.


In practice, when using a tool like `SGDClassifier()`, you will see parameters such as `alpha` and `max_iter`. These are the hyperparameters you set before training:

- `alpha` controls the learning rate. For example, in Week 5 we used it while analyzing the breast cancer dataset.
- `max_iter` sets the number of epochs, or how many times the model goes through the dataset.

In this case, we are explicitly using a learning rate (`alpha`) of 0.00001 and 5 epochs (`max_iter=5`).

Here are two additional functions to train and evaluate our model; we will use them frequently from now on:
| Function | Input parameters | Output | Syntax |
| --- | --- | --- | --- |
| .fit() | Numpy array(s) | returns a History object containing training and validation metrics | model.fit(x_train, y_train) |
| .score() | Numpy array(s) | returns a performance metric that is specific to the type of model | model.score(x_test, y_test) |

Now, we're ready to train our model with gradient descent and evaluate how well it works. 

The first code cell below splits our data set into a training set (85% of the data) and a test set (15% of the data). 

The second code cell below trains a logistic regression model using stochastic gradient descent. It initializes the model with a learning rate 0.00001 and limits training to 5 epochs. The model is first trained on the training set, and then its accuracy is evaluated on the test set. Finally, the test accuracy is printed as a percentage.

**Run the two code cells below.**

In [None]:
# Splitting the dataset into a training and a test set
test_ratio = 0.15
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=40)

In [None]:
# Load the model
model = SGDClassifier(loss='log_loss', alpha=1e-5, max_iter=5)

# Fit the model on the training data
model.fit(x_train, y_train)

# Evaluate the model on the testing data
test_predictions = model.predict(x_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

It seems our current model's performance is not as high as we hoped, returning a poor accuracy. This could be a sign that our model's hyperparameters need adjusting. For our case, it seems like the model may benefit from additional training time or a different learning rate. You might have also received a warning from `sklearn` stating so.

Before we move on, run the previous cell a few more times. You may see that the accuracy produced seems random. This is because we may start at various random locations on this gradient, and the weights are initialized randomly.

---
**Q*1. In the code snippet below, experiment with different values for `alpha` and `max_iter` until you get an accuracy above 80% on the test set.**

> HINT: Refer to the code from the code cell above.

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Load the model

# Fit the model on the training data

# Evaluate the model on the testing data


### Validation Set

It’s tempting to celebrate high accuracy on the test set, but this can be misleading if it results from <span style="background-color: #AFEEEE">**overfitting**</span> (we’ll explore overfitting and underfitting in Week 7). In this case, overfitting may have happened because we have used the test set to choose hyperparameter values, allowing information from the test set to influence our model performance. As a result, the test set no longer provides an objective measure of our model performance.

To prevent this, we need a different strategy. Instead of relying on the test set for choosing hyperparameter values, we will create a validation set by splitting off a portion of the current training set. The validation set allows us to evaluate and adjust hyperparameter values while keeping the test set completely untouched.

To summarize, we will split our data into three distinct sets. It is important these three sets are kept separate and used for their specific purposes only.
- The <span style="background-color: #AFEEEE">**training set**</span> is used to learn the model’s parameters—that is, the internal values that minimize the loss function.
- The <span style="background-color: #AFEEEE">**validation set**</span> is used to choose hyperparameters and guide model selection.
- The <span style="background-color: #AFEEEE">**test set**</span> remains untouched throughout development and is used only once, at the end, to assess how well the final model generalizes to new, unseen data.

In practice, the training set is usually the largest portion, while the validation and test sets are smaller—often around 10–20% of the data each—depending on the size of the overall dataset.

<img src="data_split.png" alt="Orginial data vs data when it is split into train, validation, and test sets" style="width: 600px;" class="center"/>

Now, let's try to split our current training data (`x_train` and `y_train`) further into a training set (`x_train` and `y_train`) and a validation set (`x_val` and `y_val`).

**Run the code below.**

In [None]:
# Data splitting
# train (70%), val (15%), test (15%)
train_ratio = 0.7
val_ratio = 0.15

# Splitting the training from the validation set
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=val_ratio/(train_ratio + val_ratio), random_state=40)

---
**Q*2. Let's experiment with different numbers of epochs/iterations. Specifically, we will test what happens if we were to increase the number from 5 epochs to 10, 50, and 100. Below, there are three code cells for you to fill out. Copy the code above and change the `max_iter` parameter. Remember to set `random_state` to 1**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Max iterations of 10
model = SGDClassifier(loss='log_loss', alpha=1e-5, max_iter=..., random_state=...)

# Fit the model on the training data
model.fit(x_train, y_train)

# Evaluate the model on the validation data
val_score = model.score(x_val, y_val)
print(f"Validation Accuracy: {val_score}")

In [None]:
# Max iterations of 50

# Fit the model on the training data

# Evaluate the model on the validation data


In [None]:
# Max iterations of 100

# Fit the model on the training data

# Evaluate the model on the validation data


---
**Q*3. After making these changes, observe how the test accuracy has shifted. Did it improve? If so, explain why conceptually you think that increasing the number of epochs helped in this case. What oddity do you observe when you compare 50 to 100 epochs? Why may that be?**

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

### The Impact of Varying the Learning Rate

Let’s explore how the learning rate influences model training. This hyperparameter controls the step size taken during gradient descent, and its value has a major impact on convergence. If the learning rate is too low, training may be painfully slow or get stuck in local minima. If it’s too high, the model might overshoot the minimum or fail to converge entirely.

A common mistake is underestimating the scale of this parameter. For example, a learning rate of 0.1 is not just slightly larger than 0.01—it’s ten times larger, and that difference can dramatically affect training dynamics.

Tuning the learning rate is a common challenge for beginners. A typical starting point is somewhere between 0.001 and 0.1, depending on the model and dataset. Begin with a moderate value like 0.01, observe training behavior, and adjust gradually. Visualizing the loss over time can help guide this process and reveal whether the learning rate needs to be increased or decreased.

The figure below illustrates what happens when the learning rate is set too high: the model fails to settle into a minimum, even after many epochs.


<img src="lr_effect.png" alt="Choice of learning rate and impact on gradient descent" style="width: 600px;" class="center"/>

---
**Q*4. Below in the two cells, examine what happens when you set the `alpha` value to 1 and 1e-1, respectively. You may keep the `max_iterations` as 100.**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Alpha value of 1
model = SGDClassifier(loss='log_loss', alpha=..., max_iter=..., random_state=1)

# Fit the model on the training data
model.fit(x_train, y_train)

# Evaluate the model on the validation data
val_score = model.score(x_val, y_val)
print(f"Validation Accuracy: {val_score}")

In [None]:
# Alpha value of 0.1

# Fit the model on the training data

# Evaluate the model on the validation data


---
**Q*5. What are the results that you see when changing the alpha value? If the learning rate is too small, what can you do to make sure that the model converges (besides changing the alpha value)?**

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

### Final Testing

Now that we’ve completed training and hyperparameter tuning, it’s time to evaluate our model on the test set. Use the code below to make predictions on the test data (not the validation set) and assess its performance.

---
**Q*6. As we have altered and explored different hyperparameter values for our model, retrain the model below with what you believe are the best hyperparameter values and evaluate your model on the test dataset.**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Load the model with the best hyperparameters from our experiments
model = SGDClassifier(loss='log_loss', alpha=..., max_iter=..., random_state=...)

# Fit the model on the training data

# Evaluate the model on the validation data
val_predictions = ...
val_accuracy = ...
print(f"Validation Accuracy: {val_accuracy}")

# Evaluate the model on the testing data
test_predictions = ...
test_accuracy = ...
print(f"Test Accuracy: {test_accuracy}")

## **Graded Exercises: (7 marks)**

**GQ*1: Read in the heart failures dataset (`hf_data_tut.csv`) (1 mark) and split the predictor variables (features) from the labels (response variable) (1 mark).**

> HINT: Look at your work from Week 3.

<span style="background-color: #FFD700">**Write your code below**</span>

**GQ*2: Train an `SGDClassifier()` model on *all* the data (1 mark). What is the accuracy (1 mark)?**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
...
accuracy = ...
print(f"Accuracy: {accuracy*100:.2f}%")

**GQ*3: Split the dataset into a train and test set and retrain your model (1 mark). What is the train and test accuracy (2 marks)?**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Split dataset

# Retrain model

# Training accuracy
train_predictions = ...
train_accuracy = ...
print(f"Train accuracy: {train_accuracy*100:.2f}%")

# Testing accuracy
test_predictions = ...
test_accuracy = ...
print(f"Test accuracy: {test_accuracy*100:.2f}%")

## Conclusion

In this module, we explored the concept of training in the context of logistic regression.

We introduced <span style="background-color: #AFEEEE">**gradient descent**</span>, a widely used optimization method that adjusts model weights to minimize the loss function—a measure of how well the model’s predictions match the actual labels.

We also discussed the role of <span style="background-color: #AFEEEE">**hyperparameters**</span>—user-defined settings like the learning rate and number of epochs—which influence the training process but are not learned from the data.

To tune these hyperparameters effectively, we introduced the <span style="background-color: #AFEEEE">**validation set**</span>, a portion of the data set aside from training to guide model selection while keeping the test set untouched.

Finally, we examined the challenges of hyperparameter tuning, focusing on how poor choices—especially in learning rate or training duration—can significantly affect model performance.