# Week 6: Model Training

# Tutorial Module

With a basic understanding of logistic regression and its role in predicting categorical outcomes, we can now focus on how to train a logistic regression model effectively.

Training is a critical phase where the model learns the relationship between input features and the target variable. This step is especially important in applications such as predicting genetic traits, assessing the likelihood of medical conditions, or classifying species in ecological studies.

## Learning objectives

By the end of this module, you will be able to:

1. Explain the role of the loss function in guiding model training.
2. Apply gradient descent to optimize logistic regression parameters.
3. Use a validation set to monitor performance and prevent overfitting.
4. Perform hyperparameter tuning to improve model accuracy.

By focusing on these areas, you will build a solid understanding of how to train a logistic regression model and apply it to biological data using Python. Below is last week’s preprocessing code, which brings us to the point where training begins.

In [None]:
# Data science
import pandas as pd
import seaborn as sns
import numpy as np

# Machine learning
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv('bc_data.csv', index_col=0)

print(f"Shape of original dataset: {df.shape}")

# Data cleaning
# Encode target feature to binary class and split target/predictor vars
y = df["diagnosis"].map({"B" : 0, "M" : 1})
x = df.drop("diagnosis", axis = 1)

# Drop all "worst" columns
cols = ['radius_worst',
        'texture_worst',
        'perimeter_worst',
        'area_worst',
        'smoothness_worst',
        'compactness_worst',
        'concavity_worst',
        'concave points_worst',
        'symmetry_worst',
        'fractal_dimension_worst']
x = x.drop(cols, axis=1)

# Drop perimeter and area (keep radius)
cols = ['perimeter_mean',
        'perimeter_se',
        'area_mean',
        'area_se']
x = x.drop(cols, axis=1)

### Loss Function

Before discussing how to train the model, we need to define what we are optimizing. This is where the **loss function** comes in. 

The <span style="background-color: #AFEEEE">**loss function**</span> (also called the cost or objective function) measures how far the model’s predictions are from the true values. It assigns a numerical value to the model’s errors—larger values mean worse performance, while smaller values mean better predictions. During training, algorithms like gradient descent use this function to guide updates to the model’s parameters, steadily reducing the loss and improving accuracy. You can think of the loss function as a landscape, and training as a process of descending into the lowest valley where the model performs best.

<img src="loss_func.jpeg" alt="Gradient descent steps towards loss function minimum" style="width: 600px;" class="center"/>

<a href="https://blog.gopenai.com/understanding-of-gradient-descent-intuition-and-implementation-b1f98b3645ea">Source</a>

### Understanding Gradient Descent

When you first learned logistic regression in statistics, you may have seen formulas that give an exact solution. In many real‑world problems, though—especially in machine learning—finding such exact solutions is either too slow for large datasets or not possible at all because the models are more complex.

This is where **gradient descent** comes in. Instead of solving equations directly, <span style="background-color: #AFEEEE">**gradient descent**</span> trains a model by gradually improving it step by step. This approach works well even for very large or high‑dimensional datasets, like those with thousands of genetic markers or detailed medical images.

Gradient descent works by repeatedly adjusting the model’s parameters to reduce the **loss function**—a measure of how far the model’s predictions are from the correct answers. Each step has two important parts:

- <span style="background-color: #AFEEEE">**Direction:**</span> The gradient tells us which way the loss increases fastest. To minimize loss, we move in the opposite direction.

- <span style="background-color: #AFEEEE">**Step size:**</span> Each step’s size matters. Too large, and we overshoot; too small, and learning becomes very slow. A well‑chosen step size helps the model reach good performance efficiently.

By iterating through these steps, the model gradually “learns” from the data and finds parameter values that work well.

### Working with Gradient Descent

To make gradient descent work well, we need to decide how big each step should be and how many steps to take. These decisions are controlled by <span style="background-color: #AFEEEE">**hyperparameters**</span>. Unlike the model’s weights (which are learned from the data), hyperparameters are settings we choose before training begins. Every machine learning model has them, and two of the most important are the **learning rate** and the **number of epochs**.

Gradient descent updates the model by repeatedly moving in the direction that reduces the loss. The size of each move is set by the **learning rate**, and the number of moves depends on how many **epochs** we run. The process stops when further steps no longer meaningfully reduce the loss. This state is called convergence—when further steps no longer significantly reduce the loss.

<span style="background-color: #AFEEEE">**converges**</span>
The learning rate controls the size of each step. A small learning rate means tiny, careful steps (slower progress but more stable), while a large learning rate means bigger steps (faster progress but risk of overshooting). Choosing a good learning rate is key to efficient training.

<span style="background-color: #AFEEEE">**Epochs**</span>
An epoch is one complete pass through the training data. The number of epochs controls how many times the model updates its weights. Too few epochs may stop training before the model has learned enough; too many can waste time or even lead to overfitting. The goal is to choose a number that allows the model to converge effectively.


In practice, when using a tool like `SGDClassifier()`, you will see parameters such as `alpha` and `max_iter`. These are the hyperparameters you set before training:

- `alpha` controls the learning rate. For example, in Week 5 we used it while analyzing the breast cancer dataset.
- `max_iter` sets the number of epochs, or how many times the model goes through the dataset.

In this case, we are explicitly using a learning rate (`alpha`) of 0.00001 and 5 epochs (`max_iter=5`).

| Function | Input parameters | Output | Syntax |
| --- | --- | --- | --- |
| .fit() | Numpy array(s) | returns a History object containing training and validation metrics | model.fit(x_train, y_train) |
| .score() | Numpy array(s) | returns a performance metric that is specific to the type of model | model.score(x_test, y_test) |

In [None]:
# Splitting the dataset into a training and a validation dataset
test_ratio = 0.15
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=40)

In [None]:
# Load the model
model = SGDClassifier(loss='log_loss', alpha=1e-5, max_iter=5)

# Fit the model on the training data
model.fit(x_train, y_train)

# Evaluate the model on the testing data
test_predictions = model.predict(x_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

It seems our current model's performance is not as high as we hoped, returning a poor accuracy. This could be a sign that our model's hyperparameters need adjusting. For our case, it seems like the model may benefit from additional training time or a different learning rate. You might have also received a warning from `sklearn` stating so.

Before we move on, run the previous cell a few more times. You may see that the accuracy produced seems random. This is because we may start at various random locations on this gradient, and the weights are initialized randomly.

---
**Q*1. In the code snippet below, experiment with different values for `alpha` and `max_iter` until you get an accuracy above 80% on the test set.**

> HINT: Refer to the code from the code cell above.

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Load the model

# Fit the model on the training data

# Evaluate the model on the testing data


---
It’s tempting to feel accomplished when your model shows high accuracy on the test set, but this can sometimes signal <span style="background-color: #AFEEEE">**overfitting**</span> (we’ll explore overfitting and underfitting in Week 7). In this scenario, overfitting happens because we’ve been using the test set to choose hyperparameter values, allowing information from the test data to influence our model. As a result, the test set is no longer truly unseen and cannot serve as an objective measure of performance.

To avoid this, we need a different strategy. Instead of relying on the test set for tuning hyperparameters, we reserve part of the training data as a <span style="background-color: #AFEEEE">**validation set**</span>. The validation set acts like a stand‑in test set, letting us evaluate and adjust hyperparameters while keeping the true test set untouched. This approach leads to more reliable tuning and a model that generalizes better.

### Validation Set

For validating our trained model, we require some data points from the training set. We can't split with data from the test set because the test set is meant to be a completely independent subset of data used to evaluate the final performance of the model (it also only contains a low number of samples from the original dataset). The test set should mimic new, unseen data that the model may encounter in a real-world scenario. Hence, it should not influence the model training or tuning process in any way.

The main function of a validation set is to offer a fair assessment of a model that's been trained on a dataset, especially during the adjustment of its hyperparameters. Thus, we further remove another subset of our training set to create this new validation set.

In [None]:
# Data splitting
# train (70%), val (15%), test (15%)
train_ratio = 0.7
val_ratio = 0.15

# Splitting the training from the validation set
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=val_ratio/(train_ratio + val_ratio), random_state=40)

It is important that the validation set is kept separate from the training set and is not used in the training process (you won't get a good grasp of your knowledge if you view the mock exam beforehand). This separation ensures that the validation performance is a good indicator of how well the model generalizes to new, unseen data. The test set is then used as the final performance metric, as it remains untouched and completely independent throughout the model development process.

<img src="data_split.png" alt="Orginial data vs data when it is split into train, validation, and test sets" style="width: 600px;" class="center"/>


Once these three sets are made, we can now train our model over a series of iterations until we are confident it may perform well on our test set (i.e., new unseen data).

---
**Q*2. Let's experiment with different numbers of epochs/iterations. Specifically, we will test what happens if we were to increase the number from 5 epochs to 10, 50, and 100. Below, there are three code cells for you to fill out. Copy the code above and change the `max_iter` parameter. Remember to set `random_state` to 1**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Max iterations of 10
model = SGDClassifier(loss='log_loss', alpha=1e-5, max_iter=..., random_state=...)

# Fit the model on the training data
model.fit(x_train, y_train)

# Evaluate the model on the validation data
val_score = model.score(x_val, y_val)
print(f"Validation Accuracy: {val_score}")

In [None]:
# Max iterations of 50

# Fit the model on the training data

# Evaluate the model on the validation data


In [None]:
# Max iterations of 100

# Fit the model on the training data

# Evaluate the model on the validation data


---
**Q*3. After making these changes, observe how the test accuracy has shifted. Did it improve? If so, explain why conceptually you think that increasing the number of epochs helped in this case. What oddity do you observe when you compare 50 to 100 epochs? Why may that be?**

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

Let's now shift our focus to examining the impact of varying the learning rate. By experimenting with significantly high and low values of the learning rate, we can observe how taking steps that are either too large or too small affects our training process. This exploration will help us understand the consequences of extreme adjustments in the learning rate.

Although the point of machine learning is to have a model "learn" the patterns in our dataset for us, there are still many parts of the process under our control; hence, there is always room for human error and for us to improve. An issue that frustrates most beginners is hyperparameter tuning.


### Misunderstanding the Learning Rate

<span style="background-color: #AFEEEE">- **Scale and Impact**</span>: A common misconception is not fully underestimating the impact of its scale. For instance, a learning rate of 0.1 is not just slightly larger than 0.01; it's actually ten times larger. This significant increase can lead to very different training behaviors. A too high learning rate might cause the model to overshoot the minimum, while a too low learning rate might result in slow convergence or getting stuck in local minima.

<span style="background-color: #AFEEEE">- **Finding the Right Balance**</span>: Beginners should start with a moderate learning rate and adjust based on the performance. Visualization of the loss function over training iterations can be helpful in tuning this parameter. Below is a graphical example of gradient descent, and what the process would look like if you increased the learning rate for a set number of epochs to too high.

<img src="lr_effect.png" alt="Choice of learning rate and impact on gradient descent" style="width: 600px;" class="center"/>

If you have identified that your model is underfitting due to a low learning rate, be careful not to set it too high. To find the "perfect" value, be patient and make sensible small adjustments.

---
**Q*4. Below in the two cells, examine what happens when you set the `alpha` value to 1 and 1e-1, respectively. You may keep the `max_iterations` as 100.**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Alpha value of 1
model = SGDClassifier(loss='log_loss', alpha=..., max_iter=..., random_state=1)

# Fit the model on the training data
model.fit(x_train, y_train)

# Evaluate the model on the validation data
val_score = model.score(x_val, y_val)
print(f"Validation Accuracy: {val_score}")

In [None]:
# Alpha value of 0.1

# Fit the model on the training data

# Evaluate the model on the validation data


---
**Q*5. What are the results that you see when changing the alpha value? If the learning rate is too small, what can you do to make sure that the model converges (besides changing the alpha value)?**

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

### Final Testing

Now that we have reviewed the complete process of training our model and tuning hyperparameters, it's finally time for us to evaluate our models on the test dataset. Below, predict on the test set (not the validation set) and see how it performs.

---
**Q*6. As we have altered and explored different hyperparameter values for our model, retrain the model below with what you believe are the best hyperparameter values and evaluate your model on the test dataset.**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Load the model with the best hyperparameters from our experiments
model = SGDClassifier(loss='log_loss', alpha=..., max_iter=..., random_state=...)

# Fit the model on the training data

# Evaluate the model on the validation data
val_predictions = ...
val_accuracy = ...
print(f"Validation Accuracy: {val_accuracy}")

# Evaluate the model on the testing data
test_predictions = ...
test_accuracy = ...
print(f"Test Accuracy: {test_accuracy}")

## **Graded Exercises: (7 marks)**

**GQ*1: Read in the heart failures dataset (`hf_data_tut.csv`) (1 mark) and split the predictor variables (features) from the labels (response variable) (1 mark).**

> HINT: Look at your work from Week 3.

<span style="background-color: #FFD700">**Write your code below**</span>

**GQ*2: Train an `SGDClassifier()` model on *all* the data (1 mark). What is the accuracy (1 mark)?**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
...
accuracy = ...
print(f"Accuracy: {accuracy*100:.2f}%")

**GQ*3: Split the dataset into a train and test set and retrain your model (1 mark). What is the train and test accuracy (2 marks)?**

<span style="background-color: #FFD700">**Write your code below**</span>

In [None]:
# Split dataset

# Retrain model

# Training accuracy
train_predictions = ...
train_accuracy = ...
print(f"Train accuracy: {train_accuracy*100:.2f}%")

# Testing accuracy
test_predictions = ...
test_accuracy = ...
print(f"Test accuracy: {test_accuracy*100:.2f}%")

## Conclusion

In this module, we explored the concept of "training" as it applies to logistic regression models.

We delved into the gradient descent method, which is a common technique used by various machine learning models, including logistic regression, to minimize their loss function. This mathematical representation quantifies how "correct" our model's predictions are.

Additionally, we discussed hyperparameters, which are variables set by us, the users, that influence the training process. This contrasts with the weights determined by the model during training.

We also examined the role of a validation set. This set comprises data separated from the training dataset, used to refine hyperparameters effectively without relying on the test set, thereby avoiding the risk of overfitting.

Lastly, we addressed the challenges associated with tuning hyperparameters, particularly the impact of choosing suboptimal learning rates and epoch numbers on the model's performance.