# Week 5: Introduction to Machine Learning 
# Tutorial Module

This week will introduce you to basic fundamental machine learning concepts. In the pre-module, you have been introduced to what machine learning is, and hopefully are motivated to learn more about how it works. Today, we will break down the stages in the ML pipeline so you can get a sense of what the input/outputs are to each stage. We will first explain key parts of data preparation, then give a glimpse of model training and evaluation, which will be covered in weeks 6 and 7, respectively. Understanding the pipeline at a high level will serve as a solid foundation before we delve into details on the inner workings of each stage in weeks 6 and 7.

## Learning Objectives
After completing this module, you will be able to:
1. Describe the stages of a typical machine learning (ML) pipeline from start to finish.
2. Explain how data and information flow between different stages of the pipeline.
3. Identify strategies to handle missing values and remove incorrect or redundant information during data cleaning.
4. Identify and remove highly correlated features from a data set.
5. Split a data set into a training set and a test set. Explain the motivation behind data splitting.
6. Summarize the conceptual process of model training.
7. Summarize the conceptual process of model evaluation.

To get started with hands‑on work in this module, let’s first make sure your environment is ready by importing some essential Python packages and checking their versions.

In [None]:
import sys
print(sys.executable)  # This will show the Python executable path

import pandas as pd
print(pd.__version__)  # This will confirm the version of pandas

The command below installs (or updates) the required Python libraries—pandas, seaborn, numpy, and scikit‑learn—into your current environment. The ! at the beginning allows you to run a terminal command directly from this notebook, and the -q option makes the installation output less verbose.

In [None]:
!pip install -q pandas seaborn numpy scikit-learn

### `scikit-learn`
`Scikit-learn` is a Python library for machine learning tools and models. We will be importing this library, along with some familiar libraries that you have already learned about, in the setup cell below. As we go through each topic in the coming weeks, we will explain more about the key functions that this library provides.

**Run the setup cell below** to read in the raw dataset we will be working with, as well as the dirtified dataset that you will use to practice cleaning. 

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

# Machine learning
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split

df_dirty = pd.read_csv('bc_data_dirtified.csv', index_col=0)
df_dirty.head()

print(f"Shape of original dirty dataset: {df_dirty.shape}")

df = pd.read_csv('bc_data.csv', index_col=0)
df = df.drop('Unnamed: 32', axis=1)
df.head()
print(f"Shape of original dataset: {df.shape}")

---
### Recap of pipeline 
Recall the pipeline diagram from the Pre-Module: 

<img src="ml_pipeline.png" alt="Drawing" style="width: 800px;"/>

This week, we would like to focus on the input and output of each stage so you can start to understand how it all fits together. 

Let's get started with the first stage, data preparation.

---
### Data preparation

![dp](data_prep_pipeline.png)

Like with all data science and machine learning tasks, we have to start with the dataset. Depending on the context, researchers may conduct data collection themselves. We will be using a public dataset of breast cancer patients, found here: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic.

This raw dataset is the input to the data preparation stage. After some data manipulation, we should have datasets in a format that is usable for the next stage, which is model training.

#### **Data cleaning**
Our results are heavily affected by the quality and quantity of our data. You are already familiar with data loading and cleaning from week 3; for this module, you are presented with "dirtified" data. Apply the method learnt in week 3 on this dataset. We will also introduce new processing techniques. 

Some datasets include columns with no useful information, such as an index column that only repeats row numbers. Dropping these columns keeps the model focused on meaningful features and makes training more efficient. In this dataset, there is a redundant column called `'Unnamed'` that carries no meaningful information, so we will drop it.

The `'diagnosis'` column currently uses letters `'B'` or `'M'` to indicate benigh or malignant cases. Machine learning algorithms work best with numerical values, so we will convert these labels to `0` for benign and `1` for malignant. This change allows the model to interpret the target variable correctly and perform mathematical operations during training.

Finally, we will separate the predictor variables and the target variable (benign/malignant). This separation makes it clear what the model learns from and what it is trying to predict, and it helps prevent accidental data leakage. As we learned in Week 3 for the heart failure dataset, we will do the same here and assign `x = predictor_vars` and `y = target_var`. 

In the code cell below, we will demonstrate only the first step—dropping a redundant column. The other two steps, converting text labels to numerical values and separating the predictor and target variables, will be demonstrated later in the Data Exploration section.

**Run the code below to perform data cleaning**

In [None]:
# Data cleaning
# Remove the 'Unnamed: 32' column
df_dirty = df_dirty.drop('Unnamed: 32', axis=1)

---
**Q*1. Display the dataset. Identify two aspects of the data that need to be cleared or processed. and briefly explain why each needs attention.**
>Hint: look for missing or invalid values, inconsistent labels, or columns that need to be transformed for modeling.

In [None]:
df_dirty

<span style="background-color: #FFD700">**Write your answer below**</span> 

Answer here:

---

Let us continue preparing our data set for training a machine learning model. Next, we will focus on two main issues we found in our dataset and explain why and how we addressed them.

**Step 1: Handling Missing Values (`NaN`)**

We found some rows with missing values (`NaN`). Many machine learning algorithms cannot handle missing values directly, so we need to deal with them before training. In our case, notice that some rows consist entire of `NaN` value sand provide no useful information. Therefore, we will **remove the rows with `NaN` values** from the dataset. 

If a row contains only a few `NaN` values alongside valid data, you might instead consider one of the following approaches:

1. Imputation with statistics: Replace missing values with the mean, median, or mode of that column.
2. Forward or backward fill: In time‑series data, use previous or next valid entries to fill in gaps.
3. Predictive imputation: Build a simple model to estimate missing values based on other features.

Choosing the best method depends on how much data is missing and the importance of the affected feature.

Let's analyze our data to see if we have any `NULL`/`NaN` values and which columns have `NaN` and how many.

In [None]:
# Count how many NaN values for each column
nan_count = df_dirty.isnull().sum()
nan_count

---
**Q*2. Remove the rows with NaN values. Print the shape of the dataset before and after removal. How many rows were removed?**

>Hint: refer to Week 3

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
# YOUR CODE HERE


<span style="background-color: #FFD700">**Write your answer below**</span> 


Answer here:

---

**Step 2: Removing Incorrect Target Values**

The `'diagnosis'` column is our target variable, and it should only contain the valid classes: `'M'` for malignant and `'B'` for benign. However, we discovered that some rows contained `"WrongValue"` in this column. Keeping incorrect or inconsistent target labels would confuse the model during training and lead to poor performance.

To fix this, we will **remove all rows where `'diagnosis'` was not `'M'` or `'B'`**, ensuring that our target labels are clean and reliable.

---
**Q*3. Remove the rows where the classification column `"diagnosis"` is not equal to `"M"` or `"B"`. Print the shape before and after removal.**

>Hint: You might want to use the function `isin()`. See here for its documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html

<span style="background-color: #FFD700">**Write your code below**</span> 

In [None]:
# YOUR CODE HERE


---
In addition, you might want to consider the following data cleaning steps. 
1. Modify the data to ensure consistency in date formats, capitalization, or spelling.
2. Remove duplicate rows to avoid redundant data. 

We will not show you examples of these steps in this module. Let's move on to Data Exploration - Feature Selection.

---
### Data Exploration - Feature Selection

![de](data_explore_pipeline.png)

The breast cancer dataset we are working with has 32 features, but not all of them may be useful for our analysis. Before training a model, it can be helpful to do a quick check on which features to keep—this process is called **feature selection**. Feature selection is not mandatory, and in many cases we simply use all available features, but it can be valuable for very large datasets because it can significantly reduce training time.

There are many ways to decide which features to keep. In our analysis, we will focus on detecting and handling **multicollinearity** by looking for highly correlated features. 

<span style="background-color: #AFEEEE">**Multicollinearity**</span> occurs when two or more independent variables are strongly correlated with each other, with correlation coefficients close to −1 or 1. If two features (say A and B) are highly correlated, it becomes difficult to tell whether A truly affects the outcome or if it appears important only because it is correlated with B.

To detect multicollinearity, we will use a heatmap—similar to the one from Week 3—to visualize correlations and decide which features to remove.

In the previous section on Data Preparation, we worked with a dirtified version of the dataset so that you could get hands‑on practice with data cleaning. From this point onward, we will switch to the original dataset to ensure that all students work with the same clean data and obtain consistent results.

Recall that in the previous section we discussed, but did not demonstrate, how to convert variable values to numerical form and how to separate predictor and target variables. We will now carry out these data preparation steps.

**Run the code below .**

In [None]:
# Extract x (features) and y (classes) values
# Instead of the cleaned df_dirty from the previous exercises,
# we will use df, which is the cleaned dataset to achieve consistent results across students

# Encode target feature to binary class and split target/predictor vars
y = df["diagnosis"].map({"B" : 0, "M" : 1})
x = df.drop("diagnosis", axis = 1)

x

Now the data is ready for feature selection. Let's examine the correlation of the features and visualize them in a heatmap. 

**Run the code below .**

In [None]:
# Feature selection
# Correlation matrix
corr = x.corr().round(2)

# Remove upper triangle half, redundant as data is mirrored
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Generate heatmap
heatmap = sns.heatmap(corr, annot=True, mask=mask, cmap='Blues', square=True, linewidths=.5, xticklabels=1, yticklabels=1)

# Resize the plot for better viewing
heatmap.figure.set_figwidth(20)
heatmap.figure.set_figheight(15)

---
**Q*4. Looking at the heatmap above, what are the top 3 features highly correlated with `radius_worst`?**

<span style="background-color: #FFD700">**Write your answer below**</span> 


Answer here:

---

The heatmap shows that all the radius, perimeter, and area features are highly correlated (correlation coeffs > 0.8). This makes sense because they are mathematically related: If we approximate a cancer cell as a circle, then area = pi * radius^2 and perimeter = 2 * pi * radiu. As the radius increases, the area and perimeter increase as well. Since these features all capture cell size, we will keep only one --- radius. 

We also see strong correlations between the “mean” features and the corresponding “worst” features. We also see strong correlations between the “mean” features and the corresponding “worst” features. According to the dataset description*, “The mean, standard error and ‘worst’ (largest mean of the three largest values) of these features were computed for each image.” For example, for radius measurements, 10 values were taken around each cell image (since it is not a perfect circle), then both the mean of all 10 and the mean of the largest three (“worst”) were recorded. Because the “worst” values are derived from the same measurements as the means, they are mathematically related. To reduce redundancy, we will drop all ‘worst’ columns and continue using the mean features in our analysis.

*https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

In [None]:
# Drop all "worst" columns
cols = ['radius_worst',
        'texture_worst',
        'perimeter_worst',
        'area_worst',
        'smoothness_worst',
        'compactness_worst',
        'concavity_worst',
        'concave points_worst',
        'symmetry_worst',
        'fractal_dimension_worst']
x = x.drop(cols, axis=1)

# Drop perimeter and area (keep radius)
cols = ['perimeter_mean',
        'perimeter_se',
        'area_mean',
        'area_se']
x = x.drop(cols, axis=1)

# Verify remaining columns
x.columns

### Data splitting
To build a reliable machine learning model, we need to evaluate how well it performs on data it has never seen before. Right now, we have a single dataset containing information from 569 patients. Can we use this dataset for both training and testing our model?

Think of training as studying for an exam. During training, the model works through “practice problems,” adjusting and learning from mistakes. Testing, on the other hand, is like sitting the actual exam. On the exam, you face **new problems**—ones that weren’t on your practice tests. This setup ensures that the exam measures true understanding rather than memorization.

If we were to reuse the same questions from the practice set on the exam, a student might score perfectly by memorizing answers without grasping the concepts. The same is true for machine learning: if we test the model on the same data it was trained on, we might overestimate its ability to handle new, unseen inputs.


Since we can’t realistically collect a completely separate dataset, the solution is to **split our existing dataset into two parts**:
- **Training set**: Used to train the model.
- **Test set**: Used to evaluate the model on unseen data.

By doing this, we simulate real‑world conditions where the model encounters new inputs, giving us a much more accurate sense of its predictive performance.

<img src="data_splitting.png" alt="Drawing" style="width: 600px;"/>

You can decide how much data to allocate to each set, but a common split is about 70–80% for training and the rest for testing. The training set is larger so the model can learn as effectively as possible, while the test set must still be large enough to provide a realistic and representative evaluation. There is always a trade‑off between the sizes of the two sets, and there is no single split that works best for every problem.

To split the dataset, we will use the `train_test_split()` function in `scikit-learn`.

| Function | Input Parameters | Output | Syntax |
| --- | --- | --- | --- |
| train_test_split() | x, y, test_size, random_state | x_train, x_test, y_train, y_test | train_test_split(x, y, test_size, random_state) |

Input parameters:
* `x`: a `pandas` `DataFrame` of the feature columns.
* `y`: a `pandas` `Series` of the outcome column.
* `test_size`: a decimal number between 0 and 1; the fraction of the dataset you wish to set as the test set. 
* `random_state`: any random number; this determines how the dataset is shuffled before splitting. Call the function with the same number if you want to shuffle it the same way (for reproducing the same split datasets every time).

**Run the code below to split into 4 sets of data: `x_train`, `y_train`, `x_test`, and `y_test`.**

In [None]:
# Data splitting
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=40)

x.head()

In [None]:
print(f"x_train: {x_train.shape}")
print(f"y_train: {y_train.shape}")

print(f"x_test: {x_test.shape}")
print(f"y_test: {y_test.shape}")

Look at the shapes printed above. These tell us the dimension of the datasets, in the form (rows, col). For example, a matrix with 2 rows and 3 columns would have a shape of (2,3).

The original dataset before cleaning and splitting had a shape of (569, 32), meaning there were 569 samples and 32 features.

**Answer the following questions:**

---
**Q*5. How many features were removed through data cleaning & feature selection?**

<span style="background-color: #FFD700">**Write your answer below**</span> 

Answer here:

---

**Q*6. How many samples are there in the training set?**

<span style="background-color: #FFD700">**Write your answer below**</span> 

Answer here:

---

**Q*7. How many samples are there in the test set?**

<span style="background-color: #FFD700">**Write your answer below**</span> 


Answer here:

---

**Q*8. What is the train-test split ratio?**

<span style="background-color: #FFD700">**Write your answer below**</span> 

Answer here:

---

### Model training

![mt](model_train_pipeline.png)

Next, we can train our model to learn from the dataset.

A **model** in machine learning is an **algorithm** that calculates an output based on input(s) from a dataset. A model has **weights** that affect the calculation of the output, and these weights are updated during training.

<span style="background-color: #AFEEEE">**Model**</span>: an algorithm that calculates an output based on input(s) from a dataset and its current weights.

<span style="background-color: #AFEEEE">**Algorithm**</span>: a set of procedures to compute an output or result. We can describe an algorithm in plain English (take x and multiply it by a, then add b, save this result in y). Then, algorithms can be translated to code that a computer can understand and execute.

<span style="background-color: #AFEEEE">**Weights (Parameters)**</span>: the learnable parameters of a model. Weights are sometimes called **parameter** in general machine learning literature. For example, if the model is the equation y = a*x + b, then the coefficient 'a' can be considered a weight of this model.

In our analysis, we will use **logistic regression**, a model designed for classification tasks (i.e., benign or malignant). The input to this model is the training set, which includes both the features (represented as 'X' in the diagram below) and the corresponding **true labels** (represented as  'Y').

<span style="background-color: #AFEEEE">**True label (Ground truth or Target)**</span>: the actual outcome for a sample in the dataset. This is what the model would predict for a particular sample if it were 100% accurate. In our case, the true labels are benign/malignant for each patient.

<img src="model_train_system.png" alt="Drawing" style="width: 800px;"/>

The type of model training we describe is an **iterative** process. At each iteration, we update the model’s weights and immediately evaluate the updated weights by comparing the prediction to the actual outcome for that data point. We repeat this across chunks of the dataset until we’ve processed the entire training set.

**High-level model training steps:**
1. The model starts with default/random weights. Think of this part as your starting knowledge at the beginning of a semester. 
2. We feed an input 'X' from the training set into the model. The model uses its current weights to produce a prediction 'Y'.
3. We calculate how far the prediction 'Y' is from the true label. 
4. The weights are adjusted to improve future predictions.
5. Steps 2-4 are repeated for each data point until the entire training set has been processed. 

In practice, we often repeat this process multiple times over the entire training set—each full pass through the data is called an **epoch**—to further improve the model’s performance.

Here we are using a logistic regression model with stochastic gradient descent. You will learn about this model in week 6.

**Run the code below to train the model.**

In [None]:
# Load the model
model = SGDClassifier(loss='log_loss', alpha=0.0001, max_iter=100, random_state=16)

# Fit the model
model.fit(x_train, y_train)

---
**Q*9: Look up the term "loss" (in the context of machine learning), and describe it in your own words. Which step in the "high-level model training steps" above does this term relate to? These links may be helpful:**
* https://developers.google.com/machine-learning/glossary#l
* https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html

<span style="background-color: #FFD700">**Write your answer below**</span> 

Answer here:

---

**Q*10: Describe what model weights are in your own words.**

<span style="background-color: #FFD700">**Write your answer below**</span> 

Answer here:

---

### Model evaluation

![me](model_eval_pipeline.png)

Finally, we can check how well our model's predictions match the true labels. If the model performs poorly, we may need to make changes, perhaps picking another model or even re-evaluating if the dataset is appropriate; sometimes, there simply isn't enough useful information in a dataset to learn from.

How good a model needs to be depends on the task and where it will be used. For example, models in medical applications often require very high accuracy, while tasks like applying a TikTok filter may prioritize speed over precision.

What makes a model good also depends on context. There’s no universal answer, but some evaluation metrics are more informative than others. Our job is to choose the right metrics for the problem at hand. We will discuss these metrics in detail in Week 7.

To evaluate our model, we:
1. Feed in our test set inputs (prepared earlier) to the trained model.
2. Obtain predictions from the logistic regression model for this data set, a list of binary classifications (0: benign, 1: malignant).
3. Compare the predictions with the true labels in the test set using various metrics. This gives us quantitative measures of model performance to analyze.

<img src="model_eval_system.png" alt="Drawing" style="width: 800px;"/>

One simple measure of performance is **accuracy**. Accuracy is the fraction of correct predictions out of the total number of predictions (i.e., the total number of samples).

**Run the code below to get model predictions and calculate accuracy.**

In [None]:
# Prediction
preds = model.predict(x_test)

# Evaluation metric: Accuracy
acc = accuracy_score(y_test, preds)

print(f"Model: LogisticRegression")
print(f"Predictions: {preds}")
print(f"Accuracy: {acc}")

Above, we printed out the accuracy of this model. From first glance, the model seems to be doing well as the accuracy is fairly high, at 0.877 or 87.7%.

Spoiler: Accuracy may not be the best measure of model performance! We will discuss these metrics in detail in week 7. 

---
**Q*11: Fit a SGD model using `'log_loss'` and 2 other types of loss functions for the SGDClassifier. Plot the accuracy score for each type. Set `alpha=0.01`, `max_iter=1000`. Refer to the [SGDClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html).**

>Hint: If you have trouble picking loss function, we would suggest the perceptron and hinge loss functions.

<span style="background-color: #FFD700">**Write your code below**</span> 

In [None]:
import matplotlib.pyplot as plt

# YOUR CODE HERE

loss_functions = [...]  # Use this in a for loop!
acc_list = []

for ...
    # Load the model

    # Fit the model

    # Prediction

    # Evaluation metric: Accuracy


# Plotting the results
plt.figure(figsize=(10, 6))
plt.bar(loss_functions, acc_list)
plt.xlabel('Loss Functions')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. Loss Function in SGDClassifier')
plt.ylim([0, 1])  # Set y-axis limit for accuracy scale (0 to 1)
plt.xticks(rotation=45)
plt.show()

## **Graded exercises (5 marks)**

**GQ*1. (1 mark) What are the 3 steps within data preparation? Provide a brief description of each.**

<span style="background-color: #FFD700">**Write your answer below**</span> 

Answer here:

---

**GQ*2. (2 marks) What are true labels used for during model training?**

<span style="background-color: #FFD700">**Write your answer below**</span> 


Answer here:

---

**GQ*3. (2 marks) In data preparation, we split the data to create a training and test set. What is the test set used for, and why do we not use a training set for this purpose?**

<span style="background-color: #FFD700">**Write your answer below**</span> 


Answer here:

---

## Conclusion

In this module, we expanded on this simplified pipeline...

<img src="ml_pipeline.png" alt="Drawing" style="width: 450px;"/>

... into a slightly more detailed view, with an explanation of what will be produced at each stage and what is required to produce it:

<img src="data_prep_system.png" alt="Drawing" style="width: 650px;"/>
<img src="model_train_system.png" alt="Drawing" style="width: 650px;"/>
<img src="model_eval_system.png" alt="Drawing" style="width: 650px;"/>


This is getting complex and exciting! Next week, we will delve into the model and training stage and highlight how you can use` sklea``n to do this analysis. 
