<a href="https://colab.research.google.com/github/CompOmics/lsabd-machine-learning-tutorials/blob/main/notebooks/1-data-preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preprocessing for machine learning

In this notebook, we will explore the basics of data preprocessing for machine learning using the breast cancer dataset from scikit-learn. We will cover loading the dataset, understanding its structure, handling missing values, and scaling features. 

Note that throughout the notebook, there are links to the relevant documentation for the functions and classes used. Click on the links and explore the documentation to deepen your understanding.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Set the style for our visualizations
sns.set_style("whitegrid")
sns.set_palette("deep")

%matplotlib inline

## Loading a dataset

The breast cancer dataset is a classic dataset in machine learning. It contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of the cell nuclei present in the image.

Learn more about the dataset in the [original paper](https://doi.org/10.1016/0304-3835(94)90099-X) and on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic).

Normally, we load datasets from a file using functions like [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). However, this dataset is so often used in tutorials, that it has been included in the [`sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) library. Let's load it and take a look at its structure.


In [None]:
cancer_dataset = datasets.load_breast_cancer(as_frame=True)
cancer_dataset.keys()

When loaded, the dataset is represented as a dictionary-like object with several attributes. The main attributes are:
- `data`: A 2D array where each row represents an instance (a breast mass) and each column represents a feature.
- `target`: A 1D array containing the target variable, indicating whether the cancer is malignant (0) or benign (1).
- `feature_names`: An array of strings representing the names of the features.
- `target_names`: An array of strings representing the names of the target classes.

Note that we used the `as_frame=True` parameter when loading the dataset. The feature table is returned as a pandas [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#dataframe), which provides convenient methods for data manipulation and analysis. The target variable is also returned as a pandas Series, which can be regarded as a single column.

The `data` attribute allows us to access the feature values:

In [None]:
features = cancer_dataset.data
features

And the `target` attribute provides the labels for each instance:

In [None]:
targets = cancer_dataset.target
targets.head()

*Question: How many samples and features are in the breast cancer dataset?*

Write your answer here:

...

*Bonus Python refresher question: What is the difference between calling a property with parentheses (e.g., `features.head()`) and without parentheses (e.g., `cancer_dataset.target`)?*

Write your answer here:

...

## Exploring the dataset

### Exploring the features

First and foremost, it's important to understand the structure and characteristics of the features. This includes checking for missing values, understanding the data types (numerical, categorical), and getting a sense of the distribution of the features.

The Pandas [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) property allows us to check the data types of each feature in the DataFrame:

In [None]:
features.dtypes

This dataset contains only numerical features (float values), so we do not need to handle categorical variables. If this would be the case, we could use techniques such as one-hot encoding or label encoding to convert categorical variables into numerical format. You can learn more about these techniques in the [Encoding categorical features](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features) section of the Scikit-learn user Guide.

The Pandas `isnull()` method can be used to check for missing values in the dataset:

In [None]:
features.isnull().sum()

Lucky again, this dataset does not contain any missing values. If there would be, Scikit-learn provides several options for handling missing values, as described in the [Handling missing values](https://scikit-learn.org/stable/modules/impute.html) section of User Guide.

We can use pandas methods such as `describe()` to get an overview of the numerical features. Adding `.T` at the end transposes the output for better readability.

In [None]:
features.describe().T

This summary provides us with some basic statistics about each feature, such as the mean, standard deviation, minimum, and maximum values. However, it is a bit hard to read. Alternatively, we can visualize the distribution of each feature using box plots. For this, we'll use the Seaborn [`boxplot`](https://seaborn.pydata.org/generated/seaborn.boxplot.html) function.

*Detail:* Because the boxplot function expects the data in a long format, we first use the pandas [`melt`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html) method to reshape the DataFrame. You can read more about long vs wide data formats in the [Pandas User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#melt-and-wide-to-long).

In [None]:
sns.boxplot(data=features.melt(), x="value", y="variable")

*Question: What is immediate obvious from the box plots about the features in the dataset? Are there any features that stand out from the others?*

Write your answer here:

...

These general distribution plots do not show the relationship between the features and the target variable. To explore this, we can create box plots for each feature, grouped by the target variable (malignant vs benign). This will help us understand how the features differ between the two classes.

For instance, let's take a look at the `mean radius` feature:

In [None]:
mean_radius_df = pd.DataFrame(
    {"mean radius": features["mean radius"], "target": targets}
)

sns.boxplot(data=mean_radius_df, x="mean radius", hue="target")

Alternatively, we can get a more detailed view of the distribution using histograms.

*Task: Try it out with the Seaborn [`histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html) function.*

In [None]:
# Use the histplot function to visualize the feature distribution


*Question: What do you think about the distribution of the `mean radius` feature for malignant vs benign tumors? Would this be a useful feature for classification?*


Write your answer here:

...

*Task: Repeat the above analysis (box plots and histograms) for at least two other features of your choice from the dataset. Summarize your findings.*

Write your findings here:

...

### Exploring the target variable

The target variable indicates whether the breast mass is malignant or benign. We can use the Pandas [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method to see the distribution of the target classes:

In [None]:
targets.value_counts()

We can visualize this distribution using a bar plot with the Seaborn [`countplot`](https://seaborn.pydata.org/generated/seaborn.countplot.html) function:

In [None]:
sns.countplot(y=targets, orient="h")

*Question: What do you observe about the class distribution? Is the dataset balanced or imbalanced? How might this impact model training and evaluation?*

Write your answer here:

...

## Data splitting

In machine learning, it is crucial to evaluate the performance of our models on unseen data. To achieve this, we typically split our dataset into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance. In real-world scenarios, we might also use a validation set for hyperparameter tuning. For now, we will focus on a simple train-test split.

This can be done easily using the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from the [`sklearn.model_selection`](https://scikit-learn.org/stable/api/sklearn.model_selection.html) module. By default, it splits the data into 75% training and 25% testing sets, but we can adjust this ratio using the `test_size` parameter. The function allows us to pass multiple arrays (e.g., features and target) and splits them consistently; i.e., the corresponding rows in the features and target arrays remain aligned after the split.

Note that some datasets contain both features and targets within a single table. In such cases, it is crucial to separate them before proceeding with your analysis.

In [None]:
train_features, test_features, train_targets, test_targets = train_test_split(
    features, targets, test_size=0.3, random_state=42
)

print(f"Training set shape: {train_features.shape}")
print(f"Testing set shape: {test_features.shape}")

print(
    f"Percent positives in training set: {train_targets.value_counts(normalize=True)[1]:.2f}"
)
print(
    f"Percent positives in testing set: {test_targets.value_counts(normalize=True)[1]:.2f}"
)

*Question: What does the `random_state` parameter do in the `train_test_split` function? Why is it important to set it?*

Write your answer here:

...

To account for class imbalance during the split, we can choose use the `stratify` parameter. By setting it to the target variable, we ensure that the proportion of classes in both the training and testing sets reflects that of the original dataset.

In [None]:
train_features, test_features, train_targets, test_targets = train_test_split(
    features, targets, test_size=0.3, random_state=42, stratify=targets
)

print(f"Training set shape: {train_features.shape}")
print(f"Testing set shape: {test_features.shape}")

print(
    f"Percent positives in training set: {train_targets.value_counts(normalize=True)[1]:.2f}"
)
print(
    f"Percent positives in testing set: {test_targets.value_counts(normalize=True)[1]:.2f}"
)

*Question: After performing the stratified split, do you see a difference in the percentage of positive classes compared to the initial split? Why is this the case? When would it not be the case?*

Write your answer here:

...

## Feature scaling

Many machine learning algorithms are sensitive to the scale of the input features. Features with larger scales can dominate the learning process, leading to suboptimal performance. To address this, we often apply feature scaling techniques to standardize or normalize the features.

As we saw earlier, the features in the breast cancer dataset have varying scales. For instance, the `mean radius` feature ranges from approximately 6 to 30, while the `worst area` feature ranges from about 200 to 2500. To ensure that all features contribute equally to the learning process, we can apply feature scaling.

Here, we will cover two common scaling techniques: Standardization and Min-Max Scaling. You can learn more about these techniques in the [scikit-learn User Guide](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler).

### Standardization using `StandardScaler`

The [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) class standardizes features by removing the mean and scaling to unit variance. This means that the resulting distribution of each feature will have a mean of 0 and a standard deviation of 1.

In [None]:
scaler = StandardScaler()
scaler.fit(train_features)
train_features_standardized_array = scaler.transform(train_features)
test_features_standardized_array = scaler.transform(test_features)

We used two steps to apply the `StandardScaler`:

1. **Fit the scaler on the training data**: This computes the mean and standard deviation for each feature in the training set.
2. **Transform both the training and testing data**: This applies the scaling using the parameters computed from the training set.

It is crucial to fit the scaler only on the training data to avoid data leakage. Data leakage occurs when information from the test set is used to inform the training process, leading to overly optimistic performance estimates. While it might not seem intuitive, this also applies to scaling techniques. Scaling always takes place **after** the data splitting.
Never do:

```python
scaler.fit(features)  # Incorrect: fitting on the entire dataset
scaled_features = scaler.transform(features)
```


To simplify your code, sklearn provides a `fit_transform` method that combines both fitting and transforming in one step for the training data. However, remember to use the `transform` method separately for the test data.

*Task: Repeat the entire process in the previous code cell, but using the `fit_transform` method for the training data. Make sure to use the correct methods on the correct data sets.*

In [None]:
# Write the code for the full standardization process using fit_transform and transform methods



Note that these functions return NumPy arrays. To convert them back to pandas DataFrames for easier handling, we can use the following approach, where we also retain the original column names:

In [None]:
train_features_standardized = pd.DataFrame(
    train_features_standardized_array, columns=train_features.columns
)
test_features_standardized = pd.DataFrame(
    test_features_standardized_array, columns=test_features.columns
)

To see the effect of standardization, let's visualize the distribution of a feature before and after scaling.

In [None]:
fig, axes = plt.subplots(figsize=(12, 8), ncols=2)

sns.boxplot(data=train_features.melt(), x="value", y="variable", ax=axes[0])
axes[0].set_title("before standardization")

sns.boxplot(
    data=train_features_standardized.melt(), x="value", y="variable", ax=axes[1]
)
axes[1].set_title("after standardization")

plt.tight_layout()
plt.show()

The feature distributions already look more comparable after standardization.

### Scaling features between 0 and 1 using `MinMaxScaler`

The `MinMaxScaler` scales features to a specified range, typically between 0 and 1. This is done by subtracting the minimum value of each feature and then dividing by the range (max - min). The resulting values will be within the specified range. The scikit-learn [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) class provides a similar interface to the `StandardScaler`.

*Task: Repeat the entire scaling process using the `MinMaxScaler`. Visualize the feature distributions before and after scaling, similar to what we did for the `StandardScaler`.*

In [None]:
# Complete the code for the full scaling process using fit_transform and transform methods.
# Ensure to end up with DataFrames for both train and test sets in the variables:
# - train_features_scaled
# - test_features_scaled


Let's visualize the distribution of the features before and after scaling for both scaling methods.

In [None]:
fig, axes = plt.subplots(figsize=(12, 8), ncols=3)

sns.boxplot(data=train_features.melt(), x="value", y="variable", ax=axes[0])
axes[0].set_title("before")

sns.boxplot(
    data=train_features_standardized.melt(), x="value", y="variable", ax=axes[1]
)
axes[1].set_title("after standardization")

sns.boxplot(data=train_features_scaled.melt(), x="value", y="variable", ax=axes[2])
axes[2].set_title("after min-max scaling")

# Hide repeated y-axis label and tick labels on the middle and right plots
for idx in [1, 2]:
    axes[idx].set_ylabel("")
    axes[idx].tick_params(labelleft=False, left=False)

plt.tight_layout()
plt.show()

*Question: Based on the box plots, how do the feature distributions differ after applying standardization and min-max scaling? What about outliers?*

Write your answer here:

...

StandardScaler is preferred for most ML algorithms (logistic regression, SVM, neural networks) that assume normally distributed data. It transforms features to have mean=0 and standard deviation=1, preserves the original distribution shape, and handles outliers better, though values remain unbounded.

MinMaxScaler is necessary when you need features bounded to [0, 1], particularly for neural networks with bounded activation functions (sigmoid, tanh) where matching input/output scales improves convergence, or algorithms like k-NN that don't assume any distribution. However, it's sensitive to outliers—a single extreme value can compress all other values.

Start with StandardScaler for linear models and SVMs; use MinMaxScaler when you specifically need bounded ranges or work with neural networks. Tree-based models (Random Forest, XGBoost) don't require scaling at all.

## Training a simple model

The dataset is now ready for training machine learning models! Here's a sneak peek of a simple
classification model using the logistic regression algorithm:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Initialize the Logistic Regression model
model = LogisticRegression()

# Fit the model on the training data
model.fit(train_features_scaled, train_targets)

# Make predictions on the test data
test_predictions = model.predict(test_features_scaled)

# Generate the classification report
print(
    classification_report(
        test_targets, test_predictions, target_names=cancer_dataset.target_names
    )
)

# Generate the confusion matrix
confusion_mat = confusion_matrix(test_targets, test_predictions)
sns.heatmap(
    confusion_mat,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=cancer_dataset.target_names,
    yticklabels=cancer_dataset.target_names,
)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()

## Using a pipeline for preprocessing and modeling

To streamline the preprocessing and modeling steps, we can use a Scikit-learn [`Pipeline`](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators). A pipeline allows us to chain multiple processing steps together, ensuring that the same transformations are applied consistently during both training and testing. Crucially, it also helps prevent data leakage by encapsulating the entire workflow.

In [None]:
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(),
)
pipeline.fit(train_features, train_targets)
preds = pipeline.predict(test_features)