# Lab 1: Introduction to Machine Learning with scikit-learn

scikit-learn is an open source Python machine learning module. Using scikit-learn makes it easy to implement a variety of machine learning tasks including regression, classification, clustering, dimensionality reduction, model selection and data pre-processing. The full documentation for scikit-learn is located at https://scikit-learn.org/stable/. It contains numerous examples and detailed descriptions for each function (and its associated parameters). This notebook will give you a brief introduction on how to implement the basic machine learning pipeline.

In [None]:
# if you are missing any of the packages, uncomment the line(s) below to install
# %pip install sklearn
# %pip install pandas
# %pip install numpy

### Loading & Previewing Data

The scikit-learn library contains a few small datasets that we can experiment with. Descriptions of the datasets are included here: https://scikit-learn.org/stable/datasets/toy_dataset.html. We will use two of them in this tutorial:

1) The Boston house prices dataset (for regression tasks) and

2) The breast cancer wisconsin (diagnostic) dataset (for classification tasks)

First we will load the datasets using helper functions and will take a look at the contents:

In [None]:
from sklearn.datasets import load_boston
import pandas as pd

boston = load_boston()
print(boston.DESCR)
(boston_X, boston_y) = load_boston(return_X_y=True) 

In [None]:
from sklearn.datasets import load_breast_cancer 

cancer = load_breast_cancer()
print(cancer.DESCR)
(cancer_X, cancer_y) = load_breast_cancer(return_X_y=True)

### Splitting Data

Before interacting with our datasets any further, we must split the data into training and testing data. This way we can use the training data to explore the data and fit our models and then use the testing data to provide an unbiased measure of the performance of our models. This can be done using the train_test_split function in scikit-learn.

In [None]:
from sklearn.model_selection import train_test_split

# split the data with a 75%-25% training-test split
# set the random state for reproducible results

boston_X_train, boston_X_test, boston_y_train, boston_y_test = train_test_split(
                                boston_X, boston_y, test_size=0.25, random_state=671)
#checks
print(boston_X_train.shape)
print(boston_X_test.shape)
print(boston_y_train.shape)
print(boston_y_test.shape)

In [None]:
### YOUR CODE: split the breast cancer data with a train-test split of 70%-30% & random_state of 671 into:
# (1) cancer_X_train
# (2) cancer_X_test
# (3) cancer_y_train
# (4) cancer_y_test

### Preprocessing the Data

For almost all datasets, we will need to preprocess the data before we can fit any models. This may involve imputing missing data, scaling our features, applying one-hot encoding, etc.

#### 1. Imputing Missing Data

Missing data is a very common problem across all datasets. One simple strategy for addressing it is to impute missing values using a chosen strategy such as the "mean", "median" or "most_frequent". First, let's check for missing data in our datasets.

In [None]:
boston_X_train_df = pd.DataFrame(boston_X_train, columns=boston.feature_names)
# check for which columns have missing values
columns = boston_X_train_df.columns[boston_X_train_df.isnull().any()]
print(len(columns)) 

In [None]:
cancer_X_train_df = pd.DataFrame(cancer_X_train, columns=cancer.feature_names)
# check for which columns have missing values
columns = cancer_X_train_df.columns[cancer_X_train_df.isnull().any()]
print(len(columns)) 

As these are "toy" datasets, they do not have any missing values but let us still practice how we would impute missing values anyways.

In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

imp = SimpleImputer(missing_values=np.nan, strategy='mean') # impute using the 'mean' for the feature
boston_X_train_imp = imp.fit_transform(boston_X_train) # use fit_transform on training data
boston_X_test_imp = imp.transform(boston_X_test) # use transform on testing data

In [None]:
### YOUR CODE: apply the SimpleImputer with a 'median' strategy to the cancer data
### As a result, you should have two variables: 
#(1) cancer_X_train_imp and
#(2) cancer_X_test_imp

#### 2.  One-Hot Encoding

Depending on the nature of our data and the ML models we want to implement, we may need to one-hot encode the categorical variables in our data as some models do not support categorical inputs. This means we will transform the feature into a set of dummy variables.

Although we do not have any categorical variables in our cancer dataset, we do have a variable, 'RAD', in our boston dataset that we want to one-hot encode as it represents an index of accessibility to radial highways. We can do so using a OneHotEncoder and ColumnTransfer, which allows us to only apply the OneHotEncoder to only the 'RAD' column.

First, let's implement the OneHotEncoder with the default parameters.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

rad_idx = list(boston.feature_names).index('RAD') # find the index of the 'RAD' column
enc_boston = ColumnTransformer([("RAD", OneHotEncoder(),[rad_idx])], remainder="passthrough") # passthrough specifies to keep other columns as they are
boston_X_train_enc = enc_boston.fit_transform(boston_X_train_imp) # use fit_transform on training data
boston_X_test_enc = enc_boston.transform(boston_X_test_imp) # use transform on testing data
print(enc_boston.transformers_)

print(boston_X_train_enc.shape)
print(boston_X_test_enc.shape)

We can see that after applying the OneHotEncoder to the 'RAD' column, we have increased the number of columns in our dataset from 13 to 21 as it added one column for each unique value of 'RAD'. However, if we use this, we will run into the issue of multicollinearity since each dummy variable can be represented as a linear combination of the other dummy variables. To solve this issue, we want to create n-1 dummy variables. This can be done using the 'drop' parameter of the OneHotEncoder function, as shown below, resulting in one less column.

In [None]:
enc_boston = ColumnTransformer([("RAD", OneHotEncoder(drop='first'),[rad_idx])], remainder="passthrough")
boston_X_train_enc = enc_boston.fit_transform(boston_X_train_imp)
boston_X_test_enc = enc_boston.transform(boston_X_test_imp)
print(enc_boston.transformers_)

print(boston_X_train_enc.shape)
print(boston_X_test_enc.shape)

#### 3. Feature Scaling

For some machine learning models, we must first scale the features so that they are all on a shared scale. This supports faster model convergence and removes any bias toward features with higher magnitudes (i.e., giving more importance to features on a scale of cm vs. inches).


Two of the most common techniques for feature normalization are 1) StandardScaler and 2) MinMaxScaler. StandardScaler works by adjusting the mean of each feature to zero with a standard deviation of 1. MinMaxScaler works by scaling all values to between 0 and 1. There are no hard rules for when to use one over the other but some factors to consider are the problem we intend to solve, assumptions regarding the distribution of the data (including the presence of outliers) and the ML models we plan on implementing. It can also be a good option to try out both and see which results in better performance. Either way, we must fit the scaler on our training data and then transform our testing data using it to avoid any data leakage.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() 
boston_X_train_scal = scaler.fit_transform(boston_X_train_enc) # use fit_transform on training data
boston_X_test_scal = scaler.transform(boston_X_test_enc) # use transform on testing data 

In [None]:
from sklearn.preprocessing import MinMaxScaler

### YOUR CODE: apply MinMaxScaler to the cancer dataset ('cancer_X_train_imp' and 'cancer_X_test_imp')
### name the resulting variables 'cancer_X_train_scal' and 'cancer_X_test_scal'

### Feature Selection

When we have large datasets with high dimensionality, feature selection can help us reduce the dimensionality (e.g., the number of input variables or features) by removing irrelevant or redundant features. This can help with reducing the computational costs associated with training models and can also improve the performance of our models in some cases. There are many feature selection techniques so we will only experiment with a few here.

In [None]:
# Method 1: Mutual Information (regression)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression

selector = SelectKBest(mutual_info_regression, k = 10) # select the top 10 features
boston_X_train_mi = selector.fit_transform(boston_X_train_scal, boston_y_train)
print("Training data shape (after applying Mutual Info):", boston_X_train_mi.shape)
boston_X_test_mi = selector.transform(boston_X_test_scal)
print("Testing data shape (after applying Mutual Info):", boston_X_test_mi.shape)

# Method 2: Principal component analysis (PCA)
from sklearn.decomposition import PCA

pca = PCA(n_components = 0.90) # can specify either the percent of explained variance or number of features
boston_X_train_pca = pca.fit_transform(boston_X_train_scal, boston_y_train)
print("Training shape (after applying PCA):", boston_X_train_pca.shape)
boston_X_test_pca = pca.transform(boston_X_test_scal)
print("Testing shape (after applying PCA):", boston_X_test_pca.shape)

In [None]:
# YOUR CODE: apply Mutual Information (classification) to the breast cancer data
# select the top 15 features and name the resulting variables 'cancer_X_train_mi' & 'cancer_X_test_mi'

from sklearn.feature_selection import mutual_info_classif

### Model Fitting, Cross-Validation & Evaluation

Now we can finally fit some models! scikit-learn allows you to easily implement a variety of models. We will work with a few of them here.

Let's start by fitting two models - (1) Linear Regression and (2) K-Nearest Neighbors Regression - to our boston housing dataset. We will fit the models to each of the three feature sets - (1) all of the features (no feature selection), (2) the feature subset selected using mutual information and (3) the feature subset selected using PCA so that we can compare performance both between the models and the different feature subsets. We will use cross-validation (3 folds in this example) to get a more reliable estimate of the performance of our models without having to touch our test dataset yet. The default performance metric will be $R^2$, or the proportion of explained variance.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

lr = LinearRegression()

lr_scores_df = pd.DataFrame(columns=["R^2 on Fold 1", "R^2 on Fold 2", "R^2 on Fold 3", "Mean R^2"])
print("Linear Regression:")
lr_scal_scores = list(cross_val_score(lr, boston_X_train_scal, boston_y_train, cv=3))
lr_scal_scores.append(np.mean(lr_scal_scores))
lr_scores_df.loc["No Feature Selection"] = lr_scal_scores
lr_mi_scores = list(cross_val_score(lr, boston_X_train_mi, boston_y_train, cv=3))
lr_mi_scores.append(np.mean(lr_mi_scores))
lr_scores_df.loc["MI Features"] = lr_mi_scores
lr_pca_scores = list(cross_val_score(lr, boston_X_train_pca, boston_y_train, cv=3))
lr_pca_scores.append(np.mean(lr_pca_scores))
lr_scores_df.loc["PCA Features"] = lr_pca_scores
print(lr_scores_df.head())

knn = KNeighborsRegressor()

knn_scores_df = pd.DataFrame(columns=["R^2 on Fold 1", "R^2 on Fold 2", "R^2 on Fold 3", "Mean R^2"])
print("\nK-Nearest Neighbors Regression:")
knn_scal_scores = list(cross_val_score(knn, boston_X_train_scal, boston_y_train, cv=3))
knn_scal_scores.append(np.mean(knn_scal_scores))
knn_scores_df.loc["No Feature Selection"] = knn_scal_scores
knn_mi_scores = list(cross_val_score(knn, boston_X_train_mi, boston_y_train, cv=3))
knn_mi_scores.append(np.mean(knn_mi_scores))
knn_scores_df.loc["MI Features"] = knn_mi_scores
knn_pca_scores = list(cross_val_score(knn, boston_X_train_pca, boston_y_train, cv=3))
knn_pca_scores.append(np.mean(knn_pca_scores))
knn_scores_df.loc["PCA Features"] = knn_pca_scores
print(knn_scores_df.head())

Looking at the results, we see that linear regression performs similarily across the feature subsets but for KNN, we see that the model performs significantly better on the feature subset chosen using mutual information. Let's use GridSearchCV below to see if we can further improve its performance by adjusting the value of n_neighbors (the number of neighbors used).

In [None]:
from sklearn.model_selection import GridSearchCV

knn = KNeighborsRegressor()

# search over n_neighbors
param_grid = [{'n_neighbors': [1, 2, 3, 4, 5, 6, 7] }]

cv_knn = GridSearchCV(knn, param_grid, cv=2)
cv_knn.fit(boston_X_train_mi, boston_y_train)
print("Best Params:", cv_knn.best_params_)
print("Best R^2 Score:", cv_knn.best_score_)
print(cv_knn.cv_results_)

Using GridSearchCV, we see that the the value of n_neighbors resulting in the best performance was 3. Let's use this as our final model and apply it to our test set. We will evaluate the test set using Mean Squared Error (MSE) and Mean Absolute Error (MAE), two common measures of model performance for regression tasks. The formula for MSE is:

$\frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i )^2$ 

where $n$ is the number of data points, $y_i$ is the actual observation and $\hat{y}_i$ is our prediction

Similarly, the formula for MAE:

$\frac{1}{n}\sum_{i=1}^n|(y_i - \hat{y}_i )|$ 

where $n$ is the number of data points, $y_i$ is the actual observation and $\hat{y}_i$ is our prediction

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(boston_X_train_mi, boston_y_train)
preds = knn.predict(boston_X_test_mi)

print("Training R^2:", knn.score(boston_X_train_mi, boston_y_train))
print("Testing R^2:", knn.score(boston_X_test_mi, boston_y_test))
print("MSE:", mean_squared_error(boston_y_test, preds)) 
print("MAE:", mean_absolute_error(boston_y_test, preds))

Nice! Our testing score increased to 0.84! Now, let's move onto the breast cancer data, a classification task...

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf_scores_df = pd.DataFrame(columns=["Fold 1 Accuracy", "Fold 2 Accuracy", "Fold 3 Accuracy", "Mean Accuracy"])
print("Random Forest Classifier:")
rf_scal_scores = list(cross_val_score(rf, cancer_X_train_scal, cancer_y_train, cv=3))
rf_scal_scores.append(np.mean(rf_scal_scores))
rf_scores_df.loc["All Features"] = rf_scal_scores
rf_mi_scores = list(cross_val_score(rf, cancer_X_train_mi, cancer_y_train, cv=3))
rf_mi_scores.append(np.mean(rf_mi_scores))
rf_scores_df.loc["MI Features"] = rf_mi_scores
print(rf_scores_df.head())


### YOUR CODE: find another classifier of your choice from scikit-learn & replicate the code above using it

This is a good example to show that feature selection will not always improve performance. This is why we must experiment with different methods and models to optimize them for the task at hand.

For a lot of classification tasks, we will want to focus our evaluation on precision, recall and the f1-score. The classification_report function in scikit-learn makes it easy to do so. As a review:

Precision = $\frac{TP}{TP + FP}$, or the fraction of our positive predictions that are actually positive instances

Recall = $\frac{TP}{TP + FN}$, or the fraction of positive instances that we actually predict as positive

F1-score = $2*\: \frac{precision\: *\: recall}{precision + recall}$, or the harmonic mean of precision and recall

In [None]:
from sklearn.metrics import classification_report

### YOUR CODE: using your chosen model, compute predictions for the test data
### use these predictions to create the classification report