# Machine Learning Modelling with Python and Scikit-Learn

This notebook explores a variety of essential and practical features within the Scikit-Learn library. 

Although it covers a lot, it's referred to as a quick overview due to the library's extensive scope. For comprehensive details, it's advisable to consult the full documentation, especially if you encounter any difficulties.

You can get more detailed information from the [documentation](https://scikit-learn.org/stable/user_guide.html).

## Setting up Python

### Create virtual environment:  

```
python3 -m venv venv
```

macOS/Linux:

```
source venv/bin/activate
```

Windows:

```
.\venv\Scripts\activate
```

### Install packages:

```
pip install -r requirements.txt
```

In [85]:
import datetime
print(f"Last updated: {datetime.datetime.now()}")

Last updated: 2025-02-19 10:08:53.838596


## What is Scikit-Learn (sklearn)?

[Scikit-Learn](https://scikit-learn.org/stable/index.html), also referred to as `sklearn`, is an open-source Python machine learning library.

It's built on top on NumPy (Python library for numerical computing) and Matplotlib (Python library for data visualization).

![](images/sklearn-6-step-ml-framework-tools-scikit-learn-highlight.png) 


## Why Scikit-Learn?

In data science and machine learning, the primary objective is to identify patterns in data and use them for predictions. 

Most problems fall into specific categories, such as [classification](https://en.wikipedia.org/wiki/Statistical_classification#Binary_and_multiclass_classification) (e.g., spam vs. non-spam emails), [regression](https://en.wikipedia.org/wiki/Regression_analysis) (e.g., predicting house prices), and [clustering](https://developers.google.com/machine-learning/clustering/overview) (grouping similar samples).

Regardless of the problem type, common steps are involved, including dividing data into training and testing sets, selecting a model, and evaluating its performance. Scikit-Learn provides Python tools to handle these tasks efficiently, from data preparation to modeling, thereby saving the effort of building everything from scratch.

## What does this notebook cover?

The Scikit-Learn library is very capable. However, learning everything off by heart isn't necessary. Instead, this notebook focuses some of the main use cases of the library.

More specifically, we'll cover:

![title](images/sklearn-workflow-title.png)

0. Implementing a Complete Scikit-Learn Workflow
1. Data Preparation
2. Selecting the Appropriate Machine Learning Model
3. Training the Model and Making Predictions
4. Assessing Model Performance
5. Enhancing Predictions via Hyperparameter Tuning
6. Saving and Loading Pre-Trained Models
7. Integrating Steps into a Unified Pipeline

## Where can I get help?

If you encounter a challenge or think of something not covered in this notebook, don't worry! Here are some recommended steps to help you find a solution:

1. **Experiment and Try It Out** - Since Scikit-Learn is designed for ease of use, start by applying what you know and attempting to solve your question on your own. It's okay to make mistakes; they're part of the learning process. If unsure, run your code to see the results.

2. **Use Documentation** - Press SHIFT + TAB while inside a function to view its docstring, which provides information about what the function does. Developing this habit will enhance your research skills and deepen your understanding of the library.

3. **Search for Solutions** - If experimenting doesn't work, try searching online. You'll likely find answers in one of these places:  
* [Scikit-Learn documentation/user guide](https://scikit-learn.org/stable/user_guide.html) : This is the most comprehensive resource for Scikit-Learn information.
* [Stack Overflow](https://stackoverflow.com/) : A Q&A platform for developers where you can find solutions to a wide range of software development problems.
* [ChatGPT](https://chat.openai.com/) : Useful for explaining code, but be sure to verify any code it generates before using it. Ask it to explain code and follow up with additional questions. Avoid using code you didn't write without verifying it first.


In [87]:
# Standard imports
# %matplotlib inline # No longer required in newer versions of Jupyter (2022+)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sklearn
print(f"Using Scikit-Learn version: {sklearn.__version__} (materials in this notebook require this version or newer).")

Using Scikit-Learn version: 1.4.2 (materials in this notebook require this version or newer).


## 0. Implementing a Complete Scikit-Learn Workflow

Before diving into the details, let's take a quick look at what a complete Scikit-Learn workflow looks like. After getting an overview, we'll explore each step more thoroughly.

We'll focus on the following hands-on steps:

1. **Data Preparation:** Splitting data into features and labels, and setting up training and testing sets.

2. **Model Selection:** Choosing the right model for our problem.

3. **Model Training and Prediction:** Fitting the model to the data and using it to make predictions.

3. **Model Evaluation:** Assessing the model's performance.

4. **Improvement through Experimentation:** Enhancing the model's performance.

5. **Saving the Model:** Saving a trained model for future use.

> **Note:** The upcoming section provides a comprehensive end-to-end workflow, which might be information-dense. We'll cover it quickly and then break it down further throughout the notebook. Keep in mind that Scikit-Learn is a versatile library, and the workflow presented here is just one example of how it can be used.


### Random Forest Classifier Workflow for Classifying Heart Disease

#### 1. Data Preparation

As an example dataset, we'll import `heart-disease.csv`.

This file contains anonymised patient medical records and whether or not they have heart disease or not (this is a classification problem since we're trying to predict whether something is one thing or another).

In [88]:
import pandas as pd

heart_disease = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv") # load data directly from URL (requires raw form on GitHub, source: https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/data/heart-disease.csv)
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Here, each row is a different patient and all columns except `target` are different patient characteristics. 

The `target` column indicates whether the patient has heart disease (`target=1`) or not (`target=0`), this is our "label" columnm, the variable we're going to try and predict.

The rest of the columns (often called features) are what we'll be using to predict the `target` value.

> **Note:** It's a common custom to save features to a varialbe `X` and labels to a variable `y`. In practice, we'd like to use the `X` (features) to build a predictive algorithm to predict the `y` (labels).

In [89]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

# Check the head of the features DataFrame
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [90]:
# Check the head and the value counts of the labels 
y.head(), y.value_counts()

(0    1
 1    1
 2    1
 3    1
 4    1
 Name: target, dtype: int64,
 target
 1    165
 0    138
 Name: count, dtype: int64)

One of the most important practices in machine learning is to split datasets into training and test sets.

As in, a model will **train on the training set** to learn patterns and then those patterns can be **evaluated on the test set**.

Crucially, a model should **never** see testing data during training.

Scikit-learn provides the [`sklearn.model_selection.train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method to split datasets in training and test sets.

> **Note:** A common practice to use an 80/20 or 70/30 or 75/25 split for training/testing data.

In [91]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

# np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.25) # by default train_test_split uses 25% of the data for the test set

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))

#### 2. Model Selection

Selecting a model in Scikit-Learn depends on the type of problem you're addressing, such as classification or regression. Scikit-Learn provides a variety of models suitable for each problem type, which can be explored in their machine learning map.

In Scikit-Learn, models are referred to as "estimators," though they are commonly known as `model` or `clf` (classifier). Each model has hyperparameters, which are adjustable settings that can be tuned to optimize the model's performance for your specific problem. Hyperparameters can be thought of as knobs on an oven that you adjust to achieve the perfect cooking conditions for your dish.

In [92]:
# Since we're working on a classification problem, we'll start with a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

We can see the current hyperparameters of a model with the [`get_params()`](https://scikit-learn.org/stable/developers/develop.html#get-params-and-set-params) method.

In [93]:
# View the current hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

We'll leave this as is for now, as Scikit-Learn models generally have good default settings.

#### 3. Model Training and Prediction

Fitting a model a dataset involves passing it the data and asking it to figure out the patterns.

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels. 

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

Most Scikit-Learn models have the [`fit(X, y)`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit) method built-in, where the `X` parameter is the features and the `y` parameter is the labels.

In our case, we start by fitting a model on the training split (`X_train`, `y_train`).

In [94]:
clf.fit(X=X_train, y=y_train)

#### Use the model to make a prediction

The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once your model instance is trained, you can use the [`predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) method to predict a target value given a set of features. 

In other words, use the model, along with some new, unseen and unlabelled data to predict the label.

> **Note:** Data you predict on should be in the same shape and format as data you trained on.

In [95]:
# In order to predict a label, data has to be in the same shape as X_train
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
279,61,1,0,138,166,0,0,125,1,3.6,1,1,2
206,59,1,0,110,239,0,0,142,1,1.2,1,1,3
208,49,1,2,120,188,0,1,139,0,2.0,1,3,3
286,59,1,3,134,204,0,1,162,0,0.8,2,2,2
269,56,1,0,130,283,1,0,103,1,1.6,0,0,3


In [96]:
# Use the model to make a prediction on the test data (further evaluation)
y_preds = clf.predict(X=X_test)

In [97]:
y_preds

array([0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 1, 0, 1, 1])

#### 4. Model Evaluation

Now that we've generated predictions, we can use Scikit-Learn methods to assess the quality of our model. Each model, or estimator, includes a built-in [`score()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.score) method that evaluates how well the model has learned the relationships between features and labels.

The `score()` method utilizes a standard evaluation metric specific to the model type. For classifiers like ours, a common metric is accuracy, which measures the fraction of correct predictions out of the total predictions.

Let's examine our model's accuracy on the training set to get an initial sense of its performance.

In [98]:
# Evaluate the model on the training set
train_acc = clf.score(X=X_train, y=y_train)
print(f"The model's accuracy on the training dataset is: {train_acc*100}%")

The model's accuracy on the training dataset is: 100.0%


Looks like our model is doing pretty well. Maybe even suspiciously well. Any ideas what could have happened?


In [99]:
# Evaluate the model on the test set
test_acc = clf.score(X=X_test, y=y_test)
print(f"The model's accuracy on the testing dataset is: {test_acc*100:.2f}%")

The model's accuracy on the testing dataset is: 86.84%


Hmm, looks like our model's accuracy is a bit less on the test dataset than the training dataset.

This is quite often the case, because remember, a model has never seen the testing examples before.

There are also a number of other evaluation methods we can use for our classification models.

All of the following classification metrics come from the [`sklearn.metrics`](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) module:
* [`classification_report(y_true, y_true)`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) - Builds a text report showing various classification metrics such as [precision, recall](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) and F1-score.
* [`confusion_matrix(y_true, y_pred)`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) - Create a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to compare predictions to truth labels.
* [`accuracy_score(y_true, y_pred)`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score) - Find the accuracy score (the default metric) for a classifier.

All metrics have the following in common: they compare a model's predictions (`y_pred`) to truth labels (`y_true`).

In [100]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Create a classification report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.91      0.72      0.81        29
           1       0.85      0.96      0.90        47

    accuracy                           0.87        76
   macro avg       0.88      0.84      0.85        76
weighted avg       0.87      0.87      0.86        76



In [101]:
# Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[21,  8],
       [ 2, 45]])

In [102]:
# Compute the accuracy score (same as the score() method for classifiers) 
accuracy_score(y_test, y_preds)

0.868421052631579

#### 5. Improvement through Experimentation

The initial model you create is typically considered a baseline, often simpler than what you've built.

To improve upon your baseline, the key strategy is experimentation. Experiments can be categorized into two main types:

**Model Perspective:** This involves using more complex models or tuning the hyperparameters of your existing model. For instance, you can adjust the hyperparameters of a RandomForestClassifier() by changing the number of trees (n_estimators) to see how it affects performance.

**Data Perspective:** This involves collecting more data or improving the quality of your existing data to help your model learn patterns more effectively.

When working with an existing dataset, it's often easier to start with model-based experiments before considering data enhancements if results are unsatisfactory.

An important consideration when tuning hyperparameters is to ensure that your results are cross-validated. Cross-validation helps confirm that your model's performance is consistent across different training and test sets, rather than being due to chance. This is crucial for ensuring reliable results.

For our [`RandomForestClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), we can experiment with different values for `n_estimators`, starting with the default of 100 and increasing it to 200 to observe any improvements. Generally, more trees can lead to better performance.

In [103]:
# Try different numbers of estimators (trees)... (no cross-validation)
np.random.seed(42)
for i in range(100, 200, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100:.2f}%")
    print("")

Trying model with 100 estimators...
Model accuracy on test set: 84.21%

Trying model with 110 estimators...
Model accuracy on test set: 84.21%

Trying model with 120 estimators...
Model accuracy on test set: 85.53%

Trying model with 130 estimators...
Model accuracy on test set: 86.84%

Trying model with 140 estimators...
Model accuracy on test set: 84.21%

Trying model with 150 estimators...
Model accuracy on test set: 85.53%

Trying model with 160 estimators...
Model accuracy on test set: 82.89%

Trying model with 170 estimators...
Model accuracy on test set: 86.84%

Trying model with 180 estimators...
Model accuracy on test set: 84.21%

Trying model with 190 estimators...
Model accuracy on test set: 86.84%



The metrics above were measured on a single train and test split.

Let's use [`sklearn.model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) to measure the results across 5 different train and test sets.

We can achieve this by setting `cross_val_score(X, y, cv=5)`.

Where `X` is the *full* feature set and `y` is the *full* label set and `cv` is the number of train and test splits `cross_val_score` will automatically create from the data (in our case, `5` different splits, this is known as 5-fold cross-validation).

In [104]:
from sklearn.model_selection import cross_val_score

# With cross-validation
np.random.seed(42)
for i in range(100, 200, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)

    # Measure the model score on a single train/test split
    model_score = model.score(X_test, y_test)
    print(f"Model accuracy on single test set split: {model_score * 100:.2f}%")
    
    # Measure the mean cross-validation score across 5 different train and test splits
    cross_val_mean = np.mean(cross_val_score(model, X, y, cv=5))
    print(f"5-fold cross-validation score: {cross_val_mean * 100:.2f}%")
    
    print("")

Trying model with 100 estimators...
Model accuracy on single test set split: 84.21%
5-fold cross-validation score: 82.15%

Trying model with 110 estimators...
Model accuracy on single test set split: 84.21%
5-fold cross-validation score: 81.17%

Trying model with 120 estimators...
Model accuracy on single test set split: 88.16%
5-fold cross-validation score: 83.16%

Trying model with 130 estimators...
Model accuracy on single test set split: 84.21%
5-fold cross-validation score: 83.14%

Trying model with 140 estimators...
Model accuracy on single test set split: 84.21%
5-fold cross-validation score: 82.48%

Trying model with 150 estimators...
Model accuracy on single test set split: 84.21%
5-fold cross-validation score: 80.17%

Trying model with 160 estimators...
Model accuracy on single test set split: 86.84%
5-fold cross-validation score: 80.83%

Trying model with 170 estimators...
Model accuracy on single test set split: 86.84%
5-fold cross-validation score: 81.83%

Trying model wit

Which model had the best cross-validation score?

This is usually a better indicator of a quality model than a single split accuracy score.

Rather than set up and track the results of these experiments manually, we can get Scikit-Learn to do the exploration for us.

Scikit-Learn's [`sklearn.model_selection.GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) is a way to search over a set of different hyperparameter values and automatically track which perform the best.

Let's test it!

In [105]:
# Another way to do it with GridSearchCV...
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

# Define the parameters to search over in dictionary form 
# (these can be any of your target model's hyperparameters) 
param_grid = {'n_estimators': [i for i in range(100, 200, 10)],
              'max_features': [j for j in range(1, 5, 1)]}

# Setup the grid search
grid = GridSearchCV(estimator=RandomForestClassifier(),
                    param_grid=param_grid,
                    cv=5,
                    verbose=1) 

# Fit the grid search to the data
grid.fit(X, y)

# Find the best parameters
print(f"The best parameter values are: {grid.best_params_}")
print(f"With a score of: {grid.best_score_*100:.2f}%")

Fitting 5 folds for each of 40 candidates, totalling 200 fits
The best parameter values are: {'max_features': 1, 'n_estimators': 160}
With a score of: 84.14%


We can extract the best model/estimator with the `best_estimator_` attribute.

In [106]:
# Set the model to be the best estimator
clf = grid.best_estimator_
clf

And now we've got the best cross-validated model, we can fit and score it on our original single train/test split of the data. 

In [107]:
# Fit the best model
clf = clf.fit(X_train, y_train)

# Find the best model scores on our single test split
# (note: this may be lower than the cross-validation score since it's only on one split of the data)
print(f"Best model score on single split of the data: {clf.score(X_test, y_test)*100:.2f}%")

Best model score on single split of the data: 85.53%


#### 6. Save a model for someone else to use

When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or application you offer.

Saving a model also allows you to reuse it later without having to go through retraining it. Which is helpful, especially when your training times start to increase.

You can [save a Scikit-Learn model](https://scikit-learn.org/stable/model_persistence.html) using Python's in-built [`pickle` module](https://docs.python.org/3/library/pickle.html).

In [108]:
import pickle

# Save an existing model to file
pickle.dump(grid, open("random_forest_model_1_grid.pkl", "wb"))

In [109]:
# Load a saved pickle model and evaluate it
loaded_pickle_model = pickle.load(open("random_forest_model_1_grid.pkl", "rb"))
print(f"Loaded pickle model prediction score: {loaded_pickle_model.score(X_test, y_test) * 100:.2f}%")

Loaded pickle model prediction score: 85.53%
