# DSTA Lab: Classification with Scikit-learn

This notebook is available from the [DSTA repo (download only)](https://www.dcs.bbk.ac.uk/~ale/dsta/)

Data is imported from the [Openml.org](https://openml.org/) public repository.

### Supervised Classification with the Python Scikit-learn module

#### Slides and codes are courtesy of [Andreas C. Mueller, NYU](https://github.com/amueller/)

### Case studies:
1. **Classification with the blood transfusion dataset from Sklearn:**

    - Imported from sklearn, check "fetch_openml" import statement further below.
    - Details about the dataset: [https://www.openml.org/d/1464](https://www.openml.org/d/1464).


2. **Classification with the Iris dataset from Sklearn:**
    - Imported from sklearn.datasets.
    - Dataset studied during last week's class and lab session.


#### Package Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_openml, load_iris
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
# from sklearn.linear_model import LogisticRegression

## Classification with the blood transfusion dataset

### Fetch the dataset from sklearn

Package sklearn includes toy datasets for experimentation with machine learning models.
One example is the blood transfusion dataset (please check the link at the top of this notebook).
Below, the dataset is loaded as a scikit-learn object.
The actual data (X, Y) are the "data" and "target" attributes of the object.

In [2]:
# Fetch the data - provided as sklearn.utils.bunch class
blood_data = fetch_openml("blood-transfusion-service-center")

print(f"blood dataset object type: {type(blood_data)}")
print(f"Attributes of the loaded Python object: {dir(blood_data)}")

blood dataset object type: <class 'sklearn.utils._bunch.Bunch'>
Attributes of the loaded Python object: ['DESCR', 'categories', 'data', 'details', 'feature_names', 'frame', 'target', 'target_names', 'url']


### Check predictors X and target Y variable names and data size

In [3]:
print(f"Predictors X variable names: {blood_data.feature_names}")
print(f"Target Y variable name: {blood_data.target_names}")
print(f"X data size: {blood_data.data.shape}")

Predictors X variable names: ['V1', 'V2', 'V3', 'V4']
Target Y variable name: ['Class']
X data size: (748, 4)


### Check the type of X and Y data

X is a pandas.DataFrame and Y is a pandas.Series.
These are the core data structures of pandas package.

In [4]:
print(f"Type of X data: {type(blood_data.data)}")
print(f"Type of Y data: {type(blood_data.target)}")

Type of X data: <class 'pandas.core.frame.DataFrame'>
Type of Y data: <class 'pandas.core.series.Series'>


### Print the first 5 rows of the predictive features

In [5]:
blood_data.data.head()

Unnamed: 0,V1,V2,V3,V4
0,2,50,12500,98
1,0,13,3250,28
2,1,16,4000,35
3,2,20,5000,45
4,1,24,6000,77


### Print the first 5 values of the target variable

In [6]:
blood_data.target.head()

0    2
1    2
2    2
3    2
4    1
Name: Class, dtype: category
Categories (2, object): ['1', '2']

### Check class distribution of Y

In [7]:
blood_data.target.value_counts()

Class
1    570
2    178
Name: count, dtype: int64

This target variable is skewed

### Use ``train_test_split`` to prepare your train and test data

As we see above, the class distribution is imbalanced...
Hint: Look for a "stratified" ``train_test_split``!

Package documentation: [sklearn train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [8]:
x_train, x_test, y_train, y_test = train_test_split(
    blood_data.data,
    blood_data.target,
    random_state=0,
    stratify=blood_data.target
    )

### Use ``StandardScaler`` from sklearn to standardize the predictors.

Package documentation: [sklearn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

Otherwise, once ``StandardScaler`` has been imported, use ``help(StandardScaler)`` to print its documentation.
You can use ``help`` Python command to check the documentation of any function or class.

In [9]:
scaler = StandardScaler()

x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

### Check class distribution in training and test Y.

Hint: The ``value_counts()`` method can help here

In [10]:
print(f"Training Y class count: \n{y_train.value_counts()}\n")
print(f"Test Y class count: \n{y_test.value_counts()}")

Training Y class count: 
Class
1    427
2    134
Name: count, dtype: int64

Test Y class count: 
Class
1    143
2     44
Name: count, dtype: int64


### Use ``LabelEncoder`` from sklearn to encode target labels with values between 0 and n_classes-1.

Package documentation: [sklearn LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [11]:
label_encoder = LabelEncoder()

y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)

mappings = {label: i for i, label in enumerate(label_encoder.classes_)}

print(f"Label Encoder Mapping: {mappings}")

Label Encoder Mapping: {'1': 0, '2': 1}


### Use again the ``shape`` function to check the dimensions of training and test X.

In [12]:
print(x_train.shape)
print(x_test.shape)

(561, 4)
(187, 4)


### Classify with K-nn

#### Check ``KNeighborsClassifier`` documentation:
[sklearn KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

#### Fit K-nn model

In [13]:
K=5
knn_classifier = KNeighborsClassifier(n_neighbors=K)

knn_classifier.fit(x_train, y_train)

#### Calculate K-nn training and test data accuracy

In [14]:
knn_train_accuracy = knn_classifier.score(x_train, y_train)
knn_test_accuracy = knn_classifier.score(x_test, y_test)

print(f"K-nn training data accuracy: {round(knn_train_accuracy, 3)}")
print(f"K-nn test data accuracy: {round(knn_test_accuracy, 3)}")

K-nn training data accuracy: 0.799
K-nn test data accuracy: 0.733


#### Use Grid Search and Cross Validation to find the best number of neighbors

The default option of 5-fold cross validation is used.
GridSearchCV documentation: [sklearn GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [15]:
# Define parameter grid
num_neighbors = np.array([1, 3, 5, 8, 10, 15, 20, 25, 30])
param_grid = dict(n_neighbors=num_neighbors)

param_grid

{'n_neighbors': array([ 1,  3,  5,  8, 10, 15, 20, 25, 30])}

In [16]:
# Initialize model
knn_model = KNeighborsClassifier()
grid = GridSearchCV(
    estimator=knn_model, 
    param_grid=param_grid,
    scoring="accuracy"
    )

# Run grid search
grid.fit(x_train, y_train)
best_n = grid.best_estimator_.n_neighbors
best_score = round(grid.best_score_, 3)

print(f"Best number of neighbors: {best_n}")
print(f"Best achieved test accuracy for {best_n} neighbors: {best_score}")

Best number of neighbors: 25
Best achieved test accuracy for 25 neighbors: 0.8


### Classification with Iris dataset

#### Fetch the dataset from sklearn

In [17]:
# Check load_iris documentation
iris_df, iris_y = load_iris(return_X_y=True, as_frame=True)

#### Check predictors X variable names and data size

In [18]:
print(f"Predictors X variable names: {iris_df.columns}")
print(f"X data size: {iris_df.shape}")

Predictors X variable names: Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')
X data size: (150, 4)


#### Check the type of X , Y data

X is a pandas.DataFrame and Y is a pandas.Series.
These are the core data structures of pandas package.

In [19]:
print(f"Type of X data: {type(iris_df)}")
print(f"Type of Y data: {type(iris_y)}")

Type of X data: <class 'pandas.core.frame.DataFrame'>
Type of Y data: <class 'pandas.core.series.Series'>


#### Print the first 5 rows of the predictive features

In [20]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


#### Print the first 5 values of the target variable

In [21]:
iris_y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

#### Check class distribution of Y

In [22]:
iris_y.value_counts()

target
0    50
1    50
2    50
Name: count, dtype: int64

#### Use ``train_test_split`` to prepare your train and test data

Further details are available from the docs: [sklearn train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [23]:
iris_x_train, iris_x_test, iris_y_train, iris_y_test = train_test_split(
    iris_df,
    iris_y,
    random_state=0,
    stratify=iris_y
    )

#### Use ``StandardScaler`` from sklearn to standardize the predictors.

Package documentation: [sklearn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

Otherwise, once ``StandardScaler`` has been imported, use ``help(StandardScaler)`` to print its documentation.
You can use ``help`` Python command to check the documentation of any function or class.

In [24]:
scaler = StandardScaler()

iris_x_train = scaler.fit_transform(iris_x_train)
iris_x_test = scaler.transform(iris_x_test)

#### Check class distribution in training and test Y.

Hint: The ``value_counts()`` method can help here

In [25]:
print(f"Training Y class count: \n{iris_y_train.value_counts()}\n")
print(f"Test Y class count: \n{iris_y_test.value_counts()}")

Training Y class count: 
target
2    38
1    37
0    37
Name: count, dtype: int64

Test Y class count: 
target
0    13
1    13
2    12
Name: count, dtype: int64


#### Use again the ``shape`` function to check the dimensions of training and test X.

In [26]:
print(iris_x_train.shape)
print(iris_x_test.shape)

(112, 4)
(38, 4)


#### Classify with K-nn

##### Fit K-nn model

In [27]:
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(iris_x_train, iris_y_train)

#### Calculate K-nn training and test data accuracy

In [28]:
knn_train_accuracy = knn_classifier.score(iris_x_train, iris_y_train)
knn_test_accuracy = knn_classifier.score(iris_x_test, iris_y_test)

print(f"K-nn training data accuracy: {round(knn_train_accuracy, 3)}")
print(f"K-nn test data accuracy: {round(knn_test_accuracy, 3)}")

K-nn training data accuracy: 0.973
K-nn test data accuracy: 0.974


#### Use Grid Search and Cross Validation to find the best number of neighbors

The default option of 5-fold cross validation is used.
GridSearchCV documentation: [sklearn GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [29]:
# Define parameter grid
num_neighbors = np.array([1, 3, 5, 8, 10, 15, 20, 25, 30])
param_grid = dict(n_neighbors=num_neighbors)

param_grid

{'n_neighbors': array([ 1,  3,  5,  8, 10, 15, 20, 25, 30])}

In [30]:
# Initialize model
knn_model = KNeighborsClassifier()

grid = GridSearchCV(
    estimator=knn_model,
    param_grid=param_grid,
    scoring="accuracy"
    )

# Run grid search
grid.fit(iris_x_train, iris_y_train)
best_n = grid.best_estimator_.n_neighbors
best_score = round(grid.best_score_, 3)

print(f"Best number of neighbors: {best_n}")
print(f"Best achieved test accuracy for {best_n} neighbors: {best_score}")

Best number of neighbors: 8
Best achieved test accuracy for 8 neighbors: 0.938


### In-class Exercise


Choose either the blood transfusion or the Iris dataset.

Then train and evaluate ``sklearn.linear_model.LogisticRegression`` on the chosen dataset.

How does it perform on the training set vs. the test set?



In [31]:
# TODO: Place your code here.

from sklearn.linear_model import LogisticRegression



## IRIS
logreg = LogisticRegression(random_state=42, solver="liblinear")

logreg.fit(iris_x_train, iris_y_train)


lr_train_accuracy = logreg.score(iris_x_train, iris_y_train)
lr_test_accuracy = logreg.score(iris_x_test, iris_y_test)

print(f"Logistic Regression training data accuracy: {round(lr_train_accuracy, 3)}")
print(f"Logistic Regression test data accuracy: {round(lr_test_accuracy, 3)}")





Logistic Regression training data accuracy: 0.938
Logistic Regression test data accuracy: 0.842


In [32]:
from sklearn.linear_model import LogisticRegression



## Blood
logreg = LogisticRegression()

logreg.fit(x_train, y_train)


lr_train_accuracy = logreg.score(x_train, y_train)
lr_test_accuracy = logreg.score(x_test, y_test)

print(f"Logistic Regression training data accuracy: {round(lr_train_accuracy, 3)}")
print(f"Logistic Regression test data accuracy: {round(lr_test_accuracy, 3)}")


Logistic Regression training data accuracy: 0.77
Logistic Regression test data accuracy: 0.781


## Take-home Exercise (discretionary)

Can you construct a binary classification dataset (using np.random for example) on which ``sklearn.linear_model.LogisticRegression`` achieves an accuracy of 1? 


In [None]:
# TODO: Place your code here.