# Classification using Scikit-learn Pipelines

## Objectives

- Understand logistic regression for classification
- Use pandas for data exploration
- Explore a way to build machine learning pipelines with scikit-learn
- Understand the confusion matrix as an evaluation metric for classification

## Outline 

- Classification in Machine Learning
- Logistic Regression
- Titanic Example
- Confusion Matrix

## Classification in Machine Learning

The most common supervised learning learning tasks in machine learning are regression (predicting values) and classification (predicting classes), this notebook focuses on the latter. Classification in machine learning and statistics is a process of categorizing a given set of data into classes, it can be performed on both structured and unstructured data. There are a bunch of machine learning algorithms for classification in machine learning but this notebook will focus on one; logistic regression.

## Logistic Regression

Logistic regression is a classification algorithm that is commonly used to estimate the probability that a given instance belongs to a particular class (e.g what is the probability that any given email is a spam ?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), and otherwise it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.

## Titanic Example

As an example, we will be using the classic [Titanic challenge](https://www.kaggle.com/c/titanic) from [Kaggle](https://www.kaggle.com/). The goal is to predict whether or not a passenger survived based on attributes such as age, sex, passenger class, where they embarked and so on. 

Scikit-Learn provides many helper functions to download popular datasets, the Titanic dataset is one of them. Let's import common python libraries and load the data using fetch_openml.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 


from sklearn.datasets import fetch_openml # Load data from https://www.openml.org/d/40945
np.random.seed(42)  # to make this notebook's output identical at every run
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

Let's observe the data by calling .head() . By default, this will show the first five rows and all the columns

In [2]:
X.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


The attributes have the following meaning:
* **pclass**: passenger class.
* **name**, **sex**, **age**: self-explanatory
* **sibsp**: how many siblings & spouses of the passenger aboard the Titanic.
* **parch**: how many children & parents of the passenger aboard the Titanic.
* **ticket**: ticket id
* **fare**: price paid (in pounds)
* **cabin**: passenger's cabin number
* **embarked**: where the passenger embarked the Titanic
* **boat**: Lifeboat
* **body**: Body Identification Number
* **home.dest**: Home/Destination

The name, body, home.dest and ticket attributes may have some values, but they will be a bit tricky to convert into useful numbers that a model can consume. So for now, we will ignore them.
We will also split the dataset into the training and testing set.

In [3]:
from sklearn.model_selection import train_test_split
X.drop(['name','boat', 'home.dest','ticket'], axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

Let's get more info to see how much data is missing:

In [4]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1047 entries, 999 to 668
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   pclass    1047 non-null   float64 
 1   sex       1047 non-null   category
 2   age       838 non-null    float64 
 3   sibsp     1047 non-null   float64 
 4   parch     1047 non-null   float64 
 5   fare      1046 non-null   float64 
 6   cabin     225 non-null    object  
 7   embarked  1047 non-null   category
 8   body      92 non-null     float64 
dtypes: category(2), float64(6), object(1)
memory usage: 67.7+ KB


In [5]:
X_train.isnull().any()

pclass      False
sex         False
age          True
sibsp       False
parch       False
fare         True
cabin        True
embarked    False
body         True
dtype: bool

In [6]:
X_train.isnull().sum()

pclass        0
sex           0
age         209
sibsp         0
parch         0
fare          1
cabin       822
embarked      0
body        955
dtype: int64

In [7]:
X_train.isnull().sum()/len(X_train) * 100

pclass       0.000000
sex          0.000000
age         19.961796
sibsp        0.000000
parch        0.000000
fare         0.095511
cabin       78.510029
embarked     0.000000
body        91.212989
dtype: float64

Okay, the age, fare, cabin and body attributes are sometimes null (less than 1047 non-null), especially the cabin (78% are null) and the body (91% are null). We will ignore the cabin and body for now and focus on the rest. The age attribute has about 20% null values, so we will need to decide what to do with them. Replacing null values with the median age seems reasonable.

In [8]:
X_train.drop(['cabin','body'], axis=1, inplace=True)
X_test.drop(['cabin','body'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


We can see below that the columns have been removed

In [9]:
X_train.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
999,3.0,female,,0.0,0.0,7.75,Q
392,2.0,female,24.0,1.0,0.0,27.7208,C
628,3.0,female,11.0,4.0,2.0,31.275,S
1165,3.0,male,25.0,0.0,0.0,7.225,C
604,3.0,female,16.0,0.0,0.0,7.65,S


Let's take a look at the numerical attributes:

In [10]:
X_train.describe()

Unnamed: 0,pclass,age,sibsp,parch,fare
count,1047.0,838.0,1047.0,1047.0,1046.0
mean,2.314231,29.604316,0.484241,0.385864,32.166838
std,0.831742,14.362736,1.01022,0.862492,48.91582
min,1.0,0.3333,0.0,0.0,0.0
25%,2.0,21.0,0.0,0.0,7.8958
50%,3.0,28.0,0.0,0.0,13.81665
75%,3.0,38.375,1.0,0.0,30.92395
max,3.0,80.0,8.0,9.0,512.3292


Now let's take a quick look at all the categorical attributes, it is worth noting that pclass can be treated as a categorical attribute even though it is a numerical variable of type float64. This is because it assumes one of the discrete values 1.0, 2.0, 3.0

In [15]:
X_train['pclass'].value_counts()

3.0    578
1.0    249
2.0    220
Name: pclass, dtype: int64

In [16]:
X_train['sex'].value_counts()

male      675
female    372
Name: sex, dtype: int64

In [17]:
X_train['embarked'].value_counts()

S    730
C    214
Q    103
Name: embarked, dtype: int64

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.

In [18]:
X_train.dtypes

pclass       float64
sex         category
age          float64
sibsp        float64
parch        float64
fare         float64
embarked    category
dtype: object

## Scikit-learn Pipelines and Imputers

Imputation refers to a technique used to replace missing values. There are many techniques we can use for imputation. From the analysis above, we know that the columns that require imputation are age, sex and fare. We know there are two categories so we will need to have two separate pipelines; one for numerical attributes and the other for categorical attributes.

Scikit-Learn provides the Pipeline class to help with such sequences of transformations. Here is a small pipeline
for the numerical and categorical attributes:

In [19]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

num_cols = ['age', 'fare', 'sibsp', 'parch']
num_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())
])

cat_cols = ['embarked', 'sex', 'pclass',]
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False)),
])

In [20]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols),
        ('cat', cat_transformer, cat_cols)
    ])

Here, we have declared a three-step pipeline: an imputer, one-hot encoder, and a ColumnTransformer class to tie everything together. How this works is fairly simple: the imputer looks for missing values and fills them according to the strategy specified. We import the ColumnTransformer class, next we get the list of numerical column names and the list of categorical column names, and then we construct a ColumnTransformer. Note that ColumnTransformer() allows us to specify which pipeline will be applied to which column. This is useful, since by default, imputers or transformers apply to the entire dataset.

## Select and train the model

Great! we framed the problem, we got the data and explored it, we sampled a training set and a test set, and we wrote transformation pipelines to clean up and prepare your data for Machine Learning algorithms automatically. we are now ready to select and train a Machine Learning model. To do this we are going to call the LogisticRegression model from Scikit-learn as our classifier and train our model.

In [21]:
from sklearn.linear_model import LogisticRegression
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

clf.fit(X_train, y_train)


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare', 'sibsp',
                                                   'parch']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
 

Our model is trained, let's visualize the overall process so far.

In [22]:
from sklearn import set_config

set_config(display="diagram")
clf

Great, our model is trained and we have a clear picture of how the pipeline works, let's use it to make predictions on the test set:

In [23]:
y_pred = clf.predict(X_test)

We can now find the accuracy, which is the metric we use for logistic regression. To get accuracy: 

accuracy = correct_predictions / total_predictions

Accuracy is the proportion of correct predictions over total predictions. This is how we can find the accuracy with logistic regression:

In [24]:
score = clf.score(X_test, y_test)
print('Test Accuracy Score', score)

Test Accuracy Score 0.8053435114503816


Why don't we use cross-validation to have an idea of how good our model is?
Scikit-learn has a cross_val_score object that allows us to see how well our model generalizes

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring="accuracy")

We can then see the range of how our scores are doing:

In [None]:
scores = pd.Series(scores)
scores.min(), scores.mean(), scores.max()

So the range of our accuracy is between 0.74 to 0.81 but generally 0.77 on average.

Accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with
skewed datasets (i.e., when some classes are much more frequent than others).

## Confusion Matrix

A much better way to evaluate the performance of a classifier is to look at
the confusion matrix. The general idea is to count the number of times instances of class A are classified as class B.
The confusion matrix will provide a way to look at things like the accuracy, precision, recall and the F1 score. Let's take a look.

One useful function in sklearn is the classification_report() function, which, as the name implies, gives us a comprehensive report of many widely-used metrics, such as precision, recall, and the F1 score.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

The report suggests that the accuracy of our model on the test dataset is about 80 percent. We can manually verify this claim by calculating the accuracy ourselves using boolean indexing

In [None]:
sum(y_pred == y_test) / len(y_pred)

Let’s end this task by looking at the confusion matrix, which is another way of compactly encoding various pieces of information for model evaluation, namely true positives, true negatives, false positives, and false negatives. Note that precision and recall are all metrics that are computed using TP, TN, FP and FN as parameters.


The confusion matrix shows that our model performs well at determining the death and survival of those passengers who actually died, but performs rather poorly on those who lived (look at the diagonal of the matrix). Analyses like these cannot be obtained simply by looking at accuracy, which is why plotting the confusion matrix is always a good idea to get a sense of the model’s performance.

In [None]:
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Blues, normalize='true')
plt.show()

# Exercise

1) Use another classification model i.e random forest or support vector machine and see if the performance changes

2) Use GridSearch or Random Search to find the best hyperparameters