# Chapter 3. Classification Walkthrough: Titanic Dataset
Machine Learning Pocket Reference by Matt Harrison Published by O'Reilly Media, Inc., 2019 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

## Ask a Question

In this example, we want to create a **predictive model** to answer a question. It will classify whether an individual survives the Titanic ship catastrophe based on individual and trip characteristics. This is a toy example, but it serves as a pedagogical tool for showing many steps of modeling. Our model should be able to take passenger information and predict whether that passenger would survive on the Titanic.

This is a **classification** question, as we are predicting a label for **survival**; either they survived or they died. So we have a **binary classification** task.

## Terms for Data

We typically train a model with a **matrix of data**. (I prefer to use pandas `DataFrames` because it is very nice to have column labels, but numpy arrays work as well.)

For **supervised learning**, such as **regression or classification**, our intent is to have a fuction that transforms features into a label. If we were to write this as an algebra formula, it would look like this:
$$y = f(\textbf{X})$$

$X$ is a matrix. Each **row represents a sample of data** or information about an individual. Every **column in X is a feature**. The **output** of our function, y, is a vector that contains **labels** (for classification) or **values** (for regression).

![](https://learning.oreilly.com/library/view/machine-learning-pocket/9781492047537/assets/mlpr_0301.png)

This is standard naming procedure for naming the data and the output. If you read academic papers or even look at the documentation for libraries, they follow this convention. In Python, we use the variable name $X$ to hold the sample data even though capitalization of variables is a violation of standard naming conventions (PEP 8). Don’t worry, everyone does it, and if you were to name your variable x, they might look at you funny. The variable $y$ stores the labels or targets.

The table shows a basic dataset with two samples and three features for each sample.

|pclass|age|sibsp|
|------|---|-----|
|1|29|0|
|1|2|1|

## Gather Data

We are going to load an **Excel file** (make sure you have `pandas` and `xlrd1` installed) with the Titanic features. It has many columns, including a survived column that contains the label of what happened to an individual:

In [None]:
url = (
     "http://biostat.mc.vanderbilt.edu/"
     "wiki/pub/Main/DataSets/titanic3.xls"
)
df = pd.read_excel(url)
orig_df = df

The following columns are included in the dataset:

- `pclass` Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
- `survival` Survival (0 = No, 1 = Yes)
- `name` Name
- `sex` Sex
- `age` Age
- `sibsp` Number of siblings/spouses aboard
- `parch` Number of parents/children aboard
- `ticket` Ticket number
- `fare` Passenger fare
- `cabin` Cabin
- `embarked` Point of embarkation (`C` = Cherbourg, `Q` = Queenstown, `S` = Southampton)
- `boat` Lifeboat
- `body` Body identification number
- `home.dest` Home/destination

Pandas can read this spreadsheet and convert it into a `DataFrame` for us. We will need to spot-check the data and ensure that it is OK for performing analysis.

## Clean Data

Once we have the data, we need to ensure that it is in a format that we can use to create a model. Most scikit-learn models require that our features be **numeric** (`integer` or `float`). In addition, many models fail if they are passed **missing values** (`NaN` in pandas or numpy). Some models perform better if the data is **standardized** (given a mean value of $0$ and a standard deviation of $1$). We will deal with these issues using pandas or scikit-learn. In addition, the Titanic dataset has leaky features.

**Leaky features** are variables that contain information about the future or target. There’s nothing bad in having data about the target, and we often have that data during model creation time. However, if those variables are not available when we perform a prediction on a new sample, we should remove them from the model as they are leaking data from the future.

Cleaning the data can take a bit of time. It helps to have access to a subject matter expert (SME) who can provide guidance on dealing with outliers or missing data.

In [None]:
df.dtypes

We typically see `int64, float64, datetime64[ns]`, or `object`. These are the types that pandas uses to store a column of data. `int64` and `float64` are **numeric types**. `datetime64[ns]` holds **date and time data**. `object` typically means that it is holding **string data**, though it could be a combination of string and other types.

When reading from CSV files, pandas will try to coerce data into the appropriate type, but will fall back to object. Reading data from spreadsheets, databases, or other systems may provide better types in the DataFrame. In any case, it is worthwhile to look through the data and ensure that the types make sense.

Integer types are typically fine. Float types might have some missing values. Date and string types will need to be converted or used to feature engineer numeric types. String types that have **low cardinality** are called **categorical columns**, and it might be worthwhile to create **dummy columns** from them (the `pd.get_dummies()` function takes care of this).

The `pandas-profiling` library includes a **profile report**. You can generate this report in a notebook. It will summarize the types of the columns and allow you to view details of quantile statistics, descriptive statistics, a histogram, common values, and extreme values

In [None]:
import pandas_profiling

pandas_profiling.ProfileReport(df)

Use the `.shape` attribute of the DataFrame to inspect the number of rows and columns:

In [None]:
df.shape

Use the `.describe` method to get **summary stats** as well as see the count of **nonnull data**. The default behavior of this method is to only report on numeric columns. Here the output is truncated to only show the first two columns

In [None]:
df.describe()

In [None]:
df.describe(include="object")

In [None]:
df.describe(include="all")

The count statistic only includes values that are not `NaN`, so it is useful for checking whether a column is missing data. It is also a good idea to spot-check the minimum and maximum values to see if there are **outliers**. Summary statistics are one way to do this. Plotting a histogram or a box plot is a visual representation that we will see later.

We will need to deal with missing data. Use the `.isnull` method to find columns or rows with missing values. Calling `.isnull` on a DataFrame returns a new DataFrame with every cell containing a True or False value. In Python, these values evaluate to 1 and 0, respectively. This allows us to sum them up or even calculate the percent missing (by calculating the mean).

The code indicates the count of missing data in each column:

In [None]:
df.isnull().sum()

> Replace `.sum` with `.mean` to get the percentage of null values. By default, calling these methods will apply the operation along `axis 0`, which is along the index. If you want to get the counts of missing features for each sample, you can apply this along `axis 1` (along the columns):

In [None]:
df.isnull().mean() * 100

A SME can help in determining what to do with missing data. The `age` column might be useful, so keeping it and interpolating values could provide some signal to the model. Columns where **most of the values are missing** (cabin, boat, and body) tend to not provide value and can be dropped.

The `body` column (body identification number) is missing for many rows. We should drop this column at any rate because it **leaks data**. This column indicates that the passenger did not survive; by necessity our model could use that to cheat. We will pull it out. (If we are creating a model to predict if a passenger would die, knowing that they had a body identification number a priori would let us know they were already dead. We want our model to not know that information and make the prediction based on the other columns.) Likewise, the boat column leaks the reverse information (that a passenger survived).

Let’s look at some of the rows with missing data. We can create a boolean array (a series with True or False to indicate if the row has missing data) and use it to inspect rows that are missing data

In [None]:
mask = df.isnull().any(axis=1)

df[mask].head(10)

We will impute (or derive values for) the missing values for the age column later.

Columns with type of object tend to be categorical (but they may also be high cardinality string data, or a mix of column types). For object columns that we believe to be categorical, use the `.value_counts` method to examine the counts of the values:

In [None]:
df.sex.value_counts(dropna=False)

Remember that pandas typically ignores `null` or `NaN` values. If you want to include those, use `dropna=False` to also show counts for `NaN`

In [None]:
df.embarked.value_counts(dropna=False)

We have a couple of options for dealing with missing embarked values. Using `S` might seem logical as that is the **most common value**. We could dig into the data and try and determine if another option is better. We could also drop those two values. Or, because this is categorical, we can ignore them and use pandas to create **dummy columns** if these two samples will just have 0 entries for every option. We will use this latter choice for this feature.

## Create Features

We can drop columns that have no variance or no signal. There aren’t features like that in this dataset, but if there was a column called “is human” that had 1 for every sample this column would not be providing any information.

Alternatively, unless we are using NLP or extracting data out of text columns where **every value is different**, a model will not be able to take advantage of this column. The `name` column is an example of this. Some have pulled out the title t from the name and treated it as categorical.

We also want to drop columns that leak information. Both `boat` and `body` columns leak whether a passenger survived.

The pandas `.drop` method can drop either rows or columns:

In [None]:
df.select_dtypes("object").head(3)

In [None]:
df = df.drop(columns=["name", "ticket", "home.dest", "boat", "body", "cabin"])

We need to create **dummy columns** from string columns. This will create new columns for `sex` and `embarked`. Pandas has a convenient `get_dummies` function for that

In [None]:
df = pd.get_dummies(df)

df.columns

In [None]:
df.head(10)

At this point the `sex_male` and `sex_female` columns are perfectly **inverse correlated**. Typically we remove any columns with perfect or very high positive or negative correlation. **Multicollinearity** can impact interpretation of feature importance and coefficients in some models. Here is code to remove the `sex_male` colum

In [None]:
df = df.drop(columns="sex_male")

Alternatively, we can add a `drop_first=True` parameter to the get_dummies call:

In [None]:
df = pd.get_dummies(df, drop_first=True)

df.columns

Create a `DataFrame (X)` with the features and a `series (y)` with the labels. We could also use numpy arrays, but then we don’t have column names

In [None]:
X = df.drop(columns="survived")
y = df.survived

## Sample Data

We always want to train and test on different data. Otherwise you don’t really know how well your model **generalizes** to data that it hasn’t seen before. We’ll use scikit-learn to pull out $30$% for testing (using `random_state=42` to remove an element of randomness if we start comparing different models)

In [None]:
from sklearn.model_selection import train_test_split

# here I would split with stratify=survived ...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
(X_train.shape, y_train.shape), (X_test.shape, y_test.shape)

## Impute Data

The `age` column has missing values. We need to **impute** age from the numeric values. We only want to impute on the training set and then use that imputer to fill in the date for the test set. Otherwise we are leaking data (cheating by giving future information to the model).

Now that we have test and train data, we can impute missing values on the training set, and use the trained imputers to fill in the test dataset. The fancyimpute library has many algorithms that it implements. Sadly, most of these algorithms are not implemented in an inductive manner. This means that you cannot call `.fit` and then `.transform`, which means you cannot impute for new data based on how the model was trained.

The `IterativeImputer` class (which was in fancyimpute but has been migrated to scikit-learn) does support inductive mode. To use it we need to add a special experimental import (as of scikit-learn version 0.21.2)

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(random_state=42)
X_train.loc[:,:] = imputer.fit_transform(X_train)
X_test.loc[:,:] = imputer.transform(X_test)

If we wanted to impute with the median, we can use pandas to do that (or the `SimpleImputer` class of sklearn):

In [None]:
meds = X_train.median()
X_train = X_train.fillna(meds)
X_test = X_test.fillna(meds)

## Normalize Data

**Normalizing** or preprocessing the data will help many models perform better after this is done. Particularly those that depend on a **distance metric** to **determine similarity**. (Note that tree models, which treat each feature on its own, don’t have this requirement.)

We are going to **standardize** the data for the preprocessing. **Standardizing** is translating the data so that it has a **mean value of zero** and a **standard deviation of one**. This way models don’t treat variables with larger scales as more important than smaller scaled variables. I’m going to stick the result (numpy array) back into a pandas DataFrame for easier manipulation (and to keep column names).

I also normally don’t standardize dummy columns, so I will ignore those:

In [None]:
from sklearn.preprocessing import StandardScaler

cols = X_train.columns
sca = StandardScaler()

X_train = sca.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns=cols)

X_test = sca.transform(X_test)
X_test = pd.DataFrame(X_test, columns=cols)

In [None]:
X_train.mean().round(), X_train.std().round()

## Refactor

At this point I like to refactor my code. I typically make two functions. One for **general cleaning**, and another for **dividing up into a training and testing set and to perform mutations** that need to happen differently on those sets:

In [None]:
from sklearn.experimental import enable_iterative_imputer

from sklearn.model_selection import train_test_split
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler

def tweak_titanic(df):
    df = df.drop(
    columns=[
        "name",
        "ticket",
        "home.dest",
        "boat",
        "body",
        "cabin",
    ]).pipe(pd.get_dummies, drop_first=True)

    return df

def get_train_test_X_y(df, y_col, size=0.3, std_cols=None):
    y = df[y_col]
    X = df.drop(columns=y_col)
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=size, random_state=42
    )
    
    cols = X.columns
    num_cols = [
        "pclass",
        "age",
        "sibsp",
        "parch",
        "fare",
    ]
    
    fi = IterativeImputer()
    
    X_train.loc[:, num_cols] = fi.fit_transform(X_train[num_cols])
    X_test.loc[:, num_cols] = fi.transform(X_test[num_cols])

    if std_cols:
        std = StandardScaler()
        X_train.loc[:, std_cols] = std.fit_transform(X_train[std_cols])
        X_test.loc[:, std_cols] = std.transform(X_test[std_cols])

    return X_train, X_test, y_train, y_test

In [None]:
ti_df = tweak_titanic(orig_df)

In [None]:
ti_df

In [None]:
std_cols = "pclass,age,sibsp,fare".split(",")
X_train, X_test, y_train, y_test = get_train_test_X_y(df=ti_df, y_col="survived", std_cols=std_cols)

In [None]:
X_train

## Baseline Model

Creating a **baseline model** that does something really simple can give us something to compare our model to. Note that using the default `.score` result gives us the accuracy which can be misleading. A problem where a positive case is $1$ in $10,000$ can easily get over $99$% accuracy by always predicting negative.

In [None]:
from sklearn.dummy import DummyClassifier

bm = DummyClassifier(strategy="stratified")
bm.fit(X_train, y_train)
bm.score(X_test, y_test)  # accuracy is misleading for imbalanced problems

In [None]:
from sklearn import metrics

metrics.precision_score(y_test, bm.predict(X_test))  # precision = TP / (TP + FP)

## Various Families

This code tries a variety of algorithm families. The “No Free Lunch” theorem states that no algorithm performs well on all data. However, for some finite set of data, there may be an algorithm that does well on that set. (A popular choice for structured learning these days is a tree-boosted algorithm such as **XGBoost**.)

Here we use a few different families and compare the **AUC score** and standard deviation using k-fold cross-validation. An algorithm that has a slightly smaller average score but tighter standard deviation might be a better choice.

Because we are using k-fold cross-validation, we will feed the model all of `X` and `y`.

**Me**: I do not support the last point as the test set should be kept secret and hidden especially when testing different model families. Especially because CV tests different portions of the training data it should be no problem to do CV on the training data and keep the test set for the final selected model.

In [None]:
from sklearn import model_selection
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

import xgboost

In [None]:
result = []
for model in [
    DummyClassifier, 
    LogisticRegression, 
    DecisionTreeClassifier, 
    KNeighborsClassifier, 
    GaussianNB, 
    SVC,
    RandomForestClassifier,
    xgboost.XGBClassifier
]:
    cls = model() # we keep the defaults
    kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=42)
    s = model_selection.cross_val_score(cls, X_train, y_train, scoring="roc_auc", cv=kfold)
    
    result.append(pd.DataFrame({'model': [model.__name__], "AUC": [s.mean()], "STD": [s.std()]}))
    print(f"{model.__name__:22} AUC: {s.mean():.3f} STD: {s.std():.2f}")
    
df_result = pd.concat(result)

In [None]:
df_result.set_index("model").plot(kind="bar")

## Stacking

If you were going down the Kaggle route (or want **maximum performance at the cost of interpretability**), stacking is an option. A stacking classifier takes other models and uses their output to predict a target or label. We will use the previous models’ outputs and combine them to see if a stacking classifier can do better

In [None]:
from mlxtend.classifier import (StackingClassifier)
clfs = [
    x() for x in [
        LogisticRegression,
        DecisionTreeClassifier,
        KNeighborsClassifier,
        GaussianNB,
        SVC,
        RandomForestClassifier,
    ]
]

stack = StackingClassifier(classifiers=clfs, meta_classifier=LogisticRegression())

kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=42)

s = model_selection.cross_val_score(stack, X_train, y_train, scoring="roc_auc", cv=kfold)

print(
 f"{stack.__class__.__name__}  "
 f"AUC: {s.mean():.3f}  STD: {s.std():.2f}"
)

In this case it looks like performance went down a bit, as well as standard deviation.

## Create Model

I’m going to use a random forest classifier to create a model. It is a flexible model that tends to give decent out-of-the-box results. Remember to train it (calling .fit) with the training data from the data that we split earlier into a training and testing set:

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

## Evaluate Model

Now that we have a model, we can use the test data to see how well the model **generalizes** to data that it hasn’t seen before. The `.score` method of a classifier returns the **average of the prediction accuracy**. We want to make sure that we call the `.score` method with the test data (presumably it should perform better with the training data):

In [None]:
rf.score(X_test, y_test)

We can also look at other metrics, such as precision:

In [None]:
metrics.precision_score(y_test, rf.predict(X_test))

A nice benefit of tree-based models is that you can inspect the feature importance. The **feature importance** tells you how much a **feature contributes to the model**. Note that removing a feature doesn’t mean that the score will go down accordingly, as other features might be colinear (in this case we could remove either the sex_male or sex_female column as they have a perfect negative correlation):

In [None]:
for col, val in sorted(zip(X_train.columns, rf.feature_importances_), key=lambda x: x[1], reverse=True):
    print(f"{col:10}{val:10.3f}")

he feature importance is calculated by looking at the error increase. If removing a feature increases the error in the model, the feature is more important.

I really like the **SHAP library** for exploring what features a model deems important, and for explaining predictions. This library works with black-box models, and we will show it later.

## Optimize Model

Models have **hyperparameters** that control how they behave. By varying the values for these parameters, we change their **performance**. Sklearn has a **grid search** class to evaluate a model with different combinations of parameters and return the best result. We can use those parameters to instantiate the model class:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf4 = RandomForestClassifier()
params = {
     "max_features": [0.4, "auto"],
     "n_estimators": [15, 200],
     "min_samples_leaf": [1, 0.1],
     "random_state": [42],
}
cv = GridSearchCV(
     rf4, params, n_jobs=-1
).fit(X_train, y_train)
print(cv.best_params_)
print(cv.score(X_test, y_test))

rf5 = RandomForestClassifier(
     **{
         "max_features": "auto",
         "min_samples_leaf": 0.1,
         "n_estimators": 200,
         "random_state": 42,
     }
)
rf5.fit(X_train, y_train)
rf5.score(X_test, y_test)

We can pass in a scoring parameter to `GridSearchCV` to optimize for different metrics. See Chapter 12 for a list of metrics and their meanings

In [None]:
`

### Confusion Matrix

A confusion matrix allows us to see the correct classifications as well as **false positives** and **false negatives**. It may be that we want to optimize toward false positives or false negatives, and different models or parameters can alter that. We can use `sklearn` to get a text version, or `Yellowbrick` for a plot

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = rf5.predict(X_test)
print(confusion_matrix(y_test, y_pred))

In [None]:
from yellowbrick.classifier import ConfusionMatrix

mapping = {0: "died", 1: "survived"}
fig, ax = plt.subplots(figsize=(6, 6))
cm_viz = ConfusionMatrix(
    rf5,
    classes=["died", "survived"],
    label_encoder=mapping,
)

cm_viz.score(X_test, y_test)
cm_viz.poof()
fig.savefig(
    "images/mlpr_0304.png",
     dpi=300,
     bbox_inches="tight",
)

Yellowbrick confusion matrix. This is a useful evaluation tool that presents the predicted class along the bottom and the true class along the side. A good classifier would have all of the values along the diagonal, and zeros in the other cells.

### ROC Curve

A **receiver operating characteristic (ROC)** plot is a common tool used to evaluate classifiers. By measuring the **area under the curve (AUC)**, we can get a metric to **compare different classifiers**. It plots the **true positive rate** against the **false positive rate**. We can use sklearn to calculate the AUC

In [None]:
from sklearn.metrics import roc_auc_score

y_pred = rf5.predict(X_test)
roc_auc_score(y_test, y_pred)

In [None]:
from yellowbrick.classifier.rocauc import roc_auc

fig, ax = plt.subplots(figsize=(6, 6))
roc_viz = roc_auc(rf5, X_train, y_train)
print(roc_viz.score(X_test, y_test))

roc_viz.poof()
fig.savefig("../img/mlpr_0305.png")

### Learning Curve

A learning curve is used to tell us if we have enough training data. It trains the model with increasing portions of the data and measures the score. If the cross-validation score continues to climb, then we might need to **invest in gathering more data**. Here is a Yellowbrick example

In [None]:
import numpy as np
from sklearn.model_selection import StratifiedKFold
from yellowbrick.model_selection import LearningCurve

fig, ax = plt.subplots(figsize=(6, 4))
cv = StratifiedKFold(n_splits=12)
sizes = np.linspace(0.3, 1.0, 10)
lc_viz = LearningCurve(
    rf5,
    cv=cv,
    train_sizes=sizes,
    scoring="f1_weighted",
    n_jobs=4,
    ax=ax,
)
lc_viz.fit(X_train, y_train)
lc_viz.poof()
fig.savefig("../img/mlpr_0306.png")