# Classification Examples - Titanic Dataset

This lesson uses the [Titanic dataset](course_datasets.md#titanic).  It predicts Survival based on passenger class, sex, fare, embarkation, fare band, using logistic regression and decision tree classifiers.

Steps
* Load data into pandas
* Clean data (select columns), remove any rows with missing values
* Encode data (convert string columns into numbers, required by model). One-hot Ordinal (later) for passenger class
* Encode label column (Died ->0, Survived ->1)
* Split data into training and test sections
* Build logistic regression model, fit on training data an predict on test data
* Evaluate models with a confusion matrix
* Build decision tree model, fit on training data and predict on test data. 
* Show decision tree model graph


In [25]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import joblib

Load the titanic data from a CSV file on a public URL into a pandas  DataFrame

In [None]:
titanic_url = 'https://raw.githubusercontent.com/MarkWilcock/CourseDatasets/main/Misc%20Datasets/Titanic%20Passenger.csv'
df = pd.read_csv(titanic_url) # read the data
df.head(5) # show the first 5 rows

Keep only the columns of interest, and rename these in a consistent snake_case (Pythonic) style

In [None]:
df_slim = df.loc[:, ['Survival', 'Title','Passenger Class','Gender', 'Embarked', 'FareBand', 'Adult Or Child']]
df_slim.columns = ['survival', 'title','pass_class','gender', 'embarked', 'fareband', 'adult_or_child']
df_slim.head(5)

Encode the categorical columns with a one hot encoder. See [this explainer article](https://www.geeksforgeeks.org/ml-one-hot-encoding/) and the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In [None]:
category_columns = ['title', 'gender', 'embarked', 'fareband', 'adult_or_child']
categorical_encoders = OneHotEncoder(sparse_output=False)
categorical_encoders

Ordinal data are categorical data with a natural rank order but the distances between the categories are uneven or unknown. e.g. cool, warm, hot.  In this dataset, the pass_class (passenger class) and arguably fareband columns contain ordinal data.

In [None]:
ordinal_columns =  ['pass_class']
pass_class_values = ['1st', '2nd', '3rd']
#fareband_values = ['0 - 10', '10 - 20', '20 - 30', '30 - above']
ordinal_encoders = OrdinalEncoder(categories=[pass_class_values]) 
ordinal_encoders

The ColumnTransformer lets us assemble the transforms on all the dataset columns.

In [None]:
ct = ColumnTransformer(transformers = [
        ('cat', categorical_encoders, category_columns),
        ('ord', ordinal_encoders, ordinal_columns)
        ], 
        remainder = 'drop')
ct.set_output(transform='pandas')
ct

In [None]:
# X is the standard name for the transformed data of features (independent variables)
X = ct.fit_transform(df_slim)
X.head(5)

Create an array of labels from the survival column.  survival is a text column (with values Died and Survived), and is transformed to an array of numbers either 0 (Died) and 1 (Survived).

In [None]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_slim.loc[:, 'survival'])
y[:5] # show the first 5 elements of y

Split into train and test datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Build and fit the logistic regression model

In [None]:
model_LR = LogisticRegression()
model_LR.fit(X_train, y_train)
predictions_LR = model_LR.predict(X_test)
predictions_LR[:5] # show the first 5 predictions

Evaluate using standard metrics

In [None]:
print('Classification Report\n',  classification_report(y_test, predictions_LR))
print(f'f1 score\n {f1_score(y_test, predictions_LR):3.3f}')

Understand how well the model is performing with a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)


In [None]:
confusion_matrix(y_test, predictions_LR)

## Decision Tree Model

In [None]:
model_DT = DecisionTreeClassifier(max_depth=4)
model_DT.fit(X_train, y_train)
model_DT

In [None]:
predictions_DT = model_DT.predict(X_test)
predictions_DT[:5] # show the first 5 predictions

In [None]:
accuracy_score(y_test, predictions_DT)

Install an extension such as GraphViz Interactive Preview, to view the decision tree model.

In [41]:
tree.export_graphviz(model_DT, 
                      out_file='outputs/titanic_tree.dot', 
                      feature_names=X.columns, 
                      class_names=['Died', 'Survived'],
                      label='all',
                      rounded=True,
                      filled=True)


Persist the model in case we want to rerun without retraining

In [None]:
joblib.dump(model_DT, 'outputs/titanic_model.pkl')

## Old code - ignore

In [43]:
# df_slim_no_label = df_slim.drop('survival', axis=1)
# df_slim_no_label.head(5) # show the first 5 rows
# y_DT = df_slim['survival'].apply(lambda x: 1 if x == 'Survived' else 0)
# y_DT.head()
# X_DT = pd.get_dummies(df_slim_no_label) 
# X_DT.head(5)
# X_train_DT, X_test_DT, y_train_DT, y_test_DT = train_test_split(X_DT, y_DT, test_size=0.2)
# (X_train_DT.shape, y_train_DT.shape), (X_test_DT.shape, y_test_DT.shape)