# SI 618 - Homework #7: Classifiers
or: How I Learned to Stop Worrying and Love Machine Learning

v.20210404.1.CT

This is, perhaps, one of the most exciting homework assignments that you have
encountered in this course!

You are going to try your hand at a Kaggle competition to predict Titanic survivorship.
(Recall that we've played with Titanic data earlier in this course -- this data set is
slightly different.)

(NOTE: if you prefer to not submit your work to the Kaggle competition that's fine --
just contact Chris via email (cteplovs@umich.edu) and we will work out an alternative.)

To start with, make sure you have a [Kaggle](https://www.kaggle.com/) account, 
then navigate to the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) project page.

We'll view the [introductory video](https://www.youtube.com/watch?v=8yZMXCaFshs) 
together in class.

The basic steps for this assignment are outlined in the video:

1. Accept the rules and join the competition
2. Download the data (from the data tab of the competition page)
3. Understand the problem
4. EDA (Exploratory Data Analysis)
5. Train, tune, and ensemble (!) your machine learning models
6. Upload your prediction as a submission on Kaggle and receive an accuracy score

additionally, you will

7. Upload your final notebook to Canvas and report your best accuracy score.  

Note that class grades are not entirely dependent on your accuracy score.  
All models that achieve 75% accuracy will receive full points for 
the accuracy component of this assignment.

Rubric:

1. (20 points) EDA
2. (60 points) Train, tune, and ensemble machine learning models
3. (10 points) Accuracy score based on Kaggle submission report (or alternative, see NOTE above).
4. (10 points) PEP-8, grammar, spelling, style, etc.

Some additional notes:

1. If you use another notebook, code, or approaches be sure to reference the original work. (Note that we recommend you study existing Kaggle notebooks before starting your own work.)
2. You can help each other but in the end you must submit your own work, both to Kaggle and to Canvas.

Some additional resources:

* "ensemble" your models with a [VotingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)
* a good primer on [feature engineering](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)
* There are a lot of good [notebooks to study](https://www.kaggle.com/c/titanic/notebooks) (check the number of upvotes to help guide your exploration)

## GOOD LUCK!
(and don't cheat)

One final note:  Your submission should be a self-contained notebook that is NOT based
on this one.  Studying the existing Kaggle competition notebooks should 
give you a sense of what makes a "good" notebook.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
train.Ticket.sample(10, random_state=42)

709                2661
439          C.A. 18723
840    SOTON/O2 3101287
720              248727
39                 2651
290               19877
300                9234
333              345764
208              367231
136               11752
Name: Ticket, dtype: object

I don't think there is much information in the 'Ticket' or 'Name' columns so I'm going to drop them.

In [4]:
cleaner_train1 = train.copy()
cleaner_train1.drop(columns=['Ticket', 'Name'], inplace=True)

cleaner_test1 = test.copy()
cleaner_test1.drop(columns=['Ticket', 'Name'], inplace=True)

In [5]:
cleaner_train1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Cabin        204 non-null    object 
 9   Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB


Looks like I need to deal with null vaues in the 'Age', 'Cabin', and 'Embarked' columns.

In [6]:
cleaner_train1['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [7]:
# fill the 2 missing 'Embarked' rows with the most common 'S'
most_common = cleaner_train1.Embarked.mode()
cleaner_train1['Embarked'] = cleaner_train1['Embarked'].apply(lambda e: e if e == None else most_common)
cleaner_test1['Embarked'] = cleaner_test1['Embarked'].apply(lambda e: e if e == None else most_common)
cleaner_train1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Cabin        204 non-null    object 
 9   Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB


In [8]:
cleaner_train1['Cabin'].value_counts()

G6             4
C23 C25 C27    4
B96 B98        4
F2             3
E101           3
              ..
B19            1
E68            1
F G63          1
F E69          1
C47            1
Name: Cabin, Length: 147, dtype: int64

Probably too many Cabin numbers, will see if the letter or number are important

In [9]:
cleaner_train1['Cabin_Letter'] = cleaner_train1['Cabin'].str[0]
cleaner_train1['Cabin_Letter'].value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Cabin_Letter, dtype: int64

There doesn't seem to be a dominant cabin. I'll drop this column.

In [10]:
cleaner_train2 = cleaner_train1.drop(columns=['Cabin', 'Cabin_Letter'])
cleaner_test2 = cleaner_test1.drop(columns=['Cabin'])
cleaner_train2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB


The provided test csv did not contain labels, so I want to create my own test set to test out the accurcay of the models before going back and creating a model with the full training set.

Because there was not an equal number of surviors and those who perished, I'll split using StratifiedShuffleSplit

In [11]:
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(titanic_train_X, titanic_train_y, test_size=0.2,
#                                                     random_state=0)

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in sss.split(cleaner_train2, cleaner_train2['Survived']):
    titanic_train_set = cleaner_train2.loc[cleaner_train2.index.intersection(train_index)]
    titanic_test_set = cleaner_train2.loc[cleaner_train2.index.intersection(test_index)]

In [12]:
titanic_train_X = titanic_train_set.drop('Survived',axis=1)
titanic_train_y = titanic_train_set['Survived'].copy()
titanic_test_X = titanic_test_set.drop('Survived',axis=1)
titanic_test_y = titanic_test_set[['Survived']].copy()

In [13]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder

For the 'Age' column I'll impute with the mean in my ColumnTransformer. Adapted from: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

In [14]:
numeric_features = list(titanic_train_X.select_dtypes(include=[np.number]))
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = list(titanic_train_X.select_dtypes(exclude=[np.number]))
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [15]:
# transform the features to prepare them for the model
train_X_prepared = preprocessor.fit_transform(titanic_train_X)
test_X_prepared = preprocessor.transform(titanic_test_X)

Now I'll try a bunch of models out and see which is the most accuract

In [16]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [17]:
# from https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
]

In [18]:
for name, clf in zip(names, classifiers):
    clf.fit(train_X_prepared, titanic_train_y)
    
    # evaluate predictions
    accuracy = clf.score(test_X_prepared, titanic_test_y)
    print("%s Accuracy: %.2f%%" % (name,accuracy * 100.0))

Nearest Neighbors Accuracy: 81.01%
Linear SVM Accuracy: 77.65%
RBF SVM Accuracy: 79.89%
Gaussian Process Accuracy: 84.36%
Decision Tree Accuracy: 72.63%
Random Forest Accuracy: 81.56%
Neural Net Accuracy: 83.24%
AdaBoost Accuracy: 75.42%
Naive Bayes Accuracy: 80.45%


Excellent! Looks like most Classifiers have an accuracy above 80% on the (my) test set. Best among them, I'll select the Neural Net with an accuracy of almost 85%. Now I'll tune the hyperparameters with a technique that I found on stackexchange: https://datascience.stackexchange.com/questions/36049/how-to-adjust-the-hyperparameters-of-mlp-classifier-to-get-more-perfect-performa

In [19]:
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

In [20]:
from sklearn.model_selection import GridSearchCV

mlp = MLPClassifier(max_iter=1000)

clf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
clf.fit(train_X_prepared, titanic_train_y)

GridSearchCV(cv=3, estimator=MLPClassifier(max_iter=1000), n_jobs=-1,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.0001, 0.05],
                         'hidden_layer_sizes': [(50, 50, 50), (50, 100, 50),
                                                (100,)],
                         'learning_rate': ['constant', 'adaptive'],
                         'solver': ['sgd', 'adam']})

In [21]:
# Best paramete set
print('Best parameters found:\n', clf.best_params_)

# All results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

Best parameters found:
 {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
0.789 (+/-0.031) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
0.708 (+/-0.097) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'adam'}
0.789 (+/-0.030) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.708 (+/-0.094) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.781 (+/-0.013) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
0.736 (+/-0.121) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 

Now that I've chosen the MLPClassifier and tuned the hyperparameters, I can pass the model the full set of labeled training data

In [22]:
X = cleaner_train2.drop('Survived',axis=1)
y = cleaner_train2['Survived'].copy()

# tranform the columns of this full training set 
full_train_X_prepared = preprocessor.fit_transform(X)
full_test_X_prepared = preprocessor.transform(cleaner_test2)

In [23]:
# pass the tuned parameters to the model
pipe = Pipeline([
    # ('pca', PCA(n_components=5,random_state=42)), # no need to reduced the dimensions
    ('nn', MLPClassifier(**clf.best_params_)),
])

In [24]:
pipe.fit(full_train_X_prepared, y)
predictions = pipe.predict(full_test_X_prepared)



In [25]:
# prepare a dataframe for csv export
myCSV = pd.DataFrame(data={
    'PassengerId': test['PassengerId'], 
    'Survived': predictions
})

In [26]:
myCSV.to_csv(r'survivorship.csv', index = False)

!['78% accuracy'](78_perc.png)