## Model nº2

Second try at building a ML for the Titanic competition. I'm going to focus on being neater in the preprocessing as well as trying different types of models and playing with their parameters.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, compose, tree, model_selection, metrics, ensemble, neighbors, svm, naive_bayes, linear_model

We load the files. I found the Survival features of the test samples on Kaggle so I can test without having to submit.

In [2]:
# We load the files. I found the Survival features of the test samples on Kaggle so I can test without having to submit.

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
results = pd.read_csv('results.csv')

# Missing age data is filled with an arbitrary negative number so when later the Age is binned it will become its own bin.

train.loc[pd.isna(train['Age']) == True, ['Age']] = -15
test.loc[pd.isna(test['Age']) == True, ['Age']] = -15

# Missing Fare data in a sample is filled with the average of the Pclass of the sample.

test.loc[pd.isna(test['Fare']) == True, ['Fare']] =  test.groupby(['Pclass']).mean()['Fare'][test.loc[pd.isna(test['Fare']) == True]['Pclass'].iloc[0]]

# Custom age binnig function

def bin_values(data):
    bins = [-20, 0, 10, 60, np.inf]
    labels = [0, 1, 2, 3]
    data['Age'] = pd.cut(data['Age'], bins=bins, labels=labels, retbins= False)
    return data

# Define train and test data samples

x_train = train.drop('Survived', axis=1)
y_train = train['Survived']
#x_train, x_test, y_train, y_test = model_selection.train_test_split(x_train, y_train, test_size=0.7, random_state= 1)
x_test = test 
y_test= results['Survived']

In [3]:
#Cateforical data is encoded using OneHotEncoder. For SibSp and Parch we only consider to categories (i.e. family or no family).
# Age is binned using the custom binning function, and we use KBinsDiscretizer to bin Fare.

column_trans = compose.ColumnTransformer(
    [('sex', preprocessing.OneHotEncoder(drop='if_binary', handle_unknown = 'ignore'), ['Sex']),
    ('pclass', preprocessing.OneHotEncoder(handle_unknown ='ignore', max_categories = 4), ['Pclass']),
    ('embarked', preprocessing.OneHotEncoder(handle_unknown ='ignore', max_categories = 3), ['Embarked']),
    ('family', preprocessing.OneHotEncoder(drop='if_binary', handle_unknown ='ignore', max_categories = 2), ['SibSp','Parch']),
    ('fare', preprocessing.KBinsDiscretizer(n_bins= 5, encode='ordinal', strategy='quantile'), ['Fare']),
    ('age',  preprocessing.FunctionTransformer(bin_values), ['Age'])
    ], remainder='drop')

# Transform tthe train and test data

x_train_transformed = column_trans.fit_transform(x_train)
x_test_transformed = column_trans.fit_transform(x_test)

# The list of classifiers that we are going to try. The parameters have been chosen after playing a bit with them.

clf1 = tree.DecisionTreeClassifier(criterion='entropy', min_samples_leaf = 15)
clf2 = ensemble.GradientBoostingClassifier()
clf3 = naive_bayes.CategoricalNB()
clf4 = neighbors.KNeighborsClassifier(n_neighbors= 50)
clf5 = svm.SVC(kernel='poly')
clf6 = ensemble.RandomForestClassifier(criterion='entropy', min_samples_leaf = 15)
clf7 = linear_model.SGDClassifier()

# Training each model, and obtaining the accuracies in the train as well as test data sets.

for clf, label in zip([clf1,clf2,clf3,clf4,clf5,clf6,clf7],['Tree','Grad. Boost','GaussianNB','K Neighnors', 'SVC', 'Random Forrest', 'SGD']):
    clf.fit(x_train_transformed,y_train)
    y_train_pred = clf.predict(x_train_transformed)
    y_test_pred = clf.predict(x_test_transformed)
    print("Accuracy: %0.4f  [%s]" % (metrics.accuracy_score(y_train,y_train_pred), 'Train set accuracy '+ label))
    print("Accuracy: %0.4f  [%s]" % (metrics.accuracy_score(y_test,y_test_pred), 'Test set accuracy '+ label))


Accuracy: 0.8159  [Train set accuracy Tree]
Accuracy: 0.7775  [Test set accuracy Tree]
Accuracy: 0.8440  [Train set accuracy Grad. Boost]
Accuracy: 0.7751  [Test set accuracy Grad. Boost]
Accuracy: 0.7508  [Train set accuracy GaussianNB]
Accuracy: 0.7560  [Test set accuracy GaussianNB]
Accuracy: 0.7744  [Train set accuracy K Neighnors]
Accuracy: 0.7560  [Test set accuracy K Neighnors]
Accuracy: 0.8036  [Train set accuracy SVC]
Accuracy: 0.7895  [Test set accuracy SVC]
Accuracy: 0.8114  [Train set accuracy Random Forrest]
Accuracy: 0.7727  [Test set accuracy Random Forrest]
Accuracy: 0.7464  [Train set accuracy SGD]
Accuracy: 0.7057  [Test set accuracy SGD]


We obtain fairly similar results with the various models. The tree classifier is still very competitive once we set a good value of min_samples_leaf. Random forrest doesn't perform better than a single Decision tree, probably because the numer of features is limited. Gradient boost on the other hand does as good as a single decision tree. The best model, however, is the Support Vector Machine with the degree 3 polynomial kernel. It also shows less variance than the tree based models. 

In [7]:
# We print the SVC results to submmit at Kaggle.

clf5.fit(x_train_transformed,y_train)
y_test_pred = clf5.predict(x_test_transformed)
test['Survived']= y_test_pred
test[['PassengerId','Survived']].to_csv('submission.csv',index=False)
metrics.accuracy_score(y_test,y_test_pred)

0.7894736842105263