In this notebook we approach the Titanic dataset with a simple logistic regression. 

In [25]:
import numpy as np
import pandas as pd
from math import log
%matplotlib inline 
import matplotlib.pyplot as plt
import warnings
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder,OneHotEncoder
warnings.filterwarnings('ignore')


train_data = pd.read_csv("Titanic/train.csv")
test_data = pd.read_csv("Titanic/test.csv")

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
train_data = train_data.copy()
test_data = test_data.copy()
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")
train_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [26]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [27]:
train_data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [28]:
for col in train_data.columns:
    if train_data[col].isna().any():
        print(train_data[col].isna().value_counts())

Age
False    714
True     177
Name: count, dtype: int64
Cabin
True     687
False    204
Name: count, dtype: int64
Embarked
False    889
True       2
Name: count, dtype: int64


From the above, it seems clear that name, cabin and ticket won't  be useful features. Clearly one can't predict whether someone survived by their name (unless, maybe, it starts with 'Lord', or something like this). Cabin is missing 3/4 of its features, and even if it weren't, there would be too many different entries to one hot encode. One could group cabins in some way, but this would, I believe at best, become a proxy for class. Similarly, using the ticket number as a predictive feature would be hard (at least, I'm not sure what to do with it). 

In the solution to this problem as an excercise in Hands On Machine Learning, the author suggests that one should group SibSp and Parch together as a feature Family_on_board. Also, he suggests that age brackets is probably a more relevant predictor than precise age. In imputing missing values, it would be good enough to take the median, although we will follow Geron's suggestion to impute values based on the median for the class. Neither the training nor test data have missing class entries, and so we won't handle the case where the class is a also a missing feature. 

In [29]:
features = ['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilyOnBoard', 'AgeBucket']
train_data['FamilyOnBoard'] = train_data['Parch'] + train_data['SibSp']
test_data['FamilyOnBoard'] = test_data['Parch'] + test_data['SibSp']
test_data["AgeBucket"] = test_data["Age"] // 15*15
train_data["AgeBucket"] = train_data["Age"] // 15*15
train_data_features = train_data[features]
train_data_features.groupby('Pclass',as_index = False)['AgeBucket'].describe()

Unnamed: 0,Pclass,count,mean,std,min,25%,50%,75%,max
0,1,186.0,32.016129,15.186935,0.0,15.0,30.0,45.0,75.0
1,2,173.0,23.236994,13.859193,0.0,15.0,15.0,30.0,60.0
2,3,355.0,18.71831,12.355587,0.0,15.0,15.0,30.0,60.0


So there seems to be an important difference in average age between the classes. I found the following code here: https://stackoverflow.com/questions/51426255/how-to-impute-nan-values-based-on-values-of-other-column. I am also going to be a hack and just enter the numbers by hand, rather than picking out the specific values of a data frame. 

In [30]:
Class = [train_data['Pclass'] == 1, train_data['Pclass'] == 2, train_data['Pclass'] == 3]
Values = [20,20,30]
train_data_features['AgeBucket'] = np.where(train_data['AgeBucket'].isnull(), np.select(Class, Values), train_data['AgeBucket'])

In [31]:
# Before I forget, I need to do the same thing to the test data.
Class = [test_data['Pclass'] == 1,test_data['Pclass'] == 2,test_data['Pclass'] == 3]
test_data['AgeBucket'] = np.where(test_data['AgeBucket'].isnull(), np.select(Class, Values), test_data['AgeBucket'])
# This is a manual imputation. I need to impute the values from the train data into the test data. This is ensured by using the same 
# 'Values' array as what I got from the train data. 
# I don't think that I need to do anything to the training data from here -- I will make a pipeline which should handle everything. 

In [32]:
y = train_data_features['Survived']
X = train_data_features.drop('Survived', axis = 1)
X['FamilyOnBoard'].value_counts()

FamilyOnBoard
0     537
1     161
2     102
3      29
5      22
4      15
6      12
10      7
7       6
Name: count, dtype: int64

So the family on board seems to drop off quite significantly after 3 members. I should change this so that the categories are 0, 1, 2 and >=3. This would also mean that I won't need to scale my data. I'll just leave the entry in the FamilyOnBoard column as 3 rather than changing it to a '>=3'or something like this since I would just need to encode it later anyway. 

In [33]:
X['FamilyOnBoard'] = X['FamilyOnBoard'].apply(lambda x : 3 if x >= 3 else x)
X['FamilyOnBoard'].value_counts()

FamilyOnBoard
0    537
1    161
2    102
3     91
Name: count, dtype: int64

In [34]:
OHE_encoder = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy='most_frequent')),
        ("encoder", OneHotEncoder())
    ]
)
num_pipeline = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy = 'mean')),
        ("scaler", StandardScaler())
    ]
)
preprocessor = ColumnTransformer(transformers = [
    ("num", num_pipeline, ['Fare']), #I am treating the other numerical variables as categorical variables which have already been encoded.
    ("ord", OrdinalEncoder(), ['Sex']),
    ("ordImputer", SimpleImputer(strategy = 'most_frequent'), ['FamilyOnBoard', 'AgeBucket']),
    ("OHE", OHE_encoder, ['Embarked'])
])


In [35]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)
model = LogisticRegression()
my_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                              ('model', model)
                             ])
my_pipeline.fit(X_train,y_train)
y_pred = my_pipeline.predict(X_train)
Acc_score = accuracy_score(y_train, y_pred)
print(Acc_score)
print(confusion_matrix(y_train, y_pred))

0.7823033707865169
[[374  70]
 [ 85 183]]


In [36]:
predictions = my_pipeline.predict(test_data)
my_submission = pd.DataFrame({'PassengerId': test_data.index, 'Survived': predictions})
my_submission.to_csv('RefinedLogitSubmission.csv', index=False)

This results in a score of .76555, which isn't any better than what I got with less feature engineering. It seems like the model has been changed, but not by enough to push some of the predictions over the threshold. I could compare the two models with binary cross entropy to see whether the predictions are going in the correct direction. Other improvements include:
 * I could create a custom test set which was designed to be representative of the data. Looking at X_test.describe(), it does seem like we could make a better train-test split, but I'm not sure how much of a difference this would make.
 * Given that the only about 38% of passengers survived, I wonder if it might make sense to increase the threshold for predicting survive slightly? I don't think that this would make a huge difference though -- looking at the confusion matrix, we can see that the number of false positives and false negatives are roughly similar.