##### Predicting Titanic Survivors
This file uses uses feature engineering, missing entry imputation, and machine learning to attempt to predict who did and who did not survive the 
Titanic. The data used here comes from Kaggle.com (specifically here: https://www.kaggle.com/c/titanic/data). As of Jun. 18, 2018, placing in the top 1,000 (out of 11,356) requires accuracy of: 80.382%. My best model submitted placed at 2,532nd place with Kaggle accuracy of 79.425%.

In [1]:
# Importing modules and the data
import pandas
import numpy as np
import math
from sklearn.metrics import mean_squared_error, accuracy_score

df_train = pandas.read_csv('train.csv')
df_test = pandas.read_csv('test.csv')

In [2]:
print(df_test.columns)
print(df_train.columns)

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


The training and testing data will be analyzed together in order to make sure that feature engineering covers everything. The two can be unseparated later once the feature engineering and imputation steps are complete. The 'Survived' column entries for the test set are NaN, but for the tranining set, they are all 1's and 0's. Therefore, as long as the rows don't get mixed up, a query of 'Survived == NaN' should do it.

In [3]:
df_combined = pandas.concat([df_train, df_test], sort=False).reset_index(drop=True)
df_combined.tail(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1308,1309,,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


Just to get a sense of the data, print the 3 most common values (including NaN's) in each column and the number of missing entries in the column.

In [4]:
for col in df_combined.columns:
    # print top 3 most frequent values and their counts
    top_val_counts = df_combined[col].value_counts(dropna=False).nlargest(3)
    print('{}'.format(top_val_counts))
    
    # print num empties just in case it doesn't show up in the top 7
    num_empties = df_combined[col].isnull().sum()
    percent_empties = int(100 * num_empties / df_combined.shape[0])
    print('There are {} empty entries ({}%) in {}\n'.format(num_empties, percent_empties, col))


1309    1
449     1
431     1
Name: PassengerId, dtype: int64
There are 0 empty entries (0%) in PassengerId

 0.0    549
NaN     418
 1.0    342
Name: Survived, dtype: int64
There are 418 empty entries (31%) in Survived

3    709
1    323
2    277
Name: Pclass, dtype: int64
There are 0 empty entries (0%) in Pclass

Connolly, Miss. Kate           2
Kelly, Mr. James               2
Andersson, Mr. Johan Samuel    1
Name: Name, dtype: int64
There are 0 empty entries (0%) in Name

male      843
female    466
Name: Sex, dtype: int64
There are 0 empty entries (0%) in Sex

NaN      263
 24.0     47
 22.0     43
Name: Age, dtype: int64
There are 263 empty entries (20%) in Age

0    891
1    319
2     42
Name: SibSp, dtype: int64
There are 0 empty entries (0%) in SibSp

0    1002
1     170
2     113
Name: Parch, dtype: int64
There are 0 empty entries (0%) in Parch

CA. 2343    11
1601         8
CA 2144      8
Name: Ticket, dtype: int64
There are 0 empty entries (0%) in Ticket

8.05     60
13.00 

Summary:

+These are ready to go: PassengerId, Survived, Pclass, SibSp, Parch, Fare (except the 1 missing entry)

+These just need encoded: Sex, Embarked

+These need a lot of work:

Age - need to impute missing ages 

Name - extract titles (Mr, Master, Mrs, etc.)

Ticket - separate ticket prefixes (if present) and ticket numbers

Cabin - mostly empty; split empty/not-empty?


That covers all 12 of the original columns

A great place to start seems to be 'Fare' since it just needs 1 value imputed. Given the small scale, the mean fare is imputed for the missing fare. 

In [5]:
# Use mean fare to impute the 1 missing fare
nan_fare_idx = df_combined.query("Fare == 'NaN'").index[0]
df_combined.at[nan_fare_idx, 'Fare'] = df_combined['Fare'].mean()

The next easiest things to do are to encode Embarked and Sex. Each is dummy encoded. Embarked is encoded via pandas' get_dummies since it has more than 2 unique values while Sex is converted manually since it only has 2 unique values.

In [6]:
# Dummy encode Embarked
cols_to_dummy_encoded = ['Embarked']
df_combined = df_combined.join(pandas.get_dummies(data=df_combined[cols_to_dummy_encoded],
                                            dummy_na=True))
# Done with original Embarked col., so go ahead and drop it
df_combined = df_combined.drop(cols_to_dummy_encoded, axis=1)


# Dummy encode Sex
df_combined['Sex_m'] = (df_combined['Sex'] == 'male') * 1  
df_combined = df_combined.drop('Sex', axis=1)  # Done with original Sex col.


In [7]:
# Just to check that everything so far has been successful
df_combined.iloc[nan_fare_idx]

PassengerId                   1044
Survived                       NaN
Pclass                           3
Name            Storey, Mr. Thomas
Age                           60.5
SibSp                            0
Parch                            0
Ticket                        3701
Fare                       33.2955
Cabin                          NaN
Embarked_C                       0
Embarked_Q                       0
Embarked_S                       1
Embarked_nan                     0
Sex_m                            1
Name: 1043, dtype: object

Age still needs to be imputed, but that should be done at the end when the imputer can work with all columns. Let's extract ticket prefix and ticket number. Tickets have 0, 1, or 2 prefixes, and ticket numbers may or may not be present. Given the variable format of entries in the Ticket column, splitting on whitespace seems like a good first step. 

In [8]:
list_of_tickets = []
prefixes = []
for ticket_string in df_combined['Ticket'].values:
    split_on_whitespace = str.split(ticket_string)
    
    # if only 1 part to the ticket id
    if len(split_on_whitespace) == 1:
        try: # try to cast string to an int
            list_of_tickets.append( [int(split_on_whitespace[0])] )
        except ValueError:
            list_of_tickets.append( [ split_on_whitespace[0] ] )
            prefixes.append( split_on_whitespace[0] )
       
    # else assumed to have at least 2 parts; only care about 1st and last parts
    else:
        list_of_tickets.append( [split_on_whitespace[0], int(split_on_whitespace[-1])] )
        prefixes.append( split_on_whitespace[0] )

print('There are {} unique ticket prefixes'.format(len(np.unique(prefixes))))


There are 50 unique ticket prefixes


50 (+ 1 for those with no prefix) unique ticket prefixes is far too many. Dummy encoding ticket prefixes will create 50 very sparse (has a lot of 0's) columns. Let's see if there is anyway to reduce this number by combining some prefixes. 

In [9]:
print('The unique ticket prefixes are:')
print('{}'.format(np.unique(prefixes)))

The unique ticket prefixes are:
['A.' 'A./5.' 'A.5.' 'A/4' 'A/4.' 'A/5' 'A/5.' 'A/S' 'A4.' 'AQ/3.' 'AQ/4'
 'C' 'C.A.' 'C.A./SOTON' 'CA' 'CA.' 'F.C.' 'F.C.C.' 'Fa' 'LINE' 'LP'
 'P/PP' 'PC' 'PP' 'S.C./A.4.' 'S.C./PARIS' 'S.O./P.P.' 'S.O.C.' 'S.O.P.'
 'S.P.' 'S.W./PP' 'SC' 'SC/A.3' 'SC/A4' 'SC/AH' 'SC/PARIS' 'SC/Paris'
 'SCO/W' 'SO/C' 'SOTON/O.Q.' 'SOTON/O2' 'SOTON/OQ' 'STON/O' 'STON/O2.'
 'STON/OQ.' 'SW/PP' 'W./C.' 'W.E.P.' 'W/C' 'WE/P']


Just examining the list visually and based on this, https://www.encyclopedia-titanica.org/community/threads/ticket-numbering-system.20348/page-2, it seems that there are 10 (including the absent 'None' prefix). 

### 1 - 1st letter is A
A.
A./5.
A.5.
A/4
A/4.
A/5
A/5.
A/S
A4.
AQ/3.
AQ/4

### 2 - 1st letter is C
C
C.A.
C.A./SOTON
CA
CA.

### 3 - 1st letter is F
F.C.
F.C.C.
Fa

### 4 - 1st letter is L
LINE
LP

### 5 - 1st letter is P
P/PP
PC
PP

### 6 - First 2 letters are S and C
S.C./A.4.
S.C./PARIS
SC
SC/A.3
SC/A4
SC/AH
SC/PARIS
SC/Paris
SCO/W

### 7 - 1st 2 are NOT S and C
S.O./P.P.
S.O.C.
S.O.P.
S.P.
S.W./PP
SO/C
SW/PP


### 8 - SOTON or STON
SOTON/O.Q.
SOTON/O2
SOTON/OQ
STON/O
STON/O2.
STON/OQ.

### 9 - 1st letter is W
W./C.
W.E.P.
W/C
WE/P

### 10 - No prefix
None

Going from 51 prefixes to 10 is a great improvement, so let's go with these 10 groupings. 


The other thing to worry about is tickets that do not have a number. In such cases, a -1 will be assigned for the ticket number. 

In [10]:
for row_num, parsed_ticket_list in enumerate(list_of_tickets):
    if len(parsed_ticket_list) == 1:
        single_entry = parsed_ticket_list[0]
        try:
            # if the 1 entry is just a number
            int(single_entry) 
            # append to front of list
            list_of_tickets[row_num] = ['None'] + list_of_tickets[row_num] 
        except ValueError: # else, must be text
            list_of_tickets[row_num] = list_of_tickets[row_num] + [-1]
            
    else: # must be 2 entries in this row (representing the parsed ticket)
        first_letter = parsed_ticket_list[0][0]
        if first_letter is 'S':
            prefix = parsed_ticket_list[0].replace('.', '')
            if prefix[0:2] is 'SC':
                list_of_tickets[row_num][0] = 'SC'
            elif prefix[0:4] == 'SOT' or prefix[0:4] == 'STO':
                list_of_tickets[row_num][0] = 'SOT'
            else:
                list_of_tickets[row_num][0] = 'other_S'
        else: # first_letter is in ['A', 'C', 'F', 'L', 'P', 'W']:
            list_of_tickets[row_num][0] = first_letter


By this point, there is nothing left to do to with the tickets, so the ticket prefix and ticket number can be added to the dataframe now. 

In [11]:
# A function for comparing the length of the extracted columns
# and the original data frame. If they are the same length, 
# then it should be o.k. add the extracted columns. 
def safe_to_add_col(df, col_list, col_name):
    same_length = len(col_list) == len(df)
    print('List of {} and combined data frame are same length (safe to add '
          'to the dataframe)? {}'.format(col_name, same_length))

In [12]:
safe_to_add_col(df_combined, list_of_tickets, 'tickets')

df_combined['Ticket Prefix'] = [row[0] for row in list_of_tickets]
df_combined['Ticket Number'] = [row[1] for row in list_of_tickets]

# no longer need original column since it was split and parsed
df_combined = df_combined.drop('Ticket', axis=1)

List of tickets and combined data frame are same length (safe to add to the dataframe)? True


Now moving on to Cabin, let's see if just throwing pandas.get_dummies at the column would be a good idea. 

In [13]:
val_counts = df_combined['Cabin'].value_counts(dropna=False)
print('There are {} unique Cabins'.format(val_counts.shape[0]))

There are 187 unique Cabins


In [14]:
df_combined['Cabin_None'] = (df_combined['Cabin'].isnull()) * 1
#df_combined.head(5) # checked that encoding went o.k.
df_combined = df_combined.drop('Cabin', axis=1) # done w/ this col.


All that remains for the data preparation part is extracting information from Name, and imputing missing ages. Therefore, let's take a look at some of the names to see what we're dealing with. 

In [15]:
# Checkout what the names look like again - sorted order really helps btw
n = 53
for count, name in enumerate(sorted(df_combined['Name'].values)):
   if count % 50 == 0: # print every  nth name
       print(name)

Abbing, Mr. Anthony
Appleton, Mrs. Edward Dale (Charlotte Lamson)
Bazzani, Miss. Albina
Bradley, Mr. George ("George Arthur Brayton")
Carlsson, Mr. August Sigfrid
Collander, Mr. Erik Gustaf
Daniels, Miss. Sarah
Drew, Mr. James Vivian
Ford, Mrs. Edward (Margaret Ann Watson)
Goldschmidt, Mr. George B
Harper, Mr. Henry Sleeper
Hogeboom, Mrs. John C (Anna Andrews)
Johansson, Mr. Nils
Kilgannon, Mr. Thomas J
Lennon, Mr. Denis
Mallet, Mr. Albert
Milling, Mr. Jacob Christian
Natsch, Mr. Charles H
Olsson, Mr. Oscar Wilhelm
Perkin, Mr. John Henry
Richard, Mr. Emile
Sage, Master. William Henry
Sivola, Mr. Antti Wilhelm
Strandberg, Miss. Ida Sofia
Troutt, Miss. Edwina Celia "Winnie"
West, Mr. Edwy Arthur
de Messemaeker, Mrs. Guillaume Joseph (Emma)


Some things to notice from the above: 

+ Last name is always first and that it is separated by a comma from the rest of a person's name. 

+ Titles have a '.' at the end of them

+ Those with nicknames have their nicknames in quotes ("...")

Based on these differences, it should be fairly easy to extract last name and title. Last name will be extracted and stored in a dictionary where the key is the last name, and a list of passenger ids with that last name is the value. Presently, I don't know how to use last names, but hopefully I will think of something later on. 

In [16]:
list_of_lnames_titles = []
family_dict = {}  # key is lname; value is list of passenger ids w/ lname
for row in df_combined[['PassengerId', 'Name']].values:
    passenger_id = row[0]
    name =  row[1]
    
    # extract last name and title
    comma_separated = name.split(',')
    lname = comma_separated[0].strip(' ')  # strip whitespace for uniformity
    title = comma_separated[1].split('.')[0].strip(' ')
    
    list_of_lnames_titles.append([lname, title])
    
    # now append to dictionary
    if lname in family_dict.keys(): # if family name already in dict
        family_dict[lname].append(passenger_id)
    else:  # else not already in dict, so set it up
        family_dict.update( {lname: [passenger_id]} )
        
safe_to_add_col(df_combined, list_of_lnames_titles, 'lnames and tickets')

df_combined['Last Name'] = [lname_title[0] for lname_title in list_of_lnames_titles]
df_combined['Title'] = [lname_title[1] for lname_title in list_of_lnames_titles]
   
# done w/ name, so go ahead and drop the column (columns are on axis 1, not 0)
df_combined = df_combined.drop('Name', axis=1)
df_combined = df_combined.drop('Last Name', axis=1) # as I said, I don't know how to use it

List of lnames and tickets and combined data frame are same length (safe to add to the dataframe)? True


Now we can finally get to some machine learning. Machine learning will be used to impute the missing ages in the data set. Various algorithms will be tried, but as a baseline, let's get a mean squared error (MSE, which measures bias and variance) for imputing the mean for all missing ages.

In [17]:
# A function for telling the mean squared error of an imputer method
def age_mse_results(y_actual, y_predicted, imputer_uses_string):
    mse = mean_squared_error(y_actual, y_predicted)
    
    print('MSE for Age imputation that uses {0}: {1:.2f}'.format(imputer_uses_string, mse))
    print('\t ==> age predictions are off by {0:.1f} years on average'.format(math.sqrt(mse)))

In [18]:
# Dummy encoding since python's machine learning algorithms only take numbers
cols_to_dummy_encode = ['Ticket Prefix', 'Title']
df_combined = df_combined.join( pandas.get_dummies(df_combined[cols_to_dummy_encode]) )
df_combined = df_combined.drop(cols_to_dummy_encode, axis=1)


from sklearn.model_selection import train_test_split

ages_present = df_combined.dropna(subset=['Age']).reset_index(drop=True)

# Split into training (for fitting) and testing data (for testing general-
# izability of the imputer)
x_train_age, x_test_age, y_train_age, y_test_age =\
train_test_split(ages_present.drop(['Age', 'Survived'], axis=1),
                 ages_present['Age'],
                 test_size=0.3)

In [19]:
# Just imputing the mean age for all missing ages
age_mse_results(y_test_age, [y_train_age.mean()] * len(y_test_age), 'mean')

MSE for Age imputation that uses mean: 204.30
	 ==> age predictions are off by 14.3 years on average


A good next step would be to try linear regression.

In [20]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train_age, y_train_age)
age_mse_results(y_test_age, lr.predict(x_test_age), 'lin. regression')


MSE for Age imputation that uses lin. regression: 120.94
	 ==> age predictions are off by 11.0 years on average


  linalg.lstsq(X, y)


A decent improvement! Let's try some other algorithms just to see if we can do any better. Before doing that, however, the age data will be standard scaled. This can help algorithms converge faster by scaling everything down have a mean of 0 with variance of 1.

In [21]:
from sklearn.preprocessing import StandardScaler


sc = StandardScaler().fit(x_train_age, y_train_age)
x_train_age = sc.transform(x_train_age)
x_test_age = sc.transform(x_test_age)

# How about a regression tree?
from sklearn.tree import DecisionTreeRegressor

# set some of the mins in order to avoid overfitting too much
dtr = DecisionTreeRegressor(min_samples_split=int( len(x_train_age) * 0.01 ), 
                            min_samples_leaf=int( len(x_train_age) * 0.01 ),
                            min_impurity_decrease=0.003)
dtr.fit(x_train_age, y_train_age)

# Let's step it up: K neighbors regressor
from sklearn.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor(n_neighbors=7, algorithm='brute')
knr.fit(x_train_age, y_train_age)

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(max_depth=15, n_estimators=1000)
rfr.fit(x_train_age, y_train_age)

age_mse_results(y_test_age, dtr.predict(x_test_age), 'decision tree regressor')
age_mse_results(y_test_age, knr.predict(x_test_age), 'k neighbors regressor')   
age_mse_results(y_test_age, rfr.predict(x_test_age), 'random forest regressor')

MSE for Age imputation that uses decision tree regressor: 152.00
	 ==> age predictions are off by 12.3 years on average
MSE for Age imputation that uses k neighbors regressor: 137.56
	 ==> age predictions are off by 11.7 years on average
MSE for Age imputation that uses random forest regressor: 118.23
	 ==> age predictions are off by 10.9 years on average


It seems that the random forest regressor generally has the lowest mean squared error, so that is the one that will be used to impute the missing ages.

In [22]:
# Get all rows w/ missing ages
nan_ages = df_combined[df_combined.Age.isnull() == True]
imputed_ages = rfr.predict(sc.transform(nan_ages.drop(['Age', 'Survived'], axis=1)))

for idx, imputed_age in zip(nan_ages.index, imputed_ages):
    df_combined.at[idx, 'Age'] = imputed_age

In case these age imputation were terrible, a column will be added telling if the age was estimated. 'Age Was Estimated' will also include those ages that, by the data dictionary's explanation, were present but estimated. Such ages are greater than 1 and have the form xx.5 (e.g., 23.5).

In [23]:
df_combined['Age Was Estimated'] = 0

estimated_age_indecies = df_combined.query("Age % 1.0 != 0.0 and Age >= 1.0").index
for idx in estimated_age_indecies:
    df_combined.at[idx, 'Age Was Estimated'] = 1
    
# Just put cols in alphabetical order just to have a known ordering to cols
df_combined = df_combined[ sorted(df_combined.columns) ]

# Just check that everything went OK and that all columns are numbers
print(df_combined.tail(2))
df_combined.head(1)

            Age  Age Was Estimated  Cabin_None  Embarked_C  Embarked_Q  \
1307  30.457594                  1           1           0           0   
1308   8.084847                  1           1           1           0   

      Embarked_S  Embarked_nan     Fare  Parch  PassengerId  \
1307           1             0   8.0500      0         1308   
1308           0             0  22.3583      1         1309   

             ...          Title_Master  Title_Miss  Title_Mlle  Title_Mme  \
1307         ...                     0           0           0          0   
1308         ...                     1           0           0          0   

      Title_Mr  Title_Mrs  Title_Ms  Title_Rev  Title_Sir  Title_the Countess  
1307         1          0         0          0          0                   0  
1308         0          0         0          0          0                   0  

[2 rows x 42 columns]


Unnamed: 0,Age,Age Was Estimated,Cabin_None,Embarked_C,Embarked_Q,Embarked_S,Embarked_nan,Fare,Parch,PassengerId,...,Title_Master,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess
0,22.0,0,1,0,0,1,0,7.25,0,1,...,0,0,0,0,1,0,0,0,0,0


At this point, all data prepartaion (feature extraction, imputation, and encoding) has been done. Therefore, we are ready to move on to building a model to classify Titanic survivors. 

Before building the model, however, all columns (excpet Survived and PassengerId) will be StandardScaled. This is done to improve performane on neural networks (faster convergence) and k neighbors (distances between points makes more sense since no columns have different scales).

In [24]:
# Go column by column and standard scale each column
cols_to_std_scale = ['Age', 'Fare']
for col in cols_to_std_scale:
    df_combined[col] = StandardScaler().fit_transform(df_combined[col].values.reshape(-1,1))

The data that will be used to train, tune, and test the model (train.csv) has to be split from the data that will be submitted to Kaggle (test.csv).  The two can be split from df_combined by querying the Survived column for NaNs/not NaNs or by querying PassengerId for values from 1 to 891 and 892 to 1309 respectively.  

In [25]:
df_train = df_combined[df_combined.Survived.isnull() == False].reset_index(drop=True)

df_submission = df_combined[df_combined.Survived.isnull() == True].reset_index(drop=True)

As was alluded to already, df_train will be split into a few parts. First, it will be split into training and testing data. Then, testing data will be split into training data (again) and validation data. Training data will be used to fit/train the model. Validation data will be used to tune the model parameters. Testing data will be used at the very very end, when the model tuning process is complete. The idea behind tuning on validation  and testing on separtate testing data for 2 reasons.
1. Ensure that the model is not just memorizing the training data. This can be checked by comparing the accuracies of predictions on the training set and predictions on the validation set. If the training set's predictions are more accurate than the validation set's, then the model is overfitting (memorizing, really, the training data). If the validation set's predictions are more accurate, then the model is underfitting (not memorizing, really, enough of the training data). 

2. Get a sense of how the model performs on unseen data (testing, not validation data))
.

Fixed cutoffs are (0 to 623 and 624 to 891) are used for the initial train/test split in order to really be sure that the model has never seen the testing data. Fixed cutoffs are not used for the next train/validation split because validation is just used to make sure that the model is not memorizing the parameters on a specific run. Now, it is possible to overfit the training and validation set, but that is much harder to do. Also, if the training and validation sets are being overfitted, then the predictions on the separate testing data should be MUCH less accurate.

Also, PassengerId will be kept to be able to check that the query was correct and for the format that Kaggle requires for its submissions.

In [26]:
frac_of_training_data = 0.7
train_test_cutoff = int(df_train.shape[0] * frac_of_training_data)

# train/test split
df_test = df_train.iloc[train_test_cutoff+1:]
df_train = df_train.iloc[0:train_test_cutoff]

# now train/val split
x_train, x_val, y_train, y_val =\
train_test_split(df_train.drop('Survived', axis=1),
                 df_train['Survived'],
                 test_size=0.3,) # test_size really is validation size here
train_val_cutoff = int(df_train.shape[0] * frac_of_training_data)

x_test = df_test.drop('Survived', axis=1)
y_test = df_test['Survived']

A few different algorithms will be tried. Again, to get a good baseline, logistic regression will be used first, followed by a random forest classifier, k neighbors classifier, and a multilayer perceptron classifier (neural network). 

In [27]:
# A function to print the accuracy score of a Titanic survivor classification model
def classify_acc_results(y_actual, y_predicted, model_string):
    acc = accuracy_score(y_actual, y_predicted)
    print('ACC {0}: {1:.2f}%'.format(model_string, 100.0 * acc))

In [28]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(tol=1e-8, max_iter=10000, C=1.1)
lr.fit(x_train, y_train)

classify_acc_results(y_train, lr.predict(x_train), 'train logistic regression')
classify_acc_results(y_val, lr.predict(x_val), 'val logistic regression')

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=1000,
                             max_depth=5)
rfc.fit(x_train, y_train)

classify_acc_results(y_train, rfc.predict(x_train), 'train random forest classifier')
classify_acc_results(y_val, rfc.predict(x_val), 'val random forest classifier')

from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier(n_neighbors=28)
knc.fit(x_train, y_train)

classify_acc_results(y_train, knc.predict(x_train), 'train k neighbors classifier')
classify_acc_results(y_val, knc.predict(x_val), 'val k neighbors classifier')


from sklearn.neural_network import MLPClassifier
mlpc = MLPClassifier(hidden_layer_sizes=(40),
                     alpha=.0005,
                     learning_rate='constant',
                     solver='adam',
                     activation='relu',
                     max_iter=1000,
                     tol=1e-9)
mlpc.fit(x_train, y_train)

classify_acc_results(y_train, mlpc.predict(x_train), 'train mlp classifier')
classify_acc_results(y_val, mlpc.predict(x_val), 'val mlp classifier')

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()

x_train = lda.fit_transform(x_train, y_train)
x_val = lda.transform(x_val)

mlp_lda = MLPClassifier(hidden_layer_sizes=[200,], alpha=26)
mlp_lda.fit(x_train, y_train)

classify_acc_results(y_train, mlp_lda.predict(x_train), 'train lda mlp')
classify_acc_results(y_val, mlp_lda.predict(x_val), 'val lda mlp')

ACC train logistic regression: 82.57%
ACC val logistic regression: 82.89%
ACC train random forest classifier: 86.24%
ACC val random forest classifier: 83.42%
ACC train k neighbors classifier: 63.76%
ACC val k neighbors classifier: 65.24%
ACC train mlp classifier: 42.89%
ACC val mlp classifier: 45.99%




ACC train lda mlp: 77.29%
ACC val lda mlp: 77.54%


Quite surprisingly, it seems that logistic regression did the best of all the bunch. Let's choose the logistic regression model as the one to go with. (original comment; see updates at bottom)

Now that the final model has been selected, let's see how well it performs on the unseen testing data.

In [34]:
classify_acc_results(y_test, mlp_lda.predict(lda.transform(x_test)), 'testing mlp w/ LDA (FINAL MODEL)')

ACC testing mlp w/ LDA (FINAL MODEL): 82.02%


This level of accuracy is what we can expect for the submissions to Kaggle. Let's get our predictions for the test.csv data (found in df_submission if the query was correct)

In [32]:
x_submission = df_submission.drop('Survived', axis=1)
submission_predictions = mlp_lda.predict(lda.transform(x_submission))

df_submission['Survived'] = [int(i) for i in submission_predictions]
cols_to_write = ['PassengerId', 'Survived']

df_submission[cols_to_write].to_csv('kaggle_titanic_submission_jun182018_918pm.csv', index=False)

In [33]:
# Just curious what % of submission is predicted to survive.
# If it is like train.csv, then it should be around 40%, I think.
df_submission[df_submission.Survived > .4].shape[0] / df_submission.shape[0]

0.3684210526315789

On Kaggle, the accuracy score that places the model at 8846th place out of 11403 was: 0.76076. This, of course, is far below what was expected. Kaggle's accuracy score is only on a portion of the submission, not the entirety. So, it is possible that it just chose some of the worst predictions. That, however, is probably not the case, so I am not sure why this happened.

UPDATE 01; Jun. 17 7:36 PM:
Only standard scaling Age and Fare improved Kaggle accuracy to 0.78468, which puts the model in 4624th place.

UPDATE 02; Jun. 18 9:18 PM:
Transforming the data with linear discriminant analysis and then training a neural network on the LDA transformed data improved accuracy up to 0.79425, which puts this model at 2,532nd place out of 11,356 people. This just goes to show either that maybe something's up with Kaggle or that building generalizable models is difficult (for me, at least). 