# Titanic: Extreme Gradient Boosting Approach
This kernel outlines my best-performing solution to the Titanic prediction competition. Whenever I learn a new classification technique, I return to this dataset and test it out in hopes of increasing my prediction accuracy. My current solution places in the top 10% of over 19,000 submissions. 

Table of Contents: <br>
&emsp; [Introduction](#Data_Intro) <br>
&emsp; [Preprocessing](#Data_Prep) <br>
&emsp; &emsp; [Feature Engineering](#Feat) <br>
&emsp; &emsp; [Missing Data Imputation](#Imp) <br>
&emsp; [Model Fitting and Prediction](#Data_Pred) <br>
&emsp; &emsp; [Train-test Split](#TT) <br>
&emsp; &emsp; [Model Fitting](#Fit) <br>
&emsp; &emsp; [Test Set Prediction](#Pred) <br>

In [40]:
import numpy as np
import pandas as pd
from sklearn.linear_model import  SGDClassifier
from sklearn.linear_model import  LogisticRegression
from sklearn import preprocessing
from sklearn.model_selection import RandomizedSearchCV
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
import warnings
import xgboost as xgb
warnings.filterwarnings('ignore')

train_validate = pd.read_csv("/Users/edwardwang/Desktop/titanic/Titanic/train.csv")
test = pd.read_csv("/Users/edwardwang/Desktop/titanic/Titanic/test.csv")

<a id='Data_Intro'></a>
# Introduction

In [41]:
train_validate.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Get number of NA values

In [42]:
train_validate.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We see that the 'Age' and 'Cabin' variables have a large portion of NA values. For now, we will just remove the 'Cabin' variable. We will perform feature engineering on the 'Age variable later on. We will also remove the 'Ticket' variable since it is irrevelant to our analysis.

<a id='Data_Prep'></a>
# Preprocessing

Removing 'Cabin' variable and separate predictor and response variables:

In [43]:
X = train_validate.iloc[:,[2,3,4,5,6,7,9,10,11]]
y = train_validate.iloc[:,1]

Label encode categorical variables:

In [44]:
le = preprocessing.LabelEncoder() 
X['Pclass'] = le.fit_transform(X['Pclass'].astype(str))
X['Sex'] = le.fit_transform(X['Sex'].astype(str))
X['Embarked'] = le.fit_transform(X['Embarked'].astype(str))
X.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,2,"Braund, Mr. Owen Harris",1,22.0,1,0,7.25,,2
1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,71.2833,C85,0
2,2,"Heikkinen, Miss. Laina",0,26.0,0,0,7.925,,2
3,0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,53.1,C123,2
4,2,"Allen, Mr. William Henry",1,35.0,0,0,8.05,,2


<a id='Feat'></a>
## Feature Engineering
We search for transformations of our original variables which may be useful for feature engineering.

### Titles

Extract the titles of each name, and use these as an additional feature:

In [45]:
X['Title'] = X['Name'].str.extract(r'(\w+\.)')
X.drop(['Name'],axis = 1,inplace = True)
X['Title']

0        Mr.
1       Mrs.
2      Miss.
3       Mrs.
4        Mr.
       ...  
886     Rev.
887    Miss.
888    Miss.
889      Mr.
890      Mr.
Name: Title, Length: 891, dtype: object

As observed above, we used the regular expresion '(\w+\.)' to find the first word in each name that ends with a period, this gives us the title of each passenger. We will now observe the value counts of this new 'Title' variable:

In [46]:
X['Title'].value_counts()

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Major.         2
Mlle.          2
Col.           2
Countess.      1
Mme.           1
Sir.           1
Don.           1
Capt.          1
Jonkheer.      1
Lady.          1
Ms.            1
Name: Title, dtype: int64

We observe that there are many different titles, and some of them only have 1 instance in the dataset. We will combine similar titles together:
* Since 'Col.', 'Major.', and 'Capt.' are all terms representing military men, we will group these together.
* Similarly 'Mlle.' (mademoiselle), 'Miss.', and 'Ms.' are all terms signifiying an unmarried woman, so we will group these together.
* The titles 'Master.' and 'Jonkheer.' signify unmarried young men and thus will be grouped together.
* The terms 'Lady.', 'Countess.', and 'Mme.' (madam) are all titles of respect for upperclass women and will be grouped together.
* The titles 'Sir.' and 'Don.' represent titles of nobility or respect for upperclass men and thus will be grouped together.

In [47]:
X[X['Title'].isin(['Col.', 'Major.','Capt.'])]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
449,0,1,52.0,0,0,30.5,C104,2,Major.
536,0,1,45.0,0,0,26.55,B38,2,Major.
647,0,1,56.0,0,0,35.5,A26,0,Col.
694,0,1,60.0,0,0,26.55,,2,Col.
745,0,1,70.0,1,1,71.0,B22,2,Capt.


In [48]:
X.loc[X['Title'].isin(['Col.', 'Major.','Capt.']),'Title'] = 'Major.'
X[X['Title'].isin(['Col.', 'Major.','Capt.'])]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
449,0,1,52.0,0,0,30.5,C104,2,Major.
536,0,1,45.0,0,0,26.55,B38,2,Major.
647,0,1,56.0,0,0,35.5,A26,0,Major.
694,0,1,60.0,0,0,26.55,,2,Major.
745,0,1,70.0,1,1,71.0,B22,2,Major.


In [49]:
X.loc[X['Title'].isin(['Mlle.', 'Miss.','Ms.']),'Title'] = 'Miss.'
X[X['Title'].isin(['Mlle.', 'Miss.','Ms.'])]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
2,2,0,26.0,0,0,7.9250,,2,Miss.
10,2,0,4.0,1,1,16.7000,G6,2,Miss.
11,0,0,58.0,0,0,26.5500,C103,2,Miss.
14,2,0,14.0,0,0,7.8542,,2,Miss.
22,2,0,15.0,0,0,8.0292,,1,Miss.
...,...,...,...,...,...,...,...,...,...
866,1,0,27.0,1,0,13.8583,,0,Miss.
875,2,0,15.0,0,0,7.2250,,0,Miss.
882,2,0,22.0,0,0,10.5167,,2,Miss.
887,0,0,19.0,0,0,30.0000,B42,2,Miss.


In [50]:
X.loc[X['Title'].isin(['Lady.', 'Countess.','Mme.']),'Title'] = 'Lady.'
X[X['Title'].isin(['Lady.', 'Countess.','Mme.'])]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
369,0,0,24.0,0,0,69.3,B35,0,Lady.
556,0,0,48.0,1,0,39.6,A16,0,Lady.
759,0,0,33.0,0,0,86.5,B77,2,Lady.


In [51]:
X.loc[X['Title'].isin(['Sir.','Don.']),'Title'] = 'Sir.'
X[X['Title'].isin(['Sir.','Don.'])]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
30,0,1,40.0,0,0,27.7208,,0,Sir.
599,0,1,49.0,1,0,56.9292,A20,0,Sir.


In [52]:
X.loc[X['Title'].isin(['Master.','Jonkheer.']),'Title'] = 'Master.'
X[X['Title'].isin(['Master.','Jonkheer.'])]

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
7,2,1,2.0,3,1,21.075,,2,Master.
16,2,1,2.0,4,1,29.125,,1,Master.
50,2,1,7.0,4,1,39.6875,,2,Master.
59,2,1,11.0,5,2,46.9,,2,Master.
63,2,1,4.0,3,2,27.9,,2,Master.
65,2,1,,1,1,15.2458,,0,Master.
78,1,1,0.83,0,2,29.0,,2,Master.
125,2,1,12.0,1,0,11.2417,,0,Master.
159,2,1,,8,2,69.55,,2,Master.
164,2,1,1.0,4,1,39.6875,,2,Master.


After combining similar titles, we again observe the value counts and then label encode the variable.

In [53]:
X['Title'].value_counts()

Mr.        517
Miss.      185
Mrs.       125
Master.     41
Dr.          7
Rev.         6
Major.       5
Lady.        3
Sir.         2
Name: Title, dtype: int64

In [54]:
X['Title'] = le.fit_transform(X['Title'].astype(str))
X['Title'].value_counts()

5    517
4    185
6    125
3     41
0      7
7      6
2      5
1      3
8      2
Name: Title, dtype: int64

### Family Size
Next, we will create a new variable which aggregates both the 'Parch' and 'SibSp' variables to indicate the total family size aboard the titanic.

In [55]:
X['Fam_size'] = X['SibSp'] + X['Parch'] + 1
X['Fam_size']

0      2
1      2
2      1
3      2
4      1
      ..
886    1
887    1
888    4
889    1
890    1
Name: Fam_size, Length: 891, dtype: int64

### Cabin variable
We will mutate this variable to instead signify if a variable is NA or not.

In [56]:
X['Cabin'][X['Cabin'].isna() == False] = 1
X['Cabin'].fillna(0, inplace = True)
X['Cabin']

0      0
1      1
2      0
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Cabin, Length: 891, dtype: int64

In [57]:
X['Cabin'].value_counts()

0    687
1    204
Name: Cabin, dtype: int64

<a id='Imp'></a>
## Missing Data Imputation
We focus our attention on imputing missing values for the age column, we will perform this using KNN imputation. Note that this will also impute the 2 missing values in the 'Embarked' column based on their nearest neighbors as well.

In [58]:
knn = KNNImputer(n_neighbors = 5)
imputed = pd.DataFrame(knn.fit_transform(X))
imputed.columns = X.columns
imputed

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title,Fam_size
0,2.0,1.0,22.0,1.0,0.0,7.2500,0.0,2.0,5.0,2.0
1,0.0,0.0,38.0,1.0,0.0,71.2833,1.0,0.0,6.0,2.0
2,2.0,0.0,26.0,0.0,0.0,7.9250,0.0,2.0,4.0,1.0
3,0.0,0.0,35.0,1.0,0.0,53.1000,1.0,2.0,6.0,2.0
4,2.0,1.0,35.0,0.0,0.0,8.0500,0.0,2.0,5.0,1.0
...,...,...,...,...,...,...,...,...,...,...
886,1.0,1.0,27.0,0.0,0.0,13.0000,0.0,2.0,7.0,1.0
887,0.0,0.0,19.0,0.0,0.0,30.0000,1.0,2.0,4.0,1.0
888,2.0,0.0,16.8,1.0,2.0,23.4500,0.0,2.0,4.0,4.0
889,0.0,1.0,26.0,0.0,0.0,30.0000,1.0,0.0,5.0,1.0


Next, we will continue with our feature engineering by binning the 'Age' variable. We will transform it from a numerical variable to a categorical variable with 4 levels. Doing this will reduce noise in the variable while still preserving the general variable pattern.

In [59]:
imputed['Age'] = pd.qcut(imputed['Age'],4)
imputed['Age'] = le.fit_transform(imputed['Age'].astype(str))

<a id='Data_Pred'></a>
# Model Fitting and Prediction
<a id='TT'></a>
## Train-test split
We begin this section by splitting the training set into training and validation sets using a 70:30 split. This will allow us to tune our hyperparameters by evaluating model performance on the validation set.

In [60]:
X_train, X_validate, y_train, y_validate = train_test_split(imputed, y, test_size = .3, random_state = 7313)

In [61]:
X_train.count()

Pclass      623
Sex         623
Age         623
SibSp       623
Parch       623
Fare        623
Cabin       623
Embarked    623
Title       623
Fam_size    623
dtype: int64

In [62]:
y_train.count()

623

In [63]:
X_validate.count()

Pclass      268
Sex         268
Age         268
SibSp       268
Parch       268
Fare        268
Cabin       268
Embarked    268
Title       268
Fam_size    268
dtype: int64

In [64]:
y_validate.count()

268

We observe that our training set contains 623 observations while our test set has 268 observations.

<a id='Fit'></a>
## Model Fitting
In this section we select our classification algorithm, tune hyperparameters, and evaluate model performance on our validatin set.

In [65]:
params = {'n_estimators':range(50,500,50),
          'learning_rate':np.arange(.1,1,.05),
          'max_depth':range(2,9),
          'subsample':np.arange(.1,1,.1),
          'colsample_bytree':np.arange(.1,1,.1)}
gbm = xgb.XGBClassifier()
cv_rf = RandomizedSearchCV(gbm, params, n_iter = 200, cv = 4, random_state = 7339)
cv_rf.fit(X_train,y_train)
cv_rf.best_params_

{'subsample': 0.1,
 'n_estimators': 100,
 'max_depth': 8,
 'learning_rate': 0.15000000000000002,
 'colsample_bytree': 0.9}

After tuning hyperparameters using randomized search cross validation, the final values are shown above. Next, we fit the model on the validation set and calculate accuracy.

In [66]:
cv_rf.score(X_validate, y_validate)

0.8283582089552238

We observe a 83% accuracy on the validation set.

<a id='Pred'></a>
## Test Set Prediction
Before we make predictions on our final test set, we will perform all preprocessing steps on the dataset.

In [67]:
labs = test['PassengerId']
test = test.iloc[:,[1,2,3,4,5,6,8,9,10]]

In [68]:
test

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,"Kelly, Mr. James",male,34.5,0,0,7.8292,,Q
1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,7.0000,,S
2,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,9.6875,,Q
3,3,"Wirz, Mr. Albert",male,27.0,0,0,8.6625,,S
4,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,12.2875,,S
...,...,...,...,...,...,...,...,...,...
413,3,"Spector, Mr. Woolf",male,,0,0,8.0500,,S
414,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,108.9000,C105,C
415,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,7.2500,,S
416,3,"Ware, Mr. Frederick",male,,0,0,8.0500,,S


In [69]:
test['Pclass'] = le.fit_transform(test['Pclass'].astype(str))
test['Sex'] = le.fit_transform(test['Sex'].astype(str))
test['Embarked'] = le.fit_transform(test['Embarked'].astype(str))

In [70]:
test['Title'] = test['Name'].str.extract(r'(\w+\.)')
test.drop(['Name'],axis = 1,inplace = True)
test.loc[X['Title'].isin(['Col.', 'Major.','Capt.']),'Title'] = 'Major.'
test.loc[X['Title'].isin(['Mlle.', 'Miss.','Ms.']),'Title'] = 'Miss.'
test.loc[X['Title'].isin(['Lady.', 'Countess.','Mme.','Dona.']),'Title'] = 'Lady.'
test.loc[X['Title'].isin(['Sir.','Don.']),'Title'] = 'Sir.'
test.loc[X['Title'].isin(['Master.','Jonkheer.']),'Title'] = 'Master.'
test['Title'] = le.fit_transform(test['Title'].astype(str))

In [71]:
test['Fam_size'] = test['SibSp'] + test['Parch'] + 1

In [72]:
test['Cabin'][test['Cabin'].isna() == False] = 1
test['Cabin'].fillna(0, inplace = True)
test['Cabin'] = le.fit_transform(test['Cabin'])

In [73]:
#KNN impute test set, using already fitted knn imputer from training set
imp_test = pd.DataFrame(knn.transform(test))
imp_test.columns = X.columns
imp_test['Age'] = pd.qcut(imp_test['Age'],4)
imp_test['Age'] = le.fit_transform(imp_test['Age'].astype(str))

In [74]:
imp_test

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title,Fam_size
0,2.0,1.0,2,0.0,0.0,7.8292,0.0,1.0,5.0,1.0
1,2.0,0.0,3,1.0,0.0,7.0000,0.0,2.0,6.0,2.0
2,1.0,1.0,3,0.0,0.0,9.6875,0.0,1.0,5.0,1.0
3,2.0,1.0,1,0.0,0.0,8.6625,0.0,2.0,5.0,1.0
4,2.0,0.0,0,1.0,1.0,12.2875,0.0,2.0,6.0,3.0
...,...,...,...,...,...,...,...,...,...,...
413,2.0,1.0,2,0.0,0.0,8.0500,0.0,2.0,5.0,1.0
414,0.0,0.0,2,0.0,0.0,108.9000,1.0,0.0,1.0,1.0
415,2.0,1.0,2,0.0,0.0,7.2500,0.0,2.0,5.0,1.0
416,2.0,1.0,2,0.0,0.0,8.0500,0.0,2.0,5.0,1.0


Next, we aggregate both the training and validation sets to tune the final parameters for our XGBoost Classifier.

In [75]:
gbm1 = xgb.XGBClassifier()
cv_rf = RandomizedSearchCV(gbm1, params, n_iter = 200, cv = 4, random_state = 7339)
cv_rf.fit(imputed,y)
cv_rf.best_params_

{'subsample': 0.9,
 'n_estimators': 200,
 'max_depth': 4,
 'learning_rate': 0.5000000000000001,
 'colsample_bytree': 0.1}

In [76]:
preds = cv_rf.predict(imp_test)

Our final predictions are configured and displayed below:

In [77]:
dfdict = {'PassengerID':labs, 'Survived':preds}
df_new = pd.DataFrame(dfdict)
df_new

Unnamed: 0,PassengerID,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
