## Introduction

The Titanic Dataset is a well-known binary classification problem in machine learning. It requires you to predict who survived the Titanic based on various features of the passengers. The following is my submission to Kaggle which earned an accuracy of 79.9%. This put it in the Top 12% of all scores of all the teams competing. 

You can download the dataset and see more details about the competition [here](https://www.kaggle.com/c/titanic/overview). You can have a look at my Kaggle profile [here](https://www.kaggle.com/cccmmmddd).

## Importing Pandas and Exploring the Dataset

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


## Manipulating Columns

In this section I performed two things: 1) I replaced the `Sex` column with a dummy variable column where 1 indicates that the passenger is male and 0 indicates that the passenger is female. This allows the data to be more easily used in my machine learning model. 2) I filled the missing values in the `Age` column with the mean of the values in the `Age` column.  

In [5]:
is_male = pd.get_dummies(df.Sex, drop_first=True)
df = pd.concat([df, is_male], axis = 1)

In [6]:
df.info() #Verifying that the column has been changed

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
male           891 non-null uint8
dtypes: float64(2), int64(5), object(5), uint8(1)
memory usage: 84.5+ KB


In [7]:
df.Age.fillna(df.Age.mean(), inplace=True)

In [8]:
df.info() #checking the columns have been adjusted correctly.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
male           891 non-null uint8
dtypes: float64(2), int64(5), object(5), uint8(1)
memory usage: 84.5+ KB


## Creating, Fitting and Running the Model

Now that my data is in the correct format, I can begin creating model. Since XGBoost is a tree-based algorithm, I do not need to do any feature scaling. 

In [9]:
from sklearn.model_selection import train_test_split
import xgboost as xgb

In [10]:
X = df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'male']] #features. I have left out several columns
y = df['Survived'] #target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=999) #splitting the data

Above, I select the columns I wish to use as features as well as my target column. I have left out several columns that I do not will help the model train. The Name, Embarked, Ticket and Cabin columns may have some causal impact on the probability of survival but they do make a model much more complex. On balance, I decided to leave those four columns, amongst others, out.

In [11]:
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123, max_depth = 5) #instantiating the model

Above, I then instantiate the model. After running the model with a range of different hyperparameters, I find having 10 estimators and a maximum depth of 5 to produce the best results.

In [12]:
xg_cl.fit(X_train, y_train) #fitting the training data
y_pred = xg_cl.predict(X_test) #creating predictions

In [13]:
from sklearn.metrics import classification_report, confusion_matrix

In [14]:
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.95      0.88       146
           1       0.85      0.60      0.70        77

    accuracy                           0.83       223
   macro avg       0.83      0.77      0.79       223
weighted avg       0.83      0.83      0.82       223

[[138   8]
 [ 31  46]]


Based on the classification report my data generates good predictions on the training dataset. However, for us to be sure it performs well, we have to generate predictions with the test dataset. 

## Using my Model to make Predictions on the Test Dataset

The first steps involve doing similar data manipulation to the what I performed on the training dataset (i.e. involves filling the missing values by the mean and creating a dummy variable column from the `Sex` column).

In [15]:
dftest = pd.read_csv('test.csv')

In [16]:
dftest.Age.fillna(dftest.Age.mean(), inplace=True) #filling in the missing values in the Age column
dftest.Fare.fillna(dftest.Fare.mean(), inplace=True) #filling in the missing values in the Fare column

In [17]:
dftest.info() #checking the columns have been adjusted correctly.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [18]:
is_male = pd.get_dummies(dftest.Sex, drop_first=True) #creating a dummy variable columns with gender
dftest = pd.concat([dftest, is_male], axis = 1) #adding the dummy variable column to the test column
X_new = dftest[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'male']]

In [19]:
X_new.info() #checking the columns have been adjusted correctly.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
Pclass    418 non-null int64
Age       418 non-null float64
SibSp     418 non-null int64
Parch     418 non-null int64
Fare      418 non-null float64
male      418 non-null uint8
dtypes: float64(2), int64(3), uint8(1)
memory usage: 16.8 KB


Below I make my new predictions out of the model I created with the training data. 

In [20]:
test_y_pred = xg_cl.predict(X_new) #creating my new predictions

## Creating a csv File with my Predictions for the Test Dataset

I then format the data and create a csv file out of my predictions.

In [21]:
submission = pd.DataFrame({
    'PassengerID': dftest['PassengerId'],
    'Survived': test_y_pred
})

In [22]:
submission.head()

Unnamed: 0,PassengerID,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,0


In [23]:
submission.to_csv('titanicgit.csv', index = False)

My model performed slightly better on the training dataset than it did on the test dataset. This indicates that it overfitted on the training data. However, the overfitting was only slight given that it still performed well.