## I just have learned about GBDT algorithm and Grid Search to find the best hyperparameters in GBDT classifier. Thus, I used Titanic dataset to do some exercise. The public score in kaggle leaderboard is 0.77990.  

The link of raw dataset  https://www.kaggle.com/c/titanic/data. I also offer data as Titanic_train.csv and Titanic_test.csv


This is the first try to use GBDT for classification. And I am not good at feature engineering. Please give good suggestions at any part. I will appreciate it.  



Attributes description  
PassengerId:  passenger's identification  
Survived: target variable(not exist in test dataset), 0 = No, 1 = Yes  
Pclass: ticket class, 1st = Upper, 2nd = Middle, 3rd = Lower  
Name: passenger's name
Sex: male or female  
Age: passenger's age  
SibSp: number of siblings / spouses aboard  
Parch: number of parents / children aboard  
Ticket: ticket number  
Fare: passenger fare  
Cabin: cabin number  
Embarked: port of embarkation

In [2]:
# import required libraries
import numpy as np 
import pandas as pd 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split,GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [3]:
# read train and test data
train_df = pd.read_csv("Titanic_train.csv")
test_df = pd.read_csv("Titanic_test.csv")

In [4]:
# shape of train and test data
print(train_df.shape)
print(test_df.shape)

(891, 12)
(418, 11)


Train dataset has one more column than test. It is the target called survived

In [5]:
# show summary of train_df
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
# check type of each column
train_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [7]:
# convert Pclass to object
train_df.Pclass = train_df.Pclass.astype('str')
train_df.dtypes

PassengerId      int64
Survived         int64
Pclass          object
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [8]:
# check unique number of object columns
cat_list = ["Pclass","Name","Sex","Ticket","Cabin","Embarked"]
train_df[cat_list].nunique()

Pclass        3
Name        891
Sex           2
Ticket      681
Cabin       147
Embarked      3
dtype: int64

 There are 891 rows in train data. "Name" has 891 unique value and "Ticket" has 681 unique value. We could consider drop them since they are so various. 

I think the following is Feature Engineering. If I was wrong, please tell me.

In [9]:
# drop columns
drop_cols = ["Name","Ticket"]
train_df.drop(drop_cols, axis=1, inplace=True)
test_df.drop(drop_cols, axis=1, inplace=True)

In [10]:
# extract traget 
train_y = train_df.Survived

In [11]:
# extract ID of test for submission file
test_ID = test_df.PassengerId

In [12]:
# drop PassengerId
train_X = train_df.drop(["PassengerId"], axis=1)
test_X = test_df.drop(["PassengerId"], axis=1)

In [13]:
# extract features for training
train_X = train_X.drop(["Survived"], axis=1)

This part is dealing with missing values and encoding categorical columns in order to be used in XGBoost model.

In [14]:
# check columns with NaN
cols_with_missing = [col for col in train_X.columns 
                                 if train_X[col].isnull().any()]
cols_with_missing

['Age', 'Cabin', 'Embarked']

Firstly, select which columns are numeric and which are categorical and then handling with them seperately.

In [15]:
# two lists contain numerical columns' name and categorical columns' name
num_cols = ["Age","SibSp","Parch","Fare"]
cat_cols = ["Sex","Cabin","Embarked","Pclass"]

In [16]:
# handling missing value in num_cols using impute
num_imputer = SimpleImputer()
train_X[num_cols] = num_imputer.fit_transform(train_X[num_cols])
test_X[num_cols] = num_imputer.fit_transform(test_X[num_cols])

In [17]:
# handling categorical columns using number label
for col in cat_cols:
    cat = LabelEncoder()
    cat.fit(list(train_X[col].values.astype('str')) + list(test_X[col].values.astype('str')))
    train_X[col] = cat.transform(list(train_X[col].values.astype('str')))
    test_X[col] = cat.transform(list(test_X[col].values.astype('str')))

After feature engineering, handling missing values and converting categorical values, I will use XGBoost algorithm to build model and predict for test dataset.

In [18]:
# create XGBClassifier instance
classifier = XGBClassifier()
# set hypermeters and the below values are trained in order to run fast
grid_param = {"learning_rate" : [0.06],
              'n_estimators': [300],
              'colsample_bytree': [0.7],
              'reg_alpha': [0.04]
              }

gd_sr = GridSearchCV(estimator=classifier,  
                     param_grid=grid_param,
                     scoring='accuracy',
                     cv=10,
                     n_jobs=-1,
                    verbose=1)
gd_sr.fit(train_X, train_y) 

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    2.4s finished


GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'learning_rate': [0.06], 'n_estimators': [300], 'colsample_bytree': [0.7], 'reg_alpha': [0.04]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [19]:
# print the best hyparameters, but for my case, the best parameters are shown as above
# I used this to find the best hyperparameters from lots of hyparameters combination
best_parameters = gd_sr.best_params_  
print(best_parameters)  

{'colsample_bytree': 0.7, 'learning_rate': 0.06, 'n_estimators': 300, 'reg_alpha': 0.04}


In [20]:
train_pred = gd_sr.predict(train_X)
pred_train = [round(value) for value in train_pred]
# evaluate predictions
acc_train = accuracy_score(train_y, pred_train)
print("Train_Accuracy: %.2f%%" % (acc_train * 100.0))

Train_Accuracy: 89.45%


In [21]:
# predict test dataset
predictions = gd_sr.predict(test_X)

In [22]:
# satisfy submission format
my_submission = pd.DataFrame({'PassengerId':test_ID,'Survived':predictions})

In [23]:
# export as csv file
my_submission.to_csv("sub.csv", index=False)