## I just have learned about XGBoost algorithm and Grid Search to find the best hyperparameters in XGBoost classifier. Thus, I used Titanic dataset to do some exercise. The public score in kaggle leaderboard is 0.77990.  

The link of dataset  https://www.kaggle.com/c/titanic/data 


This is the first try to use XGBoost for classification. And I am not good at feature engineering. Please give good suggestions at any part. I will appreciate it.  



Attributes description  
PassengerId:  passenger's identification  
Survived: target variable(not exist in test dataset), 0 = No, 1 = Yes  
Pclass: ticket class, 1st = Upper, 2nd = Middle, 3rd = Lower  
Name: passenger's name
Sex: male or female  
Age: passenger's age  
SibSp: number of siblings / spouses aboard  
Parch: number of parents / children aboard  
Ticket: ticket number  
Fare: passenger fare  
Cabin: cabin number  
Embarked: port of embarkation

In [None]:
# import required libraries
import numpy as np 
import pandas as pd 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split,GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [None]:
# read train and test data
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")

In [None]:
# shape of train and test data
print(train_df.shape)
print(test_df.shape)

Train dataset has one more column than test. It is the target called survived

In [None]:
# show summary of train_df
train_df.describe()

In [None]:
# check type of each column
train_df.dtypes

In [None]:
# convert Pclass to object
train_df.Pclass = train_df.Pclass.astype('str')
train_df.dtypes

In [None]:
# check unique number of object columns
cat_list = ["Pclass","Name","Sex","Ticket","Cabin","Embarked"]
train_df[cat_list].nunique()

 There are 891 rows in train data. "Name" has 891 unique value and "Ticket" has 681 unique value. We could consider drop them since they are so various. 

I think the following is Feature Engineering. If I was wrong, please tell me.

In [None]:
# drop columns
drop_cols = ["Name","Ticket"]
train_df.drop(drop_cols, axis=1, inplace=True)
test_df.drop(drop_cols, axis=1, inplace=True)

In [None]:
# extract traget 
train_y = train_df.Survived

In [None]:
# extract ID of test for submission file
test_ID = test_df.PassengerId

In [None]:
# drop PassengerId
train_X = train_df.drop(["PassengerId"], axis=1)
test_X = test_df.drop(["PassengerId"], axis=1)

In [None]:
# extract features for training
train_X = train_X.drop(["Survived"], axis=1)

This part is dealing with missing values and encoding categorical columns in order to be used in XGBoost model.

In [None]:
# check columns with NaN
cols_with_missing = [col for col in train_X.columns 
                                 if train_X[col].isnull().any()]
cols_with_missing

Firstly, select which columns are numeric and which are categorical and then handling with them seperately.

In [None]:
# two lists contain numerical columns' name and categorical columns' name
num_cols = ["Age","SibSp","Parch","Fare"]
cat_cols = ["Sex","Cabin","Embarked","Pclass"]

In [None]:
# handling missing value in num_cols using impute
num_imputer = SimpleImputer()
train_X[num_cols] = num_imputer.fit_transform(train_X[num_cols])
test_X[num_cols] = num_imputer.fit_transform(test_X[num_cols])

In [None]:
# handling categorical columns using number label
for col in cat_cols:
    cat = LabelEncoder()
    cat.fit(list(train_X[col].values.astype('str')) + list(test_X[col].values.astype('str')))
    train_X[col] = cat.transform(list(train_X[col].values.astype('str')))
    test_X[col] = cat.transform(list(test_X[col].values.astype('str')))

After feature engineering, handling missing values and converting categorical values, I will use XGBoost algorithm to build model and predict for test dataset.

In [None]:
# create XGBClassifier instance
classifier = XGBClassifier()
# set hypermeters and the below values are trained in order to run fast
grid_param = {"learning_rate" : [0.06],
              'n_estimators': [300],
              'colsample_bytree': [0.7],
              'reg_alpha': [0.04]
              }

gd_sr = GridSearchCV(estimator=classifier,  
                     param_grid=grid_param,
                     scoring='accuracy',
                     cv=10,
                     n_jobs=-1,
                    verbose=1)
gd_sr.fit(train_X, train_y) 

In [None]:
# print the best hyparameters, but for my case, the best parameters are shown as above
# I used this to find the best hyperparameters from lots of hyparameters combination
best_parameters = gd_sr.best_params_  
print(best_parameters)  

In [None]:
train_pred = gd_sr.predict(train_X)
pred_train = [round(value) for value in train_pred]
# evaluate predictions
acc_train = accuracy_score(train_y, pred_train)
print("Train_Accuracy: %.2f%%" % (acc_train * 100.0))

In [None]:
# predict test dataset
predictions = gd_sr.predict(test_X)

In [None]:
# satisfy submission format
my_submission = pd.DataFrame({'PassengerId':test_ID,'Survived':predictions})

In [None]:
# export as csv file
my_submission.to_csv("sub.csv", index=False)