# XG Boost GridSearch
Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. My original with all models:
https://github.com/ScottySchmidt/Kaggle/blob/main/Titanic.ipynb

### Special Notes:
* Regular XB Boost with little tunning scores top 60%
* Tunning four paramters with grid search scores top 18%

# Train Data
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and whether they survive or not. The shape of the train data is (891, 12). 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_validate
from statistics import mean
from sklearn.metrics import accuracy_score, classification_report, mean_absolute_error, mean_squared_error, r2_score
from sklearn.datasets import make_classification
from sklearn import ensemble
import sklearn.metrics as metrics
import time
from math import sqrt

train=r'/kaggle/input/titanic/train.csv'
test=r'/kaggle/input/titanic/test.csv' 

df=pd.read_csv(train)
test=pd.read_csv(test)

print(df.shape)
df.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [3]:
#Check Duplicates:
dupstr = df.duplicated()
print('Total no of duplicate values in Training Dataset = %d' % (dupstr.sum()))
df[dupstr]

Total no of duplicate values in Training Dataset = 0


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked


In [4]:
#search for columns with missing values:
def findNA():
    print("Missing data by column as a percent:")
    findNA=df.isnull().sum().sort_values(ascending=False)/len(df)
    print(findNA.head())
findNA() 

Missing data by column as a percent:
Cabin          0.771044
Age            0.198653
Embarked       0.002245
PassengerId    0.000000
Survived       0.000000
dtype: float64


# Feature Engineer
We need to replace male and female into numbers. This is a very important part of the process because gender is one of the highest predictors on if a person survived or not.

In [5]:
#GENDER
df['Sex']=df['Sex'].map({'female':0,'male':1})

Females whose number is 0 had a much higher chance of surving than males. Later on, doing the importance feature analysis we will view that gender is actually one of the highest predictors. 

In [6]:
genderTable = pd.crosstab(df['Survived'],df['Sex'])
genderTable

Sex,0,1
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,81,468
1,233,109


# Encode Categorical Variables
Encode Categorical Variables is needed for this dataset since there are many important variables that are not numeric yet. 

There are three classes C, Q, and S. Class C seems to have the highest chance of survival. Class S has the lowest chance of survival. Unfornately this would tend to show that economic status seems to played a part in the decision if someone survived or not. Embarked does not play role major role in feature importance. 

In [7]:
import category_encoders as ce
#encoder = ce.OrdinalEncoder(cols=['Embarked'])

#df = encoder.fit_transform(df)
#test = encoder.fit_transform(test)
#test.head()

# Examine Target Variable
Survived is the Y variable we will be analyzing. Since the survival rete is 0.384 the data is not considered unbalanced. 

In [8]:
temp=df['Survived'].value_counts()
print(temp)
no=temp[0]
yes=temp[1]
percent=round(yes/(yes+no),3)
print("Percent that survived: ", percent)

0    549
1    342
Name: Survived, dtype: int64
Percent that survived:  0.384


# Numeric DataFrame
For now, we will analyze only numeric values. Categorical values will need to be encoded or analyzed individually.

In [9]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df = df.select_dtypes(include=numerics)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,0,3,1,22.0,1,0,7.25
1,2,1,1,0,38.0,1,0,71.2833
2,3,1,3,0,26.0,0,0,7.925
3,4,1,1,0,35.0,1,0,53.1
4,5,0,3,1,35.0,0,0,8.05


# Check for missing values
Age is missing around 20% of values.Therefore, we can simply fill in the mean for that one column.

In [10]:
#search for columns with missing values:
def findNA():
    print("Missing data by column as a percent:")
    findNA=df.isnull().sum().sort_values(ascending=False)/len(df)
    print(findNA.head())
findNA() 

Missing data by column as a percent:
Age            0.198653
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Sex            0.000000
dtype: float64


In [11]:
df= df.fillna(df.mean())

# Split Data

In [12]:
X=df.drop('Survived', axis=1)
y=df['Survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.30, random_state = 42)

# Highly Correlated Features
There are no highly correlated variables above 80%. Therefore, we do not need to be concerned about removing variables that are too highly correlated. 

# XGBOOST

In [13]:
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

start=time.time()
print("Start")

params = {
        'learning_rate': [0.01, 0.3],
        'n_estimators':[100],
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 2, 5],
        #'colsample_bytree': [0.6, 1.0],
        'max_depth': [3, 6, 9]
        }

boost_gs = xgb.XGBClassifier()
boost_gs = GridSearchCV(boost_gs,param_grid=params,cv=3,scoring="accuracy")
boost_gs.fit(X_train,y_train)

print(boost_gs.best_params_)
print("Done. " , time.time()-start, " seconds")

Start
{'gamma': 2, 'learning_rate': 0.3, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 100}
Done.  45.71243214607239  seconds


In [14]:
xgb_pred =  boost_gs.predict(X_test)

#calculate AUC of model
xgbAUC = round( metrics.roc_auc_score(y_test, xgb_pred), 4 ) 
#print("AUC for XGB is: ", xgbAUC)

xgbMSE = mean_squared_error(y_test, xgb_pred)
#print("MSE XGB on test set: {:.4f}".format(xgbMSE))

 #CROSS VALIDATE TEST RESULTS:
boostScore =  boost_gs.score(X_test, y_test).round(4)  # train test 
boostCV = cross_validate( boost_gs, X, y, cv = 5, scoring= 'r2')
boostCV=boostCV['test_score'].mean().round(4)
print(boostScore-boostCV, " cross validate score")

0.7938000000000001  cross validate score


# Test Data

In [15]:
#GENDER
test['Sex']=test['Sex'].map({'female':0,'male':1})

features=list(X.columns)
test=test[features]

test=test.fillna(test.mean())
test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare
0,892,3,1,34.5,0,0,7.8292
1,893,3,0,47.0,1,0,7.0
2,894,2,1,62.0,0,0,9.6875
3,895,3,1,27.0,0,0,8.6625
4,896,3,0,22.0,1,1,12.2875


# Final Prediction

In [16]:
test_predictions = boost_gs.predict(test)
passID=test['PassengerId']
tupleData = list(zip(passID, test_predictions))
output = pd.DataFrame(tupleData, columns = ['PassengerId', 'Survived'])
output.head(7)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
5,897,0
6,898,1


In [17]:
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!
