## Kaggle Titanic Problem Approach via Gradient Boosting a.k.a. GBM

Beforehand, you may find certain links of great personal interest. I strongly recommend reading sklearn link as to explore variables and their variations:

[Gradient Boosting Tuning: .py Example](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)

[Про Бустинг Вкратце](http://www.machinelearning.ru/wiki/index.php?title=%D0%91%D1%83%D1%81%D1%82%D0%B8%D0%BD%D0%B3)


[Sklearn Gradient Boosting Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)



Okay, let's proceed by loading libraries and reading data

In [None]:
import os
import pandas as pd
import numpy as np
import re
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.metrics import accuracy_score

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
full_data = [train, test]
PassengerId = test['PassengerId']

true = pd.read_csv('predictions/skew.csv')
true = true.ix[0:, 1:].values

To estimate the quality of our model we imply *accuracy score* metrics.

Since we are provided with actual results we may easily compare them with the obtained predicitons.

Basically, I have sorted real titanic data and picked up values stored in **train.csv**.

Next code chunk is to prepare our data to future analysis. This technique was created by [Sina](https://www.kaggle.com/sinakhorami/titanic/titanic-best-working-classifier)

In [None]:
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)

for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()

    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)

train['CategoricalAge'] = pd.cut(train['Age'], 5)

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it
    if title_search:
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)

for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',
                                                 'Don', 'Dr', 'Major', 'Rev', 'Sir',
                                                 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

for dataset in full_data:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map({'female': 0, 'male': 1}).astype(int)

    # Mapping titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

    # Mapping Fare
    dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

    # Mapping Age
    dataset.loc[dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[dataset['Age'] > 64, 'Age'] = 4

drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', \
                 'Parch', 'FamilySize']
train = train.drop(drop_elements, axis=1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis=1)

test = test.drop(drop_elements, axis=1)

train = train.values
test = test.values

X = train[0::, 1::]
y = train[0::, 0]


The code above is self-explanatory but you may always check Sina's work to find out about it more.

Let's now look at the basic GB model and its actual *accuracy score* on true values:

In [None]:
clf = GradientBoostingClassifier(random_state=1337, max_feature='auto', warm_start=True)
clf.fit(X, y)
pred = clf.predict(test)

print('Non-biased scoring accuracy is {}'.format(round(accuracy_score(true, pred)), digits=2))

# Non-biased scoring accuracy is 0.77

Accuracy looks solid though it's higher on *y-train* set by .3.

Next section reveals how to use grid search with multiple parameters. We first of all start to vary **max_feature** with available options by creating a dictionary to which we pass variables of *GradientBoostingClassifier*, or *clf*. Then, to make sure our model is consistent we utilize *KFold* splitter. We accept the classifier parameters with the highest *mean accuracy score* amidst other iterations.

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=1337)  # splitting into 5 is basically a default action

In [None]:
params = {'max_features' : ['auto', 'sqrt', 'log2']}

# So we pass our classifier and parameters dictionary to grid. To be completely informed of the process of fitting
# We set 'verbose=3' as this prints lots of information. Every model consistency is verified by KFold 'cv' splitter
# Mean accuracy score is based on five consequent models with the same parameters. To make this process more
# Reliable we define 'shuffle=True' in KFold as it randomises input data

clf = GradientBoostingClassifier(random_state=1337, warm_start=True)
grid = GridSearchCV(clf, params, scoring='accuracy', verbose=3, cv=cv)  # verbose=[0,1,2,3] is to show fitting process
grid.fit(X, y)

grid.best_estimator_
pred = grid.predict(test)

print('Non-biased scoring accuracy is {}'.format(round(accuracy_score(true, pred)), digits=2))

# Non-biased scoring accuracy is 0.78

By tweaking *max_feature* our predictive power surged from 0.77 to 0.78; a massive increase -- I might say.

Okay, now let's try to alter *learning_rate* and *n_estimators* as these options are significant and related to training process. All I have heard of is basically rules of thumb as to which values are coherent. For example, *'sqrt'* works fine most of the times. In my case, however, I just tend to create long arrays and spend some time revealing the best possible model. I'll think of creating a time-consumptive 1-leveled functions just for demonstration. But this topic is nevertheless different.

In this situation we no longer require KFold data splitter as we actually know the y_test (true values). I just showed it as an example. You might also want to check other splitters in *sklearn.model_selection*, like **ShuffleSplitt** or **train_test_split**

In [None]:
params = {'n_estimators': np.arange(10, 151, 10),
         'learning_rate': np.arange(0.1, 0.31, 0.05)}

clf = GradientBoostingClassifier(max_features='sqrt', random_state=1337, warm_start=True, max_depth=2)  # max_depth == 2 is the best possible
grid = GridSearchCV(clf, params, scoring='accuracy', verbose=3)  # verbose=[0,1,2,3] is to show fitting process
grid.fit(X, y)

grid.best_estimator_
pred = grid.predict(test)

print('Non-biased scoring accuracy is {}'.format(round(accuracy_score(true, pred)), ndigits=2))

# Non-biased scoring accuracy is 0.79

Alright, it's been some time and i'd like to say that **0.79** was the highest that i could climb to. I even run some 1-hour grid algorithms which also gave me **0.79** as a score.
Let's check *accuracy score* on train data.

In [None]:
print('train accuracy score is {}'.format(round(grid.score(X, y), digits=2))

# train accuracy score is 0.83

In [None]:
output = pd.DataFrame({
    'PassengerId': PassengerId,
    'Survived': pred
}); output.to_csv('predictions/GBM_pred.csv', index=False)  # Do not forget to omit indexes

#### Okay, 'till next time.