### Titanic: Machine Learning from Disaster is a fairly easy machine learning project. 101 level. 

### However, it's very difficult to pass the 80 mark in the LeaderBoard. The challenge here is to get past 80.

### Initially, I tried dozens different data preparing techniques, and wasn't able to reach the 80 mark. Very frustrating. The highest I could get was around 78. 

### Random forest shows the best results in algorithms comparison. Random forest does not require data transformations. No need to standardize or normalize.

### Feature selection is necessary? Avoid overfitting, better predicting power?? It actually get a worse prediction on the LeaderBoard.

### I started to focus on hyperparameter tuning. It took forever to tune the parameters. To my surprise, hyperparameter tuning was quite helpful, it bumped the score up to 79.4. The last few points are always the hardest. Took me a few days to bump up from around 77 to 79.426. Yet still fail to pass the 80. Dang!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier

In [2]:
df = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [4]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [5]:
df['Embarked'].fillna('S',inplace=True)
df_test['Fare'].fillna(df_test.Fare.median(),inplace=True)

In [6]:
def family_recode(data):
    if data['Family'] == 0:
        value = 'One'
    elif 1<= data['Family'] <=3:
        value = '2 to 4'
    else:
        value = 'Over 4'
    return value

In [7]:
def child(row):
    if row['Age'] < 16:
        value = 1
    else:
        value = 0
    return value

In [8]:
df['Age'] = df.groupby(['Sex', 'Pclass','Embarked'])['Age'].apply(lambda x: x.fillna(x.median()))
df_test['Age'] = df_test.groupby(['Sex', 'Pclass','Embarked'])['Age'].apply(lambda x: x.fillna(x.median()))

In [9]:
## Name Title - crdit to Ahmed BESBES http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
Title_Dictionary = { 
    "Capt":"Officer","Col":"Officer", "Major":"Officer","Jonkheer":"Royalty","Don":"Royalty","Sir":"Royalty",
    "Dr":"Officer", "Rev":"Officer","the Countess":"Royalty","Dona":"Royalty","Mme":"Mrs","Mlle":"Miss","Ms":"Mrs",
    "Mr":"Mr","Mrs":"Mrs","Miss":"Miss","Master":"Master","Lady":"Royalty"
                    }

In [10]:
def data_process(df_preprocess):
    df_preprocess['Sex'] = [1 if i == 'male' else 0 for i in df_preprocess.Sex]
    df_preprocess['Name_title'] = df_preprocess['Name'].apply(lambda name: name.split(',')[1].split('.')[0].strip())
    df_preprocess['Name_title'] = df_preprocess['Name_title'].map(Title_Dictionary)
    df_preprocess['Cabin_N'] = df_preprocess['Cabin'].astype('str').str[0]
    df_preprocess['Family'] = df_preprocess['SibSp'] + df_preprocess['Parch']
    df_preprocess['Family_group'] = df_preprocess.apply(family_recode, axis=1)
    df_preprocess['Child'] = df_preprocess.apply(child, axis=1)
    
    Pclass_dummies = pd.get_dummies(df_preprocess.Pclass,prefix='Class')
    Embarked_dummies = pd.get_dummies(df_preprocess.Embarked,prefix='Embarked')
    Title_dummies = pd.get_dummies(df_preprocess.Name_title, prefix='Title')
    Cabin_dummies = pd.get_dummies(df_preprocess.Cabin_N,prefix='Cabin')
    Family_group_dummies = pd.get_dummies(df_preprocess.Family_group, prefix='Family')
    
    df_preprocess = pd.concat([df_preprocess,Pclass_dummies,Embarked_dummies,Family_group_dummies,Title_dummies,Cabin_dummies],axis=1)
    df_preprocess.drop(['Pclass','Embarked','Ticket',
                        'Family','SibSp','Parch','Family_group','Name','Cabin','Cabin_N','Name_title'],axis=1,inplace=True)
    return df_preprocess

In [11]:
df=data_process(df)
df_test=data_process(df_test)

In [12]:
df.drop(['PassengerId','Cabin_T'],axis=1,inplace=True)

In [13]:
X = df.iloc[:,1:]
Y = df['Survived']

In [14]:
seed = 7
validation_size = 0.20
num_folds = 10
scoring = 'accuracy'

In [15]:
model = RandomForestClassifier(max_depth= 2, max_features= 'sqrt', min_samples_leaf= 2, n_estimators= 50, n_jobs= -1,random_state=seed)
model.fit(X,Y)
best_feature = SelectFromModel(model, prefit=True)
X_feature= best_feature.transform(X)
X_feature.shape

(891, 7)

In [16]:
X_columns = X.columns
df_features = pd.DataFrame(data=model.feature_importances_,index = X_columns,columns=['Score'])
df_features.sort_values(by='Score',ascending=False)

Unnamed: 0,Score
Sex,0.281236
Title_Mr,0.229786
Title_Miss,0.123271
Class_3,0.079467
Fare,0.067033
Cabin_n,0.065749
Title_Mrs,0.059978
Family_2 to 4,0.021703
Family_Over 4,0.018984
Class_1,0.013953


In [17]:
X_test = df_test.iloc[:,1:]

In [18]:
X_test_feature = best_feature.transform(X_test)
X_test_feature.shape

(418, 7)

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_validation, Y_train, Y_validation = train_test_split(X_feature, Y, test_size=validation_size, random_state=seed)

In [20]:
RandomForestClassifier
model = RandomForestClassifier(random_state=seed)
kfold = KFold(n_splits=num_folds, random_state=seed)
parameters = {'max_features': ['sqrt', 'auto', 'log2'],
              'max_depth':[2,6,10,35],
              'n_estimators':[50,150,200,250,300],
              'min_samples_leaf': [2,10,25,50,60],
              'n_jobs':[-1] }
clf = GridSearchCV(estimator=model, param_grid=parameters,scoring=scoring,cv=kfold)
grid_result=clf.fit(X_train,Y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.834270 using {'max_depth': 6, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'n_estimators': 200, 'n_jobs': -1}


In [21]:
model = RandomForestClassifier(max_depth= 6, max_features= 'sqrt', min_samples_leaf= 2, n_estimators= 200, n_jobs= -1)
model.fit(X_train,Y_train)
pred = model.predict(X_validation)
accuracy_score(Y_validation,pred)

0.79329608938547491

In [22]:
model = RandomForestClassifier(max_depth= 6, max_features= 'sqrt', min_samples_leaf= 2, n_estimators= 200, n_jobs= -1)
model.fit(X_train,Y_train)
pred = model.predict(X_test_feature)
df_test['Survived'] = pred
df_submitted = df_test[['PassengerId','Survived']]
df_submitted.to_csv('final.csv',index=False)

<img src="files/image.png">