**Author:** Raoul Malm  

**Description:** Given is a training set of samples listing passengers who survived or did not survive the Titanic disaster. The goal is to construct a model that can predict from a test dataset not containing the survival information if these passengers in the test dataset survived or not. This is a supervised classification task. The individual steps for the solution are:
- Analyze data
- Manipulate data: complete, convert, create, delete features
- Model data with kNN, SVC, Decision Tree, Random Forest, Neural Networks

**Outline:**
1. Libraries and settings
2. Analyze data
3. Manipulate data
4. Model data
5. Predict and submit test results

**Results:** 
- Using a split of 90%/10% on the labeled training data this implementation, training on data of 801 passengers, achieves a 82% accuracy on the validation set of 90 passengers. Using all data on the test set achieves 79.90% accuracy.

**Reference:** [Titanic Data Science Solutions by Manav Sehgal](https://www.kaggle.com/startupsci/titanic-data-science-solutions?scriptVersionId=1145136)



# 1. Libraries and settings

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
import sklearn.linear_model
import sklearn.svm
import sklearn.ensemble
import sklearn.neighbors
import sklearn.naive_bayes
import sklearn.tree
import sklearn.neural_network
from subprocess import check_output
import seaborn as sns
import matplotlib.pyplot as plt
import os
import tensorflow as tf
%matplotlib inline

valid_set_size_percentage = 10.0; # 10% = default
train_on_all_data = True; # for submission of test results otherwise False
cv_num = 1; # number of cross validations; = 1 for submission of test results

#display parent directory and working directory
print(os.path.dirname(os.getcwd())+':', os.listdir(os.path.dirname(os.getcwd())));
print(os.getcwd()+':', os.listdir(os.getcwd()));

# 2. Analyze data

The train/test sets have 891/418 rows with 12/11 columns. The features are:
- Survived: 0 = No, 1 = Yes 
- Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd 
- Name: Name of the passenger
- Sex: male, female 
- Age: Age in years. Is fractional if less than 1. If the age is estimated, it is in the form of xx.5.
- SibSp: # of siblings / spouses aboard the Titanic (Sibling = brother, sister, stepbrother, stepsister, Spouse = husband, wife). Mistresses and fiancés were ignored
- Parch: # of parents / children aboard the Titanic (Parent = mother, father, Child = daughter, son, stepdaughter, stepson). Some children travelled only with a nanny, therefore Parch=0 for them.
- Ticket: Ticket number 
- Fare: Passenger fare 
- Cabin: Cabin number 
- Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The features can be characterized by different types:
- numerical: Age (continuous, float64), Fare (continuous, float64), SibSp (discrete, int64), Parch (discrete, int64)
- categorial: Sex (string), Pclass (int64), Embarked (character), Survived (int64), Ticket (alphanumeric, string), Cabin (alphanumeric, string), Name (string)


In [None]:
# read data and have a first look at it
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
combine = [train_df, test_df]
train_df.info()
print('_'*40)
test_df.info()

In [None]:
# look at the first five rows
train_df.head()

In [None]:
# look at the first five rows
test_df.head() 

In [None]:
# describe numerical data
train_df.describe()

In [None]:
# describe numerical data
test_df.describe()

In [None]:
# describe object data
train_df.describe(include=['O'])

In [None]:
# describe object data
test_df.describe(include=['O'])

In [None]:
# check Pclass - Survived correlation
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# check Sex - Survived correlation
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# check SibSp - Survived correlation
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# check Parch - Survived correlation
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# Age histograms depending on Survived
grid = sns.FacetGrid(train_df, col='Survived');
grid.map(plt.hist, 'Age', bins=20);

In [None]:
# Age histograms depending on Survived, Pclass
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

In [None]:
# Survived values depending on Embarked, Sex
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6);
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep');
grid.add_legend();

In [None]:
# Fare depending on Embarked, Survived, Sex
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()

# 3. Manipulate Data

By having analyzed the data we will perform the following steps:

- convert Pclass feature
- create new feature: Title
- delete features: Ticket, Cabin, Name, PassengerId
- convert features: Sex
- complete and convert feature: Age
- create new features: IsAlone, Age*Class
- complete and convert feature: Embarked 
- complete and convert feature: Fare
- delete Pclass feature

### Convert Pclass feature

In [None]:
"""
# use one-hot-encoding for Pclass 
for dataset in combine:
    dataset['Pclass 1'] = dataset['Pclass'].map({1: 1, 2: 0, 3: 0}).astype(int)
    dataset['Pclass 2'] = dataset['Pclass'].map({1: 0, 2: 1, 3: 0}).astype(int)
    dataset['Pclass 3'] = dataset['Pclass'].map({1: 0, 2: 0, 3: 1}).astype(int)
"""

### Create new feature: Title

In [None]:
# extract title from Name and then create new feature: Title  
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Survived'])

In [None]:
pd.crosstab(test_df['Title'], train_df['Sex'])

In [None]:
# reduce the number of titles
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',
                                                 'Don', 'Dr', 'Major', 'Rev', 'Sir',
                                                 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')


In [None]:
# no missing titles
print(train_df.Title.isnull().sum())
print(test_df.Title.isnull().sum())

In [None]:
"""# use one-hot-encoding for Title
for dataset in combine:
    dataset['Title Mr'] = dataset['Title'].map({"Mr": 1, "Miss": 0, "Mrs": 0, "Master": 0, "Rare": 0}).astype(int)
    dataset['Title Miss'] = dataset['Title'].map({"Mr": 0, "Miss": 1, "Mrs": 0, "Master": 0, "Rare": 0}).astype(int)
    dataset['Title Mrs'] = dataset['Title'].map({"Mr": 0, "Miss": 0, "Mrs": 1, "Master": 0, "Rare": 0}).astype(int)
    dataset['Title Master'] = dataset['Title'].map({"Mr": 0, "Miss": 0, "Mrs": 0, "Master": 1, "Rare": 0}).astype(int)
    dataset['Title Rare'] = dataset['Title'].map({"Mr": 0, "Miss": 0, "Mrs": 0, "Master": 0, "Rare": 1}).astype(int)

# drop Title
train_df = train_df.drop(['Title'],axis=1)
test_df = test_df.drop(['Title'],axis=1)
combine = [train_df, test_df]
"""

In [None]:
# use ordinal values for Title 
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_df.head()

 ### Delete features: Ticket, Cabin, Name, PassengerId

In [None]:
# delete columns: Ticket, Cabin, Name, PassengerId
train_df = train_df.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]
print("train_df = ", train_df.shape)
print("test_df = ", test_df.shape)

 ### Convert features: Sex

In [None]:
# convert variable 'Sex' into type int64
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

train_df.head()

### Complete and convert feature: Age

In [None]:
# complete missing age entries by using information on Sex, Pclass
guess_ages = np.zeros((2,3));

for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & 
                               (dataset['Pclass'] == j+1)]['Age'].dropna()

            # age_mean = guess_df.mean()
            # age_std = guess_df.std()
            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

            age_guess = guess_df.median()
            guess_ages[i,j] = int(age_guess/0.5 + 0.5 ) * 0.5
            #print(age_guess)
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == i) & 
                        (dataset.Pclass == j+1),'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

train_df.head()

In [None]:
# create new feature AgeBand
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

In [None]:
# Replace Age with ordinals based on the bands in AgeBand
for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
train_df.head()

In [None]:
# remove AgeBand
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]
train_df.head()

### Create new features: IsAlone, Age*Class

In [None]:
# create new feature FamilySize
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# create new feature IsAlone
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()


In [None]:
# remove features: Parch, SibSp, FamilySize
#train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
#test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
train_df = train_df.drop(['Parch', 'SibSp'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp'], axis=1)
combine = [train_df, test_df]
train_df.head()

In [None]:
# create new feature Age*Class
for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

#train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)
train_df.head()

### Complete and convert feature: Embarked 

In [None]:
# only 2/0 missing values in train/test data
print(train_df.Embarked.isnull().values.sum())
print(test_df.Embarked.isnull().values.sum())

In [None]:
# most frequent occurence of Embarked value
freq_port = train_df.Embarked.dropna().mode()[0]
print(freq_port);

# replace na entries with most frequent value of Embarked
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
    
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
"""
# use one-hot-encoding for Embarked 
for dataset in combine:
    dataset['Embarked S'] = dataset['Embarked'].map({'S': 1, 'C': 0, 'Q': 0}).astype(int)
    dataset['Embarked C'] = dataset['Embarked'].map({'S': 0, 'C': 1, 'Q': 0}).astype(int)
    dataset['Embarked Q'] = dataset['Embarked'].map({'S': 0, 'C': 0, 'Q': 1}).astype(int)

# drop Embarked
train_df = train_df.drop(['Embarked'],axis=1)
test_df = test_df.drop(['Embarked'],axis=1)
combine = [train_df, test_df]
"""

In [None]:
# use ordinal values for Embarked
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train_df.head()


### Complete and convert feature: Fare

In [None]:
# only 0/1 missing values in train/test data
print(train_df.Fare.isnull().values.sum())
print(test_df.Fare.isnull().values.sum())

In [None]:
# complete feature Fare in test set
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()

In [None]:
# create feature FareBand
train_df['FareBand'] = pd.qcut(train_df['Fare'], 5)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

In [None]:
train_df.head()

In [None]:
# replace feature Fare by ordinals based on FareBand
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
    
train_df.head()

In [None]:
train_df.head()

In [None]:
test_df.head()

### Drop feature Pclass

In [None]:
"""
# drop Pclass
train_df = train_df.drop(['Pclass'],axis=1)
test_df = test_df.drop(['Pclass'],axis=1)
combine = [train_df, test_df]
"""


# 4. Model data
- create training, validation, testing sets
- supervised learning plus cassification limits the number of machine learning algorithms to:  
    - Logistic Regression
    - kNN (k-Nearest Neighbors)
    - SVM (Support Vector Machine) with different kernels
    - Gaussian Naive Bayes
    - Decision Tree
    - Random Forrest
    - Deep Neural Network
- train models and summarize the results

In [None]:
## create training, validation, testing sets

# read train, validation, test data
valid_set_size = int(train_df.shape[0] * valid_set_size_percentage/100.0)

# train on all data
if train_on_all_data:
    train_set_size = train_df.shape[0]
else:
    train_set_size = train_df.shape[0] - valid_set_size
    
X_train_val = train_df.drop("Survived", axis=1).copy().values
Y_train_val = train_df["Survived"].copy().values.reshape(-1)
X_test  = test_df.drop("PassengerId", axis=1).copy().values
test_set_size = X_test.shape[0]

# store used features
df_features = pd.DataFrame(train_df.drop("Survived", axis=1).columns.delete(0))
df_features.columns = ['Feature']
print(df_features)
print('')

# normalize train, validation, test data
X_train_val_norm = (X_train_val)/(X_train_val.max() - X_train_val.min());
X_test_norm = (X_test)/(X_test.max()-X_test.min());

# split train and validation data
perm_array = np.arange(X_train_val_norm.shape[0])
np.random.shuffle(perm_array)
X_train_norm = X_train_val_norm[perm_array[:train_set_size]]
Y_train = Y_train_val[perm_array[:train_set_size]]
X_valid_norm = X_train_val_norm[perm_array[-valid_set_size:]]
Y_valid = Y_train_val[perm_array[-valid_set_size:]]

#X_train = train_df.drop("Survived", axis=1)[0:train_set_size]
#Y_train = train_df["Survived"][0:train_set_size]
#if valid_set_size > 0:
#    X_valid = train_df.drop("Survived", axis=1)[train_set_size:]
#    Y_valid = train_df["Survived"][train_set_size:]
#else:
#    X_valid = train_df.drop("Survived", axis=1)[801:]
#    Y_valid = train_df["Survived"][801:]
    
print('X_train_norm.shape = ', X_train_norm.shape)
print('Y_train.shape = ', Y_train.shape)
print('X_valid_norm.shape = ', X_valid_norm.shape)
print('Y_valid.shape = ', Y_valid.shape)
print('X_test_norm.shape = ', X_test_norm.shape)

# function to shuffle randomly train and validation data
def shuffle_train_valid_data():
    global X_train_val_norm, X_train_norm, Y_train, X_valid_norm, Y_valid
    np.random.shuffle(perm_array)
    X_train_norm = X_train_val_norm[perm_array[:train_set_size]]
    Y_train = Y_train_val[perm_array[:train_set_size]]
    X_valid_norm = X_train_val_norm[perm_array[-valid_set_size:]]
    Y_valid = Y_train_val[perm_array[-valid_set_size:]]
    return None 

In [None]:
## Logistic Regression as a benchmark model

acc_log_train = 0
acc_log_valid = 0
log_correlation = 0
Y_log_pred = np.zeros(X_test_norm.shape[0])

for i in range(cv_num):
    
    shuffle_train_valid_data()

    logreg = sklearn.linear_model.LogisticRegression()
    logreg.fit(X_train_norm, Y_train)
    Y_log_pred += logreg.predict_proba(X_test_norm)[:,1]

    acc_log_train += logreg.score(X_train_norm, Y_train)
    acc_log_valid += logreg.score(X_valid_norm, Y_valid)
    log_correlation += logreg.coef_[0];

acc_log_train /= cv_num
acc_log_valid /= cv_num
log_correlation /= cv_num
Y_log_pred /= cv_num

print('Logistic Regression: train/valid Acc = %.4f/%.4f'%(acc_log_train, acc_log_valid))
df_features["Correlation"] = pd.Series(log_correlation)
df_features.sort_values(by='Correlation', ascending=False)


In [None]:
## Further Machine Learning Algorithms

acc_svc_rbf_train = acc_svc_rbf_valid = 0
acc_svc_linear_train = acc_svc_linear_valid = 0
acc_knn_train = acc_knn_valid = 0
acc_gaussianNB_train = acc_gaussianNB_valid = 0
acc_decision_tree_train = acc_decision_tree_valid = 0
acc_random_forest_train = acc_random_forest_valid = 0

Y_pred_random_forest = np.zeros(X_test.shape[0])
Y_pred_decision_tree = np.zeros(X_test.shape[0])
Y_pred_gaussianNB = np.zeros(X_test.shape[0])
Y_pred_knn = np.zeros(X_test.shape[0])
Y_pred_svc_linear = np.zeros(X_test.shape[0])
Y_pred_svc_rbf = np.zeros(X_test.shape[0])

for i in range(cv_num):

    shuffle_train_valid_data()

    # support vector machine with rbf kernel
    svc_rbf = sklearn.svm.SVC(kernel='rbf')
    svc_rbf.fit(X_train_norm, Y_train)
    Y_pred_svc_rbf += svc_rbf.predict(X_test_norm)
    acc_svc_rbf_train += svc_rbf.score(X_train_norm, Y_train)
    acc_svc_rbf_valid += svc_rbf.score(X_valid_norm, Y_valid)

    # support vector machine with linear kernel
    svc_linear = sklearn.svm.SVC(kernel='linear')
    svc_linear.fit(X_train_norm, Y_train)
    Y_pred_svc_linear += svc_linear.predict(X_test_norm)
    acc_svc_linear_train += svc_linear.score(X_train_norm, Y_train)
    acc_svc_linear_valid += svc_linear.score(X_valid_norm, Y_valid)

    # k-Nearest-Neighbour Algorithm
    knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors = 5)
    knn.fit(X_train_norm, Y_train)
    Y_pred_knn += knn.predict(X_test_norm)
    acc_knn_train += knn.score(X_train_norm, Y_train)
    acc_knn_valid += knn.score(X_valid_norm, Y_valid)

    # Gaussian Naive Bayes
    gaussianNB = sklearn.naive_bayes.GaussianNB()
    gaussianNB.fit(X_train_norm, Y_train)
    Y_pred_gaussianNB += gaussianNB.predict_proba(X_test_norm)[:,1]
    acc_gaussianNB_train += gaussianNB.score(X_train_norm, Y_train)
    acc_gaussianNB_valid += gaussianNB.score(X_valid_norm, Y_valid)

    # Decision Tree
    decision_tree = sklearn.tree.DecisionTreeClassifier()
    decision_tree.fit(X_train_norm, Y_train)
    Y_pred_decision_tree += decision_tree.predict_proba(X_test_norm)[:,1]
    acc_decision_tree_train += decision_tree.score(X_train_norm, Y_train)
    acc_decision_tree_valid += decision_tree.score(X_valid_norm, Y_valid)

    # Random Forest
    random_forest = sklearn.ensemble.RandomForestClassifier(n_estimators=10)
    random_forest.fit(X_train_norm, Y_train)
    Y_pred_random_forest += random_forest.predict_proba(X_test_norm)[:,1] # prob for 1
    acc_random_forest_train += random_forest.score(X_train_norm, Y_train)
    acc_random_forest_valid += random_forest.score(X_valid_norm, Y_valid)

acc_svc_rbf_train /= cv_num
acc_svc_rbf_valid /= cv_num

acc_svc_linear_train /= cv_num
acc_svc_linear_valid /= cv_num

acc_knn_train /= cv_num
acc_knn_valid /= cv_num

acc_gaussianNB_train /= cv_num
acc_gaussianNB_valid /= cv_num

acc_decision_tree_train /= cv_num
acc_decision_tree_valid /= cv_num

acc_random_forest_train /= cv_num
acc_random_forest_valid /= cv_num

Y_pred_random_forest /= float(cv_num)
Y_pred_decision_tree /= float(cv_num)
Y_pred_gaussianNB /= float(cv_num)
Y_pred_knn /= float(cv_num)
Y_pred_svc_linear /= float(cv_num)
Y_pred_svc_rbf /= float(cv_num)

#print(Y_pred_random_forest)

print('SVC rbf kernel: train/valid Acc = %.4f/%.4f'%(acc_svc_rbf_train, acc_svc_rbf_valid))
print('SVC linear kernel: train/valid Acc = %.4f/%.4f'%(acc_svc_linear_train, acc_svc_linear_valid))
print('kNN: train/valid Acc = %.4f/%.4f'%(acc_knn_train, acc_knn_valid))
print('Gaussian Naive Bayes: train/valid Acc = %.4f/%.4f'%(acc_gaussianNB_train, acc_gaussianNB_valid))
print('Decision Tree: train/valid Acc = %.4f/%.4f'%(acc_decision_tree_train, acc_decision_tree_valid))
print('Random Forest: train/valid Acc = %.4f/%.4f'%(acc_random_forest_train, acc_random_forest_valid))


In [None]:
## Deep Neural Network

x_size = X_train_norm.shape[1]; # number of features
y_size = 1; # binary variable
n_n_fc1 = 256; # number of neurons of first layer
n_n_fc2 = 128; # number of neurons of second layer
n_n_fc3 = 64; # number of neurons of third layer

# variables for input and output 
x_data = tf.placeholder('float', shape=[None, x_size])
y_data = tf.placeholder('float', shape=[None, y_size])

# 1.layer: fully connected
W_fc1 = tf.Variable(tf.truncated_normal(shape = [x_size, n_n_fc1], stddev = 0.1))
b_fc1 = tf.Variable(tf.constant(0.1, shape = [n_n_fc1]))  
h_fc1 = tf.nn.relu(tf.matmul(x_data, W_fc1) + b_fc1)

# dropout
tf_keep_prob = tf.placeholder('float')
h_fc1_drop = tf.nn.dropout(h_fc1, tf_keep_prob)

# 2.layer: fully connected
W_fc2 = tf.Variable(tf.truncated_normal(shape = [n_n_fc1, n_n_fc2], stddev = 0.1)) 
b_fc2 = tf.Variable(tf.constant(0.1, shape = [n_n_fc2]))  
h_fc2 = tf.nn.relu(tf.matmul(h_fc1_drop, W_fc2) + b_fc2) 

# dropout
h_fc2_drop = tf.nn.dropout(h_fc2, tf_keep_prob)

# 3.layer: fully connected
W_fc3 = tf.Variable(tf.truncated_normal(shape = [n_n_fc2, n_n_fc3], stddev = 0.1)) 
b_fc3 = tf.Variable(tf.constant(0.1, shape = [n_n_fc3]))  
h_fc3 = tf.nn.relu(tf.matmul(h_fc2_drop, W_fc3) + b_fc3) 

# dropout
h_fc3_drop = tf.nn.dropout(h_fc3, tf_keep_prob)

# 3.layer: fully connected
W_fc4 = tf.Variable(tf.truncated_normal(shape = [n_n_fc3, y_size], stddev = 0.1)) 
b_fc4 = tf.Variable(tf.constant(0.1, shape = [y_size]))  
z_pred = tf.matmul(h_fc3_drop, W_fc4) + b_fc4  

# cost function
cross_entropy = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_data, logits=z_pred));

# optimisation function
tf_learn_rate = tf.placeholder(dtype='float', name="tf_learn_rate")
train_step = tf.train.AdamOptimizer(tf_learn_rate).minimize(cross_entropy)

# evaluation
y_pred = tf.cast(tf.nn.sigmoid(z_pred),dtype = tf.float32);
y_pred_class = tf.cast(tf.greater(y_pred, 0.5),'float')
accuracy = tf.reduce_mean(tf.cast(tf.equal(y_pred_class, y_data ), 'float'))

keep_prob = 0.5; # dropout regularization with keeping probability
learn_rate_range = [0.01,0.005,0.0025,0.001,0.001,0.001,0.00075,0.0005,0.00025,0.0001,
                   0.0001,0.0001,0.0001];
learn_rate_step = 100;

n_epoch = 500; # number of epochs
train_loss, train_acc, valid_loss, valid_acc = 0,0,0,0

acc_DNN_train = 0
acc_DNN_valid = 0
loss_DNN_train = 0
loss_DNN_valid = 0
Y_pred_DNN = np.zeros(X_test.shape[0]).astype(np.float);

for j in range(cv_num):
    
    # start TensorFlow session and initialize global variables
    sess = tf.InteractiveSession() 
    sess.run(tf.global_variables_initializer())  

    shuffle_train_valid_data() # shuffle data
    Y_train = Y_train.reshape(-1,1)
    Y_valid = Y_valid.reshape(-1,1)
    n_step = -1;

    # training model
    for i in range(0,n_epoch):

        if i%learn_rate_step == 0:
            n_step += 1;
            learn_rate = learn_rate_range[n_step];
            print('set learnrate = ', learn_rate)

        sess.run(train_step, feed_dict={x_data: X_train_norm, y_data: Y_train, 
                                        tf_keep_prob: keep_prob, tf_learn_rate: learn_rate})

        if i%20==0:
            train_loss = sess.run(cross_entropy,feed_dict={x_data: X_train_norm, 
                                                           y_data: Y_train, 
                                                           tf_keep_prob: 1.0})

            train_acc = accuracy.eval(feed_dict={x_data: X_train_norm, 
                                                 y_data: Y_train, 
                                                 tf_keep_prob: 1.0})    

            valid_loss = sess.run(cross_entropy,feed_dict={x_data: X_valid_norm, 
                                                           y_data: Y_valid, 
                                                           tf_keep_prob: 1.0})

            valid_acc = accuracy.eval(feed_dict={x_data: X_valid_norm, 
                                                 y_data: Y_valid, 
                                                 tf_keep_prob: 1.0})      

            print('%d epoch: train/val loss = %.4f/%.4f, train/val acc = %.4f/%.4f'%(i+1, 
                            train_loss, valid_loss, train_acc, valid_acc))

    acc_DNN_train += train_acc
    acc_DNN_valid += valid_acc
    loss_DNN_train += train_loss
    loss_DNN_valid += valid_loss
    
    # prediction for test set
    Y_pred_DNN += y_pred.eval(feed_dict={x_data: X_test_norm, 
                                        tf_keep_prob: 1.0}).flatten()
    
    sess.close();
    
acc_DNN_train /= float(cv_num)
acc_DNN_valid /= float(cv_num)
loss_DNN_train /= float(cv_num)
loss_DNN_valid /= float(cv_num)
Y_pred_DNN /= float(cv_num)

# final loss and accuracy
print('')
print('final: train/val loss = %.4f/%.4f, train/val acc = %.4f/%.4f'%(loss_DNN_train, 
                                                                      loss_DNN_valid, 
                                                                      acc_DNN_train, 
                                                                      acc_DNN_valid))


In [None]:
# model summary
models = pd.DataFrame({
    'Model': ['SVC with rbf kernel', 'kNN', 'Logistic Regression', 
              'Random Forest', 'Gaussian Naive Bayes', 'SVC with linear kernel', 
              'Decision Tree', 'Deep Neural Network'],
    'Train Acc': [acc_svc_rbf_train, acc_knn_train, acc_log_train, 
                  acc_random_forest_train, acc_gaussianNB_train,
                  acc_svc_linear_train, acc_decision_tree_train, acc_DNN_train],
    'Valid Acc': [acc_svc_rbf_valid, acc_knn_valid, acc_log_valid, 
                  acc_random_forest_valid, acc_gaussianNB_valid, acc_svc_linear_valid, 
                  acc_decision_tree_valid, acc_DNN_valid]})
models.sort_values(by='Valid Acc', ascending=False)

# 5. Predict and submit test results
- combine prediction of probabilities of different algorithms for the test set
- draw classes from probabilities or use fixed cuts
- submit test results

In [None]:
## combined prediction
#Y_pred_submit = (Y_pred_DNN + Y_pred_random_forest + Y_pred_decision_tree)/3.0
Y_pred_submit = Y_pred_DNN

# fixed cut
Y_pred_class_submit = np.greater(Y_pred_submit,0.5).astype(np.int) 

# draw from probability distribution
#Y_pred_class_submit = [np.random.binomial(1,x) for x in Y_pred_submit] 

In [None]:
# submit the best results
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred_class_submit
    })

#if not os.path.exists(os.path.dirname(os.getcwd())+'/output'): 
#    print('create directory ', os.path.dirname(os.getcwd())+'/output')
#    os.makedirs(os.path.dirname(os.getcwd())+'/output')
#submission.to_csv('../output/submission.csv', index=False)
submission.to_csv('submission.csv', index=False)