This file is for a midterm project of Machine Learning Zoomcamp.  

The project is using Titanic - Machine Learning from Disaster data with target to predict survival of passenger in 'Survived' column.  
You can find the data from [here](https://www.kaggle.com/competitions/titanic/overview)  
The original project on Kaggle has 3 separate data: train, test, and gender submission. This project use slightly different approach.  
Only using train data for training and test, because the test data from Kaggle doesn't have target variable we're interested in. It is hidden so the model will be uploaded to Kaggle for verification.  
We're not doing it for this project as we're building model and deploying it on cloud.  
***The priority of this project is model development (without refining the accuracy) and deployment.***  
Refining model accuracy will be the next stage if we have chance to do it.

In [6]:
import pandas as pd
import numpy as np

%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import export_text



In [7]:
from sklearn.tree import DecisionTreeClassifier

In [8]:
#read titanic dataset, originally train.csv that we're using it for training and testing.
titanic_data = pd.read_csv('train.csv')

## Exploratory Data Analysis (EDA)

In [9]:
#checking the first five rows of training data
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
#checking how many missing value in the dataset
titanic_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [11]:
#quick check the descriptive statistics of the dataset
titanic_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [12]:
#checking the uniqeu value of each column
titanic_data.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [13]:
#fill missing values with zeros
titanic_data = titanic_data.fillna(0)

In [14]:
#check AGAIN how much the missing values in the dataset AFTER replacing with zeros
titanic_data.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [15]:
# Exploring the pattern, whether gender is a strong survival indicator or not.
# Filtering female passengers from titanic data. Then, we select just the 'Survived' column from those filtered rows.
women = titanic_data.loc[titanic_data.Sex == 'female']['Survived']

# sum(women): Since the women Series contains 0s (for not survived) and 1s (for survived), summing this Series will give you 
# the total number of female passengers who survived. (Each '1' adds one to the sum).
# len(women): This calculates the total number of elements in the women Series, 
# which corresponds to the total number of female passengers in the train_data DataFrame.
rate_women = sum(women)/len(women)

print("percentage of women who survived", rate_women*100,"%")

percentage of women who survived 74.20382165605095 %


In [16]:
# Calculating the men percentage with the same logic as above.
men = titanic_data.loc[titanic_data.Sex == 'male']['Survived']
rate_men = sum(men)/len(men)

print("percentage of men who survived", rate_men*100,"%")

percentage of men who survived 18.890814558058924 %


Gender is a strong indication to predict passenger survival.  
Now, we're going to build model using Random Forest. The model will take a closer look to four different columns ('Pclass', 'Sex', 'SibSp', and 'Parch') of the data.  
The model will be using 100 of n_estimators, 5 of max_depth, and 1 of random_state.

In [17]:
#do train/validation/test split with 60/20/20 distribution
#create titanic data (td) full train and td test data set
td_full_train, td_test = train_test_split(titanic_data, test_size = 0.2, random_state = 1)

#create titanic data (td) train and td validation data set
td_train, td_val = train_test_split(td_full_train, test_size = 0.25, random_state = 1)

In [18]:
#extract target variable
y_train = td_train['Survived'].values
y_val = td_val['Survived'].values
y_test = td_test['Survived'].values

In [19]:
#delete all target variable from all partition to avoid mistakes (human error)
del td_train['Survived']
del td_val['Survived']
del td_test['Survived']

In [20]:
features = ['Pclass', 'Sex', 'SibSp', 'Parch']

#create dictionary of td_train to apply one-hot encoding for categorical and numerical features we're interested in
td_train_dict = td_train[features].to_dict(orient = 'records')

#create feature matrix for numerical and one-hot encoding for categorical features
dv = DictVectorizer(sparse = True)
X_train = dv.fit_transform(td_train_dict)

#train to predict target variable
model = DecisionTreeClassifier(max_depth = 3, random_state = 1)
dtreg = model.fit(X_train, y_train)

In [21]:
# Create feature matrix of validation partition 
td_val_dict = td_val[features].to_dict(orient='records')
X_val = dv.fit_transform(td_val_dict)

In [22]:
#creating rmse function
def rmse(y, y_pred):
    error = y_pred - y
    mse = (error**2).mean()
    return np.sqrt(mse)

In [23]:
#predicting y with X_val
y_pred_1 = dtreg.predict(X_val)

#Calc RMSE value and presenting it
print(f'RMSE Random Forest Regressor: {round(rmse(y_val, y_pred_1),4)}' )

RMSE Random Forest Regressor: 0.4858


In [24]:
#Checking accuracy score
acc = 100 * (y_pred_1 == y_val).mean()
print("accuracy score is: ", acc, "%")

accuracy score is:  76.40449438202246 %


In [25]:
import pickle

In [26]:
#save model to pickle
with open('model.bin', 'wb') as f_out:
    pickle.dump((dv, model), f_out)

In [27]:
!ls -lh

total 164K
-rw-rw-r-- 1 radz radz 3.2K Dec 11  2019  gender_submission.csv
-rw-rw-r-- 1 radz radz  25K Nov 19 22:22 'Midterm project-Titanic.ipynb'
-rw-rw-r-- 1 radz radz  24K Nov 19 20:40 'Midterm project-Titanic-pipeline version.ipynb'
-rw-rw-r-- 1 radz radz 2.6K Nov 19 22:23  model.bin
-rw-rw-r-- 1 radz radz 5.8K Nov 19 22:11  predict.py
-rw-rw-r-- 1 radz radz  28K Dec 11  2019  test.csv
-rw-rw-r-- 1 radz radz  60K Dec 11  2019  train.csv
-rw-rw-r-- 1 radz radz 4.7K Nov 19 22:14  train.py


In [28]:
#load the model f_out is file output, f_in is file input
with open('model.bin', 'rb') as f_in:
    (dv, model) = pickle.load(f_in)

### Scikit-Learn Pipeline
Creating pipeline of models for convinience code execution.

In [29]:
#import the pipeline
from sklearn.pipeline import make_pipeline

In [30]:
pipeline = make_pipeline(
    DictVectorizer(),
    DecisionTreeClassifier(max_depth = 3, random_state = 1))

In [33]:
td_test_dict = td_test[features].to_dict(orient = 'records')
td_test_dict[3]

{'Pclass': 3, 'Sex': 'female', 'SibSp': 0, 'Parch': 0}