### Decision Tree

#### Build Decision Tree on Titanic Dataset (**Join the Kaggle Competition!**)

In [1]:
# Import useful libararies used for data management
import numpy as np
import pandas as pd


In [2]:
# load training dataset 'titanic_train.csv' for titanic case, using 'PassengerId' as index column

train = pd.read_csv('titanic_train.csv', index_col='PassengerId')

In [3]:
# load test dataset 'titanic_test.csv' for titanic case, using 'PassengerId' as index column
test = pd.read_csv('titanic_test.csv', index_col='PassengerId')


#### Preprocessing on Training Set and Test Set

In [4]:
# get the types of features for training set (pay attension to attributes with missing values)

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [5]:
# On training set, replace missing values in 'Age' with mean (numeric variables) of training set
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

train['Age'].fillna(train['Age'].mean(),inplace=True)


In [6]:
# On training set, replace missing values in ‘Embarked’ with the most frequent value (mode) of training set
train['Embarked'].fillna(train['Embarked'].mode(),inplace=True)


In [7]:
# Delete the column 'Cabin' from training set
train = train.drop(columns='Cabin')


In [8]:
# check if the training data set cleaned or not
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       891 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(4)
memory usage: 76.6+ KB


**Now, we will do preprocessing on test set. Note that when handling the missing values on test set, we should use most frequent value or mean value of the training set for replacement.**

In [9]:
# get the types of features for test set (pay attension to attributes with missing values)

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Name      418 non-null    object 
 2   Sex       418 non-null    object 
 3   Age       332 non-null    float64
 4   SibSp     418 non-null    int64  
 5   Parch     418 non-null    int64  
 6   Ticket    418 non-null    object 
 7   Fare      417 non-null    float64
 8   Cabin     91 non-null     object 
 9   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB


In [10]:
# Calculate the mean in the 'Age' column in trainning data set
Age_mean = train['Age'].mean()
# On test set, replace missing values in 'Age' with mean of training set
test['Age'].fillna(Age_mean,inplace=True)
print(Age_mean)

29.699117647058763


In [11]:
# Find out the mean in 'Fare' in training set
Fare_mean = train['Fare'].mean()
# On test set, replace missing values in 'Fare' with mean of training set
test['Fare'].fillna(Fare_mean,inplace=True)
print(Fare_mean)

32.2042079685746


In [12]:
# Delete the column 'Cabin' from test set

test = test.drop(columns='Cabin')


**Now, we will do one-hot-encoding on categorical variables. Please DO NOT drop the first dummy! When we process training data and test data separately, the first dummy may not be always the same. Please generate all the dummy variables first and manually drop the same dummy for training and test afterwards (so that training and test data would have the same set of dummy variables.**

In [13]:
# On training set, create dummy variables for categorical feature 'Sex', using 'Sex' as prefix, and DO NOT drop the first dummy
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

train = pd.get_dummies(train, columns=['Sex'], prefix=['Sex'], drop_first=False)


In [14]:
# On test set, create dummy variables for categorical feature 'Sex', using 'Sex' as prefix, and DO NOT drop the first dummy
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
test = pd.get_dummies(test, columns=['Sex'], prefix=['Sex'], drop_first=False)


In [15]:
# Delete the column 'Sex_female' from training set

train = train.drop(columns='Sex_female')

In [16]:
# Check if the 'Sex_female' column removed from train dataframe 
train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,S,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C,0
3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,S,0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,S,0
5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,S,1


In [17]:
# Delete the column 'Sex_female' from test set

test = test.drop(columns='Sex_female')

In [18]:
# Check if the 'Sex_female' column removed from test dataframe 
test.head()

Unnamed: 0_level_0,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
892,3,"Kelly, Mr. James",34.5,0,0,330911,7.8292,Q,1
893,3,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0,S,0
894,2,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,Q,1
895,3,"Wirz, Mr. Albert",27.0,0,0,315154,8.6625,S,1
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.0,1,1,3101298,12.2875,S,0


In [19]:
# On training set, create dummy variables for categorical feature 'Embarked', using 'Embarked' as prefix, and DO NOT drop the first dummy
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

train = pd.get_dummies(train, columns=['Embarked'], prefix=['Embarked'], drop_first=False)


In [20]:
# Check if the dummy variables added in the dataframe
train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,1,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,0,1,0,0
3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,0,0,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,0,0,0,1
5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,1,0,0,1


In [21]:
# On test set, create dummy variables for categorical feature 'Embarked', using 'Embarked' as prefix, and DO NOT drop the first dummy
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
test = pd.get_dummies(test, columns=['Embarked'], prefix=['Embarked'], drop_first=False)


In [22]:
# Check if the dummy variables added in the dataframe
test.head()

Unnamed: 0_level_0,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
892,3,"Kelly, Mr. James",34.5,0,0,330911,7.8292,1,0,1,0
893,3,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0,0,0,0,1
894,2,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,1,0,1,0
895,3,"Wirz, Mr. Albert",27.0,0,0,315154,8.6625,1,0,0,1
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.0,1,1,3101298,12.2875,0,0,0,1


In [23]:
# Delete the column 'Embarked_C' from training set

train = train.drop(columns='Embarked_C')


In [24]:
# Delete the column 'Embarked_C' from test set
test = test.drop(columns='Embarked_C')


In [25]:
# Display the first 5 rows for training set

train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_male,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,1,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,0,0,0
3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,0,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,0,0,1
5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,1,0,1


In [26]:
# Display the first 5 rows for test set

test.head()

Unnamed: 0_level_0,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_male,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",34.5,0,0,330911,7.8292,1,1,0
893,3,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0,0,0,1
894,2,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,1,1,0
895,3,"Wirz, Mr. Albert",27.0,0,0,315154,8.6625,1,0,1
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.0,1,1,3101298,12.2875,0,0,1


### Model Building

- Recall the purpose of this Titanic Prediction Project
- What are your trying to predict? 
    - The 'Survived' value (target variable)
    
- What features/attributes are you going to use to help you doing the prediction? 
    - Pclass, SibSp, Parch, ...

In [27]:
# use 'Pclass','Age','SibSp','Parch','Fare','Sex_male','Embarked_Q','Embarked_S' as features
#** No need to fill in any command in this cell, but you need to execute this command. 

features = ['Pclass','Age','SibSp','Parch','Fare','Sex_male','Embarked_Q','Embarked_S']

In [28]:
# assign the target variable 'Survived'

target = ['Survived']
target

['Survived']

In [29]:
# Import Decision Tree Classifier from sklearn
from sklearn.tree import DecisionTreeClassifier


In [30]:
# in model_5, we will try using 'entropy' as oour criterion, Accuracy score =0.77033
model_5 = DecisionTreeClassifier(criterion='entropy', splitter='best',
                               min_samples_split=15, min_samples_leaf=5)
#model_6 = DecisionTreeClassifier(criterion='entropy', splitter='best',
                               #min_samples_split=15, min_samples_leaf=1) Accuracy Score =0.76315
#model_7 = DecisionTreeClassifier(criterion='entropy', splitter='best',max_depth=10,
                               #min_samples_split=20, min_samples_leaf=1) Accuracy Score =0.76076

Modle 5 gives the highest score. By comparing all the result i got, i noticed that using entropy calculation will result in better accuracy than using gini impurity. It's also important to specify the minimum saples in the node that allow to split. If the min_samples_split is too low, the accuracy performance will be worst.

In [31]:
# assign values of independent variables and target variable of training set to X_train and y_train respectively.
#** No need to fill in any command in this cell, but you need to execute this command. 

X_train = train[features]
y_train = train[target]

In [32]:
# train model 4 using training dataset
model_5.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', min_samples_leaf=5,
                       min_samples_split=15)

In [33]:
# import libraries for visuaiization
#** No need to fill in any command in this cell, but you need to execute this command. 
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus
from sklearn import tree
import graphviz

In [34]:
# use export_graphviz to visualize the tree
#** No need to fill in any command in this cell, but you need to execute this command. 
dot_data = tree.export_graphviz(model_5, out_file=None, 
                      feature_names=features,  
                      class_names=['Did not survive', 'Survived'],
                      filled = True, rounded=True,  
                      special_characters=True)

graph = graphviz.Source(dot_data)  
graph.render("titanic_tree_5") 

'titanic_tree_5.pdf'

### Model Testing

- We just build the very first tree using the default parameter settings, what would be the predicting result on testing dataset?

In [35]:
# assign values of independent variables of test set to X_test.
#** No need to fill in any command in this cell, but you need to execute this command. 

X_test = test[features]

In [36]:
# test model using test set

y_pred = model_5.predict(X_test)
y_pred

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [37]:
# create a DataFrame with two columns: 'PassengerID' and 'Survived'
#** No need to fill in any command in this cell, but you need to execute this command. 
df = pd.DataFrame({'PassengerId': X_test.index, 'Survived': y_pred})

# write the results to csv file. 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
#** No need to fill in any command in this cell, but you need to execute this command. 
df.to_csv('results_5.csv', index=False)

#### You are ready to upload the 'results.csv' to Kaggle to join the competition!