#### We can use decision tree for both classification and regression problems. In classification, we take the mode of the values at the leaf node to make predictions. Whereas, in regression, we take the mean of the values at the leaf node.

In [1]:
# importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
# reading the cleaned dataset (data exploration has already been done on it)

data = pd.read_csv("data_cleaned.csv")    # missing values, dummies, etc. are already cared in the dataset

In [3]:
# checking dimensions

data.shape

(891, 25)

In [4]:
# printing first few rows

data.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


## Working on with the 'Survived' target variable using decision tree

In [6]:
# separating the independent and dependent variables

x = data.drop(['Survived'],axis=1)    # independent variable
y = data['Survived']    # dependent (target) variable

#### > Now we split the dataframe into train and test file by using a module called model_selection and inside that the train_test_split function

In [7]:
# import the required module for splitting

from sklearn.model_selection import train_test_split

#### Now what the train_test_split function, what it will do is it accepts an independent variable and a dependent variable & then it splits those independent and dependent variable into two parts
> First is independent variable for train and test then the dependent variable for train and test

In [8]:
# split the dataset into train and test
# random_state : int or RandomState (Pseudo-random number generator state used for random sampling)
# stratify : it will make the proportions of ones and zeroes in train & test similar

train_x, test_x, train_y, test_y = train_test_split(x,y,random_state=101,stratify=y)    # (independent variable,dependent variable,random_state,stratify)

Now we check proportion of 1's & 0's in train and test both

In [9]:
train_y.value_counts()/len(train_y)

0    0.616766
1    0.383234
Name: Survived, dtype: float64

In [10]:
test_y.value_counts()/len(test_y)

0    0.61435
1    0.38565
Name: Survived, dtype: float64

So here we can see that the proportions are almost similar. This is done generally to make the test very similar to the train, so that whatever performance we have on the train is emulated on the test

### Now, that we have the train and test file, let's go ahead and get a decision tree for this dataset
> So we'll be training our model on train_x & train_y, and predicting on the test_x
> Later we'll be checking our performance against test_y which are the actual values

#### So now to use a decision tree, first of all, we will have to import decision_tree_classifier inside the module called tree in scikit-learn

In [12]:
# import decision tree classifier package

from sklearn.tree import DecisionTreeClassifier

In [13]:
# create an object for storing the decision_tree_classifier

clf = DecisionTreeClassifier()    # to see the default parameters, press Shift+Tab

In [14]:
# fitting our model

clf.fit(train_x,train_y)    # (independent,dependent)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

The results show us the arguments by which the model has been fitted or trained

In [15]:
# checking the accuracy of our model for the train dataset

clf.score(train_x,train_y)

0.9880239520958084

So we can see that our model is 99% accurate on the training dataset that we have used

In [17]:
# checking the accuracy of our model on the test file as well

clf.score(test_x,test_y)    # takes in the independent variable, calculates some predictions and matches those predictions with the actual values

0.7668161434977578

Here we can see that the accuracy on our test dataset is 76%, so somehow our performance is not that good on the test file.
> So, to mitigate that, we can look and change various factors such as max_depth, min_sample_leaf (these are the two important parameters we should play with), etc. in the .fit() function arguments

#### Also, we can get the predictions for the dataset

In [18]:
# predictions for train dataset

clf.predict(train_x)    # (independent variable)

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,

In [19]:
# predictions for test dataset

clf.predict(test_x)

array([1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 0], dtype=int64)