# Decision Tree

## Project Description
In this project, you will see how decision trees work by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
in_file = 'TitanicSurvival.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## (TODO) Describe the features

Recall that these are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No, 1 = Yes)
- **Pclass**: Passenger class (socio-economic status, 1 = 1st Upper, 2 = 2nd Middle, 3 = 3rd Lower)
- **Name**: Full name of the passenger
- **Sex**: Sex of the passenger (male or female)
- **Age**: Age of the passenger
- **SibSp**: Count of siblings and spouses of the passenger
- **Parch**: Count of parents and children of the passenger
- **Ticket**: Numbering system for identification of the passenger's ticket
- **Fare**: Price paid in gbp by the passenger for the ticket
- **Cabin**: Cabin number occupied by the passenger 
- **Embarked**: Port from which the passenger embarked on the ship

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets.  
Run the code cell below to remove **Survived** as a feature of the dataset and store it in `outcomes`.

In [2]:
# Store the 'Survived' feature in a new variable
outcomes = full_data['Survived']

#  Remove it from the dataset
keep_col = ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
features_raw = full_data[keep_col]

# Show the new dataset with 'Survived' removed
display(features_raw.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The very same sample of the RMS Titanic data now shows the **Survived** feature removed from the DataFrame. Note that `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired*. That means for any passenger `data.loc[i]`, they have the survival outcome `outcomes[i]`.

## (TODO) Preprocessing the data

Now, let's do some data preprocessing. First, we'll one-hot encode the features.

In [3]:
# One-Hot encoding for all our categorical variables
features = pd.get_dummies(features_raw)

And now we'll fill in any blanks with zeroes.

In [4]:
features = features.fillna(0)
display(features.head())

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel",...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## (TODO) Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets (80%-20%). Then we'll train the model on the training set.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size = 0.2, random_state = 42)

In [6]:
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

## (TODO) Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [42]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.8044692737430168


## (TODO)  Improving the model

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:
- `max_depth`
- `min_samples_leaf`
- `min_samples_split`

You can use your intuition, trial and error, or even better, feel free to use Grid Search!

**Challenge:** Try to get to 85% accuracy on the testing set. If you'd like a hint, take a look at the solutions notebook next.

In [23]:
# Importing metrics and Grid Search
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

# Choosing the parameters list
sample_split_range = list(range(2, 15, 2))
parameters = {'max_depth':sample_split_range,'min_samples_leaf':sample_split_range, 'min_samples_split':sample_split_range}

# Grid Search
grid = GridSearchCV(model, parameters, cv = 5, scoring = 'accuracy')

# Fitting
grid_fit = grid.fit(X_train, y_train)

# Finding the best estimator
best_clf = grid_fit.best_estimator_
print('BestCLF', best_clf)
best_par =grid_fit.best_params_
print('BestPar', best_par)

# TODO: Train the model
best_clf.fit(X_train, y_train)

# TODO: Make predictions
y_train_pred = best_clf.predict(X_train)
y_test_pred = best_clf.predict(X_test)

# TODO: Calculate the accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

BestCLF DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=14,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
BestPar {'max_depth': 8, 'min_samples_leaf': 4, 'min_samples_split': 14}
The training accuracy is 0.8820224719101124
The test accuracy is 0.8547486033519553


### Question
1) Describe one real-world application in industry where the model can be applied.  
2) What are the strengths of the model; when does it perform well?  
3) What are the weaknesses of the model; when does it perform poorly?  
4) What makes this model a good candidate for the problem, given what you know about the data?  
  
Please include references with your answer.  

#### Answer
1) Applications of Decision Tree Machine Learning Algorithm

- Decision trees are among the popular machine learning algorithms that find great use in finance for option pricing.
- Remote sensing is an application area for pattern recognition based on decision trees.
- Decision tree algorithms are used by banks to classify loan applicants by their probability of defaulting payments.
- Gerber Products, a popular baby product company, used decision tree machine learning algorithm to decide whether they should continue using the plastic PVC (Poly Vinyl Chloride) in their products.
- Rush University Medical Centre has developed a tool named Guardian that uses a decision tree machine learning algorithm to identify at-risk patients and disease trends.
   
2) Advantages of Using Decision Tree Machine Learning Algorithms

- Decision trees are very instinctual and can be explained to anyone with ease. People from a non-technical background, can also decipher the hypothesis drawn from a decision tree, as they are self-explanatory. A decision tree is simple to understand and once it is understood, we can construct it.    
- When using decision tree machine learning algorithms, data type is not a constraint as they can handle both categorical and numerical variables.  
- Decision tree machine learning algorithms do not require making any assumption on the linearity in the data and hence can be used in circumstances where the parameters are non-linearly related. These machine learning algorithms do not make any assumptions on the classifier structure and space distribution.  
- These algorithms are useful in data exploration. Decision trees implicitly perform feature selection which is very important in predictive analytics. When a decision tree is fit to a training dataset, the nodes at the top on which the decision tree is split, are considered as important variables within a given dataset and feature selection is completed by default.  
- They are also time-efficient with large data. Decision trees help save data preparation time, as they are not sensitive to missing values and outliers. Missing values will not stop you from splitting the data for building a decision tree. Outliers will also not affect the decision trees as data splitting happens based on some samples within the split range and not on exact absolute values.  
- It requires less effort for the training of the data.  
- Decision Tree is proven to be a robust model with promising outcomes.  
   
3) Drawbacks of Using Decision Tree Machine Learning Algorithms  

- The more the number of decisions in a tree, less is the accuracy of any expected outcome.  
- A major drawback of decision tree machine learning algorithms, is that the outcomes may be based on expectations. When decisions are made in real-time, the payoffs and resulting outcomes might not be the same as expected or planned. There are chances that this could lead to unrealistic decision trees leading to bad decision making. Any irrational expectations could lead to major errors and flaws in decision tree analysis, as it is not always possible to plan for all eventualities that can arise from a decision.  
- Decision Trees do not fit well for continuous variables and result in instability and classification plateaus. Only if the information is precise and accurate, the decision tree will deliver promising results. Even if there is a slight change in the input data, it can cause large changes in the tree.  
- Decision trees are easy to use when compared to other decision making models but creating large decision trees that contain several branches is a complex and time consuming task.  
- Decision tree machine learning algorithms consider only one attribute at a time and might not be best suited for actual data in the decision space.  
- Large sized decision trees with multiple branches are not comprehensible and pose several presentation difficulties.  
- Costs. Sometimes cost also remains a main factor because when one is required to construct a complex decision tree, it requires advanced knowledge in quantitative and statistical analysis.  
  
4) Decision tree algorithm is a good candidate for the titanic survival problem:  
- They can classify attributes which are categorical variables.  
- They can handle missing values nicely by looking at the data in other columns.  
- They are quite efficient in dealing with the target function that has discrete output values.  
- Some of the instances ('sex', 'embarked') are represented by attribute value pairs

References:  
https://www.dezyre.com/article/top-10-machine-learning-algorithms/202  
https://www.educba.com/decision-tree-algorithm/

## (TODO)  Describe your conclusions  

In this notebook we learned how to use Desicion Tree algorithm in order to classify our data. Pandas library helped us read, understand, preprocess and clean our data, which were in a csv file format. The data was about Titanic survival, so the model we had to implement was if an X passenger survived given our features. Then we trained and tested our Desicion Tree model using Scikit-learn libary and after that we improved its accuracy with Grid search. Finally we dived into the Desicion Tree algorithm answering the next questions. Where is it used, what are its advantages and disadvantages and why we could use it in our problem knowing our data?