# INFO371 Homework: Decision Trees 

Your task for this assignment is to explore a dataset using decision trees in order to give you some experience with trees and hyperparameter tuning. This lab asks you to play with classification trees and find the best combination of hyperparameters. Your goal is to try to get as good accuracy as possible! Keep in mind some of these experiemnts will take a while to run -- make sure you start early and give yourself enough time to finish the assignment. 

## Data
In this assignment, you will work with a dataset to try to predicit whether someone is at high or low risk of having a heart attack given some general health information about each person. The dataset has the following features: 

* Age : Age of the patient
* Sex : Sex of the patient (0 = Male, 1 = Female)
* exang: exercise induced angina (1 = yes; 0 = no)
* caa: number of major vessels (0-3)
* cp : Chest Pain type
     * Value 0: typical angina
     * Value 1: atypical angina
     * Value 2: non-anginal pain
     * Value 3: asymptomatic
* trtbps : resting blood pressure (in mm Hg)
* chol : cholestoral in mg/dl fetched via BMI sensor
* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* rest_ecg : resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach : maximum heart rate achieved
* old peak: ST depression induced by exercise relative to rest
* slp: the slope of the peak exercise ST segment
    * 0 = unsloping
    * 1 = flat
    * 2 = downsloping
* thall : thalassemia
    * 0 = null
    * 1 = fixed defect
    * 2 = normal
    * 3 = reversable defect
* output : 0= less chance of heart attack 1= more chance of heart attack

Note that the column "output" is your label (i.e. the thing you are trying to predict). 

---
For some more information on some of the health definitions: 
* [Angina](https://www.nhs.uk/conditions/angina/#:~:text=Angina%20is%20chest%20pain%20caused,of%20these%20more%20serious%20problems): chest pain due to reduced blood flow to the heart muscles. There're 3 types of angina: stable angina, unstable angina, and variant angina.

* ECG: short for electrocardiogram, it's a routine test usually done to check the heart's electrical activity.

* [ST depression](https://litfl.com/st-segment-ecg-library/): a type of ST-segment abnormality. the ST segment is the flat, isoelectric part of the ECG and it represents the interval between ventricular depolarization and repolarization.

* Thalassemia: is a genetic blood disorder that is characterized by a lower rate of hemoglobin than normal.

## Dataset Exploration

1. load the dataset and ensure it looks good 


2. Split your dataset into your feature set and label set. Then do a random test/train split of 80/20. 


In [3]:
# code goes here
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
df = pd.read_csv('heart.csv.bz2')
print(df.head())
f_train, f_test,

   age  sex  cp  trtbps  chol  fbs  restecg  thalachh  exng  oldpeak  slp  \
0   63    1   3     145   233    1        0       150     0      2.3    0   
1   37    1   2     130   250    0        1       187     0      3.5    0   
2   41    0   1     130   204    0        0       172     0      1.4    2   
3   56    1   1     120   236    0        1       178     0      0.8    2   
4   57    0   0     120   354    0        1       163     1      0.6    2   

   caa  thall  output  
0    0      1       1  
1    0      2       1  
2    0      2       1  
3    0      2       1  
4    0      2       1  


## Baseline Model Comparison 

3. Get a baseline accuracy using the naive model (i.e. a model where you assign the same label to all your testing data and that label is the one that appeared the most in your training data). 


4. Train a decision tree classifier using the default parameters except you should use the information gain/entropy metric for splitting. Report the training and testing accuracy. 


5. Plot the resulting decision tree and examine its structure. You can use the [function provided by scikitlearn library](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) to do so.  Do these features in that order make sense? Does this tell you anything interesting about your features? Did any of these suprise you? 


## Tuning the Model

6. Now lets try to examine average performance across different random splits. Run the model at least 10 times on different random splits of your data and report the average testing and training accuracy as well as the standard deviation. For an idea on how to do this, take a look at some of the lecture code we’ve done in class. Please don't use the cross-validation function here -- we want you to write this part yourself. 

    Hint – its probably a good idea to write a function to do this since you will be using this code to tune the various parameters. 



7. Now lets examine what happens as you increase the complexity of the model. Tune the model by examining various max-tree-depth. Plot the __average__ training and testing accuracy for each value of tree depth (NOTE - this means you'll have to run your model multiple times to get average accuracy. Use your function from step 6 to get this score). Accuracy should be on the y-axis and tree-depth should be on the x-axis. 



8. Explain what overfitting and underfitting is. How do you know when you are overfitting? Using your plot from question 7, explain when you are underfitting and overfitting your decision tree model. NOTE -- if you do not see overfitting, you may need to keep increasing the max-tree-depth. 


9. Now lets examine what happens when you tune a different parameter. Tune the model now by examining min_samples_split. Plot the __average__ training and testing accuracy for each value of min_samples_split. Accuracy should be on the y-axis and min_samples_split should be on the x-axis. Then explain on this plot where overfitting is happening. 

    Note -- we are tuning this parameter in isolation so you should set the max tree depth to whatever the default was. Don't use what you got in step 7.


10. Now lets examine what happens when you tune a different parameter. Tune the model now by examining min_samples_leaf. Plot the __average__ training and testing accuracy for each value of min_samples_leaf. Accuracy should be on the y-axis and min_samples_split should be on the x-axis. Then explain on this plot where overfitting is happening. 

    Note -- we are tuning this parameter in isolation so you should set the max tree depth to whatever the default was. Don't use what you got in step 7 or step 9.


11. Now that we've examined the hyperparamters in isolation, lets now perform a 3-D grid search across all three hyperparameters (see your lab notes on how to do this!). Use __average__ testing accuracy to chose your best parameters. How does the avarage testing accuracy with the all three best paramters perform? How does it compare to your completely untuned model (i.e. the one with all default parameters)? 



In [None]:
#code goes here

## Compare Performance with Random Forests 

Now lets compare our performance by using a random forest model. 

12. Using the same test/train split as you used in Question 4, train a RandomForestClassifier model using default parameters and the ''entropy'' criterion. Report the testing and trainign accuracy. How does this model's accuracy compare to what you found in Question 4? 


13. Now using this random forest model, examine the imporance of each of the features in the dataset. You don't have to write this yourself, see this [documentaion](https://scikit-learn.org/stable/modules/feature_selection.html#tree-based-feature-selection) and this [example](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py) for how to do this in Python. How do these features compare to what you found in Question 5? Are they the same? Do any of these results suprise you? 


14. Now lets try to examine average performance across different random splits. Run the model at least 10 times on different random splits of your data and report the average testing and training accuracy as well as the standard deviation. Again, please don't use the cross-validation function here -- we want you to write this part yourself. You may use the same function as you did in Question 6 if you would like. 


15. Now lets examine what happens as you increase the complexity of the model by increasing the number of trees to include in our forest. Tune the model by examining a range of n_estimators. Plot the __average__ training and testing accuracy for each value of n_estimators (NOTE - this means you'll have to run your model multiple times to get average accuracy. Use your function from step 14 to get this score). Accuracy should be on the y-axis and n_estimators should be on the x-axis.


16. Using this plot explain what happens as you increase the number of trees in your forest. Are there trade-offs you have to consider?


17. Compare your results with Random Forest to the results you got with just one decision tree and write up your obervations and analysis. Which model are you finding is typically more accurate? Which model tended to overfit more? Are there any considerations you need to think about when chosing between Forests versus single trees? 

In [None]:
#code goes here 