## Prepare python environment


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state=5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the Statslog (Heart) Dataset (1 points)

---


We will use heart dataset from UCI machine learning repository. Details of this data can be found [here](https://archive.ics.uci.edu/ml/datasets/statlog+(heart)). 
The dataset contains the following features with their corresponding feature types:
1. age in years (real)
2. sex (binary; 1=male/0=female)
3. cp: chest pain type (categorical)
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital) (real)
5. chol: serum cholestorol in mg/dl (real)
6. fbs: (fasting blood sugar > 120 mg/dl) (binary; 1=true/0=false)
7. restecg: resting electrocardiographic results (categorical)
8. thalach: maximum heart rate achieved (real)
9. exang: exercise induced angina (1 = yes; 0 = no) (binary)
10. oldpeak: ST depression induced by exercise relative to rest (real)
11. slope: the slope of the peak exercise ST segment (ordinal)
12. ca: number of major vessels colored by flourosopy (real)
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect. (categorical)

The objective is to determine whether a person has heart disease or not based on these features.

Note: We will use a subset of the above features because the [scikit-learn implementation of Decision Trees does not support categorical variables](https://scikit-learn.org/stable/modules/tree.html#tree). 

### Loading the dataset

In [None]:
# Download and load the dataset
import os
if not os.path.exists('heart.csv'): 
    !wget https://raw.githubusercontent.com/JHA-Lab/ece364/main/dataset/heart.csv 
df = pd.read_csv('heart.csv')

# keep real valued features and the target feature
ind_non_categorical_features=np.array([0,3,4,7,9,11,-1])
non_categorical_features=df.columns[ind_non_categorical_features]

df=df[non_categorical_features]

# Display the first five instances in the dataset
df.head()

### Check the data type for each column

In [None]:
df.info()

#### There are a total of 303 entries in this dataset. First 13 columns are features and the last column indicates whether the person has heart disease or not.

#### Look at some statistics of the data using the `describe` function in pandas.

In [None]:
df.describe()

1. Count tells us the number of Non-empty rows in a feature.

2. Mean tells us the mean value of that feature.

3. Std tells us the Standard Deviation Value of that feature.

4. Min tells us the minimum value of that feature.

5. 25%, 50%, and 75% are the percentile/quartile of each features.

6. Max tells us the maximum value of that feature.

#### Look at distribution of some features across the population. See [here](https://seaborn.pydata.org/generated/seaborn.distplot.html) for details. These have been done for you.

In [None]:
sns.histplot(df['thalach'],bins=30,color='red',stat="density",kde=True)

In [None]:
sns.histplot(df['chol'],bins=30,color='green',stat='density',kde=True)

In [None]:
sns.histplot(df['trestbps'],bins=30,color='blue',stat='density',kde=True)

#### Plot histogram of heart disease with age. This has been done for you.

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='age',data = df, hue = 'target',palette='coolwarm_r')
plt.show()

#### Extract target and descriptive features (0.5 points)

In [None]:
# Store all the features from the data in X
X= # TODO
# Store all the labels in y
y= # TODO

In [None]:
# Convert data to numpy array
X = # TODO
y = # TODO

#### Create training and test datasets (0.5 points)

Split the data into training and test sets using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for testing. 

In [None]:
X_train,X_test,y_train,y_test = # TODO

## Training Decision Tree-based Classifiers (9 points)


### Exercise 1: Learning a Decision Tree (5 points)

#### We will use the `sklearn` library to train a Decision Tree classifier. Review ch.4 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for more details. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
# tree visualization helper function 
from sklearn.tree import export_graphviz
from six import StringIO  
from IPython.display import Image  
import pydotplus

"""
clf: DecisionTreeClassifier

Returns a bytes object representing the image of the tree 
"""
def get_tree_image(clf):
    dot_data = StringIO()
    feature_names=df.drop('target',axis=1).columns
    class_names=["No heart disease", "Has heart disease"]
    export_graphviz(clf, out_file=dot_data,  
                    filled=True, rounded=True,
                    special_characters=True,
                    feature_names=feature_names, 
                    class_names=class_names)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    

    return graph.create_png()

#### Exercise 1a: Fit and interpret a decision tree. (3 points)

#### Fit Decision trees using the Gini index and entropy-based impurity measure. 

#### Set the random_state to the value defined above. Keep all other parameters at their default values. 

#### Report the training and test set accuracies for each classifier.

In [None]:
# TODO

#### Visualize the Decision Tree with the best test performance.

In [None]:
best_clf=# TODO
tree_image=get_tree_image(best_clf)
Image(tree_image)

#### Indicate the most informative descriptive feature (with the threshold) and briefly explain why this is the most informative (from an algorithmic viewpoint). 

TO DO

#### Briefly comment on the tree's depth and what factors may contribute to the shallowness/complexity of the tree. 


TO DO

#### Show how one can interpret the tree by specifying the rule from its left most branch. 

TO DO

#### Exercise 1b: Prune a decision tree. (2 points)

#### Next, let's try pruning the tree to see if we can improve the classifier's generalization performance.

####  Preprune a decision tree by varying the `max_depth` among {None (no depth control), 1,3,5,7}.

#### Set the criterion to entropy and the random_state to the value defined above. Keep all other parameters at their default values. 

#### Report the training and test set accuracies for each classifier.

In [None]:
# TODO 

#### Analyze the effect of increasing tree depth on training and test performance.

TO DO

### Exercise 2: Learning an Ensemble of Decision Trees (4 points)

#### We will use the `sklearn` library to implement bagging and boosting. Review ch.4 and read more on [bagging](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and [boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html). 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

#### Exercise 2a: Fit a Random Forest. (2 points)

#### Fit different Random Forest classifiers by varying the number of trees among {10, 100, 500,1000}. 

#### Set the `criterion` to entropy and set the random_state to the value defined above. Keep all other parameters at their default values. 

#### Report the test set accuracies for each classifier.

In [None]:
# TODO

#### Comment on the effect of increasing the number of trees on test performance. Compare the performance of the best performing Random Forest classifier against the Decision Tree Classifier trained with entropy (Ex. 1a) and explain any difference. 

TO DO

#### Exercise 2b: Fit a Gradient Boosted Decision Tree (GBDT). (2 points)

#### Fit different GBDTs by varying the number of boosting steps/trees added among {5,50,100,200}. 

#### Set the `n_iter_no_change` to 100, `validation_fraction=0.2`, and random_state to the value defined above. Keep all other parameters at their default values. 

#### Report the training and test set accuracies for each classifier.

In [None]:
# TODO

#### Comment on the effect of increasing the number of trees on test performance. Compare the performance of the best performing GBDT against that of the best performing Random Forest classifier (Ex. 2a) and Decision Tree classifier trained with entropy (Ex. 1a). 

TO DO