# SUSS AI-IG Code Along Workshop
# Intermediate Python Workshop: Decision Trees

# **Decision Trees**

## Classification and Regression Trees

### Definition
Classification and Regression Trees (CART) are a set of *supervised* learning models used for problems involving **classification** and **regression**. 

### Python Library: `sklearn.tree`

- Classification: `DecisionTreeClassifier`

- Regression: `DecisionTreeRegressor`

## Decision Tree Classifier

#### Exercise 1: Training your first Classification Tree
Dataset: Wisconsin Breast Cancer Dataset


> https://www.kaggle.com/uciml/breast-cancer-wisconsin-data


We'll predict whether a tumuor is **malignant** or **benign** based on 2 features: 
1. mean radius of the tumour
2. mean number of concave points

In [1]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
"""" Data has already been cleaned """
# import pandas to read dataset
import pandas as pd

""" Read the dataset """
breastCancer = pd.read_csv('gdrive/My Drive/SUSS/AIIG-Workshop/breastCancer.csv')

""" Return overview of the dataset """
breastCancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

Some Things to Note: 

1. **Data Exploration** and **Transformation** are always performed before any modelling is carried out. 

2. Dataset is split into **Training** and **Testing sets** (sometimes Validation set, depending on the analytics problem). 


> - this is to ensure that the data that was being used to train the model would not be used to test the model
- unseen data (i.e., testing set) to be used when testing the model, this is to **avoid overfitting** the model


In [None]:
"""
Some steps done beforehand:
    1. Import the relevant python libraries
    2. Splitting data into train-test sets
"""

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

""" split data into train-test sets, 80-20 split """
# set random_state to ensure model splits at the same point to ensure more consistent results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

In [None]:
""" 1. Import DecisionTreeClassifier from sklearn.tree """
from sklearn.tree import _____

""" 2. Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6 """
dt = DecisionTreeClassifier(max_depth=__)

""" 3. Fit Decision Tree to the Training set """
dt.fit(X_train, _____)

""" 4. Predict the test set labels """
y_pred = dt.predict(_____)
print(y_pred[0:5])  # returns the first 5 results

#### Exercise 2: Evaluate the Classification Tree
It's time to evaluate the Classification Tree's performance on the test set.

We will do so using the **accuracy** metric which corresponds to the fraction of **correct predictions** made on the test set. 

In [None]:
""" 1. Import accuracy_score from sklearn.metrics """
from _____._____ import _____

""" 2. Predict the test set labels """
_____ = dt._____(_____)

""" 3. Compute the acccuracy of the test set """
acc = _____(y_test, y_pred)
# print("Test set accuracy: {:.2f}".format(acc))
print(f"Test set accuracy: {acc:.2f}")  # in decimals (2 d.p.)
print(f"Test set accuracy: {acc * 100:.2f}%")   # in percentage (2 d.p.)

## Decision Tree Regressor

#### Exercise 2: Training your first Regression Tree
Dataset: Automotive (Miles per gallon) performances of cars


> https://www.kaggle.com/uciml/autompg-dataset


We'll predict the miles-per-gallon (mpg) consumption of a car given 6 different features: 
1. Cylinders (multi-valued discrete)
2. Displacement (continuous)
3. Horsepower (continuous)
4. Weight (continuous)
5. Acceleration (continuous)
6. Origin (multi-valued discrete)


*will upload the steps for data cleaning tomorrow*

In [None]:
""" Read the dataset """
automotive = pd.read_csv('gdrive/My Drive/SUSS/AIIG-Workshop/automotive.csv')

""" Return overview of the dataset """
automotive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB


In [None]:
# import relevant sklearn libraries
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, mean_squared_error as MSE

In [None]:
X = automotive.drop(columns=['mpg', 'car name'], axis=1)

In [None]:
y = automotive['mpg']

In [None]:
# split dataset into train-test (80-20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

In [None]:
# instantiate DecisionTreeRegressor (dtr) with max depth of 8, min_samples_leaf of 0.13
dtr = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13)

In [None]:
# find out why cannot run
dtr.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13)

In [None]:
""" Comput y_pred """
y_pred = dtr.predict(X_test)

""" Compute MSE of dtr """
mse_dtr = MSE(y_test, y_pred)

""" Compute RMSE of dtr """
rmse_dtr = mse_dtr**(1/2)

""" Print output of rmse_str """
print(f"Test set RMSE of dtr: {rmse_dtr:.2f}")

Test set RMSE of dtr: 4.24
