# Prerequisite - Decision Tree Algorithm

#### In order to complete the exercises below, a basic understanding of decision tree algorithm is required. The figure blow provides a simple example of what decision trees are.


<img src="src/simple_decision_tree.png" width="350" style="margin-left:auto; margin-right:auto"/>


#### In this section, we will be using many decision trees at once (a.k.a RandomForest) to help us extract information from large datasets.

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Data Description

#### Recall the line equation in middle school

# $$ y = m x + c $$

#### and its matrix counterpart

# $$ Y = \beta X + C $$

#### we are going to generate $X$ and $Y$ using secret $\beta$ values. However, some values of $\beta$ are set to zero. That means not all the variables of $X$ are informative (have no effect on $Y$). In this dataset, the constant part $c$ will not be used (set to zero).

In [2]:
synthetic_X, synthetic_y, synthetic_beta = make_regression(
    n_samples=317,
    n_features=256,
    n_informative=1,
    random_state=2022,
    coef=True,
)

print(f"X shape: {synthetic_X.shape}",
      f"beta shape: {synthetic_beta.shape}",
      f"y shape: {synthetic_y.shape}",
      f"n_informative: {(synthetic_beta != 0).sum()}",
      f"informative variables are at columns: {[i for i, b in enumerate(synthetic_beta) if b != 0]}",
      sep="\n",
)

X shape: (317, 256)
beta shape: (256,)
y shape: (317,)
n_informative: 1
informative variables are at columns: [41]


# Sanity check of the generated data
#### we need to verify that $X \beta = y$

In [3]:
np.allclose(np.dot(synthetic_X, synthetic_beta), synthetic_y)

True

# Exercise 1

### identify the informative variables in X

In [4]:
model = RandomForestRegressor(n_estimators=100, max_depth=1, random_state=2022)
model.fit(synthetic_X, synthetic_y)
informative_variable_index = np.flatnonzero(model.feature_importances_)
informative_variable_index

array([41])

In [5]:
synthetic_m = synthetic_beta[informative_variable_index]
synthetic_x = synthetic_X[:, informative_variable_index]
np.allclose(np.dot(synthetic_x, synthetic_m), synthetic_y)

True

#### Disclaimer: the method above needs modifications to work with multiple informative variables and requires cross-validation iterations to ensure the identified variables are in fact informative. 

# Data Description

#### This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

source: https://archive.ics.uci.edu/ml/datasets/parkinsons

# Exercise 2

### Identify the most informative features in this dataset

In [6]:
pd_dataset = pd.read_csv("../data/pd_speech_features.csv", header=1)
pd_dataset.head()

Unnamed: 0,id,gender,PPE,DFA,RPDE,numPulses,numPeriodsPulses,meanPeriodPulses,stdDevPeriodPulses,locPctJitter,...,tqwt_kurtosisValue_dec_28,tqwt_kurtosisValue_dec_29,tqwt_kurtosisValue_dec_30,tqwt_kurtosisValue_dec_31,tqwt_kurtosisValue_dec_32,tqwt_kurtosisValue_dec_33,tqwt_kurtosisValue_dec_34,tqwt_kurtosisValue_dec_35,tqwt_kurtosisValue_dec_36,class
0,0,1,0.85247,0.71826,0.57227,240,239,0.008064,8.7e-05,0.00218,...,1.562,2.6445,3.8686,4.2105,5.1221,4.4625,2.6202,3.0004,18.9405,1
1,0,1,0.76686,0.69481,0.53966,234,233,0.008258,7.3e-05,0.00195,...,1.5589,3.6107,23.5155,14.1962,11.0261,9.5082,6.5245,6.3431,45.178,1
2,0,1,0.85083,0.67604,0.58982,232,231,0.00834,6e-05,0.00176,...,1.5643,2.3308,9.4959,10.7458,11.0177,4.8066,2.9199,3.1495,4.7666,1
3,1,0,0.41121,0.79672,0.59257,178,177,0.010858,0.000183,0.00419,...,3.7805,3.5664,5.2558,14.0403,4.2235,4.6857,4.846,6.265,4.0603,1
4,1,0,0.3279,0.79782,0.53028,236,235,0.008162,0.002669,0.00535,...,6.1727,5.8416,6.0805,5.7621,7.7817,11.6891,8.2103,5.0559,6.1164,1


In [7]:
pd_X = pd_dataset.drop(columns=["id", "class"])
pd_y = pd_dataset["class"]

In [8]:
model = RandomForestClassifier(criterion="entropy", max_depth=1, random_state=2022)
model.fit(pd_X, pd_y)

RandomForestClassifier(criterion='entropy', max_depth=1, random_state=2022)

In [9]:
(pd.DataFrame(
    {
        "feature": pd_X.columns, 
        "score": model.feature_importances_
    })
 .sort_values("score", ascending=False)
 .head(13)
)

Unnamed: 0,feature,score
440,tqwt_TKEO_mean_dec_12,0.06
125,std_delta_delta_log_energy,0.06
441,tqwt_TKEO_mean_dec_13,0.05
403,tqwt_entropy_log_dec_11,0.05
111,std_delta_log_energy,0.04
132,std_6th_delta_delta,0.04
194,app_entropy_shannon_5_coef,0.03
121,std_9th_delta,0.03
368,tqwt_entropy_shannon_dec_12,0.03
347,tqwt_energy_dec_27,0.03


# Exercise 3

### which features help us identifying the gender of the subject?

In [10]:
gender_X = pd_dataset.drop(columns=["id", "gender", "class"])
gender_y = pd_dataset["gender"]

In [11]:
model = RandomForestClassifier(criterion="entropy", max_depth=1, random_state=2022)
model.fit(gender_X, gender_y)

RandomForestClassifier(criterion='entropy', max_depth=1, random_state=2022)

In [12]:
best_features = (pd.DataFrame(
    {
        "feature":gender_X.columns, 
        "score":model.feature_importances_
    })
 .sort_values("score", ascending=False)
 .head(13)
)
best_features

Unnamed: 0,feature,score
285,app_LT_entropy_shannon_6_coef,0.05
190,app_entropy_shannon_2_coef,0.05
222,app_TKEO_std_4_coef,0.05
205,app_entropy_log_7_coef,0.04
203,app_entropy_log_5_coef,0.04
195,app_entropy_shannon_7_coef,0.04
295,app_LT_entropy_log_6_coef,0.03
286,app_LT_entropy_shannon_7_coef,0.03
223,app_TKEO_std_5_coef,0.03
189,app_entropy_shannon_1_coef,0.03


In [13]:
best_feature = best_features.iloc[0, 0]
best_feature

'app_LT_entropy_shannon_6_coef'

# Possible extra exercises:


#### 1. Implement an algorithm to identify multiple informative variables for regression.
#### 2. For regression, instead of random forest, use other regression algorithms (LASSO).
#### 3. Use cross-validation to make the findings more robust.