# Decision Trees

#### Instructions:
- Write modular code with relevant docstrings and comments for you to be able to use
functions you have implemented in future assignments.
- All theory questions and observations must be written in a markdown cell of your jupyter notebook.You can alsoadd necessary images in `imgs/` and then include it in markdown. Any other submission method for theoretical question won't be entertained.
- Start the assignment early, push your code regularly and enjoy learning!

### Question 1 Optimal DT from table
**[20 points]**\
We will use the dataset below to learn a decision tree which predicts if people pass machine
learning (Yes or No), based on their previous GPA (High, Medium, or Low) and whether or
not they studied. 

| GPA | Studied | Passed |
|:---:|:-------:|:------:|
|  L  |    F    |    F   |
|  L  |    T    |    T   |
|  M  |    F    |    F   |
|  M  |    T    |    T   |
|  H  |    F    |    T   |
|  H  |    T    |    T   |
    
 For this problem, you can write your answers using $log_2$
, but it may be helpful to note
that $log_2 3 ≈ 1.6$.

---
1. What is the entropy H(Passed)?
2. What is the entropy H(Passed | GPA)?
3. What is the entropy H(Passed | Studied)?
4. Draw the full decision tree that would be learned for this dataset. You do
not need to show any calculations.
---


### Question 2 DT loss functions
**[10 points]**
1. Explain Gini impurity and Entropy. 
2. What are the min and max values for both Gini impurity and Entropy
3. Plot the Gini impurity and Entropy for $p\in[0,1]$.
4. Multiply Gini impurity by a factor of 2 and overlay it over entropy.

### Question 3 Training a Decision Tree  
**[40 points]**

You can download the spam dataset from the link given below. This dataset contains feature vectors and the lables of Spam/Non-Spam mails. 
http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data

**NOTE: The last column in each row represents whether the mail is spam or non spam**\
Although not needed, incase you want to know what the individual columns in the feature vector means, you can read it in the documentation given below.
http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.DOCUMENTATION

**Download the data and load it from the code given below**

In [318]:
import pandas as pd
import numpy as np
import random
from sklearn.utils import shuffle
download_url = ("./spambase.data")
cols = np.arange(0,58,1)
df = pd.read_csv(download_url,names=cols)
pd.set_option("display.max.rows", None)
data = pd.DataFrame(df).to_numpy()
num = len(data)
feat = len(data[0])-1

In [319]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
X = data[:,0:feat]
label = data[:,-1]

You can try to normalize each column (feature) separately with wither one of the following ideas. **Do not normalize labels**.
- Shift-and-scale normalization: substract the minimum, then divide by new maximum. Now all values are between 0-1
- Zero mean, unit variance : substract the mean, divide by the appropriate value to get variance=1.

In [320]:
X = X - X.mean(axis=0)
X=X/X.std(axis=0)
no_of_features = 57

1. Split your data into train 80% and test dataset 20% 
2. **[BONUS]** Visualize the data using PCA . You can reduce the dimension of the data if you want. Bonus marks if this increases your accuracy.

*NOTE: If you are applying PCA or any other type of dimensionality reduction, do it before splitting the dataset*

In [321]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
Y=X
principal=PCA(n_components=44)
principal.fit(Y)
Y=principal.transform(Y)
print(Y.shape)
n_pcs= principal.components_.shape[0]
most_important = [np.abs(principal.components_[i]).argmax() for i in range(n_pcs)]
most_important.sort()
ind = list(set(most_important))
print(ind)
j = np.arange(0,10,1)
X = X[:,j]
no_of_features = X.shape[1]
print(X.shape)


(4601, 44)
[1, 2, 3, 4, 6, 7, 9, 11, 12, 16, 17, 18, 19, 20, 21, 23, 31, 32, 37, 38, 40, 41, 42, 45, 46, 47, 51, 53, 55, 56]
(4601, 10)


In [322]:
n_pcs= principal.components_.shape[0]
print(principal.components_.shape)
print(n_pcs)

(44, 57)
44


In [323]:
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(X, label, test_size=0.2, random_state=42) 

In [324]:
print(X_train.shape)

(3680, 10)


You need to perform a K fold validation on this and report the average training error over all the k validations. 
- For this , you need to split the training data into k splits.
- For each split, train a decision tree model and report the training , validation and test scores.
- Report the scores in a tabular form for each validation

In [325]:
K = 10
x_data = np.array(np.array_split(X_train,K))
y_data = np.array(np.array_split(y_train,K))
tr_ac = []
test_ac = []
val_ac = []
for i in range(10):
    ind = np.setxor1d(np.arange(0,10),i)
    x_trainset = x_data[ind].reshape((-1,no_of_features))
    y_trainset = y_data[ind].reshape((-1,1))
    x_valset = x_data[i]
    y_valset = y_data[i]
    model = DecisionTreeClassifier()
    model = model.fit(x_trainset,y_trainset)
    y_predval = model.predict(x_valset)
    y_predtest = model.predict(X_test)
    y_predtrain = model.predict(X_train)
    tr_ac.append(metrics.accuracy_score(y_train,y_predtrain))
    val_ac.append(metrics.accuracy_score(y_valset,y_predval))
    test_ac.append(metrics.accuracy_score(y_test,y_predtest))

In [326]:
accuracies = pd.DataFrame({"split_number":np.arange(0,10,1),"Training Accuracy":tr_ac, "Validation Accuracy":val_ac,"Testing Accuracy":test_ac})
print(accuracies)

   split_number  Training Accuracy  Validation Accuracy  Testing Accuracy
0             0           0.938587             0.820652          0.820847
1             1           0.938043             0.831522          0.818675
2             2           0.938587             0.793478          0.824104
3             3           0.938587             0.831522          0.818675
4             4           0.938315             0.823370          0.838219
5             5           0.939674             0.820652          0.830619
6             6           0.941848             0.845109          0.828447
7             7           0.938859             0.820652          0.838219
8             8           0.940761             0.839674          0.826276
9             9           0.941033             0.847826          0.813246


### Question 4 Random Forest Algorithm
**[30 points]**

1. What is boosting, bagging and  stacking?
Which class does random forests belong to and why? **[5 points]**

2. Implement random forest algorithm using different decision trees. **[25 points]** 

In [327]:
def random_forest_algorithm(X_train,y_train,X_test,y_test):
    tot_trees = 100
    rand_feats = 30
    rand_data = int(66*X_train.shape[0]/100)
    pred_trees = np.zeros((tot_trees,X_test.shape[0]))
    for i in range(tot_trees):
        feat_ind = random.sample(list(np.arange(0,feat,1)),rand_feats)
        feat_ind.sort()
        data_ind = random.sample(list(np.arange(0,X_train.shape[0],1)),rand_data)
        data_ind.sort()
        data = X_train[data_ind][:,feat_ind]
        labels = y_train[data_ind]
        model = DecisionTreeClassifier()
        model = model.fit(data,labels)
        pred_trees[i] = model.predict(X_test[:,feat_ind])
    # Testing
    pred = np.zeros(X_test.shape[0])
    sums = pred_trees.sum(axis=0)
    for i in range(len(sums)):
        if(((sums[i]-(tot_trees/2))>=0)):
            pred[i]=1
        else:
            pred[i]=0
    print("The test accuracy obtained with random forests is ",metrics.accuracy_score(y_test,pred))
    
    

In [328]:
random_forest_algorithm(X_train,y_train,X_test,y_test)

IndexError: index 11 is out of bounds for axis 1 with size 10