## Supplement 6: Decision Trees and Random Forest

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from scipy.stats import mode


### 6.3 Programming Task: Song popularity prediction using Random Forest
The goal of this task is to train a random forest model that predicts the song popularity using the datasets already provided in task 4.3
 

In [2]:
%ls

Supplement-6_2.ipynb  test-reg-tree.csv     train-reg-tree.csv
Supplement-6_3.ipynb  test-songs.csv        train-songs.csv


In [3]:
train_df = pd.read_csv('train-songs.csv')
test_df = pd.read_csv('test-songs.csv')

In [4]:
train_df.head()

Unnamed: 0,danceability,key,loudness,acousticness,instrumentalness,liveness,valence,tempo,popular
0,0.391,8,-9.532,0.478,6e-06,0.116,0.138,105.593,0.0
1,0.628,1,-13.834,0.156,0.0104,0.0836,0.761,102.974,0.0
2,0.613,3,-22.789,0.864,0.0,0.269,0.371,75.104,0.0
3,0.504,2,-5.931,0.414,0.0,0.0845,0.163,135.927,1.0
4,0.698,9,-3.84,0.101,0.0,0.107,0.931,124.042,1.0


In [5]:
test_df.head()

Unnamed: 0,danceability,key,loudness,acousticness,instrumentalness,liveness,valence,tempo,popular
0,0.652,9,-7.319,0.725,2e-06,0.189,0.354,131.955,1.0
1,0.5,11,-7.996,0.0024,0.0,0.133,0.515,77.383,0.0
2,0.422,10,-7.215,0.109,0.0,0.722,0.331,74.98,1.0
3,0.708,5,-5.426,0.0136,0.00221,0.118,0.734,122.006,1.0
4,0.657,9,-8.351,0.705,9e-06,0.084,0.381,141.735,1.0


In [6]:
# Read data
TARGET_COLUMN = 'popular'
#TODO

train_X = train_df.drop(columns=TARGET_COLUMN).values
train_y = train_df[TARGET_COLUMN].values
test_X = test_df.drop(columns=TARGET_COLUMN).values
test_y = test_df[TARGET_COLUMN].values

In [7]:
print(f'{train_X.shape=}, {train_y.shape=}\n{test_X.shape=}, {test_y.shape=}')

train_X.shape=(20000, 8), train_y.shape=(20000,)
test_X.shape=(2000, 8), test_y.shape=(2000,)


   i\. Implement a function that draws a bootstrap sample of size N from the train dataset, where N can be specified by the user.




In [8]:
def generate_bootstrap(train_X, train_y, N):
    indx = np.arange(len(train_y))
    selected_indx = np.random.choice(indx, N)
    return train_X[selected_indx], train_y[selected_indx]

   ii\. Complete the implementation of the random forest algorithm. For this task you may use the DecisionTreeClassifier from the scikit-learn library. The other parts of the random forest algorithm must be implemented using only Scipy/Numpy.

In [9]:
class RandomForest:
    def __init__(self, n_trees, max_samples, **tree_kwargs):
        #TODO Initialize list containing weak classifiers. Also initialize any other parameter if required.
        self.trees = [tree.DecisionTreeClassifier(**tree_kwargs) for _ in range(n_trees)]
        self.max_samples = max_samples

    def train(self,train_X,train_y):
        for tree in self.trees:
            X, y = generate_bootstrap(train_X, train_y, self.max_samples)
            tree.fit(X, y)
   
    def predict(self,test_X):
        #TODO Final predictions are obtained by taking majority-vote (most frequent class) from each weak classifier prediction
        predictions = [tree.predict(test_X) for tree in self.trees]
        y_predictions = mode(predictions).mode.squeeze()
        return y_predictions

iii\. Train the model for the dataset from train-songs.csv using the parameters given below.
| Parameter| Value|
|----------|------|
Number of trees|100|
Maximum features per tree|2|
Bootstrap sample size|20000|
Minimum node size|1|
Maximum tree depth|10|


Note: The bootstrap sample size is the same as train dataset size in this task.


In [10]:
# Note: Run this cell without any changes. The model will train if the implementation of subtask (ii) is correct.

random_forest_model = RandomForest(n_trees=100, max_samples=20000,max_depth=10, min_samples_leaf=1, max_features=2)

random_forest_model.train(train_X, train_y)

   iv\. Calculate the accuracy of the model using the test dataset and compare your results with the
RandomForestClassifier from the scikit-learn library using the following parameters.

In [11]:
# TODO Run predict for test data and calculate accuracy
print(f'My RandomForestClassifier Accuracy: {(random_forest_model.predict(test_X) == test_y).sum() / len(test_y)}')

My RandomForestClassifier Accuracy: 0.8055


In [12]:
# TODO: Train and predict using scikit-learn library
sklearn_rf = RandomForestClassifier(n_estimators=100, max_samples=20000,max_depth=10, min_samples_leaf=1, max_features=2)
sklearn_rf.fit(train_X, train_y)
print(f'Sklearn RandomForestClassifier Accuracy: {(sklearn_rf.predict(test_X) == test_y).sum() / len(test_y)}')

Sklearn RandomForestClassifier Accuracy: 0.804


**The accuracy is slightly different because of random in bootstrap and features selection.**