# **CS 4361/5361 Machine Learning**

**Predicting the running times of GPU operations under various parameter settings using ensembles of decision trees**

**Author:** [Olac Fuentes](http://www.cs.utep.edu/ofuentes/)<br>
**Last modified:** 2021/09/23<br>

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time
import random
from sklearn.metrics import accuracy_score, confusion_matrix,mean_squared_error,mean_absolute_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from google.colab import files

In [3]:
uploaded = files.upload()

Saving gpu_running_time.csv to gpu_running_time.csv


In [4]:
df = pd.read_csv('gpu_running_time.csv')
df

Unnamed: 0,MWG,NWG,KWG,MDIMC,NDIMC,MDIMA,NDIMB,KWI,VWM,VWN,STRM,STRN,SA,SB,Run1 (ms),Run2 (ms),Run3 (ms),Run4 (ms)
0,128,128,16,16,32,32,32,8,2,4,1,0,1,1,13.29,13.25,13.36,13.37
1,128,128,16,16,32,32,32,8,2,4,1,1,1,1,13.29,13.36,13.38,13.65
2,128,128,16,16,32,32,32,2,2,2,1,1,1,1,13.78,13.76,13.73,13.69
3,128,128,16,16,32,32,32,8,2,2,1,1,1,1,14.34,14.44,14.43,14.58
4,128,64,16,16,16,16,32,2,2,2,1,1,1,1,14.61,14.69,14.80,14.78
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241595,128,128,32,8,8,16,32,8,2,2,0,0,1,1,3322.83,3313.44,3359.22,3342.30
241596,128,128,32,8,8,32,8,8,2,2,0,0,1,1,3324.15,3324.11,3332.74,3300.80
241597,128,128,32,8,8,16,16,8,2,2,0,0,1,1,3325.87,3340.98,3333.41,3341.08
241598,128,128,32,8,8,32,16,8,2,2,0,0,1,1,3333.92,3335.08,3354.68,3317.04


We convert the dataframe to a numpy array and extract the features (X) and target (y). 



In [5]:
data = df.to_numpy()
X = data[:,:14]
y = np.mean(data[:,14:],axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4361)

As a baseline, we evaluate the error when we use the mean y train as predicition for all examples in the test set. 

In [6]:
pred = np.mean(y_train)+np.zeros_like(y_test)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

Mean squared error = 135001.80
Mean absolute error =  216.05


Now let's evaluate a decision tree for this task. 

In [7]:
model = DecisionTreeRegressor()
model.fit(X_train,y_train)
pred = model.predict(X_test)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

Mean squared error = 25.12
Mean absolute error =   1.57


We'd expect a random forest to perform better than a decision tree. Let's see if that is the case.

In [8]:
model = RandomForestRegressor()
model.fit(X_train,y_train)
pred = model.predict(X_test)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

Mean squared error = 17.67
Mean absolute error =   1.67


k-nn can also be use for regression. Let's see how well it works in this problem. Warning: it takes a while to run!

In [None]:
model = KNeighborsRegressor(algorithm='brute',n_jobs=-1)
model.fit(X_train,y_train)
pred = model.predict(X_test)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

## **Exercises**

Let's build our own ensembles of decision trees to see if we can surpass the performance of random forests using the same number of individual trees (the sklearn implementation uses 100 by default). 

1. Build a ensemble that uses regression trees with random splitting instead of best splitting to solve the problem.

Authors: Joshua Zamora and Steve Ramos

In [None]:
t = 30
E = []
for i in range(t):
  model = DecisionTreeRegressor(splitter = 'random')
  model.fit(X_train,y_train)
  pred = model.predict(X_test)
  E.append(pred)

In [None]:
pred = np.mean(np.array(E),axis=0)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test)))

[ 32.502      264.86144444  94.96911111 ...  27.50388889  36.45155556
  37.366     ]
Mean squared error =  3.76
Mean absolute error =   0.68


2. Build a ensemble that uses bagging and regression trees to solve the problem. 

In [None]:
t = 30
E = []
print(X_train.shape[0])
for i in range(t):
  model = DecisionTreeRegressor(splitter = 'random')
  a = np.random.randint(low = 0, high = X_train.shape[0], size=X_train.shape[0])
  model.fit(X_train[a],y_train[a])
  pred = model.predict(X_test)
  E.append(pred)

169120


In [None]:
pred = np.mean(np.array(E),axis=0)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test))) 

Mean squared error =  3.91
Mean absolute error =   0.72


3. Build a ensemble that uses random attribute selection and regression trees to solve the problem. 

In [None]:

t = 10
E = []

for i in range(t):
  model = DecisionTreeRegressor()
  a = np.random.randint(low = 0, high = X_train.shape[1], size=X_train.shape[1])
  model.fit(X_train[:, a], y_train)
  pred = model.predict(X_test)
  E.append(pred)

In [None]:
pred = np.mean(np.array(E),axis=0)
print('Mean squared error = {:5.2f}'.format(mean_squared_error(pred,y_test)))
print('Mean absolute error =  {:5.2f}'.format(mean_absolute_error(pred,y_test))) 

Mean squared error = 165041.37
Mean absolute error =  179.12
