Peter Schnizer


# HW 2
---

 ### Task 1:
Create your class that implements the Gradient Boosting concept, based on the locally weighted regression method (Lowess class), and that allows a user-prescribed number of boosting steps. The class you develop should have all the mainstream useful options, including “fit,” “is_fitted”,  and “predict,” methods.  Show applications with real data for regression, 10-fold cross-validations and compare the effect of different scalers, such as the “StandardScaler”, “MinMaxScaler”, and the “QuantileScaler”.  In the case of the “Concrete” data set, determine a choice of hyperparameters that yield lower MSEs for your method when compared to the eXtream Gradient Boosting library.


**Answer:**

In [224]:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.utils.validation import check_is_fitted
from scipy.spatial.distance import cdist
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer
from xgboost import XGBRegressor


In [225]:
# Gaussian Kernel
def Gaussian(w):
  return np.where(w>4,0,1/(np.sqrt(2*np.pi))*np.exp(-1/2*w**2))

# Tricubic Kernel
def Tricubic(w):
  return np.where(w>1,0,70/81*(1-w**3)**3)

# Quartic Kernel
def Quartic(w):
  return np.where(w>1,0,15/16*(1-w**2)**2)

# Epanechnikov Kernel
def Epanechnikov(w):
  return np.where(w>1,0,3/4*(1-w**2))

def weight_function(u,v,kern=Gaussian,tau=0.5):
    return kern(cdist(u, v, metric='euclidean')/(2*tau))

In [226]:
class Lowess:
    def __init__(self, kernel = Gaussian, tau=0.05):
        self.kernel = kernel
        self.tau = tau

    def fit(self, x, y):
        kernel = self.kernel
        tau = self.tau
        self.xtrain_ = x
        self.yhat_ = y

    def predict(self, x_new):
        check_is_fitted(self)
        x = self.xtrain_
        y = self.yhat_
        lm = linear_model.Ridge(alpha=0.0001)
        w = weight_function(x,x_new,self.kernel,self.tau)

        if np.isscalar(x_new):
          lm.fit(np.diag(w)@(x.reshape(-1,1)),np.diag(w)@(y.reshape(-1,1)))
          yest = lm.predict([[x_new]])[0][0]
        else:
          n = len(x_new)
          yest_test = []
          #Looping through all x-points
          for i in range(n):
            lm.fit(np.diag(w[:,i])@x,np.diag(w[:,i])@y)
            yest_test.append(lm.predict([x_new[i]]))
        return np.array(yest_test).flatten()
        
    def is_fitted(self):
       return check_is_fitted(self)

In [227]:
class Boost:
    def __init__(self, model1, model2):
        self.model1 = model1
        self.model2 = model2
    
    def fit(self, x, y):
        self.xtrain_ = x
        self.ytrain_ = y
    
    def is_fitted(self):
        return check_is_fitted(self)
    
    def predict(self, x_new, boost_iter):
        model1 = self.model1
        model2 = self.model2
        model1.fit(self.xtrain_, self.ytrain_)
        
        yhat_train = model1.predict(self.xtrain_)
        residuals_train = self.ytrain_ - yhat_train
        model2.fit(self.xtrain_, residuals_train)
        
        for _ in range(boost_iter):
            residuals_hat = model2.predict(x_new)
            yhat_lw = model1.predict(x_new) + residuals_hat
            model1.fit(x_new, yhat_lw)
            yhat_train = model1.predict(self.xtrain_)
            residuals_train = self.ytrain_ - yhat_train
            model2.fit(self.xtrain_, residuals_train)
        
        yhat_lw = model1.predict(x_new) + model2.predict(x_new)
        return yhat_lw

### Testing Scalars On Real Data:

In [228]:
nba = pd.read_csv('Data/nba.csv')
x = nba.drop(['Date','Matchup','Spread','Margin'],axis=1).to_numpy()[:-2]
y = nba['Margin'].to_numpy()[:-2]

In [229]:
mse_quant = []
mse_mm = []
mse_ss = []
quant_scale = QuantileTransformer(n_quantiles=350)
mm_scale = MinMaxScaler()
ss_scale = StandardScaler()
kf = KFold(n_splits=10,shuffle=True,random_state=1234)
model_1 = Lowess(kernel=Gaussian,tau=0.9)
model_2 = Lowess(kernel=Tricubic,tau=0.9)
full_model = Boost(model_1, model_2)

for idxtrain, idxtest in kf.split(x):
  xtrain = x[idxtrain]
  ytrain = y[idxtrain].ravel()
  ytest = y[idxtest].ravel()
  xtest = x[idxtest]
  xtrain_q = quant_scale.fit_transform(xtrain)
  xtrain_mm = mm_scale.fit_transform(xtrain)
  xtrain_ss = ss_scale.fit_transform(xtrain)

  xtest_q = quant_scale.transform(xtest)
  xtest_mm = mm_scale.transform(xtest)
  xtest_ss = ss_scale.transform(xtest)

  full_model.fit(xtrain_q,ytrain)
  ypred = full_model.predict(xtest_q,3)
  mse_quant.append(mse(ytest,ypred))

  full_model.fit(xtrain_mm,ytrain)
  ypred = full_model.predict(xtest_mm,3)
  mse_mm.append(mse(ytest,ypred))

  full_model.fit(xtrain_ss,ytrain)
  ypred = full_model.predict(xtest_ss,3)
  mse_ss.append(mse(ytest,ypred))

print('The Cross-validated Mean Squared Error for QuantileTransformer : '+str(np.mean(mse_quant)))
print('The Cross-validated Mean Squared Error for MinMaxScaler : '+str(np.mean(mse_mm)))
print('The Cross-validated Mean Squared Error for StandardScaler : '+str(np.mean(mse_ss)))


The Cross-validated Mean Squared Error for QuantileTransformer : 171.29336283977153
The Cross-validated Mean Squared Error for MinMaxScaler : 164.7207573152529
The Cross-validated Mean Squared Error for StandardScaler : 359.83740436352895


The MinMaxScaler has the best score followed by the QuantileTransformer and StandardScaler. The StandardScaler scores better with lower values of tau in model 1 and model 2. Overall, the MSE of 164 is much better than the MSE of 178 I was able to obtain in the last homework.

### Application On Concrete Dataset:

In [230]:
data = pd.read_csv('Data/concrete.csv')

In [231]:
x = data.drop(columns=['strength']).values
y = data['strength'].values

In [232]:
mse_lwr = []
mse_rf = []
scale = QuantileTransformer(n_quantiles=900)
kf = KFold(n_splits=10,shuffle=True,random_state=1234)
model_rf = XGBRegressor(objective ='reg:squarederror',n_estimators=100,reg_lambda=20,alpha=1,gamma=10,max_depth=7)
model_1 = Lowess(kernel=Gaussian,tau=0.325)
model_2 = Lowess(kernel=Tricubic,tau=0.225)
full_model = Boost(model_1, model_2)

for idxtrain, idxtest in kf.split(x):
  xtrain = x[idxtrain]
  ytrain = y[idxtrain].ravel()
  ytest = y[idxtest].ravel()
  xtest = x[idxtest]
  xtrain = scale.fit_transform(xtrain)
  xtest = scale.transform(xtest)

  full_model.fit(xtrain,ytrain)
  ypred = full_model.predict(xtest,1)

  model_rf.fit(xtrain,ytrain)
  yhat_rf = model_rf.predict(xtest)

  mse_lwr.append(mse(ytest,ypred))
  mse_rf.append(mse(ytest,yhat_rf))
print('The Cross-validated Mean Squared Error for Boosted Locally Weighted Regression is : '+str(np.mean(mse_lwr)))
print('The Cross-validated Mean Squared Error for a DT-based method: '+str(np.mean(mse_rf)))

The Cross-validated Mean Squared Error for Boosted Locally Weighted Regression is : 19.518017495330895
The Cross-validated Mean Squared Error for a DT-based method: 20.18308623182933


With these hyperparameters, my implementation of gradient boosting beats XGBoost.

---
### Task 2:
Based on the Usearch library, create your own class that computes the k_Nearest Neighbors for Regression.

**Answer:**

In [234]:
from usearch.index import search, MetricKind

class KNN_Reg:
    def __init__(self, n_neighbors=5):
        self.n_neighbors = n_neighbors
    
    def fit(self, x, y):
        self.xtrain_ = x
        self.ytrain_ = y
    
    def is_fitted(self):
        return check_is_fitted(self)
    
    def predict(self, x_new):
        kneighbors = self.__get_knn(x_new)
        if len(kneighbors.shape) > 1:
            return [np.mean(row) for row in self.ytrain_[kneighbors]]
        else:
            return np.mean(self.ytrain_[kneighbors])
    
    def __get_knn(self, x_new):
        neighbors = search(self.xtrain_, x_new, self.n_neighbors, MetricKind.L2sq, exact=True)
        return neighbors.keys

    

## Testing My Class on NBA Game Data:

In [235]:
# data = pd.read_csv('Data/concrete.csv')
# x = data.drop(columns=['strength']).values
# y = data['strength'].values

In [236]:
nba = pd.read_csv('Data/nba.csv')
x = nba.drop(['Date','Matchup','Spread','Margin'],axis=1).to_numpy()[:-2]
y = nba['Margin'].to_numpy()[:-2]

In [237]:
mse_knn = []
scale = QuantileTransformer(n_quantiles=500)

kf = KFold(n_splits=10,shuffle=True,random_state=1234)
model_knn = KNN_Reg(55)

for idxtrain, idxtest in kf.split(x):
  xtrain = x[idxtrain]
  ytrain = y[idxtrain].ravel()
  ytest = y[idxtest].ravel()
  xtest = x[idxtest]
  xtrain = scale.fit_transform(xtrain)
  xtest = scale.transform(xtest)

  model_knn.fit(xtrain,ytrain)
  ypred = model_knn.predict(xtest)

  mse_knn.append(mse(ytest,ypred))
print('The Cross-validated Mean Squared Error for KNN Regression is : '+str(np.mean(mse_knn)))


The Cross-validated Mean Squared Error for KNN Regression is : 168.82562564478098


This is almost as good as the mse obtained from the gradient boosted Lowess model.

---
### Task 3
Host your project on your GitHub page.

**Answer:**

https://github.com/Pschnizer/DATA441/blob/main/DATA_441_Project_2.ipynb