# Testing Random Ridge Regression 


For this classifier, we won't use the same approach as for the other. It takes a lot of ressources to use it and it has a lot of parameters.

We'll start with parameters that we select by hand then we'll try to modify each parameter one by one to see their effect on the accuracy.

The validation will be done by splitting the data into a training set and a testing set.

In [None]:
import numpy as np
import sys
sys.path.insert(0, '../scripts')
from proj1_helpers import *         
from preprocessing import * 
                                    
from classifiers import *
from features_ext import *
from utils import *


In [None]:
TRAIN = '../data/train.csv'


## Load the data and preprocess

In [None]:
y_train, tx_train, ids_train = load_csv_data(TRAIN)


In [None]:
y, x = preprocess(y_train, tx_train, "NanToMean", onehotencoding=True)



In [None]:
degree = 9
centroids = build_centroids(y, x)
centroids = [b for a,b in centroids]
x_extended, d = build_poly_interaction(x, degree, [], centroids)
x_tr_split, x_te_split, y_tr_split, y_te_split = split_data(x_extended, y, 0.7)

## Selecting a reference model


In [None]:
# reference 
n_classifier = 10
lambda_ = 0.01
number_of_rows = 50000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_tr_split)
cl.accuracy(preds, y_tr_split)


0.8294685714285714


We have a better accuracy than with ridge regression ! 

Let's try to find the best hyperparameters.

In [None]:
# Using less features
n_classifier = 10
lambda_ = 0.01
number_of_rows = 50000
features_per_classifier = 32
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
cl.accuracy(preds, y_te_split)

0.8261066666666667

It seems like decreasing the number of features doesn't improve the performances. Maybe it could increase the performances if we add a lot more classifiers but we faced memories issues while doing that.

In [None]:
# Using less rows
n_classifier = 10
lambda_ = 0.01
number_of_rows = 40000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
cl.accuracy(preds, y_te_split)


0.8278266666666667

The accuracy is still good but decreased a little bit.

In [None]:
# more rows
n_classifier = 10
lambda_ = 0.01
number_of_rows = 60000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
cl.accuracy(preds, y_te_split)


0.82696

The accuracy is also decreasing. It looks like the optimal number of rows is close to 50000.

In [None]:
# higher lambda
n_classifier = 10
lambda_ = 0.5
number_of_rows = 50000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
cl.accuracy(preds, y_te_split)

0.82724

Modifying the lambda parameter doesn't affect a lot the accuracy. Here, all of our classifiers share the same lambda. We could improve our model by using a different lambda for each parameter.

In [None]:
# More classifiers 
n_classifier = 12
lambda_ = 0.01
number_of_rows = 50000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
cl.accuracy(preds, y_te_split)

0.8278533333333333

More classifiers does not improve the accuracy. However, it also takes more time and memory to train and to generate predictions.

Let's try with less classifiers in order to be sure that the number of classifier really impacts our predictions.

In [None]:
# Less classifiers
n_classifier = 8
lambda_ = 0.01
number_of_rows = 50000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
cl.accuracy(preds, y_te_split)

0.82712

With 8 classifiers, we had better predictions than with 10. Let's try with even less.

In [None]:
# Less classifiers
n_classifier = 6
lambda_ = 0.01
number_of_rows = 50000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
cl.accuracy(preds, y_te_split)

0.8272

The accuracy is decreasing a little bit and becomes closer to the accuracy of a single ridge regression classifier trained with all the data.

Let's try to make a lot of bad classifiers and see how they combine.

In [None]:
# More classifier / Less features
n_classifier = 50
lambda_ = 0.01
number_of_rows = 50000
features_per_classifier = 10
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
cl.accuracy(preds, y_te_split)

0.8051066666666666

In [None]:
val, counts = np.unique(preds, return_counts=True)
counts

array([53998, 21002])

The accuracy decreases significantly. It might not be a good idea to do this but it still gives decent results.


# Ridge regression's performance

We want to compare this classifier that uses a lot of ridge regression classifiers to a single ridge regression classifier.

In [None]:
lambdas = [0.5, 0.1, 0.02, 0.01, 0.001, 0.0001]
res = []
for lambda_ in lambdas:
  ridge = ClassifierLinearRegression(lambda_, "L2")
  ridge.train(y_tr_split, x_tr_split)
  preds2 = ridge.predict(x_te_split)
  res.append(ridge.accuracy(preds2, y_te_split))
res

[0.82636, 0.82644, 0.82652, 0.8265466666666667, 0.8266, 0.82668]

Changing the lambda doesn't affect the testing accuracy significantly.

# Finding the best degree

We want to see the impact of the degree on our Random Ridge classifier. Let's start with a smaller degree.

In [None]:
degree = 6
centroids = build_centroids(y, x)
centroids = [b for a,b in centroids]
x_extended, d = build_poly_interaction(x, degree, [], centroids)
x_tr_split, x_te_split, y_tr_split, y_te_split = split_data(x_extended, y, 0.7)

In [None]:
# smaller degree
n_classifier = 10
lambda_ = 0.01
number_of_rows = 50000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
print(np.unique(preds))
cl.accuracy(preds, y_te_split)


[-1.  1.]


0.81656

The accuracy is lower with a smaller degree. We should try with a higher one.

In [None]:
degree = 10
centroids = build_centroids(y, x)
centroids = [b for a,b in centroids]
x_extended, d = build_poly_interaction(x, degree, [], centroids)
x_tr_split, x_te_split, y_tr_split, y_te_split = split_data(x_extended, y, 0.7)

In [None]:
# higher degree
n_classifier = 10
lambda_ = 0.01
number_of_rows = 50000
features_per_classifier = 41
use_centroids = True
cl = ClassifierRandomRidgeRegression(n_classifier, lambda_, number_of_rows, features_per_classifier, use_centroids)
cl.train(y_tr_split, x_tr_split, d)
preds = cl.predict(x_te_split)
print(np.unique(preds))
cl.accuracy(preds, y_te_split)

[-1.  1.]


0.82796

Here the degree is a little bit higher but the accuracy is decreasing. There is no point in increasing the degree past 9.

