## CROSS VALIDATION FOR DIFFERENT MODELS

In [5]:
from crossvalidation import *
from implementations import *
from feature_engineering import *

In [3]:
#Loading train data

filename = 'train.csv'
data_folder = './data/'
file_path = data_folder + filename
y,tx,ids,features = load_train_data(file_path)

# Computing preprocessing routine

list_subsets, list_features, y_0, y_1, y_2_3, columns_to_drop_in_subsets = preprocessing(tx,y,ids,features)

We want to introduce interaction factors between variables during the preprocessing routine. 
However, there is not statistical significance that multiplying variables with trigonometric functions 
may improve the model performance. Therefore, since trigonometric values are the last columns of each 
dataset, we save in a list how many columns are related to trigonometric values in each subset in order 
not to multiply columns with them later.

In [4]:
how_many_trig_features=[2,1,2]

Defining parameters to test

In [48]:
lambdas = np.logspace(-7,-3,5)
degrees = [3,5,7]
k_fold = 4
gamma = 0.1
max_iters = 200

Defining lists to save optimal degrees and lambdas for each subset

In [49]:
optimal_lambdas = [0]*3
optimal_degrees = [1]*3

#### RIDGE REGRESSION

Doing cross validation on subsets_0 for ridge regression

In [16]:
optimal_degrees[0], optimal_lambdas[0], best_rmse = cross_validation_demo_ridge(y_0, list_subsets[0], k_fold, lambdas, degrees,
                                                                               how_many_trig_features[0])

KeyboardInterrupt: 

Doing cross validation on subsets_1 for ridge regression

In [13]:
optimal_degrees[1], optimal_lambdas[1], best_rmse = cross_validation_demo_ridge(y_1, list_subsets[1], k_fold, lambdas, degrees,
                                                                               how_many_trig_features[1])

The choice of lambda which leads to the best test rmse is 0.00010 with a test rmse of 0.371. The best degree is 7.0


Doing cross validation on subsets_2_3 for ridge regression

In [11]:
optimal_degrees[2], optimal_lambdas[2], best_rmse = cross_validation_demo_ridge(y_2_3, list_subsets[2], k_fold, lambdas, degrees,
                                                                               how_many_trig_features[2])

The choice of lambda which leads to the best test rmse is 0.00010 with a test rmse of 0.347. The best degree is 7.0


NameError: name 'optimal_degrees' is not defined

#### REGULARIZED LOGISTIC REGRESSION

Doing cross validation on subsets_0 for regularized logistic regression

In [17]:
best_degree,best_lambda,_ = cross_validation_demo_log(y_0, list_subsets[0], k_fold, lambdas, gamma, max_iters,degrees, how_many_trig_features[0])

The choice of lambda which leads to the best test logloss is 0.00000 with a test logloss of 0.362. The best degree is 3.0


Doing cross validation on subsets_1 for regularized logistic regression

In [18]:
best_degree,best_lambda,_ = cross_validation_demo_log(y_1, list_subsets[1], k_fold, lambdas, gamma, max_iters,degrees, how_many_trig_features[1])

The choice of lambda which leads to the best test logloss is 0.00000 with a test logloss of 0.419. The best degree is 3.0


Doing cross validation on subsets_2_3 for regularized logistc regression

In [19]:
best_degree,best_lambda,_ = cross_validation_demo_log(y_2_3, list_subsets[2], k_fold, lambdas, gamma, max_iters,degrees, how_many_trig_features[2])

The choice of lambda which leads to the best test logloss is 0.00000 with a test logloss of 0.377. The best degree is 3.0


#### COMPUTING ACCURACY FOR RIDGE REGRESSION

In [30]:
list_outputs = [y_0,y_1,y_2_3]
for idx in range(3):
    list_subsets[idx] = list_subsets[idx][int(0.6*list_subsets[idx].shape[0]),:]

Computing accuracy for ridge regression using optimal values as hyperparameters

In [14]:
compute_accuracy(list_outputs,list_subsets,0.7,[0.00001,0.00001,0.00001],[7,7,7],[0,0,0],pred_threshold = 0.5)

Average train accuracy: 0.8362282705107522
std train accuracy: 0.0005229317330015228
Average test accuracy: 0.8349103493434902
std train accuracy: 0.0008072756696673245


#### COMPUTING ACCURACY FOR REGULARIZED LOGISTIC REGRESSION

Computing accuracy for regularized logistic regression using optimal values as hyperparameters

In [54]:
compute_accuracy(list_outputs,list_subsets,0.7,[0.0000, 0.0000, 0.0000],[3,3,3],how_many_trig_features, pred_threshold=0.55,method = 'logistic',gamma = 0.1)

Average train accuracy: 0.8320193575709321
std train accuracy: 0.00046693834276353334
Average test accuracy: 0.8319370556540728
std train accuracy: 0.0010278248657757754
