# Implementations pipeline
##### In this notebook we test our implemented ML methods (regressions) and we test their accuracy
We begin by importing the libraries that we are going to need for this procedure and initialising the constants that are going to be used for the models.

In [None]:
import sys

SCRIPTS_FILEPATH = "./../scripts/"
DATA_FILEPATH = "../data/train.csv"

sys.path.append(SCRIPTS_FILEPATH)
from implementations import *
from compute import *
from data_cleaner import Data_Cleaner
from proj1_helpers import predict_labels

lambda_= 1e-6
max_iters = 1000

We will train our models using 3 different versions of the same dataset. We do this to able to compare the impact of feature engineering in our implementations.
1. Raw data : The data is loaded, the missing variables and the outliers are treated. Then the data is normalized
2. Polynomial data : The data is loaded, the missing variables and the outliers are treated. Polynomial feature expansion is applied. The data is normalized.
3. Interactions data : The data is loaded, the missing variables and the outliers are treated. Feature interaction is applied. The data is normalized.

In all cases the dataset is split in 2 so we can estimate the performance of the model on the validation set :
- Training dataset (80%)
- Test validation dataset (20%)


## Raw data


In [None]:
data = Data_Cleaner(DATA_FILEPATH)
data._fill_with_NaN()
data.fix_mass_MMC()
data.replace_with_zero()
data.treat_outliers(1.5,92.5)
data.normalize()

tX_train, tX_test, y_train, y_test = data.split_data(80)
initial_w = np.zeros(tX_train.shape[1])

In [None]:
tX_train.shape

We Create our models using different regressions but every time using the same training set that we have mentioned before. Then we predict the variables of our test set and test the accuracy of our predictions : 

In [None]:
w, loss = least_squares_GD(y_train, tX_train, np.copy(initial_w), max_iters, gamma =1e-1)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = least_squares_SGD(y_train, tX_train, np.copy(initial_w), max_iters, gamma = 1e-3)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = least_squares(y_train, tX_train)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = ridge_regression(y_train, tX_train, lambda_)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = logistic_regression(y_train, tX_train, np.copy(initial_w), 1000, gamma = 1e-6)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = reg_logistic_regression(y_train, tX_train,lambda_ , np.copy(initial_w), max_iters, gamma=1e-5)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

## Polynomial data

In [None]:
data = Data_Cleaner(DATA_FILEPATH)
data._fill_with_NaN()
data.fix_mass_MMC()
data.replace_with_zero()
data.treat_outliers(1.5,92.5)
data.build_polynomial(2)
data.normalize()

tX_train, tX_test, y_train, y_test = data.split_data(80)
initial_w = np.zeros(tX_train.shape[1])

In [None]:
w, loss = least_squares_GD(y_train, tX_train, np.copy(initial_w), max_iters, gamma =1e-1)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = least_squares_SGD(y_train, tX_train, np.copy(initial_w), max_iters, gamma = 1e-3)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = least_squares(y_train, tX_train)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = ridge_regression(y_train, tX_train, lambda_)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = logistic_regression(y_train, tX_train, np.copy(initial_w), 1000, gamma = 1e-6)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = reg_logistic_regression(y_train, tX_train,lambda_ , np.copy(initial_w), max_iters, gamma=1e-5)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

## Interactions data

In [None]:
data = Data_Cleaner(DATA_FILEPATH)
data._fill_with_NaN()
data.fix_mass_MMC()
data.replace_with_zero()
data.treat_outliers(1.5,92.5)
data.build_interactions()
data.normalize()

tX_train, tX_test, y_train, y_test = data.split_data(80)
initial_w = np.zeros(tX_train.shape[1])

In [None]:
w, loss = least_squares(y_train, tX_train)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = ridge_regression(y_train, tX_train, lambda_)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = logistic_regression(y_train, tX_train, np.copy(initial_w), 1000, gamma = 1e-6)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)

In [None]:
w, loss = reg_logistic_regression(y_train, tX_train,lambda_ , np.copy(initial_w), max_iters, gamma=1e-5)
y_pred =  predict_labels(w,tX_test)

compute_leaderboard_score(y_test,y_pred)