# Regularization, Feature Scaling and Cross Validation


The program consists of two parts that interact with each other. We will call these parts the **Poxos** and the **Petros**. 
Fhe **Poxss** picks a **beta_star** vector, that corresponds to some green polynomial line and generates points around it with standard deviation **sigma**. The Poxos calls these points **x_1**, and **y_1** where x_1 and y_1 are vectors. The Poxos also generates another set of points called **x_2** and **y_2**. The Poxos is going to give **x_1** and **y_1** for the Petros to learn and also gives **x_2**, BUT NOT **y_2** to the Petros to test the output.


The **Petros**, takes the **x_1** and **y_1** values of the points but does not know the degree of the polynomial the Poxos used. The **Petros** has just learned regularization and wants to use that in this problem. That's why instead of trying multiple values for **d_train** he decides to use **ONLY** **d_train** = 9 and use **lambda** to see if he can avoid overfitting using regularization. He also knows, that because he wants his regularization to work well, then he needs to scale and standardize his data. To do so, when he constructs the **X** matrix with polynomial degrees, he then does features scaling to get a new matrix **X_scaled**. To do so, he calculates the mean and standard deviation (**mu** and **sigma**) for each COLUMN of **X** and does feature scaling as learned in class.

There is another question, however, how do you choose **lambda** ? The Petros decides that he will try multiple values for **lambda** and see which one will give him the best cross validation score. He will use **k** fold cross validation with **num_trials** random shufflings of the data, and determine the **BEST** lambda to use for the regression. When he determins the best **lambda** he will use that **lambda** to train the model on the entire **x_1**, **y_1** dataset and then apply the result on the **x_2** do determine **y_2_predicted** and return that to the Poxos. He will also print out cross validation scores for many values of the **lambda**.

Extra: the Petros realizses, that since he used feature scaling, the **beta_hat** he received at the end will be meaningless for the Poxos. Because the coordinate system is different. Can he find a way to construct **beta_hat_for_poxos** using the **beta_hat** which will make sense to the Poxos? 


After the Poxos receives the **y_2_predicted**, he will compare that to **y_2** to see how well the Petros learned.








In [1]:
import numpy as np
from numpy import linalg

In [2]:
def generate_x_vec(xmin, xmax, num_points):
    return np.matrix(np.random.uniform(low=xmin, high=xmax, size=num_points)).reshape(-1,1)
    

In [3]:
def linear_regression(X,Y):
    beta = linalg.inv((X.transpose() * X)) * X.transpose() * Y
    return beta

In [4]:
def generate_points(x_vec, sigma, beta_star):
    y =  x_vec * beta_star
    y_star = np.random.normal(y, sigma)
    beta_hat = linear_regression(x_vec, y_star)
    return y_star

In [5]:
sigma = 5
beta_star = [0.5,-0.2,1,1.2]

In [6]:
x_1 = generate_x_vec(10,30,200)
x_2 = generate_x_vec(10,30,200)

y_1 = generate_points(x_1, sigma, beta_star)
y_2 = generate_points(x_1, sigma, beta_star)

In [7]:
from sklearn.preprocessing import StandardScaler

scaler_sd = StandardScaler()

scaler_sd.fit(x_1)

x_1_scaled = scaler_sd.transform(x_1)
x_2_scaled = scaler_sd.transform(x_2)

In [8]:
def transform_x(x,d):
    X = []
    for i in range(1,d+1):
        X.append(((np.asarray(x)).flatten())**i)
        
    X = np.matrix(X).T
    return X

In [9]:
x_1_train = transform_x(x_1_scaled,9)
x_2_test= transform_x(x_2_scaled,9)

In [10]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

In [11]:
enet = ElasticNet()

In [12]:
alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
l1_ratio = np.arange(0, 1.1, 0.1)

param_grid = {'alpha':alpha,'l1_ratio':l1_ratio}

In [20]:
grid = GridSearchCV(enet,param_grid,cv=5,
                    scoring= 'neg_mean_squared_error',
                    verbose=5)

In [14]:
grid.fit(x_1_train,y_1)

Fitting 5 folds for each of 77 candidates, totalling 385 fits
[CV 1/5] END .....................alpha=0.0001, l1_ratio=0.0; total time=   0.0s
[CV 2/5] END .....................alpha=0.0001, l1_ratio=0.0; total time=   0.0s
[CV 3/5] END .....................alpha=0.0001, l1_ratio=0.0; total time=   0.0s
[CV 4/5] END .....................alpha=0.0001, l1_ratio=0.0; total time=   0.0s
[CV 5/5] END .....................alpha=0.0001, l1_ratio=0.0; total time=   0.0s
[CV 1/5] END .....................alpha=0.0001, l1_ratio=0.1; total time=   0.0s
[CV 2/5] END .....................alpha=0.0001, l1_ratio=0.1; total time=   0.0s
[CV 3/5] END .....................alpha=0.0001, l1_ratio=0.1; total time=   0.0s
[CV 4/5] END .....................alpha=0.0001, l1_ratio=0.1; total time=   0.0s
[CV 5/5] END .....................alpha=0.0001, l1_ratio=0.1; total time=   0.0s
[CV 1/5] END .....................alpha=0.0001, l1_ratio=0.2; total time=   0.0s
[CV 2/5] END .....................alpha=0.0001,

[CV 3/5] END ......................alpha=0.001, l1_ratio=1.0; total time=   0.0s
[CV 4/5] END ......................alpha=0.001, l1_ratio=1.0; total time=   0.0s
[CV 5/5] END ......................alpha=0.001, l1_ratio=1.0; total time=   0.0s
[CV 1/5] END .......................alpha=0.01, l1_ratio=0.0; total time=   0.0s
[CV 2/5] END .......................alpha=0.01, l1_ratio=0.0; total time=   0.0s
[CV 3/5] END .......................alpha=0.01, l1_ratio=0.0; total time=   0.0s
[CV 4/5] END .......................alpha=0.01, l1_ratio=0.0; total time=   0.0s
[CV 5/5] END .......................alpha=0.01, l1_ratio=0.0; total time=   0.0s
[CV 1/5] END .......................alpha=0.01, l1_ratio=0.1; total time=   0.0s
[CV 2/5] END .......................alpha=0.01, l1_ratio=0.1; total time=   0.0s
[CV 3/5] END .......................alpha=0.01, l1_ratio=0.1; total time=   0.0s
[CV 4/5] END .......................alpha=0.01, l1_ratio=0.1; total time=   0.0s
[CV 5/5] END ...............

[CV 3/5] END ........................alpha=0.1, l1_ratio=1.0; total time=   0.0s
[CV 4/5] END ........................alpha=0.1, l1_ratio=1.0; total time=   0.0s
[CV 5/5] END ........................alpha=0.1, l1_ratio=1.0; total time=   0.0s
[CV 1/5] END ..........................alpha=1, l1_ratio=0.0; total time=   0.0s
[CV 2/5] END ..........................alpha=1, l1_ratio=0.0; total time=   0.0s
[CV 3/5] END ..........................alpha=1, l1_ratio=0.0; total time=   0.0s
[CV 4/5] END ..........................alpha=1, l1_ratio=0.0; total time=   0.0s
[CV 5/5] END ..........................alpha=1, l1_ratio=0.0; total time=   0.0s
[CV 1/5] END ..........................alpha=1, l1_ratio=0.1; total time=   0.0s
[CV 2/5] END ..........................alpha=1, l1_ratio=0.1; total time=   0.0s
[CV 3/5] END ..........................alpha=1, l1_ratio=0.1; total time=   0.0s
[CV 4/5] END ..........................alpha=1, l1_ratio=0.1; total time=   0.0s
[CV 5/5] END ...............

[CV 2/5] END .........................alpha=10, l1_ratio=1.0; total time=   0.0s
[CV 3/5] END .........................alpha=10, l1_ratio=1.0; total time=   0.0s
[CV 4/5] END .........................alpha=10, l1_ratio=1.0; total time=   0.0s
[CV 5/5] END .........................alpha=10, l1_ratio=1.0; total time=   0.0s
[CV 1/5] END ........................alpha=100, l1_ratio=0.0; total time=   0.0s
[CV 2/5] END ........................alpha=100, l1_ratio=0.0; total time=   0.0s
[CV 3/5] END ........................alpha=100, l1_ratio=0.0; total time=   0.0s
[CV 4/5] END ........................alpha=100, l1_ratio=0.0; total time=   0.0s
[CV 5/5] END ........................alpha=100, l1_ratio=0.0; total time=   0.0s
[CV 1/5] END ........................alpha=100, l1_ratio=0.1; total time=   0.0s
[CV 2/5] END ........................alpha=100, l1_ratio=0.1; total time=   0.0s
[CV 3/5] END ........................alpha=100, l1_ratio=0.1; total time=   0.0s
[CV 4/5] END ...............

GridSearchCV(cv=5, estimator=ElasticNet(),
             param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
                         'l1_ratio': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])},
             scoring='neg_mean_squared_error', verbose=5)

In [15]:
g = grid.best_estimator_.get_params()
params = {'alpha':g['alpha'],'l1_ratio':g['l1_ratio']}
params

{'alpha': 0.1, 'l1_ratio': 1.0}

In [16]:
model = ElasticNet(alpha=params['alpha'],l1_ratio=params['l1_ratio'])

model.fit(x_1_train,y_1)

pred = model.predict(x_2_test)

In [17]:
from sklearn import metrics

print(f'MAE:{np.round((metrics.mean_absolute_error(y_2,pred)),5)}')

MAE:6.45303


In [18]:
print( f'RMSE:{np.round(np.sqrt(metrics.mean_squared_error(y_2,pred)),5)}')

RMSE:8.33292


In [19]:
for i in range(0,10):
    print(f'{i}\n y_pred--{pred[i]} \n y_2-----{y_2[i]} \n ----------------------------------------------------')

0
 y_pred--[13.83412983 -5.83029107 29.79853335 34.52404472] 
 y_2-----[ 5.77405875 -6.12963862 10.99154063  5.40874969] 
 ----------------------------------------------------
1
 y_pred--[ 9.95140448 -4.54401127 17.94839664 22.9321877 ] 
 y_2-----[13.12964279 -5.19278979  9.00493126 13.8910389 ] 
 ----------------------------------------------------
2
 y_pred--[ 9.20394167 -4.32844908 15.43044284 20.33390367] 
 y_2-----[ 4.6630268  -4.838919   17.83737103 25.03881109] 
 ----------------------------------------------------
3
 y_pred--[ 9.69890913 -4.49920174 17.13142839 22.0903429 ] 
 y_2-----[-3.22842935 -3.02728439 15.28795736 13.2945396 ] 
 ----------------------------------------------------
4
 y_pred--[10.88908331 -4.67492508 20.92256127 25.99720493] 
 y_2-----[ 9.04293487  0.32980195 19.76231485 23.74841697] 
 ----------------------------------------------------
5
 y_pred--[12.01771751 -4.91145875 24.41186663 29.59921454] 
 y_2-----[ 4.38968248 -4.42666473 18.49931145 23.80456674]