### Feature Scaling

Two main ways:

* Standardization
* Normalization

.fit() method simply calculates the necessary statistics(Xmin,Xmax,mean,sd).

.transform() actually scales data and returns the new scaled version of data

### Scaling

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df=pd.read_csv("C:\\Users\\2211444\\Desktop\\Udemy Python Masterclass\\UNZIP_FOR_NOTEBOOKS_FINAL\\08-Linear-Regression-Models\\Advertising.csv")

In [3]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [4]:
X=df.drop("sales",axis=1)

In [5]:
y=df["sales"]

In [6]:
from sklearn.preprocessing import PolynomialFeatures

In [7]:
polynomial_converter=PolynomialFeatures(degree=3,include_bias=False)

In [8]:
poly_features=polynomial_converter.fit_transform(X)

In [9]:
X.shape

(200, 3)

In [10]:
poly_features.shape

(200, 19)

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

In [13]:
X_train.shape

(140, 19)

### scale the data

In [14]:
from sklearn.preprocessing import StandardScaler

In [15]:
scaler=StandardScaler() #creating an instance

In [16]:
#fit it only to the training set
scaler.fit(X_train)

StandardScaler()

In [17]:
scaled_X_train=scaler.transform(X_train)

In [18]:
scaled_X_test=scaler.transform(X_test)

In [21]:
scaled_X_train[0] #so we see that the values are scaled

array([ 0.49300171, -0.33994238,  1.61586707,  0.28407363, -0.02568776,
        1.49677566, -0.59023161,  0.41659155,  1.6137853 ,  0.08057172,
       -0.05392229,  1.01524393, -0.36986163,  0.52457967,  1.48737034,
       -0.66096022, -0.16360242,  0.54694754,  1.37075536])

In [22]:
poly_features[0]

array([2.30100000e+02, 3.78000000e+01, 6.92000000e+01, 5.29460100e+04,
       8.69778000e+03, 1.59229200e+04, 1.42884000e+03, 2.61576000e+03,
       4.78864000e+03, 1.21828769e+07, 2.00135918e+06, 3.66386389e+06,
       3.28776084e+05, 6.01886376e+05, 1.10186606e+06, 5.40101520e+04,
       9.88757280e+04, 1.81010592e+05, 3.31373888e+05])

### Ridge Regression: L2 regularization

Here lamda is referred to as alpha

In [23]:
from sklearn.linear_model import Ridge

In [25]:
ridge_model=Ridge(alpha=10)

In [26]:
ridge_model.fit(scaled_X_train,y_train)

Ridge(alpha=10)

In [27]:
test_predictions=ridge_model.predict(scaled_X_test)

In [28]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

In [29]:
MAE=mean_absolute_error(y_test,test_predictions)

In [30]:
MAE

0.5774404204714166

In [31]:
RMSE=np.sqrt(mean_squared_error(y_test,test_predictions))

In [32]:
RMSE

0.8946386461319645

### how do we decide that alpha=10 is best , weneed to perform cross validation technique from a set of alpha values

In [33]:
from sklearn.linear_model import RidgeCV #cross validation

In [34]:
ridge_cv_model=RidgeCV(alphas=(0.1,1.0,10.0))

In [35]:
ridge_cv_model.fit(scaled_X_train,y_train)
#Here we are only using training set for hyperparameter tuning

RidgeCV(alphas=array([ 0.1,  1. , 10. ]))

#### here the training set will be divided between training and validation as a small portion of the training set will workk as a validation set in order to check the best alpha parameter 

In [36]:
ridge_cv_model.alpha_ #this gives us the best performing alpha

0.1

### So the question arises what scoring metric was it using to choose the optimal alpha

In [38]:
from sklearn.metrics import SCORERS

In [39]:
SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])

### Here we see negative_mean_absolute_error instead of mean_absolute_error which means higher is better as there is neg_mean_squared_error


So we again run the ridge_cv_model by giving scoring equal to neg_mean_squared_error as a parameter

In [40]:
ridge_cv_model=RidgeCV(alphas=(0.1,1.0,10.0),scoring='neg_mean_absolute_error')

In [41]:
ridge_cv_model.fit(scaled_X_train,y_train)
#Here we are only using training set for hyperparameter tuning

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), scoring='neg_mean_absolute_error')

In [42]:
ridge_cv_model.alpha_ #this gives us the best performing alpha

0.1

### Predictions and Evaluation

In [43]:
test_predictions=ridge_cv_model.predict(scaled_X_test)

In [44]:
MAE=mean_absolute_error(y_test,test_predictions)

In [45]:
RMSE=np.sqrt(mean_squared_error(y_test,test_predictions))

In [46]:
MAE,RMSE

(0.4273774884346364, 0.6180719926958731)

So we observe that using cv model that is by parameter tuning we find that MAE and RMSE decreases so we did find a better alpha value by parameter tuning

In [47]:
ridge_cv_model.coef_

array([ 5.40769392,  0.5885865 ,  0.40390395, -6.18263924,  4.59607939,
       -1.18789654, -1.15200458,  0.57837796, -0.1261586 ,  2.5569777 ,
       -1.38900471,  0.86059434,  0.72219553, -0.26129256,  0.17870787,
        0.44353612, -0.21362436, -0.04622473, -0.06441449])

In [49]:
ridge_cv_model.best_score_ #this gives neg score

-0.3749223340292963

### LASSO REGRESSION : L1 Regularization

In [50]:
from sklearn.linear_model import LassoCV

#### we can choose alpha through a list as we did in case of Ridge or if we set it as none alpha is set automaticaly

In [53]:
lasso_cv_model=LassoCV(eps=0.1,n_alphas=100,cv=5)
#eps=alpha min/alpha max
#n_alphas is no of alphas along the regularization path
#cv=5 , is the no of k fold cross validation

In [54]:
lasso_cv_model.fit(scaled_X_train,y_train)

LassoCV(cv=5, eps=0.1)

In [55]:
lasso_cv_model.alpha_

0.4943070909225832

In [58]:
test_predictions=lasso_cv_model.predict(scaled_X_test)

In [59]:
MAE=mean_absolute_error(y_test,test_predictions)

In [60]:
RMSE=np.sqrt(mean_squared_error(y_test,test_predictions))

In [61]:
MAE,RMSE

(0.6541723161252864, 1.1308001022762542)

### So we see that lasso is not performing as well as Ridge by comparing their MAE and RMSE.


So what might be the case???

* we check out the lasso coefficients

In [62]:
lasso_cv_model.coef_

array([1.002651  , 0.        , 0.        , 0.        , 3.79745279,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

* We observe that most of the coefficients are zero and only there are two non-zero coefficients , so this means that the model is running for only two featuress hence we get a poor performance compared to the ridge regression.


But here we can see that only two feature is considered so easy to interpret, so it depends on our model whether this trade off will be fruitful

### we can get much better performance by considering a wide range f aplha values and by increasing the max number of iterations

In [65]:
lasso_cv_model=LassoCV(eps=0.001,n_alphas=100,cv=5,max_iter=1000000)
#eps=alpha min/alpha max
#n_alphas is no of alphas along the regularization path
#cv=5 , is the no of k fold cross validation

In [66]:
lasso_cv_model.fit(scaled_X_train,y_train)

LassoCV(cv=5, max_iter=1000000)

In [67]:
lasso_cv_model.alpha_

0.004943070909225833

In [68]:
test_predictions=lasso_cv_model.predict(scaled_X_test)

In [69]:
MAE=mean_absolute_error(y_test,test_predictions)

In [70]:
RMSE=np.sqrt(mean_squared_error(y_test,test_predictions))

In [71]:
MAE,RMSE

(0.43350346185900707, 0.606314074898403)

In [72]:
lasso_cv_model.coef_

array([ 4.86023329,  0.12544598,  0.20746872, -4.99250395,  4.38026519,
       -0.22977201, -0.        ,  0.07267717, -0.        ,  1.77780246,
       -0.69614918, -0.        ,  0.12044132, -0.        , -0.        ,
       -0.        ,  0.        ,  0.        , -0.        ])

So we observe that we get much better model performance and much more features this time

So as we search for more alpha values and give more computational time we get a better performance model

While comparing this lasso model with ridge model we also observe that the RMSE and MAE are quite closer for both although for lasso we are getting this performance by considering small no of features comapred to ridge.

* Thus we see that Lasso gives a good performance as well a shelps in feature scaling compared to Ridge

### Elastic Net: combination of Ridge and Lasso

In [73]:
from sklearn.linear_model import ElasticNetCV

In [76]:
elastic_model=ElasticNetCV(l1_ratio=[.1,.5,.7,.9,.95,.99,1],eps=0.001,
                         n_alphas=100,max_iter=100000)
#L1 ratio is the ratio between l1 and l2 

In [77]:
elastic_model.fit(scaled_X_train,y_train)

ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], max_iter=100000)

In [78]:
elastic_model.l1_ratio

[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1]

In [79]:
elastic_model.l1_ratio_ #to get the best l1 value

1.0

So we see that we get the alpha value as 1 , so this means we get only lasso.

Hence lasso is a better model here

In [80]:
elastic_model.alpha_

0.004943070909225833

In [85]:
test_predictions=elastic_model.predict(scaled_X_test)

In [86]:
MAE=mean_absolute_error(y_test,test_predictions)

In [87]:
RMSE=np.sqrt(mean_squared_error(y_test,test_predictions))

In [88]:
MAE,RMSE

(0.43350346185900707, 0.606314074898403)

Hence we get the same value as lasso , thus eastic net gave us that lasso is best