# Table of contents
* [Intro](#Sklearn)
* [CrossValidation](#CrossValidation)
    * [Train Test Split](#TrainTestSplit)
    * [Train Validation Test Split](#TrainValidationTestSplit)
    * [K-Fold Cross Validation](#K-FoldCrossValidation)
        * [cross_val_score](#cross_val_score)
        * [cross_validate](#cross_validate)
* [Grid Search](#GridSearch)
* [Model Evaluation using sklearn.metrics](#sklearn.metrics) 
 
https://www.geeksforgeeks.org/how-to-add-a-table-of-contents-in-the-jupyter-notebook/

<a name="Sklearn"></a><h1>Sklearn</h1>
Numpy + SciPy + matplotlib => Sklearn

- very fast and effecient
- Needs data in array (ndarray)
- Incredible documentation
- Numerical stability (can deal with very small or very big numbers which other libraries cant)
- Variety
    - Regression
    - Classification
    - Clustering
    - Support Vector Machines
    - Dimentionality reduction
- Deep learning though have better alternatives

<a name="CrossValidation"></a><h1>Cross Validation</h1>

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

df=pd.read_csv('./resources/Advertising.csv')
df

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5


In [18]:
# define X and y
X = df.drop("sales",axis=1)
y = df["sales"]

<a name="TrainTestSplit"></a><h3>Train Test Split</h3>

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.20,random_state=42, shuffle=True)
print("Total Size::",len(X),
     "X_train::",len(X_train),
     "y_train::",len(y_train), 
     "X_test::",len(X_test),
     "y_test::",len(y_test))

Total Size:: 200 X_train:: 160 y_train:: 160 X_test:: 40 y_test:: 40


<a name="TrainValidationTestSplit"></a><h3>Train Validation Test Split</h3>

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_other, y_train, y_other = train_test_split(X,y,test_size =0.3,random_state=42, shuffle=True)
X_validate, X_test, y_validate, y_test = train_test_split(X_other,y_other,test_size =0.5,random_state=42, shuffle=True)


print("Total Size::",len(X),
     "X_train::",len(X_train),
     "y_train::",len(y_train),
     "X_validate::",len(X_validate),
     "y_validate::",len(y_validate),
     "X_test::",len(X_test),
     "y_test::",len(y_test))

Total Size:: 200 X_train:: 140 y_train:: 140 X_validate:: 30 y_validate:: 30 X_test:: 30 y_test:: 30


<a name="K-FoldCrossValidation"></a><h3>K-Fold Cross Validation</h3>

https://scikit-learn.org/stable/modules/cross_validation.html

<p>However, by partitioning the available data into three sets,
   we drastically reduce the number of samples
   which can be used for learning the model,
   and the results can depend on a particular random choice for the pair of
   (train, validation) sets.
</p>
<p>A solution to this problem is a procedure called
   <a class="reference external" href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)">cross-validation</a>
   (CV for short).
   A test set should still be held out for final evaluation,
   but the validation set is no longer needed when doing CV.
   In the basic approach, called <em>k</em>-fold CV,
   the training set is split into <em>k</em> smaller sets
   (other approaches are described below,
   but generally follow the same principles).
   The following procedure is followed for each of the <em>k</em> “folds”:
</p>
<ul class="simple">
   <li>
      <p>
         A model is trained using 
         <span class="math notranslate nohighlight">
            <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" tabindex="0" ctxtmenu_counter="0" style="font-size: 113.1%; position: relative;">
               <mjx-math class="MJX-TEX" aria-hidden="true">
                  <mjx-mi class="mjx-i">
                     <mjx-c class="mjx-c1D458 TEX-I"></mjx-c>
                  </mjx-mi>
                  <mjx-mo class="mjx-n" space="3">
                     <mjx-c class="mjx-c2212"></mjx-c>
                  </mjx-mo>
                  <mjx-mn class="mjx-n" space="3">
                     <mjx-c class="mjx-c31"></mjx-c>
                  </mjx-mn>
               </mjx-math>
               <mjx-assistive-mml unselectable="on" display="inline">
                  <math xmlns="http://www.w3.org/1998/Math/MathML">
                     <mi>k</mi>
                     <mo>−</mo>
                     <mn>1</mn>
                  </math>
               </mjx-assistive-mml>
            </mjx-container>
         </span>
         of the folds as training data;
      </p>
   </li>
   <li>
      <p>the resulting model is validated on the remaining part of the data
         (i.e., it is used as a test set to compute a performance measure
         such as accuracy).
      </p>
   </li>
</ul>
<p>The performance measure reported by <em>k</em>-fold cross-validation
   is then the average of the values computed in the loop.
   This approach can be computationally expensive,
   but does not waste too much data
   (as is the case when fixing an arbitrary validation set),
   which is a major advantage in problems such as inverse inference
   where the number of samples is very small.
</p>

<a name="cross_val_score"></a><h3>K-Fold Cross Validation - cross_val_score</h3>

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.20,random_state=42, shuffle=True)
print("Total Size::",len(X),
     "X_train::",len(X_train),
     "y_train::",len(y_train), 
     "X_test::",len(X_test),
     "y_test::",len(y_test))

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train

Total Size:: 200 X_train:: 160 y_train:: 160 X_test:: 40 y_test:: 40


array([[-4.04248386e-01, -1.02823707e+00, -3.37675384e-01],
       [ 3.20607716e-01, -9.19827737e-01, -1.16143931e+00],
       [-1.27051084e+00,  2.59123702e-01,  2.54250789e-01],
       [-1.04235941e+00, -6.96233499e-01, -5.74445854e-01],
       [ 8.79103401e-01, -1.38734296e+00, -7.07629243e-01],
       [-1.32873699e+00, -1.29926038e+00, -7.96418169e-01],
       [-9.43731452e-01, -4.65863678e-01,  5.35415722e-01],
       [-3.23140256e-02,  6.94073782e-02, -5.34984109e-01],
       [-5.39713297e-01, -1.16374872e+00,  2.19721762e-01],
       [-8.75998996e-01,  3.13328366e-01, -6.87898371e-01],
       [-8.53421511e-01,  1.62101588e+00,  2.24654481e-01],
       [ 2.18414888e-01, -1.06889056e+00, -8.45745350e-01],
       [-1.67928215e+00,  1.76330312e+00,  2.22240532e+00],
       [-1.68997675e+00,  1.08574483e+00,  1.01882210e+00],
       [-8.74810708e-01, -1.49575229e+00, -7.47090988e-01],
       [-2.45017701e-01, -1.16374872e+00,  6.68075010e-02],
       [-9.10459368e-01, -3.98107848e-01

In [32]:
model = Ridge(alpha=100)

#https://scikit-learn.org/stable/modules/model_evaluation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model,X_train,y_train,scoring='neg_mean_squared_error',cv=5)
scores

array([-6.78339464, -6.39262409, -8.22891388, -8.33661617, -7.14791425])

In [33]:
abs(scores.mean())

7.377892606778351

In [34]:
model2 = Ridge(alpha=1)

#scoring options are available here:: https://scikit-learn.org/stable/modules/model_evaluation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model2,X_train,y_train,scoring='neg_mean_squared_error',cv=5)
abs(scores.mean())

2.945615462110152

In [37]:
# We still need the fitted model - previous fit happen only inside cross_val_score, we need to fit it again
model2.fit(X_train,y_train)
y_pred = model2.predict(X_test)

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred) 

3.1941558922079643

<a name="cross_validate"></a><h3>K-Fold Cross Validation - cross_validate</h3>

In [38]:
from sklearn.model_selection import cross_validate
model = Ridge(alpha=100)
scores = cross_validate(model,X_train,y_train,scoring=['neg_mean_absolute_error','neg_mean_squared_error','neg_root_mean_squared_error'],cv=10)
scores

{'fit_time': array([0.00100899, 0.00099802, 0.00102997, 0.        , 0.00097799,
        0.00127673, 0.00099969, 0.        , 0.0009985 , 0.00100183]),
 'score_time': array([0.00100112, 0.0010004 , 0.00097084, 0.00102186, 0.00200009,
        0.00101376, 0.        , 0.0010016 , 0.        , 0.        ]),
 'test_neg_mean_absolute_error': array([-1.53398301, -1.91288522, -1.83790899, -1.81848311, -2.13593868,
        -2.18146944, -1.88032224, -2.52485994, -2.34399767, -1.14916067]),
 'test_neg_mean_squared_error': array([ -6.40088697,  -5.55929707,  -6.80689014,  -5.08742434,
         -6.43411081,  -8.1391262 ,  -5.01705767,  -9.63369341,
        -11.64954439,  -2.09027682]),
 'test_neg_root_mean_squared_error': array([-2.52999742, -2.35781616, -2.60900175, -2.25553194, -2.53655491,
        -2.85291539, -2.23987894, -3.10381917, -3.41314289, -1.44577897])}

In [41]:
scores_df = pd.DataFrame(scores)

In [42]:
scores_df.mean()

fit_time                            0.000829
score_time                          0.000801
test_neg_mean_absolute_error       -1.931901
test_neg_mean_squared_error        -6.681831
test_neg_root_mean_squared_error   -2.534444
dtype: float64

<a name="GridSearch"></a><h3>Grid Search</h3>

In [46]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

base_estimator = ElasticNet()
param_grid = {'alpha':[0.1,1,5,10,50,100],
              'l1_ratio':[.1,.5,.7,.95,.99,1]}

grid_model = GridSearchCV(estimator=base_estimator,
                          param_grid=param_grid,
                         scoring='neg_mean_squared_error',
                         cv=5,verbose=2)
grid_model.fit(X_train,y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.1; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.5; total time=   0.0s
[CV] END ............................alpha=0.1, l1_ratio=0.7; total time=   0.0s
[CV] END ............................alpha=0.1,

In [47]:
grid_model.best_estimator_

In [49]:
pd.DataFrame(grid_model.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.0006,0.00049,0.0002,0.0004,0.1,0.1,"{'alpha': 0.1, 'l1_ratio': 0.1}",-3.886993,-1.942023,-2.589727,-2.86631,-4.319818,-3.120974,0.867296,6
1,0.000499,0.000445,0.000402,0.000492,0.1,0.5,"{'alpha': 0.1, 'l1_ratio': 0.5}",-3.890509,-1.738867,-2.443394,-2.745195,-4.288074,-3.021208,0.93969,5
2,0.001115,0.000226,0.000192,0.000384,0.1,0.7,"{'alpha': 0.1, 'l1_ratio': 0.7}",-3.906669,-1.644888,-2.377283,-2.691004,-4.26998,-2.977965,0.97489,4
3,0.002101,0.001957,0.000604,0.000493,0.1,0.95,"{'alpha': 0.1, 'l1_ratio': 0.95}",-3.943018,-1.537841,-2.292364,-2.630927,-4.253823,-2.931595,1.021061,3
4,0.001135,0.000194,0.000398,0.000487,0.1,0.99,"{'alpha': 0.1, 'l1_ratio': 0.99}",-3.950681,-1.522053,-2.279191,-2.622128,-4.251968,-2.925204,1.028584,2
5,0.001005,2e-06,0.000701,0.000399,0.1,1.0,"{'alpha': 0.1, 'l1_ratio': 1}",-3.952682,-1.518171,-2.275935,-2.620003,-4.251537,-2.923666,1.030466,1
6,0.000504,0.000447,0.00049,0.000616,1.0,0.1,"{'alpha': 1, 'l1_ratio': 0.1}",-7.696654,-7.310087,-9.586634,-9.800923,-7.948916,-8.468643,1.023055,12
7,0.000465,0.000579,0.0,0.0,1.0,0.5,"{'alpha': 1, 'l1_ratio': 0.5}",-6.389786,-5.892301,-8.356518,-8.797368,-7.108256,-7.308846,1.113986,11
8,0.001286,0.000364,0.000102,0.000204,1.0,0.7,"{'alpha': 1, 'l1_ratio': 0.7}",-5.596964,-5.059916,-7.40809,-7.91996,-6.502475,-6.497481,1.070533,10
9,0.0,0.0,0.000509,0.000449,1.0,0.95,"{'alpha': 1, 'l1_ratio': 0.95}",-4.494258,-3.89497,-5.947435,-6.579584,-5.615657,-5.306381,0.977678,9


In [51]:
y_pred = grid_model.predict(X_test)
mean_squared_error(y_test,y_pred)

3.208876825052843

<a name="sklearn.metrics"></a><h1>Model Evaluation using sklearn.metrics</h1>
https://scikit-learn.org/stable/modules/model_evaluation

In [64]:
y_test = [1,2,3]
y_pred = [1.1,2.2,3.3]

from sklearn.metrics import mean_squared_error
val_mse = mean_squared_error(y_test,y_pred)
val_mse

0.04666666666666666

<p><code>sklearn.metrics</code> has a <code>mean_squared_error</code> function with a <code>squared</code> kwarg (defaults to <code>True</code>). Setting <code>squared</code> to <code>False</code> will return the RMSE.</p>

In [65]:
# Setting 
val_rmse = mean_squared_error(y_test,y_pred,squared=False)
val_rmse

0.21602468994692867

In [67]:
# Alternatively
from math import sqrt
val_rmse = sqrt(mean_squared_error(y_test,y_pred))
val_rmse

0.21602468994692867

# Regression shortcuts
<ul>
    <li>reg = LinearRegression()</li>
    <li>reg.fix(X,y)</li>
    <li>reg.predict(X)</li>
    <li>r2=reg.score(X,y) #r2</li>
    <li>reg.coef_</li>
    <li>reg.intercept_</li> 
    <li>adjusted_r2=1-(1-r2)*(n-1)/(n-p-1) where n=X.shape[0] and p=X.shape[1]</li> 
</ul>