This is the Walmart project for the Jedha's certification bloc 3, with supervised machine learning.

Author : Youenn PATAT

<img src="https://upload.wikimedia.org/wikipedia/commons/c/ca/Walmart_logo.svg" alt="WALMART LOGO" />

The main goal of this project is to estimate the weekly sales in their stores, using a supervised ML model.

importation of all libraries and functions usefull.

In [40]:
import pandas as pd
import numpy as np
import plotly.express as px 
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, TargetEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score

# Part 1 : EDA and preprocessing

## a) Exploration of the dataset

In [41]:
df = pd.read_csv("Walmart_Store_sales.csv")
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092


In [42]:
print("Number of rows :", df.shape[0])

print("Basics statistics")
display(df.describe(include="all"))

print("percentage of missing values:")
display(100 * df.isnull().sum() / df.shape[0])

Number of rows : 150
Basics statistics


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,150.0,132,136.0,138.0,132.0,136.0,138.0,135.0
unique,,85,,,,,,
top,,19-10-2012,,,,,,
freq,,4,,,,,,
mean,9.866667,,1249536.0,0.07971,61.398106,3.320853,179.898509,7.59843
std,6.231191,,647463.0,0.271831,18.378901,0.478149,40.274956,1.577173
min,1.0,,268929.0,0.0,18.79,2.514,126.111903,5.143
25%,4.0,,605075.7,0.0,45.5875,2.85225,131.970831,6.5975
50%,9.0,,1261424.0,0.0,62.985,3.451,197.908893,7.47
75%,15.75,,1806386.0,0.0,76.345,3.70625,214.934616,8.15


percentage of missing values:


Store            0.000000
Date            12.000000
Weekly_Sales     9.333333
Holiday_Flag     8.000000
Temperature     12.000000
Fuel_Price       9.333333
CPI              8.000000
Unemployment    10.000000
dtype: float64

We see there are some missing values in most of the columns. In preprocessing we will find a way to complete these missing values. At a first sight, I don't see absurd values. But we will check it better in the pandas preprocessing part to see if we will drop some lines.

In [43]:

def display_bar(f):
    fig = px.bar(df, x = f, y =  "Weekly_Sales", height=600, width=800)
    fig.show()
    
display_bar('Store')

In [44]:
display_bar('Date')

In [45]:
display_bar('Holiday_Flag')

In [46]:
def display_scatter(f):
    fig = px.scatter(df, x = f, y =  "Weekly_Sales", height=600, width=800)
    fig.show()

display_scatter('Temperature')

In [47]:
display_scatter('Fuel_Price')

In [48]:
display_scatter('CPI')

In [49]:
display_scatter('Unemployment')

In [50]:
fig = px.scatter_matrix(df)
fig.update_layout(
        title = go.layout.Title(text = "Bivariate analysis", x = 0.5), showlegend = False, 
            autosize=False, height=1200, width = 1200)
fig.show()

With this matrix plot, we don't see any colinearity between features.

## b) Preparation of the dataset (preprocessing with pandas)

* **We will drop lines when our target value (weekly sales) is missing**.

In [51]:
df = df.dropna(subset=["Weekly_Sales"])
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092
5,4.0,28-05-2010,1857533.7,0.0,,2.756,126.160226,7.896


* **Create usable features with the date** : *years*, *months*, *day*, *dayofweek*

In [52]:
df["Date"] = pd.to_datetime(df["Date"], format="%d-%m-%Y")

df["Years"] = df["Date"].dt.year
df["Months"] = df["Date"].dt.month
df["Day"] = df["Date"].dt.day
df["DayOfWeek"] = df["Date"].dt.dayofweek

df = df.drop("Date", axis=1)

df.head()

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Years,Months,Day,DayOfWeek
0,6.0,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,4.0
1,13.0,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0,4.0
3,11.0,1244390.03,0.0,84.57,,214.556497,7.346,,,,
4,6.0,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,4.0
5,4.0,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0,4.0


* **Drop lines containing invalid values or outliers :**
In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*.

In [53]:
print("Dropping outliers in the following columns : Temperature, Fuel_Price, CPI and Unemployment...")

col_concerned = ["Temperature", "Fuel_Price", "CPI", "Unemployment"]

for col in col_concerned:
    to_keep = (df[col] < df[col].mean() + 3 * df[col].std()) & (df[col] > df[col].mean() - 3 * df[col].std())
    df = df.loc[to_keep, :]

print("...Done, number of lines remaining : ", df.shape[0])

df.head()

Dropping outliers in the following columns : Temperature, Fuel_Price, CPI and Unemployment...
...Done, number of lines remaining :  90


Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Years,Months,Day,DayOfWeek
0,6.0,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,4.0
1,13.0,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0,4.0
4,6.0,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,4.0
6,15.0,695396.19,0.0,69.8,4.069,134.855161,7.658,2011.0,6.0,3.0,4.0
7,20.0,2203523.2,0.0,39.93,3.617,213.023622,6.961,2012.0,2.0,3.0,4.0


In [54]:
df["Store"].value_counts()

Store
3.0     9
18.0    7
7.0     7
13.0    7
1.0     6
19.0    6
5.0     5
4.0     5
6.0     4
14.0    4
20.0    4
8.0     4
10.0    4
2.0     4
9.0     3
17.0    3
16.0    3
15.0    3
11.0    2
Name: count, dtype: int64

In [55]:
print("Number of rows :", df.shape[0])

print("Basics statistics")
display(df.describe(include="all"))

print("percentage of missing values:")
display(100 * df.isnull().sum() / df.shape[0])

Number of rows : 90
Basics statistics


Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Years,Months,Day,DayOfWeek
count,90.0,90.0,80.0,90.0,90.0,90.0,90.0,80.0,80.0,80.0,80.0
mean,9.9,1233865.0,0.075,61.061,3.318444,179.524905,7.389733,2010.8875,6.3625,16.125,4.0
std,6.204475,664725.0,0.265053,17.74604,0.484399,39.554303,0.982729,0.826672,3.028321,8.521566,0.0
min,1.0,268929.0,0.0,18.79,2.548,126.128355,5.143,2010.0,1.0,1.0,4.0
25%,4.0,561724.0,0.0,45.3425,2.81475,132.602339,6.64225,2010.0,4.0,10.0,4.0
50%,9.0,1260826.0,0.0,61.45,3.468,197.166416,7.419,2011.0,6.0,16.5,4.0
75%,15.75,1807159.0,0.0,75.7925,3.73775,214.855374,8.099,2012.0,8.25,23.25,4.0
max,20.0,2771397.0,1.0,91.65,4.17,226.968844,9.342,2012.0,12.0,31.0,4.0


percentage of missing values:


Store            0.000000
Weekly_Sales     0.000000
Holiday_Flag    11.111111
Temperature      0.000000
Fuel_Price       0.000000
CPI              0.000000
Unemployment     0.000000
Years           11.111111
Months          11.111111
Day             11.111111
DayOfWeek       11.111111
dtype: float64

* **Separation of the Traget Y from the features X**

In [56]:
print("Separating labels from features...")
target_variable = "Weekly_Sales"

X = df.drop(target_variable, axis = 1)
Y = df.loc[:, target_variable]

print("...Done.")
print()

print("Y : ")
print(Y.head())
print()
print("X :")
print(X.head())

Separating labels from features...
...Done.

Y : 
0    1572117.54
1    1807545.43
4    1644470.66
6     695396.19
7    2203523.20
Name: Weekly_Sales, dtype: float64

X :
   Store  Holiday_Flag  Temperature  Fuel_Price         CPI  Unemployment  \
0    6.0           NaN        59.61       3.045  214.777523         6.858   
1   13.0           0.0        42.38       3.435  128.616064         7.470   
4    6.0           0.0        78.89       2.759  212.412888         7.092   
6   15.0           0.0        69.80       4.069  134.855161         7.658   
7   20.0           0.0        39.93       3.617  213.023622         6.961   

    Years  Months   Day  DayOfWeek  
0  2011.0     2.0  18.0        4.0  
1  2011.0     3.0  25.0        4.0  
4  2010.0     5.0  28.0        4.0  
6  2011.0     6.0   3.0        4.0  
7  2012.0     2.0   3.0        4.0  


## c) Preprocessing with Sklearn

Now let's do the Sklearn preprocessing !

* **Separation of the train & test** 

In [57]:
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("...Done.")


Dividing into train and test sets...
...Done.


* **Creation of the pipeline for preprocessing, identifying the numerical and categorical features.**

In [58]:
# Create pipeline for numeric features
numeric_features = ["Temperature", "Fuel_Price", "CPI", "Unemployment", "Years", "Months", "Day", "DayOfWeek"]  
numeric_transformer = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="median"),
        ),  # missing values will be replaced by columns' median because only years, months, day and dayofweek need it so it is better to choose median than mean
        ("scaler", StandardScaler()),
    ]
)

# Create pipeline for categorical features
categorical_features = ["Store", "Holiday_Flag"]  
categorical_transformer = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="most_frequent"),
        ),  # missing values will be replaced by most frequent value
        (
            "encoder",
            OneHotEncoder(drop="if_binary"),
        ),  # first column will be dropped to avoid creating correlations between features
    ]
)

# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

* **Apply the preprocessing on train and test sets**

In [59]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train.head())
X_train = preprocessor.fit_transform(X_train)
print("...Done.")
print(X_train[0:5])  
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test.head())
X_test = preprocessor.transform(X_test)  
print("...Done.")
print(X_test[0:5, :]) 

Performing preprocessings on train set...
     Store  Holiday_Flag  Temperature  Fuel_Price         CPI  Unemployment  \
127   16.0           0.0        61.79       2.711  189.523128         6.868   
63     5.0           0.0        69.17       3.594  224.019287         5.422   
35    19.0           0.0        33.26       3.789  133.958742         7.771   
10     8.0           0.0        82.92       3.554  219.070197         6.425   
95     1.0           0.0        74.78       2.854  210.337426         7.808   

      Years  Months   Day  DayOfWeek  
127  2010.0     7.0   9.0        4.0  
63   2012.0    10.0  19.0        4.0  
35   2011.0     3.0  25.0        4.0  
10   2011.0     8.0  19.0        4.0  
95   2010.0     5.0  14.0        4.0  
...Done.
[[ 0.04260362 -1.26840641  0.20507788 -0.55534542 -1.1763434   0.147002
  -0.86859506  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.       

No needs to do the label encoder for Y because it is a numerical value.

# Part 2 : Baseline model - Linear regression

* **Model Training**

In [60]:
model = LinearRegression()

print("Training model...")
model.fit(X_train, Y_train) 
print("...Done.")

Training model...
...Done.


* **Score of the model**

In [61]:
print("R2 score on training set : ", model.score(X_train, Y_train))
print("R2 score on test set : ", model.score(X_test, Y_test))

R2 score on training set :  0.9868321417045137
R2 score on test set :  0.9352216314000096


As first sight, the model seems quite good with 98 % of precision on training and 93 % on testing, a little below than training but not that bad.

A little cross-validation to test if there is overfitting.

In [62]:
scores = cross_val_score(model,X_train, Y_train, cv=5)
avg = scores.mean()
std = scores.std()
print('Cross-validated accuracy : {}\nstandard deviation : {}'.format(avg, std))
print("R2 score on test set is finally: ", model.score(X_test, Y_test) + std, "or ", model.score(X_test, Y_test) - std)
print("and that's under the R2 score on training set :", model.score(X_train, Y_train))

Cross-validated accuracy : 0.9553550974246763
standard deviation : 0.028341216024172698
R2 score on test set is finally:  0.9635628474241823 or  0.9068804153758369
and that's under the R2 score on training set : 0.9868321417045137


So the standard deviation is to low to obtatin a score test in an interval counting the train test. So we have a little overfitting here that we will try to correct in the last part.

* **Get the corfficients to see the feature importance**

In [63]:
preprocessor.get_feature_names_out().tolist()

['num__Temperature',
 'num__Fuel_Price',
 'num__CPI',
 'num__Unemployment',
 'num__Years',
 'num__Months',
 'num__Day',
 'num__DayOfWeek',
 'cat__Store_1.0',
 'cat__Store_2.0',
 'cat__Store_3.0',
 'cat__Store_4.0',
 'cat__Store_5.0',
 'cat__Store_6.0',
 'cat__Store_7.0',
 'cat__Store_8.0',
 'cat__Store_9.0',
 'cat__Store_10.0',
 'cat__Store_11.0',
 'cat__Store_13.0',
 'cat__Store_14.0',
 'cat__Store_15.0',
 'cat__Store_16.0',
 'cat__Store_17.0',
 'cat__Store_18.0',
 'cat__Store_19.0',
 'cat__Store_20.0',
 'cat__Holiday_Flag_1.0']

In [64]:
column_names = preprocessor.get_feature_names_out().tolist()

coefs = pd.DataFrame(index = column_names, data = model.coef_.transpose(), columns=["coefficients"])
coefs.index = coefs.index.str.replace(r"^(num__|cat__)", "", regex=True) # To drop the num__ and cat__ in the name of categories
coefs

Unnamed: 0,coefficients
Temperature,-11462.7
Fuel_Price,-57984.83
CPI,717469.9
Unemployment,32478.5
Years,-6895.022
Months,17243.19
Day,-49592.65
DayOfWeek,-5.529728e-09
Store_1.0,-347190.7
Store_2.0,-75852.39


In [65]:
feature_importance = coefs.sort_values(by = 'coefficients')
# Plot coefficients
fig = px.bar(feature_importance, orientation = 'h')
fig.update_layout(showlegend = False, 
                  margin = {'l': 120}, # to avoid cropping of column names
                  height = 600, width = 800,
                  yaxis_title = "Feature name",
                  xaxis_title = "Coefficient"
                  )
fig.show()

The n° of store seems to have an high impact in our model to predict the weekly sales. Only certain number of store, not all.

# Part 3 : Fight Overfitting

## 1) Ridge

Test of multiple alpha for ridge:

In [71]:
ridge1 = Ridge(alpha = 1e-3)
print(ridge1)
ridge1.fit(X_train, Y_train)

print("R2 score on training set : ", ridge1.score(X_train, Y_train))
print("R2 score on test set : ", ridge1.score(X_test, Y_test))

Ridge(alpha=0.001)
R2 score on training set :  0.9867522621884073
R2 score on test set :  0.9369066011096839


In [72]:
ridge2 = Ridge(alpha = 1)
print(ridge2)
ridge2.fit(X_train, Y_train)

print("R2 score on training set : ", ridge2.score(X_train, Y_train))
print("R2 score on test set : ", ridge2.score(X_test, Y_test))

Ridge(alpha=1)
R2 score on training set :  0.9427813652185592
R2 score on test set :  0.8473082408809713


In [73]:
ridge3 = Ridge(alpha = 10)
print(ridge3)
ridge3.fit(X_train, Y_train)

print("R2 score on training set : ", ridge3.score(X_train, Y_train))
print("R2 score on test set : ", ridge3.score(X_test, Y_test))

Ridge(alpha=10)
R2 score on training set :  0.5492457118018667
R2 score on test set :  0.3632062416752043


In [77]:
data_dict = {
    'Feature': preprocessor.get_feature_names_out().tolist(),
    'Ridge1': ridge1.coef_,
    'Ridge2': ridge2.coef_,
    'Ridge3': ridge3.coef_
            }

coefficients_ridge = pd.DataFrame(data = data_dict)
coefficients_ridge["Feature"] = coefficients_ridge["Feature"].str.replace(r"^(num__|cat__)", "", regex=True)
coefficients_ridge.head()

Unnamed: 0,Feature,Ridge1,Ridge2,Ridge3
0,Temperature,-11967.117727,7982.665994,3064.544703
1,Fuel_Price,-55157.782395,-67119.055228,-58620.577072
2,CPI,535841.866033,-127142.214348,-151391.106402
3,Unemployment,30024.199245,91114.315418,96356.141669
4,Years,4188.164316,69384.730013,8882.192152


In [80]:
import plotly.express as px
fig = px.line(coefficients_ridge, x = 'Feature', y = ['Ridge1', 'Ridge2', 'Ridge3'], height=600, width=800)
fig.show()

## 2) Lasso

Test of multiple alpha for lasso:

In [93]:
lasso1 = Lasso(alpha = 1, max_iter=int(1e4)) #max_iter changed because of convergence warning
print(lasso1)
lasso1.fit(X_train, Y_train)

print("R2 score on training set : ", lasso1.score(X_train, Y_train))
print("R2 score on test set : ", lasso1.score(X_test, Y_test))

Lasso(alpha=1, max_iter=10000)
R2 score on training set :  0.9868315040562735
R2 score on test set :  0.9353786704934931


In [94]:
lasso2 = Lasso(alpha = 10, max_iter=int(1e4))
print(lasso2)
lasso2.fit(X_train, Y_train)

print("R2 score on training set : ", lasso2.score(X_train, Y_train))
print("R2 score on test set : ", lasso2.score(X_test, Y_test))

Lasso(alpha=10, max_iter=10000)
R2 score on training set :  0.986775070375541
R2 score on test set :  0.9366217320076096


In [95]:
lasso3 = Lasso(alpha = 30, max_iter=int(1e4))
print(lasso3)
lasso3.fit(X_train, Y_train)

print("R2 score on training set : ", lasso3.score(X_train, Y_train))
print("R2 score on test set : ", lasso3.score(X_test, Y_test))

Lasso(alpha=30, max_iter=10000)
R2 score on training set :  0.9866217199939565
R2 score on test set :  0.9380553830545715


In [96]:
data_dict = {
    'Feature': preprocessor.get_feature_names_out().tolist(),
    'Lasso1': lasso1.coef_,
    'Lasso2': lasso2.coef_,
    'Lasso3': lasso3.coef_
            }

coefficients_lasso = pd.DataFrame(data = data_dict)
coefficients_lasso["Feature"] = coefficients_lasso["Feature"].str.replace(r"^(num__|cat__)", "", regex=True)
coefficients_lasso.head()

Unnamed: 0,Feature,Lasso1,Lasso2,Lasso3
0,Temperature,-11510.767222,-11945.29014,-12130.474518
1,Fuel_Price,-57721.579266,-55420.678661,-53299.659419
2,CPI,701239.795581,563956.180556,423417.779515
3,Unemployment,32248.821089,30306.974641,29113.90682
4,Years,-5914.035941,2294.180593,11235.124897


In [98]:
import plotly.express as px
fig = px.line(coefficients_lasso, x = 'Feature', y = ['Lasso1', 'Lasso2', 'Lasso3'], height=600, width=800)
fig.show()

## 3) Hyperparameter selection

### Ridge

In [107]:
print("Grid search...")
regressor = Ridge()

params = {
    'alpha': [0.0001, 0.001, 0.0011, 0.0012, 0.0013, 0.0015, 0.002, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}
best_ridge = GridSearchCV(regressor, param_grid = params, cv = 10) # cv : the number of folds to be used for CV
best_ridge.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", best_ridge.best_params_)
print("Best R2 score : ", best_ridge.best_score_)

Grid search...
...Done.
Best hyperparameters :  {'alpha': 0.0012}
Best R2 score :  0.9630582852152922


### Lasso

In [106]:
print("Grid search...")
regressor = Lasso(max_iter=int(1e6))

params = {
    'alpha': [1, 2, 3, 5, 10, 20, 25, 26, 27, 28, 30, 32, 35, 40, 50, 100,],
}
best_lasso = GridSearchCV(regressor, param_grid = params, cv = 10) # cv : the number of folds to be used for CV
best_lasso.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", best_lasso.best_params_)
print("Best R2 score : ", best_lasso.best_score_)

Grid search...
...Done.
Best hyperparameters :  {'alpha': 28}
Best R2 score :  0.9633526087016703


### Comparison of the two

In [108]:
print("RIDGE / R2 score on training set : ", best_ridge.score(X_train, Y_train))
print("RIDGE / R2 score on test set : ", best_ridge.score(X_test, Y_test))
print()
print("LASSO / R2 score on training set : ", best_lasso.score(X_train, Y_train))
print("LASSO / R2 score on test set : ", best_lasso.score(X_test, Y_test))

RIDGE / R2 score on training set :  0.9867265773355011
RIDGE / R2 score on test set :  0.9371411284508883

LASSO / R2 score on training set :  0.9866477418432557
LASSO / R2 score on test set :  0.9378885059035936


In [109]:
data_dict = {
    'Feature': preprocessor.get_feature_names_out().tolist(),
    'Best_Ridge': best_ridge.best_estimator_.coef_,
    'Best_Lasso': best_lasso.best_estimator_.coef_
            }

coefficients = pd.DataFrame(data = data_dict)
coefficients["Feature"] = coefficients["Feature"].str.replace(r"^(num__|cat__)", "", regex=True)
coefficients.head()

Unnamed: 0,Feature,Best_Ridge,Best_Lasso
0,Temperature,-12038.482259,-12090.434856
1,Fuel_Price,-54738.306234,-53598.68016
2,CPI,508686.395505,442185.04856
3,Unemployment,29674.295611,29320.864306
4,Years,5854.449657,10074.559703


In [110]:
fig = px.line(coefficients, x = 'Feature', y = ['Best_Ridge', 'Best_Lasso'], height=600, width=800)
fig.show()

### Verification than overfitting is corrected

In [111]:
scores = cross_val_score(best_ridge,X_train, Y_train, cv=5)
avg = scores.mean()
std = scores.std()
print('Cross-validated accuracy : {}\nstandard deviation : {}'.format(avg, std))
print("R2 score on test set is finally: ", best_ridge.score(X_test, Y_test) + std, "or ", best_ridge.score(X_test, Y_test) - std)
print("and that's under the R2 score on training set :", best_ridge.score(X_train, Y_train))

Cross-validated accuracy : 0.9333198008802637
standard deviation : 0.06981891957670519
R2 score on test set is finally:  1.0069600480275935 or  0.8673222088741831
and that's under the R2 score on training set : 0.9867265773355011


In [112]:
scores = cross_val_score(best_lasso,X_train, Y_train, cv=5)
avg = scores.mean()
std = scores.std()
print('Cross-validated accuracy : {}\nstandard deviation : {}'.format(avg, std))
print("R2 score on test set is finally: ", best_lasso.score(X_test, Y_test) + std, "or ", best_lasso.score(X_test, Y_test) - std)
print("and that's under the R2 score on training set :", best_lasso.score(X_train, Y_train))

Cross-validated accuracy : 0.9473525321065663
standard deviation : 0.04683547714923878
R2 score on test set is finally:  0.9847239830528324 or  0.8910530287543549
and that's under the R2 score on training set : 0.9866477418432557


Finally, here it seems that it is ridge (aplha = 0.0012) that is the best to correct the little overfitting of the beginning.