<img src="https://www.bestdesigns.co/uploads/inspiration_images/4350/990__1511457498_404_walmart.png" alt="WALMART LOGO" />

# Walmart : predict weekly sales

## Company's Description 📇

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962.

## Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

## Goals 🎯

The project can be divided into three steps:

- Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
- Part 2 : train a **linear regression model** (baseline)
- Part 3 : avoid overfitting by training a **regularized regression model**

## Scope of this project 🖼️

For this project, you'll work with a dataset that contains information about weekly sales achieved by different Walmart stores, and other variables such as the unemployment rate or the fuel price, that might be useful for predicting the amount of sales. The dataset has been taken from a Kaggle competition, but we made some changes compared to the original data. Please make sure that you're using **our** custom dataset (available on JULIE). 🤓

## Deliverable 📬

To complete this project, your team should: 

- Create some visualizations
- Train at least one **linear regression model** on the dataset, that predicts the amount of weekly sales as a function of the other variables
- Assess the performances of the model by using a metric that is relevant for regression problems
- Interpret the coefficients of the model to identify what features are important for the prediction
- Train at least one model with **regularization (Lasso or Ridge)** to reduce overfitting


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Part 1 : EDA and data preprocessing

Start your project by exploring your dataset : create figures, compute some statistics etc...

Then, you'll have to make some preprocessing on the dataset. You can follow the guidelines from the *preprocessing template*. There will also be some specific transformations to be planned on this dataset, for example on the *Date* column that can't be included as it is in the model. Below are some hints that might help you 🤓

 #### Preprocessing to be planned with pandas

 **Drop lines where target values are missing :**
 - Here, the target variable (Y) corresponds to the column *Weekly_Sales*. One can see above that there are some missing values in this column.
 - We never use imputation techniques on the target : it might create some bias in the predictions !
 - Then, we will just drop the lines in the dataset for which the value in *Weekly_Sales* is missing.
 
**Create usable features from the *Date* column :**
The *Date* column cannot be included as it is in the model. Either you can drop this column, or you will create new columns that contain the following numeric features : 
- *year*
- *month*
- *day*
- *day of week*

**Drop lines containing invalid values or outliers :**
In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*
 


**Target variable/target (Y) that we will try to predict, to separate from the others** : *Weekly_Sales*

 **------------**

 #### Preprocessings to be planned with scikit-learn

 **Explanatory variables (X)**
We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

 - Categorical variables : Store, Holiday_Flag
 - Numerical variables : Temperature, Fuel_Price, CPI, Unemployment, Year, Month, Day, DayOfWeek

### Part 2 : Baseline model (linear regression)
Once you've trained a first model, don't forget to assess its performances on the train and test sets. Are you satisfied with the results ?
Besides, it would be interesting to analyze the values of the model's coefficients to know what features are important for the prediction. To do so, the `.coef_` attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

### Part 3 : Fight overfitting
In this last part, you'll have to train a **regularized linear regression model**. You'll find below some useful classes in scikit-learn's documentation :
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

**Bonus question**

In regularized regression models, there's a hyperparameter called *the regularization strength* that can be fine-tuned to get the best generalized predictions on a given dataset. This fine-tuning can be done thanks to scikit-learn's GridSearchCV class : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Also, you'll find here some examples of how to use GridSearchCV together with Ridge or Lasso models : https://alfurka.github.io/2018-11-18-grid-search/

In [1]:
#Importing Libraries

import pandas as pd
from pandas_dq import dq_report
import numpy as np
import datetime

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer #, KNNImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score
#from sklearn.model_selection import cross_val_score

In [2]:
url = 'Walmart_Store_sales.csv'
df = pd.read_csv(url)

In [3]:
df.shape

(150, 8)

## LES DONNEES MANQUANTES

In [4]:
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092


In [5]:
# Combien j'ai de lignes pour le store 12 ?
df[df['Store'] == 12].shape[0]

5

In [6]:
dq_report(df, verbose=1)

    All variables classified into correct types.


  dq_report(df, verbose=1)


Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
Store,float64,0.0,,1.0,20.0,No issue
Date,object,12.0,56.0,,,"18 missing values. Impute them with mean, median, mode, or a constant value such as 123., 51 rare categories: Too many to list. Group them into a single category or drop the categories., Mixed dtypes: has 2 different data types: object, float,"
Weekly_Sales,float64,9.333333,,268929.03,2771397.17,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Holiday_Flag,float64,8.0,1.0,,,"12 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Temperature,float64,12.0,,18.79,91.65,"18 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Fuel_Price,float64,9.333333,,2.514,4.193,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
CPI,float64,8.0,,126.111903,226.968844,"12 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Unemployment,float64,10.0,,5.143,14.313,"15 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 5 outliers greater than upper bound (10.48) or lower than lower bound(4.27). Cap them or remove them."


Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
Store,float64,0.0,,1.0,20.0,No issue
Date,object,12.0,56.0,,,"18 missing values. Impute them with mean, median, mode, or a constant value such as 123., 51 rare categories: Too many to list. Group them into a single category or drop the categories., Mixed dtypes: has 2 different data types: object, float,"
Weekly_Sales,float64,9.333333,,268929.03,2771397.17,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Holiday_Flag,float64,8.0,1.0,,,"12 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Temperature,float64,12.0,,18.79,91.65,"18 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Fuel_Price,float64,9.333333,,2.514,4.193,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
CPI,float64,8.0,,126.111903,226.968844,"12 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Unemployment,float64,10.0,,5.143,14.313,"15 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 5 outliers greater than upper bound (10.48) or lower than lower bound(4.27). Cap them or remove them."


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         150 non-null    float64
 1   Date          132 non-null    object 
 2   Weekly_Sales  136 non-null    float64
 3   Holiday_Flag  138 non-null    float64
 4   Temperature   132 non-null    float64
 5   Fuel_Price    136 non-null    float64
 6   CPI           138 non-null    float64
 7   Unemployment  135 non-null    float64
dtypes: float64(7), object(1)
memory usage: 9.5+ KB


In [8]:
# Combien j'ai de lignes pour le store 12 ?
df[df['Store'] == 12].shape[0]

5

## TRAITEMENT DES DONNEES MANQUANTES

In [9]:
#supprimer les valeurs manquantes de la target
df.dropna(subset=['Weekly_Sales'], inplace=True)

In [10]:
# Combien j'ai de lignes pour le store 12 ?
df[df['Store'] == 12].shape[0]

5

In [11]:
# voir si il y a des valeurs manques sur "weekly_sales"
print(df['Weekly_Sales'].isnull().sum())

0


In [12]:
df.shape

(136, 8)

## TRAITEMENT DES VALEURS NEGATIVE - ABERRANTES - OUTLIERS

In [13]:
# Processing date as proper datetime format
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

In [14]:
df.shape

(136, 8)

In [15]:
# créer de nouvelles colonnes pour l'année, le mois et le jour
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

In [16]:
df.shape

(136, 11)

In [17]:
# converting temperature from fahrenheit to celsius
df['Temperature'] = round((df['Temperature']-32)*5/9 ,1)

In [18]:
df.shape

(136, 11)

In [19]:
# montrer les lignes sans date
df[df['Date'].isnull()]

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
3,11.0,NaT,1244390.03,0.0,29.2,,214.556497,7.346,,,
9,3.0,NaT,418925.47,0.0,15.6,3.555,224.13202,6.833,,,
17,18.0,NaT,1205307.5,0.0,-5.9,2.788,131.527903,9.202,,,
34,2.0,NaT,1853161.99,0.0,30.9,3.48,214.929625,,,,
42,1.0,NaT,1661767.33,1.0,,3.73,222.439015,6.908,,,
65,10.0,NaT,1714309.9,,6.4,3.287,127.191774,8.744,,,
81,5.0,NaT,359206.21,0.0,,3.63,221.434215,5.943,,,
82,11.0,NaT,1569607.94,0.0,11.5,3.51,223.917015,6.833,,,
83,15.0,NaT,607475.44,0.0,26.6,3.972,135.873839,7.806,,,
86,17.0,NaT,986922.62,0.0,,3.793,131.037548,6.235,,,


# Peut-on rapprocher certaines dates manquantes avec le taux de chomage national ou le taux d'inflation ?

In [20]:
CPI_same = df[df['CPI'].notna()].sort_values(by='CPI')
CPI_same

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
147,17.0,2010-06-11,845252.21,0.0,14.0,2.841,126.111903,,2010.0,6.0,11.0
135,12.0,2010-09-10,903119.03,1.0,28.7,3.044,126.114581,14.18,2010.0,9.0,10.0
98,10.0,2010-06-25,1768172.31,0.0,32.4,,126.1266,9.524,2010.0,6.0,25.0
137,10.0,NaT,1831676.03,0.0,31.4,3.112,126.128355,9.199,,,
99,13.0,2010-07-02,2018314.71,0.0,26.0,2.814,126.1392,7.951,2010.0,7.0,2.0
5,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0
14,17.0,2010-10-01,829207.27,0.0,15.6,2.853,126.2346,6.885,2010.0,10.0,1.0
131,17.0,2010-11-12,855459.96,0.0,,2.831,126.546161,,2010.0,11.0,12.0
28,17.0,2010-04-16,757738.76,0.0,7.3,2.915,126.5621,6.635,2010.0,4.0,16.0
103,4.0,2010-12-10,2302504.86,0.0,5.8,2.86,126.7934,7.127,2010.0,12.0,10.0


Pas à ppremière vue

In [21]:
# comparer dans DF les lignes avec un CPI entre 214 et 215 pour voir si il y a des valeurs se rapprochant
df_filtré = df.query('214 <= CPI < 215')
print(df_filtré)


     Store    Date     Weekly_Sales  Holiday_Flag  Temperature  Fuel_Price  \
0     6.0  2011-02-18   1572117.54        NaN         15.3         3.045     
3    11.0         NaT   1244390.03        0.0         29.2           NaN     
34    2.0         NaT   1853161.99        0.0         30.9         3.480     
52    9.0  2010-06-25    509263.28        0.0         29.5         2.653     
56    8.0  2010-08-27    888816.78        0.0         23.8         2.619     
67    3.0  2010-02-05    461622.22        0.0          7.6         2.572     
90    9.0  2010-07-09    485389.15        NaN         25.8         2.642     
92    3.0         NaT    384200.69        0.0          NaN         2.667     
96    8.0  2010-03-12    860336.16        0.0          9.9           NaN     
100   3.0  2010-06-18    364076.85        0.0         28.6         2.637     
107   8.0  2010-02-12    994801.40        1.0          0.7         2.548     
120   8.0  2010-09-17    836707.85        0.0         24.1      

Aucune valeur. Impossible de rapprocher les dates vides avec une date pleine possédant le même taux de chômage. 

In [22]:
# Supprimer les date vides
df.dropna(subset=['Date'], inplace=True)

In [23]:
df.shape

(118, 11)

In [24]:
# montrer les lignes sans Holiday_Flag
df[df['Holiday_Flag'].isnull()]

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,6.0,2011-02-18,1572117.54,,15.3,3.045,214.777523,6.858,2011.0,2.0,18.0
15,6.0,2010-04-30,1498080.16,,20.5,2.78,211.894272,7.092,2010.0,4.0,30.0
43,7.0,2011-08-26,629994.47,,14.2,3.485,194.379637,8.622,2011.0,8.0,26.0
48,1.0,2011-08-05,1624383.75,,33.1,3.684,215.544618,7.962,2011.0,8.0,5.0
53,14.0,2011-03-25,1879451.23,,5.4,3.625,184.994368,8.549,2011.0,3.0,25.0
73,1.0,2010-08-27,1449142.92,,29.6,2.619,211.567306,7.787,2010.0,8.0,27.0
90,9.0,2010-07-09,485389.15,,25.8,2.642,214.65643,6.442,2010.0,7.0,9.0
118,9.0,2010-06-18,513073.87,,28.3,2.637,215.016648,6.384,2010.0,6.0,18.0
136,4.0,2011-07-08,2066541.86,,29.2,3.469,129.1125,5.644,2011.0,7.0,8.0


In [25]:
#Export un csv avec uniquement la colonne Date pour les colonnes sans Holiday_Flag
df[df['Holiday_Flag'].isnull()]['Date'].to_csv('missing_holiday_flag.csv', index=False)


Aucune date sans Holidays-Flag ne correspond à un jour férié national. Passons les valeurs de NaN à 0

In [26]:
# remplacer les valeurs NaN dans Holiday_Flag par 0
df['Holiday_Flag'].fillna(0, inplace=True)

In [27]:
# montrer les lignes sans Holiday_Flag
df[df['Holiday_Flag'].isnull()]

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day


In [28]:
df.shape

(118, 11)

In [29]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Créer une figure avec une grille de 3x2
fig = make_subplots(rows=2, cols=3, subplot_titles=('Temperature', 'Fuel Price', 'CPI', 'Unemployment', 'Weekly Sales'))

# Ajouter les boîtes à moustaches à la figure
fig.add_trace(go.Box(y=df['Temperature'], name='Temperature'), row=1, col=1)
fig.add_trace(go.Box(y=df['Fuel_Price'], name='Fuel Price'), row=1, col=2)
fig.add_trace(go.Box(y=df['CPI'], name='CPI'), row=1, col=3)
fig.add_trace(go.Box(y=df['Unemployment'], name='Unemployment'), row=2, col=1)
fig.add_trace(go.Box(y=df['Weekly_Sales'], name='Weekly Sales'), row=2, col=2)

# Mise en page
fig.update_layout(height=600, width=800, title_text="Distribution des variables")

# Afficher
fig.show()

In [30]:
# show unemployement outliers for each store
fig = px.box(df, x='Store', y='Unemployment')
fig.show()

In [31]:
# Calculer la moyenne et l'écart-type en ignorant les NaN
mean = df['Unemployment'].mean()
std = df['Unemployment'].std()

# Les outliers nétant que des valeurs hautes, Créer un masque pour les outliers en ignorant les NaN (uniquement avec > 3x std
outlier_mask = (df['Unemployment'] > mean + 3*std)

# Créer un nouveau DataFrame sans les outliers
df_clean = df[~outlier_mask]

In [32]:
df_clean.shape

(113, 11)

In [33]:
# boite à moustache pour unemployment
fig_unemp = px.box(df_clean, y='Unemployment', title='Unemployment')
fig_unemp.show()

In [34]:
df=df_clean 

In [35]:
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,6.0,2011-02-18,1572117.54,0.0,15.3,3.045,214.777523,6.858,2011.0,2.0,18.0
1,13.0,2011-03-25,1807545.43,0.0,5.8,3.435,128.616064,7.47,2011.0,3.0,25.0
4,6.0,2010-05-28,1644470.66,0.0,26.0,2.759,212.412888,7.092,2010.0,5.0,28.0
5,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0
6,15.0,2011-06-03,695396.19,0.0,21.0,4.069,134.855161,7.658,2011.0,6.0,3.0


In [36]:
#supprimer la colonne Date
df.drop(columns=['Date'], inplace=True)

In [37]:
df.shape

(113, 10)

In [38]:
df.head

<bound method NDFrame.head of      Store  Weekly_Sales  Holiday_Flag  Temperature  Fuel_Price      CPI     \
0     6.0    1572117.54        0.0         15.3         3.045    214.777523   
1    13.0    1807545.43        0.0          5.8         3.435    128.616064   
4     6.0    1644470.66        0.0         26.0         2.759    212.412888   
5     4.0    1857533.70        0.0          NaN         2.756    126.160226   
6    15.0     695396.19        0.0         21.0         4.069    134.855161   
7    20.0    2203523.20        0.0          4.4         3.617    213.023622   
8    14.0    2600519.26        0.0         -0.8         3.109           NaN   
10    8.0     895066.50        0.0         28.3         3.554    219.070197   
11   18.0    1029618.10        0.0         11.1         2.878    132.763355   
12    7.0     414094.05        0.0          2.6         3.767    192.826069   
13    1.0    1677472.78        0.0         18.2         3.734    221.211813   
14   17.0     829207.2

# maintenant qu'il est légèrement plus propre, analysons un peu le dataset

In [39]:
#Moyenne des ventes totale
rounded_mean = round(df['Weekly_Sales'].mean(), 2)
print(rounded_mean)

1267414.77


In [40]:
#Moyenne des ventes par magasin
mean_sales_by_store = df.groupby('Store')['Weekly_Sales'].mean().round(2)
print(mean_sales_by_store)

Store
1.0     1550100.94
2.0     1982229.07
3.0      403353.32
4.0     2173758.98
5.0      294398.75
6.0     1551123.58
7.0      536664.40
8.0      888754.13
9.0      506887.40
10.0    1854847.71
11.0    1757242.51
13.0    1997235.41
14.0    2092878.41
15.0     642282.06
16.0     515317.77
17.0     841507.31
18.0    1151981.77
19.0    1400615.22
20.0    1941521.13
Name: Weekly_Sales, dtype: float64


In [41]:
# le total de vente par magasin
total_sales_by_store = df.groupby('Store')['Weekly_Sales'].sum().round(2)
print(total_sales_by_store)

Store
1.0     12400807.53
2.0     11893374.43
3.0      4033533.19
4.0     13042553.90
5.0      2060791.26
6.0      9306741.48
7.0      3756650.81
8.0      5332524.79
9.0      2027549.60
10.0     5564543.12
11.0     1757242.51
13.0    17975118.68
14.0    18835905.73
15.0     1926846.17
16.0     2061271.09
17.0     4207536.54
18.0     8063872.40
19.0    11204921.72
20.0     7766084.51
Name: Weekly_Sales, dtype: float64


In [42]:
# grahique de corrélation
corr = df.corr()
fig = px.imshow(corr)
fig.show()


In [43]:
fig = px.imshow(corr)

# Ajouter les valeurs de corrélation dans les carrés
for y in range(corr.shape[0]):
    for x in range(corr.shape[1]):
        fig.add_annotation(x=x, y=y, 
                           text=str(round(corr.iloc[y, x], 2)), 
                           showarrow=False, 
                           font=dict(color="black"))


# Ajuster la mise en page pour une meilleure lisibilité
fig.update_layout(
        xaxis=dict(side="top"),
        width=800,  # Largeur de la figure en pixels
        height=600)  # Hauteur de la figure en pixels

fig.show()

In [44]:
df.head()

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,6.0,1572117.54,0.0,15.3,3.045,214.777523,6.858,2011.0,2.0,18.0
1,13.0,1807545.43,0.0,5.8,3.435,128.616064,7.47,2011.0,3.0,25.0
4,6.0,1644470.66,0.0,26.0,2.759,212.412888,7.092,2010.0,5.0,28.0
5,4.0,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0
6,15.0,695396.19,0.0,21.0,4.069,134.855161,7.658,2011.0,6.0,3.0


In [45]:
# graph de l'impact de la température, sur les ventes
fig = px.scatter(df, x='Temperature', y='Weekly_Sales', trendline='ols')
fig.show()

In [46]:
# graph de l'impact du taux de chomage sur les ventes
fig = px.scatter(df, x='Unemployment', y='Weekly_Sales', trendline='ols')
fig.show()


In [47]:
# graph de l'impact dde l'inflation sur les ventes
fig = px.scatter(df, x='CPI', y='Weekly_Sales', trendline='ols')
fig.show()

In [48]:
#Graph de l'impact du prix du fuel sur les ventes
fig = px.scatter(df, x='Fuel_Price', y='Weekly_Sales', trendline='ols')
fig.show()

In [49]:
#graph montant des ventes par année
fig = px.bar(df.groupby('Year')['Weekly_Sales'].sum().reset_index(), x='Year', y='Weekly_Sales')
fig.show()


In [50]:
#graph montant des ventes par mois
fig = px.bar(df.groupby('Month')['Weekly_Sales'].sum().reset_index(), x='Month', y='Weekly_Sales')
fig.show()

PRE PROCESSING

In [52]:
# Separate target variable Y from features X
print("Separating labels from features...")
target_variable = "Weekly_Sales"

X = df.drop(target_variable, axis = 1)
Y = df.loc[:, target_variable]

Separating labels from features...


In [81]:
# Remplacement des NaN par la moyenne par Store pour Temperature et Fuel_Price
for col in ['Temperature', 'Fuel_Price']:
    X[col] = X.groupby('Store')[col].transform(lambda x: x.fillna(x.mean()))

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

numeric_features = ['Day', 'Month', 'Year', 'CPI', 'Unemployment']
special_features = ['Temperature', 'Fuel_Price']
categorical_features = ['Store', 'Holiday_Flag']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

special_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('special', special_transformer, special_features),
        ('cat', categorical_transformer, categorical_features)
    ])

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [109]:
# Train model
print("Train model...")
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
print("...Done.")
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = regressor.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = regressor.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()
# Print R^2 scores
print("R2 score on training set : ", r2_score(Y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(Y_test, Y_test_pred))

Train model...
...Done.
Predictions on training set...
...Done.
[ 355479.56861685 1159346.90405919 1403039.49221897 2066863.58616534
 1925765.82798506 1993340.25350217 2170778.68196777 1583907.04996371
  608234.74404722 2435413.84784752 1425017.43269192 2050458.19438805
 1963020.57139294 1988580.20845454 1292929.96804843 1894979.92249855
  577876.9912371  1356675.24231989 1370353.10875379  922125.07782542
 2118573.21239598  330023.76712427 1976491.63835358  363314.0773656
 1650434.02693033 1846981.22975911 2037727.01524297 2047699.24878152
 2074795.82468615  683702.23939525  606385.99993958 1151770.91785922
  370550.90300884  403385.82595278 1538509.07898754 1983767.4458496
 2427628.41990218  429905.20014556 1620383.5913399  1560492.51184586
  508064.11730152  317297.93021119  403643.65878733 1491528.49941411
  756784.10201676  411450.42238104 1994189.90022409 2162549.04629507
 1757242.51        455001.62966197  944594.51310338 1480619.11228554
 2168980.24462973  138757.56818403 139991

In [110]:
# Separate target variable Y from features X
print("Separating labels from features...")
target_variable = "Weekly_Sales"

X = df.drop(target_variable, axis = 1)
Y = df.loc[:, target_variable]

print("...Done.")
print()

print('Y : ')
print(Y.head())
print()
print('X :')
print(X.head())

Separating labels from features...
...Done.

Y : 
0    1572117.54
1    1807545.43
4    1644470.66
5    1857533.70
6     695396.19
Name: Weekly_Sales, dtype: float64

X :
   Store  Holiday_Flag  Temperature  Fuel_Price      CPI     Unemployment  \
0   6.0        0.0         15.3         3.045    214.777523      6.858      
1  13.0        0.0          5.8         3.435    128.616064      7.470      
4   6.0        0.0         26.0         2.759    212.412888      7.092      
5   4.0        0.0          NaN         2.756    126.160226      7.896      
6  15.0        0.0         21.0         4.069    134.855161      7.658      

    Year   Month   Day  
0  2011.0   2.0   18.0  
1  2011.0   3.0   25.0  
4  2010.0   5.0   28.0  
5  2010.0   5.0   28.0  
6  2011.0   6.0    3.0  


In [111]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [112]:
# Distinguish numeric and categorical features:
numeric_features = ['Day', 'Month', 'Year', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']
categorical_features = ['Store', 'Holiday_Flag']

In [113]:
# Create pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # missing values will be replaced by columns' mean
    ('scaler', StandardScaler())
])

In [114]:
# Create pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # missing values will be replaced by most frequent value
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

In [115]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [116]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5]) # MUST use this syntax because X_train is a numpy array and not a pandas DataFrame anymore
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
X_test = preprocessor.transform(X_test) # Don't fit again !! The test set is used for validating decisions
# we made based on the training set, therefore we can only apply transformations that were parametered using the training set.
# Otherwise this creates what is called a leak from the test set which will introduce a bias in all your results.
print('...Done.')
print(X_test[0:5,:]) # MUST use this syntax because X_test is a numpy array and not a pandas DataFrame anymore
print()

Performing preprocessings on train set...
...Done.
[[ 1.01737744e+00 -2.76032796e-02 -1.05558715e+00  1.53350011e+00
  -1.36747759e+00  9.86281451e-01 -9.67330408e-16  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [-1.61052027e-01  1.21454430e+00 -1.05558715e+00 -4.18941855e-01
  -8.77527971e-01 -1.15855666e+00  2.08608974e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [ 1.60659218e+00  2.82933616e-01 -1.05558715e+00  8.11945472e-01
  -7.59940063e-01 -1.

In [117]:
# Train model
print("Train model...")
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


In [118]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = regressor.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[ 355479.56861685 1159346.90405919 1403039.49221897 2066863.58616534
 1925765.82798506 1993340.25350217 2170778.68196777 1583907.04996371
  608234.74404721 2435413.84784752 1425017.43269192 2050458.19438804
 1963020.57139294 1988580.20845454 1292929.96804842 1894979.92249855
  577876.9912371  1356675.24231989 1370353.10875379  922125.07782542
 2118573.21239598  330023.76712427 1976491.63835358  363314.0773656
 1650434.02693033 1846981.22975912 2037727.01524296 2047699.24878152
 2074795.82468615  683702.23939525  606385.99993959 1151770.91785922
  370550.90300884  403385.82595278 1538509.07898754 1983767.4458496
 2427628.41990218  429905.20014556 1620383.5913399  1560492.51184586
  508064.11730152  317297.93021119  403643.65878733 1491528.49941411
  756784.10201676  411450.42238104 1994189.90022409 2162549.04629507
 1757242.51        455001.62966197  944594.51310338 1480619.11228554
 2168980.24462973  138757.56818403 1399913.28958961  951420.00373

In [119]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = regressor.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on test set...
...Done.
[ 378434.9262536  1439208.53112662 1610263.64973443  882160.33481902
  467036.84319288 1108436.20055574 2062198.58117078 2316439.3934927
 2054873.89282565 1557575.21453423 1029514.94842001 2045531.68806318
 1120031.60638901  607697.48000837  466273.46117054   79503.65530793
  612246.41204381  168732.14634922 1811123.04670739  487776.41568401
 1924306.22110956  472762.50180452 2069778.86919786]



In [120]:
# Print R^2 scores
print("R2 score on training set : ", r2_score(Y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(Y_test, Y_test_pred))

R2 score on training set :  0.9727215539002302
R2 score on test set :  0.9396484787972459


In [121]:
column_names = []
for name, pipeline, features_list in preprocessor.transformers_: # loop over pipelines
    if name == 'num': # if pipeline is for numeric variables
        features = features_list # just get the names of columns to which it has been applied
    else: # if pipeline is for categorical variables
        features = pipeline.named_steps['encoder'].get_feature_names_out() # get output columns names from OneHotEncoder
    column_names.extend(features) # concatenate features names
        
print("Names of columns corresponding to each coefficient: ", column_names)

Names of columns corresponding to each coefficient:  ['Day', 'Month', 'Year', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'x0_2.0', 'x0_3.0', 'x0_4.0', 'x0_5.0', 'x0_6.0', 'x0_7.0', 'x0_8.0', 'x0_9.0', 'x0_10.0', 'x0_11.0', 'x0_13.0', 'x0_14.0', 'x0_15.0', 'x0_16.0', 'x0_17.0', 'x0_18.0', 'x0_19.0', 'x0_20.0', 'x1_1.0']


In [122]:
# Create a pandas DataFrame
coefs = pd.DataFrame(index = column_names, data = regressor.coef_.transpose(), columns=["coefficients"])
coefs

Unnamed: 0,coefficients
Day,-37960.29
Month,74691.94
Year,4294.936
Temperature,-31488.91
Fuel_Price,-41295.74
CPI,-95514.7
Unemployment,-71933.97
x0_2.0,355847.9
x0_3.0,-1206250.0
x0_4.0,279391.4


In [123]:
# Compute abs() and sort values
feature_importance = abs(coefs).sort_values(by = 'coefficients')
feature_importance

Unnamed: 0,coefficients
Year,4294.936
Temperature,31488.91
x0_6.0,32639.38
Day,37960.29
Fuel_Price,41295.74
x1_1.0,52695.35
Unemployment,71933.97
Month,74691.94
CPI,95514.7
x0_11.0,161941.8


In [124]:
# Plot coefficients
fig = px.bar(feature_importance, orientation = 'h')
fig.update_layout(showlegend = False, 
                  margin = {'l': 120} # to avoid cropping of column names
                 )
fig.show()

AMELIORATION DU SCORE

RIDGE

In [186]:
ridge1 = Ridge()
print(ridge1)
ridge1.fit(X_train, Y_train)

Ridge()


In [187]:
# Calculer les scores R^2 avec validation croisée
scores = cross_val_score(ridge1, X_train, Y_train, cv=10, scoring='r2')

# Afficher les scores R^2 de la validation croisée
print("R2 scores from cross-validation : ", scores)

# Afficher le score moyen R^2 de la validation croisée
print("Average R2 score from cross-validation : ", scores.mean())

# Afficher l'écart-type des scores R^2 de la validation croisée
print("Standard deviation of R2 scores from cross-validation : ", scores.std())

# Afficher les scores R^2 sur les ensembles d'entraînement et de test
print("R2 score on training set : ", ridge1.score(X_train, Y_train))
print("R2 score on test set : ", ridge1.score(X_test, Y_test))

R2 scores from cross-validation :  [0.77503191 0.77750858 0.9224355  0.82979052 0.80224389 0.84567795
 0.90677325 0.92140108 0.82097463 0.7820829 ]
Average R2 score from cross-validation :  0.8383920211310597
Standard deviation of R2 scores from cross-validation :  0.05590969473789396
R2 score on training set :  0.9277751525312948
R2 score on test set :  0.9150375810428166


LASSO

In [188]:
lasso1 = Lasso()
print(lasso1)
lasso1.fit(X_train, Y_train)

Lasso()


In [189]:
# Calculer les scores R^2 avec validation croisée
scores = cross_val_score(lasso1, X_train, Y_train, cv=10, scoring='r2')

# Afficher les scores R^2 de la validation croisée
print("R2 scores from cross-validation : ", scores)

# Afficher le score moyen R^2 de la validation croisée
print("Average R2 score from cross-validation : ", scores.mean())

# Afficher l'écart-type des scores R^2 de la validation croisée
print("Standard deviation of R2 scores from cross-validation : ", scores.std())

# Afficher les scores R^2 sur les ensembles d'entraînement et de test
print("R2 score on training set : ", lasso1.score(X_train, Y_train))
print("R2 score on test set : ", lasso1.score(X_test, Y_test))

R2 scores from cross-validation :  [0.66269914 0.93290644 0.97488605 0.93312153 0.92828011 0.91600762
 0.95430032 0.94885695 0.97612127 0.92656023]
Average R2 score from cross-validation :  0.9153739657867594
Standard deviation of R2 scores from cross-validation :  0.08638710987705707
R2 score on training set :  0.9727215515785628
R2 score on test set :  0.9396706539082827


GRIDSEARCH RIDGE

In [190]:
# Perform grid search
print("Grid search...")
regressor = Ridge()
# Grid of values to be tested
params = {
    'alpha': [0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]
}
best_ridge = GridSearchCV(regressor, param_grid = params, cv = 10) # cv : the number of folds to be used for CV
best_ridge.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", best_ridge.best_params_)
print("Best R2 score : ", best_ridge.best_score_)

Grid search...
...Done.
Best hyperparameters :  {'alpha': 0.05}
Best R2 score :  0.9173043978931755


GRIDSEARCH LASSO

In [191]:
# Perform grid search
print("Grid search...")
regressor = Lasso()
# Grid of values to be tested
params = {
    'alpha': [1, 2, 3, 5, 10, 20, 30, 50, 100,200,300,500,1000,2000,3000,5000]
}
best_lasso = GridSearchCV(regressor, param_grid = params, cv = 10) # cv : the number of folds to be used for CV
best_lasso.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", best_lasso.best_params_)
print("Best R2 score : ", best_lasso.best_score_)

Grid search...
...Done.
Best hyperparameters :  {'alpha': 1000}
Best R2 score :  0.920788357329602


DONNEES DU MEILLEURE SCORE

In [192]:
lasso2 = Lasso(1000)
print(lasso2)
lasso2.fit(X_train, Y_train)


Lasso(alpha=1000)


In [193]:
# Calculer les scores R^2 avec validation croisée
scores = cross_val_score(lasso2, X_train, Y_train, cv=10, scoring='r2')

# Afficher les scores R^2 de la validation croisée
print("R2 scores from cross-validation : ", scores)

# Afficher le score moyen R^2 de la validation croisée
print("Average R2 score from cross-validation : ", scores.mean())

# Afficher l'écart-type des scores R^2 de la validation croisée
print("Standard deviation of R2 scores from cross-validation : ", scores.std())

# Afficher les scores R^2 sur les ensembles d'entraînement et de test
print("R2 score on training set : ", lasso2.score(X_train, Y_train))
print("R2 score on test set : ", lasso2.score(X_test, Y_test))

R2 scores from cross-validation :  [0.73486558 0.92196509 0.97100497 0.92293901 0.91610693 0.91183633
 0.97675107 0.96552846 0.97608419 0.91080195]
Average R2 score from cross-validation :  0.920788357329602
Standard deviation of R2 scores from cross-validation :  0.06744472559399518
R2 score on training set :  0.9707639371771167
R2 score on test set :  0.9558728586972315


In [194]:
column_names = []
for name, pipeline, features_list in preprocessor.transformers_: # loop over pipelines
    if name == 'num': # if pipeline is for numeric variables
        features = features_list # just get the names of columns to which it has been applied
    else: # if pipeline is for categorical variables
        features = pipeline.named_steps['encoder'].get_feature_names_out() # get output columns names from OneHotEncoder
    column_names.extend(features) # concatenate features names
        
print("Names of columns corresponding to each coefficient: ", column_names)

Names of columns corresponding to each coefficient:  ['Day', 'Month', 'Year', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'x0_2.0', 'x0_3.0', 'x0_4.0', 'x0_5.0', 'x0_6.0', 'x0_7.0', 'x0_8.0', 'x0_9.0', 'x0_10.0', 'x0_11.0', 'x0_13.0', 'x0_14.0', 'x0_15.0', 'x0_16.0', 'x0_17.0', 'x0_18.0', 'x0_19.0', 'x0_20.0', 'x1_1.0']


In [195]:
# Create a pandas DataFrame
coefs = pd.DataFrame(index = column_names, data = lasso2.coef_, columns=["coefficients"])
coefs

Unnamed: 0,coefficients
Day,-16724.31
Month,-59890.52
Year,-38245.98
Temperature,-27422.63
Fuel_Price,-36099.36
CPI,74822.22
Unemployment,0.0
x0_2.0,397144.7
x0_3.0,-1141497.0
x0_4.0,490893.6


In [184]:
# Compute abs() and sort values
feature_importance = abs(coefs).sort_values(by = 'coefficients')
feature_importance

Unnamed: 0,coefficients
x1_1.0,0.0
Unemployment,0.0
Day,16724.31
Temperature,27422.63
Fuel_Price,36099.36
Year,38245.98
x0_11.0,52521.71
x0_6.0,55438.54
Month,59890.52
x0_19.0,66294.55


In [185]:
# Plot coefficients
fig = px.bar(feature_importance, orientation = 'h')
fig.update_layout(showlegend = False, 
                  margin = {'l': 120} # to avoid cropping of column names
                 )
fig.show()