<img src="https://www.bestdesigns.co/uploads/inspiration_images/4350/990__1511457498_404_walmart.png" alt="WALMART LOGO" />

# Walmart : predict weekly sales

## Company's Description 📇

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962.

## Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

## Goals 🎯

The project can be divided into three steps:

- Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
- Part 2 : train a **linear regression model** (baseline)
- Part 3 : avoid overfitting by training a **regularized regression model**

## Scope of this project 🖼️

For this project, you'll work with a dataset that contains information about weekly sales achieved by different Walmart stores, and other variables such as the unemployment rate or the fuel price, that might be useful for predicting the amount of sales. The dataset has been taken from a Kaggle competition, but we made some changes compared to the original data. Please make sure that you're using **our** custom dataset (available on JULIE). 🤓

## Deliverable 📬

To complete this project, your team should: 

- Create some visualizations
- Train at least one **linear regression model** on the dataset, that predicts the amount of weekly sales as a function of the other variables
- Assess the performances of the model by using a metric that is relevant for regression problems
- Interpret the coefficients of the model to identify what features are important for the prediction
- Train at least one model with **regularization (Lasso or Ridge)** to reduce overfitting


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Part 1 : EDA and data preprocessing

Start your project by exploring your dataset : create figures, compute some statistics etc...

Then, you'll have to make some preprocessing on the dataset. You can follow the guidelines from the *preprocessing template*. There will also be some specific transformations to be planned on this dataset, for example on the *Date* column that can't be included as it is in the model. Below are some hints that might help you 🤓

 #### Preprocessing to be planned with pandas

 **Drop lines where target values are missing :**
 - Here, the target variable (Y) corresponds to the column *Weekly_Sales*. One can see above that there are some missing values in this column.
 - We never use imputation techniques on the target : it might create some bias in the predictions !
 - Then, we will just drop the lines in the dataset for which the value in *Weekly_Sales* is missing.
 
**Create usable features from the *Date* column :**
The *Date* column cannot be included as it is in the model. Either you can drop this column, or you will create new columns that contain the following numeric features : 
- *year*
- *month*
- *day*
- *day of week*

**Drop lines containing invalid values or outliers :**
In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*
 


**Target variable/target (Y) that we will try to predict, to separate from the others** : *Weekly_Sales*

 **------------**

 #### Preprocessings to be planned with scikit-learn

 **Explanatory variables (X)**
We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

 - Categorical variables : Store, Holiday_Flag
 - Numerical variables : Temperature, Fuel_Price, CPI, Unemployment, Year, Month, Day, DayOfWeek

### Part 2 : Baseline model (linear regression)
Once you've trained a first model, don't forget to assess its performances on the train and test sets. Are you satisfied with the results ?
Besides, it would be interesting to analyze the values of the model's coefficients to know what features are important for the prediction. To do so, the `.coef_` attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

### Part 3 : Fight overfitting
In this last part, you'll have to train a **regularized linear regression model**. You'll find below some useful classes in scikit-learn's documentation :
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

**Bonus question**

In regularized regression models, there's a hyperparameter called *the regularization strength* that can be fine-tuned to get the best generalized predictions on a given dataset. This fine-tuning can be done thanks to scikit-learn's GridSearchCV class : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Also, you'll find here some examples of how to use GridSearchCV together with Ridge or Lasso models : https://alfurka.github.io/2018-11-18-grid-search/

In [1]:
#Importing Libraries

import pandas as pd
from pandas_dq import dq_report
import numpy as np
import datetime

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

from sklearn.model_selection import train_test_split #, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer #, KNNImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression #, Ridge, Lasso
from sklearn.metrics import r2_score
#from sklearn.model_selection import cross_val_score

In [2]:
url = 'Walmart_Store_sales.csv'
df = pd.read_csv(url)

In [3]:
df.shape

(150, 8)

## LES DONNEES MANQUANTES

In [4]:
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092


In [5]:
dq_report(df, verbose=1)

    All variables classified into correct types.


  dq_report(df, verbose=1)


Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
Store,float64,0.0,,1.0,20.0,No issue
Date,object,12.0,56.0,,,"18 missing values. Impute them with mean, median, mode, or a constant value such as 123., 51 rare categories: Too many to list. Group them into a single category or drop the categories., Mixed dtypes: has 2 different data types: object, float,"
Weekly_Sales,float64,9.333333,,268929.03,2771397.17,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Holiday_Flag,float64,8.0,1.0,,,"12 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Temperature,float64,12.0,,18.79,91.65,"18 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Fuel_Price,float64,9.333333,,2.514,4.193,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
CPI,float64,8.0,,126.111903,226.968844,"12 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Unemployment,float64,10.0,,5.143,14.313,"15 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 5 outliers greater than upper bound (10.48) or lower than lower bound(4.27). Cap them or remove them."


Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
Store,float64,0.0,,1.0,20.0,No issue
Date,object,12.0,56.0,,,"18 missing values. Impute them with mean, median, mode, or a constant value such as 123., 51 rare categories: Too many to list. Group them into a single category or drop the categories., Mixed dtypes: has 2 different data types: object, float,"
Weekly_Sales,float64,9.333333,,268929.03,2771397.17,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Holiday_Flag,float64,8.0,1.0,,,"12 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Temperature,float64,12.0,,18.79,91.65,"18 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Fuel_Price,float64,9.333333,,2.514,4.193,"14 missing values. Impute them with mean, median, mode, or a constant value such as 123."
CPI,float64,8.0,,126.111903,226.968844,"12 missing values. Impute them with mean, median, mode, or a constant value such as 123."
Unemployment,float64,10.0,,5.143,14.313,"15 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 5 outliers greater than upper bound (10.48) or lower than lower bound(4.27). Cap them or remove them."


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         150 non-null    float64
 1   Date          132 non-null    object 
 2   Weekly_Sales  136 non-null    float64
 3   Holiday_Flag  138 non-null    float64
 4   Temperature   132 non-null    float64
 5   Fuel_Price    136 non-null    float64
 6   CPI           138 non-null    float64
 7   Unemployment  135 non-null    float64
dtypes: float64(7), object(1)
memory usage: 9.5+ KB


## TRAITEMENT DES DONNEES MANQUANTES

In [7]:
#supprimer les valeurs manquantes de la target
df.dropna(subset=['Weekly_Sales'], inplace=True)
# voir si il y a des valeurs manques sur "weekly_sales"
print(df['Weekly_Sales'].isnull().sum())

0


## TRAITEMENT DES VALEURS NEGATIVE - ABERRANTES - OUTLIERS

In [8]:
# Processing date as proper datetime format
df['Date'] = pd.to_datetime(df['Date'])

  df['Date'] = pd.to_datetime(df['Date'])


In [9]:
df['Date']

0     2011-02-18
1     2011-03-25
3            NaT
4     2010-05-28
5     2010-05-28
6     2011-06-03
7     2012-02-03
8     2010-12-10
9            NaT
10    2011-08-19
11    2010-10-15
12    2011-05-13
13    2012-03-16
14    2010-10-01
15    2010-04-30
16    2010-08-20
17           NaT
18    2011-12-16
20    2010-04-02
21    2011-05-13
22    2012-10-12
23    2010-03-26
24    2012-05-04
25    2012-10-12
26    2012-01-13
27    2011-05-20
28    2010-04-16
29    2011-08-26
30    2011-05-06
32    2012-02-10
33    2012-02-10
34           NaT
35    2011-03-25
36    2011-09-23
37    2011-04-15
38    2011-06-24
39    2011-11-11
40    2012-04-27
41    2012-09-14
42           NaT
43    2011-08-26
44    2010-02-12
45    2012-02-24
46    2010-07-30
47    2010-07-02
48    2011-08-05
49    2012-03-30
50    2012-06-01
51    2010-11-12
52    2010-06-25
53    2011-03-25
54    2012-07-06
55    2011-09-23
56    2010-08-27
58    2010-07-30
59    2012-02-17
60    2011-05-06
61    2012-03-02
62    2010-12-

In [10]:
# créer de nouvelles colonnes pour l'année, le mois et le jour
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

In [11]:
df.head(10)

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0
1,13.0,2011-03-25,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0
3,11.0,NaT,1244390.03,0.0,84.57,,214.556497,7.346,,,
4,6.0,2010-05-28,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0
5,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0
6,15.0,2011-06-03,695396.19,0.0,69.8,4.069,134.855161,7.658,2011.0,6.0,3.0
7,20.0,2012-02-03,2203523.2,0.0,39.93,3.617,213.023622,6.961,2012.0,2.0,3.0
8,14.0,2010-12-10,2600519.26,0.0,30.54,3.109,,,2010.0,12.0,10.0
9,3.0,NaT,418925.47,0.0,60.12,3.555,224.13202,6.833,,,
10,8.0,2011-08-19,895066.5,0.0,82.92,3.554,219.070197,6.425,2011.0,8.0,19.0


In [12]:
# converting temperature from fahrenheit to celsius
df['Temperature'] = round((df['Temperature']-32)*5/9 ,1)

In [13]:
df['Temperature']

0      15.3
1       5.8
3      29.2
4      26.0
5       NaN
6      21.0
7       4.4
8      -0.8
9      15.6
10     28.3
11     11.1
12      2.6
13     18.2
14     15.6
15     20.5
16     24.6
17     -5.9
18     10.1
20      3.5
21     25.2
22     10.5
23      3.9
24     10.4
25      7.1
26     11.0
27      6.6
28      7.3
29      NaN
30     20.2
32     -7.3
33      2.8
34     30.9
35      0.7
36     26.8
37      NaN
38     27.7
39     16.0
40     10.2
41      NaN
42      NaN
43     14.2
44      3.6
45     12.6
46     28.0
47     19.0
48     33.1
49      9.0
50     16.2
51     15.4
52     29.5
53      5.4
54     30.5
55     17.6
56     23.8
58      NaN
59      2.7
60      NaN
61     14.2
62     11.5
63     20.7
64      2.5
65      6.4
66     23.2
67      7.6
68     32.8
70     -2.6
72      2.7
73     29.6
74     20.0
75     24.0
76      8.6
78     16.8
79     20.8
80     10.0
81      NaN
82     11.5
83     26.6
85      7.0
86      NaN
87     24.0
88      4.3
89     24.7
90     25.8
91  

In [14]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Créer une figure avec une grille de 3x2
fig = make_subplots(rows=2, cols=3, subplot_titles=('Temperature', 'Fuel Price', 'CPI', 'Unemployment', 'Weekly Sales'))

# Ajouter les boîtes à moustaches à la figure
fig.add_trace(go.Box(y=df['Temperature'], name='Temperature'), row=1, col=1)
fig.add_trace(go.Box(y=df['Fuel_Price'], name='Fuel Price'), row=1, col=2)
fig.add_trace(go.Box(y=df['CPI'], name='CPI'), row=1, col=3)
fig.add_trace(go.Box(y=df['Unemployment'], name='Unemployment'), row=2, col=1)
fig.add_trace(go.Box(y=df['Weekly_Sales'], name='Weekly Sales'), row=2, col=2)

# Mettre à jour la mise en page si nécessaire
fig.update_layout(height=600, width=800, title_text="Distribution des variables")

# Afficher la figure
fig.show()

In [15]:
# supprimer les outliers sur unmeployment
df = df[df['Unemployment'] < 10]

In [16]:
# boite à moustache pour unemployment
fig_unemp = px.box(df, y='Unemployment', title='Unemployment')
fig_unemp.show()

# maintenant qu'il est légèrement plus propre, analysons un peu le dataset

In [17]:
#Moyenne des ventes totale
rounded_mean = round(df['Weekly_Sales'].mean(), 2)
print(rounded_mean)

1268911.28


In [18]:
#Moyenne des ventes par magasin
mean_sales_by_store = df.groupby('Store')['Weekly_Sales'].mean().round(2)
print(mean_sales_by_store)

Store
1.0     1569313.26
2.0     1927775.91
3.0      404768.97
4.0     2173758.98
5.0      303042.80
6.0     1551123.58
7.0      536664.40
8.0      895145.74
9.0      506095.44
10.0    1822105.81
11.0    1523746.83
13.0    1997235.41
14.0    2052948.62
15.0     624494.29
16.0     515317.77
17.0     908105.90
18.0    1127564.73
19.0    1400615.22
20.0    1962384.41
Name: Weekly_Sales, dtype: float64


In [19]:
# le total de vente par magasin
total_sales_by_store = df.groupby('Store')['Weekly_Sales'].sum().round(2)
print(total_sales_by_store)

Store
1.0     12554506.09
2.0     11566655.45
3.0      4452458.66
4.0     13042553.90
5.0      2121299.63
6.0      9306741.48
7.0      3756650.81
8.0      4475728.69
9.0      1518286.32
10.0     9110529.05
11.0     4571240.48
13.0    17975118.68
14.0    14370640.37
15.0     1873482.86
16.0     2061271.09
17.0     4540529.51
18.0    10148082.61
19.0    11204921.72
20.0     9811922.06
Name: Weekly_Sales, dtype: float64


In [20]:
# grahique de corrélation
corr = df.corr()
fig = px.imshow(corr)
fig.show()


In [21]:
# Supposons que df est votre DataFrame et que vous avez déjà calculé corr
corr = df.corr()

fig = px.imshow(corr)

# Ajouter les valeurs de corrélation dans les carrés
for y in range(corr.shape[0]):
    for x in range(corr.shape[1]):
        fig.add_annotation(x=x, y=y, 
                           text=str(round(corr.iloc[y, x], 2)), 
                           showarrow=False, 
                           font=dict(color="black"))


# Ajuster la mise en page pour une meilleure lisibilité
fig.update_layout(
        xaxis=dict(side="top"),
        width=800,  # Largeur de la figure en pixels
        height=600)  # Hauteur de la figure en pixels

fig.show()

In [22]:
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,6.0,2011-02-18,1572117.54,,15.3,3.045,214.777523,6.858,2011.0,2.0,18.0
1,13.0,2011-03-25,1807545.43,0.0,5.8,3.435,128.616064,7.47,2011.0,3.0,25.0
3,11.0,NaT,1244390.03,0.0,29.2,,214.556497,7.346,,,
4,6.0,2010-05-28,1644470.66,0.0,26.0,2.759,212.412888,7.092,2010.0,5.0,28.0
5,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0


In [23]:
# Separate target variable Y from features X
print("Separating labels from features...")
features_list = ["Store", "Holiday_Flag", "Temperature","Fuel_Price", "CPI", "Unemployment", "Year", "Month", "Day"]
target_variable = ['Weekly_Sales']

X = df[features_list]
Y = df[target_variable]

print("...Done.")
print()

print('Y : ')
print(Y.head())
print()
print('X :')
print(X.head())

Separating labels from features...
...Done.

Y : 
   Weekly_Sales
0   1572117.54 
1   1807545.43 
3   1244390.03 
4   1644470.66 
5   1857533.70 

X :
   Store  Holiday_Flag  Temperature  Fuel_Price      CPI     Unemployment  \
0   6.0        NaN         15.3         3.045    214.777523      6.858      
1  13.0        0.0          5.8         3.435    128.616064      7.470      
3  11.0        0.0         29.2           NaN    214.556497      7.346      
4   6.0        0.0         26.0         2.759    212.412888      7.092      
5   4.0        0.0          NaN         2.756    126.160226      7.896      

    Year   Month   Day  
0  2011.0   2.0   18.0  
1  2011.0   3.0   25.0  
3     NaN   NaN    NaN  
4  2010.0   5.0   28.0  
5  2010.0   5.0   28.0  


In [24]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [25]:
# Distinguish numeric and categorical features:
numeric_features = ['Day', 'Month', 'Year', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']
categorical_features = ['Store', 'Holiday_Flag']

In [26]:
# Create pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # missing values will be replaced by columns' mean
    ('scaler', StandardScaler())
])

In [27]:
# Create pipeline for categorical features
categorical_transformer = Pipeline(
    steps=[
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

In [28]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [29]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5]) # MUST use this syntax because X_train is a numpy array and not a pandas DataFrame anymore
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
X_test = preprocessor.transform(X_test) # Don't fit again !! The test set is used for validating decisions
# we made based on the training set, therefore we can only apply transformations that were parametered using the training set.
# Otherwise this creates what is called a leak from the test set which will introduce a bias in all your results.
print('...Done.')
print(X_test[0:5,:]) # MUST use this syntax because X_test is a numpy array and not a pandas DataFrame anymore
print()

Performing preprocessings on train set...
...Done.
  (0, 0)	0.0781062689205895
  (0, 1)	-1.0400197084211238
  (0, 2)	1.465479740751528
  (0, 3)	-0.45803970503736263
  (0, 4)	0.43660144011926266
  (0, 5)	-1.158346128029191
  (0, 6)	-1.2796738149441738
  (0, 17)	1.0
  (1, 0)	-0.5574643507273441
  (1, 1)	1.6681100421342723
  (1, 2)	0.17149231008775384
  (1, 3)	0.03195625849097887
  (1, 4)	-0.07953376552490965
  (1, 5)	1.100543297286518
  (1, 6)	-1.0858840361941644
  (1, 10)	1.0
  (2, 0)	0.9679051364276966
  (2, 1)	0.9910776044954233
  (2, 2)	0.17149231008775384
  (2, 3)	1.1823815641662156
  (2, 4)	0.2986687558522858
  (2, 5)	1.1307033324598592
  (2, 6)	0.1668284621541126
  (2, 8)	1.0
  (3, 1)	-3.0066280027389825e-16
  (3, 2)	-2.9421867808736674e-13
  (3, 3)	-0.9906440132203426
  (3, 4)	-0.1017809726647452
  (3, 5)	-1.2519111920209074
  (3, 6)	1.3305558580151395
  (3, 15)	1.0
  (3, 26)	1.0
  (4, 0)	0.9679051364276966
  (4, 1)	0.9910776044954233
  (4, 2)	0.17149231008775384
  (4, 3)	0.20238

In [30]:
# Train model
print("Train model...")
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


In [31]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = regressor.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[[1980031.51104292]
 [ 416993.14045364]
 [ 323625.52800093]
 [1797671.86802106]
 [1948059.70535726]
 [ 830374.88956693]
 [ 143282.86553663]
 [ 558650.11066219]
 [1615514.62997745]
 [1627619.92755072]
 [ 587515.33003648]
 [2013975.50646618]
 [1569607.94327313]
 [1909718.78010581]
 [ 915177.19029695]
 [ 575840.95326725]
 [2298797.47780941]
 [1608662.53413105]
 [1012083.95796791]
 [ 596003.71558477]
 [ 235574.00489801]
 [1564827.3828365 ]
 [2117432.83143002]
 [ 625899.20087711]
 [ 378318.01455623]
 [1983331.20019669]
 [2080189.58517923]
 [1866742.35002267]
 [ 829521.03327607]
 [1535176.80487895]
 [1524104.4015148 ]
 [1997014.4605434 ]
 [1140921.15881513]
 [ 686018.68031903]
 [2122560.01097479]
 [ 221279.38680038]
 [1594345.33332321]
 [ 692237.80955861]
 [1915044.88544552]
 [2145751.95331364]
 [ 446562.3750355 ]
 [1612812.51000456]
 [ 972834.47931415]
 [2036493.27731095]
 [2399520.93304825]
 [ 564937.58141828]
 [ 266198.06487707]
 [ 995359.29505002]


In [32]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = regressor.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on test set...
...Done.
[[ 597374.28739796]
 [1311690.1039306 ]
 [ 373628.30691442]
 [1326543.21229998]
 [ 445664.05392386]
 [ 349243.96200725]
 [1687169.42780489]
 [ 363999.07656635]
 [ 965681.31435409]
 [ 554037.07480523]
 [ 461007.71960108]
 [1438357.05762777]
 [1702262.09974961]
 [ 494116.02247168]
 [1301020.64020484]
 [1849235.43078691]
 [1444535.85054814]
 [1737390.26423106]
 [1496544.86666986]
 [2017972.18694077]
 [1825579.76003339]
 [1728045.63699549]
 [1557206.37508079]
 [ 401011.68096407]]



In [33]:
# Print R^2 scores
print("R2 score on training set : ", r2_score(Y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(Y_test, Y_test_pred))

R2 score on training set :  0.9788288362713986
R2 score on test set :  0.9152007244644461


In [34]:
column_names = []
for name, pipeline, features_list in preprocessor.transformers_: # loop over pipelines
    if name == 'num': # if pipeline is for numeric variables
        features = features_list # just get the names of columns to which it has been applied
    else: # if pipeline is for categorical variables
        features = pipeline.named_steps['encoder'].get_feature_names_out() # get output columns names from OneHotEncoder
    column_names.extend(features) # concatenate features names
        
print("Names of columns corresponding to each coefficient: ", column_names)

Names of columns corresponding to each coefficient:  ['Day', 'Month', 'Year', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Store_2.0', 'Store_3.0', 'Store_4.0', 'Store_5.0', 'Store_6.0', 'Store_7.0', 'Store_8.0', 'Store_9.0', 'Store_10.0', 'Store_11.0', 'Store_13.0', 'Store_14.0', 'Store_15.0', 'Store_16.0', 'Store_17.0', 'Store_18.0', 'Store_19.0', 'Store_20.0', 'Holiday_Flag_1.0', 'Holiday_Flag_nan']


In [35]:
# Create a pandas DataFrame
coefs = pd.DataFrame(index = column_names, data = regressor.coef_.transpose(), columns=["coefficients"])
coefs

Unnamed: 0,coefficients
Day,-47758.68
Month,39802.23
Year,-19543.71
Temperature,-39619.84
Fuel_Price,-21957.7
CPI,105119.1
Unemployment,-78621.57
Store_2.0,190480.9
Store_3.0,-1260583.0
Store_4.0,669012.3


In [36]:
# Compute abs() and sort values
feature_importance = abs(coefs).sort_values(by = 'coefficients')
feature_importance

Unnamed: 0,coefficients
Year,19543.71
Fuel_Price,21957.7
Store_19.0,33673.06
Temperature,39619.84
Month,39802.23
Store_6.0,47317.78
Day,47758.68
Holiday_Flag_1.0,65519.99
Holiday_Flag_nan,65580.79
Unemployment,78621.57


In [37]:
# Plot coefficients
fig = px.bar(feature_importance, orientation = 'h')
fig.update_layout(showlegend = False, 
                  margin = {'l': 120} # to avoid cropping of column names
                 )
fig.show()