Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

Goals 🎯

The project can be divided into three steps:

Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning \
Part 2 : train a linear regression model (baseline) \
Part 3 : avoid overfitting by training a regularized regression model

In [127]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

from plotly.subplots import make_subplots

In [128]:
walmart_df=pd.read_csv('Walmart_Store_sales.csv')
walmart_df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092


In [129]:
print('Number of rows :{}'.format(walmart_df.shape[0]))
print('Number of columns :{}'.format(walmart_df.shape[1]))

display(walmart_df.info())

print('Basics statistics:')
display(walmart_df.describe(include='all'))

print('Percentage of missing values:')
display(display(100 * walmart_df.isnull().sum() / walmart_df.shape[0]))

Number of rows :150
Number of columns :8
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         150 non-null    float64
 1   Date          132 non-null    object 
 2   Weekly_Sales  136 non-null    float64
 3   Holiday_Flag  138 non-null    float64
 4   Temperature   132 non-null    float64
 5   Fuel_Price    136 non-null    float64
 6   CPI           138 non-null    float64
 7   Unemployment  135 non-null    float64
dtypes: float64(7), object(1)
memory usage: 9.5+ KB


None

Basics statistics:


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,150.0,132,136.0,138.0,132.0,136.0,138.0,135.0
unique,,85,,,,,,
top,,19-10-2012,,,,,,
freq,,4,,,,,,
mean,9.866667,,1249536.0,0.07971,61.398106,3.320853,179.898509,7.59843
std,6.231191,,647463.0,0.271831,18.378901,0.478149,40.274956,1.577173
min,1.0,,268929.0,0.0,18.79,2.514,126.111903,5.143
25%,4.0,,605075.7,0.0,45.5875,2.85225,131.970831,6.5975
50%,9.0,,1261424.0,0.0,62.985,3.451,197.908893,7.47
75%,15.75,,1806386.0,0.0,76.345,3.70625,214.934616,8.15


Percentage of missing values:


Store            0.000000
Date            12.000000
Weekly_Sales     9.333333
Holiday_Flag     8.000000
Temperature     12.000000
Fuel_Price       9.333333
CPI              8.000000
Unemployment    10.000000
dtype: float64

None

Part 1 : EDA and data preprocessing

Start your project by exploring your dataset : create figures, compute some statistics etc...

Then, you'll have to make some preprocessing on the dataset. You can follow the guidelines from the preprocessing template. There will also be some specific transformations to be planned on this dataset, for example on the Date column that can't be included as it is in the model. Below are some hints that might help you 🤓

Preprocessing to be planned with pandas

Drop lines where target values are missing :

Here, the target variable (Y) corresponds to the column Weekly_Sales. One can see above that there are some missing values in this column.
We never use imputation techniques on the target : it might create some bias in the predictions !
Then, we will just drop the lines in the dataset for which the value in Weekly_Sales is missing.
Create usable features from the Date column : The Date column cannot be included as it is in the model. Either you can drop this column, or you will create new columns that contain the following numeric features :

year
month
day
day of week
Drop lines containing invalid values or outliers : In this project, will be considered as outliers all the numeric features that don't fall within the range : 
[
X
ˉ
−
3
σ
,
X
ˉ
+
3
σ
]
[ 
X
ˉ
 −3σ, 
X
ˉ
 +3σ]. This concerns the columns : Temperature, Fuel_price, CPI and Unemployment

Target variable/target (Y) that we will try to predict, to separate from the others : Weekly_Sales

------------

Preprocessings to be planned with scikit-learn

Explanatory variables (X) We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

Categorical variables : Store, Holiday_Flag
Numerical variables : Temperature, Fuel_Price, CPI, Unemployment, Year, Month, Day, DayOfWeek


In [130]:
#Dropping lines where Y -weekly sales - is missing

index_nan=walmart_df.loc[walmart_df['Weekly_Sales'].isnull()].index

walmart_df=walmart_df.drop(index=index_nan)

print('Number of rows :{}'.format(walmart_df.shape[0]))
print('Number of columns :{}'.format(walmart_df.shape[1]))
walmart_df.head()


Number of rows :136
Number of columns :8


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092
5,4.0,28-05-2010,1857533.7,0.0,,2.756,126.160226,7.896


In [131]:
print('Number of rows :{}'.format(walmart_df.shape[0]))
print('Number of columns :{}'.format(walmart_df.shape[1]))

print('Percentage of missing values:')
display(display(100 * walmart_df.isnull().sum() / walmart_df.shape[0]))

Number of rows :136
Number of columns :8
Percentage of missing values:


Store            0.000000
Date            13.235294
Weekly_Sales     0.000000
Holiday_Flag     8.088235
Temperature     11.029412
Fuel_Price       8.823529
CPI              8.088235
Unemployment    10.294118
dtype: float64

None

In [132]:
#Create usable features for the Date column - year, month, day, day of the week

walmart_df['Date']=pd.to_datetime(walmart_df['Date'],dayfirst=True)

walmart_df['Day'] = walmart_df['Date'].dt.day
walmart_df['Month'] = walmart_df['Date'].dt.month
walmart_df['Year'] = walmart_df['Date'].dt.year

#walmart_df['Dayoftheweek']= walmart_df['Date'].dt.day_name() 
#display(walmart_df['Dayoftheweek'].value_counts())
# -> ne donne que des vendredi donc inutile pour le modèle.

walmart_df=walmart_df.drop('Date',axis=1)
walmart_df.head()



Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Day,Month,Year
0,6.0,1572117.54,,59.61,3.045,214.777523,6.858,18.0,2.0,2011.0
1,13.0,1807545.43,0.0,42.38,3.435,128.616064,7.47,25.0,3.0,2011.0
3,11.0,1244390.03,0.0,84.57,,214.556497,7.346,,,
4,6.0,1644470.66,0.0,78.89,2.759,212.412888,7.092,28.0,5.0,2010.0
5,4.0,1857533.7,0.0,,2.756,126.160226,7.896,28.0,5.0,2010.0


In [133]:
fig = make_subplots(rows=4,cols=1)
fig.add_trace(
    go.Histogram(
        x = walmart_df['Temperature'],
        name = 'Temperature'),
    row=1,
    col=1)

fig.add_trace(
    go.Histogram(
        x = walmart_df['Fuel_Price'],
        name = 'Fuel_Price'),
    row=2,
    col=1)

fig.add_trace(
    go.Histogram(
        x = walmart_df['CPI'],
        name = 'CPI'),
    row=3,
    col=1)

fig.add_trace(
    go.Histogram(
        x = walmart_df['Unemployment'],
        name = 'Unemployment'),
    row=4,
    col=1)

fig.update_layout(width=700,height=1000)
fig.show()

In [134]:
#droping outliers that don't fall within the range : [Xˉ−3σ,Xˉ+3σ].
#This concerns the columns : Temperature, Fuel_price, CPI and Unemployment 

outliers = []

for i in ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']:
    print(f"Identifying outliers in {i}...")

    to_keep = (walmart_df[i] < walmart_df[i].mean() + 3 * walmart_df[i].std()) \
        & (walmart_df[i] > walmart_df[i].mean() - 3 * walmart_df[i].std())

    outliers.extend(walmart_df.loc[~to_keep, :].to_dict('records'))

    walmart_df = walmart_df.loc[to_keep, :]

    print('...Done')
    print('Number of rows :{}'.format(walmart_df.shape[0]))

print("Outliers identified:")
display(len(outliers))
for outlier in outliers:
    print(outlier)



Identifying outliers in Temperature...
...Done
Number of rows :121
Identifying outliers in Fuel_Price...
...Done
Number of rows :109
Identifying outliers in CPI...
...Done
Number of rows :102
Identifying outliers in Unemployment...
...Done
Number of rows :90
Outliers identified:


46

{'Store': 4.0, 'Weekly_Sales': 1857533.7, 'Holiday_Flag': 0.0, 'Temperature': nan, 'Fuel_Price': 2.756, 'CPI': 126.1602258, 'Unemployment': 7.896, 'Day': 28.0, 'Month': 5.0, 'Year': 2010.0}
{'Store': 6.0, 'Weekly_Sales': 1420405.41, 'Holiday_Flag': 0.0, 'Temperature': nan, 'Fuel_Price': 3.523, 'CPI': 217.2706543, 'Unemployment': 6.925, 'Day': 26.0, 'Month': 8.0, 'Year': 2011.0}
{'Store': 18.0, 'Weekly_Sales': 988157.72, 'Holiday_Flag': 0.0, 'Temperature': nan, 'Fuel_Price': 3.823, 'CPI': 134.2784667, 'Unemployment': 8.975, 'Day': 15.0, 'Month': 4.0, 'Year': 2011.0}
{'Store': 16.0, 'Weekly_Sales': 526525.16, 'Holiday_Flag': 0.0, 'Temperature': nan, 'Fuel_Price': 3.659, 'CPI': 198.1267184, 'Unemployment': 6.061, 'Day': 14.0, 'Month': 9.0, 'Year': 2012.0}
{'Store': 1.0, 'Weekly_Sales': 1661767.33, 'Holiday_Flag': 1.0, 'Temperature': nan, 'Fuel_Price': 3.73, 'CPI': 222.4390153, 'Unemployment': 6.908, 'Day': nan, 'Month': nan, 'Year': nan}
{'Store': 6.0, 'Weekly_Sales': 1532308.78, 'Holiday

In [135]:
#Target variable/target (Y) that we will try to predict, to separate from the others : Weekly_Sales

#Preprocessings to be planned with scikit-learn:
    
    #Explanatory variables (X) We need to identify which columns contain categorical variables 
    #and which columns contain numerical variables, as they will be treated differently.
    
    #Categorical variables : Store, Holiday_Flag
    #Numerical variables : Temperature, Fuel_Price, CPI, Unemployment, Year, Month, Day, DayOfWeek

target_name = "Weekly_Sales"

print("Separating labels from features...")
Y = walmart_df.loc[:, target_name]
X = walmart_df.drop(target_name, axis=1)
print("...Done.")
print(Y.head())
print()
print(X.head())
print()


Separating labels from features...
...Done.
0    1572117.54
1    1807545.43
4    1644470.66
6     695396.19
7    2203523.20
Name: Weekly_Sales, dtype: float64

   Store  Holiday_Flag  Temperature  Fuel_Price         CPI  Unemployment  \
0    6.0           NaN        59.61       3.045  214.777523         6.858   
1   13.0           0.0        42.38       3.435  128.616064         7.470   
4    6.0           0.0        78.89       2.759  212.412888         7.092   
6   15.0           0.0        69.80       4.069  134.855161         7.658   
7   20.0           0.0        39.93       3.617  213.023622         6.961   

    Day  Month    Year  
0  18.0    2.0  2011.0  
1  25.0    3.0  2011.0  
4  28.0    5.0  2010.0  
6   3.0    6.0  2011.0  
7   3.0    2.0  2012.0  



In [136]:
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [137]:
print('Percentage of missing values:')
display(display(100 * walmart_df.isnull().sum() / walmart_df.shape[0]))

Percentage of missing values:


Store            0.000000
Weekly_Sales     0.000000
Holiday_Flag    11.111111
Temperature      0.000000
Fuel_Price       0.000000
CPI              0.000000
Unemployment     0.000000
Day             11.111111
Month           11.111111
Year            11.111111
dtype: float64

None

In [138]:
numeric_features = ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Year', 'Month', 'Day']
numeric_transformer = Pipeline(steps=[
    ("imputer",SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_features = ['Store', 'Holiday_Flag']
categorical_transformer = Pipeline(steps=[
        ("imputer",SimpleImputer(strategy="most_frequent")),  # missing values will be replaced by most frequent value
        ("encoder",OneHotEncoder(drop="first"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [139]:
print("Performing preprocessings on train set...")
print(X_train.head())
X_train = preprocessor.fit_transform(X_train)
print("...Done.")
print(X_train[0:5])
print()

print("Performing preprocessings on test set...")
print(X_test.head())
X_test = preprocessor.transform(X_test)
print("...Done.")
print(X_test[0:5, :])
print()

Performing preprocessings on train set...
     Store  Holiday_Flag  Temperature  Fuel_Price         CPI  Unemployment  \
127   16.0           0.0        61.79       2.711  189.523128         6.868   
63     5.0           0.0        69.17       3.594  224.019287         5.422   
35    19.0           0.0        33.26       3.789  133.958742         7.771   
10     8.0           0.0        82.92       3.554  219.070197         6.425   
95     1.0           0.0        74.78       2.854  210.337426         7.808   

      Day  Month    Year  
127   9.0    7.0  2010.0  
63   19.0   10.0  2012.0  
35   25.0    3.0  2011.0  
10   19.0    8.0  2011.0  
95   14.0    5.0  2010.0  
...Done.
[[ 0.04260362 -1.26840641  0.20507788 -0.55534542 -1.1763434   0.147002
  -0.86859506  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          1.          0.          0.          0.
   0.          0.        ]
 [ 0.4592769   

In [140]:
print("Train model...")
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


In [141]:
print("Predictions on training set...")
Y_train_pred = regressor.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

print("Predictions on test set...")
Y_test_pred = regressor.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on training set...
...Done.
[ 611364.67099396  370577.26212486 1275740.37137493  879179.76718068
 1536772.70829879 1514868.79536837 1965323.8723865   602145.54012832
  948687.87405245 1089144.04045663 2125262.41163193  650336.60787243
 2145312.0623884   610712.16639662  517258.85415893  778674.43751482
  621000.71199925 1637887.71082182  166083.77933535  532890.97130511
 1846150.02967254 2113342.41663075 1117874.96097089 1449549.93545644
 2064847.33029364 1946434.88789985  420203.37186409 2018205.31305822
  911972.28740893 1619671.09250448 2039633.27775499 1566247.21290487
 1544871.47814237 1918280.17346583  329688.5413809   513754.34016273
  930146.16563808 1520404.73250487 2020147.89985993 2062163.07944381
  523043.15752523 1942173.83959015 1592843.57093179  425386.35441876
  245875.50172863  503128.68941671  438285.84900605 1792986.20265494
 1965095.38313835  420314.62383058 2068359.18918786 1881633.53078707
  798003.7668471  1545014.05935314  471641.00853216  408800.486

Part 2 : Baseline model (linear regression)

Once you've trained a first model, don't forget to assess its performances on the train and test sets. Are you satisfied with the results ? Besides, it would be interesting to analyze the values of the model's coefficients to know what features are important for the prediction. To do so, the .coef_ attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [142]:
#etant donné les valeurs des résultats - très proches de 1 - on peut suspecter un over fiting 

print("R2 score on training set : ", r2_score(Y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(Y_test, Y_test_pred))

R2 score on training set :  0.9868321417045137
R2 score on test set :  0.9352216314000099


In [143]:
column_names = []
for name, pipeline, features_list in preprocessor.transformers_:
    if name == 'num':
        features = features_list 
    else:
        features = pipeline.named_steps['encoder'].get_feature_names_out()
    column_names.extend(features)
        

coefs = pd.DataFrame(index = column_names, data = regressor.coef_.transpose(), columns=["coefficients"])
feature_importance = abs(coefs).sort_values(by = 'coefficients',ascending=False)
feature_importance

Unnamed: 0,coefficients
x0_4.0,2204173.0
x0_13.0,2066814.0
x0_10.0,1798457.0
x0_19.0,1328211.0
x0_3.0,1250987.0
x0_5.0,1227433.0
x0_9.0,1102517.0
x0_14.0,1017347.0
x0_18.0,987464.4
x0_17.0,856056.7


In [144]:
fig = px.bar(feature_importance, orientation = 'h')
fig.update_layout(showlegend = False, 
                  margin = {'l': 120}# to avoid cropping of column names
                 )
fig.show()

Part 3 : Fight overfitting

In this last part, you'll have to train a regularized linear regression model. You'll find below some useful classes in scikit-learn's documentation :

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

In [145]:
ridge = Ridge()
print(ridge)
ridge.fit(X_train, Y_train)

print("R2 score on training set : ", ridge.score(X_train, Y_train))
print("R2 score on test set : ", ridge.score(X_test, Y_test))

Ridge()
R2 score on training set :  0.9326481680110414
R2 score on test set :  0.8246510243579797


In [146]:
#Réduction de l'overfiting / avec la variance du R2 on peut conclure que les résultats sont cohérents

print("Grid search - RIDGE...")
regressor = Ridge()

params = {
    'alpha': [2,3,5] 
}
gridsearch = GridSearchCV(regressor, param_grid = params, cv = 10) 
gridsearch.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best R2 score : ", gridsearch.best_score_)

print("R2 score on training set : ", gridsearch.score(X_train, Y_train))
print("R2 score on test set : ", gridsearch.score(X_test, Y_test))

scores = cross_val_score(gridsearch.best_estimator_, X_train, Y_train, cv = 10)
print('The cross-validated R2-score is : ', scores.mean())
print('The standard deviation is : ', scores.std())

Grid search - RIDGE...
...Done.
Best hyperparameters :  {'alpha': 2}
Best R2 score :  0.7242359828312347
R2 score on training set :  0.863761716662687
R2 score on test set :  0.7214978628303186
The cross-validated R2-score is :  0.7242359828312347
The standard deviation is :  0.09297959777835434


In [149]:
lasso1 = Lasso()
print(lasso1)
lasso1.fit(X_train, Y_train)

print("R2 score on training set : ", lasso1.score(X_train, Y_train))
print("R2 score on test set : ", lasso1.score(X_test, Y_test))

Lasso()
R2 score on training set :  0.9864577742465312
R2 score on test set :  0.939043642068452



Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.033e+11, tolerance: 3.010e+09



In [147]:
print("Grid search - LASSO...")
regressor = Lasso()

params = {
    'alpha': [15000,16000,18000] 
}
gridsearch = GridSearchCV(regressor, param_grid = params, cv = 10) 
gridsearch.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best R2 score : ", gridsearch.best_score_)

print("R2 score on training set : ", gridsearch.score(X_train, Y_train))
print("R2 score on test set : ", gridsearch.score(X_test, Y_test))

scores = cross_val_score(gridsearch.best_estimator_, X_train, Y_train, cv = 10)
print('The cross-validated R2-score is : ', scores.mean())
print('The standard deviation is : ', scores.std())

Grid search - LASSO...
...Done.
Best hyperparameters :  {'alpha': 15000}
Best R2 score :  0.6434168537146401
R2 score on training set :  0.7970363974344489
R2 score on test set :  0.614398827157794
The cross-validated R2-score is :  0.6434168537146401
The standard deviation is :  0.08893369952972789


In [148]:
coefs = pd.DataFrame(index = column_names, data = gridsearch.best_estimator_.coef_.transpose(), columns=["coefficients"])
feature_importance = abs(coefs).sort_values(by = 'coefficients',ascending=False)
feature_importance

Unnamed: 0,coefficients
x0_4.0,758489.044441
x0_7.0,718952.176443
x0_13.0,632064.658206
x0_3.0,628180.777937
x0_5.0,478109.757257
x0_20.0,465782.548921
x0_15.0,402638.008131
x0_9.0,204227.032198
x0_14.0,199458.957807
x0_18.0,196149.87928
