# Walmart : predict weekly sales

## Company's Description 📇

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962.

## Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

## Goals 🎯

The project can be divided into three steps:

- Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
- Part 2 : train a **linear regression model** (baseline)
- Part 3 : avoid overfitting by training a **regularized regression model**

## Deliverable 📬

To complete this project, your team should: 

- Create some visualizations
- Train at least one **linear regression model** on the dataset, that predicts the amount of weekly sales as a function of the other variables
- Assess the performances of the model by using a metric that is relevant for regression problems
- Interpret the coefficients of the model to identify what features are important for the prediction
- Train at least one model with **regularization (Lasso or Ridge)** to reduce overfitting


# Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline

import plotly.express as px


# Dataset exploring

In [3]:
# Import dataset
print("Loading dataset...")
dataset = pd.read_csv(r"G:\Mon Drive\Fichiers\2.Scolarité\1. Jedha_Data_Science\CERTIF_PROJECTS\ML_Engineer_Certification_Projects\04_SUPERVISED_ML_Walmart_&_Conversion_rate\Walmart\Src\Walmart_Store_sales.csv")
print("...Done.")
dataset.head()

Loading dataset...
...Done.


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092


In [4]:
# Basic stats
print("general info : ")
display(dataset.info())
print()

print("Basics statistics: ")
data_desc = dataset.describe(include='all')
display(data_desc)
print()

print("Percentage of missing values: ")
display(100*dataset.isnull().sum()/dataset.shape[0])

print("Display of dataset: ")
display(dataset.head())
print()

general info : 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         150 non-null    float64
 1   Date          132 non-null    object 
 2   Weekly_Sales  136 non-null    float64
 3   Holiday_Flag  138 non-null    float64
 4   Temperature   132 non-null    float64
 5   Fuel_Price    136 non-null    float64
 6   CPI           138 non-null    float64
 7   Unemployment  135 non-null    float64
dtypes: float64(7), object(1)
memory usage: 9.5+ KB


None


Basics statistics: 


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,150.0,132,136.0,138.0,132.0,136.0,138.0,135.0
unique,,85,,,,,,
top,,19-10-2012,,,,,,
freq,,4,,,,,,
mean,9.866667,,1249536.0,0.07971,61.398106,3.320853,179.898509,7.59843
std,6.231191,,647463.0,0.271831,18.378901,0.478149,40.274956,1.577173
min,1.0,,268929.0,0.0,18.79,2.514,126.111903,5.143
25%,4.0,,605075.7,0.0,45.5875,2.85225,131.970831,6.5975
50%,9.0,,1261424.0,0.0,62.985,3.451,197.908893,7.47
75%,15.75,,1806386.0,0.0,76.345,3.70625,214.934616,8.15



Percentage of missing values: 


Store            0.000000
Date            12.000000
Weekly_Sales     9.333333
Holiday_Flag     8.000000
Temperature     12.000000
Fuel_Price       9.333333
CPI              8.000000
Unemployment    10.000000
dtype: float64

Display of dataset: 


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092





In [5]:
# Drop rows where weekly sales is missing
mask_1 = ~dataset['Weekly_Sales'].isnull()
dataset = dataset.loc[mask_1, :]
print(dataset.shape[0])

136


In [6]:
# Create date column in date Year / month / day / day of the week
dataset['Year'] = pd.to_datetime(dataset['Date']).dt.year
dataset['Month'] = pd.to_datetime(dataset['Date']).dt.month
dataset['Day'] = pd.to_datetime(dataset['Date']).dt.day
dataset['Week_day'] = pd.to_datetime(dataset['Date']).dt.dayofweek


dataset = dataset.drop(columns='Date')

  dataset['Year'] = pd.to_datetime(dataset['Date']).dt.year
  dataset['Month'] = pd.to_datetime(dataset['Date']).dt.month
  dataset['Day'] = pd.to_datetime(dataset['Date']).dt.day
  dataset['Week_day'] = pd.to_datetime(dataset['Date']).dt.dayofweek


In [7]:
display(100*dataset.isnull().sum()/dataset.shape[0])
display(dataset.head())
print(dataset.shape[0])

Store            0.000000
Weekly_Sales     0.000000
Holiday_Flag     8.088235
Temperature     11.029412
Fuel_Price       8.823529
CPI              8.088235
Unemployment    10.294118
Year            13.235294
Month           13.235294
Day             13.235294
Week_day        13.235294
dtype: float64

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Week_day
0,6.0,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,4.0
1,13.0,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0,4.0
3,11.0,1244390.03,0.0,84.57,,214.556497,7.346,,,,
4,6.0,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,4.0
5,4.0,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0,4.0


136


In [8]:
def drop_outliers (dataset, columns):
    for col in columns:
        mean = dataset[col].mean()
        std = dataset[col].std()
        lower_bound = mean - 3 * std
        upper_bound = mean + 3 * std
        filtered_df = dataset[dataset[col].between(lower_bound, upper_bound)]

    return filtered_df

In [9]:
outlier_columns = ["Temperature", "Fuel_Price", "CPI", "Unemployment"]

dataset_filtered = drop_outliers(dataset.copy(), outlier_columns)

In [10]:
display(dataset_filtered.head())
print(dataset_filtered.shape[0])

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Week_day
0,6.0,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,4.0
1,13.0,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0,4.0
3,11.0,1244390.03,0.0,84.57,,214.556497,7.346,,,,
4,6.0,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,4.0
5,4.0,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0,4.0


117


### Remove missing values in date. Imputing would not make sense

In [11]:
print((dataset_filtered["Year"].isnull()).value_counts())
dataset_time_notnull = dataset_filtered[dataset_filtered['Year'].notnull()]
print(dataset_time_notnull.shape[0])

Year
False    102
True      15
Name: count, dtype: int64
102


### Model_1 Training

In [12]:
target_variable = "Weekly_Sales"

X = dataset_time_notnull.drop(target_variable, axis = 1)
Y = dataset_time_notnull.loc[:,target_variable]

print(X.shape)
print(Y.shape)

(102, 10)
(102,)


In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

In [14]:
categorical_features = ['Store','Holiday_Flag']
numeric_features = list(set(X.columns) - set(categorical_features))


print(numeric_features)
print(categorical_features)

['Unemployment', 'Fuel_Price', 'Month', 'Temperature', 'Year', 'Week_day', 'CPI', 'Day']
['Store', 'Holiday_Flag']


In [15]:
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler",  StandardScaler()),
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(drop="first")),
    ]
)


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])



print("Performing preprocessings on train set...")
X_train = preprocessor.fit_transform(X_train)
print("Performing preprocessings on test set...")
X_test = preprocessor.transform(X_test)
print('...Done.')
print()
print("Reshaping target")
Y_train = Y_train.values.reshape(-1,1)
Y_test = Y_test.values.reshape(-1,1)
print("...Done")


Performing preprocessings on train set...
Performing preprocessings on test set...
...Done.

Reshaping target
...Done


In [16]:
print("Train model...")
model_1 = LinearRegression()
model_1.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


### Model_1 Performance assessment

In [17]:
print("R2 score on training set : ", model_1.score(X_train, Y_train))
print("R2 score on test set : ", model_1.score(X_test, Y_test))

mse_model_1 = mean_squared_error(Y_test, model_1.predict(X_test))
print("Mean Squared Error:", mse_model_1)

R2 score on training set :  0.9859923021083215
R2 score on test set :  0.9104953289320068
Mean Squared Error: 43313516398.016975


### Extracting coefficients and ploting them

In [18]:
coefficients = model_1.coef_
print("Coefficients:", coefficients)
# Check
print((X_train).shape)
print((coefficients).shape)

Coefficients: [[-9.78420505e+04 -6.42994768e+04  3.31811461e+04 -3.41829710e+04
   7.64900804e+03  8.67294148e-09  1.10779562e+05 -6.03129690e+04
   1.65433404e+05 -1.29781206e+06  6.77141738e+05 -1.43514100e+06
  -1.34000917e+05 -9.54805696e+05 -8.69505182e+05 -1.31533660e+06
   6.12443459e+05  1.90997466e+05  4.74588204e+05  6.70557689e+05
  -6.69478910e+05 -1.15194484e+06 -7.60066217e+05 -1.66055520e+05
   1.17227077e+05  3.59312624e+05 -9.12239666e+04]]
(81, 27)
(1, 27)


In [19]:
encoded_feature_names = preprocessor.named_transformers_['cat'].named_steps['encoder'].get_feature_names_out(categorical_features)

final_feature_names = numeric_features + list(encoded_feature_names)
# Check
print(len(final_feature_names))
final_feature_names

27


['Unemployment',
 'Fuel_Price',
 'Month',
 'Temperature',
 'Year',
 'Week_day',
 'CPI',
 'Day',
 'Store_2.0',
 'Store_3.0',
 'Store_4.0',
 'Store_5.0',
 'Store_6.0',
 'Store_7.0',
 'Store_8.0',
 'Store_9.0',
 'Store_10.0',
 'Store_11.0',
 'Store_13.0',
 'Store_14.0',
 'Store_15.0',
 'Store_16.0',
 'Store_17.0',
 'Store_18.0',
 'Store_19.0',
 'Store_20.0',
 'Holiday_Flag_1.0']

In [20]:
feature_coef = pd.DataFrame({'feature': final_feature_names, 'coefficient': coefficients.flatten()})
feature_coef.head()

Unnamed: 0,feature,coefficient
0,Unemployment,-97842.050524
1,Fuel_Price,-64299.47677
2,Month,33181.146124
3,Temperature,-34182.971007
4,Year,7649.008045


In [26]:
fig = px.bar(feature_coef, x="feature", y="coefficient",  title='Histogram of Coefficients')
fig.show()

### Model_2 using GridSearchCV, trying both Lasso and Ridge regularizations

In [22]:
#Pipeline
ridge_pipeline = Pipeline([
    ('ridge', Ridge())
])

lasso_pipeline = Pipeline([
    ('lasso', Lasso())
])

#Parameters
ridge_params = {
    'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
}
lasso_params = {
    'lasso__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
}

#Training models
ridge_gridsearch = GridSearchCV(estimator = ridge_pipeline,
                                param_grid = ridge_params,
                                cv = 3)
ridge_gridsearch.fit(X_train, Y_train)

lasso_gridsearch = GridSearchCV(estimator = lasso_pipeline,
                                param_grid = lasso_params,
                                cv = 3)
lasso_gridsearch.fit(X_train, Y_train)

print("...Done.")

print("Best parameters for Ridge:", ridge_gridsearch.best_params_)
print("Best score for Ridge:", ridge_gridsearch.best_score_)

print("Best parameters for Lasso:", lasso_gridsearch.best_params_)
print("Best score for Lasso:", lasso_gridsearch.best_score_)


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.625e+11, tolerance: 2.058e+09


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.620e+11, tolerance: 2.058e+09


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.568e+11, tolerance: 2.058e+09


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.159e+11, tolerance: 2.058e+09


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.956e+10, tolerance: 2.058e+09



...Done.
Best parameters for Ridge: {'ridge__alpha': 0.01}
Best score for Ridge: 0.8954046262462662
Best parameters for Lasso: {'lasso__alpha': 0.001}
Best score for Lasso: 0.8936146332436478


In [23]:
# Training models 

ridge_model = Pipeline([
    ('ridge', Ridge(alpha=0.01))
])

lasso_model = Pipeline([
    ('lasso', Lasso(alpha=0.001))
])


#Training models
ridge_model.fit(X_train, Y_train)
lasso_model.fit(X_train, Y_train)

#### Ridge performance assessment

In [24]:
print("R2 score on training set : ", ridge_model.score(X_train, Y_train))
print("R2 score on test set : ", ridge_model.score(X_test, Y_test))

mse_ridge_model = mean_squared_error(Y_test, ridge_model.predict(X_test))
print("Mean Squared Error:", mse_ridge_model)

R2 score on training set :  0.985950103855604
R2 score on test set :  0.911507940670722
Mean Squared Error: 42823488619.28407


#### Ridge performance assessment

In [25]:
print("R2 score on training set : ", lasso_model.score(X_train, Y_train))
print("R2 score on test set : ", lasso_model.score(X_test, Y_test))

mse_lasso_model = mean_squared_error(Y_test, ridge_model.predict(X_test))
print("Mean Squared Error:", mse_lasso_model)

R2 score on training set :  0.9859923021083183
R2 score on test set :  0.9104953369118504
Mean Squared Error: 42823488619.28407
