<img src="https://www.bestdesigns.co/uploads/inspiration_images/4350/990__1511457498_404_walmart.png" alt="WALMART LOGO" />

# Walmart : predict weekly sales

## Company's Description 📇

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962.

## Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

## Goals 🎯

The project can be divided into three steps:

- Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
- Part 2 : train a **linear regression model** (baseline)
- Part 3 : avoid overfitting by training a **regularized regression model**

## Scope of this project 🖼️

For this project, you'll work with a dataset that contains information about weekly sales achieved by different Walmart stores, and other variables such as the unemployment rate or the fuel price, that might be useful for predicting the amount of sales. The dataset has been taken from a Kaggle competition, but we made some changes compared to the original data. Please make sure that you're using **our** custom dataset (available on JULIE). 🤓

## Deliverable 📬

To complete this project, your team should: 

- Create some visualizations
- Train at least one **linear regression model** on the dataset, that predicts the amount of weekly sales as a function of the other variables
- Assess the performances of the model by using a metric that is relevant for regression problems
- Interpret the coefficients of the model to identify what features are important for the prediction
- Train at least one model with **regularization (Lasso or Ridge)** to reduce overfitting


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Part 1 : EDA and data preprocessing

Start your project by exploring your dataset : create figures, compute some statistics etc...

Then, you'll have to make some preprocessing on the dataset. You can follow the guidelines from the *preprocessing template*. There will also be some specific transformations to be planned on this dataset, for example on the *Date* column that can't be included as it is in the model. Below are some hints that might help you 🤓

 #### Preprocessing to be planned with pandas

 **Drop lines where target values are missing :**
 - Here, the target variable (Y) corresponds to the column *Weekly_Sales*. One can see above that there are some missing values in this column.
 - We never use imputation techniques on the target : it might create some bias in the predictions !
 - Then, we will just drop the lines in the dataset for which the value in *Weekly_Sales* is missing.
 
**Create usable features from the *Date* column :**
The *Date* column cannot be included as it is in the model. Either you can drop this column, or you will create new columns that contain the following numeric features : 
- *year*
- *month*
- *day*
- *day of week*

**Drop lines containing invalid values or outliers :**
In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*
 


**Target variable/target (Y) that we will try to predict, to separate from the others** : *Weekly_Sales*

 **------------**

 #### Preprocessings to be planned with scikit-learn

 **Explanatory variables (X)**
We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

 - Categorical variables : Store, Holiday_Flag
 - Numerical variables : Temperature, Fuel_Price, CPI, Unemployment, Year, Month, Day, DayOfWeek

### Part 2 : Baseline model (linear regression)
Once you've trained a first model, don't forget to assess its performances on the train and test sets. Are you satisfied with the results ?
Besides, it would be interesting to analyze the values of the model's coefficients to know what features are important for the prediction. To do so, the `.coef_` attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

### Part 3 : Fight overfitting
In this last part, you'll have to train a **regularized linear regression model**. You'll find below some useful classes in scikit-learn's documentation :
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

**Bonus question**

In regularized regression models, there's a hyperparameter called *the regularization strength* that can be fine-tuned to get the best generalized predictions on a given dataset. This fine-tuning can be done thanks to scikit-learn's GridSearchCV class : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Also, you'll find here some examples of how to use GridSearchCV together with Ridge or Lasso models : https://alfurka.github.io/2018-11-18-grid-search/

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score, GridSearchCV

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) # to avoid deprecation warnings

pd.options.mode.chained_assignment = None  # default='warn'

### Part 1 : EDA and data preprocessing

In [2]:
# Import of the data
data = pd.read_csv('Walmart_Store_sales.csv')
display (data.head(2))
print (f"{data.shape[0]} rows x {data.shape[1]} columns")

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47


150 rows x 8 columns


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         150 non-null    float64
 1   Date          132 non-null    object 
 2   Weekly_Sales  136 non-null    float64
 3   Holiday_Flag  138 non-null    float64
 4   Temperature   132 non-null    float64
 5   Fuel_Price    136 non-null    float64
 6   CPI           138 non-null    float64
 7   Unemployment  135 non-null    float64
dtypes: float64(7), object(1)
memory usage: 9.5+ KB


In [4]:
data.describe(include='all')

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,150.0,132,136.0,138.0,132.0,136.0,138.0,135.0
unique,,85,,,,,,
top,,19-10-2012,,,,,,
freq,,4,,,,,,
mean,9.866667,,1249536.0,0.07971,61.398106,3.320853,179.898509,7.59843
std,6.231191,,647463.0,0.271831,18.378901,0.478149,40.274956,1.577173
min,1.0,,268929.0,0.0,18.79,2.514,126.111903,5.143
25%,4.0,,605075.7,0.0,45.5875,2.85225,131.970831,6.5975
50%,9.0,,1261424.0,0.0,62.985,3.451,197.908893,7.47
75%,15.75,,1806386.0,0.0,76.345,3.70625,214.934616,8.15


In [5]:
# Creating a new DataFrame to check Weekly Sales by Temperature
data_temp_category = data[["Weekly_Sales", "Temperature"]]

# Turning the Temperature into a categorical variable
data_temp_category["Temperature"] = round(data_temp_category["Temperature"]/10)*10
display(data_temp_category.head(2))

# Sum of the data by Temperature
data_temp_category = data_temp_category.groupby(["Temperature"], as_index=False).sum()

Unnamed: 0,Weekly_Sales,Temperature
0,1572117.54,60.0
1,1807545.43,40.0


In [6]:
fig = px.bar(data_temp_category, x="Temperature", y="Weekly_Sales", title="Sum of Weekly Sales by Temperature", color="Temperature", color_continuous_scale=px.colors.sequential.Plasma)
fig.update_layout(title_x=0.5, template='plotly_dark', width=1500, height=500)
fig.show()  

##### Conclusion

People don't buy much when it's very cold. Is this due to the temperature or something else? Maybe people buy less in winter?  
20°F corresponds to -6.66°C.

In [7]:
# Find correlation between the variables
corr = data.corr()

# Heatmap
fig = go.Figure()
fig.add_trace(go.Heatmap(
    z = corr,
    x = corr.columns.values,
    y = corr.columns.values,
    colorscale = px.colors.diverging.RdBu,
    zmid=0
    ))

fig.update_layout(width=1500, height=800, paper_bgcolor='black', font_color='white')
fig.show()

##### Conclusion

- CPI is strongly correlated with Store.  
- CPI is moderately correlated with Unemployment

In [8]:
# Mean of the data by Temperature
data_store_category = data.groupby(["Store"], as_index=False).mean()

# Bar chart of Weekly_Sales by Store
fig = px.bar(data_store_category, x="Store", y="Weekly_Sales", hover_name="Store", title="Mean of Weekly Sales by Store")
fig.update_layout(title_x=0.5, template='plotly_dark', width=1500, height=500, xaxis = dict(type = 'category'))
fig.show()


##### Conclusion

Some stores are selling way more than others.

In [9]:
# Checking percentage of missing values in Weekly_Sales
weekly_sales_missing_values = (data['Weekly_Sales'].isnull().sum())/(data['Weekly_Sales'].shape[0])*100
print (f"There is {round(weekly_sales_missing_values, 2)}% missing values in the 'Weekly_Sales' column.")

# Droping the rows with missing values
data = data.dropna(subset=['Weekly_Sales'])
print ("Dropping the rows with missing values...")

# Checking percentage of missing values in Weekly_Sales a second time
weekly_sales_missing_values = (data['Weekly_Sales'].isnull().sum())/(data['Weekly_Sales'].shape[0])*100
print (f"There is {round(weekly_sales_missing_values, 2)}% missing values in the 'Weekly_Sales' column.")

There is 9.33% missing values in the 'Weekly_Sales' column.
Dropping the rows with missing values...
There is 0.0% missing values in the 'Weekly_Sales' column.


In [10]:
# Instantiating a variable to store the data with potential outliers
cols = ["Temperature", "Fuel_Price", "CPI", "Unemployment"]

# Removing outliers values that are further away than 3 times the standard deviation below and above the mean
for i in data[cols]:
    outliers_min = data[i].mean() - data[i].std()*3
    outliers_max = data[i].mean() + data[i].std()*3
    data2 = data[(data[i] > outliers_min) & (data[i] < outliers_max)]

In [11]:
# Transforming the date into a datetime format
data2['Date'] = pd.to_datetime(data2['Date'], format="%d-%m-%Y")

# Extracting the year, month, day and day of week from the Date column
data2['Year'] = data2['Date'].dt.year
data2['Month'] = data2['Date'].dt.month
data2['Day'] = data2['Date'].dt.day
data2['Day_of_Week'] = data2['Date'].dt.dayofweek

# Dropping the rows with missing values in the Date column
data2 = data2.dropna(subset=['Date'])

In [12]:
# Checking values of the day of week
print (data2["Day_of_Week"].value_counts())

4.0    102
Name: Day_of_Week, dtype: int64


##### Conclusion

Day of week happens to always be 4 (friday). Since it is always the same value, this column will not bring anything relevant for our Machine Learning model.

In [13]:
# Dropping the Date and Day of week column
data2 = data2.drop(['Date', 'Day_of_Week'], axis=1)
data2.head(2)

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,6.0,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0
1,13.0,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0


### Part 2 : Baseline model (linear regression)

In [14]:
# Instantiating explanatory variable and target variable
X = data2.drop(['Weekly_Sales'], axis=1)
y = data2['Weekly_Sales']

# Checking percentage of missing values in the explanatory variable
a=X.isnull().sum()/X.shape[0]*100
a.sort_values(ascending=False)

Fuel_Price      9.803922
Holiday_Flag    8.823529
Temperature     7.843137
CPI             7.843137
Store           0.000000
Unemployment    0.000000
Year            0.000000
Month           0.000000
Day             0.000000
dtype: float64

In [15]:
# Creating a train_test_split from my explanatory and target variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state = 42)

In [16]:
# Splitting my numerical and categorical features
numeric_features = ["Temperature", "Fuel_Price", "CPI", "Unemployment", "Year", "Month", "Day"]
categorical_features = ["Store", "Holiday_Flag"]

# Create pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # missing values in Age will be replaced by columns' mean
    ('scaler', StandardScaler())
])

# Create pipeline for categorical features
categorical_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # missing values will be replaced by most frequent value
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

# Use ColumnTranformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Preprocessings on train set
X_train = preprocessor.fit_transform(X_train)

# Preprocessings on test set
X_test = preprocessor.transform(X_test)

In [17]:
# Instantiating a linear regression model
regressor = LinearRegression()

# Fitting the model
regressor.fit(X_train, y_train)

LinearRegression()

In [18]:
# Predictions on training set
Y_train_pred = regressor.predict(X_train)

# Predictions on test set
Y_test_pred = regressor.predict(X_test)

In [19]:
# Print R^2 scores
print("R2 score on training set : ", r2_score(y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(y_test, Y_test_pred))

R2 score on training set :  0.9748702066689389
R2 score on test set :  0.9509327344030791


##### Conclusion

Our model is strong. Predictions are very accurate.

In [20]:
# Instanciating a list to store features names
all_column = numeric_features + preprocessor.transformers_[1][1].named_steps['encoder'].get_feature_names_out().tolist()

# Creating a DataFrame to store the coefficients
df = pd.DataFrame()

# Adding the Features names
df['Features'] = all_column

# Adding the coefficients in an order matching the feature names based on the preprocessor order
df['Coefficients'] = regressor.coef_

# We want absolute values of the coefficients to sort them from the most important to the least important
df['Coefficients'] = df['Coefficients'].abs()
df.sort_values(by='Coefficients', ascending=False, inplace=True)

df


Unnamed: 0,Features,Coefficients
10,x0_5.0,1326590.0
8,x0_3.0,1239972.0
14,x0_9.0,1217274.0
20,x0_16.0,1144005.0
12,x0_7.0,1087850.0
19,x0_15.0,862457.8
21,x0_17.0,816399.7
13,x0_8.0,792159.5
9,x0_4.0,612451.8
18,x0_14.0,595394.2


##### Conclusion

According to coefficients, the most important features are the stores numbers.

### Part 3 : Fight overfitting

In [21]:
# Instanciating 2 models of ridge regression with different alpha values
ridge_regressor_small_alpha = Ridge(alpha = 10)
ridge_regressor_large_alpha = Ridge(alpha = 10000)

# Training both models
ridge_regressor_small_alpha.fit(X_train, y_train)
ridge_regressor_large_alpha.fit(X_train, y_train)

Ridge(alpha=10000)

In [22]:
print("Score on training: ")
print("Linear Regression score : {}".format(regressor.score(X_train, y_train)))
print("Ridge with small Alpha score : {}".format(ridge_regressor_small_alpha.score(X_train, y_train)))
print("Ridge with large Alpha score : {}".format(ridge_regressor_large_alpha.score(X_train,y_train)))

Score on training: 
Linear Regression score : 0.9748702066689389
Ridge with small Alpha score : 0.5792883868283731
Ridge with large Alpha score : 0.004807188995228806


##### Conclusion

Using 2 models with random alpha values achieved a way lower score than our Linear Regression. Let's try with a GridSearchCV to test with more parameters.

In [23]:
params = {'alpha': np.arange(0, 10000, 100)} # determine the range of parameters to try
ridge = Ridge() # create an instance of the model

grid = GridSearchCV(ridge, params, cv = 10, verbose = 1)
grid_fit = grid.fit(X_train, y_train)

print("Optimal value for alpha : ", grid_fit.best_params_)

print('Train score for the best model : ', grid_fit.best_estimator_.score(X_train,y_train))
print('Test score for the best model : ', grid_fit.best_estimator_.score(X_test,y_test))

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
Optimal value for alpha :  {'alpha': 0}
Train score for the best model :  0.9748702066689389
Test score for the best model :  0.9509327344030788


##### Conclusion

The best parameter for this model is alpha 0.  
I would like to try to refine it some more so I will do a second GridSearchCV around the 0 value.

In [24]:
params = {'alpha': np.arange(0, 100, 1)} # determine the range of parameters to try
ridge = Ridge() # create an instance of the model

grid = GridSearchCV(ridge, params, cv = 10, verbose = 1)
grid_fit = grid.fit(X_train, y_train)

print("Optimal value for alpha : ", grid_fit.best_params_)

print('Train score for the best model : ', grid_fit.best_estimator_.score(X_train,y_train))
print('Test score for the best model : ', grid_fit.best_estimator_.score(X_test,y_test))

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
Optimal value for alpha :  {'alpha': 0}
Train score for the best model :  0.9748702066689389
Test score for the best model :  0.9509327344030788


In [25]:
params = {'alpha': np.arange(0, 1, 0.01)} # determine the range of parameters to try
ridge = Ridge() # create an instance of the model

grid = GridSearchCV(ridge, params, cv = 10, verbose = 1)
grid_fit = grid.fit(X_train, y_train)

print("Optimal value for alpha : ", grid_fit.best_params_)

print('Train score for the best model : ', grid_fit.best_estimator_.score(X_train,y_train))
print('Test score for the best model : ', grid_fit.best_estimator_.score(X_test,y_test))

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
Optimal value for alpha :  {'alpha': 0.03}
Train score for the best model :  0.9746093624944998
Test score for the best model :  0.9543602659507405


##### Conclusion

0.03 is a small change. The result is pretty similar.

In [26]:
# Checking the cross validation score
scores = cross_val_score(grid_fit.best_estimator_, X_train, y_train, cv = 10)

print('The cross-validated R2-score is : ', scores.mean())
print('The standard deviation is : ', scores.std())

The cross-validated R2-score is :  0.899343822286023
The standard deviation is :  0.05737481772584478


##### Conclusion

The cross-validated r2-score, while being lower than my previous scores, is still very good at 0.89 and I consider a standard deviation of 0.05 quite reasonnable. The model looks reliable.