# Walmart Sales Forecast: Notebook 3

### Problem Statement:

The dataset contains historical sales data for 45 Walmart stores located in different regions. Various events and holidays significantly impact daily sales. Walmart faces challenges due to unforeseen demand fluctuations and stockouts. This project aims to analyze the different attributes influencing sales and to develop accurate predictive models to forecast sales and demand, thereby improving inventory management and reducing stockouts.

### Objectives of this Notebook:

1. Perform feature engineering to create meaningful features for the model.
2. Build and evaluate machine learning models to predict weekly sales.
3. Use the insights gained from the previous analyses to inform model development.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
import pickle

In [2]:
# Load and preprocess the dataset
file_path = 'data/Walmart_Store_sales.csv'
data = pd.read_csv(file_path)

# Convert 'Date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'], format='%d-%m-%Y')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106


### Feature Engineering

Since most of the given attributes have weak correlation with weekly sales, we will create new features that will help our predictive model capture the data properly.

1. Extract features such as year, month, week, and day of the week from the 'Date' column.

In [3]:
# Extracting date features
data['Year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data['Week'] = data['Date'].dt.isocalendar().week
data['Day'] = data['Date'].dt.day
data['DayOfWeek'] = data['Date'].dt.dayofweek

data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Week,Day,DayOfWeek
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106,2010,2,5,5,4
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106,2010,2,6,12,4
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,7,19,4
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,8,26,4
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106,2010,3,9,5,4


2. Create lag features for the 'Weekly_Sales' to capture the sales patterns from previous weeks.

Attributes like lag and rolling mean are often helpful in capturing time-series data similar to our dataset

In [4]:
# Lag -> Sales from the previous week (lag 1 => last week, lag 2 => last 2nd week, lag 3 => last 3rd week)
data['Lag_1'] = data['Weekly_Sales'].shift(1)
data['Lag_2'] = data['Weekly_Sales'].shift(2)
data['Lag_3'] = data['Weekly_Sales'].shift(3)

# Fill NaN values for weeks initial weeks with no lag
data.fillna(0, inplace=True)
data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Week,Day,DayOfWeek,Lag_1,Lag_2,Lag_3
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106,2010,2,5,5,4,0.0,0.0,0.0
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106,2010,2,6,12,4,1643690.9,0.0,0.0
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,7,19,4,1641957.44,1643690.9,0.0
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,8,26,4,1611968.17,1641957.44,1643690.9
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106,2010,3,9,5,4,1409727.59,1611968.17,1641957.44


3. Create rolling mean and rolling standard deviation features for 'Weekly_Sales' to capture the trend and volatility.


In [5]:
# Create rolling mean and standard deviation features
data['Rolling_Mean_3'] = data['Weekly_Sales'].rolling(window=3).mean().shift(1)
data['Rolling_Std_3'] = data['Weekly_Sales'].rolling(window=3).std().shift(1)

# Fill NaN values resulting from rolling operations with 0
data.fillna(0, inplace=True)
data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Week,Day,DayOfWeek,Lag_1,Lag_2,Lag_3,Rolling_Mean_3,Rolling_Std_3
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106,2010,2,5,5,4,0.0,0.0,0.0,0.0,0.0
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106,2010,2,6,12,4,1643690.9,0.0,0.0,0.0,0.0
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,7,19,4,1641957.44,1643690.9,0.0,0.0,0.0
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,8,26,4,1611968.17,1641957.44,1643690.9,1632539.0,17835.791719
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106,2010,3,9,5,4,1409727.59,1611968.17,1641957.44,1554551.0,126313.968444


Being time-series data, it can be seen that weekly sales highly depend on the previous weeks sales & trends (Weekly sales have high correlation with features corresponding to lag and rolling mean we created as part of feature engineering)

4. Features representing special holidays that have impact on demand

In [6]:
# Dates for the holidays/events
super_bowl_dates = ['2010-02-12', '2011-02-11', '2012-02-10']
labor_day_dates = ['2010-09-06', '2011-09-05', '2012-09-03']
thanksgiving_dates = ['2010-11-25', '2011-11-24', '2012-11-22']
christmas_dates = ['2010-12-25', '2011-12-25', '2012-12-25']

# Create binary features for each holiday/event
data['IsSuperBowl'] = data['Date'].isin(pd.to_datetime(super_bowl_dates)).astype(int)
data['IsLaborDay'] = data['Date'].isin(pd.to_datetime(labor_day_dates)).astype(int)
data['IsThanksgiving'] = data['Date'].isin(pd.to_datetime(thanksgiving_dates)).astype(int)
data['IsChristmas'] = data['Date'].isin(pd.to_datetime(christmas_dates)).astype(int)

# Check the new features
data.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,...,DayOfWeek,Lag_1,Lag_2,Lag_3,Rolling_Mean_3,Rolling_Std_3,IsSuperBowl,IsLaborDay,IsThanksgiving,IsChristmas
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106,2010,2,...,4,0.0,0.0,0.0,0.0,0.0,0,0,0,0
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106,2010,2,...,4,1643690.9,0.0,0.0,0.0,0.0,1,0,0,0
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,...,4,1641957.44,1643690.9,0.0,0.0,0.0,0,0,0,0
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,...,4,1611968.17,1641957.44,1643690.9,1632539.0,17835.791719,0,0,0,0
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106,2010,3,...,4,1409727.59,1611968.17,1641957.44,1554551.0,126313.968444,0,0,0,0


### Building and Evaluating Machine Learning Models

1. Split Data into Training and Testing Sets

In [7]:
# Define the target variable and features
X = data.drop(columns=['Date', 'Weekly_Sales']) # Not using date as we are using the derived attributes of it
y = data['Weekly_Sales']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(5148, 20) (1287, 20) (5148,) (1287,)


2. Train and evaluate different machine learning models.

In [8]:
# Initialize models
lr_model = LinearRegression()
rf_model = RandomForestRegressor(random_state=42)
gb_model = GradientBoostingRegressor(random_state=42)
svr_model = SVR()
dt_model = DecisionTreeRegressor(random_state=42)

# Train models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
svr_model.fit(X_train, y_train)
dt_model.fit(X_train, y_train)

# Predict on test set
lr_predictions = lr_model.predict(X_test)
rf_predictions = rf_model.predict(X_test)
gb_predictions = gb_model.predict(X_test)
svr_predictions = svr_model.predict(X_test)
dt_predictions = dt_model.predict(X_test)

# Evaluate models
lr_mse = mean_squared_error(y_test, lr_predictions)
rf_mse = mean_squared_error(y_test, rf_predictions)
gb_mse = mean_squared_error(y_test, gb_predictions)
svr_mse = mean_squared_error(y_test, svr_predictions)
dt_mse = mean_squared_error(y_test, dt_predictions)

lr_r2 = r2_score(y_test, lr_predictions)
rf_r2 = r2_score(y_test, rf_predictions)
gb_r2 = r2_score(y_test, gb_predictions)
svr_r2 = r2_score(y_test, svr_predictions)
dt_r2 = r2_score(y_test, dt_predictions)

In [9]:
# Displaying the evaluation metrics
print(f'Linear Regression - MSE: {lr_mse}, R2: {lr_r2}')
print(f'Random Forest - MSE: {rf_mse}, R2: {rf_r2}')
print(f'Gradient Boosting - MSE: {gb_mse}, R2: {gb_r2}')
print(f'Support Vector Regression - MSE: {svr_mse}, R2: {svr_r2}')
print(f'Decision Tree - MSE: {dt_mse}, R2: {dt_r2}')

Linear Regression - MSE: 22606821941.55993, R2: 0.9298261628940062
Random Forest - MSE: 7824915911.450474, R2: 0.9757106781325702
Gradient Boosting - MSE: 8619256500.817415, R2: 0.9732449654698611
Support Vector Regression - MSE: 330262290603.64813, R2: -0.025167192583885223
Decision Tree - MSE: 13723319981.139275, R2: 0.9574014417683583


3. Selecting the best performing model based on evaluation metrics.

In [10]:
models_r2 = {
    'Linear Regression': lr_r2,
    'Random Forest': rf_r2,
    'Gradient Boosting': gb_r2,
    'Support Vector Regression': svr_r2,
    'Decision Tree': dt_r2
}

best_model_name = max(models_r2, key=models_r2.get)
best_model_r2 = models_r2[best_model_name]

print(f'Best Model based on R2 score: {best_model_name}')
print(f'R2 score: {best_model_r2}')

Best Model based on R2 score: Random Forest
R2 score: 0.9757106781325702


![Image](https://qph.cf2.quoracdn.net/main-qimg-93345a02ed0276bab498e3cc99acf8c7-pjlq)


Random forest model also has least Mean-squared Error.

In [11]:
# Cross-validation with the Random Forest model
cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='r2')
print(f'Cross-Validation R2 Scores: {cv_scores}')
print(f'Mean CV R2 Score: {cv_scores.mean()}')

Cross-Validation R2 Scores: [0.94605528 0.97157023 0.96123074 0.9672543  0.97503406]
Mean CV R2 Score: 0.9642289213871491


4. Hyperparameter tuning for Random forest

In [12]:
# parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the model
rf_model = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
grid_search.fit(X_train, y_train)

# Displaying the best parameters and best score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")

Fitting 3 folds for each of 81 candidates, totalling 243 fits
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.3s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.4s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.3s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.6s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.4s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.6s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.5s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.8s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   6.7s
[CV] END max_depth=10, min_sa

5. Evaluating the Tuned Random Forest Model

In [13]:
# Predict on test set
best_rf_model = grid_search.best_estimator_
rf_predictions = best_rf_model.predict(X_test)

# Evaluate the model
rf_mse = mean_squared_error(y_test, rf_predictions)
rf_r2 = r2_score(y_test, rf_predictions)

print(f'Tuned Random Forest - MSE: {rf_mse}, R2: {rf_r2}')

Tuned Random Forest - MSE: 8010414782.0635805, R2: 0.9751348710791329


In [14]:
# Cross-validation with the tuned Random Forest model
cv_scores = cross_val_score(best_rf_model, X_train, y_train, cv=5, scoring='r2')
print(f'Cross-Validation R2 Scores: {cv_scores}')
print(f'Mean CV R2 Score: {cv_scores.mean()}')

Cross-Validation R2 Scores: [0.94611826 0.97100644 0.96157496 0.96826744 0.97757707]
Mean CV R2 Score: 0.9649088348160573


The optimised model did not have any marginal increase in its performance. This indicates that the default parameters might already be close to optimal for this particular dataset and that the model's performance is much more influenced by the nature of the features and data itself rather than the fine-tuning of hyperparameters.

6. Export the dataset and optimized model.

In [15]:
data.to_csv("data/Walmart_Store_sales_updated.csv")
with open('best_model.pkl', 'wb') as file:
    pickle.dump(best_rf_model, file)

print("Model has been exported as 'best_model.pkl'")

Model has been exported as 'best_model.pkl'


## Summary

In this notebook, we focused on developing a machine learning model to accurately predict weekly sales for Walmart stores. The key steps and findings from this notebook are summarized below:


1. **Feature Engineering**:
   - Extracted temporal features such as year, month, week, and day of the week.
   - Created lag features and rolling statistics to capture temporal dependencies in the sales data.
   - Introduced binary features for major holidays and events like Super Bowl, Labor Day, Thanksgiving, and Christmas. 

2. **Model Training and Evaluation**:
   - Split the data into training and testing sets.
   - Trained multiple machine learning models including Linear Regression, Random Forest, Gradient Boosting, Support Vector Regression, and Decision Tree.
   - Performed hyperparameter tuning using GridSearchCV for the Random Forest model.
   - Evaluated model performance using Mean Squared Error (MSE) and R-squared (R2) metrics.

3. **Insights and Conclusions**:
   - Despite hyperparameter tuning, there was no significant improvement in model performance, indicating that the default parameters were already close to optimal for this dataset and model depends more on the nature of the features and data.  
   - The importance of temporal features was highlighted, suggesting the need for robust feature engineering to capture time-series patterns.

#### Made by Hrishikesh Reddy Papasani
##### LinkedIn: https://www.linkedin.com/in/hrishikesh-reddy-papasani-02110725a/
##### Github: https://github.com/Hrishikesh-Papasani
##### Contact: hrpapasani@gmail.com