# Walmart Store Sales Forecasting Project

## Introduction

In this project, we will analyze and forecast sales data for Walmart stores using time series forecasting and regression techniques. Accurate sales prediction is crucial for Walmart, as it allows the company to manage inventory efficiently, plan promotional markdowns, and meet customer demand effectively. Unexpected fluctuations in demand can lead to stockouts or excessive inventory, impacting customer satisfaction and operational costs.

The dataset used in this project contains historical weekly sales data from 45 Walmart stores across various regions in the United States. Additionally, the dataset includes information on holiday events, economic indicators (such as CPI and the Unemployment Index), and promotional markdown events. These features make this dataset ideal for understanding both temporal and economic factors that impact retail sales.

## About the Dataset

The dataset provides:
- **Sales Data**: Weekly sales figures for each Walmart store.
- **Holiday Events**: Major U.S. holidays, such as the Super Bowl, Labour Day, Thanksgiving, and Christmas, which are expected to impact sales patterns.
- **Economic Indicators**: Key metrics, such as the Consumer Price Index (CPI) and Unemployment Index, which may influence consumer spending behavior.
- **Promotional Events**: Information on Walmart markdowns leading up to holidays, which can affect weekly sales.

This data presents an opportunity to explore how seasonal trends, holidays, economic conditions, and promotions affect sales. This is especially relevant for understanding demand patterns and making data-driven business decisions.

## Project Objectives

1. **Data Exploration**:
   - Analyze the dataset to understand its structure, missing values, and basic statistics.
   - Visualize patterns and trends to gain insights into factors influencing Walmart sales.

2. **Time Series Forecasting**:
   - Develop time series models to predict weekly sales over time.
   - Evaluate and fine-tune the model for improved forecasting accuracy.

3. **Anomaly Detection**:
   - Identify and analyze any anomalies or outliers in sales data, especially around major holidays and promotions.

4. **Regression Modeling**:
   - Build and evaluate regression models to predict sales based on features such as economic indicators, holiday events, and promotions.

5. **Model Evaluation**:
   - Compare model performances using metrics like R² and RMSE, and document findings for each model.

This notebook will guide you through each step, from data exploration to model development and evaluation. The goal is to achieve a comprehensive analysis and prediction model for Walmart store sales data, providing insights that could help Walmart optimize stock levels, plan promotions, and respond effectively to changes in demand.


In [7]:
# Standard Libraries
import pandas as pd
import numpy as np

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

import os
import json
import joblib

In [8]:
os.chdir("C:/Users/USER/Desktop/GitHub/time-series-forecasting-sales-data")

# Data Loading and Initial Exploration

In [12]:
data = pd.read_csv('data/data.csv')
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Unnamed: 0    421570 non-null  int64  
 1   Store         421570 non-null  int64  
 2   Date          421570 non-null  object 
 3   IsHoliday     421570 non-null  int64  
 4   Dept          421570 non-null  float64
 5   Weekly_Sales  421570 non-null  float64
 6   Temperature   421570 non-null  float64
 7   Fuel_Price    421570 non-null  float64
 8   MarkDown1     421570 non-null  float64
 9   MarkDown2     421570 non-null  float64
 10  MarkDown3     421570 non-null  float64
 11  MarkDown4     421570 non-null  float64
 12  MarkDown5     421570 non-null  float64
 13  CPI           421570 non-null  float64
 14  Unemployment  421570 non-null  float64
 15  Type          421570 non-null  int64  
 16  Size          421570 non-null  int64  
dtypes: float64(11), int64(5), object(1)
memory usage

Unnamed: 0.1,Unnamed: 0,Store,Date,IsHoliday,Dept,Weekly_Sales,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,Type,Size
0,0,1,2010-02-05,0,1.0,24924.5,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,3,151315
1,1,1,2010-02-05,0,26.0,11737.12,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,3,151315
2,2,1,2010-02-05,0,17.0,13223.76,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,3,151315
3,3,1,2010-02-05,0,45.0,37.44,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,3,151315
4,4,1,2010-02-05,0,28.0,1085.29,42.31,2.572,0.0,0.0,0.0,0.0,0.0,211.096358,8.106,3,151315


In [11]:
data.shape

(421570, 17)

# Initial Observations and Data Understanding

In [13]:
data.describe()

Unnamed: 0.1,Unnamed: 0,Store,IsHoliday,Dept,Weekly_Sales,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,Type,Size
count,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0,421570.0
mean,211611.321278,22.200546,0.070358,44.260317,15981.258123,60.090059,3.361027,2590.074819,879.974298,468.087665,1083.132268,1662.772385,171.201947,7.960289,2.410088,136727.915739
std,122195.149363,12.785297,0.25575,30.492054,22711.183519,18.447931,0.458515,6052.385934,5084.538801,5528.873453,3894.529945,4207.629321,39.159276,1.863296,0.666337,60980.583328
min,0.0,1.0,0.0,1.0,-4988.94,-2.06,2.472,0.0,-265.76,-29.1,0.0,0.0,126.064,3.879,1.0,34875.0
25%,105782.25,11.0,0.0,18.0,2079.65,46.68,2.933,0.0,0.0,0.0,0.0,0.0,132.022667,6.891,2.0,93638.0
50%,211603.5,22.0,0.0,37.0,7612.03,62.09,3.452,0.0,0.0,0.0,0.0,0.0,182.31878,7.866,3.0,140167.0
75%,317424.75,33.0,0.0,74.0,20205.8525,74.28,3.738,2809.05,2.2,4.54,425.29,2168.04,212.416993,8.572,3.0,202505.0
max,423285.0,45.0,1.0,99.0,693099.36,100.14,4.468,88646.76,104519.54,141630.61,67474.85,108519.28,227.232807,14.313,3.0,219622.0


In [15]:
data.describe(include = 'object')

Unnamed: 0,Date
count,421570
unique,143
top,2011-12-23
freq,3027


## Check for Missing Values and Data Types

In [16]:
data.isna().sum()

Unnamed: 0      0
Store           0
Date            0
IsHoliday       0
Dept            0
Weekly_Sales    0
Temperature     0
Fuel_Price      0
MarkDown1       0
MarkDown2       0
MarkDown3       0
MarkDown4       0
MarkDown5       0
CPI             0
Unemployment    0
Type            0
Size            0
dtype: int64

In [24]:
data.dtypes

Unnamed: 0               int64
Store                    int64
Date            datetime64[ns]
IsHoliday                int64
Dept                   float64
Weekly_Sales           float64
Temperature            float64
Fuel_Price             float64
MarkDown1              float64
MarkDown2              float64
MarkDown3              float64
MarkDown4              float64
MarkDown5              float64
CPI                    float64
Unemployment           float64
Type                     int64
Size                     int64
dtype: object

In [25]:
data.duplicated().sum()

0

# Data Cleaning

In [26]:
data['Date'] = pd.to_datetime(data['Date'])

In [27]:
# Calculate time intervals
date_diff = data['Date'].diff().value_counts()
print("Time intervals:\n", date_diff)


Time intervals:
 0 days       415135
7 days         6390
-994 days        44
Name: Date, dtype: int64
