# Workshop on Introduction to Machine Learning with Time Series Data: A Case Study on Walmart Sales Forecasting Using M5 Dataset

Welcome to this workshop on Machine Learning with Time Series Data. We will be using the M5 Forecasting - Accuracy dataset from Kaggle, which represents unit sales data segregated by product categories in various Walmart outlets. This workshop is intended for researchers across the University of Central Florida.

This workshop is divided into five modules, each focusing on different aspects of Machine Learning and Time Series Analysis:

1. Introduction to Dataset and Time Series Data
2. Exploratory Data Analysis
3. Data Preparation and Preprocessing
4. Model Training
5. Prediction and Evaluation

By the end of this workshop, you will have a clear understanding of how to handle time series data and how to apply machine learning models to make predictions based on that data.

Let's get started!

## Setting Up The Environment

Before we begin, let's ensure that we have all the necessary packages installed. Run the cell below to install the required Python packages.

```python
!pip install pandas numpy sklearn matplotlib seaborn


# Introduction to the Workshop

Welcome to our interactive workshop on "Machine Learning for Time Series Data: A Case Study with Walmart Sales Forecasting". This workshop is designed with researchers in mind, to provide you with a comprehensive understanding of time series data, its characteristics, challenges, and the various machine learning techniques to analyze and forecast this type of data.

## Why Time Series Analysis?

As researchers, you often encounter data that unfolds over time - whether it be tracking disease spread in public health, monitoring climate change indicators in environmental studies, analyzing economic trends, or understanding behavioral patterns in social sciences. The ability to make accurate predictions based on historical data is a crucial tool in a researcher's arsenal.

Time series data is distinctive as it brings along its unique characteristics - trend, seasonality, autocorrelation, and more. The chronological order of observations carries meaningful information and ignoring the temporal dependence could lead to misleading insights.

## Our Case Study: Walmart Sales Forecasting

To make this learning process practical and relevant, we will be using the M5 Forecasting dataset from Kaggle in our case study. It contains daily sales data for various products from Walmart, collected over several years. The dataset is an excellent example of real-world time series data with its complexities and challenges, such as multiple seasonality, presence of trends, and impact of external factors like price and promotions.

While this dataset is focused on sales forecasting, the techniques and methodologies we will explore are highly transferrable and can be applied to various research fields:

1. **Public Health**: In studying disease spread and the impact of health interventions, time series analysis can be used for forecasting future cases and evaluating intervention strategies.
2. **Environmental Studies**: Predicting future climate variables or environmental changes, understanding their patterns and trends, all require dealing with time series data.
3. **Economics**: Economic indicators such as GDP, inflation, unemployment rates, and many more, are typically time series data. Forecasting these indicators can provide vital insights for policy-making and economic planning.
4. **Social Sciences**: In studying human behavior over time, researchers often encounter time series data. Whether it's monitoring social trends or predicting future behaviors, time series analysis techniques are invaluable.

By the end of this workshop, you will not only gain an understanding of time series data and forecasting techniques but also acquire hands-on experience with a real-world dataset, using Python and its powerful libraries. You'll leave equipped with a new set of tools to bring to your own research, regardless of your field of study.

Let's start our time series exploration journey.



# Features of Time Series Data

Time series data is unique because it changes over time, meaning the sequence of data points is important. Here are some special features of time series data:

1. **Trend**: Sometimes data can increase or decrease over time. This is called a trend.

2. **Seasonality**: Sometimes data can show a repeating pattern at regular intervals. This is called seasonality. An example is sales of ice cream increasing in the summer months every year.

3. **Cyclical Patterns**: Sometimes data can rise and fall, but not in a regular pattern. This is a cyclical pattern. An example could be economic data that goes through periods of growth and recession, but not at regular intervals.

4. **Noise**: Noise is the random variation in the data that doesn’t fit any pattern.

5. **Stationarity**: If a series has a constant mean and variance over time, it is said to be stationary. This means it doesn’t have trend or seasonality. Many models assume that data are stationary.

6. **Autocorrelation**: Sometimes, a data point in a series can be correlated with past or future data points. This is called autocorrelation. For example, it might be that a high value today means it's more likely we'll see a high value tomorrow.

These features can be used to help us model time series data and make forecasts or predictions for future data points.



## Dataset Description

The M5 dataset, generously made available by Walmart, includes the unit sales of various products sold in the USA, organized in the form of grouped time series. More specifically, the dataset involves the unit sales of 3,049 products, classified in 3 product categories (Hobbies, Foods, and Household) and 7 product departments. The products are sold across ten stores, located in three States (CA, TX, and WI).

### Data Files:

There are three primary data files in the dataset:

1. `calendar.csv`: This file contains information about the dates the products are sold, including various events and whether SNAP purchases are allowed for each state.

2. `sell_prices.csv`: This file contains information about the price of the products sold per store and date.

3. `sales_train.csv`: This file contains the historical daily unit sales data per product and store.

Let's load and explore these data files one by one.


## Loading and Exploring the Data

We'll use pandas, a powerful data handling library in Python, to load our data files. Let's install it first:

```python
!pip install pandas


# Introduction to Pandas

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data. It's built on top of two core Python libraries - Matplotlib for data visualization and NumPy for mathematical operations.

## Data Structures

Pandas introduces two useful (and powerful) structures: Series and DataFrame, both of which handle most typical use cases:

- A Series is a one-dimensional array-like object that can hold any data type. It is basically a column.
- A DataFrame is a two-dimensional table of data with rows and columns.

## Basic Functions

Let's introduce some basic functions you'll frequently use when manipulating data with pandas:

- `pd.read_csv(filepath)`: This function is used to read a CSV (Comma Separated Values) file and convert it into a DataFrame.

- `df.head(n)`: This function returns the first n rows of the DataFrame df. If n is not specified, it returns the first 5 rows.

- `df.describe()`: This function gives the statistical summary of the DataFrame df.

- `df.info()`: This function gives the summary of the DataFrame df including the number of non-null values in each column.

- `df.columns`: This is used to get the list of all column names of the DataFrame df.

- `df.values`: This is used to get the array of all values of the DataFrame df.

## Loading Data

You can load data into a DataFrame using pandas' read functions. For example, you can use `pd.read_csv()` to read a CSV file:



## Viewing Data

To view the first 'n' rows of the DataFrame, use `df.head(n)`. To view the last 'n' rows, use `df.tail(n)`. If 'n' is not provided, by default, it shows 5 rows.


In [None]:

# View the first 5 rows
df.head()

# View the last 5 rows
df.tail()


You can also view the statistical summary of the DataFrame using df.describe(), and the summary of the DataFrame using df.info().

In [None]:
# Statistical summary
df.describe()

# DataFrame summary
df.info()


###Calendar Data

This file contains information about the dates when products are sold, including various events and whether SNAP purchases are allowed for each state.

In [None]:
import pandas as pd

# Load the calendar data
calendar = pd.read_csv('calendar.csv')

# Display the first few rows of the DataFrame
calendar.head()

# Get a concise summary of the DataFrame
calendar.info()




###Sell Prices Data

This file contains information about the price of the products sold per store and date.

In [None]:
# Load the sell prices data
sell_prices = pd.read_csv('sell_prices.csv')

# Display the first few rows of the DataFrame
sell_prices.head()

# Get a concise summary of the DataFrame
sell_prices.info()

# Display the statistical summary

print("\nSell Data Description:")
print(sell_prices.describe())

###Sales Train Data

This file contains the historical daily unit sales data per product and store.

In [None]:
# Load the sales train data
sales_train = pd.read_csv('sales_train.csv')

# Display the first few rows of the DataFrame
sales_train.head()

# Get a concise summary of the DataFrame
sales_train.info()

# Display the statistical summary
print("\nSales Data Description:")
print(sales.describe())

## Goal of the Analysis

Our primary objective for this analysis is to predict future sales for Walmart. With the vast amount of historical sales data available for Walmart's various products, we are uniquely poised to provide valuable insights that can help inform business decisions.

Predicting sales accurately allows a business to maintain an optimal inventory, manage its cash flow better, understand the effect of external factors on sales, and ultimately enhance its bottom line. This is particularly crucial for a retail behemoth like Walmart.

We will be forecasting sales for the next 28 days based on the historical sales data from the M5 forecasting dataset. This dataset provides us with sales data for various products sold across Walmart's ten stores located in three states - California, Texas, and Wisconsin.

Our analysis will involve the following key steps:

1. **Exploratory Data Analysis (EDA):** We'll start by exploring our dataset to understand the various factors that could influence sales. This would include looking at seasonal trends, the impact of holidays and other special events, as well as the effect of SNAP (Supplemental Nutrition Assistance Program) days.

2. **Feature Engineering:** Based on our EDA, we will create meaningful features that can help improve the performance of our predictive model.

3. **Model Selection and Training:** We will choose an appropriate model for time series forecasting, train it on our historical sales data, and tune it for better performance.

4. **Evaluation:** We will evaluate the performance of our model using suitable metrics and make necessary adjustments.

5. **Forecasting:** Finally, we'll use our trained model to predict sales for the next 28 days.

We look forward to uncovering insights from the data and building a model that can predict future sales with a high degree of accuracy.


## Hands-On Activity: Loading and Exploring the Walmart Dataset

We will now load the Walmart dataset into our Python environment and explore its features using basic commands. This activity is designed to familiarize you with the dataset and get you started with the practical aspect of data analysis.

**Step 1: Import Necessary Libraries**

The first step involves importing the libraries that we will need for this activity. We'll be using `pandas` for data manipulation and `numpy` for numerical computations.




In [None]:

import pandas as pd
import numpy as np

**Step 2: Load the Data**

We will load the Walmart dataset using pandas. The dataset consists of three CSV files - sales_train.csv, sell_prices.csv, and calendar.csv.

In [None]:
# Load the data
sales_train = pd.read_csv('sales_train.csv')
sell_prices = pd.read_csv('sell_prices.csv')
calendar = pd.read_csv('calendar.csv')


**Step 3: Explore the Data**

Next, let's get a glimpse of our datasets using the head() function, which returns the first few rows of each DataFrame. We'll also use the shape attribute to see the dimensions of our datasets.

In [None]:
# View the first few rows of the sales_train dataset
print("Sales Train data:")
print(sales_train.head())

# Dimensions of sales_train dataset
print("\nSales Train data shape:", sales_train.shape)

# Repeat for sell_prices and calendar datasets
print("\nSell Prices data:")
print(sell_prices.head())
print("\nSell Prices data shape:", sell_prices.shape)

print("\nCalendar data:")
print(calendar.head())
print("\nCalendar data shape:", calendar.shape)


Take your time to go through each of these steps. As you're exploring the data, make sure to look at the different features and try to understand what each one represents you can try running the different commands as mentioned in the previous section

Remember, a thorough understanding of your data is the first step towards building a successful model!

---

## Questions & Answers

Now that we have loaded the dataset and done some initial exploration, let's open the floor for discussion.

This is a great time to ask questions, share observations, or discuss potential challenges that might arise during the analysis of this dataset.

- Do you have any questions about the Walmart dataset or any of the commands we used so far?
- Did you notice anything interesting or unusual while exploring the data?
- Are there any challenges you foresee in predicting Walmart sales based on this dataset?

Please feel free to ask your questions in the chat. I'm here to help!

---
