## Import Libraries

In [None]:
# import the relevant libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [None]:
# import the 3 datasets and assign them to their own variable
features = pd.read_csv('../input/retaildataset/Features data set.csv')
sales = pd.read_csv('../input/retaildataset/sales data-set.csv') 
stores = pd.read_csv('../input/retaildataset/stores data-set.csv')

In [None]:
features.sample(n=5)

In [None]:
sales.sample(n=5)

In [None]:
stores.sample(n=5)

## Merge the Datasets
I used SATHISH KUMAR'S code to merge the dataframe's. I like how he used Python to merge the datasets like you would in SQL. Below is a link to his project. 
https://www.kaggle.com/code/ssathishkumar/retail-sales-forecasting-time-series-eda

In [None]:
# merge the 3 separate datasets using the merge function. 
features = features.merge(stores, on = 'Store')
df = features.merge(sales, on = ['Store','Date','IsHoliday'])
df=df.fillna(0)

In [None]:
df.shape

In [None]:
df.describe()

## Creating an Index Based on the DATES

In [None]:
# sort the dataframe by date
df = df.sort_values(by='Date')

In [None]:
# parse the 'Day', 'Month', and 'Year' from the 'Date' column
# change the 'Date' column to the datetime format
df['Date'] = pd.to_datetime(df['Date'])
# make new columns from the day, month, and year
df['Day'] = df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year

We just want the sales data to predict. Let's plot the 'Weekly Sales' column 
Currently, Sales data is shown on a weekly basis. 4 weeks a month and 52 weeks a year.

In [None]:
# set the index to the date column, make sure its in the datetime format 
df = df.set_index('Date')

In [None]:
# plot the Weekly Sales column
df['Weekly_Sales'].plot(figsize=(25,8));

## Extract the Sales Data

In [None]:
# The Weekly Sales column will be pulled out to create a new dataframe
df_Sales = df[['Weekly_Sales']]

In [None]:
# check the new dataframe 
df_Sales.head()

In [None]:
# Next, we resample the dataframe to show average sales per month only, instead of every week. 
df_Sales = df_Sales.resample(rule='M').mean()

In [None]:
df_Sales.head()

In [None]:
# rename the 'Weekly_Sales' column to 'Monthly_Sales'
df_Sales = df_Sales.rename(columns={'Weekly_Sales':'Monthly_Sales'})

In [None]:
# plot the sales on a lineplot
df_Sales.plot(figsize=(20,8))

plt.title('Average Monthly Sales')
plt.xlabel('Date')
plt.ylabel('Dollar Sales');

In [None]:
# check for nan values in the df
df_Sales.isnull().sum()

In [None]:
# export the dataframe so we can use it again
df.to_csv('Retail Sales Monthly.csv',index=False)

## Split the Data into Training and Test Splits

In [None]:
# lets find out how big the dataframe is. We need to figure out the size. We are going to split the data frame into training and test splits. 
df_Sales.shape

The size of the test set is typically about 20% of the total sample. But here, you'll see that I choose 40%. Why? Because I want to predict one year into the future AND also have a years worth of 'test' data to measure against the accuracy of the model.

In other words, does my test data match in length as far as I'm willing to forecast out?

In [None]:
# muliply 36 by 0.4 to find how many months to subtract to create the training data
36 * 0.4

In [None]:
# subtract 14 from 36 to get 22 months
36 - 14

In [None]:
# Create the Train and Test variables
# sales train variable from the begining of the dataframe to 22 months
sales_train = df_Sales.iloc[:22]
# sales test variable from 22 months to the end of the dataframe
sales_test = df_Sales.iloc[21:]

In [None]:
#print the test variable 
sales_test

## Exponential Smoothing

In [None]:
# import Expontential Smoothing 
from statsmodels.tsa.holtwinters import ExponentialSmoothing

In [None]:
# I'll use the additive method because the seasonal variations are roughly constant through the series
# Here, we fit the model on the training data 'sales_train'
fitted_model = ExponentialSmoothing(sales_train['Monthly_Sales'],
                                   trend = 'add',
                                   seasonal = 'add',
                                   seasonal_periods = 10).fit()

In [None]:
# Assign the 2 years of forecasting to the test predictions variable
# Our predictions are grabbing the fitted model, then off the fitted model object, forecast 24 months into the future
test_predictions = fitted_model.forecast(24)

In [None]:
# print the test_predictions
# this is a series that predicts certain values for a date
test_predictions

## Plot the Predictions against the Training, and Test Sets

In [None]:
# plot the test predictions against the past sales and test set 
sales_train['Monthly_Sales'].plot(legend=True, label= 'TRAIN', figsize=(15,8))
sales_test['Monthly_Sales'].plot(legend=True, label= 'TEST', figsize=(15,8))
test_predictions.plot(legend=True, label= 'PREDICTIONS', figsize=(15,8))

plt.title('Train, Test, and Sales Predictions')
plt.xlabel("Date")
plt.ylabel("Dollar Sales");

## Check the Accuracy of the Model

In [None]:
# check the accuracy of the model 
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
# find the sales standard deviation and mean
sales_test.describe()

In [None]:
# change the size of the forecast
# we have to make the size of the forecast the same as the test set in order to get the squared errors
test_predictions = fitted_model.forecast(15)

In [None]:
# find the mean squared error
MSE = mean_squared_error(sales_test, test_predictions)
# find the mean absolute error
MAE = mean_absolute_error(sales_test, test_predictions)
# find the root mean squared error
RMSE = np.sqrt(mean_squared_error(sales_test, test_predictions))
# suppress scientific notation in the dataframe
pd.options.display.float_format = '{:.2f}'.format
# create a dataframe showing the error results
results = pd.DataFrame({'Squared Error': ['MSE','MAE','RMSE','STD DVTN'],
                       'Score': [MSE,MAE,RMSE, '1047']})
results = results.set_index('Squared Error')
results

## Conclusion

The root mean squared error for our model accuracy is 1270. When we compare it the standard deviation of the original data of 1047, we see that it is close but not totally accurate. Visually, when you compare the test data to the predicted values, they are off quite a bit. But when you zoom out to the training data it seems to be an acceptable prediction because of the variation in trends throughtout the years. The trend into the future seems plausible for a year ahead. The data set is stationary with some aspects of seasonality. For example there is a spike in sales during the Thanksgiving and Christmas holiday months. There is a flucuating seasonal pattern. 