# Company Inc - Sales Data Analysis

## Office Supplies Market trend and the forecast for the next year (2015)

<br>

### This notebook was developed to analyse Office Supplies Market trend and provide a forecast for the next year 

### PyCaret is an open-source machine learning library designed for streamlined and effective end-to-end predictive modeling. It was chosen to support the analysis, and the installation steps for the PyCaret library are provided below.

In [0]:
!pip install pycaret==3.2.0

Collecting pycaret==3.2.0
  Using cached pycaret-3.2.0-py3-none-any.whl (484 kB)
Collecting kaleido>=0.2.1
  Using cached kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
Collecting cloudpickle
  Using cached cloudpickle-3.0.0-py3-none-any.whl (20 kB)
Collecting tbats>=1.1.3
  Using cached tbats-1.1.3-py3-none-any.whl (44 kB)
Collecting numba>=0.55.0
  Using cached numba-0.58.1-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.6 MB)
Collecting xxhash
  Using cached xxhash-3.4.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (193 kB)
Collecting scikit-plot>=0.3.7
  Using cached scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Collecting scipy~=1.10.1
  Using cached scipy-1.10.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting wurlitzer
  Using cached wurlitzer-3.0.3-py3-none-any.whl (7.3 kB)
Collecting sktime!=0.17.1,!=0.17.2,!=0.18.0,<0.22.0,>=0.16.1
  Using cached sktime-0.21.1-py3-none-any.whl (17.1 MB)
Collecti

### Import the libraries

In [0]:
import plotly.express as px # plotly is a python library for visualizations
import pandas as pd # the Pandas library is utilized for data processing.
import pycaret # Employing PyCaret for time series analyses.
pycaret.__version__ #Check PyCaret version

Out[15]: '3.2.0'

### Create a Spark SQL query to get the data and summarize the total sales by Date

In [0]:
#get the data using Spark SQL to combine the fact table "Order" and the two dimensions tables "Ship" and "Ship_Mode"
df = spark.sql("""

SELECT  cal.Date -- Summarize the results by Date
        ,SUM(Sales) as Total_Sales -- Sum the Sales to get the Total Sales By date
FROM company.order ord -- Fact table: Order
     ,company.Product prod -- Dimension table: Product
     ,company.Product_Category prod_cat -- Dimension table: Product_Category
     ,company.calendar_date cal -- -- Dimension table: Calendar_Date
WHERE prod.Product_ID = ord.Product_ID -- Combine with a Inner Join the tables "Order" and "Product" using the model keys
      AND prod.Product_Category_ID = prod_cat.Product_Category_ID  --Combine with a Inner Join the tables "Product" and "Product_Category" using the model key
      AND prod_cat.Category = 'Office Supplies' -- Filter just the Product Category "Office Supplies"
      AND prod_cat.Sub_Category = 'Paper' -- Filter just the Product Sub Category "Paper"
      AND ord.Order_Date = cal.Date_ID -- Combine with a Inner Join the tables "Order" and "Calendar_Date" using the model keys
GROUP BY cal.Date -- Group the results by Date
      
""")

### Conducting an initial data analysis to assess the distribution across each month.

In [0]:
df_da = df.toPandas() #convert the Spark SQL result to Pandas

df_da['Date'] = pd.to_datetime(df_da['Date']) # convert the Date column to datetime format
df_da['Year-Month'] = df_da['Date'].dt.to_period('M').astype(str) #create a text column with the Year and Month information

In [0]:
fig = px.histogram(df_da, x = "Year-Month") #plot the distribution across each month
fig.show()

### Prepare the data to calculate the sales trend and the forecast

In [0]:
df_pd = df.toPandas()  #convert the Spark SQL result to Pandas
df_pd.index = pd.to_datetime(df_pd['Date']) # convert the Date column to datetime format
df_res = df_pd['Total_Sales'].resample('M').sum() # aggregate the sales results by each month

### Plot the sales results by month

In [0]:
fig = px.line(df_res, x = df_res.index, y="Total_Sales")
fig.show()

### Plot the trend sales results by month using a Ordinary Least Squares (OLS) regression

In [0]:
fig = px.scatter(df_res, x = df_res.index, y="Total_Sales", trendline="ols")
fig.show()

### Plot the trend sales results by month using a moving average with a size window equals 3

In [0]:
fig = px.scatter(df_res, x = df_res.index, y="Total_Sales", trendline="rolling", trendline_options=dict(window=3))
fig.show()

### Setup PyCarret to calculate the forecast

In [0]:
from pycaret.time_series import *

#"df_res" = the timeseries dataframe with the sales results by month
#"fh" = the test size, in this case, 10 months
#"session_id" = session id to identify the results
s = setup(df_res, fh = 10, session_id = 123)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,Total_Sales
2,Approach,Univariate
3,Exogenous Variables,Not Present
4,Original data shape,"(48, 1)"
5,Transformed data shape,"(48, 1)"
6,Transformed train set shape,"(38, 1)"
7,Transformed test set shape,"(10, 1)"
8,Rows with missing values,0.0%
9,Fold Generator,ExpandingWindowSplitter


### Compare different types of Time Series models using the principal metrics 

In [0]:
# compare baseline models
best = compare_models()

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2,TT (Sec)
snaive,Seasonal Naive Forecaster,0.8153,0.7748,1753.6333,2124.1308,0.4518,0.4838,-0.8371,0.1
croston,Croston,0.854,0.8314,1779.7118,2201.6014,0.3666,0.4167,-0.7632,0.03
grand_means,Grand Means Forecaster,0.8699,0.8469,1804.3254,2228.5427,0.3706,0.42,-0.8021,0.0433
rf_cds_dt,Random Forest w/ Cond. Deseasonalize & Detrending,0.9105,0.8432,1987.1863,2344.2865,0.6712,0.4123,-1.1513,0.71
lightgbm_cds_dt,Light Gradient Boosting w/ Cond. Deseasonalize...,0.9571,0.8631,2085.047,2412.7754,0.7037,0.4235,-1.3234,0.4067
stlf,STLF,0.9631,0.903,2074.4387,2499.564,0.6635,0.5816,-1.7012,0.08
polytrend,Polynomial Trend Forecaster,0.9689,0.8717,2113.3083,2439.5417,0.7078,0.4276,-1.3843,0.0333
naive,Naive Forecaster,0.9909,0.9376,2101.7333,2522.402,0.591,0.4532,-1.2841,1.61
ada_cds_dt,AdaBoost w/ Cond. Deseasonalize & Detrending,1.0149,0.9276,2208.9073,2596.7657,0.6802,0.4491,-1.8136,0.4067
et_cds_dt,Extra Trees w/ Cond. Deseasonalize & Detrending,1.0337,0.9229,2234.2002,2549.9077,0.7335,0.4543,-1.5465,0.69


Processing:   0%|          | 0/113 [00:00<?, ?it/s]Processing:   4%|▍         | 5/113 [00:04<01:44,  1.03it/s]Processing:   6%|▌         | 7/113 [00:05<01:17,  1.37it/s]Processing:   8%|▊         | 9/113 [00:05<00:52,  1.98it/s]Processing:  10%|▉         | 11/113 [00:06<00:43,  2.32it/s]Processing:  12%|█▏        | 13/113 [00:06<00:34,  2.91it/s]Processing:  13%|█▎        | 15/113 [00:07<00:31,  3.07it/s]Processing:  15%|█▌        | 17/113 [00:07<00:23,  4.13it/s]Processing:  17%|█▋        | 19/113 [00:07<00:23,  3.93it/s]Processing:  19%|█▊        | 21/113 [00:08<00:22,  4.08it/s]Processing:  20%|██        | 23/113 [00:08<00:22,  3.93it/s]Processing:  22%|██▏       | 25/113 [00:09<00:26,  3.27it/s]Processing:  24%|██▍       | 27/113 [00:10<00:25,  3.36it/s]Processing:  26%|██▌       | 29/113 [00:10<00:28,  2.99it/s]Processing:  27%|██▋       | 31/113 [00:11<00:26,  3.14it/s]Processing:  29%|██▉       | 33/113 [00:11<00:21,  3.76it/s]Processing:  31%|███       | 35/113

### Based in the metrics, the Seasonal Naive Forecaster	("snaive") model returned the best result

In [0]:
# train a dt model with default params
dt = create_model('snaive')

Unnamed: 0,cutoff,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2
0,2011-08,0.8318,0.7623,2056.3,2518.1676,0.6902,0.6043,-1.2238
1,2012-06,0.8665,0.7867,1790.9,2046.8073,0.4081,0.5545,-1.4468
2,2013-04,0.7476,0.7753,1413.7,1807.4176,0.2571,0.2925,0.1593
Mean,,0.8153,0.7748,1753.6333,2124.1308,0.4518,0.4838,-0.8371
SD,,0.0499,0.0099,263.6605,295.2689,0.1795,0.1368,0.7104


Processing:   0%|          | 0/4 [00:00<?, ?it/s]Processing:  75%|███████▌  | 3/4 [00:00<00:00,  7.50it/s]                                                         

### Tune the model created, Seasonal Naive Forecaster	("snaive"), optimizing by the metric MAE

In [0]:
# tune model with custom grid and metric = MAE
tuned_dt = tune_model(dt, optimize = 'MAE')

Unnamed: 0,cutoff,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2
0,2011-08,0.8318,0.7623,2056.3,2518.1676,0.6902,0.6043,-1.2238
1,2012-06,0.8665,0.7867,1790.9,2046.8073,0.4081,0.5545,-1.4468
2,2013-04,0.7476,0.7753,1413.7,1807.4176,0.2571,0.2925,0.1593
Mean,,0.8153,0.7748,1753.6333,2124.1308,0.4518,0.4838,-0.8371
SD,,0.0499,0.0099,263.6605,295.2689,0.1795,0.1368,0.7104


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.7s finished
Processing:  43%|████▎     | 3/7 [00:01<00:01,  2.28it/s]Processing:  86%|████████▌ | 6/7 [00:01<00:00,  4.17it/s]                                                         

### Plot the tuned model forecast for the 10 months in test split

In [0]:
# plot forecast
plot_model(tuned_dt, plot = 'forecast')

### Using the tuned model, plot with the forecast sales results for the next year (2015)

In [0]:
# plot forecast for 13 months in future
plot_model(tuned_dt, plot = 'forecast', data_kwargs = {'fh' : 23})