# ShampooSalesTimeSeries

## 1. Introduction and algorithm description
This notebook uses the shampoo sales dataset to demonstrate the time series algorithms below which are provided by the hana_ml. 

- ARIMA
- Auto ARIMA
- Auto Exponential Smoothing
- Seasonal Decompose

### - ARIMA 
The Auto Regressive Integrated Moving Average (ARIMA) algorithm is famous in econometrics, statistics and time series analysis.
There are three integers (p, d, q) that are used to parametrize ARIMA models. Because of that, a nonseasonal ARIMA model is denoted with ARIMA(p, d, q):

 - p is the number of autoregressive terms (AR part). It allows to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
 - d is the number of nonseasonal differences needed for stationarity. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.
 - q is the number of lagged forecast errors in the prediction equation (MA part). This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.

When dealing with seasonal effects, Seasonal ARIMA(SARIMA) is used, which is denoted as ARIMA(p,d,q)(P,D,Q,s). Here, p, d, q are the nonseasonal parameters described above, P, D, Q follow the same definition but are applied to the seasonal component of the time series. The term s is the periodicity of the time series.

### - Auto ARIMA 
Although the ARIMA model is useful and powerful in time series analysis, it is somehow difficult to choose appropriate orders. Hence, auto ARIMA is to determine the orders of an ARIMA model automatically.

### - Auto Exponential Smoothing
Auto exponential smoothing is used to calculate optimal parameters of a set of smoothing functions, including Single Exponential Smoothing, Double Exponential Smoothing, and Triple Exponential Smoothing.

### - Seasonal Decompose
The algorithm is to decompose a time series into three components: seasonal, trend, and random.

## 2. Dataset
Shampoo sales dataset describes the monthly number of sales of shampoo over a 3 year period.
The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright and Hyndman (1998). We can see that the dataset shows an increasing trend and possibly has a seasonal component. 

<img src="images/Shampoo-Sales.png" title="Temperatures" width="600" height="1200" />


Dataset source: https://raw.githubusercontent.com/jbrownlee/Datasets/master/shampoo.csv for tutorials use only.

### Attribute information
 - ID: ID
 - SALES: Monthly sales 


## 3. Data Loading

### Import packages
First, import packages needed in the data loading.

In [1]:
from hana_ml import dataframe
from data_load_utils import DataSets, Settings

### Setup Connection
In our case, the data is loaded into a table called "SHAMPOO_SALES_DATA_TBL" in HANA from a csv file "shampoo.csv".
To do that, a connection to HANA is created and then passed to the data loader.
To create a such connection, a config file, <b>config/e2edata.ini</b> is used to control the connection parameters.
A sample section in the config file is shown below which includes HANA url, port, user and password information.  

#########################<br>
[hana]<br>
url=host-url<br>
user=username<br>
passwd=userpassword<br>
port=3xx15<br>
#########################<br>



In [2]:
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
# the connection
connection_context = dataframe.ConnectionContext(url, port, user, pwd)

### Load Data
Then, the function DataSets.load_shampoo_data() is used to decide load or reload the data from scratch. If it is the first time to load data, an exmaple of return message is shown below:

##################<br>
ERROR:hana_ml.dataframe:Failed to get row count for the current Dataframe, (259, 'invalid table name:  Could not find table/view SHAMPOO_SALES_DATA_TBL in schema XIN: line 1 col 37 (at pos 36)')
Table SHAMPOO_SALES_DATA_TBL doesn't exist in schema XIN
Creating table SHAMPOO_SALES_DATA_TBL in schema XIN ....
Drop unsuccessful
Creating table XIN.SHAMPOO_SALES_DATA_TBL
Data Loaded:100%
###################<br>

If the data is already loaded, there would be a return message "Table XXX exists and data exists".

In [3]:
data_tbl = DataSets.load_shampoo_data(connection_context)

ERROR:hana_ml.dataframe:Failed to get row count for the current Dataframe, (259, 'invalid table name:  Could not find table/view SHAMPOO_SALES_DATA_TBL in schema PAL_USER: line 1 col 37 (at pos 36)')


Table SHAMPOO_SALES_DATA_TBL doesn't exist in schema PAL_USER
Creating table SHAMPOO_SALES_DATA_TBL in schema PAL_USER ....
Drop unsuccessful
Creating table PAL_USER.SHAMPOO_SALES_DATA_TBL
Data Loaded:100%


### Create Dataframes
Create a dataframe df from SHAMPOO_SALES_DATA_TBL for the following steps.

In [4]:
df = connection_context.table(data_tbl)

### Simple Data Exploration
We will do some data exploration to know the data better.
- First five data points

In [5]:
df.collect().head(3)

Unnamed: 0,ID,SALES
0,25,339.7
1,24,342.3
2,23,264.5


- Columns

In [6]:
print(df.columns)

['ID', 'SALES']


- No. of data points

In [7]:
print('Number of rows in df: {}'.format(df.count()))

Number of rows in df: 36


- Data types

In [8]:
df.dtypes()

[('ID', 'INT', 10, 10, 10, 0), ('SALES', 'DOUBLE', 15, 15, 15, 0)]

## 4. Analysis
In this section, various time series algorithms are applied to analyze the shampoo sales dataset.

### 4.1 seasonal decompose
Because the dataset shows an increasing trend and possibly some seasonal component, we first use seasonal decompose function to decompose the data.

In [9]:
from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose

In [10]:
stats, decompose = seasonal_decompose(df, endog= 'SALES', alpha = 0.2, thread_ratio=0.5)

seasonal decompose function returns two tables: stats and decompose.

In [11]:
stats.collect()

Unnamed: 0,STAT_NAME,STAT_VALUE
0,type,multiplicative
1,period,2
2,acf,0.515912


We could see the data has a seasonality and its period is 2. The corresponding multiplicative seasonality model is identified. The decompose table shows the components.

In [12]:
decompose.collect().head(5)

Unnamed: 0,ID,SEASONAL,TREND,RANDOM
0,1,1.030443,235.975,1.093935
1,2,0.969557,185.225,0.812423
2,3,1.030443,157.85,1.125693
3,4,0.969557,150.5,0.817581
4,5,1.030443,162.1,1.079416


### 4.2 ARIMA
import the ARIMA module

In [13]:
from hana_ml.algorithms.pal.tsa.arima import ARIMA

Create an ARIMA estimator and make the initialization:

In [14]:
arima = ARIMA(order=(1, 0, 0), seasonal_order=(1, 0, 0, 2),
              method='mle', thread_ratio=1.0)

Perform fit on the given data:

In [18]:
arima.fit(df, endog='SALES')

There are two attributes of ARIMA model: model_ and fitted_. We could see the model parameters in model_. 

In [19]:
arima.model_.collect()

Unnamed: 0,KEY,VALUE
0,p,1
1,AR,0.0341783
2,d,0
3,q,0
4,MA,
5,s,2
6,P,1
7,SAR,0.881861
8,D,0
9,Q,0


The model_ contains AIC (Akaike Information Criterion) and BIC (Bayes Information Criterion) that can be minimized to select the best fitting model. 

In [20]:
arima.fitted_.collect().set_index('ID').head(5)

Unnamed: 0_level_0,FITTED,RESIDUALS
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,312.618416,-46.618416
2,309.623055,-163.723055
3,270.250161,-87.150161
4,162.574303,-43.274303
5,196.81884,-16.51884


Predict uisng the ARIMA model:

In [21]:
result = arima.predict(forecast_method='innovations_algorithm',forecast_length=5)

In [22]:
result.collect()

Unnamed: 0,TIMESTAMP,FORECAST,SE,LO80,HI80,LO95,HI95
0,0,556.080049,80.337725,453.123078,659.037021,398.621001,713.539097
1,1,607.631165,80.384634,504.614076,710.648254,450.080176,765.182155
2,2,527.325305,107.211247,389.928516,664.722093,317.19512,737.45549
3,3,572.778865,107.238661,435.346945,710.210786,362.594951,782.96278
4,4,501.960007,124.152572,342.852031,661.067984,258.625437,745.294578


### 4.3 Auto ARIMA 
Import auto ARIMA module

In [23]:
from hana_ml.algorithms.pal.tsa.auto_arima import AutoARIMA

Create an auto ARIMA estimator and make the initialization:

In [24]:
autoarima = AutoARIMA(search_strategy=1, allow_linear=1, thread_ratio=1.0)

Perform fit on the given data:

In [22]:
autoarima.fit(df, endog='SALES')

In [23]:
autoarima.model_.collect()

Unnamed: 0,KEY,VALUE
0,p,1
1,AR,-0.567381
2,d,1
3,q,1
4,MA,-0.51326
5,s,2
6,P,0
7,SAR,
8,D,0
9,Q,0


In [24]:
autoarima.fitted_.collect().set_index('ID').head(6)

Unnamed: 0_level_0,FITTED,RESIDUALS
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,,
2,278.098031,-132.198031
3,257.930136,-74.830136
4,213.876321,-94.576321
5,221.242853,-40.942853
6,185.467432,-16.967432


Predict uisng the auto ARIMA model:

In [25]:
result= autoarima.predict(forecast_method='innovations_algorithm', forecast_length=5)

In [26]:
result.collect()

Unnamed: 0,TIMESTAMP,FORECAST,SE,LO80,HI80,LO95,HI95
0,0,581.414104,66.62492,496.030804,666.797403,450.831658,711.996549
1,1,637.531732,66.841199,551.87126,723.192205,506.525388,768.538076
2,2,624.653831,75.672918,527.675052,721.632611,476.337636,772.970026
3,3,650.922683,76.666004,552.671213,749.174153,500.660076,801.18529
4,4,654.980411,80.779919,551.456744,758.504078,496.654677,813.306145


### 4.4 Auto Exponential Smoothing 
Import auto exponential smoothing module:

In [27]:
from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing

Create an auto exponential smoothing estimator and make the initialization:

In [28]:
autoexpsmooth = AutoExponentialSmoothing(model_selection=1, forecast_num=3)

Perform the fit on the given data:

In [29]:
autoexpsmooth.fit_predict(df,endog= 'SALES',)

Have a look at the stats_ and it shows the parameters and Triple Exponential SMoothing (TESM) model is selected.

In [30]:
autoexpsmooth.stats_.collect()

Unnamed: 0,STAT_NAME,STAT_VALUE
0,FORECAST_MODEL_NAME,TESM
1,MSE,5995.716823643405
2,NUMBER_OF_ITERATIONS,290
3,SA_NUMBER_OF_ITERATIONS,100
4,NM_NUMBER_OF_ITERATIONS,190
5,NM_EXECUTION_TIME,0.001025
6,SA_STOP_COND,MAX_ITERATION
7,NM_STOP_COND,ERROR_DIFFERENCE
8,ALPHA,0.08199869110618652
9,BETA,0.9999999911615682


To see the result of smoothing forecast and upper and lower bound in the forecast_:

In [31]:
autoexpsmooth.forecast_.collect()

Unnamed: 0,TIMESTAMP,VALUE,PI1_LOWER,PI1_UPPER,PI2_LOWER,PI2_UPPER
0,3,235.192318,,,,
1,4,109.89402,,,,
2,5,158.943465,,,,
3,6,89.801132,,,,
4,7,153.79621,,,,
5,8,128.925369,,,,
6,9,213.984531,,,,
7,10,190.686499,,,,
8,11,203.129699,,,,
9,12,171.233333,,,,


## 5. Close Connection

In [32]:
connection_context.close()