## The Machine Learning Part with Daily Data

First import any lib, we will install Pandasql, statsmodels, and fbprophet via PyPl on Databricks

Python Lib Ref:  
Pandasql: https://pypi.org/project/pandasql/  
Statsmodel: https://www.statsmodels.org/stable/index.html

Time Series Ref:  
https://otexts.com/fpp2/regression.html  
https://www.machinelearningplus.com/time-series/time-series-analysis-python/  
https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/  
https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/

https://facebook.github.io/prophet/docs/quick_start.html#python-api

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#pip install -U pandasql
from pandasql import sqldf

#pip install -U statsmodels
from statsmodels.tsa.seasonal import seasonal_decompose


#pip install fbprophet
from fbprophet import Prophet


import logging
logger = spark._jvm.org.apache.log4j
logging.getLogger("py4j").setLevel(logging.ERROR)

Now we load the data we previously prepared

In [5]:
df_MSC_SalesTemp = pd.read_csv("/dbfs/FileStore/tables/MSC_SalesTemp.csv")

df_MSC_SalesTemp['date'] =  pd.to_datetime(df_MSC_SalesTemp['date'])

df_MSC_SalesTemp.index = pd.to_datetime(df_MSC_SalesTemp.date)

print(df_MSC_SalesTemp.shape)
print(df_MSC_SalesTemp.head())
print(df_MSC_SalesTemp.dtypes)




In [6]:
#Check any missing values

df_MSC_SalesTemp.isnull().sum()



We should review the data by plotting them against the dates

In [8]:
def plot_df(x, y, title="", xlabel='', ylabel='', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    display()

plot_df(x=df_MSC_SalesTemp.date, y=df_MSC_SalesTemp.AvgTemp, xlabel='Date', ylabel='Temperature', title='Daily Temperature VS Dates') 

plot_df(x=df_MSC_SalesTemp.date, y=df_MSC_SalesTemp.DailyRev, xlabel='Date', ylabel='Revenue', title='Daily Sales Revenue VS Dates')

plot_df(x=df_MSC_SalesTemp.date, y=df_MSC_SalesTemp.ItemsSold, xlabel='Date', ylabel='Counts', title='Daily Sales Count VS Dates')

From above plots, we can observe that there may be some seasonality and trends. We will decompose latter below.

Next, we will do ADF test. With ADF, the null hypothesis is that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. It is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models.

In [10]:
from statsmodels.tsa.stattools import adfuller

# ADF Test
result_adf_temp = adfuller(df_MSC_SalesTemp.AvgTemp.values, autolag='AIC')
print(f'ADF Statistic: {result_adf_temp[0]}')
print(f'p-value: {result_adf_temp[1]}')
for key, value in result_adf_temp[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')
    
    

In [11]:
# ADF Test
result_adf_dr = adfuller(df_MSC_SalesTemp.DailyRev.values, autolag='AIC')
print(f'ADF Statistic: {result_adf_dr[0]}')
print(f'p-value: {result_adf_dr[1]}')
for key, value in result_adf_dr[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')



In [12]:
# ADF Test
result_adf_temp = adfuller(df_MSC_SalesTemp.ItemsSold.values, autolag='AIC')
print(f'ADF Statistic: {result_adf_temp[0]}')
print(f'p-value: {result_adf_temp[1]}')
for key, value in result_adf_temp[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')



From above ADF test, only the daily temperature series has p-value slightly greater than 0.05, so will be the only one non-stationary, while the daily items count and revenue both stationary.

In this case, we will still do a decompose of all series.

In [14]:
# Additive Decomposition for Average Temp
result_temp_add = seasonal_decompose(df_MSC_SalesTemp['AvgTemp'], model='additive', freq=30)

pd.plotting.register_matplotlib_converters()
result_temp_add.plot()
display()

In [15]:
# Additive Decomposition for daily revenue
result_dr_add = seasonal_decompose(df_MSC_SalesTemp['DailyRev'], model='additive', freq=30)


result_dr_add.plot()
display()

In [16]:
# Multiplicative Decomposition for daily revenue
result_dr_mul = seasonal_decompose(df_MSC_SalesTemp['DailyRev'], model='multiplicative', freq = 30)

result_dr_mul.plot()
display()

In [17]:
# Additive Decomposition for # of Items Sold
result_items_add = seasonal_decompose(df_MSC_SalesTemp['ItemsSold'], model='additive', freq=30)


result_items_add.plot()
display()

In [18]:
# Multiplicative Decomposition for # of Items Sold
result_items_mu = seasonal_decompose(df_MSC_SalesTemp['ItemsSold'], model='multiplicative', freq=30)


result_items_mu.plot()
display()

We can see from above plots that for the daily sales rev and items count, the difference between additive and multiplicative modes is small. So we will just use the additive mode for all three columns of data.

Next, we will reconstruct the data set with different compoenents.

In [20]:
#print(result_temp_add.trend)
#print(result_temp_add.resid)



In [21]:
df_recons_temp_add = pd.concat([result_temp_add.seasonal, result_temp_add.trend, result_temp_add.resid, result_temp_add.observed], axis=1)
df_recons_temp_add.columns = ['seas', 'trend', 'resid', 'actual_values']

print(df_recons_temp_add.shape)
print(df_recons_temp_add.isnull().sum())
print(df_recons_temp_add)


In [22]:
df_recons_dr_add = pd.concat([result_dr_add.seasonal, result_dr_add.trend, result_dr_add.resid, result_dr_add.observed], axis=1)
df_recons_dr_add.columns = ['seas', 'trend', 'resid', 'actual_values']

print(df_recons_dr_add.shape)
print(df_recons_dr_add.isnull().sum())
print(df_recons_dr_add)



In [23]:
df_recons_items_add = pd.concat([result_items_add.seasonal, result_items_add.trend, result_items_add.resid, result_items_add.observed], axis=1)
df_recons_items_add.columns = ['seas', 'trend', 'resid', 'actual_values']

print(df_recons_items_add.shape)
print(df_recons_items_add.isnull().sum())
print(df_recons_items_add)



Then we will do the stationarity test on the residues for the three sets. With ADF, the null hypothesis is that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. It is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models.

In [25]:
from statsmodels.tsa.stattools import adfuller

# ADF Test
result_adf_tempadd = adfuller(df_recons_temp_add.resid.values[15: 1019], autolag='AIC')
print(f'ADF Statistic: {result_adf_tempadd[0]}')
print(f'p-value: {result_adf_tempadd[1]}')
for key, value in result_adf_tempadd[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')




In [26]:


# ADF Test
result_adf_dradd = adfuller(df_recons_dr_add.resid.values[15: 1019], autolag='AIC')
print(f'ADF Statistic: {result_adf_dradd[0]}')
print(f'p-value: {result_adf_dradd[1]}')
for key, value in result_adf_dradd[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')





In [27]:


# ADF Test
result_adf_itemsadd = adfuller(df_recons_items_add.resid.values[15: 1019], autolag='AIC')
print(f'ADF Statistic: {result_adf_itemsadd[0]}')
print(f'p-value: {result_adf_itemsadd[1]}')
for key, value in result_adf_itemsadd[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')





The p-value for above ADF test all much smaller than 0.05, so we can reject the Null Hypothesis of ADF, which means the residuals are stationary.

We should find the Co-variance of the temp against daily sales reve and items count.  
https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/

In [30]:
covariance_01 = np.cov(df_MSC_SalesTemp.AvgTemp, df_MSC_SalesTemp.DailyRev)

covariance_02 = np.cov(df_MSC_SalesTemp.AvgTemp, df_MSC_SalesTemp.ItemsSold)

print(covariance_01)
print(covariance_02)

We should try better methods, like Person Correcltion.  

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

The Pearson correlation coefficient (named for Karl Pearson) can be used to summarize the strength of the linear relationship between two data samples.

The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.

In [32]:
# Pearson correlation coefficient for raw data
from scipy.stats import pearsonr

pcovariance_01 = pearsonr(df_MSC_SalesTemp.AvgTemp, df_MSC_SalesTemp.DailyRev)

pcovariance_02 = pearsonr(df_MSC_SalesTemp.AvgTemp, df_MSC_SalesTemp.ItemsSold)

print(pcovariance_01)
print(pcovariance_02)

From above Pearson Correlation Coefficient, the pairs are unlikely to be linearly related, but we can still try to fit them and see what we get

In [34]:
from sklearn.linear_model import LinearRegression

# create X and y
feature_cols_01 = ['AvgTemp']

#print(data[feature_cols_01])

X_01 = df_MSC_SalesTemp[feature_cols_01]
y_01 = df_MSC_SalesTemp['DailyRev']
y_02 = df_MSC_SalesTemp['ItemsSold']

#Check if the features and labels sets are good
print(X_01)
print(y_01)

lm_01 = LinearRegression()
lm_01.fit(X_01, y_01)

lm_02 = LinearRegression()
lm_02.fit(X_01, y_02)


lm_01_intercept = lm_01.intercept_
# print intercept and coefficients
print ('Intercept for Daily Reve LM is:', lm_01_intercept)

lm_01_coef = lm_01.coef_
print ('Coeffcients for Daily Reve LM is:', lm_01.coef_)


lm_02_intercept = lm_02.intercept_
# print intercept and coefficients
print ('Intercept for Daily Iterm Count LM is:', lm_02_intercept)

lm_01_coef = lm_01.coef_
print ('Coeffcients for Daily Iterm Count LM is:', lm_02.coef_)


Rsq_01 = lm_01.score(X_01, y_01)
print('R-square for Daily Reve LM is:', Rsq_01)

Rsq_02 = lm_02.score(X_01, y_02)
print('R-square for Daily Iterm Count LM is:', Rsq_02)

In [35]:
from sklearn.linear_model import LinearRegression

# create X and y
feature_cols_02 = ['resid']

#print(data[feature_cols_01])

X_rec = df_recons_temp_add[feature_cols_02][15:1019]
y_rec01 = df_recons_dr_add['resid'][15:1019]
y_rec02 = df_recons_items_add['resid'][15:1019]

#Check if the features and labels sets are good
print(X_rec)
print(y_rec01)

lm_rec01 = LinearRegression()
lm_rec01.fit(X_rec, y_rec01)

lm_rec02 = LinearRegression()
lm_rec02.fit(X_rec, y_rec02)


lm_rec01_intercept = lm_rec01.intercept_
# print intercept and coefficients
print ('Intercept for Recons Daily Reve LM is:', lm_rec01_intercept)

lm_01_coef = lm_rec01.coef_
print ('Coeffcients for Recons Daily Reve LM is:', lm_rec01.coef_)


lm_rec02_intercept = lm_rec02.intercept_
# print intercept and coefficients
print ('Intercept for Daily Iterm Count LM is:', lm_rec02_intercept)

lm_rec02_coef = lm_rec02.coef_
print ('Coeffcients for Daily Iterm Count LM is:', lm_rec02.coef_)


Rsq_rec01 = lm_rec01.score(X_rec, y_rec01)
print('R-square for Daily Reve LM is:', Rsq_rec01)

Rsq_rec02 = lm_rec02.score(X_rec, y_rec02)
print('R-square for Daily Iterm Count LM is:', Rsq_rec02)

Then, we will proceed with time series forecast with the Facebook Prophet to see how it goes

https://facebook.github.io/prophet/docs/quick_start.html#python-api

Implement Facebook Prophet for daily temp

Note we use first 30 months of data, from begining to 2015-06-30, [0:911] as the training set, and from 2015-07-01 to 2015-10-30 [911:1033] as the validation/test set

In [38]:
#date      DailyRev  ItemsSold   AvgTemp

df_fb_daytemp = pd.concat([df_MSC_SalesTemp.date[0:911], df_MSC_SalesTemp.AvgTemp[0:911]], axis=1)
df_fb_daytemp.columns = ['ds', 'y']

df_val_daytemp = pd.concat([df_MSC_SalesTemp.date[911:1033], df_MSC_SalesTemp.AvgTemp[911:1033]], axis=1)
df_val_daytemp.columns = ['ds', 'y_act']

true_daytemp = df_val_daytemp['y_act']

print(true_daytemp)
#print(df_fb_montemp.head())
#print(df_fb_montemp.shape)

#print(df_fb_montemp.ds)

m_temp = Prophet()

m_temp.fit(df_fb_daytemp)


In [39]:
future_temp = m_temp.make_future_dataframe(periods = 180)  

forecast_temp = m_temp.predict(future_temp)
m_temp.plot(forecast_temp)

display()



In [40]:
m_temp.plot_components(forecast_temp)

display()



In [41]:
# Check the y-hats for daily temp that will be used to calculate RMSE

print(forecast_temp[['ds','yhat']][911:1033])

pred_temp = forecast_temp['yhat'][911:1033]
print(pred_temp)



Implement Facebook Prophet for daily revenue

In [43]:


#date      DailyRev  ItemsSold   AvgTemp

df_fb_dayre = pd.concat([df_MSC_SalesTemp.date[0:911], df_MSC_SalesTemp.DailyRev[0:911]], axis=1)
df_fb_dayre.columns = ['ds', 'y']

df_val_dayre = pd.concat([df_MSC_SalesTemp.date[911:1033], df_MSC_SalesTemp.DailyRev[911:1033]], axis=1)
df_val_dayre.columns = ['ds', 'y_act']

true_dayre = df_val_dayre['y_act']

print(true_dayre)
#print(df_fb_montemp.head())
#print(df_fb_montemp.shape)

#print(df_fb_montemp.ds)

m_dr = Prophet()

m_dr.fit(df_fb_dayre)




In [44]:

future_dr = m_dr.make_future_dataframe(periods = 180)  

forecast_dr = m_dr.predict(future_dr)
m_dr.plot(forecast_dr)

display()


In [45]:
m_dr.plot_components(forecast_dr)

display()



In [46]:
# Check the y-hats for monthly revenue that will be used to calculate RMSE

print(forecast_dr[['ds','yhat']][911:1033])

pred_dr = forecast_dr['yhat'][911:1033]
print(pred_dr)


In [47]:
from sklearn.metrics import mean_squared_error

mse_temp = mean_squared_error(true_daytemp, pred_temp)
rmse_temp = mse_temp**(0.5)

mse_dr = mean_squared_error(true_dayre, pred_dr)
rmse_dr = mse_dr**(0.5)

print('The RMSE for Daily Temperature is:', rmse_temp)

print('The RMSE for Daily Revenue is:', rmse_dr)



The RMSE for Daily Temperature is: 3.598394550747085  
The RMSE for Daily Revenue is: 157820.8835738632  

From Data Preparation Notebook, we know the following means and standard deviations:

Means:  
DailyRev:     824420.868021  
ItemsSold:      1234.406190  
AvgTemp:           6.950757  
 
Standard Deviations:   
DailyRev:     386737.832552  
ItemsSold:       570.724269  
AvgTemp:          10.367651  

In this case, the RMSEs for daily temperature and revenue as predicted by the Facebook Prophet is not that bad, but not great either.