<a id="Introduction"></a> 

<img src="https://triuminfo.com/application/views/images/sales_forecasting_banner.jpg">

<center><h1>💰 Store Sales💰</h1></center>

# Introduction 

What is the most stores and grocery families that is most consumend on our markets? 🤔

does those sales got affected by seasonality (holidays) or by any trend?

Well, you can't imagine consumer behavior... if you want we can test it, you can find a correlation anywhere you look ... if you're really, deeply paying attention.


> **🟢 Goal**: This notebook has the purpose of analysing and predicting Sales prices to generate more products to be sold on our stores, as well as identifying if there is any impact for seasonality on the prices.

 Manily our stores are located on Ecuador, and we need to imporve our sales in different states. 

Corporación Favorita, a large Ecuadorian-based grocery retailer.



![WhatsApp Image 2022-05-27 at 3.47.56 PM.jpeg](attachment:43463724-6ec1-49a2-a9c0-f579b4427b03.jpeg)

### 📚 Libraries & Functions

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

import datetime as dt
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

# from sklearn.metrics import mean_squared_error, mean_absolute_error, explained_variance_score, r2_score 
# from sklearn.metrics import mean_poisson_deviance, mean_gamma_deviance, accuracy_score
from sklearn.preprocessing import MinMaxScaler

# from itertools import product
# import statsmodels.api as sm

# import itertools
# from statsmodels.tsa.arima_model import ARIMA
# from statsmodels.tsa.holtwinters import ExponentialSmoothing
# from statsmodels.tsa.holtwinters import SimpleExpSmoothing
# from statsmodels.tsa.seasonal import seasonal_decompose
# import statsmodels.tsa.api as smt
# from statsmodels.tsa.statespace.sarimax import SARIMAX


from itertools import cycle
import plotly.offline as py
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

from statsmodels.tsa.stattools import adfuller       ### Augmented Dickey Fuller

# set style 

plt.style.use('seaborn-darkgrid')

mpl.rcParams['xtick.labelsize'] = 16
mpl.rcParams['ytick.labelsize'] = 16
mpl.rcParams['axes.spines.left'] = False
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False

class clr:
    S = '\033[1m' + '\033[95m'
    E = '\033[0m'


# Color palette

my_colors = ["#ce8f5a", "#efd199", "#80c8bc", "#5ec0ca", "#6287a2"]

In [None]:
# ===== DICKEY FULLER TEST =====

def test_stationarity(x, log='non-stationary'):
    
    """Test stationarity of a time Series variable."""
    
    # Perform Dicky Fuller test 
    
    result = adfuller(x)
    
    print("ADF statistic: {}".format(result[0]))
    
    print("p-value: {}".format(result[1]))
    
    for key, value in result[4].items():
        
        if result[0] > value:
            print (clr.S + "The graph is non-stationary! (it has a trend)" + clr.E)
            
            break
            
        else: 
            print (clr.S + "The graph is stationary! (it doesn't have a trend)" + clr.E)
            
            break
            
    print("Critical values:")
    
    for key,value in result[4].items():
        
        print("\t%s: %.3f " %(key, value))
        
    # Determine rolling statistics 
    
    rolmean = x.rolling(window=22, center=False).mean()
    
    rolstd = x.rolling(window=22, center=False).std()
    
    # Plot rolling statistics:
    
    plt.figure(figsize= (25, 11))
    
    orig = plt.plot(x, color = my_colors[0], lw=8, label = "Original")
    
    mean = plt.plot(rolmean, color = my_colors[2], lw=2.5, ls= '--', label = 'Rolling Mean')
    
    std = plt.plot(rolmean, color = my_colors[3], lw=3, label = 'Rolling Std')
    
    plt.legend(loc='best', fontsize=20)
    
    plt.title('Rolling Mean & Standard Deviation', size=25)
    
    plt.show(block=False)
    
        

In [None]:
# Run this to enable CSS stype
# WARNING: run it in a separate cell, not alongside other code blocks too
from IPython.core.display import HTML
def css_styling():
    styles = open("../input/alerts-21/alerts.css", "r").read()
    return HTML("<style>"+styles+"</style>")
css_styling()

<a id="Understand the data "></a> 

# 1. Understand the data 🧐

## 1.1 🏪 Stores overview

<div class="alert simple-alert">
  💡<b>Helpful Notes</b>: 
      
* store_nbr: Identifies the store at which the products are sold.
* city 
* state 
* type
* cluster  
    
</div>


### 📥 Read in Data

In [None]:
%%time 

stores = pd.read_csv("../input/store-sales-time-series-forecasting/stores.csv")

In [None]:
stores.head() 

In [None]:
stores.cluster.nunique()

In [None]:
stores.type.nunique()

In [None]:
stores.info()

![](https://besthotelshome.com/wp-content/uploads/2020/03/Map-of-Ecuador.jpg)

In [None]:
stores.state.unique() 


# we will focus manil with two states "Pichincha" where the capital of ecuador is located ('Quito')
# and "Guayas" where the largest city in Ecuador is located ('Guayaquil'). 

In [None]:
stores.city.unique()

In [None]:
no_of_cities = len(stores.city.unique())
no_of_states = len(stores.state.unique()) 

In [None]:
print(clr.S+f"Our stores are manily located in {no_of_states} differnet states and in particular {no_of_cities} different cities."+clr.E)

In [None]:
print(clr.S+"Our unique stores:"+clr.E, stores.store_nbr.unique())

In [None]:
stores_per_state_per_city = stores.groupby(['state','city'])['store_nbr'].count().reset_index()

In [None]:
px.bar(stores_per_state_per_city, x = 'city', y = 'store_nbr', color='state', 
       labels = dict(city= 'City', store_nbr='Number of Stores'),
    title = 'Stores per state and city ')

> Increase Plot size 

> We can **Conclude** by looking on this chart that Only 5 states have more than 2 stores. From 10th states, there is only one store per state.

<a id = 'train dataset overview'></a>

## 1.2  train dataset overview

<div class="alert simple-alert">
  💡<b>Helpful Notes</b>: 
      
* 3+ million store records.
* store_nbr: Identifies the store at which the products are sold. 
* family: Type of product sold. 
* Sales: Total sales at a given date. 
* onpromotion: Total number of items in a product family that were being promoted.
    
</div>

In [None]:
%%time

train = pd.read_csv('../input/store-sales-time-series-forecasting/train.csv', index_col=0)

In [None]:
train.head()

In [None]:
print(clr.S+f"The dataset contains {train.shape[0]} rows "+clr.E)
print(clr.S+f"The dataset contains {train.shape[1]} coulmns"+clr.E)

train.isna().sum()

<div class="alert simple-alert">
  <p>💡<b>Note</b>: Note that we have no missing values on our dataset :) </p>
</div>

In [None]:
print(clr.S+ "Our stores is build on these types of groceries:"+clr.E, train.family.unique())

In [None]:
len(train.family.unique())

In [None]:
train.sales.mean()

In [None]:
sales_per_family_mean = train.groupby('family')['sales'].mean().sort_values(ascending=False).reset_index()

In [None]:
px.bar(sales_per_family_mean, x='family', y='sales', 
       labels = dict(family='Category', sales='Average Sales'),
       title = "Average Sales Per Category")

In [None]:
print(clr.S+ f"WE got over {train.family.nunique()} unique and different categories on our stores."+clr.E)

In [None]:
train.info() # to check our coulmns data types 

<div class="alert warning-alert">
  <p>💡<b>Note</b>: we can see that our data in memort usage is above 135 MB, some data types need to be changed</p>
</div>


In [None]:
train['date'] = pd.to_datetime(train['date'])

In [None]:
train['year'] = train['date'].dt.year
train.head(3)

In [None]:
sales_made_per_year = train.groupby("year")['sales'].sum().reset_index()

sales_made_per_year

In [None]:
px.line(sales_made_per_year, x='year', y='sales', 
       labels = dict(year= 'Year', sales='Sales'),
       title= "Sales Made Per Year ")

<div class="alert danger-alert">
  <p>💡<b>Note</b>: We coiuld see that most our sales made was on 2016, but a huge drop could be seen on 2017 for our sales.</p>
</div>

In [None]:
count_sales_per_year = train.groupby("year")['sales'].count().reset_index()

count_sales_per_year

In [None]:
px.line(count_sales_per_year, x='year', y='sales', 
       labels = dict(year= 'Year', sales='Sales'),
       title= "Number of Sales Per Year")

<div class="alert danger-alert">
  <p>💡<b>Note</b>: Seems like we have a drop in years 2016-2017 overall sales.</p>
</div>

In [None]:
train['month'] = train['date'].dt.month

In [None]:
sales_per_month = train.groupby(['year','month'])['sales'].sum().reset_index()


In [None]:
sales_per_month.head()

In [None]:
px.line(sales_per_month, x = 'month', y= 'sales', color='year')

We could see that we start every year with a better sales than the preceding year (excep for 2015 year) which is a good indacator on overall,but we got a couple of peaks that needs further investagations. 

In [None]:
top_5_categories = train.groupby('family')['sales'].mean().sort_values(ascending=False)[:5].reset_index()

<div class="alert simple-alert">
  <p>💡<b>Note</b>:These are the top 5 grocerie types in sales :) </p>
</div>

In [None]:
for i in top_5_categories.family.unique():
    
    plt.figure(figsize=(20,8))
    
    sns.lineplot(y = train[['date','sales']][train['family']==i]
                 .resample('M', on='date')
                 .mean()
                 .reset_index(drop=False)['sales']
                 
    , x = train[['date','sales']][train['family']==i]
                 .resample('M', on='date')
                 .mean()
                 .reset_index(drop=False)['date'],linewidth = 1.5)
    
    plt.xlabel('Date')
    plt.ylabel("Sales Mean")
    plt.title('{} Times Series Forecasting'.format(i))
    plt.xticks(rotation=45)
    plt.show()

In [None]:
least_5_categories = train.groupby('family')['sales'].mean().sort_values(ascending=True)[:5].reset_index()

<div class="alert simple-alert">
  <p>💡<b>Note</b>:These are the lowest 5 groceries types in sales. </p>
</div>

In [None]:
for cat in least_5_categories.family.unique():
    
    plt.figure(figsize=(20,8))
    
    sns.lineplot(y = train[['date','sales']][train['family']==cat]
                 .resample('M', on='date')
                 .mean()
                 .reset_index(drop=False)['sales']
                 
    , x = train[['date','sales']][train['family']==cat]
                 .resample('M', on='date')
                 .mean()
                 .reset_index(drop=False)['date'],linewidth = 1.5)
    
    plt.xlabel('Date')
    plt.ylabel("Sales Mean")
    plt.title('{} Times Series Forecasting'.format(i))
    plt.xticks(rotation=45)
    plt.show()

<a id = 'Further data to analyze'></a>

# 1.2 Further data to analyze (Events) dataset 

**💡 Helpful Notes**:


* Date 
* Type: if its a holiday or not 
* Locale: if this holiday is local or regional 
* locale_name: name of the local city 
* description: brief describtion about the event 


In [None]:
%%time

events = pd.read_csv('../input/store-sales-time-series-forecasting/holidays_events.csv')

oil = pd.read_csv('../input/store-sales-time-series-forecasting/oil.csv')

stores = pd.read_csv('../input/store-sales-time-series-forecasting/stores.csv')

test = pd.read_csv('../input/store-sales-time-series-forecasting/test.csv')

submission = pd.read_csv('../input/store-sales-time-series-forecasting/sample_submission.csv')

In [None]:
events.head()

In [None]:
events['date'] = pd.to_datetime(events.date) # change column type 

In [None]:
events['year'] = events.date.dt.year #create seperate year column to get an overview per year. 

In [None]:
events['month'] = events.date.dt.month  

In [None]:
events_type_value_counts = events.type.value_counts().reset_index()

In [None]:
px.bar(events_type_value_counts, x='index', y='type',
        labels=dict(index = "Type", type="Frequency"))


In [None]:
count_events = events.groupby("year")['type'].count().reset_index()

In [None]:
count_events # count events per year 

In [None]:
px.bar(count_events, x='year', y='type', 
        labels= dict(type="Frequency of events", year= 'Year'))



<div class="alert simple-alert">
  <p>💡<b>Note</b>:Seems like the holidays per year is not equally distributed !!
    We can look for more deeper look on the holidays :)</p>
</div>

In [None]:
events.type.value_counts().reset_index()

In [None]:
holidays_or_event = events[(events['type'] == 'Holiday') | (events['type'] == 'Event')] 

In [None]:
holidays_or_event.head()

In [None]:
x = holidays_or_event.groupby('year')['type'].value_counts()
x.index

In [None]:
holidays_or_event['month'] = holidays_or_event.date.dt.month 
#let's look for holidays per month to get more consice overview

In [None]:
holidays_or_event_per_year_and_month = holidays_or_event.groupby(['year','month'])['type'].value_counts().reset_index(level=[0,1])

holidays_or_event_per_year_and_month

In [None]:
holidays_or_event_per_year_and_month.columns = ['Year', 'Month', 'Count']

In [None]:
holidays_or_event_per_year_and_month = holidays_or_event_per_year_and_month.reset_index()

In [None]:
holidays_or_event_per_year_and_month.head()

In [None]:
px.line(holidays_or_event_per_year_and_month, y='Count', x='Month', color='Year')

In [None]:
px.line(sales_per_month, x = 'month', y= 'sales', color='year')

In [None]:
#px.scatter(sales_per_month, x = 'month', y= 'sales', color='year')

In [None]:
events.head()

In [None]:
try_this_one = events.groupby('date')['type'].count().reset_index()
try_this_one

In [None]:
px.line(try_this_one, x='date', y='type')

In [None]:
event_per_month = events.groupby(['year','month'])['type'].count().reset_index()
event_per_month

In [None]:
px.line(event_per_month, x='month', y='type',color='year',
       labels = dict(month='Month', type='Frequency'))

In [None]:
diff_events = events.locale.value_counts().reset_index()

In [None]:
px.bar(diff_events, x = 'index', y='locale', 
       labels = dict(index='Event Type', locale='Frequency'))

# 1.3 oil dataset 

<div class="alert simple-alert">
  <p>💡<b>Note</b>: We could use oil prices for further investgations by joining this dataset to train dataset. </p>
</div>

In [None]:
oil.head() 
# this dataset give us more information about oil price in a given specific date 

In [None]:
oil['date'] = pd.to_datetime(oil.date)

In [None]:
px.line(oil, x='date', y='dcoilwtico', labels=dict(date='Date', dcoilwtico='Oil Price'), title ='Oil price per month')

<a id = 'Further data to analyze'></a>

# 1.4 Deeper analysis by merging all avaliable datasets  


In [None]:
events.head()

In [None]:
events_new = events.drop(['year','month', 'locale', 'locale_name','description', 'transferred'], axis=1)

> We previously managed to investgate on **events** dataset alone but we can merge it with other datasets to make more complex and deeper knowledge.

In [None]:
train_stores = train.merge(stores, how='left', on='store_nbr')

In [None]:
events.head()

In [None]:
events_new = events.drop(['locale','locale_name', 'description', 'transferred', 'year', 'month'],axis=1)

In [None]:
data = train_stores.merge(events_new, how='left', on= 'date')

In [None]:
data = data.merge(oil, how ='left', on='date')

data.head()

In [None]:
print(clr.S+f"the overall dateset rows is {data.shape[0]} with {data.shape[1]} coulmns."+clr.E)

In [None]:
sales_per_family_per_city = data.groupby(['city','family'])['sales'].sum()
sales_per_family_per_city

In [None]:
sales_per_family_per_city = sales_per_family_per_city.reset_index(level=[0,1])
sales_per_family_per_city 

In [None]:
px.bar(sales_per_family_per_city, y='sales', x='city', color='family')

In [None]:
data.head()

In [None]:
stores_per_city = data.groupby('city')['store_nbr'].count().reset_index()


In [None]:
stores_per_city.head()

In [None]:
px.bar(stores_per_city, y='store_nbr', x ='city')

> Seems to me some of our categorical features that have the impact on sales are the following:

* State 
* family 
* city
* type_x
* type_y 

In [None]:
data.head()

In [None]:
data['week'] = data['date'].dt.isocalendar().week
data['quarter'] = data['date'].dt.quarter
data['day_of_week'] = data['date'].dt.day_name()


In [None]:
data['month_name'] = data['date'].dt.strftime('%b')

In [None]:
data.head()

In [None]:
df_store_sales = data.groupby('type_x').agg({"sales" : "mean"})\
     .reset_index()\
    .sort_values(by='sales', ascending=False)

df_fam_sales = data.groupby('family').agg({"sales" : "mean"})\
    .reset_index()\
    .sort_values(by='sales', ascending=False)[:10]

df_clus_sales = data.groupby('month_name').agg({"sales" : "mean"})\
    .reset_index() 

In [None]:
print(px.colors.diverging.RdYlGn.reverse)

In [None]:
colors = px.colors.sequential.amp_r
fig = make_subplots(rows=2, cols=2, 
                    specs=[[{"type": "bar"}, {"type": "pie"}],
                           [{"colspan": 2}, None]],
                    column_widths=[0.7, 0.3], vertical_spacing=0, horizontal_spacing=0.02,
                    subplot_titles=("Top 10 Highest Product Sales", "Highest Sales in Stores", "Months Vs Sales (in millions)"))

fig.add_trace(go.Bar(x=df_fam_sales['sales'], y=df_fam_sales['family'], marker=dict(color= colors),
                     name='Family', orientation='h'), 
                     row=1, col=1)
fig.add_trace(go.Pie(values=df_store_sales['sales'], labels=df_store_sales['type_x'], name='Store type',
                     marker=dict(colors=colors), hole=0.7,
                     hoverinfo='label+percent+value', textinfo='label'), 
                    row=1, col=2)
fig.add_trace(go.Bar(x=df_clus_sales['month_name'], y=df_clus_sales['sales'], 
                     marker=dict(color= px.colors.sequential.amp[3]), name='Months'), 
                     row=2, col=1)

# styling -------------------------------------------------------------------------------
fig.update_yaxes(showgrid=False, ticksuffix=' ', categoryorder='total ascending', row=1, col=1)
fig.update_xaxes(visible=False, row=1, col=1)
fig.update_xaxes(tickmode = 'array', tickvals=df_clus_sales.month_name,  row=2, col=1)
fig.update_yaxes(visible=True, row=2, col=1, range=[0,700])
fig.update_layout(height=600, bargap=0.2,
                  margin=dict(b=0,r=20,l=20), xaxis=dict(tickmode='linear'),
                  title_text="Average Sales Analysis",
                  template="plotly_white",
                  title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93'),
                  hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=False)

fig.show()


In [None]:
top_city  = stores.groupby('city') .size().sort_values(ascending=False)

In [None]:
stores_nbr_id = data.store_nbr.unique()

In [None]:
stores_nbr_id

In [None]:
sales_grouped = data.groupby(['store_nbr','date']).agg({'sales':'sum'})
sales_by_store_id = {}
for store in range(1,len(stores_nbr_id)):
    sales_by_store_id[store] = {'sale': sales_grouped.loc[store].sales.mean(),
                                'city': stores.city[store],
                                'city-id': stores.city[store]+"-{}".format(store) ,
                                'state': stores.state[store] ,
                                'type': stores.type[store] ,
                                'cluster': stores.cluster[store] }
sales_by_store_id = pd.DataFrame.from_dict(sales_by_store_id, orient='index') # convert to Pandas DataFrame 

sales_by_city      = {}
for city in top_city.index.values:
    sales_by_city[city]      = {'sum': sales_by_store_id [ stores.city==city ].sale.sum(),
                                'mean': sales_by_store_id [ stores.city==city ].sale.mean() }
sales_by_city      = pd.DataFrame.from_dict(sales_by_city, orient='index')    


In [None]:
# Sorting
sales_by_store_id.sort_values( by='sale', inplace=True, ascending=False )
sales_by_city_sum  = sales_by_city.sort_values( by='sum', ascending=False )

In [None]:
sales_by_store_id.head()

In [None]:
px.bar(sales_by_store_id, x='city-id', y='sale', title= 'Average sales per given store')

In [None]:
sales_by_store_id.sale.mean()

In [None]:
sales_grouped  = data.groupby('date').agg({'sales':'sum'}).to_period("D")
sales_grouped['year']      = sales_grouped.index.year
sales_grouped['quarter']   = sales_grouped.index.quarter
sales_grouped['month']     = sales_grouped.index.month
sales_grouped['week']      = sales_grouped.index.week
sales_grouped['dayofweek'] = sales_grouped.index.dayofweek  # Monday=0, Sunday=6
sales_grouped['dayofmonth']= sales_grouped.index.day  # day in month from 01 to 31
sales_grouped['dayofyear'] = sales_grouped.index.dayofyear

sales_smooth7  = sales_grouped.copy()
sales_smooth30 = sales_grouped.copy()
sales_smooth365= sales_grouped.copy()

sales_smooth7["sales"]   = sales_smooth7.  sales.rolling(window=7,  center=True, min_periods=3 ).mean()
sales_smooth30["sales"]  = sales_smooth30. sales.rolling(window=30, center=True, min_periods=15).mean()
sales_smooth365["sales"] = sales_smooth365.sales.rolling(window=365,center=True, min_periods=183).mean()

In [None]:
figsize = (14,5)

fig, (ax1,ax2) = plt.subplots(1,2,figsize=figsize)

sales_grouped.groupby(['month']).agg({'sales':'mean'}).plot(kind="barh",ax=ax1)

ax1.set(title="Average Sales by Month")
ax1.set(ylabel="Month", xlabel="Average Sales")
ax1.get_legend().remove()

labels1 = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

ax1.set_yticks(range(0,12), labels1)

sales_grouped.groupby(['quarter']).agg({'sales':'mean'}).plot.pie(y="sales",ax=ax2, legend=False, autopct='%1.f%%',
                             startangle=90, labels=["Quarter 1","Quarter 2","Quarter 3","Quarter 4"], fontsize="x-large")
ax2.set(title="Average Sales by Quarter")

<div class="alert simple-alert">
  <p>💡<b>Note</b>: Most of the sales occurred in December. People tend to stock for the end of the year to prepare for a new year or to decorate the house, or schedule for a long holiday starting from Christmas. </p>
</div>

<div class="alert simple-alert">
  <p>💡<b>Note</b>: Although the sales in each quarter are very similar, the last (4th) quarter is the biggist, which agrees to the sales by month where there is a jump in December. </p>
</div>

In [None]:
top_stores_2017 = data.loc[data.date.dt.year==2017].groupby("store_nbr").agg({"sales":"sum"})
top_stores_2017 = pd.merge(top_stores_2017, stores, on="store_nbr").drop(["city","state"],axis=1)

figsize = (15,5)
fig, (ax1,ax2) = plt.subplots(1,2,figsize=figsize)
top_stores_2017.groupby(['type']).agg({'sales':'mean'}).plot.pie(y="sales",ax=ax1, legend=False, autopct='%1.f%%',
                             startangle=90, labels=["Type A","Type B","Type C","Type D","Type E"], fontsize="x-large")
ax1.set(title="Average Sales by Type")

top_stores_2017.groupby(['cluster']).agg({'sales':'mean'}).plot.bar(ax=ax2,  fontsize="large")
ax2.set(title="Average Sales by Cluster")
ax2.set(ylabel="Average Sales", xlabel="Cluster")
ax2.get_legend().remove()

plt.show()

<div class="alert simple-alert">
  💡<b>Note</b>: 
    
* Store Type A has the highest sales which is 37%, followed by Types B and D.

* Store Cluster 5 has the highest sales, followed by Clusters 11, 14, and 8. 
</div>

### Scaling of Numerical Features

In [None]:
cols = list(data[['sales', 'onpromotion']].columns)
cols

In [None]:
scaler = MinMaxScaler(feature_range=(0,1))

train[cols] = scaler.fit_transform(train[cols])

In [None]:
train_date = data[cols]

In [None]:
train_data = data.groupby(['year']).agg({'sales':'mean', 'onpromotion':'mean'}).reset_index()
train_data.tail()

In [None]:
px.scatter(train_data, x ='sales', y='onpromotion', trendline='ols', color='year' )

In [None]:
np.corrcoef(train_data.sales, train_data.onpromotion)[0][1]

In [None]:
data.head()

In [None]:
sales_per_oil_price = data.groupby(['year']).agg({'sales':'mean', 'dcoilwtico':'mean'}).reset_index()

In [None]:
sales_per_oil_price

In [None]:
px.scatter(sales_per_oil_price, y='dcoilwtico', x='sales', color='year', labels=dict(dcoilwtico='Avg Oil Price', sales='Avg Sales'))

In [None]:
np.corrcoef(train_data.sales, train_data.onpromotion)[0][1]

# 2. Modeling 🏋️‍

<img src='../input/photos-/2022-05-28 (2).png'></img>


>  Of course there's a lot of choices the open source community, scikit-learn, scipy, tensorflow, pytorch, mxnet.

> They should be popular, algorithms that are in demand. 

In [None]:
train.head()

In [None]:
train['date'] = pd.to_datetime(train['date'])
train = train.set_index('date')

In [None]:
top_5_categories.family.values

In [None]:
top_5 = train[train['family'].isin(['GROCERY I', 'BEVERAGES', 'PRODUCE', 'CLEANING', 'DAIRY'])][['family','sales']]
top_5

In [None]:
# we will work on the last year (last date = 2017-08-15)
grocery = top_5[top_5['family']=='GROCERY I'].groupby('date')['sales'].mean()#.resample('7D').mean()
grocery = grocery['2016-08-15':]
grocery

In [None]:
grocery_re = grocery.reset_index()
grocery_re

### 2.2 Testing Stationarity 

Our time series data **Can have a trend or not**.It is of the utmost importance to determine how the series is behaving before applying any model to it. 

> **Augmented Dickey Fuller test**: it determines how strongly a time series is defined by a trend. 

#### Hypothesis:

<div class ="alert simple-alert"><p> 1. Null Hypothesis (H0): Null Hypothesis of the test is that the time series can be represented by a unit root that <b>is not stationary</b>.

2. Alternative Hypothesis (H1): Alternative Hypothesis of the test is that the time series **is stationary**.</p> </div>

#### Why is Stationary Important?

<div class = "alert success-alert"><p>For data to be stationary, the statistical properties of a system <b>do not change over time</b>. This does not mean that the values for each data point have to be the same, but the overall behavior of the data should remain constant.

If the data is non-stationary (meaning it has a trend), we need to **remove** it in order to proceed with the analysis.</p></div> 

In [None]:
sales = pd.Series(grocery_re['sales'].to_numpy())

test_stationarity(sales, log='non-stationary')

<div class="alert danger-alert">
  <p>💡<i>Note</i>: If our data have trends (is non-stationary we could remove this trending by using <b>natural log</b> to our series</p>
</div>

In [None]:
prediction_period = 30 # last 1 month

# Set Train data to be uplo ( Total data length - prediction_period )
df_train= grocery[:len(grocery)-prediction_period]


# Set Test data to be the last prediction_period
df_test= grocery[len(grocery)-prediction_period:]

In [None]:
df_train.shape, df_test.shape

In [None]:
plt.figure(figsize=(16,6))
plt.plot(df_train)
plt.plot(df_test)
plt.legend(['TRAIN','TEST'])
plt.show()

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#def plot_co2(df_train, df_test, y_pred, title):
#     mae = mean_absolute_error(df_test, y_pred)
#     #plt.plot(df_train), legend=True, label="TRAIN", title=f"{title}, MAE: {round(mae,2)}")
#     #plt.plot(df_test), legend=True, label="TEST", figsize=(6, 4))
#     #plt.plot(y_pred), legend=True, label="PREDICTION")
#     plt.figure(figsize=(16,6))
#     plt.title(f"{title}, MAE: {round(mae,2)}")
#     sns.lineplot(x = df_train.reset_index().date[250:], y = df_train.reset_index().sales[250:])
#     sns.lineplot(x = df_test.reset_index().date, y = df_test.reset_index().sales)
#     sns.lineplot(x = df_test.reset_index().date, y = y_pred.values)
#     plt.xticks(rotation=45)
#     plt.legend(['TRAIN','TEST','PREDICTION'])
#     plt.show()

Configuring a SARIMA requires selecting hyperparameters for both the trend and seasonal elements of the series.

#### Trend Elements
There are three trend elements that require configuration.

They are the same as the ARIMA model; specifically:

- p: Trend autoregression order.
- d: Trend difference order.
- q: Trend moving average order.

#### Seasonal Elements
There are four seasonal elements that are not part of ARIMA that must be configured; they are:

- P: Seasonal autoregressive order.
- D: Seasonal difference order.
- Q: Seasonal moving average order.
- m: The number of time steps for a single seasonal period.

In [None]:
#p = d = q = range(0, 4)
# p = range(0,3)
# q = range(0,3)
# d = range(0,2)
# s = range(1,8)

# pdq = list(itertools.product(p, d, q))
# seasonal_pdq = list(itertools.product(p, d, q, s))


# def sarima_optimizer_mae(train, pdq, seasonal_pdq):
#     best_mae, best_order, best_seasonal_order = float("inf"), float("inf"), None
#     for param in pdq:
#         for param_seasonal in seasonal_pdq:
#             try:
#                 model = SARIMAX(train, order=param, seasonal_order=param_seasonal)
#                 sarima_model = model.fit(disp=0)
#                 y_pred_test = sarima_model.get_forecast(steps=30)
#                 y_pred = y_pred_test.predicted_mean
#                 mae = mean_absolute_error(df_test, y_pred)

#                 # mae = fit_model_sarima(train, val, param, param_seasonal)

#                 if mae < best_mae:
#                     best_mae, best_order, best_seasonal_order = mae, param, param_seasonal
#                 print('SARIMA{}x{}12 - MAE:{}'.format(param, param_seasonal, mae))
#             except:
#                 continue
#     print('SARIMA{}x{}12 - MAE:{}'.format(best_order, best_seasonal_order, best_mae))
#     return best_order, best_seasonal_order

# best_order, best_seasonal_order = sarima_optimizer_mae(df_train, pdq, seasonal_pdq)

# model = SARIMAX(df_train, order=best_order, seasonal_order=best_seasonal_order)
# sarima_final_model = model.fit(disp=0)

# y_pred_test = sarima_final_model.get_forecast(steps=30)
# y_pred = y_pred_test.predicted_mean
# y_pred = pd.Series(y_pred.values, index=df_test.index)


# plot_co2(df_train, df_test, y_pred, "SARIMA")

In [None]:
# fig, ax = plt.subplots(2,1)
# fig = sm.graphics.tsa.plot_acf(df_train, lags=30, ax=ax[0])
# fig = sm.graphics.tsa.plot_pacf(df_train, lags=30, ax=ax[1])
# plt.show()

In [None]:
# model = SARIMAX(df_train, order=(2, 1, 1), seasonal_order=(2, 1, 1, 7))
# sarima_final_model = model.fit(max_iter = 50, method = 'powell')

# y_pred_test = sarima_final_model.get_forecast(steps=30)
# y_pred = y_pred_test.predicted_mean
# y_pred = pd.Series(y_pred.values, index=df_test.index)


# plot_co2(df_train, df_test, y_pred, "SARIMA")

In [None]:
# from pmdarima import auto_arima
# model = auto_arima(df_train.values, start_p=1, start_q=1,
#                       test='adf',       # use adftest to find optimal 'd'
#                       max_p=5, max_q=5, # maximum p and q
#                       m=7,              # frequency of series
#                       d=None,           # let model determine 'd'
#                       seasonal=True,   # Seasonality
#                       start_P=0, 
#                       D=1, 
#                       trace=True,
#                       error_action='ignore',  
#                       suppress_warnings=True, 
#                       stepwise=True)

# print(model.summary())

In [None]:
# n_periods = 30
# fc, confint = model.predict(n_periods=n_periods, return_conf_int=True)
# fc_series = pd.Series(fc, index=df_test.index)

# plot_co2(df_train, df_test, fc_series, "AUTO_ARIMA")

### LSTM

We will frame the problem to take a window of the last so many number of days of data to predict the current days data.

so we will define function `dataset_generator_lstm()` that will split the input sequence into windows of data for fitting a supervised learning model, like an LSTM

For example, if the sequence was:
```
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
```
Then the samples for training the model will look like:

```
Input               | Output
1, 2, 3, 4, 5       6
2, 3, 4, 5, 6       7
3, 4, 5, 6, 7       8
```

#### Define look_back period

A `“lookback period”` defines how many previous timesteps are used in order to predict the subsequent timestep.

For example if I set the lookback period is to 5, that that means that I am using the time steps at t-4, t-3, t-2, t-1, and t to predict the value at time t+1.
```
lookback = 5

X_train, Y_train = dataset_generator_lstm(train, lookback)

X_val, Y_val = dataset_generator_lstm(val, lookback)
```

In [None]:
# def dataset_generator_lstm(dataset, look_back=5):
#     # A “lookback period” defines the window-size of how many
#     # previous timesteps are used in order to predict
#     # the subsequent timestep. 
#     dataX, dataY = [], []
    
#     for i in range(len(dataset) - look_back):
#         window_size_x = dataset[i:(i + look_back), 0]
#         dataX.append(window_size_x)
#         dataY.append(dataset[i + look_back, 0]) # this is the label or actual y-value
#     return np.array(dataX), np.array(dataY)

# trainX, trainY = dataset_generator_lstm(scaled_train)

# testX, testY = dataset_generator_lstm(scaled_test)

# print("trainX: ", trainX.shape)
# print("trainY: ", trainY.shape)
# print("testY: ", testX.shape)
# print("testY", testY.shape)

In [None]:
# trainX