# Contexto

Las previsiones no son sólo para los meteorólogos. Los gobiernos prevén el crecimiento económico. Los científicos intentan predecir la población futura. Y las empresas prevén la demanda de productos, una tarea habitual de los científicos de datos profesionales. Las previsiones son especialmente importantes para las tiendas de comestibles, que deben decidir con delicadeza cuánto inventario comprar. Si se predice un poco por encima, las tiendas de comestibles se quedan con un exceso de existencias de productos perecederos. Si adivinan un poco menos, los artículos más populares se agotan rápidamente, lo que supone una pérdida de ingresos y un disgusto para los clientes. Unas previsiones más precisas, gracias al aprendizaje automático, podrían ayudar a los minoristas a complacer a los clientes teniendo la cantidad justa de productos en el momento adecuado.

Los métodos actuales de previsión subjetiva para el comercio minorista tienen pocos datos que los respalden y es poco probable que se puedan automatizar. El problema se complica aún más a medida que los minoristas añaden nuevas ubicaciones con necesidades únicas, nuevos productos, gustos estacionales en constante cambio y una comercialización de productos impredecible.

# Objetivo

En esta competencia de "iniciación", utilizará la previsión de series temporales para pronosticar las ventas de las tiendas con datos de Corporación Favorita, un gran minorista de comestibles con sede en Ecuador.

En concreto, construirás un modelo que prediga con mayor precisión las ventas unitarias de miles de artículos vendidos en diferentes tiendas Favorita. Practicará sus habilidades de aprendizaje automático con un conjunto de datos de entrenamiento accesible con información de fechas, tiendas y artículos, promociones y ventas unitarias.

# Codificación

In [908]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [909]:
#Import librarys
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#pandas dataframe config
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.4f}'.format

In [910]:
DATA_PATH = '../input/store-sales-time-series-forecasting/'

Primero procederemos a la lectura de los datos que nos proveen para la resolución del reto

In [911]:
df_train = pd.read_csv(os.path.join(DATA_PATH, 'train.csv'))
df_test = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'))
df_stores = pd.read_csv(os.path.join(DATA_PATH, 'stores.csv'))
# For the transactions dataset, we sort them by store number and date, to help us visualizing the data chronologically
df_transactions = pd.read_csv(os.path.join(DATA_PATH, 'transactions.csv')).sort_values(['store_nbr', 'date'])
df_oil = pd.read_csv(os.path.join(DATA_PATH, 'oil.csv'))
df_holidays_events = pd.read_csv(os.path.join(DATA_PATH, 'holidays_events.csv'))

# Read the data
df_store_sales = pd.read_csv(os.path.join(DATA_PATH, 'train.csv'),
                             usecols=['store_nbr', 'family', 'date', 'sales', 'onpromotion'],
                             dtype = {
                                    'store_nbr': 'category',
                                    'family': 'category',
                                    'sales': 'float32',
                                    'onpromotion': 'uint32',
                                },
                                parse_dates=['date'],
                                infer_datetime_format=True)

df_store_sales['date'] = df_store_sales.date.dt.to_period('D')
df_store_sales = df_store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()


df_holiday_events = pd.read_csv(os.path.join(DATA_PATH, 'holidays_events.csv'),
                                dtype={
                                    'type': 'category',
                                    'locale': 'category',
                                    'locale_name': 'category',
                                    'description': 'category',
                                    'transferred': 'bool',
                                },
                                parse_dates=['date'],
                                infer_datetime_format=True)
df_holiday_events = df_holiday_events.set_index('date').to_period('D')

df_test = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'),
                    dtype={
                        'store_nbr': 'category',
                        'family': 'category',
                        'onpromotion': 'uint32',
                    },
                    parse_dates=['date'],
                    infer_datetime_format=True )
df_test['date'] = df_test.date.dt.to_period('D')
df_test = df_test.set_index(['store_nbr', 'family', 'date']).sort_index()

df_stores = pd.read_csv(os.path.join(DATA_PATH, 'stores.csv'))
df_transactions = pd.read_csv(os.path.join(DATA_PATH, 'transactions.csv')).sort_values(['store_nbr', 'date'])
df_oil = pd.read_csv(os.path.join(DATA_PATH, 'oil.csv'))
df_submission_sample = pd.read_csv(os.path.join(DATA_PATH, 'sample_submission.csv'))

In [912]:
df_transactions

Se convertidan las fechas a tipos de dato datetime

In [913]:
df_train['date'] = pd.to_datetime(df_train['date'])
#df_test['date'] = pd.to_datetime(df_test['date'])
df_transactions['date'] = pd.to_datetime(df_transactions['date'])

Procederemos a ver si hay relaciones entre los valores de train con lás demás tablas

**Transactions**

In [914]:
df_train_temp = df_train.groupby(['date', 'store_nbr']).sales.mean().reset_index()

In [915]:
df_aux_merge = pd.merge(df_train_temp, df_transactions, how = 'left')

In [916]:
sns.heatmap(data=df_aux_merge.corr(), vmin=-1, vmax=1, cmap = 'RdBu', annot=True, square = True)

**Oil**

In [917]:
pd.date_range(start = '2013-01-01', end = '2017-08-15' ).difference(df_oil.index)

In [918]:
df_oil['date'] = pd.to_datetime(df_oil['date'])
df_oil = df_oil.set_index('date')

In [919]:
df_oil = df_oil.resample('1D').mean()
df_oil.reset_index()

In [920]:
pd.date_range(start = '2013-01-01', end = '2017-08-15' ).difference(df_oil.index)

In [921]:
df_oil['dcoilwtico'] = np.where(df_oil['dcoilwtico']==0, np.nan, df_oil['dcoilwtico'])
df_oil['interpolated_price'] = df_oil.dcoilwtico.interpolate()

In [922]:
df_oil = df_oil.drop('dcoilwtico',axis=1)

In [923]:
df_oil['price_chg'] = df_oil.interpolated_price - df_oil.interpolated_price.shift(1)
df_oil['pct_chg'] = df_oil['price_chg']/df_oil.interpolated_price.shift(-1)

In [924]:
# We make sure that the dateitme format is correct on this dataframe
df_train['date'] = pd.to_datetime(df_train['date'])
# We group it by date and we sum the values of sales for each day.
df_dates = df_train.groupby(df_train.date)['sales'].mean().reset_index()

In [925]:
df_dates_t = df_transactions.groupby(df_transactions.date)['transactions'].mean().reset_index()

In [926]:
daily_total_sales = df_dates.copy()

In [927]:
daily_total_sales = daily_total_sales.set_index(pd.to_datetime(daily_total_sales['date']))


In [928]:
daily_total_sales = daily_total_sales.resample('1D').mean()

In [929]:
df_oil.interpolated_price.loc['2013-01-01':'2017-08-15']

In [930]:
plt.scatter(daily_total_sales,df_oil.interpolated_price.loc['2013-01-01':'2017-08-15'],alpha=0.2)
plt.ylabel('oil price')
plt.xlabel('daily total sales')
plt.show()

Se observa que a pesar de que el valor del petroleo se queda igual, la cantidad de ventas cambia, esto nos da a entender que no hay una relación entre estos.

**Holidays**

In [931]:
average_sales = df_train.groupby(df_train.date)['sales'].mean().reset_index()

In [932]:
df_he = df_holidays_events.drop(["locale_name", "description", "type"], axis=1)

In [933]:
df_he["locale"].value_counts()

In [934]:
df_he["transferred"].value_counts()

Vamos a eliminar tanto los locales porque son muy pocos al igual que los transferred, estos ultimos porque son dias que no fueron de descanso si no que se moviliazaron hacia otra fecha.

In [935]:
df_di=df_he.loc[(df_he["locale"] == 'Regional')].index
df_he=df_he.drop(df_di)

In [936]:
df_di=df_he.loc[(df_he["transferred"] == True)].index
df_he=df_he.drop(df_di)

In [937]:
df_he = df_he.drop(["transferred"], axis=1)

In [938]:
df_he = df_he.replace(to_replace="Local",
           value=1)
df_he = df_he.replace(to_replace="National",
           value=1)

In [939]:
df_he.columns = df_he.columns.str.replace('locale', 'IsHoliday')
df_he

In [940]:
df_he["date"] = pd.to_datetime(df_he['date'])

In [941]:
average_sales['IsHoliday'] = df_he['IsHoliday']

In [942]:
average_sales['IsHoliday'] = average_sales['IsHoliday'].fillna(0)

In [943]:
average_sales

In [944]:
average_sales.corr()

Se ve que hay una relación bastante fuerte, asi que tambien se tomará en cuenta para el dataframe final

# Creación del dataframe

In [945]:
final_df_train = df_train.copy()

Antes de comnezar con la concatenación quitaremos aquellos días atipicos, como lo es el primero de enero y el mes posterior al temblor

In [946]:
df_dates.plot(kind = 'scatter', x = 'date', y = 'sales')

In [947]:
df_fd=final_df_train.loc[(final_df_train["date"].dt.day == 1) & (final_df_train["date"].dt.month == 1)].index
final_df_train=final_df_train.drop(df_fd)

In [948]:
df_fd=final_df_train.loc[(final_df_train["date"].dt.month == 4) & (final_df_train["date"].dt.day >= 16) & (final_df_train["date"].dt.day <= 31) & (final_df_train["date"].dt.year == 2016)].index
final_df_train=final_df_train.drop(df_fd)

df_fd=final_df_train.loc[(final_df_train["date"].dt.month == 5) & (final_df_train["date"].dt.day >= 1) & (final_df_train["date"].dt.day <= 16) & (final_df_train["date"].dt.year == 2016)].index
final_df_train=final_df_train.drop(df_fd)

In [949]:
df_dates = final_df_train.groupby(final_df_train.date)['sales'].mean().reset_index()


In [950]:
df_dates.plot(kind = 'scatter', x = 'date', y = 'sales')

Primero uniremos los dataframe de train con el de transacciones, para esto se le asignaran las transacciones que tuvo en es dia con su respectiva tienda.

In [951]:
final_df_train = pd.merge(final_df_train, df_transactions, on=['date','store_nbr'])

Ahora uniremos el dataset resultante con si es holiday o no

In [952]:
final_df_train['IsHoliday'] = df_he['IsHoliday']

In [953]:
final_df_train['IsHoliday'] = final_df_train['IsHoliday'].fillna(0)

Como se ve una tendencia creciente por año, haremos una nueva columna de tipo año

In [954]:
#final_df_train['year'] = (final_df_train["date"].dt.year).astype('int')
final_df_train['year'] = ((final_df_train["date"].dt.year).astype('int')) - 2000

Algo que agregaremos son los dummys por cada tienda, con ello se tiene en cuenta que tiendas 

In [955]:
X_store = pd.get_dummies(final_df_train['store_nbr'])

In [956]:
final_df_train = final_df_train.join(X_store, on='store_nbr').fillna(0.)

In [957]:
final_df_train['date'] = final_df_train.date.dt.to_period('D')

In [958]:
final_df_train = final_df_train.set_index(['store_nbr', 'family', 'date']).sort_index()

In [959]:
final_df_train

Se ha obtenido el dataframe final con el creara el modelo de predicción

# Creación del modelo

In [960]:
average_sales = df_store_sales.groupby('date').mean().sales

In [961]:
average_sales.plot(style='.', figsize=(20,10));

In [962]:
average_sales_2017 = (
    df_store_sales
    .groupby('date')
    .mean().sales.squeeze().loc['2017'])

In [963]:
average_sales_2017.plot(style='.', figsize=(20,10));

In [964]:
y = average_sales_2017.copy()

In [965]:
fourier_terms = CalendarFourier(freq='M', order = 4)

In [966]:
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order = 1,
    seasonal = True,
    additional_terms=[fourier_terms],
    drop = True,
)

X = dp.in_sample()

In [967]:
X.head()

In [968]:
model_1 = LinearRegression().fit(X, y)
y1_pred = pd.Series(model_1.predict(X), index=X.index, 
                    name='4 Fourier Terms')

In [969]:
ax=y.plot(style='.', figsize=(20,10), title="Avg. Sales 2017")
ax=y1_pred.plot(ax=ax, label='4 Fourier terms')
ax.legend();

In [970]:
X_forecast = dp.out_of_sample(steps=30)
y_forecast = pd.Series(model_1.predict(X_forecast),
                      index=X_forecast.index)

In [971]:
ax=average_sales_2017.plot(style='.', figsize=(20,10))
ax=y1_pred.plot(ax=ax, linewidth=3, label='trend', color='C0')
ax=y_forecast.plot(ax=ax, linewidth=3, label='prediction', 
                   color='C3')
ax.legend();

In [972]:
df_holiday_events.head()

In [973]:
holidays = (df_holiday_events.query("locale in ['National', 'Regional']")
    .loc['2017':'2017-08-15', ['description']]
    .assign(description=lambda x: x.description.cat.remove_unused_categories()))

In [974]:
holidays

In [975]:
y_d = y - y1_pred

In [976]:
ax=y_d.plot(style='.', figsize=(20,10))
plt.plot_date(holidays.index, y_d[holidays.index], color='C3')
ax.set_ylim(bottom=-200);

In [977]:
X_holiday = pd.get_dummies(holidays)

In [978]:
X_holiday.head()

In [979]:
X2 = X.join(X_holiday, on='date').fillna(0.)

In [980]:
X2.head()

In [981]:
model_2 = LinearRegression().fit(X2, y)
y2_pred = pd.Series(model_2.predict(X2), index=X2.index,
                   name='Fourier + holidays')

In [982]:
ax1 = y.plot(style='.', figsize=(20,10))
ax1 = y2_pred.plot(ax=ax1, label='Fourier + Holiday')
ax1.set_ylim(bottom=200)
ax1.legend();

In [983]:
holidays2 = (df_holiday_events.query("locale in ['National', 'Regional']")
            .loc['2017-08-15':, ['description']]
            .assign(description=lambda x: x.description.cat.remove_unused_categories()))

In [984]:
holidays2

In [985]:
df_store_sales.head()

In [986]:
y=df_store_sales.unstack(['store_nbr','family']).loc['2017']

In [987]:
fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)

In [988]:
X = dp.in_sample()
X['NewYear'] = (X.index.dayofyear == 1)

In [989]:
model = LinearRegression(fit_intercept=False)
model.fit(X, y)

In [990]:
y_pred = pd.DataFrame(model.predict(X), index=X.index, 
                      columns=y.columns)

In [991]:
X_test = dp.out_of_sample(steps=16)

In [992]:
X_test.index.name='date'
X_test['NewYear']=(X_test.index.dayofyear == 1)

In [993]:
y_submit = model.predict(X_test)

In [994]:
y_submit = pd.DataFrame(y_submit, index=X_test.index,
                       columns=y.columns)

In [995]:
y_submit = y_submit.stack(['store_nbr', 'family'])

In [996]:
y_submit = y_submit.join(df_test.id).reindex(columns=['id', 'sales'])

In [997]:
y_submit.to_csv('submission.csv', index=False)