# Store Sales 
#### Use time series forecating and machine learning to predict grocery sales. 

Predict sales for the thousands of product families sold at Favorita stores located in Ecuador. The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. 

Original dataset can be found here on Kaggle:
https://www.kaggle.com/c/store-sales-time-series-forecasting/code?competitionId=29781&sortBy=dateRun&tab=profile

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_validate
from statistics import mean
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.datasets import make_classification
from sklearn import ensemble
import sklearn.metrics as metrics

oil=r'/kaggle/input/store-sales-time-series-forecasting/oil.csv'
sample=r'/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv'
event=r'/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv'
stores=r'/kaggle/input/store-sales-time-series-forecasting/stores.csv'
train=r'/kaggle/input/store-sales-time-series-forecasting/train.csv'
test=r'/kaggle/input/store-sales-time-series-forecasting/test.csv'
transaction=r'/kaggle/input/store-sales-time-series-forecasting/transactions.csv'

## Train Dataset 
The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers.

In [None]:
trainDF=pd.read_csv(train)
trainDF.head()

## Transactions 
The original dataset is 83488 rows Ã— 3 columns.

In [None]:
tDF=pd.read_csv(transaction)
tDF.head()
tDF.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) #no missing data

#Feature three new columns with year, month, and day:
tDF['year']=tDF['date'].apply(lambda x: int(str(x)[:4])) #first four
tDF['month']=tDF['date'].apply(lambda x: int(str(x)[5:7])) #five and six
tDF['day']=tDF['date'].apply(lambda x: int(str(x)[-2:])) # last two

print(tDF['day'].unique()) 
#[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]
print(tDF['month'].unique()) # [ 1  2  3  4  5  6  7  8  9 10 11 12]
print(tDF['year'].unique()) # [2013 2014 2015 2016 2017]

The date column is currently an object but needs to be chnaged to a datetime datatype column. This will help later when we use the create_time_features later. Dates are initially as 2013-01-02 format so wo will keep that consistent. Missing values must be deleted one last time in the event there was an coerce error that occurs. Finally, we must transform each date to a datetime that can be regonized by using .dt for the create_time_features function to work. Here is a good SO article on how to solve this issue:
https://stackoverflow.com/questions/56698521/can-only-use-dt-accessor-with-datetimelike-values/56698574

In [None]:
tDF['date'] = pd.to_datetime(tDF['date'], errors='coerce', format = '%Y-%m-%d')
print(tDF)

years=tDF.groupby('date')[:4]['transactions'].sum()

tDF.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) #Above line could cause NA


tDF['year'] = tDF['date'].dt.year
tDF['dayofyear'] = tDF['date'].dt.dayofyear
tDF['dayofweek'] = tDF['date'].dt.dayofweek
year=tDF.groupby('year')['transactions'].sum()



## Multivariate time series forecasting
Here is a full function for all the new columns you can make from a datetime. In this specific sales case, we will only be needing month, year, and day since time was not provided.

In [None]:
import datetime as dt
from datetime import datetime

# ADD time features to our model:
def create_time_features(df, target=None):
    """
    Creates time series features from datetime index
    """
    df['date'] = df.index
    #df['hour'] = df['date'].dt.hour
    #df['dayofweek'] = df['date'].dt.dayofweek
    #df['quarter'] = df['date'].dt.quarter
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year
    df['dayofyear'] = df['date'].dt.dayofyear
    #df['sin_day'] = np.sin(df['dayofyear'])
    #df['cos_day'] = np.cos(df['dayofyear'])
    #df['dayofmonth'] = df['date'].dt.day
    #df['weekofyear'] = df['date'].dt.weekofyear
    X = df.drop(['date'], axis=1)
    if target:
        y = df[target]
        X = X.drop([target], axis=1)
        return X, y
    
    return X

In [None]:
from sklearn.preprocessing import StandardScaler

X_train_df, y_train = create_time_features(tDF)
X_test_df, y_test = create_time_features(tDF)

scaler = StandardScaler()
scaler.fit(X_train_df)  # No cheating, never scale on the training+test!
X_train = scaler.transform(X_train_df)
X_test = scaler.transform(X_test_df)

X_train_df = pd.DataFrame(X_train, columns=X_train_df.columns)
X_test_df = pd.DataFrame(X_test, columns=X_test_df.columns)

## Stores Dataset

In [None]:
stores=pd.read_csv(stores)
stores.head()

## Events

In [None]:
event=pd.read_csv(event)
event.head()

In [None]:
df['YearMonth'] = pd.to_datetime(df['Date']).apply(lambda x: '{year}-{month}'.format(year=x.year, month=x.month))
res = df.groupby('YearMonth')['Values'].sum()

## References
Most of these resources are on Time Series:
1. https://www.kaggle.com/ekrembayar/store-sales-ts-forecasting-a-comprehensive-guide
2. https://www.kaggle.com/ilyakondrusevich/54-stores-54-models
3. https://github.com/jiwidi/time-series-forecasting-with-python/blob/master/02-Forecasting_models.ipynb