<a href="https://colab.research.google.com/github/Elsiekiprop/Sales-Prediction--Time-Series-Models/blob/gikonyo/Sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Store Sales - Time Series Forecasting.
## Overview
This project aims to predict sales of various products at Favorita stores in Ecuador.
## Business Understanding

## Data Understanding.

Seven data sets will be used in this project. The data sets are as follows:

**Train.csv**
This data set includes data on:
- date: day the sale occured
- id: the sales id
- store_nbr: the store at which the sale occured
- sales: total sales for a given product family at a given store at a given date.
- onpromotion: total  number of items promoted at a given store at a given date.

The  **test.csv** file contains data similar to the training data. The data contains sales information collected 15 days after the train data.


**Oil.csv**

This file contains details on oil prices since Ecuador's economy heavily depends on Oil


**Stores.csv**

This file includes information on store location:
- city: the city a state is located
- state: the state a city is located
- cluster: a group of similar stores
- type: the type of store




# Importing necessary libraries.



In [16]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

#imports for data analysis
# import plotly.expess as px

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Common Functions

In [21]:
def DropColumns(col_names, df):
    # input: list of column names and a dataframe
    # output: same dataframe with columns dropped
    df = df.drop(col_names, axis=1)
    return df

def iqr_outliers(df,ft):
  """
  input: dataframe and features
  description: will remove outliers based on interquartile range
  output: is a list of rows to be removed
  """
  q1=df[ft].quantile(0.05)
  q2=df[ft].quantile(0.95)
  iqr = q2-q1
  lower = q1 - 1.5 * iqr
  upper = q2 + 1.5 * iqr
  ls = df.index[ (df[ft]<lower) | (df[ft]>upper) ]
  return ls

def remove_(df,ls):
  """
  input:dataframe, list of rows
  description: will remove the rows
  output: dataframe
  """
  ls = sorted(set(ls))
  df = df.drop(ls)
  return df

def fill_null(df, value = None):
  """
    input: dataframe
    description: the fxn will fill missing integer values with 0 and missing categorical values with the mode in their respective columns
    output: dataframe with filled values
  """
  if value != None :
    for col in df:
      if df[col].dtype in ("int64", "float64"):
          df[col] = df[col].fillna(value)
    print("Finished removing null values")
    return df

  for col in df:
    if df[col].dtype in ("int64", "float64"):
      df[col] = df[col].fillna(df[col]).mean()
    elif df[col].dtype == "object":
      df[col] = df[col].fillna(df[col].mode()[0])
  print("Finished removing null values")
  return df

# Data Access and Collection

In [18]:
path_var = '/content/drive/MyDrive/projects/Store-Sales-Forecast/store-sales-time-series-forecasting'

In [19]:
#Reading data
train_df=pd.read_csv(path_var + "/train.csv")
test_df=pd.read_csv(path_var + "/test.csv")
oil_df=pd.read_csv(path_var + "/oil.csv")
sample_submission_df=pd.read_csv(path_var + "/sample_submission.csv")
stores_df=pd.read_csv(path_var + "/stores.csv")
transactions_df=pd.read_csv(path_var + "/transactions.csv")
holidays_events_df=pd.read_csv(path_var + "/holidays_events.csv")

# Data Cleaning


*   Check on how to clean TimeSeries data
*   Filling in values for TimeSeries data
> NB Change date column into index.




In [23]:
train_df.head()

Unnamed: 0_level_0,id,store_nbr,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,1500443.5,27.5,AUTOMOTIVE,357.775749,2.60277
2013-01-01,1500443.5,27.5,BABY CARE,357.775749,2.60277
2013-01-01,1500443.5,27.5,BEAUTY,357.775749,2.60277
2013-01-01,1500443.5,27.5,BEVERAGES,357.775749,2.60277
2013-01-01,1500443.5,27.5,BOOKS,357.775749,2.60277


So our data is Categorical!

In [None]:
# set date as the index
train_df = train_df.set_index('date')

In [22]:
#fill null values
train_data = fill_null(train_df)

Finished removing null values


In [24]:
#change dataframe to lowercase
train_data = train_data.applymap(lambda s: s.lower() if type(s) == str else s)

In [25]:
train_data

Unnamed: 0_level_0,id,store_nbr,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,1500443.5,27.5,automotive,357.775749,2.60277
2013-01-01,1500443.5,27.5,baby care,357.775749,2.60277
2013-01-01,1500443.5,27.5,beauty,357.775749,2.60277
2013-01-01,1500443.5,27.5,beverages,357.775749,2.60277
2013-01-01,1500443.5,27.5,books,357.775749,2.60277
...,...,...,...,...,...
2017-08-15,1500443.5,27.5,poultry,357.775749,2.60277
2017-08-15,1500443.5,27.5,prepared foods,357.775749,2.60277
2017-08-15,1500443.5,27.5,produce,357.775749,2.60277
2017-08-15,1500443.5,27.5,school and office supplies,357.775749,2.60277


No charts were generated by quickchart
No charts were generated by quickchart
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


# EDA

In [None]:
#oil data

# show the rows of the dataset
oil_df.head()
# oil_df['date'] = pd.to_datetime(oil_df['date'], format='%Y-%m')

# set the date as the index
oil_df = oil_df.set_index('date')

oil_df_val = oil_df['dcoilwtico']

# plot
oil_df_val.plot(figsize=(14,6))
plt.xlabel('Date')
plt.ylabel('Periodic Oil Consumption')
plt.grid()
plt.show()

## Check for ACF and PACF

# Model Building and Training
Proposed models:
*   Prophet
*   Arma
*   ARIMA
*   SARIMA







# Model Evaluation

Check for accuracy of models

# Deploy Model

Ref: https://colab.research.google.com/drive/15PBqTZELcx73TdXUpsN7TVHOp-x6R7EU#scrollTo=j8-Bzga1LWOn