<a href="https://colab.research.google.com/github/Elsiekiprop/Sales-Prediction--Time-Series-Models/blob/gikonyo/Sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Store Sales - Time Series Forecasting.
## Overview
This project aims to predict sales of various products at Favorita stores in Ecuador.
## Business Understanding

## Data Understanding.

Seven data sets will be used in this project. The data sets are as follows:

**Train.csv**
This data set includes data on:
- date: day the sale occured
- id: the sales id
- store_nbr: the store at which the sale occured
- sales: total sales for a given product family at a given store at a given date.
- onpromotion: total  number of items promoted at a given store at a given date.

The  **test.csv** file contains data similar to the training data. The data contains sales information collected 15 days after the train data.


**Oil.csv**

This file contains details on oil prices since Ecuador's economy heavily depends on Oil


**Stores.csv**

This file includes information on store location:
- city: the city a state is located
- state: the state a city is located
- cluster: a group of similar stores
- type: the type of store




In [97]:
!pip install dataprep
!pip install seaborn



# Importing necessary libraries.



In [98]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

#imports for data analysis
# import plotly.expess as px
from dataprep.eda import create_report

In [99]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Common Functions

In [100]:
def DropColumns(col_names, df):
    # input: list of column names and a dataframe
    # output: same dataframe with columns dropped
    df = df.drop(col_names, axis=1)
    return df

def iqr_outliers(df,ft):
  """
  input: dataframe and features
  description: will remove outliers based on interquartile range
  output: is a list of rows to be removed
  """
  q1=df[ft].quantile(0.05)
  q2=df[ft].quantile(0.95)
  iqr = q2-q1
  lower = q1 - 1.5 * iqr
  upper = q2 + 1.5 * iqr
  ls = df.index[ (df[ft]<lower) | (df[ft]>upper) ]
  return ls

def remove_(df,ls):
  """
  input:dataframe, list of rows
  description: will remove the rows
  output: dataframe
  """
  ls = sorted(set(ls))
  df = df.drop(ls)
  return df

def fill_null(df):
  """
    input: dataframe
    description: the fxn will fill missing integer values with 0 and missing categorical values with the mode in their respective columns
    output: dataframe with filled values
  """
  for col in df:
    if df[col].dtype in ("int64", "float64"):
      df[col] = df[col].fillna(df[col]).mean()
    elif df[col].dtype == "object":
      df[col] = df[col].fillna(df[col].mode()[0])
  print("Finished removing null values")
  return df

# Data Access and Collection

In [101]:
path_var = '/content/drive/MyDrive/projects/Store-Sales-Forecast/store-sales-time-series-forecasting'

In [102]:
#Reading data
train_df=pd.read_csv(path_var + "/train.csv")
test_df=pd.read_csv(path_var + "/test.csv")
oil_df=pd.read_csv(path_var + "/oil.csv")
sample_submission_df=pd.read_csv(path_var + "/sample_submission.csv")
stores_df=pd.read_csv(path_var + "/stores.csv")
transactions_df=pd.read_csv(path_var + "/transactions.csv")
holidays_events_df=pd.read_csv(path_var + "/holidays_events.csv")

# Data Cleaning


*   Check on how to clean TimeSeries data
*   Filling in values for TimeSeries data
> NB Change date column into index.




## Holidays Events

In [103]:
holidays_events_df.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [104]:
len(holidays_events_df)

350

Perform some EDA

In [105]:
# set date as the index
holidays_df = holidays_events_df.set_index('date')

In [106]:
holidays_df.isnull().sum()

type           0
locale         0
locale_name    0
description    0
transferred    0
dtype: int64

In [107]:
#fill null values
# holidays_data = fill_null(holidays_df)

In [108]:
#change dataframe to lowercase
holidays_data = holidays_df.applymap(lambda s: s.lower() if type(s) == str else s)

In [109]:
len(holidays_data)

350

In [110]:
holidays_data.describe()

Unnamed: 0,type,locale,locale_name,description,transferred
count,350,350,350,350,350
unique,6,3,24,101,2
top,holiday,national,ecuador,carnaval,False
freq,221,174,174,10,338


EDA

In [None]:
# using data prep to explore the dataset
create_report(holidays_df).show()

  df = df.append(pd.DataFrame({col: [nrows - npresent]}, index=["Others"]))
  df = df.append(pd.DataFrame({col: [nrows - npresent]}, index=["Others"]))
The plot will not show in a notebook environment, please try 'show_browser' if you want to open it in browser


## Stores Data

In [None]:
stores_df.head()

In [None]:
len(stores_df)

In [None]:
# set the store_nbr as the index
stores_df = stores_df.set_index('store_nbr')

In [None]:
stores_df.isnull().sum()

# so no null values

In [None]:
#change dataframe to lowercase
stores_data = stores_df.applymap(lambda s: s.lower() if type(s) == str else s)

In [None]:
stores_data

In [None]:
len(stores_data)

EDA

In [None]:
# using data prep to explore the dataset
create_report(stores_data).show()

## Oil Data

In [None]:
oil_df.head()
len(oil_df)

In [None]:
oil_df

In [None]:
# check for null in the data
oil_df.isnull().sum()

In [None]:
# oil_data = fill_null(oil_df)

In [None]:
len(oil_df)

## Transactions Data

In [None]:
transactions_df.head()

In [None]:
len(transactions_df)

In [None]:
transactions_df.isnull().sum()

In [None]:
transactions_df = transactions_df.set_index('date')

EDA

In [None]:
create_report(transactions_df).show()

## Train Data

In [None]:
train_df.head()

In [None]:
len(train_df)

In [None]:
train_df.isnull().sum()

So our data is Categorical!

In [None]:
# set date as the index
train_df = train_df.set_index('date')

In [None]:
# #fill null values
# train_data = fill_null(train_df)

In [None]:
train_data

In [None]:
#change dataframe to lowercase
train_data = train_data.applymap(lambda s: s.lower() if type(s) == str else s)

In [None]:
train_data

In [None]:
len(train_data)

In [None]:
# cleaning test data
test_df.head()

In [None]:
len(test_df)

### EDA

In [None]:
# using data prep to explore the dataset
from dataprep.eda import create_report
create_report(train_data).show()

## Test Data

## Check for ACF and PACF

# Model Building and Training
Proposed models:
*   Prophet
*   Arma
*   ARIMA
*   SARIMA







# Model Evaluation

Check for accuracy of models

# Deploy Model

Ref: https://colab.research.google.com/drive/15PBqTZELcx73TdXUpsN7TVHOp-x6R7EU#scrollTo=j8-Bzga1LWOn