## Project Objective

To build a model that accurately predicts the unit sales for the items sold by Corporation Favorita

## Hypothesis & Questions

### Hypotheses

### Questions

1. Is the train dataset complete (has all the required dates)?
2. Which dates have the lowest and highest sales for each year?
3. Did the earthquake impact sales?
4. Are certain groups of stores selling more products? (Cluster, city, state, type)
5. Are sales affected by promotions, oil prices and holidays?
6. What analysis can we get from the date and its extractable features?
7. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

In [1]:
# Extracting the Zip File to Get Access to the Data
import zipfile
with zipfile.ZipFile("store-sales-time-series-forecasting.zip","r") as zip_loaded:
    zip_loaded.extractall("files/")

print("Extraction Complete.")

Extraction Complete.


In [100]:
# Importing and loading relevant libraries and packages
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from itertools import *

import warnings

# Hiding the warnings
#warnings.filterwarnings('ignore')

print("Loading complete.", "Warnings hidden.")



**Previewing & exploring the files**

**Train data and complementary data**

In [3]:
train_data = pd.read_csv("files/train.csv")
train_data

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0
1,1,2013-01-01,1,BABY CARE,0.000,0
2,2,2013-01-01,1,BEAUTY,0.000,0
3,3,2013-01-01,1,BEVERAGES,0.000,0
4,4,2013-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 137.4+ MB


In [5]:
train_data.nunique()

id             3000888
date              1684
store_nbr           54
family              33
sales           379610
onpromotion        362
dtype: int64

In [6]:
# Setting all floats to display with 2 decimal places
pd.options.display.float_format = '{:,.2f}'.format

In [7]:
## Getting the  actual dates
actual_days = train_data["date"].unique()
actual_days

array(['2013-01-01', '2013-01-02', '2013-01-03', ..., '2017-08-13',
       '2017-08-14', '2017-08-15'], dtype=object)

In [8]:
# Converting the date column to datetime format
train_data["sales_date"] = pd.to_datetime(train_data["date"]).dt.date
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
 6   sales_date   object 
dtypes: float64(1), int64(3), object(3)
memory usage: 160.3+ MB


In [9]:
# Checking if there are any missing dates
date_range = train_data.sales_date.min(), train_data.sales_date.max()
date_range

(datetime.date(2013, 1, 1), datetime.date(2017, 8, 15))

In [10]:
# Check completeness of dates
## Number of expected dates
expected_days = pd.date_range(start = train_data["sales_date"].min(), end = train_data["sales_date"].max())
expected_days

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10',
               ...
               '2017-08-06', '2017-08-07', '2017-08-08', '2017-08-09',
               '2017-08-10', '2017-08-11', '2017-08-12', '2017-08-13',
               '2017-08-14', '2017-08-15'],
              dtype='datetime64[ns]', length=1688, freq='D')

We note a difference of 4 days between the actual dates (1,684) and expected dates (1,688) within the range. As such we have to find the missing dates and add them to ensure completeness of the dates.

This gives the answer to question 1 (Is the train dataset complete (has all the required dates)?) as a no.

In [11]:
## Get missing dates
missing_dates = set(expected_days.date) - set(train_data["sales_date"].unique())
missing_dates

{datetime.date(2013, 12, 25),
 datetime.date(2014, 12, 25),
 datetime.date(2015, 12, 25),
 datetime.date(2016, 12, 25)}

In [12]:
# Getting the list of unique stores
unique_stores = train_data["store_nbr"].unique()
unique_stores

array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,  4,
       40, 41, 42, 43, 44, 45, 46, 47, 48, 49,  5, 50, 51, 52, 53, 54,  6,
        7,  8,  9], dtype=int64)

In [13]:
# Getting the unique families
unique_families = train_data["family"].unique()
unique_families

array(['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD'], dtype=object)

Since we're predicting the sales for each store, it means we have to fill in the missing dates for each store. We will do this with the _product_ module from _itertools_

In [14]:
missing_data = list(product(missing_dates, unique_stores, unique_families))
train_addon = pd.DataFrame(missing_data, columns = ["sales_date", "store_nbr", "family"])
train_addon

Unnamed: 0,sales_date,store_nbr,family
0,2015-12-25,1,AUTOMOTIVE
1,2015-12-25,1,BABY CARE
2,2015-12-25,1,BEAUTY
3,2015-12-25,1,BEVERAGES
4,2015-12-25,1,BOOKS
...,...,...,...
7123,2014-12-25,9,POULTRY
7124,2014-12-25,9,PREPARED FOODS
7125,2014-12-25,9,PRODUCE
7126,2014-12-25,9,SCHOOL AND OFFICE SUPPLIES


In [15]:
train_data = pd.concat([train_data, train_addon], ignore_index=True)
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3008016 entries, 0 to 3008015
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           float64
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  float64
 6   sales_date   object 
dtypes: float64(3), int64(1), object(3)
memory usage: 160.6+ MB


In [16]:
train_data

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,sales_date
0,0.00,2013-01-01,1,AUTOMOTIVE,0.00,0.00,2013-01-01
1,1.00,2013-01-01,1,BABY CARE,0.00,0.00,2013-01-01
2,2.00,2013-01-01,1,BEAUTY,0.00,0.00,2013-01-01
3,3.00,2013-01-01,1,BEVERAGES,0.00,0.00,2013-01-01
4,4.00,2013-01-01,1,BOOKS,0.00,0.00,2013-01-01
...,...,...,...,...,...,...,...
3008011,,,9,POULTRY,,,2014-12-25
3008012,,,9,PREPARED FOODS,,,2014-12-25
3008013,,,9,PRODUCE,,,2014-12-25
3008014,,,9,SCHOOL AND OFFICE SUPPLIES,,,2014-12-25


- With December 25 omitted from each of the years, I assume that it was deliberate - most likely because all shops are closed on December 25 each year. In effect, no items would have been on promotion and no sales would have been made; that is to say that it is safe to fill the null "sales" and "onpromotion" column data with 0.

- By this, I am also dropping the "id" column as it will not be relevant to subsequent analyses and modelling.

- I will be filling the missing dates in the original dates column with the sales data, for aesthetic purposes only.

In [17]:
# Dropping "id" and "date" columns
train_data.drop(columns = ["id", "date"], axis = 1, inplace = True)

# Filling missing rows in the sales column and casting it to numeric
train_data["sales"].fillna(0, inplace = True)
train_data["sales"] = pd.to_numeric(train_data["sales"])

# Filling missing rows in the onpromotion column
train_data["onpromotion"].fillna(0, inplace = True)

train_data

Unnamed: 0,store_nbr,family,sales,onpromotion,sales_date
0,1,AUTOMOTIVE,0.00,0.00,2013-01-01
1,1,BABY CARE,0.00,0.00,2013-01-01
2,1,BEAUTY,0.00,0.00,2013-01-01
3,1,BEVERAGES,0.00,0.00,2013-01-01
4,1,BOOKS,0.00,0.00,2013-01-01
...,...,...,...,...,...
3008011,9,POULTRY,0.00,0.00,2014-12-25
3008012,9,PREPARED FOODS,0.00,0.00,2014-12-25
3008013,9,PRODUCE,0.00,0.00,2014-12-25
3008014,9,SCHOOL AND OFFICE SUPPLIES,0.00,0.00,2014-12-25


**Transactions data**

In [18]:
transactions = pd.read_csv("files/transactions.csv")
transactions

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922
...,...,...,...
83483,2017-08-15,50,2804
83484,2017-08-15,51,1573
83485,2017-08-15,52,2255
83486,2017-08-15,53,932


In [19]:
# Viewing basic information about the transactions data
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          83488 non-null  object
 1   store_nbr     83488 non-null  int64 
 2   transactions  83488 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.9+ MB


In [20]:
transactions.nunique()

date            1682
store_nbr         54
transactions    4993
dtype: int64

- Since the train data has the same number of unique stores as the transactions data, we can use the unique stores variable defined earlier to fill in the missing dates.
- Also, given that the transactions and train data cover the same period, it is concerning that the transactions data has even less unique dates than the train data has. As such, we have to find and impute the missing dates as done for the train data.

In [21]:
transactions["sales_date"] = pd.to_datetime(transactions["date"]).dt.date

In [22]:
# Getting missing dates
missing_txn_dates = set(expected_days.date) - set(transactions["sales_date"].unique())
missing_txn_dates

{datetime.date(2013, 12, 25),
 datetime.date(2014, 12, 25),
 datetime.date(2015, 12, 25),
 datetime.date(2016, 1, 1),
 datetime.date(2016, 1, 3),
 datetime.date(2016, 12, 25)}

In [23]:
missing_txn_data = list(product(missing_txn_dates, unique_stores))
txn_data_addon = pd.DataFrame(missing_txn_data, columns = ["sales_date", "store_nbr"])
txn_data_addon

Unnamed: 0,sales_date,store_nbr
0,2014-12-25,1
1,2014-12-25,10
2,2014-12-25,11
3,2014-12-25,12
4,2014-12-25,13
...,...,...
319,2016-01-01,54
320,2016-01-01,6
321,2016-01-01,7
322,2016-01-01,8


In [24]:
transactions

Unnamed: 0,date,store_nbr,transactions,sales_date
0,2013-01-01,25,770,2013-01-01
1,2013-01-02,1,2111,2013-01-02
2,2013-01-02,2,2358,2013-01-02
3,2013-01-02,3,3487,2013-01-02
4,2013-01-02,4,1922,2013-01-02
...,...,...,...,...
83483,2017-08-15,50,2804,2017-08-15
83484,2017-08-15,51,1573,2017-08-15
83485,2017-08-15,52,2255,2017-08-15
83486,2017-08-15,53,932,2017-08-15


In [25]:
# Adding the data for the missing transaction dates to the main transaction data and filling nulls with 0
transactions = pd.concat([transactions, txn_data_addon], ignore_index=True)
transactions.drop("date", axis = 1, inplace = True)
transactions["transactions"].fillna(0, inplace = True)


In [26]:
# Recasting the sales date column data type to date
transactions["sales_date"] = pd.to_datetime(transactions["sales_date"]).dt.date
transactions

Unnamed: 0,store_nbr,transactions,sales_date
0,25,770.00,2013-01-01
1,1,2111.00,2013-01-02
2,2,2358.00,2013-01-02
3,3,3487.00,2013-01-02
4,4,1922.00,2013-01-02
...,...,...,...
83807,54,0.00,2016-01-01
83808,6,0.00,2016-01-01
83809,7,0.00,2016-01-01
83810,8,0.00,2016-01-01


In [27]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83812 entries, 0 to 83811
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   store_nbr     83812 non-null  int64  
 1   transactions  83812 non-null  float64
 2   sales_date    83812 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 1.9+ MB


**Holidays and events data**

In [28]:
holidays_events = pd.read_csv("files/holidays_events.csv")
holidays_events

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
...,...,...,...,...,...,...
345,2017-12-22,Additional,National,Ecuador,Navidad-3,False
346,2017-12-23,Additional,National,Ecuador,Navidad-2,False
347,2017-12-24,Additional,National,Ecuador,Navidad-1,False
348,2017-12-25,Holiday,National,Ecuador,Navidad,False


In [29]:
holidays_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         350 non-null    object
 1   type         350 non-null    object
 2   locale       350 non-null    object
 3   locale_name  350 non-null    object
 4   description  350 non-null    object
 5   transferred  350 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 14.1+ KB


The holidays and events dataframe looks complete, hence there will be no need for any cleaning now.

In [30]:
holidays_events.nunique()

date           312
type             6
locale           3
locale_name     24
description    103
transferred      2
dtype: int64

**Oil data**

In [31]:
oil_data = pd.read_csv("files/oil.csv")
oil_data.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21
6,2013-01-09,93.08
7,2013-01-10,93.81
8,2013-01-11,93.6
9,2013-01-14,94.27


In [32]:
oil_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


We note about 43 missing values for oil prices in the oil data. Checks online revealed that said data were unavailable in real time, as such a forward fill method will be applied to fill the nulls and a backfill applied to fill any rows missing after that.

In [33]:
# Filling nulls with forward fill and backfill
oil_data = oil_data.ffill().bfill()
oil_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1218 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


In [34]:
oil_data.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,93.14
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21
6,2013-01-09,93.08
7,2013-01-10,93.81
8,2013-01-11,93.6
9,2013-01-14,94.27


In [35]:
# Converting the dates in the oil data to dates
oil_data["date"] = pd.to_datetime(oil_data["date"]).dt.date

The oil data now has no nulls, and is supposed to be complete, but we note that there are still some missing dates. e.g. it moves from January 4, 2013 to January 7, 2013. A quick check reveals that those dates are weekends, implying that the data is for business days and does not include weekends. With this in mind, I assume that oil prices, for the period, are frozen at close of business days of Friday and so remain constant over the weekends. As such, the "missing dates" (weekends) can be brought in another forward fills applied to them.

In [36]:
# Getting missing dates
missing_oil_dates = set(expected_days.date) - set(oil_data["date"].unique())
missing_oil_dates

{datetime.date(2013, 1, 5),
 datetime.date(2013, 1, 6),
 datetime.date(2013, 1, 12),
 datetime.date(2013, 1, 13),
 datetime.date(2013, 1, 19),
 datetime.date(2013, 1, 20),
 datetime.date(2013, 1, 26),
 datetime.date(2013, 1, 27),
 datetime.date(2013, 2, 2),
 datetime.date(2013, 2, 3),
 datetime.date(2013, 2, 9),
 datetime.date(2013, 2, 10),
 datetime.date(2013, 2, 16),
 datetime.date(2013, 2, 17),
 datetime.date(2013, 2, 23),
 datetime.date(2013, 2, 24),
 datetime.date(2013, 3, 2),
 datetime.date(2013, 3, 3),
 datetime.date(2013, 3, 9),
 datetime.date(2013, 3, 10),
 datetime.date(2013, 3, 16),
 datetime.date(2013, 3, 17),
 datetime.date(2013, 3, 23),
 datetime.date(2013, 3, 24),
 datetime.date(2013, 3, 30),
 datetime.date(2013, 3, 31),
 datetime.date(2013, 4, 6),
 datetime.date(2013, 4, 7),
 datetime.date(2013, 4, 13),
 datetime.date(2013, 4, 14),
 datetime.date(2013, 4, 20),
 datetime.date(2013, 4, 21),
 datetime.date(2013, 4, 27),
 datetime.date(2013, 4, 28),
 datetime.date(2013, 5, 

In [37]:
oil_dates_add = pd.DataFrame(missing_oil_dates, columns = ["date"])
oil_dates_add

Unnamed: 0,date
0,2016-01-24
1,2017-04-22
2,2014-04-19
3,2016-10-08
4,2016-03-12
...,...
477,2014-03-01
478,2014-06-08
479,2017-01-08
480,2013-11-03


In [38]:
# Adding the  missing oil dates to the main dataframe
oil_data = pd.concat([oil_data, oil_dates_add], ignore_index=True)
oil_data["date"] = pd.to_datetime(oil_data["date"])
oil_data = oil_data.sort_values(by = ["date"], ignore_index = True)
oil_data.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,93.14
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-05,
5,2013-01-06,
6,2013-01-07,93.2
7,2013-01-08,93.21
8,2013-01-09,93.08
9,2013-01-10,93.81


In [39]:
# Filling nulls with forward fill and backfill
oil_data = oil_data.ffill().bfill()
oil_data.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,93.14
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-05,93.12
5,2013-01-06,93.12
6,2013-01-07,93.2
7,2013-01-08,93.21
8,2013-01-09,93.08
9,2013-01-10,93.81


In [40]:
# Recasting the oil data dates to datetime dates
oil_data["date"] = pd.to_datetime(oil_data["date"]).dt.date
oil_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700 entries, 0 to 1699
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1700 non-null   object 
 1   dcoilwtico  1700 non-null   float64
dtypes: float64(1), object(1)
memory usage: 26.7+ KB


**Stores data**

In [41]:
stores_data = pd.read_csv("files/stores.csv")
stores_data

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4
5,6,Quito,Pichincha,D,13
6,7,Quito,Pichincha,D,8
7,8,Quito,Pichincha,D,8
8,9,Quito,Pichincha,B,6
9,10,Quito,Pichincha,C,15


In [42]:
stores_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


**Test data**

In [43]:
test_data = pd.read_csv("files/test.csv")
test_data

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0
...,...,...,...,...,...
28507,3029395,2017-08-31,9,POULTRY,1
28508,3029396,2017-08-31,9,PREPARED FOODS,0
28509,3029397,2017-08-31,9,PRODUCE,1
28510,3029398,2017-08-31,9,SCHOOL AND OFFICE SUPPLIES,9


In [44]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           28512 non-null  int64 
 1   date         28512 non-null  object
 2   store_nbr    28512 non-null  int64 
 3   family       28512 non-null  object
 4   onpromotion  28512 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ MB


The test data looks complete, with no nulls. Casting the date column to date will be the only cleaning activity here.

In [45]:
# Casting the date column to date data type
test_data["date"] = pd.to_datetime(test_data["date"]).dt.date

**Sample Submission**

In [46]:
sample_submission = pd.read_csv("files/sample_submission.csv")
sample_submission

Unnamed: 0,id,sales
0,3000888,0.00
1,3000889,0.00
2,3000890,0.00
3,3000891,0.00
4,3000892,0.00
...,...,...
28507,3029395,0.00
28508,3029396,0.00
28509,3029397,0.00
28510,3029398,0.00


No changes will be made to the sample submission as it is only a guide.

## Answering the other questions

**Which dates have the lowest and highest sales for each year?**

The imputation of the originally missing dates means that automatically, minimum sales for each of the four years will be on those dates (December 25 each year), but that is not what we want. What we want to know is which days had the least sales when stores were opened, as such I will only include sales values greater than 0.

In [70]:
# Aggregating sales by dates
train_by_date = train_data[train_data["sales"] != 0.00].groupby(by = "sales_date").sales.agg(["sum"]).sort_values(by = "sales_date")
train_by_date

Unnamed: 0_level_0,sum
sales_date,Unnamed: 1_level_1
2013-01-01,2511.62
2013-01-02,496092.42
2013-01-03,361461.23
2013-01-04,354459.68
2013-01-05,477350.12
...,...
2017-08-11,826373.72
2017-08-12,792630.54
2017-08-13,865639.68
2017-08-14,760922.41


In [71]:
# Creating a column for the years for grouping
train_by_date["year"] = pd.to_datetime(train_by_date.index).year
train_by_date.rename(columns = {"sum":"total_sales"}, inplace = True)
train_by_date

Unnamed: 0_level_0,total_sales,year
sales_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-01,2511.62,2013
2013-01-02,496092.42,2013
2013-01-03,361461.23,2013
2013-01-04,354459.68,2013
2013-01-05,477350.12,2013
...,...,...
2017-08-11,826373.72,2017
2017-08-12,792630.54,2017
2017-08-13,865639.68,2017
2017-08-14,760922.41,2017


In [102]:
data_2013 = train_by_date[train_by_date["year"] == 2013]
data_2013 = data_2013.reset_index()
data_2013

Unnamed: 0,sales_date,total_sales,year
0,2013-01-01,2511.62,2013
1,2013-01-02,496092.42,2013
2,2013-01-03,361461.23,2013
3,2013-01-04,354459.68,2013
4,2013-01-05,477350.12,2013
...,...,...,...
359,2013-12-27,479314.97,2013
360,2013-12-28,556952.31,2013
361,2013-12-29,499719.50,2013
362,2013-12-30,635134.74,2013


In [104]:
fig = px.line(data_2013, x = "sales_date", y = "total_sales", title="Sales trend for Corporation Favorita in 2013", 
             labels = {"sales_date":"Sales Date", "total_sales":"Total Sales"})
fig.show()

In [73]:
min_sales_13 = data_2013["total_sales"].min()
max_sales_13 = data_2013["total_sales"].max()
low_hi_sales_13 = data_2013[(data_2013["total_sales"] == min_sales_13) | (data_2013["total_sales"] == max_sales_13)]
low_hi_sales_13 = low_hi_sales_13.reset_index()
low_hi_sales_13

Unnamed: 0,sales_date,total_sales,year
0,2013-01-01,2511.62,2013
1,2013-12-23,792865.28,2013


In [74]:
data_2014 = train_by_date[train_by_date["year"] == 2014]
data_2014 = data_2014.reset_index()
data_2014

Unnamed: 0_level_0,total_sales,year
sales_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-01-01,8602.07,2014
2014-01-02,801011.23,2014
2014-01-03,680672.85,2014
2014-01-04,936628.89,2014
2014-01-05,949618.79,2014
...,...,...
2014-12-27,740596.16,2014
2014-12-28,716329.64,2014
2014-12-29,773998.40,2014
2014-12-30,912970.53,2014


In [75]:
min_sales_14 = data_2014["total_sales"].min()
max_sales_14 = data_2014["total_sales"].max()
low_hi_sales_14 = data_2014[(data_2014["total_sales"] == min_sales_14) | (data_2014["total_sales"] == max_sales_14)]
low_hi_sales_14 = low_hi_sales_14.reset_index()
low_hi_sales_14

Unnamed: 0,sales_date,total_sales,year
0,2014-01-01,8602.07,2014
1,2014-12-23,1064977.97,2014


In [76]:
data_2015 = train_by_date[train_by_date["year"] == 2015]
data_2015 = data_2015.reset_index()
data_2015

Unnamed: 0_level_0,total_sales,year
sales_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-01,12773.62,2015
2015-01-02,657763.39,2015
2015-01-03,648880.69,2015
2015-01-04,730923.78,2015
2015-01-05,569267.30,2015
...,...,...
2015-12-27,837714.13,2015
2015-12-28,789684.91,2015
2015-12-29,870762.03,2015
2015-12-30,1030043.74,2015


In [77]:
min_sales_15 = data_2015["total_sales"].min()
max_sales_15 = data_2015["total_sales"].max()
low_hi_sales_15 = data_2015[(data_2015["total_sales"] == min_sales_15) | (data_2015["total_sales"] == max_sales_15)]
low_hi_sales_15 = low_hi_sales_15.reset_index()
low_hi_sales_15

Unnamed: 0,sales_date,total_sales,year
0,2015-01-01,12773.62,2015
1,2015-10-04,1234130.94,2015


In [78]:
data_2016 = train_by_date[train_by_date["year"] == 2016]
data_2016 = data_2016.reset_index()
data_2016

Unnamed: 0_level_0,total_sales,year
sales_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-01,16433.39,2016
2016-01-02,1066677.42,2016
2016-01-03,1226735.72,2016
2016-01-04,955956.88,2016
2016-01-05,835320.44,2016
...,...,...
2016-12-27,842475.49,2016
2016-12-28,951533.71,2016
2016-12-29,894108.24,2016
2016-12-30,1163643.04,2016


In [79]:
min_sales_16 = data_2016["total_sales"].min()
max_sales_16 = data_2016["total_sales"].max()
low_hi_sales_16 = data_2016[(data_2016["total_sales"] == min_sales_16) | (data_2016["total_sales"] == max_sales_16)]
low_hi_sales_16 = low_hi_sales_16.reset_index()
low_hi_sales_16

Unnamed: 0,sales_date,total_sales,year
0,2016-01-01,16433.39,2016
1,2016-04-18,1345920.6,2016


In [80]:
data_2017 = train_by_date[train_by_date["year"] == 2017]
data_2017 = data_2017.reset_index()
data_2017

Unnamed: 0_level_0,total_sales,year
sales_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-01,12082.50,2017
2017-01-02,1402306.37,2017
2017-01-03,1104377.08,2017
2017-01-04,990093.46,2017
2017-01-05,777620.95,2017
...,...,...
2017-08-11,826373.72,2017
2017-08-12,792630.54,2017
2017-08-13,865639.68,2017
2017-08-14,760922.41,2017


In [81]:
min_sales_17 = data_2017["total_sales"].min()
max_sales_17 = data_2017["total_sales"].max()
low_hi_sales_17 = data_2017[(data_2017["total_sales"] == min_sales_17) | (data_2017["total_sales"] == max_sales_17)]
low_hi_sales_17 = low_hi_sales_17.reset_index()
low_hi_sales_17

Unnamed: 0,sales_date,total_sales,year
0,2017-01-01,12082.5,2017
1,2017-04-01,1463083.96,2017


In [82]:
# Combining the highest and lowest sales dates 
low_hi_sales_df = pd.concat([low_hi_sales_13, low_hi_sales_14, low_hi_sales_15, 
                             low_hi_sales_16, low_hi_sales_17], ignore_index = True)
low_hi_sales_df

Unnamed: 0,sales_date,total_sales,year
0,2013-01-01,2511.62,2013
1,2013-12-23,792865.28,2013
2,2014-01-01,8602.07,2014
3,2014-12-23,1064977.97,2014
4,2015-01-01,12773.62,2015
5,2015-10-04,1234130.94,2015
6,2016-01-01,16433.39,2016
7,2016-04-18,1345920.6,2016
8,2017-01-01,12082.5,2017
9,2017-04-01,1463083.96,2017


The table above summarizes the dates which had least and most sales for each year. We note that Corporation Favorita made least sales on January 1 each year. For 2013 and 2014, they made most sales in December, while they made most sales in April 2016 and 2017. The outsider is 2015, when they made most sales in October.