# Sales data EDA

- looking for Time Series Analysis (TSA) specific components
  - Trend
  - Seasonality
  - Cyclic
  - Irregularity


- Identifying the Time Series Data type
  - Stationary or Non-Stationary
  - Using statistical tests to check if the data is Stationary
    - Augmented Dickey-Fuller (ADF) Test
    - Kwiatkowski-Plillips-Schmidt-Shin (KPSS) Test



- Converting data to Stationary type
  - Detrending
  - Deferencing

## Imports

In [10]:
import sys
sys.path.append('../')

import random
import pandas as pd

from src.fetch_data import DataLoader
from src.exploration import Analysis
from src.cleaning import CleanDataFrame
from src.visualization import Plotters
from src.rotating_logs import get_rotating_log

import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go


cleaner = CleanDataFrame()
analyzer = Analysis()
plotters = Plotters(w=6, h=4)

In [2]:
# First load the cleaned stores data
data_path = 'data/cleaned/store.csv'
version = 'stores_missing_filled_v2'
repo = '../'

store_df = DataLoader.dvc_get_data(data_path, version, repo)

# Then load the raw sales data
data_path = 'data/raw/train.csv'
version = 'raw_data'
repo = '../'

sales_df = DataLoader.dvc_get_data(data_path, version, repo)

DataLoaderLogger - INFO - DVC: CSV file read with path: data/cleaned/store.csv | version: stores_missing_filled_v2 | from: ../
  df = pd.read_csv(io.StringIO(content), sep=",")
DataLoaderLogger - INFO - DVC: CSV file read with path: data/raw/train.csv | version: raw_data | from: ../


In [4]:
store_df.head()

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,0.0,0.0,none
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,0.0,0.0,none
4,5,a,a,29910.0,4.0,2015.0,0,0.0,0.0,none


In [5]:
sales_df.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


First, Let's see the date range of our data.

In [19]:
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
start_date, end_date = sales_df['Date'].aggregate([min, max])
print(f"start_date: {start_date.date()} ----> end_date: {end_date.date()}")

start_date: 2013-01-01 ----> end_date: 2015-07-31


Next, I will check if there are any missing dates in the data. 

In [29]:
unique_dates = sales_df['Date'].unique()
print(f"There are {len(unique_dates)} unique dates in the data.\nThe number of days between the end and start date is {(end_date-start_date).days}")


There are 942 unique dates in the data.
The number of days between the end and start date is 941


There are no missing entries in the timeseries component of the data

In [None]:
# fig = px.histogram(sales_df, x='Date', y='Sales', histfunc="avg", title="Histogram on Date Axes")
# fig.update_traces(xbins_size="M1")
# fig.update_xaxes(showgrid=True, ticklabelmode="period", dtick="M1", tickformat="%b\n%Y")
# fig.update_layout(bargap=0.1)
# fig.add_trace(go.Scatter(mode="markers", x=sales_df["Date"], y=sales_df["Sales"], name="daily"))
# fig.show()