# Exploratory Data Analysis

In deze notebook heb ik wat code geschreven om een overzicht te krijgen van de data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

### Calendar
`calendar_afcs2020.csv`

Contains information about the dates on which the products are sold.

* __Date + month + year + d__ = datum
* __wm_yr_wk__ = jaar, week (koppelen met sell prices dataset)
* __Weekday & wday__ = weekdag (sat:1, sun:2, mon:3, tue:4, wed:5, thu:6, fri:7)
* __event_type & event_name__ = feestdagen
* __snap_CA__ = boolean(0,1), iets met kortingsbonnen

In [2]:
calendar = pd.read_csv("calendar_afcs2020.csv")
calendar.head()

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA
0,1/29/2011,11101,Saturday,1,1,2011,d_1,,,,,0
1,1/30/2011,11101,Sunday,2,1,2011,d_2,,,,,0
2,1/31/2011,11101,Monday,3,1,2011,d_3,,,,,0
3,2/1/2011,11101,Tuesday,4,2,2011,d_4,,,,,1
4,2/2/2011,11101,Wednesday,5,2,2011,d_5,,,,,1


In [3]:
calendar["event_type_1"].unique()

array([nan, 'Sporting', 'Cultural', 'National', 'Religious'], dtype=object)

In [4]:
calendar["event_name_1"].unique()

array([nan, 'SuperBowl', 'ValentinesDay', 'PresidentsDay', 'LentStart',
       'LentWeek2', 'StPatricksDay', 'Purim End', 'OrthodoxEaster',
       'Pesach End', 'Cinco De Mayo', "Mother's day", 'MemorialDay',
       'NBAFinalsStart', 'NBAFinalsEnd', "Father's day",
       'IndependenceDay', 'Ramadan starts', 'Eid al-Fitr', 'LaborDay',
       'ColumbusDay', 'Halloween', 'EidAlAdha', 'VeteransDay',
       'Thanksgiving', 'Christmas', 'Chanukah End', 'NewYear',
       'OrthodoxChristmas', 'MartinLutherKingDay', 'Easter'], dtype=object)

In [5]:
calendar["event_type_2"].unique()

array([nan, 'Cultural', 'Religious'], dtype=object)

In [6]:
calendar["event_name_2"].unique()

array([nan, 'Easter', 'Cinco De Mayo', 'OrthodoxEaster', "Father's day"],
      dtype=object)

In [7]:
print("Boolean options:", calendar["snap_CA"].unique(), "\nMean value:", calendar["snap_CA"].mean())

Boolean options: [0 1] 
Mean value: 0.33011681056373793


### Sell Prices
`sell_prices_afcs2020.csv`

Contains information about the price of the products sold by CA3 store and date.

* __store_id__ = de winkel, in dit geval altijd CA_3
* __item_id__ = product
* __wm_yr_wk__ = jaar, week (koppelen met calendar dataset)
* __sell_price__ = verkoopprijs product

In [8]:
prices = pd.read_csv("sell_prices_afcs2020.csv")
prices.head()

Unnamed: 0,store_id,item_id,wm_yr_wk,sell_price
0,CA_3,HOBBIES_2_001,11105,5.47
1,CA_3,HOBBIES_2_001,11106,5.47
2,CA_3,HOBBIES_2_001,11107,5.47
3,CA_3,HOBBIES_2_001,11108,5.47
4,CA_3,HOBBIES_2_001,11109,5.47


In [9]:
prices["store_id"].unique()

array(['CA_3'], dtype=object)

In [10]:
prices["wm"] = prices["wm_yr_wk"].astype(str).str[0]
prices["wm"].unique()

array(['1'], dtype=object)

### Train Validation
`sales_train_validation_afcs2020.csv`

Contains the historical daily unit sales data per product by CA3 store [d_1 - d_1913]. This can be used as a training set for models.

In [11]:
df = pd.read_csv("sales_train_validation_afcs2020.csv")
df.head()

Unnamed: 0,id,d_1,d_2,d_3,d_4,d_5,d_6,d_7,d_8,d_9,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,HOBBIES_2_001_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HOBBIES_2_002_CA_3_validation,0,0,0,1,0,1,0,0,0,...,0,0,1,0,0,0,1,1,0,0
2,HOBBIES_2_003_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,3
3,HOBBIES_2_004_CA_3_validation,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,HOBBIES_2_005_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


### Train Evaluation
`sales_train_evaluation_afcs2020.csv`

Complete dataset that includes sales [d_1 - d_1941] and serves to evaluate from [d_1914 - d_1941], 28 days of the complete training sample (sales_train_validation_afcs2020.cs).

In [12]:
df_eval = pd.read_csv("sales_train_evaluation_afcs2020.csv")
df_eval.head()

Unnamed: 0,id,d_1,d_2,d_3,d_4,d_5,d_6,d_7,d_8,d_9,...,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940,d_1941
0,HOBBIES_2_001_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,HOBBIES_2_002_CA_3_validation,0,0,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,HOBBIES_2_003_CA_3_validation,0,0,0,0,0,0,0,0,0,...,1,0,4,1,0,3,0,1,2,2
3,HOBBIES_2_004_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,5,0,0,0,0,0,0,0,0
4,HOBBIES_2_005_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Sample Submission
`sample_submission_afcs2020.csv`

The correct format for submissions. the evaluation rows, this corresponds to [d_1942 - d_1969], 28 forecast days (F1-F28).

* __Index__: id, producten in de winkel
* __Features__: 28 forecast dagen (F1 - F28)

In [13]:
sample = pd.read_csv("sample_submission_afcs2020.csv")
sample.head()

Unnamed: 0,id,F1,F2,F3,F4,F5,F6,F7,F8,F9,...,F19,F20,F21,F22,F23,F24,F25,F26,F27,F28
0,HOBBIES_2_001_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HOBBIES_2_002_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HOBBIES_2_003_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HOBBIES_2_004_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,HOBBIES_2_005_CA_3_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
