# Train datatset

Investigate the train dataset, similar but longer than test dataset.

Every row in this dataset represents the number of sales and promotion per day, store and product family

In [37]:
# import libs
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

from definitions import TRAIN_FILE

# load data
train_df = pd.read_csv(TRAIN_FILE)
train_df['date'] = pd.to_datetime(train_df['date'])
train_df['store_nbr'] = train_df['store_nbr'].astype('category')
report = ProfileReport(train_df, title="Train dataset report.", infer_dtypes=False)

# remove outliers.
train_df['sales'] = train_df['sales'].clip(upper=2500)
train_df['onpromotion'] = train_df['onpromotion'].clip(upper=2500)

# uncomment to save as html file if you rather look at the report in your browser.
report.to_file('train_data_report.html')
report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

observations:
- sales and promotion have a lot of zero's. but no missing data.
- sales and promotion have outliers (already clipped in this report).
- there is a mild correlation between product family and promotions

## 0 sales or promotions.
lets replace the 0's in the sales and onpromotion columns with nan's and see if there is a pattern in the missing data matrix.
Also we group the data by stores and product families so that eacht column in the matrix is the full sales/onpromotion of one store through time.


In [39]:
data_by_store = []
for store_nbr, store_df in train_df[['date' ,'store_nbr', 'sales', 'onpromotion']].groupby('store_nbr'):
    store_df = store_df.groupby('date').sum()
    store_df = store_df[['sales', 'onpromotion']].rename(columns={'sales': 'sales_' + str(store_nbr), 'onpromotion': 'onpromotion_' + str(store_nbr)})
    data_by_store.append(store_df)

grouped_by_store = pd.concat(data_by_store, axis=1)

# consider 0 data as nan.
grouped_by_store[grouped_by_store < 1] = np.NAN
# grouped_by_store.replace(0, np.NAN)

# create minimal (descriptives only) profile report.
report = ProfileReport(grouped_by_store, title="Transactions dataset report.", minimal=True)
# manually enable matrix to visualise missing data.
report.config.missing_diagrams = {'bar': False, 'matrix': True, 'heatmap': False, 'dendrogram': False}

# uncomment to save as html file if you rather look at the report in your browser.
report.to_file('train_data_report_clipped.html')
report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

observations:
 - Sales data doesnt seem to have a lot of 0's.
 - Only 0's in sales data are when the store doesn't exist year or is closed for a longer time, the same pattern can bee seen in the transaction dataset.
 - almost no promotions before approximately mid 2014. After that every store has pretty much continuous promotions.

 ## promotion - family correlation.