# Rossmann: Generic (Stage1 - Prepare)

## Intro

Keggle competition: [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)

## Usage

**Input parameters**
1. RAW_TRAIN_CSV: Raw train data about sales
1. RAW_STORE_CSV:Raw train data about stores
1. PROCESSED_TRAIN_CSV:The name of the output file used to store the processed train data

**Output**
1. A file with processed train data ready to be used for model training


## Setup env

### Set global variables

In [1]:
!pwd

/opt/shared/notebooks


In [2]:
DATASETS_DIR = '../data'

In [3]:
# this cell is tagged `parameters`
RAW_TRAIN_CSV = DATASETS_DIR + '/raw/train.csv'
RAW_STORE_CSV = DATASETS_DIR + '/raw/store.csv'
PROCESSED_TRAIN_CSV = DATASETS_DIR + '/processed/tst-train.csv'

###  Install required packages

Se il notebook è eseguito su una macchina pulita installare i pacchetti necessari con i seguenti comandi ...

In [4]:
#!curl https://raw.githubusercontent.com/andrea-gioia/boostrap.ai/master/fastai07colab	 | bash

Se il notebook è eseguito all'interno di un ambiente virtuale conda con tutti i pacchetti specificati nel file di requirements già installati fare solo un check eseguendo i seguenti comandi ...

###  Dump environment

In [5]:
!python -V

Python 3.7.3


In [6]:
!conda env list

# conda environments:
#
base                     /opt/conda
custom                *  /opt/conda/envs/custom



In [8]:
conda env export

name: custom
channels:
  - pytorch
  - fastai
  - conda-forge
  - defaults
dependencies:
  - _py-xgboost-mutex=2.0=cpu_0
  - asn1crypto=0.24.0=py37_0
  - attrs=19.1.0=py37_1
  - backcall=0.1.0=py37_0
  - beautifulsoup4=4.7.1=py37_1
  - blas=1.0=mkl
  - bleach=3.1.0=py37_0
  - bottleneck=1.2.1=py37h035aef0_1
  - ca-certificates=2019.5.15=0
  - certifi=2019.3.9=py37_0
  - cffi=1.12.3=py37h2e261b9_0
  - chardet=3.0.4=py37_1
  - cryptography=2.6.1=py37h1ba5d50_0
  - cudatoolkit=10.0.130=0
  - cycler=0.10.0=py37_0
  - cymem=2.0.2=py37hfd86e86_0
  - cytoolz=0.9.0.1=py37h14c3975_1
  - dataclasses=0.6=py_0
  - dbus=1.13.6=h746ee38_0
  - decorator=4.4.0=py37_1
  - defusedxml=0.6.0=py_0
  - dill=0.2.9=py37_0
  - entrypoints=0.3=py37_0
  - expat=2.2.6=he6710b0_0
  - fastai=1.0.53.post2=1
  - fastprogress=0.1.21=py_0
  - fontconfig=2.13.0=h9420a91_0
  - freetype=2.9.1=h8a8886c_1
  - glib=2.56.2=hd408876_0
  - gmp=6.1.2=h6c8ec71_1
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48

In [9]:
#!conda list

In [10]:
#!pip list

### Import packages

In [11]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [12]:
from fastai.imports import *

# fastai v0.7
#from fastai07.structured import *

# fastai v1.0
from fastai.tabular import *

#from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from pandas.api.types import is_string_dtype, is_numeric_dtype

from sklearn import metrics

### Set random seed

In [48]:
# Set a seed value: 
seed_value= 42  


# Set `python` built-in pseudo-random generator at a fixed value: 
random.seed(seed_value) 

# Set `numpy` pseudo-random generator at a fixed value:
np.random.seed(seed_value) 

# Set `torch` pseudo-random generator at a fixed value:
torch.manual_seed(seed_value)
torch.backends.cudnn.deterministic = True 
torch.backends.cudnn.benchmark = False
    
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    

### Define shared functions

In [14]:
def display_all(df):
    return
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

In [15]:
def train_cats(df):
    for n,c in df.items():
        if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()

In [16]:
def numericalize(df, col, name, max_n_cat):
    if not is_numeric_dtype(col) and ( max_n_cat is None or len(col.cat.categories)>max_n_cat):
        df[name] = col.cat.codes+1


In [17]:
def fix_missing(df, col, name, na_dict):
    if is_numeric_dtype(col):
        if pd.isnull(col).sum() or (name in na_dict):
            df[name+'_na'] = pd.isnull(col)
            filler = na_dict[name] if name in na_dict else col.median()
            df[name] = col.fillna(filler)
            na_dict[name] = filler
    return na_dict

In [18]:
def proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None,
            preproc_fn=None, max_n_cat=None, subset=None, mapper=None):
    if not ignore_flds: ignore_flds=[]
    if not skip_flds: skip_flds=[]
    if subset: df = get_sample(df,subset)
    else: df = df.copy()
    ignored_flds = df.loc[:, ignore_flds]
    df.drop(ignore_flds, axis=1, inplace=True)
    if preproc_fn: preproc_fn(df)
    if y_fld is None: y = None
    else:
        if not is_numeric_dtype(df[y_fld]): df[y_fld] = df[y_fld].cat.codes
        y = df[y_fld].values
        skip_flds += [y_fld]
    df.drop(skip_flds, axis=1, inplace=True)

    if na_dict is None: na_dict = {}
    else: na_dict = na_dict.copy()
    na_dict_initial = na_dict.copy()
    for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
    if len(na_dict_initial.keys()) > 0:
        df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)
    if do_scale: mapper = scale_vars(df, mapper)
    for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    df = pd.get_dummies(df, dummy_na=True)
    df = pd.concat([ignored_flds, df], axis=1)
    res = [df, y, na_dict]
    if do_scale: res = res + [mapper]
    return res

## Stage 1: prepare

### Load raw dataset

In [19]:
df_raw = pd.read_csv(RAW_TRAIN_CSV, low_memory=False, parse_dates=["Date"])

In [20]:
display_all(df_raw.tail().T)

In [21]:
display_all(df_raw.describe(include='all').T)

### Load store dataset

In [22]:
df_store = pd.read_csv(RAW_STORE_CSV, low_memory=False)

In [23]:
display_all(df_store.tail().T)

In [24]:
display_all(df_store.describe(include='all').T)

### Marge datasets (train+store)

In [25]:
df_raw = pd.merge(df_raw, df_store, on='Store')

In [26]:
display_all(df_raw.tail().T)

## Feature engineering

### Expand dates

In [27]:
#df_raw = df_raw.sort_values(['Date'])
dates = df_raw['Date']
add_datepart(df_raw, 'Date', drop=True)

Unnamed: 0,Store,DayOfWeek,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,...,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start,Elapsed
0,1,5,5263,555,1,1,0,1,c,a,...,31,4,212,True,False,False,False,False,False,1438300800
1,1,4,5020,546,1,1,0,1,c,a,...,30,3,211,False,False,False,False,False,False,1438214400
2,1,3,4782,523,1,1,0,1,c,a,...,29,2,210,False,False,False,False,False,False,1438128000
3,1,2,5011,560,1,1,0,1,c,a,...,28,1,209,False,False,False,False,False,False,1438041600
4,1,1,6102,612,1,1,0,1,c,a,...,27,0,208,False,False,False,False,False,False,1437955200
5,1,7,0,0,0,0,0,0,c,a,...,26,6,207,False,False,False,False,False,False,1437868800
6,1,6,4364,500,1,0,0,0,c,a,...,25,5,206,False,False,False,False,False,False,1437782400
7,1,5,3706,459,1,0,0,0,c,a,...,24,4,205,False,False,False,False,False,False,1437696000
8,1,4,3769,503,1,0,0,0,c,a,...,23,3,204,False,False,False,False,False,False,1437609600
9,1,3,3464,463,1,0,0,0,c,a,...,22,2,203,False,False,False,False,False,False,1437523200


In [28]:
display_all(df_raw.tail().T)

In [29]:
df_raw.Year.head()

0    2015
1    2015
2    2015
3    2015
4    2015
Name: Year, dtype: int64

### Convert cotegorical features

In [30]:
train_cats(df_raw)

In [31]:
display_all(df_raw.describe(include='all').T)

Sembra che le categorie siano già correttamente ordinata ...

In [32]:
df_raw.StoreType.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

In [33]:
df_raw.Assortment.cat.categories

Index(['a', 'b', 'c'], dtype='object')

...  non dobbiamo pertanto  procedere al riordino


```
df_raw.StoreType.cat.set_categories(['a', 'b', 'c', 'd'], ordered=True, inplace=True)
```



In [34]:
df_raw.StoreType.cat.codes[:5]

0    2
1    2
2    2
3    2
4    2
dtype: int8

### Handle missing values

In [35]:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

In [36]:
df_raw.shape

(1017209, 30)

We'll replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable.

In [37]:
X_train, y_train, nas = proc_df(df_raw, 'Sales')

In [38]:
display_all(X_train.isnull().sum().sort_index()/len(X_train))

In [39]:
df_raw['Sales'][-5:]

1017204    4771
1017205    4540
1017206    4297
1017207    3697
1017208       0
Name: Sales, dtype: int64

In [40]:
y_train[-5:]

array([4771, 4540, 4297, 3697,    0])

### Save prepared data

In [41]:
X_train.shape, y_train.shape

((1017209, 34), (1017209,))

In [42]:
display_all(df_raw.tail().T)

In [43]:
y_train = pd.DataFrame(y_train,columns=["Sales"])
y_train[-5:]

Unnamed: 0,Sales
1017204,4771
1017205,4540
1017206,4297
1017207,3697
1017208,0


In [44]:
df_processed_train = pd.concat([X_train, y_train], axis = 1);
df_processed_train = pd.concat([df_processed_train, dates], axis = 1);
df_processed_train.shape, display_all(df_processed_train.tail().T)

((1017209, 36), None)

In [45]:
nas

{'CompetitionDistance': 2330.0,
 'CompetitionOpenSinceMonth': 8.0,
 'CompetitionOpenSinceYear': 2010.0,
 'Promo2SinceWeek': 22.0,
 'Promo2SinceYear': 2012.0}

In [46]:
df_processed_train = df_processed_train.sort_values(['Date'])
display_all(df_processed_train.tail().T)

In [47]:
df_processed_train.to_csv(PROCESSED_TRAIN_CSV)

In [None]:
# END