# Making cutoffs
The idea of making cutoffs is from the cross-validation. 

As intuitively the model is not likly to have robust and consistant performance if we always train the model with data from 2013.1 to 2017.7 and do validation test for 2017.8. 

Thus, the idea of making cutoffs is by cutting the long time range data to a few parts along with time, and for each part we train the model and do the performace checking. By doing this, we are adding two hyperparameters that is training range and testing range. For example, if we set the cutoff with training range = 90 and testing range = 15 days, the model should conduct "learning from last 90 days and predicting for the following 15 days" indenpendently for each cutoff

In [1]:
import numpy as np
import pandas as pd

In [6]:
df = pd.read_csv('use_data.csv').drop(columns = 'Unnamed: 0')
df.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,oil_price,holiday
0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,93.14,0.0
1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,93.14,0.0
2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13,93.14,0.0
3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13,93.14,0.0
4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13,93.14,0.0


### Compose and test a function that makes the training and testing set with cutoffs. 
### Then write it into utils for easy use

In [7]:
def make_cutoffs(df, train_day, test_day, stride = False):
    # input: df with date column, number of days used in each cutoff used for train / test
            # stride: number of days each cutoff jumps, by defalt it will equal to train_day
    # output: to df with cutoff lable, one for train one for test
    df['date'] = pd.to_datetime(df['date'])
    if not stride:
        stride = train_day
    use = df.sort_values(by = 'date')
    d1, d_stop = use['date'].tolist()[0], use['date'].tolist()[-1]
    d2 = d1 + pd.DateOffset(days=train_day)
    d3 = d2 + pd.DateOffset(days=test_day)
    train_df_lis, test_df_lis = [],[]
    while d3 <= d_stop:
        # get dfs for one cut off
        sub_train = df.loc[(df['date']>=d1)&(df['date']<d2)].copy()
        sub_test = df.loc[(df['date']>=d2)&(df['date']<d3)].copy()
        sub_train['cutoff'] = d1
        sub_test['cutoff'] = d1
        train_df_lis.append(sub_train)
        test_df_lis.append(sub_test)

        #update ds
        d1 = d1 + pd.DateOffset(days=stride)
        d2 = d1 + pd.DateOffset(days=train_day)
        d3 = d2 + pd.DateOffset(days=test_day)
    return pd.concat(train_df_lis), pd.concat(test_df_lis)


In [8]:
tr,ts = make_cutoffs(df, 90, 15, stride = False)

In [9]:
tr.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,oil_price,holiday,cutoff
0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,93.14,0.0,2013-01-01
1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,93.14,0.0,2013-01-01
2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13,93.14,0.0,2013-01-01
3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13,93.14,0.0,2013-01-01
4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13,93.14,0.0,2013-01-01


In [10]:
ts.head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,oil_price,holiday,cutoff
2970,2013-04-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,97.1,0.0,2013-01-01
2971,2013-04-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,97.1,0.0,2013-01-01
2972,2013-04-01,1,BEAUTY,1.0,0,Quito,Pichincha,D,13,97.1,0.0,2013-01-01
2973,2013-04-01,1,BEVERAGES,931.0,0,Quito,Pichincha,D,13,97.1,0.0,2013-01-01
2974,2013-04-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13,97.1,0.0,2013-01-01
