## Feature Engineering and CV, continued from eda_4_23

New train and test generation, features include:
- customer info in the second month
- products in the first month
- combination of first and second month `ind_actividad_cliente`
- combination of first and second month `tiprel_1mes`
- combination of first month product by using binary number (`target_combine`)
- encoding `target_combine` with 
    - mean number of new products
    - mean number of customers with new products
    - mean number of customers with each new products
- Count patterns in the last `max_lag` months
- Number of month to the last time the customer purchase each product


#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- Private score: 0.0302475, public score: 0.0299266

In [1]:
from santander_helper import *

In [50]:
def count_zeros(month1, max_lag):
    if os.path.exists('../input/count_zeros_{}_{}.hdf'.format(month1, max_lag)):
        df = pd.read_hdf('../input/count_zeros_{}_{}.hdf'.format(month1, max_lag), 
            'count_zeros')
        
        return df
    else:
        month_new = month_list.index(month1)+1
        month_end = month_list.index(month1)
        month_start = month_end-max_lag+1
        
        # Check if month_new is the last month
        if month_new<len(month_list)-1:
            # Customers with new products in month_new
            customer_product_pair = pd.read_hdf('../input/customer_product_pair.hdf', 'customer_product_pair')
            ncodpers_list = customer_product_pair.loc[customer_product_pair.fecha_dato==month_list[month_new], 
                'ncodpers'].unique().tolist()

        # Load data for all the lag related months
        df = []
        for m in range(month_start, month_end+1):
            df.append(pd.read_hdf('../input/data_month_{}.hdf'.format(month_list[m]), 'data_month'))

        # concatenate data
        df = pd.concat(df, ignore_index=True)
        df = df.loc[:, ['ncodpers', 'fecha_dato']+target_cols]
        if month_new<len(month_list)-1:
            # select customers if this is not test set
            df = df.loc[df.ncodpers.isin(ncodpers_list), :]
        # set ncodpers and fecha_dato as index
        df.set_index(['ncodpers', 'fecha_dato'], inplace=True)
        # unstack to make month as columns
        df = df.unstack(level=-1, fill_value=0)

        # count number of concatenating zeros before the second/current month
        df = df.groupby(level=0, axis=1).progress_apply(lambda x: (1-x).iloc[:, ::-1].cummin(axis=1).sum(axis=1))
        df.columns = [k+'_zc' for k in df.columns]
        
        gc.collect()
        
        df.to_hdf('../input/count_zeros_{}_{}.hdf'.format(month1, max_lag), 'count_zeros')
        
        return df

### Zero couting function

For each (customer, product) pair, count how many concatenating months before the current month the target is zero. This zero counting consider `max_lag` months before the current month.

The function is moved to santander_helper.py

In [2]:
x_train, y_train = create_train('2015-06-28', pattern_flag=True)

100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 87.23it/s]



Start counting patterns:


In [3]:
x_val, y_val = create_train('2015-12-28', pattern_flag=True)

100%|█████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 108.87it/s]



Start counting patterns:


In [4]:
x_test = create_test(pattern_flag=True)

100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:03<00:00,  5.51it/s]



Start counting patterns:


In [5]:
x_test.shape

(929615, 158)

In [6]:
x_val.shape

(46148, 158)

In [7]:
x_train.shape

(45140, 158)

In [8]:
y_train.shape

(45140,)

In [9]:
y_val.shape

(46148,)

### Train model

In [None]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 50

dtrain = xgb.DMatrix(x_train.values, y_train.values)
dval = xgb.DMatrix(x_val.values, y_val.values)
model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train'), (dval, 'val')], verbose_eval=True)

[0]	train-mlogloss:2.70803	val-mlogloss:2.73915
[1]	train-mlogloss:2.53512	val-mlogloss:2.58417
[2]	train-mlogloss:2.39652	val-mlogloss:2.45986
[3]	train-mlogloss:2.28028	val-mlogloss:2.35618
[4]	train-mlogloss:2.18082	val-mlogloss:2.27307
[5]	train-mlogloss:2.09362	val-mlogloss:2.19459
[6]	train-mlogloss:2.01652	val-mlogloss:2.13468
[7]	train-mlogloss:1.94695	val-mlogloss:2.07281
[8]	train-mlogloss:1.88421	val-mlogloss:2.01761
[9]	train-mlogloss:1.82663	val-mlogloss:1.97114
[10]	train-mlogloss:1.7733	val-mlogloss:1.9244
[11]	train-mlogloss:1.72421	val-mlogloss:1.8813
[12]	train-mlogloss:1.67889	val-mlogloss:1.84189
[13]	train-mlogloss:1.63697	val-mlogloss:1.80524


In [None]:
preds = model.predict(xgb.DMatrix(x_test.values))

df_preds = pd.DataFrame(preds, index=x_test.index, columns=target_cols)
# Remove already bought products 
df_preds[x_test[target_cols]==1] = 0 
preds = df_preds.values
preds = np.argsort(preds, axis=1)
preds = np.fliplr(preds)[:, :7]

In [None]:
test_id = x_test.loc[:, 'ncodpers'].values
final_preds = [' '.join([target_cols[k] for k in pred]) for pred in preds]

out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds})
out_df.to_csv('eda_4_24.csv.gz', compression='gzip', index=False)