## Feature Engineering and CV, continued from eda_4_22

New train and test generation, features include:
- customer info in the second month
- products in the first month
- combination of first and second month `ind_actividad_cliente`
- combination of first and second month `tiprel_1mes`
- combination of first month product by using binary number (`target_combine`)
- encoding `target_combine` with 
    - mean number of new products
    - mean number of customers with new products
    - mean number of customers with each new products
- Count patterns in the last `max_lag` months


#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- Private score: 0.0302475, public score: 0.0299266

In [1]:
from santander_helper import *

### New pattern couting function

For train sets, only count patterns for customers who add new products in the next/second month.
For test set, count all patterns.
Save pattern counting data for later use. 

The function is moved to santander_helper.py

In [5]:
x_train, y_train = create_train('2015-06-28', pattern_flag=True)


Start counting patterns:


In [8]:
x_val, y_val = create_train('2015-12-28', pattern_flag=True)


Start counting patterns:


In [9]:
x_test = create_test(pattern_flag=True)


Start counting patterns:


### Train model

In [10]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 50

dtrain = xgb.DMatrix(x_train.values, y_train.values)
dval = xgb.DMatrix(x_val.values, y_val.values)
model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train'), (dval, 'val')], verbose_eval=True)

[0]	train-mlogloss:2.70705	val-mlogloss:2.73845
[1]	train-mlogloss:2.53978	val-mlogloss:2.59421
[2]	train-mlogloss:2.40165	val-mlogloss:2.47167
[3]	train-mlogloss:2.2863	val-mlogloss:2.36808
[4]	train-mlogloss:2.18682	val-mlogloss:2.29041
[5]	train-mlogloss:2.10001	val-mlogloss:2.21372
[6]	train-mlogloss:2.02277	val-mlogloss:2.14565
[7]	train-mlogloss:1.95338	val-mlogloss:2.09042
[8]	train-mlogloss:1.89002	val-mlogloss:2.03509
[9]	train-mlogloss:1.83277	val-mlogloss:1.98523
[10]	train-mlogloss:1.77963	val-mlogloss:1.93927
[11]	train-mlogloss:1.73166	val-mlogloss:1.89692
[12]	train-mlogloss:1.68653	val-mlogloss:1.85732
[13]	train-mlogloss:1.64425	val-mlogloss:1.82003
[14]	train-mlogloss:1.60524	val-mlogloss:1.78635
[15]	train-mlogloss:1.56854	val-mlogloss:1.75496
[16]	train-mlogloss:1.53421	val-mlogloss:1.72568
[17]	train-mlogloss:1.50157	val-mlogloss:1.70161
[18]	train-mlogloss:1.47061	val-mlogloss:1.67457
[19]	train-mlogloss:1.4418	val-mlogloss:1.65358
[20]	train-mlogloss:1.41461	val-

In [11]:
preds = model.predict(xgb.DMatrix(x_test.values))

df_preds = pd.DataFrame(preds, index=x_test.index, columns=target_cols)
# Remove already bought products 
df_preds[x_test[target_cols]==1] = 0 
preds = df_preds.values
preds = np.argsort(preds, axis=1)
preds = np.fliplr(preds)[:, :7]

In [12]:
test_id = x_test.loc[:, 'ncodpers'].values
final_preds = [' '.join([target_cols[k] for k in pred]) for pred in preds]

out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds})
out_df.to_csv('eda_4_23.csv.gz', compression='gzip', index=False)