## Feature Engineering and CV, continued from eda_4_21

New train and test generation, features include:
- customer info in the second month
- products in the first month
- combination of first and second month `ind_actividad_cliente`
- combination of first and second month `tiprel_1mes`
- combination of first month product by using binary number (`target_combine`)
- encoding `target_combine` with 
    - mean number of new products
    - mean number of customers with new products
    - mean number of customers with each new products


#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- Private score: 0.0302475, public score: 0.0299266

In [1]:
from santander_helper import *
%matplotlib inline

### Train and test data sets

In [2]:
x_train, y_train = create_train('2015-06-28')

100%|█████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 113.79it/s]


In [3]:
x_val, y_val = create_train('2015-12-28')

100%|█████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 184.48it/s]


In [4]:
x_test = create_test()

100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:02<00:00,  7.73it/s]


## Train model

- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122

In [5]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 50

dtrain = xgb.DMatrix(x_train.values, y_train.values)
dval = xgb.DMatrix(x_val.values, y_val.values)
model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train'), (dval, 'val')], verbose_eval=True)

[0]	train-mlogloss:2.72219	val-mlogloss:2.74082
[1]	train-mlogloss:2.56442	val-mlogloss:2.60281
[2]	train-mlogloss:2.42179	val-mlogloss:2.47642
[3]	train-mlogloss:2.3052	val-mlogloss:2.37478
[4]	train-mlogloss:2.2042	val-mlogloss:2.28715
[5]	train-mlogloss:2.11638	val-mlogloss:2.21133
[6]	train-mlogloss:2.03799	val-mlogloss:2.15155
[7]	train-mlogloss:1.96828	val-mlogloss:2.09571
[8]	train-mlogloss:1.90473	val-mlogloss:2.04013
[9]	train-mlogloss:1.84744	val-mlogloss:1.99022
[10]	train-mlogloss:1.79351	val-mlogloss:1.94516
[11]	train-mlogloss:1.74622	val-mlogloss:1.90364
[12]	train-mlogloss:1.69986	val-mlogloss:1.86328
[13]	train-mlogloss:1.65765	val-mlogloss:1.82695
[14]	train-mlogloss:1.61949	val-mlogloss:1.79345
[15]	train-mlogloss:1.58177	val-mlogloss:1.76408
[16]	train-mlogloss:1.54737	val-mlogloss:1.73608
[17]	train-mlogloss:1.51506	val-mlogloss:1.7094
[18]	train-mlogloss:1.48489	val-mlogloss:1.68581
[19]	train-mlogloss:1.45596	val-mlogloss:1.66122
[20]	train-mlogloss:1.42857	val-m

Prediction from my model

In [6]:
preds = model.predict(xgb.DMatrix(x_test.values))

df_preds = pd.DataFrame(preds, index=x_test.index, columns=target_cols)
# Remove already bought products 
df_preds[x_test[target_cols]==1] = 0 
preds = df_preds.values
preds = np.argsort(preds, axis=1)
preds = np.fliplr(preds)[:, :7]

Write out prediction results from my model

In [7]:
test_id = x_test.loc[:, 'ncodpers'].values
final_preds = [' '.join([target_cols[k] for k in pred]) for pred in preds]

out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds})
out_df.to_csv('eda_4_22.csv.gz', compression='gzip', index=False)