## Feature Engineering and CV, continued from eda_4_21

New train and test generation, features include:
- customer info in the second month
- products in the first month
- combination of first and second month `ind_actividad_cliente`
- combination of first and second month `tiprel_1mes`
- combination of first month product by using binary number (`target_combine`)
- encoding `target_combine` with 
    - mean number of new products
    - mean number of customers with new products
    - mean number of customers with each new products


#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- Private score: 0.0302475, public score: 0.0299266

In [1]:
from santander_helper import *
%matplotlib inline

### Train and test data sets

In [2]:
x_train, y_train = create_train('2015-06-28')

In [3]:
x_val, y_val = create_train('2015-12-28')

In [4]:
x_test = create_test()

## Train model

- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122

In [5]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 50

dtrain = xgb.DMatrix(x_train.values, y_train.values)
dval = xgb.DMatrix(x_val.values, y_val.values)
model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train'), (dval, 'val')], verbose_eval=True)

[0]	train-mlogloss:2.76452	val-mlogloss:2.77824
[1]	train-mlogloss:2.6277	val-mlogloss:2.66856
[2]	train-mlogloss:2.51632	val-mlogloss:2.56824
[3]	train-mlogloss:2.41917	val-mlogloss:2.4831
[4]	train-mlogloss:2.33502	val-mlogloss:2.40834
[5]	train-mlogloss:2.25995	val-mlogloss:2.34964
[6]	train-mlogloss:2.19215	val-mlogloss:2.2927
[7]	train-mlogloss:2.13084	val-mlogloss:2.2422
[8]	train-mlogloss:2.07419	val-mlogloss:2.19261
[9]	train-mlogloss:2.02298	val-mlogloss:2.14833
[10]	train-mlogloss:1.97543	val-mlogloss:2.10655
[11]	train-mlogloss:1.93209	val-mlogloss:2.06806
[12]	train-mlogloss:1.89061	val-mlogloss:2.03448
[13]	train-mlogloss:1.85299	val-mlogloss:2.00593
[14]	train-mlogloss:1.81786	val-mlogloss:1.97612
[15]	train-mlogloss:1.78488	val-mlogloss:1.94731
[16]	train-mlogloss:1.75323	val-mlogloss:1.9238
[17]	train-mlogloss:1.72337	val-mlogloss:1.89896
[18]	train-mlogloss:1.69568	val-mlogloss:1.87588
[19]	train-mlogloss:1.66876	val-mlogloss:1.85434
[20]	train-mlogloss:1.64356	val-mlo

Prediction from my model

In [6]:
preds = model.predict(xgb.DMatrix(x_test.values))

df_preds = pd.DataFrame(preds, index=x_test.index, columns=target_cols)
# Remove already bought products 
df_preds[x_test[target_cols]==1] = 0 
preds = df_preds.values
preds = np.argsort(preds, axis=1)
preds = np.fliplr(preds)[:, :7]

Write out prediction results from my model

In [7]:
test_id = x_test.loc[:, 'ncodpers'].values
final_preds = [' '.join([target_cols[k] for k in pred]) for pred in preds]

out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds})
out_df.to_csv('eda_4_22.csv.gz', compression='gzip', index=False)