## Feature Engineering in RAM-Limited Data, Part 5

#### CV for mean encoding of `target_combine`

#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
    - Private score: 0.0302475, public score: 0.0299266

- with all above and mean encoding of target indicator and target #products: mlogloss=1.30756
    - Private score: 0.0302597, public score: 0.0299519
    
- with all above and mean encoding of each product: mlogloss=1.29115
    - Private score: 0.0301206, public score: 0.0297601
- with all above and 120 trees: mlogloss=1.15386
    - Private score 0.0301176, public score 0.0297002

In [1]:
from santander_helper import *
%matplotlib inline

Generate data

In [None]:
x_train, y_train = create_train_test('2015-06-28', target_flag=True, pattern_flag=True)

In [None]:
x_val, y_val = create_train_test('2015-12-28', max_lag=5, target_flag=True, pattern_flag=True)

In [None]:
x_test = create_train_test('2016-06-28', max_lag=5, target_flag=False, pattern_flag=True)

## Train model

- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- with all above and mean encoding of target indicator and target #products: mlogloss=1.30756
- with all above and mean encoding of each product: mlogloss=1.29115
- with all above and 120 trees: mlogloss=1.15386
    - Private score 0.0301176, public score 0.0297002

In [None]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 120

dtrain = xgb.DMatrix(x_train.values, y_train.values)
dval = xgb.DMatrix(x_val.values, y_val.values)

train_history = {}
models = {}
n_repeat = 1
np.random.seed(0)
for n in range(1):
    train_history[n] = {}
    param['seed'] = np.random.randint(10**6)
    model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train'), (dval, 'val')], 
        verbose_eval=True, evals_result=train_history[n])
    models[n] = model

In [None]:
result = {(k, d): train_history[k][d]['mlogloss'] for k in range(n_repeat) for d in ['train', 'val']}
result = pd.DataFrame(result)

result_mean = result.groupby(level=1, axis=1).mean()
result_std = result.groupby(level=1, axis=1).std()

result = pd.concat((result_mean, result_std), axis=1, ignore_index=True)
result.columns = pd.MultiIndex.from_product([['mean', 'std'], ['train', 'val']], names=['quantity', 'data'])

In [None]:
plt.figure(figsize=(16, 9))
plt.plot(result.loc[:, ('mean', slice(None))])
plt.fill_between(result.index, result.loc[:, ('mean', 'train')]-result.loc[:, ('std', 'train')], 
    result.loc[:, ('mean', 'train')]+result.loc[:, ('std', 'train')], alpha=0.5)
plt.fill_between(result.index, result.loc[:, ('mean', 'val')]-result.loc[:, ('std', 'val')], 
    result.loc[:, ('mean', 'val')]+result.loc[:, ('std', 'val')], alpha=0.5)
plt.grid()

Prediction from my model

In [None]:
preds = model.predict(xgb.DMatrix(x_test.values))

df_preds = pd.DataFrame(preds, index=x_test.index, columns=target_cols)
# Remove already bought products 
df_preds[x_test[target_cols]==1] = 0 
preds = df_preds.values
preds = np.argsort(preds, axis=1)
preds = np.fliplr(preds)[:, :7]

Write out prediction results from my model

In [None]:
test_id = x_test.loc[:, 'ncodpers'].values
final_preds = [' '.join([target_cols[k] for k in pred]) for pred in preds]

out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds})
out_df.to_csv('eda_4_19.csv.gz', compression='gzip', index=False)

In [None]:
#pd.Series(x_train.columns).to_csv('x_train_cols.csv')

In [None]:
result