## Feature Engineering and CV based Winners' Solutions

continued from eda_4_26

New in this notebook:
- average of products for each (customer, product) pair
- exponent weighted average of products each (customer, product) pair
- time since presence of products, distance to the first 1
- time to the last positive flank (01)
- time to the last negative flank (10)
- time to the last 1, to the nearest product purchase
- time to the first 1, to the first product purchase

Trained@2015-06-28, validated@2015-12-28, mlogloss=1.28481

Private score: 0.0302054, public score: 0.0298683

To-do: 
- mean encoding of products grouped by combinations of: canal_entrada, segmento, cod_prov
- Time since change and lags for a few non-product features: 
    - segmento
    - ind_actividad_cliente
    - cod_prov
    - canal_entrada
    - indrel_1mes
    - tiprel_1mes


Features:
- before eda_4_25
    - customer info in the second month
    - products in the first month
    - combination of first and second month `ind_actividad_cliente`
    - combination of first and second month `tiprel_1mes`
    - combination of first month product by using binary number (`target_combine`)
    - encoding `target_combine` with 
        - mean number of new products
        - mean number of customers with new products
        - mean number of customers with each new products
    - Count patterns in the last `max_lag` months
    - Number of month to the last time the customer purchase each product
        - CV@2015-12-28: mlogloss=1.29349
        - Private score: 0.0302475, public score: 0.0299266
- eda_4_25
    - Use all available history data
        - E.g., for 2016-05-28 train data, use all previous months, for 2015-02-28, use 1 lag month. 
        - Need to create test set that use the same amount of previous months for each training data set. 
        - This is from [the second winner's solution](https://www.kaggle.com/c/santander-product-recommendation/discussion/26824), his bold part in paragraph 4.
    - Combine models trained on 2016-05-28 and 2015-06-28:
        - Private score: 0.0304583, public score: 0.0300839
        - This is to catch both seasonality and trend, presented in 2015-06-28 and 2016-05-28, respectively. 
        - This idea is mentioned by many winners, like [11-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823) and [14-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26808)

- eda_4_27
    - put 2015-06-28 and 2016-05-28 in the same data set, with the same lag=5
        - Private score:0.0303096, public score: 0.0299867
        - Different as [11-th winner's discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823)
            > We tested this by adding 50% of May-16 data to our June model and sure enough, we went from 0.0301 to 0.0303. Then, we built separate models for Jun and May, but the ensemble didn’t work. We weren’t surprised because June data is better for seasonal products, and May data is better for trend products. And vice-versa, June data is bad for trend products and May data is bad for seasonal products. So, they sort of cancelled each other out.

        - But my score is always worse than theirs, maybe this is the reason why we have different observations

In [1]:
from santander_helper import *

In [2]:
x_train, y_train = create_train('2015-06-28', pattern_flag=True)

In [3]:
x_val, y_val = create_train('2015-12-28', pattern_flag=True)

In [4]:
x_test = create_test(pattern_flag=True)

In [5]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 50

dtrain = xgb.DMatrix(x_train.values, y_train.values)
dval = xgb.DMatrix(x_val.values, y_val.values)
model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train'), (dval, 'val')], verbose_eval=True)

[0]	train-mlogloss:2.70425	val-mlogloss:2.72751
[1]	train-mlogloss:2.53054	val-mlogloss:2.57816
[2]	train-mlogloss:2.39174	val-mlogloss:2.45626
[3]	train-mlogloss:2.27667	val-mlogloss:2.35607
[4]	train-mlogloss:2.17737	val-mlogloss:2.26857
[5]	train-mlogloss:2.09057	val-mlogloss:2.19049
[6]	train-mlogloss:2.01329	val-mlogloss:2.12256
[7]	train-mlogloss:1.94391	val-mlogloss:2.06879
[8]	train-mlogloss:1.8804	val-mlogloss:2.01387
[9]	train-mlogloss:1.8227	val-mlogloss:1.96327
[10]	train-mlogloss:1.76998	val-mlogloss:1.91818
[11]	train-mlogloss:1.7206	val-mlogloss:1.87445
[12]	train-mlogloss:1.67542	val-mlogloss:1.83568
[13]	train-mlogloss:1.63347	val-mlogloss:1.79962
[14]	train-mlogloss:1.59422	val-mlogloss:1.76587
[15]	train-mlogloss:1.55781	val-mlogloss:1.7343
[16]	train-mlogloss:1.52319	val-mlogloss:1.70851
[17]	train-mlogloss:1.49054	val-mlogloss:1.68033
[18]	train-mlogloss:1.45981	val-mlogloss:1.65444
[19]	train-mlogloss:1.43115	val-mlogloss:1.63016
[20]	train-mlogloss:1.404	val-mlog

In [7]:
preds = model.predict(xgb.DMatrix(x_test.values))

df_preds = pd.DataFrame(preds, index=x_test.index, columns=target_cols)
# Remove already bought products 
df_preds[x_test[target_cols]==1] = 0 
preds = df_preds.values
preds = np.argsort(preds, axis=1)
preds = np.fliplr(preds)[:, :7]

In [8]:
test_id = x_test.loc[:, 'ncodpers'].values
final_preds = [' '.join([target_cols[k] for k in pred]) for pred in preds]

out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds})
out_df.to_csv('eda_4_28.csv.gz', compression='gzip', index=False)