## Feature Engineering and CV based Winners' Solutions

Try to use MAP@7 as `feval` for xgboost, failed because no way to tell xgboost how to group results by users. **Can only do this after train is finished.**

New in this notebook:
- A hacky implementation of MAP@7 evaluation. 
- This method is suitable when training on one month and validate on another month, since ncodpers is the key in ground truth dictionaries.
- This method **only works if the MAP functions and training codes are in the same notebook**.

To-do: 
- mean encoding of products grouped by combinations of: canal_entrada, segmento, cod_prov
- Time since change and lags for a few non-product features: 
    - segmento
    - ind_actividad_cliente
    - cod_prov
    - canal_entrada
    - indrel_1mes
    - tiprel_1mes


Features:
- before eda_4_29
    - average of products for each (customer, product) pair
    - exponent weighted average of products each (customer, product) pair
    - time since presence of products, distance to the first 1
    - time to the last positive flank (01)
    - time to the last negative flank (10)
    - time to the last 1, to the nearest product purchase
    - time to the first 1, to the first product purchase
    - Trained@2015-06-28, validated@2015-12-28, mlogloss=1.28481
    - Private score: 0.0302054, public score: 0.0298683
- before eda_4_25
    - customer info in the second month
    - products in the first month
    - combination of first and second month `ind_actividad_cliente`
    - combination of first and second month `tiprel_1mes`
    - combination of first month product by using binary number (`target_combine`)
    - encoding `target_combine` with 
        - mean number of new products
        - mean number of customers with new products
        - mean number of customers with each new products
    - Count patterns in the last `max_lag` months
    - Number of month to the last time the customer purchase each product
        - CV@2015-12-28: mlogloss=1.29349
        - Private score: 0.0302475, public score: 0.0299266
- eda_4_25
    - Use all available history data
        - E.g., for 2016-05-28 train data, use all previous months, for 2015-02-28, use 1 lag month. 
        - Need to create test set that use the same amount of previous months for each training data set. 
        - This is from [the second winner's solution](https://www.kaggle.com/c/santander-product-recommendation/discussion/26824), his bold part in paragraph 4.
    - Combine models trained on 2016-05-28 and 2015-06-28:
        - Private score: 0.0304583, public score: 0.0300839
        - This is to catch both seasonality and trend, presented in 2015-06-28 and 2016-05-28, respectively. 
        - This idea is mentioned by many winners, like [11-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823) and [14-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26808)

- eda_4_27
    - put 2015-06-28 and 2016-05-28 in the same data set, with the same lag=5
        - Private score:0.0303096, public score: 0.0299867
        - Different as [11-th winner's discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823)
            > We tested this by adding 50% of May-16 data to our June model and sure enough, we went from 0.0301 to 0.0303. Then, we built separate models for Jun and May, but the ensemble didn’t work. We weren’t surprised because June data is better for seasonal products, and May data is better for trend products. And vice-versa, June data is bad for trend products and May data is bad for seasonal products. So, they sort of cancelled each other out.

        - But my score is always worse than theirs, maybe this is the reason why we have different observations

# Compare two weights

In [1]:
from santander_helper import *

In [2]:
history = {}

x_train, y_train, weight_train = create_train('2015-06-28', pattern_flag=True)
x_val, y_val, weight_val = create_train('2016-05-28', pattern_flag=True)

gt_train = prep_map(x_train, y_train)
gt_val = prep_map(x_val, y_val)

dtrain = xgb.DMatrix(x_train, y_train)
dval = xgb.DMatrix(x_val, y_val)

ground_truth = {'train': gt_train, 'val': gt_val}
data_hash = {'train': hash(dtrain.get_label().tostring()), 'val': hash(dval.get_label().tostring())}

for weight_index in [0, 1]:
    history[weight_index] = {}
    
    dtrain.set_weight(weight_train.values[:, weight_index])
    dval.set_weight(weight_val.values[:, weight_index])

    param = {'objective': 'multi:softprob', 
             'eta': 0.05, 
             'max_depth': 4, 
             'silent': 1, 
             'num_class': len(target_cols),
             'eval_metric': 'mlogloss',
             'min_child_weight': 1,
             'subsample': 0.7,
             'colsample_bytree': 0.7,
             'seed': 0}
    num_rounds = 300

    model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train'), (dval, 'dval')], 
        verbose_eval=True, feval=eval_map, evals_result=history[weight_index], gt=ground_truth, ts=data_hash)

[0]	train-mlogloss:2.67673	dval-mlogloss:2.69823	train-MAP@7:0.869703	dval-MAP@7:0.836259
[1]	train-mlogloss:2.49074	dval-mlogloss:2.52495	train-MAP@7:0.871119	dval-MAP@7:0.832829
[2]	train-mlogloss:2.34356	dval-mlogloss:2.39247	train-MAP@7:0.873126	dval-MAP@7:0.837453
[3]	train-mlogloss:2.22236	dval-mlogloss:2.28039	train-MAP@7:0.873827	dval-MAP@7:0.837061
[4]	train-mlogloss:2.11922	dval-mlogloss:2.18952	train-MAP@7:0.875822	dval-MAP@7:0.836688
[5]	train-mlogloss:2.02934	dval-mlogloss:2.10505	train-MAP@7:0.87617	dval-MAP@7:0.837969
[6]	train-mlogloss:1.94954	dval-mlogloss:2.03087	train-MAP@7:0.876927	dval-MAP@7:0.839329
[7]	train-mlogloss:1.87774	dval-mlogloss:1.96457	train-MAP@7:0.878078	dval-MAP@7:0.840711
[8]	train-mlogloss:1.81368	dval-mlogloss:1.90596	train-MAP@7:0.877812	dval-MAP@7:0.840521
[9]	train-mlogloss:1.75547	dval-mlogloss:1.85178	train-MAP@7:0.878262	dval-MAP@7:0.840807
[10]	train-mlogloss:1.70167	dval-mlogloss:1.80365	train-MAP@7:0.879044	dval-MAP@7:0.841796
[11]	train

[90]	train-mlogloss:0.787226	dval-mlogloss:0.97059	train-MAP@7:0.892729	dval-MAP@7:0.857326
[91]	train-mlogloss:0.785354	dval-mlogloss:0.969351	train-MAP@7:0.892902	dval-MAP@7:0.857364
[92]	train-mlogloss:0.783641	dval-mlogloss:0.96791	train-MAP@7:0.893065	dval-MAP@7:0.857372
[93]	train-mlogloss:0.78191	dval-mlogloss:0.966498	train-MAP@7:0.893141	dval-MAP@7:0.857499
[94]	train-mlogloss:0.780128	dval-mlogloss:0.96512	train-MAP@7:0.893286	dval-MAP@7:0.857563
[95]	train-mlogloss:0.778574	dval-mlogloss:0.964153	train-MAP@7:0.893374	dval-MAP@7:0.857584
[96]	train-mlogloss:0.776962	dval-mlogloss:0.962695	train-MAP@7:0.893487	dval-MAP@7:0.857678
[97]	train-mlogloss:0.775399	dval-mlogloss:0.961222	train-MAP@7:0.8935	dval-MAP@7:0.85775
[98]	train-mlogloss:0.773871	dval-mlogloss:0.959959	train-MAP@7:0.893474	dval-MAP@7:0.857943
[99]	train-mlogloss:0.772415	dval-mlogloss:0.95917	train-MAP@7:0.893581	dval-MAP@7:0.857868
[100]	train-mlogloss:0.770977	dval-mlogloss:0.957982	train-MAP@7:0.8936	dval-M

[178]	train-mlogloss:0.70972	dval-mlogloss:0.921897	train-MAP@7:0.898992	dval-MAP@7:0.860569
[179]	train-mlogloss:0.709252	dval-mlogloss:0.921839	train-MAP@7:0.898981	dval-MAP@7:0.860568
[180]	train-mlogloss:0.708827	dval-mlogloss:0.921852	train-MAP@7:0.899033	dval-MAP@7:0.860525
[181]	train-mlogloss:0.708313	dval-mlogloss:0.921785	train-MAP@7:0.899077	dval-MAP@7:0.860569
[182]	train-mlogloss:0.707862	dval-mlogloss:0.921584	train-MAP@7:0.899111	dval-MAP@7:0.860594
[183]	train-mlogloss:0.707422	dval-mlogloss:0.921229	train-MAP@7:0.899176	dval-MAP@7:0.860627
[184]	train-mlogloss:0.706978	dval-mlogloss:0.921073	train-MAP@7:0.899286	dval-MAP@7:0.860675
[185]	train-mlogloss:0.706554	dval-mlogloss:0.920999	train-MAP@7:0.899251	dval-MAP@7:0.860582
[186]	train-mlogloss:0.706146	dval-mlogloss:0.921164	train-MAP@7:0.899284	dval-MAP@7:0.86065
[187]	train-mlogloss:0.705745	dval-mlogloss:0.921153	train-MAP@7:0.89935	dval-MAP@7:0.860631
[188]	train-mlogloss:0.705303	dval-mlogloss:0.920961	train-MAP@

[266]	train-mlogloss:0.675915	dval-mlogloss:0.922529	train-MAP@7:0.904943	dval-MAP@7:0.859399
[267]	train-mlogloss:0.675593	dval-mlogloss:0.922575	train-MAP@7:0.905059	dval-MAP@7:0.859422
[268]	train-mlogloss:0.675274	dval-mlogloss:0.922472	train-MAP@7:0.905087	dval-MAP@7:0.859524
[269]	train-mlogloss:0.674933	dval-mlogloss:0.922565	train-MAP@7:0.905126	dval-MAP@7:0.859382
[270]	train-mlogloss:0.674614	dval-mlogloss:0.92268	train-MAP@7:0.905215	dval-MAP@7:0.859304
[271]	train-mlogloss:0.674254	dval-mlogloss:0.922714	train-MAP@7:0.905297	dval-MAP@7:0.859307
[272]	train-mlogloss:0.673892	dval-mlogloss:0.922721	train-MAP@7:0.905296	dval-MAP@7:0.859278
[273]	train-mlogloss:0.67357	dval-mlogloss:0.922665	train-MAP@7:0.90537	dval-MAP@7:0.859279
[274]	train-mlogloss:0.673229	dval-mlogloss:0.922602	train-MAP@7:0.905377	dval-MAP@7:0.859315
[275]	train-mlogloss:0.672906	dval-mlogloss:0.922704	train-MAP@7:0.905451	dval-MAP@7:0.859332
[276]	train-mlogloss:0.672606	dval-mlogloss:0.922956	train-MAP@

KeyboardInterrupt: 

### 2016-05-28, max_lag=16

In [3]:
x_train_may16, y_train_may16 = create_train('2015-05-28', pattern_flag=True, max_lag=16)

x_test_may16 = create_test(pattern_flag=True)

param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 60

dtrain_may16 = xgb.DMatrix(x_train_may16.values, y_train_may16.values)
model_may16 = xgb.train(param, dtrain_may16, num_rounds, evals=[(dtrain_may16, 'train')], verbose_eval=True)

preds_may16 = model_may16.predict(xgb.DMatrix(x_test_may16.values))

df_preds_may16 = pd.DataFrame(preds_may16, index=x_test_may16.index, columns=target_cols)
# Remove already bought products 
df_preds_may16[x_test_may16[target_cols]==1] = 0 
preds_may16 = df_preds_may16.values
preds_may16 = np.argsort(preds_may16, axis=1)
preds_may16 = np.fliplr(preds_may16)[:, :7]

test_id = x_test_may16.loc[:, 'ncodpers'].values
final_preds_may16 = [' '.join([target_cols[k] for k in pred]) for pred in preds_may16]

out_df_may16 = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds_may16})
out_df_may16.to_csv('eda_4_28_may16.csv.gz', compression='gzip', index=False)

[0]	train-mlogloss:2.64321
[1]	train-mlogloss:2.42916
[2]	train-mlogloss:2.26143
[3]	train-mlogloss:2.12057
[4]	train-mlogloss:2.00099
[5]	train-mlogloss:1.89603
[6]	train-mlogloss:1.80356
[7]	train-mlogloss:1.72067
[8]	train-mlogloss:1.64465
[9]	train-mlogloss:1.57663
[10]	train-mlogloss:1.51355
[11]	train-mlogloss:1.45557
[12]	train-mlogloss:1.40214
[13]	train-mlogloss:1.35231
[14]	train-mlogloss:1.30587
[15]	train-mlogloss:1.26249
[16]	train-mlogloss:1.222
[17]	train-mlogloss:1.18383
[18]	train-mlogloss:1.14811
[19]	train-mlogloss:1.11448
[20]	train-mlogloss:1.08268
[21]	train-mlogloss:1.0525
[22]	train-mlogloss:1.02404
[23]	train-mlogloss:0.997108
[24]	train-mlogloss:0.971606
[25]	train-mlogloss:0.947348
[26]	train-mlogloss:0.924398
[27]	train-mlogloss:0.902768
[28]	train-mlogloss:0.881792
[29]	train-mlogloss:0.861978
[30]	train-mlogloss:0.842969
[31]	train-mlogloss:0.824979
[32]	train-mlogloss:0.80799
[33]	train-mlogloss:0.791558
[34]	train-mlogloss:0.776193
[35]	train-mlogloss:0.

### Combine two models

In [4]:
preds_june15 = model_june15.predict(xgb.DMatrix(x_test_june15.values))
preds_may16 = model_may16.predict(xgb.DMatrix(x_test_may16.values))

preds1 = np.sqrt(preds_june15*preds_may16)
preds2 = 0.5*preds_june15 + 0.5*preds_may16

# Geometry mean
df_preds1 = pd.DataFrame(preds1, index=x_test_may16.index, columns=target_cols)
# Remove already bought products 
df_preds1[x_test_may16[target_cols]==1] = 0 
preds1 = df_preds1.values
preds1 = np.argsort(preds1, axis=1)
preds1 = np.fliplr(preds1)[:, :7]

test_id = x_test_may16.loc[:, 'ncodpers'].values
final_preds1 = [' '.join([target_cols[k] for k in pred]) for pred in preds1]

out_df1 = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds1})
out_df1.to_csv('eda_4_28_gm.csv.gz', compression='gzip', index=False)

# Algorithmic mean
df_preds2 = pd.DataFrame(preds2, index=x_test_may16.index, columns=target_cols)
# Remove already bought products 
df_preds2[x_test_may16[target_cols]==1] = 0 
preds2 = df_preds2.values
preds2 = np.argsort(preds2, axis=1)
preds2 = np.fliplr(preds2)[:, :7]

test_id = x_test_may16.loc[:, 'ncodpers'].values
final_preds2 = [' '.join([target_cols[k] for k in pred]) for pred in preds2]

out_df2 = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds2})
out_df2.to_csv('eda_4_28_am.csv.gz', compression='gzip', index=False)
