## Feature Engineering and CV based Winners' Solutions

Try to use MAP@7 as `feval` for xgboost, failed because no way to tell xgboost how to group results by users. **Can only do this after train is finished.**

New in this notebook:
- A hacky implementation of MAP@7 evaluation.

To-do: 
- mean encoding of products grouped by combinations of: canal_entrada, segmento, cod_prov
- Time since change and lags for a few non-product features: 
    - segmento
    - ind_actividad_cliente
    - cod_prov
    - canal_entrada
    - indrel_1mes
    - tiprel_1mes


Features:
- before eda_4_29
    - average of products for each (customer, product) pair
    - exponent weighted average of products each (customer, product) pair
    - time since presence of products, distance to the first 1
    - time to the last positive flank (01)
    - time to the last negative flank (10)
    - time to the last 1, to the nearest product purchase
    - time to the first 1, to the first product purchase
    - Trained@2015-06-28, validated@2015-12-28, mlogloss=1.28481
    - Private score: 0.0302054, public score: 0.0298683
- before eda_4_25
    - customer info in the second month
    - products in the first month
    - combination of first and second month `ind_actividad_cliente`
    - combination of first and second month `tiprel_1mes`
    - combination of first month product by using binary number (`target_combine`)
    - encoding `target_combine` with 
        - mean number of new products
        - mean number of customers with new products
        - mean number of customers with each new products
    - Count patterns in the last `max_lag` months
    - Number of month to the last time the customer purchase each product
        - CV@2015-12-28: mlogloss=1.29349
        - Private score: 0.0302475, public score: 0.0299266
- eda_4_25
    - Use all available history data
        - E.g., for 2016-05-28 train data, use all previous months, for 2015-02-28, use 1 lag month. 
        - Need to create test set that use the same amount of previous months for each training data set. 
        - This is from [the second winner's solution](https://www.kaggle.com/c/santander-product-recommendation/discussion/26824), his bold part in paragraph 4.
    - Combine models trained on 2016-05-28 and 2015-06-28:
        - Private score: 0.0304583, public score: 0.0300839
        - This is to catch both seasonality and trend, presented in 2015-06-28 and 2016-05-28, respectively. 
        - This idea is mentioned by many winners, like [11-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823) and [14-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26808)

- eda_4_27
    - put 2015-06-28 and 2016-05-28 in the same data set, with the same lag=5
        - Private score:0.0303096, public score: 0.0299867
        - Different as [11-th winner's discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823)
            > We tested this by adding 50% of May-16 data to our June model and sure enough, we went from 0.0301 to 0.0303. Then, we built separate models for Jun and May, but the ensemble didn’t work. We weren’t surprised because June data is better for seasonal products, and May data is better for trend products. And vice-versa, June data is bad for trend products and May data is bad for seasonal products. So, they sort of cancelled each other out.

        - But my score is always worse than theirs, maybe this is the reason why we have different observations

In [1]:
from santander_helper import *

In [87]:
@jit
def apk(actual, predicted, k=7, default=0.0):
    if predicted.size>k:
        predicted = predicted[:k]
    score = 0.0
    num_hits = 0.0
    
    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if actual.size==0:
        return default

    return score / min(actual.size, k)

@jit
def mapk(actual, predicted, k=7, default=0.0):
    
    n = actual.shape[0]
    apks = np.zeros(n)
    for i in range(n):
        apks[i] = apk(actual[i], predicted[i], k, default)
    mean_apk = np.mean(apks)
    
    return mean_apk
    
@jit
def eval_map(y_prob, dtrain):
    if dtrain.get_label().size==y_train.size:
        gti = gt_train['index']
        gtv = gt_train['value']
    elif dtrain.get_label().size==y_val.size:
        gti = gt_val['index']
        gtv = gt_val['value']
    
    n = len(gti)
    apks = np.zeros(n)
    y_pred = {}
    for i, (cust_id, idx) in enumerate(gti.items()):
        tmp = np.mean(y_prob[idx, :], axis=0)
        y_pred[cust_id] = np.argsort(tmp)[:-8:-1]
        apks[i] = apk(gtv[cust_id], y_pred[cust_id])
    score = np.mean(apks)

    return 'MAP@7', score

In [88]:
def prep_map(x_train, y_train, file_name):
    '''Prepare ground truth value and index for MAP evaluation, and save it.'''
    # Ground truth value: MAP needs to know the products bought by each customers
    gtv = pd.concat((pd.DataFrame(x_train.loc[:, 'ncodpers'].copy()), y_train), axis=1, ignore_index=True)
    gtv.columns = ['ncodpers', 'target']
    gtv = gtv.groupby('ncodpers')['target'].apply(lambda x: x.values).to_dict()
    # Ground truth index: MAP needs to know for each customer which rows are its corresponding data
    gti = pd.DataFrame(x_train.loc[:, 'ncodpers']).reset_index()
    gti = gti.groupby('ncodpers')['index'].apply(lambda x: x.values).to_dict()
    
    gt = {'value': gtv, 'index': gti}
    
    return gt

In [44]:
x_train, y_train, weight_train = create_train('2015-06-28', pattern_flag=True)
x_val, y_val, weight_val = create_train('2016-05-28', pattern_flag=True)

dtrain = xgb.DMatrix(x_train, y_train, weight=weight_train)
dval = xgb.DMatrix(x_val, y_val, weight=weight_val)

gt_train = prep_map(x_train, y_train)
gt_val = prep_map(x_val, y_val)

param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 4, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 200

model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train'), (dval, 'dval')], verbose_eval=True, feval=eval_map)

In [70]:
y_pred, y_true = load_pickle('y_pred.pickle')

In [86]:
print(y_true[19232], y_pred[19232])

[11 12 16] [16  7 15 11  8 12  5]


In [85]:
apk(y_true[19232], y_pred[19232])

16 1.0 1.0
11 2.0 1.5
12 3.0 2.0


0.6666666666666666

In [71]:
y_pred

{15889: array([17, 16, 15,  7, 11,  8, 12], dtype=int64),
 15929: array([16,  7, 12, 11,  8, 17, 14], dtype=int64),
 15952: array([16,  7,  2, 12, 11,  8, 17], dtype=int64),
 15988: array([17, 16,  7, 12, 15,  2, 11], dtype=int64),
 15993: array([17,  0, 12, 11,  9,  8,  5], dtype=int64),
 16008: array([17, 11,  8, 14,  5,  2,  9], dtype=int64),
 16056: array([11, 12, 17, 15,  8,  5,  2], dtype=int64),
 16125: array([ 5, 17, 15, 12, 11,  2,  7], dtype=int64),
 16133: array([17, 16,  7, 12, 11,  8,  5], dtype=int64),
 16201: array([11, 12, 17, 18,  8,  6,  5], dtype=int64),
 16202: array([17,  8,  5, 18,  2,  9,  7], dtype=int64),
 16242: array([ 0,  2, 17, 12, 11,  8, 15], dtype=int64),
 16294: array([15, 17,  0,  8,  5, 18,  2], dtype=int64),
 16506: array([11, 12, 17,  8,  0,  5,  2], dtype=int64),
 16525: array([15, 13, 17,  6,  8, 18,  5], dtype=int64),
 16536: array([ 2,  7, 17, 12,  6, 18,  8], dtype=int64),
 16576: array([ 2,  7, 11,  8,  5, 17,  9], dtype=int64),
 16680: array(

In [72]:
y_true

{15889: array([17], dtype=int64),
 15929: array([16], dtype=int64),
 15952: array([16], dtype=int64),
 15988: array([17], dtype=int64),
 15993: array([17], dtype=int64),
 16008: array([17], dtype=int64),
 16056: array([11, 12], dtype=int64),
 16125: array([5], dtype=int64),
 16133: array([17], dtype=int64),
 16201: array([11, 12, 17], dtype=int64),
 16202: array([17], dtype=int64),
 16242: array([2], dtype=int64),
 16294: array([8], dtype=int64),
 16506: array([11, 12], dtype=int64),
 16525: array([6], dtype=int64),
 16536: array([ 2, 17], dtype=int64),
 16576: array([2], dtype=int64),
 16680: array([2], dtype=int64),
 16705: array([17], dtype=int64),
 16731: array([16], dtype=int64),
 16787: array([11, 12], dtype=int64),
 16826: array([16], dtype=int64),
 16857: array([0], dtype=int64),
 16988: array([17], dtype=int64),
 17015: array([17], dtype=int64),
 17059: array([16], dtype=int64),
 17118: array([11, 12], dtype=int64),
 17120: array([17], dtype=int64),
 17151: array([17], dtype=i

### 2015-06-28, max_lag=5

In [3]:
x_train_june15, y_train_june15, weight_train_june15 = create_train('2015-06-28', pattern_flag=True)
x_val_june15, y_val_june15, weight_val_june15 = create_train('2016-05-28', pattern_flag=True)

In [None]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 4, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 150

dtrain_june15 = xgb.DMatrix(x_train_june15, y_train_june15)
dval_june15 = xgb.DMatrix(x_val_june15, y_val_june15)
model_june15 = xgb.train(param, dtrain_june15, num_rounds, 
                         evals=[(dtrain_june15, 'train'), (dval_june15, 'dval')], 
                         verbose_eval=True, feval=eval_map)

In [32]:
x_test_june15 = create_test(pattern_flag=True)
preds_june15 = model_june15.predict(xgb.DMatrix(x_test_june15.values))

df_preds_june15 = pd.DataFrame(preds_june15, index=x_test_june15.index, columns=target_cols)
# Remove already bought products 
df_preds_june15[x_test_june15[target_cols]==1] = 0 
preds_june15 = df_preds_june15.values
preds_june15 = np.argsort(preds_june15, axis=1)
preds_june15 = np.fliplr(preds_june15)[:, :7]

test_id = x_test_june15.loc[:, 'ncodpers'].values
final_preds_june15 = [' '.join([target_cols[k] for k in pred]) for pred in preds_june15]

out_df_june15 = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds_june15})
out_df_june15.to_csv('eda_4_28_june15_it100.csv.gz', compression='gzip', index=False)

NameError: name 'model_june15' is not defined

### 2016-05-28, max_lag=16

In [3]:
x_train_may16, y_train_may16 = create_train('2015-05-28', pattern_flag=True, max_lag=16)

x_test_may16 = create_test(pattern_flag=True)

param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 60

dtrain_may16 = xgb.DMatrix(x_train_may16.values, y_train_may16.values)
model_may16 = xgb.train(param, dtrain_may16, num_rounds, evals=[(dtrain_may16, 'train')], verbose_eval=True)

preds_may16 = model_may16.predict(xgb.DMatrix(x_test_may16.values))

df_preds_may16 = pd.DataFrame(preds_may16, index=x_test_may16.index, columns=target_cols)
# Remove already bought products 
df_preds_may16[x_test_may16[target_cols]==1] = 0 
preds_may16 = df_preds_may16.values
preds_may16 = np.argsort(preds_may16, axis=1)
preds_may16 = np.fliplr(preds_may16)[:, :7]

test_id = x_test_may16.loc[:, 'ncodpers'].values
final_preds_may16 = [' '.join([target_cols[k] for k in pred]) for pred in preds_may16]

out_df_may16 = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds_may16})
out_df_may16.to_csv('eda_4_28_may16.csv.gz', compression='gzip', index=False)

[0]	train-mlogloss:2.64321
[1]	train-mlogloss:2.42916
[2]	train-mlogloss:2.26143
[3]	train-mlogloss:2.12057
[4]	train-mlogloss:2.00099
[5]	train-mlogloss:1.89603
[6]	train-mlogloss:1.80356
[7]	train-mlogloss:1.72067
[8]	train-mlogloss:1.64465
[9]	train-mlogloss:1.57663
[10]	train-mlogloss:1.51355
[11]	train-mlogloss:1.45557
[12]	train-mlogloss:1.40214
[13]	train-mlogloss:1.35231
[14]	train-mlogloss:1.30587
[15]	train-mlogloss:1.26249
[16]	train-mlogloss:1.222
[17]	train-mlogloss:1.18383
[18]	train-mlogloss:1.14811
[19]	train-mlogloss:1.11448
[20]	train-mlogloss:1.08268
[21]	train-mlogloss:1.0525
[22]	train-mlogloss:1.02404
[23]	train-mlogloss:0.997108
[24]	train-mlogloss:0.971606
[25]	train-mlogloss:0.947348
[26]	train-mlogloss:0.924398
[27]	train-mlogloss:0.902768
[28]	train-mlogloss:0.881792
[29]	train-mlogloss:0.861978
[30]	train-mlogloss:0.842969
[31]	train-mlogloss:0.824979
[32]	train-mlogloss:0.80799
[33]	train-mlogloss:0.791558
[34]	train-mlogloss:0.776193
[35]	train-mlogloss:0.

### Combine two models

In [4]:
preds_june15 = model_june15.predict(xgb.DMatrix(x_test_june15.values))
preds_may16 = model_may16.predict(xgb.DMatrix(x_test_may16.values))

preds1 = np.sqrt(preds_june15*preds_may16)
preds2 = 0.5*preds_june15 + 0.5*preds_may16

# Geometry mean
df_preds1 = pd.DataFrame(preds1, index=x_test_may16.index, columns=target_cols)
# Remove already bought products 
df_preds1[x_test_may16[target_cols]==1] = 0 
preds1 = df_preds1.values
preds1 = np.argsort(preds1, axis=1)
preds1 = np.fliplr(preds1)[:, :7]

test_id = x_test_may16.loc[:, 'ncodpers'].values
final_preds1 = [' '.join([target_cols[k] for k in pred]) for pred in preds1]

out_df1 = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds1})
out_df1.to_csv('eda_4_28_gm.csv.gz', compression='gzip', index=False)

# Algorithmic mean
df_preds2 = pd.DataFrame(preds2, index=x_test_may16.index, columns=target_cols)
# Remove already bought products 
df_preds2[x_test_may16[target_cols]==1] = 0 
preds2 = df_preds2.values
preds2 = np.argsort(preds2, axis=1)
preds2 = np.fliplr(preds2)[:, :7]

test_id = x_test_may16.loc[:, 'ncodpers'].values
final_preds2 = [' '.join([target_cols[k] for k in pred]) for pred in preds2]

out_df2 = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds2})
out_df2.to_csv('eda_4_28_am.csv.gz', compression='gzip', index=False)
