## Feature Engineering and CV, continued from eda_4_25

New train and test generation, features include:
- before eda_4_25
    - customer info in the second month
    - products in the first month
    - combination of first and second month `ind_actividad_cliente`
    - combination of first and second month `tiprel_1mes`
    - combination of first month product by using binary number (`target_combine`)
    - encoding `target_combine` with 
        - mean number of new products
        - mean number of customers with new products
        - mean number of customers with each new products
    - Count patterns in the last `max_lag` months
    - Number of month to the last time the customer purchase each product
        - CV@2015-12-28: mlogloss=1.29349
        - Private score: 0.0302475, public score: 0.0299266
- eda_4_25
    - Use all available history data
        - E.g., for 2016-05-28 train data, use all previous months, for 2015-02-28, use 1 lag month. 
        - Need to create test set that use the same amount of previous months for each training data set. 
        - This is from [the second winner's solution](https://www.kaggle.com/c/santander-product-recommendation/discussion/26824), his bold part in paragraph 4.
    - Combine models trained on 2016-05-28 and 2015-06-28:
        - Private score: 0.0304583, public score: 0.0300839
        - This is to catch both seasonality and trend, presented in 2015-06-28 and 2016-05-28, respectively. 
        - This idea is mentioned by many winners, like [11-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823) and [14-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26808)

New in this notebook:
- put 2015-06-28 and 2016-05-28 in the same data set, with the same lag=5
    - Private score:0.0303096, public score: 0.0299867
    - Different as [11-th winner's discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823)
        > We tested this by adding 50% of May-16 data to our June model and sure enough, we went from 0.0301 to 0.0303. Then, we built separate models for Jun and May, but the ensemble didn’t work. We weren’t surprised because June data is better for seasonal products, and May data is better for trend products. And vice-versa, June data is bad for trend products and May data is bad for seasonal products. So, they sort of cancelled each other out.
        
    - But my score is always worse than theirs, maybe this is the reason why we have different observations

In [1]:
from santander_helper import *

### Zero couting function

For each (customer, product) pair, count how many concatenating months before the current month the target is zero. This zero counting consider `max_lag` months before the current month.

The function is moved to santander_helper.py

In [3]:
x_train_may16, y_train_may16 = create_train('2016-05-28', pattern_flag=True, max_lag=5)
x_train_june15, y_train_june15 = create_train('2015-06-28', pattern_flag=True, max_lag=5)
x_test = create_test(pattern_flag=True, max_lag=5)

In [4]:
x_train_may16.shape

(37889, 158)

In [5]:
x_train_june15.shape

(45140, 158)

In [6]:
x_train = pd.concat((x_train_may16, x_train_june15), ignore_index=True)
y_train = pd.concat((y_train_may16, y_train_june15), ignore_index=True)

### Train model on 2015-06-28

In [7]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 60

dtrain = xgb.DMatrix(x_train.values, y_train.values)
#dval = xgb.DMatrix(x_val.values, y_val.values)
model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train')], verbose_eval=True)

[0]	train-mlogloss:2.71227
[1]	train-mlogloss:2.54375
[2]	train-mlogloss:2.40782
[3]	train-mlogloss:2.29505
[4]	train-mlogloss:2.19766
[5]	train-mlogloss:2.11216
[6]	train-mlogloss:2.03695
[7]	train-mlogloss:1.96874
[8]	train-mlogloss:1.90646
[9]	train-mlogloss:1.84965
[10]	train-mlogloss:1.7978
[11]	train-mlogloss:1.75018
[12]	train-mlogloss:1.70629
[13]	train-mlogloss:1.66503
[14]	train-mlogloss:1.62675
[15]	train-mlogloss:1.59081
[16]	train-mlogloss:1.55701
[17]	train-mlogloss:1.52529
[18]	train-mlogloss:1.49556
[19]	train-mlogloss:1.46752
[20]	train-mlogloss:1.44085
[21]	train-mlogloss:1.41581
[22]	train-mlogloss:1.39212
[23]	train-mlogloss:1.36958
[24]	train-mlogloss:1.34832
[25]	train-mlogloss:1.32807
[26]	train-mlogloss:1.30886
[27]	train-mlogloss:1.29067
[28]	train-mlogloss:1.27312
[29]	train-mlogloss:1.25666
[30]	train-mlogloss:1.24079
[31]	train-mlogloss:1.22565
[32]	train-mlogloss:1.21119
[33]	train-mlogloss:1.19727
[34]	train-mlogloss:1.184
[35]	train-mlogloss:1.17141
[36]	

In [8]:
preds = model.predict(xgb.DMatrix(x_test.values))

df_preds = pd.DataFrame(preds, index=x_test.index, columns=target_cols)
# Remove already bought products 
df_preds[x_test[target_cols]==1] = 0 
preds = df_preds.values
preds = np.argsort(preds, axis=1)
preds = np.fliplr(preds)[:, :7]

In [9]:
test_id = x_test.loc[:, 'ncodpers'].values
final_preds = [' '.join([target_cols[k] for k in pred]) for pred in preds]

out_df = pd.DataFrame({'ncodpers': test_id, 'added_products': final_preds})
out_df.to_csv('eda_4_26.csv.gz', compression='gzip', index=False)

### Train 50% of 2016-05-28 and 100% of 2015-06-28

In [17]:
np.random.seed(0)
idx = np.arange(len(x_train_may16))
np.random.shuffle(idx)
idx = idx[:int(len(idx)/2)]

In [18]:
x_train = pd.concat((x_train_may16.iloc[idx, :], x_train_june15), ignore_index=True)
y_train = pd.concat((y_train_may16.iloc[idx], y_train_june15), ignore_index=True)

In [None]:
param = {'objective': 'multi:softprob', 
         'eta': 0.05, 
         'max_depth': 8, 
         'silent': 1, 
         'num_class': len(target_cols),
         'eval_metric': 'mlogloss',
         'min_child_weight': 1,
         'subsample': 0.7,
         'colsample_bytree': 0.7,
         'seed': 0}
num_rounds = 60

dtrain = xgb.DMatrix(x_train.values, y_train.values)
#dval = xgb.DMatrix(x_val.values, y_val.values)
model = xgb.train(param, dtrain, num_rounds, evals=[(dtrain, 'train')], verbose_eval=True)

