## Feature Engineering in RAM-Limited Data, Part 4

#### Mean encoding of `target_combine` one the previous month target
1. In order to mean encode `target_combine`, I have to first have lag target of previous months. For example, for June training set, I can include products in April and May, and also encode all products bought in April with the mean target in May.

2. Another way of mean encoding is to not use time series. Just put all target together and analyze. In this case, we can have the results as in the [3-rd solution](http://blog.kaggle.com/2017/02/22/santander-product-recommendation-competition-3rd-place-winners-interview-ryuji-sakata/) and [forum discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26899).

The first method is too complicated to implement, so I will try the second one.

#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- Private score: 0.0302475, public score: 0.0299266

In [1]:
from santander_helper import *
%matplotlib inline

## Encoding

Load targets

In [2]:
if os.path.isfile('../input/targets.hdf'):
    targets = pd.read_hdf('../input/targets.hdf', 'targets')
else:
    print('Create targets')
    targets = []
    for m1, m2 in tqdm.tqdm_notebook(list(zip(month_list[:-2], month_list[1:-1]))):
        target1 = obtain_target(m2)
        target1['fecha_dato'] = m2
        targets.append(target1)

    targets = pd.concat(targets, ignore_index=True, copy=False)
    targets.to_hdf('../input/targets.hdf', 'targets')

Create targets


HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




Load products 

In [3]:
df = []
for month in tqdm.tqdm_notebook(month_list):
    df.append(pd.read_hdf('../input/data_month_{}.hdf'.format(month), 'data_month'))
df = pd.concat(df, ignore_index=True)

HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




Extract product information

In [4]:
df = df.loc[:, ['fecha_dato', 'ncodpers']+target_cols].copy()

Calculate `target_combine`

In [30]:
df['target_combine'] = np.sum(df[target_cols].values*
    np.float_power(2, np.arange(-3, len(target_cols)-3)), 
    axis=1, dtype=np.float64)

In [31]:
new_product_per_customer = targets.groupby(['ncodpers', 'fecha_dato'])['target'].count()

In [32]:
new_product_per_customer = pd.DataFrame(new_product_per_customer)
new_product_per_customer.reset_index(inplace=True, drop=False)

In [33]:
df.sort_values(['fecha_dato', 'ncodpers'], inplace=True)

In [34]:
dt = pd.merge(df, new_product_per_customer, how='left')

In [35]:
dt_cols = dt.columns.tolist()
dt_cols[-1] = 'target_count'
dt.columns = dt_cols

In [36]:
dt.target_count = dt.target_count.fillna(0)

In [37]:
dt['target_indicator'] = (dt.target_count>0).astype(int)

In [53]:
target_combine_mean_encoding = dt.groupby('target_combine')['target_indicator'].mean()

In [42]:
dt.head()

Unnamed: 0,fecha_dato,ncodpers,ind_cco_fin_ult1,ind_cder_fin_ult1,ind_cno_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_ctop_fin_ult1,ind_ctpp_fin_ult1,ind_dela_fin_ult1,...,ind_nomina_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_recibo_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,target_combine,target_count,target_indicator
0,2015-01-28,15889,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,49160.125,0.0,0
1,2015-01-28,15890,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,26408.5,0.0,0
2,2015-01-28,15892,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,1.0,1.0,1.0,61488.5,0.0,0
3,2015-01-28,15893,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,32784.0,0.0,0
4,2015-01-28,15894,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,1.0,1.0,1.0,62256.625,0.0,0


In [48]:
dt.loc[dt.target_combine==0, 'target_count'].sum()

0.0

In [54]:
target_combine_mean_encoding

target_combine
0.000        0.000000
0.125        0.020101
0.250        0.028846
0.375        0.024462
0.500        0.045138
0.625        0.088940
1.000        0.009800
1.125        0.166667
2.000        0.096502
2.125        0.107792
2.250        1.000000
2.500        0.172746
2.625        0.137931
4.000        0.001032
4.125        0.002693
4.250        0.043478
4.375        0.018595
4.500        0.027962
4.625        0.056604
6.000        0.029851
6.125        0.057554
6.500        0.090909
8.000        0.001359
8.125        0.005789
8.375        0.025641
8.500        0.032741
8.625        0.021978
8.750        0.000000
10.000       0.000000
10.125       0.027778
               ...   
63456.500    0.000000
63456.625    0.428571
63464.500    0.200000
63472.500    0.000000
63472.625    0.250000
63480.500    0.043478
63488.500    0.000000
63560.125    1.000000
63744.500    0.214286
63776.500    1.000000
63872.500    1.000000
63916.500    1.000000
64256.500    0.071429
64260.125    1.00