## Feature Engineering in RAM-Limited Data, Part 4

#### Mean encoding of `target_combine` one the previous month target
1. In order to mean encode `target_combine`, I have to first have lag target of previous months. For example, for June training set, I can include products in April and May, and also encode all products bought in April with the mean target in May.

2. Another way of mean encoding is to not use time series. Just put all target together and analyze. In this case, we can have the results as in the [3-rd solution](http://blog.kaggle.com/2017/02/22/santander-product-recommendation-competition-3rd-place-winners-interview-ryuji-sakata/) and [forum discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26899).

The first method is too complicated to implement, so I will try the second one. 
- data: 
    - first month product from 2015-01-28 to 2016-04-28
    - second month product (new product) from 2015-02-28 to 2016-05-28


#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- Private score: 0.0302475, public score: 0.0299266

In [1]:
from santander_helper import *
%matplotlib inline

## Encoding

Load targets

In [2]:
if os.path.isfile('../input/targets.hdf'):
    # If the data already exists, just load it
    print('Load targets')
    targets = pd.read_hdf('../input/targets.hdf', 'targets')
else:
    print('Create targets')
    # If data does not exist, need to create one
    targets = []
    # For each pair of months, call obtain_target (it actually does not need a pair, just the second month)
    for m1, m2 in tqdm.tqdm_notebook(list(zip(month_list[:-2], month_list[1:-1]))):
        target1 = obtain_target(m2)
        target1['fecha_dato'] = m2
        targets.append(target1)

    targets = pd.concat(targets, ignore_index=True, copy=False)
    targets.to_hdf('../input/targets.hdf', 'targets', complib='blosc:lz4', complevel=9, format='t')

Load targets


New products for each customer at each month through `pivot_table`

In [3]:
targets_p = targets.copy()
targets_p['dummy'] = 1
targets_p = targets_p.pivot_table(index=['ncodpers', 'fecha_dato'], columns=['target'], values=['dummy'])
targets_p.fillna(0.0, inplace=True)
targets_p.reset_index(inplace=True)
targets_p.columns = ['ncodpers', 'fecha_dato']+target_cols

Calculate `target_combine`

In [4]:
# Count how many new products each customer purchases in each month
new_product_per_customer = targets.groupby(['ncodpers', 'fecha_dato'])['target'].count()
new_product_per_customer = pd.DataFrame(new_product_per_customer)
new_product_per_customer.reset_index(inplace=True, drop=False)
cols = new_product_per_customer.columns.tolist()
cols[-1] = 'target_count'
new_product_per_customer.columns = cols

Merge with `targets_p` 

In [6]:
new_product_per_customer = new_product_per_customer.merge(targets_p, how='left', on=['ncodpers', 'fecha_dato'])

In [7]:
new_product_per_customer.head()

Unnamed: 0,ncodpers,fecha_dato,target_count,ind_cco_fin_ult1,ind_cder_fin_ult1,ind_cno_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_ctop_fin_ult1,ind_ctpp_fin_ult1,...,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_nom_pens_ult1,ind_nomina_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_recibo_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1
0,15889,2015-05-28,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,15889,2015-12-28,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,15889,2016-03-28,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,15889,2016-05-28,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,15891,2015-07-28,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Map `fecha_dato` to the previous month, since I want to build a mapping from the products in the first month to the new products to the second month, so the first month should be the key.

In [8]:
month_mapping = dict(zip(month_list[1:-1], month_list[:-2]))
new_product_per_customer.fecha_dato = new_product_per_customer.fecha_dato.map(month_mapping)

In [9]:
new_product_per_customer.head()

Unnamed: 0,ncodpers,fecha_dato,target_count,ind_cco_fin_ult1,ind_cder_fin_ult1,ind_cno_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_ctop_fin_ult1,ind_ctpp_fin_ult1,...,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_nom_pens_ult1,ind_nomina_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_recibo_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1
0,15889,2015-04-28,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,15889,2015-11-28,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,15889,2016-02-28,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,15889,2016-04-28,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,15891,2015-06-28,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Load the current products (products in the first month) and extract product information

In [10]:
if os.path.isfile('../input/df_target_cols.hdf'):
    print('Load df_target_cols')
    df = pd.read_hdf('../input/df_target_cols.hdf', 'df_target_cols')
else:
    print('Create df_target_cols')
    df = []
    for month in tqdm.tqdm_notebook(month_list[:-2]):
        df.append(pd.read_hdf('../input/data_month_{}.hdf'.format(month), 'data_month'))
    df = pd.concat(df, ignore_index=True)
    df = df.loc[:, ['fecha_dato', 'ncodpers']+target_cols].copy()
    df['target_combine'] = np.sum(df[target_cols].values*
        np.float_power(2, np.arange(-10, len(target_cols)-10)), 
        axis=1, dtype=np.float64)
    df.drop(target_cols, axis=1, inplace=True)
    
    df.to_hdf('../input/df_target_cols.hdf', 'df_target_cols', complib='blosc:lz4', complevel=9, format='t')

Create df_target_cols


HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




Merge `target_combine` and `target_count`

In [11]:
dt = pd.merge(df, new_product_per_customer, how='left', on=['ncodpers', 'fecha_dato'])
dt.fillna(0, inplace=True)
dt['target_indicator'] = (dt.target_count>0).astype(int)

In [12]:
dt.head()

Unnamed: 0,fecha_dato,ncodpers,target_combine,target_count,ind_cco_fin_ult1,ind_cder_fin_ult1,ind_cno_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_ctop_fin_ult1,...,ind_hip_fin_ult1,ind_nom_pens_ult1,ind_nomina_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_recibo_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,target_indicator
0,2015-01-28,1375586,0.000977,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,2015-01-28,1050611,0.000977,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,2015-01-28,1050612,0.000977,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,2015-01-28,1050613,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,2015-01-28,1050614,0.000977,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [13]:
mean_encoding = {}
mean_encoding_cols = target_cols+['target_count', 'target_indicator']
for c in tqdm.tqdm_notebook(mean_encoding_cols):
    mean_encoding[c] = pd.DataFrame(dt.groupby('target_combine')[c].mean())
    mean_encoding[c].columns = [c]
    dt = pd.merge(dt, mean_encoding[c], how='left', left_on='target_combine', right_index=True, suffixes=('', '_m'))

# Remove auxiliary columns
dt.drop(target_cols+['target_count', 'target_indicator', 'fecha_dato', 'ncodpers'], inplace=True, axis=1)
# Remove duplicate rows
dt.drop_duplicates(inplace=True)

HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




In [14]:
dt.head()

Unnamed: 0,target_combine,ind_cco_fin_ult1_m,ind_cder_fin_ult1_m,ind_cno_fin_ult1_m,ind_ctju_fin_ult1_m,ind_ctma_fin_ult1_m,ind_ctop_fin_ult1_m,ind_ctpp_fin_ult1_m,ind_dela_fin_ult1_m,ind_ecue_fin_ult1_m,...,ind_nom_pens_ult1_m,ind_nomina_ult1_m,ind_plan_fin_ult1_m,ind_pres_fin_ult1_m,ind_reca_fin_ult1_m,ind_recibo_ult1_m,ind_tjcr_fin_ult1_m,ind_valo_fin_ult1_m,target_count_m,target_indicator_m
0,0.000977,0.0,3e-06,0.001438,1.75156e-07,0.000775,8e-05,3.1e-05,0.00064,0.000649,...,0.001865,0.001799,1.3e-05,2e-06,0.000359,0.011854,0.000798,0.000137,0.020539,0.016837
3,0.0,0.012075,1e-06,0.00064,0.0002025728,0.000511,0.000158,4.1e-05,0.000386,0.000974,...,0.000497,0.000454,1.3e-05,2e-06,2.1e-05,0.001057,0.000671,3.9e-05,0.017769,0.015525
21,64.000977,0.0,8e-06,0.009034,0.0,0.000341,0.000482,0.000292,0.000591,0.00266,...,0.01202,0.011465,6.3e-05,5e-06,0.003308,0.0,0.006771,0.000831,0.048004,0.027418
41,0.250977,0.0,1.7e-05,0.002663,0.0,0.000105,0.000621,0.000178,0.005602,0.0,...,0.004138,0.003839,3.3e-05,6e-06,0.000804,0.03549,0.011815,0.000943,0.067019,0.05851
44,64.250977,0.0,0.0,0.007761,0.0,0.000207,0.00145,0.000753,0.003202,0.0,...,0.013186,0.012395,0.000188,0.0,0.00373,0.0,0.026975,0.002072,0.072505,0.052462


In [15]:
dt.shape

(9485, 22)

In [16]:
dt.set_index('target_combine', inplace=True)

In [17]:
dt.head()

Unnamed: 0_level_0,ind_cco_fin_ult1_m,ind_cder_fin_ult1_m,ind_cno_fin_ult1_m,ind_ctju_fin_ult1_m,ind_ctma_fin_ult1_m,ind_ctop_fin_ult1_m,ind_ctpp_fin_ult1_m,ind_dela_fin_ult1_m,ind_ecue_fin_ult1_m,ind_fond_fin_ult1_m,...,ind_nom_pens_ult1_m,ind_nomina_ult1_m,ind_plan_fin_ult1_m,ind_pres_fin_ult1_m,ind_reca_fin_ult1_m,ind_recibo_ult1_m,ind_tjcr_fin_ult1_m,ind_valo_fin_ult1_m,target_count_m,target_indicator_m
target_combine,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.000977,0.0,3e-06,0.001438,1.75156e-07,0.000775,8e-05,3.1e-05,0.00064,0.000649,9.5e-05,...,0.001865,0.001799,1.3e-05,2e-06,0.000359,0.011854,0.000798,0.000137,0.020539,0.016837
0.0,0.012075,1e-06,0.00064,0.0002025728,0.000511,0.000158,4.1e-05,0.000386,0.000974,2.6e-05,...,0.000497,0.000454,1.3e-05,2e-06,2.1e-05,0.001057,0.000671,3.9e-05,0.017769,0.015525
64.000977,0.0,8e-06,0.009034,0.0,0.000341,0.000482,0.000292,0.000591,0.00266,0.000128,...,0.01202,0.011465,6.3e-05,5e-06,0.003308,0.0,0.006771,0.000831,0.048004,0.027418
0.250977,0.0,1.7e-05,0.002663,0.0,0.000105,0.000621,0.000178,0.005602,0.0,0.000765,...,0.004138,0.003839,3.3e-05,6e-06,0.000804,0.03549,0.011815,0.000943,0.067019,0.05851
64.250977,0.0,0.0,0.007761,0.0,0.000207,0.00145,0.000753,0.003202,0.0,0.000584,...,0.013186,0.012395,0.000188,0.0,0.00373,0.0,0.026975,0.002072,0.072505,0.052462


In [18]:
dt.to_hdf('../input/target_mean_encoding.hdf', 'target_mean_encoding', complib='blosc:lz4', complevel=9, format='t')