## Feature Engineering in RAM-Limited Data, Part 4

#### Mean encoding of `target_combine` one the previous month target
1. In order to mean encode `target_combine`, I have to first have lag target of previous months. For example, for June training set, I can include products in April and May, and also encode all products bought in April with the mean target in May.

2. Another way of mean encoding is to not use time series. Just put all target together and analyze. In this case, we can have the results as in the [3-rd solution](http://blog.kaggle.com/2017/02/22/santander-product-recommendation-competition-3rd-place-winners-interview-ryuji-sakata/) and [forum discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26899).

The first method is too complicated to implement, so I will try the second one.

#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- Private score: 0.0302475, public score: 0.0299266

In [1]:
from santander_helper import *
%matplotlib inline

## Encoding

Load targets

In [2]:
if os.path.isfile('../input/targets.hdf'):
    print('Load targets')
    targets = pd.read_hdf('../input/targets.hdf', 'targets')
else:
    print('Create targets')
    targets = []
    for m1, m2 in tqdm.tqdm_notebook(list(zip(month_list[:-2], month_list[1:-1]))):
        target1 = obtain_target(m2)
        target1['fecha_dato'] = m2
        targets.append(target1)

    targets = pd.concat(targets, ignore_index=True, copy=False)
    targets.to_hdf('../input/targets.hdf', 'targets')

Load targets


Calculate `target_combine`

In [3]:
new_product_per_customer = targets.groupby(['ncodpers', 'fecha_dato'])['target'].count()
new_product_per_customer = pd.DataFrame(new_product_per_customer)
new_product_per_customer.reset_index(inplace=True, drop=False)
cols = new_product_per_customer.columns.tolist()
cols[-1] = 'target_count'
new_product_per_customer.columns = cols

month_mapping = dict(zip(month_list[1:-1], month_list[:-2]))
new_product_per_customer.fecha_dato = new_product_per_customer.fecha_dato.map(month_mapping)

Load products and extract product information

In [4]:
if os.path.isfile('../input/df_target_cols.hdf'):
    print('Load df_target_cols')
    df = pd.read_hdf('../input/df_target_cols.hdf', 'df_target_cols')
else:
    print('Create df_target_cols')
    df = []
    for month in tqdm.tqdm_notebook(month_list):
        df.append(pd.read_hdf('../input/data_month_{}.hdf'.format(month), 'data_month'))
    df = pd.concat(df, ignore_index=True)
    df = df.loc[:, ['fecha_dato', 'ncodpers']+target_cols].copy()
    df['target_combine'] = np.sum(df[target_cols].values*
        np.float_power(2, np.arange(-10, len(target_cols)-10)), 
        axis=1, dtype=np.float64)
    
    df.to_hdf('../input/df_target_cols.hdf', 'df_target_cols')

Create df_target_cols


HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




Merge `target_combine` and `target_count`

In [5]:
dt = pd.merge(df, new_product_per_customer, how='left')

In [7]:
dt.target_count = dt.target_count.fillna(0)

In [8]:
dt['target_indicator'] = (dt.target_count>0).astype(int)

In [17]:
count_mean_encoding = pd.DataFrame(dt.groupby('target_combine')['target_count'].mean())
count_mean_encoding.columns = ['count_mean_encoding']
indicator_mean_encoding = pd.DataFrame(dt.groupby('target_combine')['target_indicator'].mean())
indicator_mean_encoding.columns = ['indicator_mean_encoding']

In [19]:
dt = pd.merge(dt, count_mean_encoding, how='left', left_on='target_combine', right_index=True)

In [20]:
dt = pd.merge(dt, indicator_mean_encoding, how='left', left_on='target_combine', right_index=True)

In [23]:
dt.shape

(14576924, 26)

In [24]:
dt.columns

Index(['fecha_dato', 'ncodpers', 'ind_cco_fin_ult1', 'ind_cder_fin_ult1',
       'ind_cno_fin_ult1', 'ind_ctju_fin_ult1', 'ind_ctma_fin_ult1',
       'ind_ctop_fin_ult1', 'ind_ctpp_fin_ult1', 'ind_dela_fin_ult1',
       'ind_ecue_fin_ult1', 'ind_fond_fin_ult1', 'ind_hip_fin_ult1',
       'ind_nom_pens_ult1', 'ind_nomina_ult1', 'ind_plan_fin_ult1',
       'ind_pres_fin_ult1', 'ind_reca_fin_ult1', 'ind_recibo_ult1',
       'ind_tjcr_fin_ult1', 'ind_valo_fin_ult1', 'target_combine',
       'target_count', 'target_indicator', 'count_mean_encoding',
       'indicator_mean_encoding'],
      dtype='object')

In [25]:
dt = dt[['fecha_dato', 'ncodpers', 'target_combine', 'count_mean_encoding', 'indicator_mean_encoding']].copy()
dt.to_hdf('../input/target_mean_encoding.hdf', 'target_mean_encoding')