## Feature Engineering in RAM-Limited Data, Part 4

#### Mean encoding of `target_combine` one the previous month target
1. In order to mean encode `target_combine`, I have to first have lag target of previous months. For example, for June training set, I can include products in April and May, and also encode all products bought in April with the mean target in May.

2. Another way of mean encoding is to not use time series. Just put all target together and analyze. In this case, we can have the results as in the [3-rd solution](http://blog.kaggle.com/2017/02/22/santander-product-recommendation-competition-3rd-place-winners-interview-ryuji-sakata/) and [forum discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26899).

The first method is too complicated to implement, so I will try the second one. 
- data: 
    - first month product from 2015-01-28 to 2016-04-28
    - second month product (new product) from 2015-02-28 to 2016-05-28


#### CV@2015-12-28:
- benchmark: val = 1.62857
- with only `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, mlogloss=1.57141
- with `ind_actividad_client_combine`, `tiprel_1mes_combine`, `target_combine`, `n_products` and patterns: val = 1.31122
- Private score: 0.0302475, public score: 0.0299266

In [1]:
from santander_helper import *
%matplotlib inline

## Encoding

Load targets

In [2]:
if os.path.isfile('../input/targets.hdf'):
    # If the data already exists, just load it
    print('Load targets')
    targets = pd.read_hdf('../input/targets.hdf', 'targets')
else:
    print('Create targets')
    # If data does not exist, need to create one
    targets = []
    # For each pair of months, call obtain_target (it actually does not need a pair, just the second month)
    for m1, m2 in tqdm.tqdm_notebook(list(zip(month_list[:-2], month_list[1:-1]))):
        target1 = obtain_target(m2)
        target1['fecha_dato'] = m2
        targets.append(target1)

    targets = pd.concat(targets, ignore_index=True, copy=False)
    targets.to_hdf('../input/targets.hdf', 'targets', complib='blosc:lz4', complevel=9, format='t')

Load targets


New products for each customer at each month through `pivot_table`

In [3]:
targets_p = targets.copy()
targets_p['dummy'] = 1
targets_p = targets_p.pivot_table(index=['ncodpers', 'fecha_dato'], columns=['target'], values=['dummy'])
targets_p.fillna(0.0, inplace=True)
targets_p.reset_index(inplace=True)
targets_p.columns = ['ncodpers', 'fecha_dato']+target_cols

Calculate `target_combine`

In [4]:
# Count how many new products each customer purchases in each month
new_product_per_customer = targets.groupby(['ncodpers', 'fecha_dato'])['target'].count()
new_product_per_customer = pd.DataFrame(new_product_per_customer)
new_product_per_customer.reset_index(inplace=True, drop=False)
cols = new_product_per_customer.columns.tolist()
cols[-1] = 'target_count'
new_product_per_customer.columns = cols

Merge with `targets_p` 

In [8]:
new_product_per_customer = new_product_per_customer.merge(targets_p, how='left', on=['ncodpers', 'fecha_dato'])

Map `fecha_dato` to the previous month, since I want to build a mapping from the products in the first month to the new products to the second month, so the first month should be the key.

In [10]:
month_mapping = dict(zip(month_list[1:-1], month_list[:-2]))
new_product_per_customer.fecha_dato = new_product_per_customer.fecha_dato.map(month_mapping)

Load the current products (products in the first month) and extract product information

In [12]:
if os.path.isfile('../input/df_target_cols.hdf'):
    print('Load df_target_cols')
    df = pd.read_hdf('../input/df_target_cols.hdf', 'df_target_cols')
else:
    print('Create df_target_cols')
    df = []
    for month in tqdm.tqdm_notebook(month_list[:-2]):
        df.append(pd.read_hdf('../input/data_month_{}.hdf'.format(month), 'data_month'))
    df = pd.concat(df, ignore_index=True)
    df = df.loc[:, ['fecha_dato', 'ncodpers']+target_cols].copy()
    df['target_combine'] = np.sum(df[target_cols].values*
        np.float_power(2, np.arange(0, len(target_cols))), 
        axis=1, dtype=np.float64)
    df.drop(target_cols, axis=1, inplace=True)
    
    df.to_hdf('../input/df_target_cols.hdf', 'df_target_cols', complib='blosc:lz4', complevel=9, format='t')

Load df_target_cols


Merge `target_combine` and `target_count`

In [17]:
dt = pd.merge(df, new_product_per_customer, how='left', on=['ncodpers', 'fecha_dato'])
dt.fillna(0, inplace=True)
dt['target_indicator'] = (dt.target_count>0).astype(int)

In [19]:
dt.head()

Unnamed: 0,fecha_dato,ncodpers,target_combine,target_count,ind_cco_fin_ult1,ind_cder_fin_ult1,ind_cno_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_ctop_fin_ult1,...,ind_hip_fin_ult1,ind_nom_pens_ult1,ind_nomina_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_recibo_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,target_indicator
0,2015-01-28,1375586,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,2015-01-28,1050611,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,2015-01-28,1050612,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,2015-01-28,1050613,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,2015-01-28,1050614,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [20]:
dt.shape

(12715856, 24)

In [21]:
mean_encoding = {}
mean_encoding_cols = target_cols+['target_count', 'target_indicator']
for c in tqdm.tqdm_notebook(mean_encoding_cols):
    mean_encoding[c] = pd.DataFrame(dt.groupby('target_combine')[c].mean())
    mean_encoding[c].columns = [c]
    dt = pd.merge(dt, mean_encoding[c], how='left', left_on='target_combine', right_index=True, suffixes=('', '_m'))

# Remove auxiliary columns
dt.drop(['target_count', 'target_indicator', 'fecha_dato', 'ncodpers'], inplace=True, axis=1)
# Remove duplicate rows
dt.drop_duplicates(inplace=True)

HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




In [25]:
dt.target_combine.unique().shape

(9485,)

In [None]:
dt.set_index('target_combine', inplace=True)

In [None]:
dt.head()

In [None]:
dt.to_hdf('../input/target_mean_encoding.hdf', 'target_mean_encoding', complib='blosc:lz4', complevel=9, format='t')

### Another way to implement the mean encoding, double check

Calculate new product for each customer in each month

In [None]:
new_cols = [k+'_new' for k in target_cols]
du = collections.OrderedDict()
for m1, m2 in tqdm.tqdm_notebook(list(zip(month_list[:-2], month_list[1:-1]))):
    df1 = pd.read_hdf('../input/data_month_{}.hdf'.format(m1), 'data_month')
    df2 = pd.read_hdf('../input/data_month_{}.hdf'.format(m2), 'data_month')

    df1 = df1[['fecha_dato', 'ncodpers']+target_cols]
    df2 = df2[['fecha_dato', 'ncodpers']+target_cols]

    dt = df2.merge(df1, on=['ncodpers'], how='left', suffixes=('_l', ''))
    dt.fillna(0.0, inplace=True)

    dt.drop(['fecha_dato_l'], axis=1, inplace=True)
    x = dt.iloc[:, 1:20].values-dt.iloc[:, 21:].values
    x = pd.DataFrame(x, index=dt.ncodpers, columns=new_cols)
    df1.drop('fecha_dato', axis=1, inplace=True)
    df1.set_index('ncodpers', inplace=True)
    x = df1.join(x, how='left')
    du[m1] = x.copy()

Calculate pattern of product in the first month. The pattern is considered as a binary number, then converted to decimal. Also count number of new products and indicator of buying new products.

In [24]:
du = pd.concat(du, ignore_index=True)
du[du<0] = 0

du['target_combine'] = np.sum(du.values[:, :19]*np.float_power(2, np.arange(0, 19)), axis=1, dtype=np.float64)
du['target_count'] = du.loc[:, new_cols].sum(axis=1)
du['target_indicator'] = du.loc[:, new_cols].max(axis=1)

du.drop(target_cols, axis=1, inplace=True)

Encode product pattern of the first month with mean of each new product, number of new products and indicator of buying new products.

In [25]:
dg = collections.OrderedDict()
new_cols = new_cols+['target_count', 'target_indicator']
for c in tqdm.tqdm_notebook(new_cols):
    dg[c] = du.groupby('target_combine')[c].mean()
dg = pd.concat(dg, axis=1)
dg.columns = new_cols

HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




In [26]:
dg.shape

(9485, 21)

In [27]:
dg.to_hdf('../input/target_mean_encoding_2.hdf', 'target_mean_encoding', complib='blosc:lz4', complevel=9, format='t')