## Feature Engineering and CV based Winners' Solutions

continued from eda_4_26

New in this notebook:
- average of products for each (customer, product) pair
- exponent weighted average of products each (customer, product) pair
- time since presence of products, distance to the first 1

To-do: 
- mean encoding of products grouped by combinations of: canal_entrada, segmento, cod_prov
- Time since change and lags for a few non-product features: 
    - segmento
    - ind_actividad_cliente
    - cod_prov
    - canal_entrada
    - indrel_1mes
    - tiprel_1mes


Features:
- before eda_4_25
    - customer info in the second month
    - products in the first month
    - combination of first and second month `ind_actividad_cliente`
    - combination of first and second month `tiprel_1mes`
    - combination of first month product by using binary number (`target_combine`)
    - encoding `target_combine` with 
        - mean number of new products
        - mean number of customers with new products
        - mean number of customers with each new products
    - Count patterns in the last `max_lag` months
    - Number of month to the last time the customer purchase each product
        - CV@2015-12-28: mlogloss=1.29349
        - Private score: 0.0302475, public score: 0.0299266
- eda_4_25
    - Use all available history data
        - E.g., for 2016-05-28 train data, use all previous months, for 2015-02-28, use 1 lag month. 
        - Need to create test set that use the same amount of previous months for each training data set. 
        - This is from [the second winner's solution](https://www.kaggle.com/c/santander-product-recommendation/discussion/26824), his bold part in paragraph 4.
    - Combine models trained on 2016-05-28 and 2015-06-28:
        - Private score: 0.0304583, public score: 0.0300839
        - This is to catch both seasonality and trend, presented in 2015-06-28 and 2016-05-28, respectively. 
        - This idea is mentioned by many winners, like [11-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823) and [14-th winner](https://www.kaggle.com/c/santander-product-recommendation/discussion/26808)

- eda_4_27
    - put 2015-06-28 and 2016-05-28 in the same data set, with the same lag=5
        - Private score:0.0303096, public score: 0.0299867
        - Different as [11-th winner's discussion](https://www.kaggle.com/c/santander-product-recommendation/discussion/26823)
            > We tested this by adding 50% of May-16 data to our June model and sure enough, we went from 0.0301 to 0.0303. Then, we built separate models for Jun and May, but the ensemble didn’t work. We weren’t surprised because June data is better for seasonal products, and May data is better for trend products. And vice-versa, June data is bad for trend products and May data is bad for seasonal products. So, they sort of cancelled each other out.

        - But my score is always worse than theirs, maybe this is the reason why we have different observations

In [1]:
from santander_helper import *

In [2]:
month_list

['2015-01-28',
 '2015-02-28',
 '2015-03-28',
 '2015-04-28',
 '2015-05-28',
 '2015-06-28',
 '2015-07-28',
 '2015-08-28',
 '2015-09-28',
 '2015-10-28',
 '2015-11-28',
 '2015-12-28',
 '2016-01-28',
 '2016-02-28',
 '2016-03-28',
 '2016-04-28',
 '2016-05-28',
 '2016-06-28']

In [3]:
month1 = '2016-01-28'
max_lag = 3

In [4]:
month_new = month_list.index(month1)+1
month_end = month_list.index(month1)
month_start = month_end-max_lag+1

In [5]:
month_new

13

In [6]:
month_list[month_new]

'2016-02-28'

In [7]:
month_list[month_start]

'2015-11-28'

In [8]:
# Check if month_new is the last month
if month_new<len(month_list)-1:
    # Customers with new products in month_new
    customer_product_pair = pd.read_hdf('../input/customer_product_pair.hdf', 'customer_product_pair')
    ncodpers_list = customer_product_pair.loc[customer_product_pair.fecha_dato==month_list[month_new], 
        'ncodpers'].unique().tolist()

In [9]:
# Load data for all the lag related months
df = []
for m in range(month_start, month_end+1):
    df.append(pd.read_hdf('../input/data_month_{}.hdf'.format(month_list[m]), 'data_month'))

In [10]:
# concatenate data
df = pd.concat(df, ignore_index=True)

In [11]:
df = df.loc[:, ['fecha_dato']+cat_cols+target_cols]
if month_new<len(month_list)-1:
    # select customers if this is not test set
    df = df.loc[df.ncodpers.isin(ncodpers_list), :]

In [12]:
# set ncodpers and fecha_dato as index
df.set_index(['ncodpers', 'fecha_dato'], inplace=True)

In [13]:
# unstack to make month as columns
df = df.unstack(level=-1, fill_value=np.nan)

In [16]:
df.isnull().sum().head()

               fecha_dato
canal_entrada  2015-11-28    3087
               2015-12-28    2192
               2016-01-28      17
conyuemp       2015-11-28    3087
               2015-12-28    2192
dtype: int64

Arithmetic /exponent weighted average of products for each (customer, product) pair 

In [24]:
# Group data by features
group0 = df.fillna(0.0).groupby(axis=1, level=0)

# Average of products for each (customer, product) pair
mean_product = pd.DataFrame()
mean_product['ncodpers'] = df.index.tolist() # Note: orders of ncodpers in df and ncodpers_list are different! 
for k in target_cols:
    mean_product[k+'_lag_mean'] = group0.get_group(k).mean(axis=1).values

mean_product.set_index('ncodpers', inplace=True)

In [27]:
# Exponent average of products for each (customer, product) pair
mean_exp_product = pd.DataFrame()
mean_exp_product['ncodpers'] = df.index.tolist() # Note: orders of ncodpers in df and ncodpers_list are different! 
mean_exp_alpha1 = 0.1
mean_exp_weight1 = np.float_power(1-mean_exp_alpha1, np.arange(0, max_lag))
mean_exp_weight1 = mean_exp_weight1[::-1]/np.sum(mean_exp_weight1)
mean_exp_alpha2 = 0.5
mean_exp_weight2 = np.float_power(1-mean_exp_alpha2, np.arange(0, max_lag))
mean_exp_weight2 = mean_exp_weight2[::-1]/np.sum(mean_exp_weight2)
for k in target_cols:
    mean_exp_product[k+'_lag_exp_mean1'] = np.average(group0.get_group(k).values, axis=1, weights=mean_exp_weight1) #group0.get_group(k).apply(np.average, axis=1, weights=mean_exp_weight1).values
    mean_exp_product[k+'_lag_exp_mean2'] = np.average(group0.get_group(k).values, axis=1, weights=mean_exp_weight2) # group0.get_group(k).apply(np.average, axis=1, weights=mean_exp_weight2).values
    
mean_exp_product.set_index('ncodpers', inplace=True)

distance to positive flank

In [164]:
def dist_pos_flank(x):
    x = x.values[:, ::-1]
    x = np.hstack((x, np.ones((x.shape[0], 1)), np.zeros((x.shape[0], 1)) ))
    x = np.diff(x, axis=1)
    x = np.argmin(x, axis=1)
    return x

In [169]:
distance_positive_flank = pd.DataFrame()
distance_positive_flank['ncodpers'] = df.index.tolist()
for k in target_cols:
    distance_positive_flank[k+'_dist_pos_flank'] = dist_pos_flank(group0.get_group(k))
    
distance_positive_flank.set_index('ncodpers', inplace=True)

distance to negative flank

In [170]:
def dist_neg_flank(x):
    x = x.values[:, ::-1]
    x = np.hstack((x, np.zeros((x.shape[0], 1)), np.ones((x.shape[0], 1)) ))
    x = np.diff(x, axis=1)
    x = np.argmax(x, axis=1)
    return x

In [171]:
distance_negative_flank = pd.DataFrame()
distance_negative_flank['ncodpers'] = df.index.tolist()
for k in target_cols:
    distance_negative_flank[k+'_dist_neg_flank'] = dist_neg_flank(group0.get_group(k))
    
distance_negative_flank.set_index('ncodpers', inplace=True)

Distance to the first 1

In [185]:
def dist_first_one(x):
    x = x.values
    x = np.hstack( (x, np.ones((x.shape[0], 1)) ) )
    x = x.shape[1]-2-np.argmax(x, axis=1)
    return x

In [186]:
distance_first_one = pd.DataFrame()
distance_first_one['ncodpers'] = df.index.tolist()
for k in target_cols:
    distance_first_one[k+'_dist_first_one'] = dist_first_one(group0.get_group(k))
    
distance_first_one.set_index('ncodpers', inplace=True)

In [None]:
month_new = month_list.index(month1)+1
month_end = month_list.index(month1)
month_start = month_end-max_lag+1

# Check if month_new is the last month
if month_new<len(month_list)-1:
    # Customers with new products in month_new
    customer_product_pair = pd.read_hdf('../input/customer_product_pair.hdf', 'customer_product_pair')
    ncodpers_list = customer_product_pair.loc[customer_product_pair.fecha_dato==month_list[month_new], 
        'ncodpers'].unique().tolist()

# Load data for all the lag related months
df = []
for m in range(month_start, month_end+1):
    df.append(pd.read_hdf('../input/data_month_{}.hdf'.format(month_list[m]), 'data_month'))

# concatenate data
df = pd.concat(df, ignore_index=True)
df = df.loc[:, ['ncodpers', 'fecha_dato']+target_cols]
if month_new<len(month_list)-1:
    # select customers if this is not test set
    df = df.loc[df.ncodpers.isin(ncodpers_list), :]
# set ncodpers and fecha_dato as index
df.set_index(['ncodpers', 'fecha_dato'], inplace=True)
# unstack to make month as columns
df = df.unstack(level=-1, fill_value=0)

# count number of concatenating zeros before the second/current month
df = df.groupby(level=0, axis=1).progress_apply(lambda x: (1-x).iloc[:, ::-1].cummin(axis=1).sum(axis=1))
df.columns = [k+'_zc' for k in df.columns]

gc.collect()

