## 例5-6 分箱计数示例

In [21]:
import pandas as pd

- [Click-through ad data from Kaggle competition](https://www.kaggle.com/c/avazu-ctr-prediction/data)
- train_subset is first 10K rows of 6+GB set

In [29]:
# 使用这个超过6GB的数据集的前1 000 000 -1行作为训练集
df = pd.read_csv('data/avazu-ctr-prediction/train.csv', nrows=99999)

In [30]:
print(df.shape)
df.head(3)

(99999, 24)


Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,1.000017e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,1.000037e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79


In [50]:
df[['device_id']].head()

Unnamed: 0,device_id
0,a99f214a
1,a99f214a
2,a99f214a
3,a99f214a
4,a99f214a


In [31]:
# how many features should we have after?
# 看看训练集中有多少个唯一的特征
len(df['device_id'].unique())

7201

Features are $\theta$ = [$N^+$, $N^-$, $log(N^+)-log(N^-)$, isRest]

$N^+$ = $p(+)$ = $n^+/(n^+ + n^-)$

$N^-$ = $p(-)$ = $n^-/(n^+ + n^-)$

$log(N^+)-log(N^-)$ = $\frac{p(+)}{p(-)}$

isRest = back-off bin (not shown here)

对每个类别，我们要计算：  
Theta = [counts, p(click), p(no click), p(click)/p(no click)]

In [57]:
def click_counting(x, bin_column):
    """
    in:
    s1 = pd.Series({'a': 1, 'c': 7, 'b': 4}, name='s1')
    s2 = pd.Series({'b': 1, 'd': 7, 'a': 4}, name='s2')
    pd.DataFrame([s1, s2])
    
    out:
    	a	c	b	d
    s1	1.0	7.0	4.0	NaN
    s2	4.0	NaN	1.0	7.0
    """
    clicks = pd.Series(
        x[x['click'] > 0][bin_column].value_counts(), name='clicks')
    no_clicks = pd.Series(
        x[x['click'] < 1][bin_column].value_counts(), name='no_clicks')
    counts = pd.DataFrame([clicks, no_clicks]).T.fillna('0')
    counts['total'] = counts['clicks'].astype(
        'int64') + counts['no_clicks'].astype('int64')

    return counts


def bin_counting(counts):
    counts['N+'] = counts['clicks'].astype(
        'int64').divide(counts['total'].astype('int64'))
    counts['N-'] = counts['no_clicks'].astype(
        'int64').divide(counts['total'].astype('int64'))
    counts['pre_N+'] = counts['N+'].divide(counts['N-'])
    counts['log_N+'] = counts['pre_N+'].apply('log')

    # If we wanted to only return bin-counting properties, we would filter here
    # 如果只想返回分箱属性就进行过滤
    bin_counts = counts[['N+', 'N-', 'pre_N+', 'log_N+']]
    return counts, bin_counts

In [58]:
# bin counts example: device_id
# 分箱计数示例：device_id
bin_column = 'device_id'
device_clicks = click_counting(df[[bin_column, 'click']], bin_column)
device_all, device_bin_counts = bin_counting(device_clicks)

  return f(self, *args, **kwargs)


In [59]:
# check to make sure we have all the devices
len(device_bin_counts)

7201

In [60]:
device_all.sort_values(by='total', ascending=False).head(4)

Unnamed: 0,clicks,no_clicks,total,N+,N-,pre_N+,log_N+
a99f214a,15729,71206,86935,0.180928,0.819072,0.220894,-1.510071
c357dbff,33,134,167,0.197605,0.802395,0.246269,-1.401332
31da1bd0,0,62,62,0.0,1.0,0.0,-inf
936e92fb,5,54,59,0.084746,0.915254,0.092593,-2.379546


In [62]:
# We can see how this can change model evaluation time by comparing raw vs. bin-counting size
from sys import getsizeof

print('Our pandas Series, in bytes: ', getsizeof(df[['device_id', 'click']]))
print('Our bin-counting feature, in bytes: ', getsizeof(device_bin_counts))

Our pandas Series, in bytes:  7300031
Our bin-counting feature, in bytes:  1026201
