## 第二部分：关联分析
    这部分对订单进行关联分析，寻找购物中经常一起出现的商品组合：{A,B}（通常寻找的是商品对，就像经典的beer&diaper，三个一起高频出现的情况比较少）。这种商品之间的关联关系通常是可以解释的，一般是人们不易察觉的购物模式，可以用来做组合推荐，或者促销，比如在对A商品做大促时提高关联度高的商品B的售价来获利。
### 关联规则
    if-then形式：若出现A，可以推导出有很大可能性出现B，若出现{A,B}，那么很有可能出现C。
    我们的目标就是寻找这样的组合对，问题的关键就是如何统计衡量这个可能性
### Key Metrics
* **支持度 support**：  
当前组合出现的概率：当前集合的频率 / 订单数
<br>F_{A,B} / F_orders
* **置信度 confidence**：
<br>类似一个条件概率，A出现的情况下，B出现的概率，有方向
<br>confidence{A->B} = support{A,B}/support{A}
<br>confidence{B->A} = support{A,B}/support{B}
* **提升度 lift**：
<br>lift{A,B} = support{A,B}/support{A}*support{B}
<br>分子是A B同时出现的概率；分母是若A B完全独立，同时出现的概率，或者是说A B完全随机分布时同时出现的概率。
<br>考察A B的关联程度：
    <br>>1 同时出现的频率高于随机分布，正相关；
    <br><1 同时出现的频率低于随机分布，负相关；
    <br>=1 相互独立，无关。</br>

### A-Priori算法：
寻找关联商品所面临的最大问题是订单数据量大，商品种类繁多，所有组合的可能太多了，A-Priori算法主要就是解决的就是这个问题，从底层起，从小集合开始，一层层向上查找，避免不必要的统计。
#### 主要思想：
Monotonicity of Frequency
<br>如果一个集合是频繁的(达到支持度)，那么他的所有子集也一定是频繁的；相反的，如果一个集合非频繁，那么他的所有超集也都达不到支持度。
#### 流程：
* 输入数据集(所有包含两个以上商品的订单)，设置支持度阈值(support_threshold)
* first pass：统计所有商品的出现频率，过滤掉达不到支持度的
* second pass：一个订单一个订单的生成商品组合，计数统计频率，筛选出达标的组合，计算置信度和提升度。
* 得到关联商品组合作出解释。</br>

### 载入、清洗数据

In [1]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter

In [2]:
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))
orders = pd.read_csv('../market_sells_orders/input/order_products__prior.csv').set_index('order_id')['product_id']
print('total sales: %s'%orders.shape[0])
print('unique orders count: %s'%len(orders.index.unique()))
print('unique products count: %s'%len(orders.value_counts()))
print('data size: %s'%size(orders))
orders.head()

total sales: 32434489
unique orders count: 3214874
unique products count: 49677
data size: 518.95 MB


order_id
2    33120
2    28985
2     9327
2    45918
2    30035
Name: product_id, dtype: int64

**过滤掉商品数少于2的订单**

In [3]:
order_size = orders.index.value_counts().rename('freq')
qualified_orders = order_size[order_size>=2].index
order_item = orders[orders.index.isin(qualified_orders)]

### 计数方法，组合方法
**我们的数据包含了3千多万条订单数据，包含将近5万个商品，商品对所有可能的组合数有n(n-1)/2=10^9种。用A-Priori寻找商品组合进行统计对内存的消耗依然很大，尽量使用生成器，不要把全部的组合都一起载入内存再进行计数，另外要注意组合去重。**

In [4]:
def count(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq") 
    return pd.Series(Counter(iterable)).rename("freq")
def order_count(order_item):
    return len(set(order_item.index))

In [5]:
def get_pairs(order_item):
    order_item = order_item.reset_index().as_matrix()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_set = set([item[1] for item in order_object])      
        for item_pair in combinations(item_set, 2):
            pair = (min(item_pair),max(item_pair))
            yield pair

### 遍历数据集，统计支持度、置信度、提升度

In [10]:
def association_rules(order_item, support_threshold):
    print("First Pass:")
    orders_count = order_count(order_item)
    item_stats = count(order_item).to_frame("freq")
    item_stats['support']  = (item_stats['freq']/orders_count)*100

    qualifying_items = item_stats[item_stats['support'] >= support_threshold].index
    order_item = order_item[order_item.isin(qualifying_items)]

    print("Products count: %s"%len(qualifying_items))
    print("Remaining order_products: %s"%len(order_item))

    print("Second Pass:")
    #orders_count = order_count(order_item)
    pair_gen = get_pairs(order_item)
    pairs = count(pair_gen).to_frame("freqAB")
    
    pairs['supportAB'] = (pairs['freqAB']/orders_count)*100
    print("Total product pairs count: %s"%len(pairs))
    pairs = pairs[pairs['supportAB'] >= support_threshold]
    print("Remaining product pairs: %s"%len(pairs))

    pairs = pairs.reset_index().rename(columns={'level_0': 'product_A', 'level_1': 'product_B'})
    pairs = pd.merge(pairs,item_stats.rename(columns={'freq':'freqA','support':'supportA'}),left_on='product_A',right_index=True)
    pairs = pd.merge(pairs,item_stats.rename(columns={'freq':'freqB','support':'supportB'}),left_on='product_B',right_index=True)
    
    pairs['confidenceAtoB'] = pairs['supportAB'] / pairs['supportA']
    pairs['confidenceBtoA'] = pairs['supportAB'] / pairs['supportB']
    pairs['lift'] = pairs['supportAB'] / (pairs['supportA'] * pairs['supportB'])
    
    return pairs.sort_values('lift', ascending=False)

考虑到商品数目繁多，支持度设置的不宜过高: 0.02

In [21]:
%%time
rules = association_rules(order_item, 0.02)

First Pass:
Products count: 7208
Remaining order_products: 28056134
Second Pass:


  


Total product pairs count: 13334712
Remaining product pairs: 23631
CPU times: user 5min 50s, sys: 10.7 s, total: 6min 1s
Wall time: 6min 8s


In [30]:
product = pd.read_csv('../market_sells_orders/input/products.csv')[['product_id','product_name']]
merged = rules.merge(product.rename(columns={'product_name':'productA'}),left_on='product_A',right_on='product_id').merge(product.rename(columns={'product_name':'productB'}),left_on='product_B',right_on='product_id')
result=merged[['productA','productB','supportAB','supportA','supportB', 'confidenceAtoB','confidenceBtoA','lift']].sort_values('lift', ascending=False)

把提升度小于1的商品过滤掉

In [31]:
result = result[result['lift']>1]
result = result[result['supportAB']>0.05].reset_index()
result.drop(columns=['index'],inplace=True)
result

Unnamed: 0,productA,productB,supportAB,supportA,supportB,confidenceAtoB,confidenceBtoA,lift
0,Almond Milk Blueberry Yogurt,Almond Milk Peach Yogurt,0.073901,0.154179,0.15382,0.479321,0.480442,3.116125
1,Almond Milk Blueberry Yogurt,Almond Milk Strawberry Yogurt,0.088355,0.154179,0.186846,0.573065,0.472874,3.067035
2,Almond Milk Strawberry Yogurt,Almond Milk Peach Yogurt,0.080997,0.186846,0.15382,0.433497,0.526573,2.818213
3,Coconut Chia Bar,Chocolate Peanut Butter,0.061149,0.152021,0.152119,0.402237,0.401978,2.644221
4,Stage 1 Apples Sweet Potatoes Pumpkin & Bluebe...,Organic 4 Months Butternut Squash Carrots Appl...,0.050063,0.182432,0.105849,0.274422,0.472969,2.592576
5,Yotoddler Organic Pear Spinach Mango Yogurt,Organic Whole Milk Strawberry Beet Berry Yogur...,0.092311,0.200155,0.205126,0.461199,0.450024,2.248374
6,YoKids Squeeze Organic Blueberry Blue Yogurt,YoKids Squeeze! Organic Strawberry Flavor Yogurt,0.055393,0.113534,0.222358,0.487903,0.249118,2.19422
7,Stage 1 Apples Sweet Potatoes Pumpkin & Bluebe...,"Organic Pears, Peas and Broccoli Puree Stage 1",0.062947,0.182432,0.164447,0.345044,0.38278,2.098206
8,Blueberry Whole Milk Yogurt Pouch,Organic Whole Milk Strawberry Beet Berry Yogur...,0.066577,0.160785,0.205126,0.414074,0.324566,2.018634
9,Organic Greek Lowfat Yogurt With Strawberries,Organic Greek Lowfat Yogurt With Blueberries,0.064059,0.206172,0.156141,0.310706,0.410262,1.9899


### 结论
    由上表可以看出，提升度较高的相关商品组合并不意外，大都是一些常见食品的组合，同一类商品的不同品牌组合可以理解为品牌的同质性比较高，目标群重叠度比较高，可以用来做相关推荐。