### 关联分析（关联规则学习）含义：从大规模数据中寻找物品间的联系

##### Apriori算法：
###### 优点：易于编码实现
###### 缺点：在大数据集上运行较慢
###### 适用数据类型：数值型或者标称型
##### 一些概念：
###### 频繁项集：经常出现在一块的物品的集合
###### 关联规则：暗示两种物品之间存在很强的关系
###### 如下所示：

In [2]:
from numpy import *
import pandas as pd

dataId=[i for i in range(5)]
dataValue=['豆奶、莴苣','莴苣、尿布、葡萄酒、甜菜','豆奶、尿布、葡萄酒、橙汁','莴苣、豆奶、尿布、葡萄酒','莴苣、豆奶、尿布、橙汁']
data=pd.DataFrame()
data['交易号码']=dataId
data['商品']=dataValue
data

Unnamed: 0,交易号码,商品
0,0,豆奶、莴苣
1,1,莴苣、尿布、葡萄酒、甜菜
2,2,豆奶、尿布、葡萄酒、橙汁
3,3,莴苣、豆奶、尿布、葡萄酒
4,4,莴苣、豆奶、尿布、橙汁


##### 上述数据中，葡萄酒、豆奶、尿布就是一组频繁项集，可以观察出买了尿布的人很可能会买葡萄酒

##### 频繁项集中频繁的度量单位：支持度和可信度
##### 一个频繁项集的支持度被定义为：数据集中包含该项集记录所占比例，是针对项集的
##### 例如上述数据中豆奶的支持度为4/5
##### 可信度（置信度）：针对一条关联规则来的，这条关联规则的可信度被定义为两个支持度的比值（类似于条件概率）
##### 例如关联规则尿布->葡萄酒的置信度为:支持度（尿布、葡萄酒）/支持度（尿布）
##### 例如关联规则尿布->葡萄酒=3/4,故得出结论所有包含“尿布”的记录中75%都适用



In [19]:
x=[i for i in range(4)]
y=[i for i in range(4)]
data=pd.DataFrame()
# print([str(i)+str(j) for i in x for j in y if i!=j])
data['1种']=[str(i) for i in x]+[None,None,None,None,None]
data['2种']=['01', '02', '03', '12', '13', '23']+[None,None,None]
data['3种']=['012','013','023','024','034','123','124','134','234']
data['4种']=[''.join(list(map(str,x)))]+[None,None,None,None,None,None,None,None]
print('假设有4种商品，需要遍历数据15次,如下图所示:')
data


假设有4种商品，需要遍历数据15次,如下图所示


Unnamed: 0,1种,2种,3种,4种
0,0.0,1.0,12,123.0
1,1.0,2.0,13,
2,2.0,3.0,23,
3,3.0,12.0,24,
4,,13.0,34,
5,,23.0,123,
6,,,124,
7,,,134,
8,,,234,


##### 为了降低计算所需的时间，可以应用Apriori原理
##### 该原理基于一种假设：如果该项集是频繁的，则其所有子集也是频繁的
##### 如上图中{01}两种商品是一个频繁项集，则其子集{0}和{1}也是频繁的
##### 如上图{23}两种商品组合不是频繁的，则{023}、{123}、{0123}也是不频繁的
##### 下面就用代码来实现：

##### 伪代码如下：
    对于数据集中的每条交易记录train:
        对于每个候选集项can:
            检查can是否属于train的子集
            如果是子集:
                增加can的计数器
        对于每个候选集项:
            如果其支持度不小于最小度量，则保留该项集
    返回所有项集列表
                

In [10]:
def loadDataSet():
    return [[1,2,3],[1,2,3,4],[2,3],[2,3,4],[2,5]]

# 创建一个不变集合（存放所有商品类别）
def createC1(dataSet):
    c1=[]
    for trainsaction in dataSet:
        for item in trainsaction:
            if not [item] in c1:
                c1.append([item])
    c1.sort()
    # 返回一个不变集合
    # print(c1)
    return list(map(frozenset,c1))

# 输入参数分别为原始数据集、商品种类集合、最小支持度
def scanD(D,ck,minSupport):
    ssCnt={}
    for tid in D:
        for can in ck:
            if can.issubset(tid):
                if can not in ssCnt:
                    ssCnt[can]=1
                else:
                    ssCnt[can]+=1
                # 上述if-else等价于:
                # ssCnt[can]=ssCnt.get(cant,0)+1
    numItem=float(len(D))
    retList=[]
    supportData={}
    for key in ssCnt:
        # 计算所有项集的支持度
        support=ssCnt[key]/numItem
        # 根据支持度划分项集子集
        if support>= minSupport:
            retList.insert(0,key)
        supportData[key]=support
    return retList,supportData
                
   
def test():
    data=loadDataSet()
    ck=createC1(data)
    print('商品种类为：',ck)
    retList,supportData=scanD(data,ck,0.5)
    print('单商品频繁项集为',retList)
    print('单商品支持度为：',supportData)
    print('从上述数据可以根据设置最小支持度过滤掉部分数据，减少后续计算程度！')
test()


[[1], [2], [3], [4], [5]]
商品种类为： [frozenset({1}), frozenset({2}), frozenset({3}), frozenset({4}), frozenset({5})]
单商品频繁项集为 [frozenset({3}), frozenset({2})]
单商品支持度为： {frozenset({1}): 0.4, frozenset({2}): 1.0, frozenset({3}): 0.8, frozenset({4}): 0.4, frozenset({5}): 0.2}
从上述数据可以根据设置最小支持度过滤掉部分数据，减少后续计算程度！


##### 整个Apriori算法伪代码如下：
    当集合项中的个数大于0时:
        构建一个k个项组成的候选项集的列表
        检查数据以确认每个项集都是频繁的
        保留频繁集项并构建k+1项组成的候选集的列表

In [39]:
def loadDataSet():
    return [[1,2,3],[1,2,3,4],[1,2,3,4],[2,5],[1,2,3,4]]

# 该函数负责将Lk中的元素分解为不重复的子集所组成的列表
def aprioriGen(Lk,k):
    retList=[]
    lenLk=len(Lk)
    for i in range(lenLk):
        for j in range(i+1,lenLk):
            L1=list(Lk[i])[:k-2]
            L2=list(Lk[j])[:k-2]
            L1.sort()
            L2.sort()
            # print(L1,L2)
            # 如果两个列表相等就合并
            if L1==L2:
                retList.append(Lk[i]|Lk[j])
    return retList


def apriori(dataSet,minSupport=0.5):
    C1=createC1(dataSet)
    D=list(map(set,dataSet))
    # print("将列表元素转换为集合：",D)
    L1,supportData=scanD(D,C1,minSupport)
    L=[L1]
    k=2
    while(len(L[k-2])>0):
        ck=aprioriGen(L[k-2],k)
        Lk,supK=scanD(D,ck,minSupport)
        supportData.update(supK)
        if Lk!=None:
            L.append(Lk)
        k+=1
    return L,supportData

def test():
    dataSet=loadDataSet()
    L,supportData=apriori(dataSet,minSupport=0.5)
    print('频繁项集为：',L)
    print('频繁项集机：支持度为 ',supportData)
    
test()

[[1], [2], [3], [4], [5]]
频繁项集为： [[frozenset({4}), frozenset({3}), frozenset({2}), frozenset({1})], [frozenset({1, 4}), frozenset({2, 4}), frozenset({3, 4}), frozenset({1, 2}), frozenset({1, 3}), frozenset({2, 3})], [frozenset({2, 3, 4}), frozenset({1, 3, 4}), frozenset({1, 2, 4}), frozenset({1, 2, 3})], [frozenset({1, 2, 3, 4})], []]
频繁项集机：支持度为  {frozenset({1}): 0.8, frozenset({2}): 1.0, frozenset({3}): 0.8, frozenset({4}): 0.6, frozenset({5}): 0.2, frozenset({2, 3}): 0.8, frozenset({1, 3}): 0.8, frozenset({1, 2}): 0.8, frozenset({3, 4}): 0.6, frozenset({2, 4}): 0.6, frozenset({1, 4}): 0.6, frozenset({1, 2, 3}): 0.8, frozenset({1, 2, 4}): 0.6, frozenset({1, 3, 4}): 0.6, frozenset({2, 3, 4}): 0.6, frozenset({1, 2, 3, 4}): 0.6}


##### 在计算可信度时可以注意到：如果一个项集不满足最低可信度，其子集也一定不满足
##### 从频繁项集中挖掘关联规则：可以先从一个频繁项集开始，接着创建一个关联规则列表，右边只包含一个元素
##### 然后对这些元素进行测试，接下来合并所有剩余规则创建一个新的规则列表，列表右边包含两个元素（该方法为分级法）


In [41]:
def generateRules(L, supportData, minConf=0.7):  #supportData is a dict coming from scanD
    bigRuleList = []
    for i in range(1, len(L)):#only get the sets with two or more items
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]
            # 只获取有两个或更多元素的集合
            if (i > 1):
                rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
            else:
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList         

def calcConf(freqSet, H, supportData, brl, minConf=0.7):
    prunedH = [] #create new list to return
    for conseq in H:
        conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence
        if conf >= minConf: 
            print(freqSet-conseq,'-->',conseq,'conf:',conf)
            brl.append((freqSet-conseq, conseq, conf))
            prunedH.append(conseq)
    return prunedH

def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
    m = len(H[0])
    if (len(freqSet) > (m + 1)): #try further merging
        Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
        if (len(Hmp1) > 1):    #need at least two sets to merge
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

def test():
    data=loadDataSet()
    print('在当前数据集上生成一个置信度大于0.7，且其项集支持度大于0.5的关联规则：')
    L,supportData=apriori(data,minSupport=0.5)
    rules=generateRules(L,supportData,minConf=0.7)
    print('关联规则为：',rules)

test()

在当前数据集上生成一个置信度大于0.7，且其项集支持度大于0.5的关联规则：
[[1], [2], [3], [4], [5]]
frozenset({4}) --> frozenset({1}) conf: 1.0
frozenset({1}) --> frozenset({4}) conf: 0.7499999999999999
frozenset({4}) --> frozenset({2}) conf: 1.0
frozenset({4}) --> frozenset({3}) conf: 1.0
frozenset({3}) --> frozenset({4}) conf: 0.7499999999999999
frozenset({2}) --> frozenset({1}) conf: 0.8
frozenset({1}) --> frozenset({2}) conf: 1.0
frozenset({3}) --> frozenset({1}) conf: 1.0
frozenset({1}) --> frozenset({3}) conf: 1.0
frozenset({3}) --> frozenset({2}) conf: 1.0
frozenset({2}) --> frozenset({3}) conf: 0.8
frozenset({4}) --> frozenset({2, 3}) conf: 1.0
frozenset({3}) --> frozenset({2, 4}) conf: 0.7499999999999999
frozenset({4}) --> frozenset({1, 3}) conf: 1.0
frozenset({3}) --> frozenset({1, 4}) conf: 0.7499999999999999
frozenset({1}) --> frozenset({3, 4}) conf: 0.7499999999999999
frozenset({4}) --> frozenset({1, 2}) conf: 1.0
frozenset({1}) --> frozenset({2, 4}) conf: 0.7499999999999999
frozenset({3}) --> frozenset({1,

##### 案例：发现毒蘑菇的相似特征
##### 数据集介绍：mushroom.dat
##### 第一列表示是否有毒（1：无毒；2：有毒）
##### 第二列为蘑菇伞形状，有六种可能值
##### 一共有23种特征

In [49]:
def loadDataSet(filename):
    mushData=[]
    with open(filename) as f:
        data=f.readlines()
    for line in data:
        mushData.append(line.split())
    return mushData

def test():
    mushData=loadDataSet(r'data/mushroom.dat')
    L,supportData=apriori(mushData,minSupport=0.3)
    print('所有结果包含2的频繁项集为：')
    for i in L[1]:
        if i.intersection('2'):
            print(i)
        
test()

[['1'], ['10'], ['100'], ['101'], ['102'], ['103'], ['104'], ['105'], ['106'], ['107'], ['108'], ['109'], ['11'], ['110'], ['111'], ['112'], ['113'], ['114'], ['115'], ['116'], ['117'], ['118'], ['119'], ['12'], ['13'], ['14'], ['15'], ['16'], ['17'], ['18'], ['19'], ['2'], ['20'], ['21'], ['22'], ['23'], ['24'], ['25'], ['26'], ['27'], ['28'], ['29'], ['3'], ['30'], ['31'], ['32'], ['33'], ['34'], ['35'], ['36'], ['37'], ['38'], ['39'], ['4'], ['40'], ['41'], ['42'], ['43'], ['44'], ['45'], ['46'], ['47'], ['48'], ['49'], ['5'], ['50'], ['51'], ['52'], ['53'], ['54'], ['55'], ['56'], ['57'], ['58'], ['59'], ['6'], ['60'], ['61'], ['62'], ['63'], ['64'], ['65'], ['66'], ['67'], ['68'], ['69'], ['7'], ['70'], ['71'], ['72'], ['73'], ['74'], ['75'], ['76'], ['77'], ['78'], ['79'], ['8'], ['80'], ['81'], ['82'], ['83'], ['84'], ['85'], ['86'], ['87'], ['88'], ['89'], ['9'], ['90'], ['91'], ['92'], ['93'], ['94'], ['95'], ['96'], ['97'], ['98'], ['99']]
所有结果包含2的频繁项集为：
frozenset({'2', '28'}

### 本章小结：
##### 关联分析是用于发现大数据集中元素间有趣关系的一个工具集
##### 可以适用频繁项集来对这种关系进行量化，但直接计算组合会耗费大量时间，可以采用apriori方法
##### 可以适用支持度和可信度来生成一条关联规则