##  实验二：关联规则挖掘分析

&emsp;&emsp;关联规则主要用于寻找数据集中项集之间的关联关系。它揭示了数据项间的未知关系，基于样本的统计规律，进行关联规则挖掘。根据所挖掘的关联关系，可以从一个属性信息来推断另一个属性的信息。当置信度或提升度达到某一阈值时，则认为规则成立。        
&emsp;&emsp;关联规则挖掘主要分为两个部分：第一个是找出事物数据库中所有大于等于事先设定的最小支持度的数据项集；第二个是利用频繁项集生成所需要的关联规则，根据事先设定的最小置信度进行取舍，最后得到强关联规则。      

#### Apriori算法的主要步骤如下：   
(1)扫描全部数据，产生候选1-项集的集合$C_{1}$。   
(2)根据最小支持度，由候选1-项集的集合$C_{1}$产生频繁1-项集的集合$L_{1}$。   
(3)对K>1，重复执行步骤(4)-(6)。   
(4)由$L_{k}$执行连接和剪枝操作，产生候选（k+1）-项集的集合$C_{k+1}$。    
(5)根据最小支持度，由候选（k+1）-项集的集合$C_{k+1}$，产生频繁（k+1）-项集的集合$L_{k+1}$。   
(6)若L≠$\phi $ ,则k=k+1，调往步骤(4)；否则，调往步骤(7)。   
(7)根据最小置信度，由频繁项集产生强关联规则，结束。

### 1. 应用python语言编写程序，挖掘数据集”腹泻数据.xlsx”中所有频繁项集，支持度阈值为0.1。

In [1]:
import pandas as pd

trans=pd.read_csv('腹泻数据.csv',header=None,encoding='utf-8')
trans.head()
trans_list=trans.values.tolist()

def load_data(x):
    data=[]
    for trans in x:
        tem=[]
        for i in trans:
            if type(i)==str:
                tem.append(i)
        data.append(tem)
    return data

def createC1(dataSet):
    C1=[]
    for transaction in dataSet:
        for item in transaction:
            if not [item] in C1:
                C1.append([item])
    C1.sort()
    return list(map(frozenset,C1))

def scanD(D,Ck,minSupport):
    ssCnt={}
    for tid in D:
        for can in Ck:
            if can.issubset(tid):
                if not can in ssCnt:
                    ssCnt[can]=1
                else:
                    ssCnt[can]+=1
    numItems=float(len(D))
    retList=[]
    supportData={}
    for key in ssCnt:
        support = ssCnt[key]/numItems
        if support >= minSupport:
            retList.insert(0,key)
        supportData[key]=support
    return retList,supportData

def aprioriGen(Lk,k):
    retList=[]
    lenLk=len(Lk)
    for i in range(lenLk):
        for j in range(i+1,lenLk):
            L1=list(Lk[i])[:k-2]
            L2=list(Lk[j])[:k-2]
            L1.sort()
            L2.sort()
            if L1 == L2:
                retList.append (Lk[i]|Lk[j])
    return retList

def apriori(dataSet,minSupport=0.1):
    C1=createC1(dataSet)
    D=list(map(set,dataSet))
    L1,supportData=scanD(D,C1,0.1)
    L=[L1]
    k=2
    while(len(L[k-2])>0):
        Ck=aprioriGen(L[k-2],k)
        Lk,supK=scanD(D,Ck,minSupport)
        supportData.update(supK)
        L.append(Lk)
        k+=1
    return L,supportData

def generateRules(L,supportData,minConf=0.7):
    bigRuleList=[]
    for i in range(1,len(L)):
        for freqSet in L[i]:
            H1=[frozenset([item]) for item in freqSet]
            if(i>1):
                rulesFromConseq(freqSet,H1,supportData,bigRuleList,minConf)
            else:
                calcConf(freqSet,H1,supportData,bigRuleList,minConf)
        return bigRuleList

def calcConf(freqSet,H,supportData,br1,minConf=0.7):
    prunedH=[]
    for conseq in H:
        conf = supportData[freqSet] / supportData[freqSet-conseq]
        if conf >= minConf:
            print(freqSet-conseq,'-->',conseq,'conf:',conf)
            br1.append((freqSet-conseq,conseq,conf))
            prunedH.append(conseq)
    return prunedH

def rulesFromConseq(freqSet,H,supportData,br1,minConf=0.7):
    m=len(H[0])
    if(len(freqSet)>(m+1)):
        Hmp1=apriori(H,m+1)
        Hmp1=calcConf(freqSet,Hmp1,supportData,br1,minConf)
        if(len(Hmp1)>1):
            rulesFromConseq(freqSet,H,supportData,br1,minConf)

if __name__=='__main__':
    dataSet=load_data(trans_list)
    L,supportData=apriori(dataSet,0.1)
    rules=generateRules(L,supportData,minConf=0.7)
    #print(rules)
    print(L)

frozenset({'干姜'}) --> frozenset({'甘草'}) conf: 0.7272727272727273
frozenset({'黄芩'}) --> frozenset({'甘草'}) conf: 0.7777777777777779
frozenset({'人参'}) --> frozenset({'甘草'}) conf: 1.0
frozenset({'陈皮'}) --> frozenset({'甘草'}) conf: 0.7
[[frozenset({'干姜'}), frozenset({'黄芩'}), frozenset({'陈皮'}), frozenset({'人参'}), frozenset({'木香'}), frozenset({'茯苓'}), frozenset({'白术'}), frozenset({'甘草'})], [frozenset({'干姜', '甘草'}), frozenset({'黄芩', '甘草'}), frozenset({'人参', '甘草'}), frozenset({'陈皮', '甘草'}), frozenset({'甘草', '白术'})], []]


### 2. 利用sklearn中的Apriori库函数，挖掘”腹泻数据.xlsx”的频繁项集和关联规则，支持度阈值0.1，置信度阈值0.7。

In [2]:
from efficient_apriori import apriori
trans=pd.read_csv('腹泻数据.csv',header=None,encoding='utf-8')
trans.head()
trans_list=trans.values.tolist()

def load_data(x):
    data=[]
    for trans in x:
        tem=[]
        for i in trans:
            if type(i)==str:
                tem.append(i)
        data.append(tem)
    return data

trans_final=load_data(trans_list)
trans_final

[['肥儿散', '君子仁', '鸡内金', '白术', '茯苓', '山药', '山楂', '甘草'],
 ['青金丹', '巴豆', '木香', '青橘皮', '吴茱萸', '附子'],
 ['法制橘红', '橘红', '檀香', '白豆蔻', '片脑'],
 ['加味温补通阳方',
  '山药',
  '熟地',
  '麻黄',
  '炮姜',
  '鹿角胶',
  '桂枝',
  '补骨脂',
  '白术',
  '炒陈曲',
  '醋香附',
  '当归',
  '熟附子',
  '山茱萸',
  '木香',
  '生黄耆',
  '骨碎补',
  '鸡血藤'],
 ['驱绦汤', '南瓜子肉', '槟榔'],
 ['完带汤', '白术', '山药', '人参', '白芍', '车前子', '苍术', '甘草', '陈皮', '黑芥穗', '柴胡'],
 ['六合定中丸',
  '藿香叶',
  '苏叶',
  '厚朴',
  '枳壳',
  '木香',
  '生甘草',
  '檀香',
  '柴胡',
  '羌活',
  '银花叶',
  '赤茯苓',
  '木瓜'],
 ['磨积散',
  '神曲',
  '山楂',
  '茯苓',
  '陈皮',
  '麦芽',
  '泽泻',
  '白术',
  '法半夏',
  '藿香',
  '苍术',
  '厚朴',
  '甘草'],
 ['健脾补肾汤', '党参', '川续断', '白术', '茯苓', '白芍', '当归', '五味子', '菟丝子', '川厚朴', '香附'],
 ['金蟾丸', '干蟾', '黄连', '芜荑', '芦荟', '人参', '甘草'],
 ['芙蓉截流丸', '清膏烟', '陈米饮'],
 ['自制芙蓉截流丸', '清膏烟', '陈米饭'],
 ['胰腺清化汤', '柴胡', '黄芩', '白芍', '厚朴', '枳实', '佩兰', '金银花', '大青叶', '大黄', '芒消'],
 ['午时茶',
  '茅术',
  '陈皮',
  '柴胡',
  '连翘',
  '白芷',
  '川朴',
  '枳实',
  '楂肉',
  '羌活',
  '防风',
  '前胡',
  '藿香',
  '甘草',
  '陈茶',
  '桔梗',
  '麦芽',
  '苏叶',

In [3]:
itemsets,rules=apriori(trans_final,min_support=0.1,min_confidence=0.7)
#itemsets是字典形式存储，键为频繁项集的元素个数，值也是一个字典（频繁项集为键，值为支持数）
print(itemsets,'\n')
#rules规则是以列表形式存储
print(rules)

{1: {('白术',): 18, ('茯苓',): 10, ('甘草',): 30, ('木香',): 7, ('人参',): 9, ('陈皮',): 10, ('黄芩',): 9, ('干姜',): 11}, 2: {('人参', '甘草'): 9, ('干姜', '甘草'): 8, ('甘草', '白术'): 12, ('甘草', '陈皮'): 7, ('甘草', '黄芩'): 7}} 

[{人参} -> {甘草}, {干姜} -> {甘草}, {陈皮} -> {甘草}, {黄芩} -> {甘草}]


### 3. 利用mlxtend库函数，挖掘”腹泻数据.xlsx”的频繁项集和关联规则，支持度阈值0.1，置信度阈值0.7。

In [4]:
from mlxtend.frequent_patterns import apriori,association_rules
from mlxtend.preprocessing import TransactionEncoder

In [5]:
df=pd.DataFrame(trans_final)

In [6]:
te=TransactionEncoder()
trans_new=te.fit_transform(trans_final)
trans_new=pd.DataFrame(trans_new,columns=te.columns_)
trans_new

Unnamed: 0,一加减正气散,三品一条枪,丹参,乳香,二生丹,云茯苓,五味子,五味木香丸,京半夏,人参,...,黄精,黄耆,黄芩,黄连,黑芥穗,龙眼肉,龙眼肉粥,龙胆泻肝汤,龙胆草,龙脑丸
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
62,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
63,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
64,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False


In [7]:
frequent_itemset = apriori(trans_new,min_support=0.1,use_colnames=True)
frequent_itemset

Unnamed: 0,support,itemsets
0,0.136364,(人参)
1,0.166667,(干姜)
2,0.106061,(木香)
3,0.454545,(甘草)
4,0.272727,(白术)
5,0.151515,(茯苓)
6,0.151515,(陈皮)
7,0.136364,(黄芩)
8,0.136364,"(人参, 甘草)"
9,0.121212,"(干姜, 甘草)"


In [8]:
frequent_itemset.sort_values(by='support',ascending=True)

Unnamed: 0,support,itemsets
2,0.106061,(木香)
11,0.106061,"(陈皮, 甘草)"
12,0.106061,"(黄芩, 甘草)"
9,0.121212,"(干姜, 甘草)"
0,0.136364,(人参)
7,0.136364,(黄芩)
8,0.136364,"(人参, 甘草)"
5,0.151515,(茯苓)
6,0.151515,(陈皮)
1,0.166667,(干姜)


In [9]:
rules_mlxtend=association_rules(frequent_itemset,metric='confidence',min_threshold=0.7)
rules_mlxtend

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(人参),(甘草),0.136364,0.454545,0.136364,1.0,2.2,0.07438,inf
1,(干姜),(甘草),0.166667,0.454545,0.121212,0.727273,1.6,0.045455,2.0
2,(陈皮),(甘草),0.151515,0.454545,0.106061,0.7,1.54,0.03719,1.818182
3,(黄芩),(甘草),0.136364,0.454545,0.106061,0.777778,1.711111,0.044077,2.454545
