# FP Growth算法

**Reference:**

pyfpgrowth包: https://fp-growth.readthedocs.io/en/latest/readme.html
<br>直接安装`pip install pyfpgrowth`
<br>mlxtend包：http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/
<br>直接安装`pip install mlxtend`
<br>超市数据集：
https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view 

In [1]:
# 导入各种包
# ！ pip install pyfpgrowth
# ！ pip install mlxtend 
import pandas as pd
import pyfpgrowth
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.preprocessing import TransactionEncoder

In [2]:
df = pd.read_csv('store_data.csv', header=None)  # 依然是导入超市数据集
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


如果需要`find_frequent_patterns`功能，需要生成一个python`list` 而不是`dataframe`, 因此需要去掉NaN值. 

In [3]:
# 去除空白值，分别得到每个用户购买的东西
record = df.stack().groupby(level=0).apply(list).tolist()
# 看看record长啥样（前5个）：
for i in range(0, 5):
    print(record[i])

['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil']
['burgers', 'meatballs', 'eggs']
['chutney']
['turkey', 'avocado']
['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea']


In [4]:
min_support = 300  # 假设最小值度为35，⚠️该算法的支持度表示为出现超过35次，aprior算法的支持度为35/7500
patterns = pyfpgrowth.find_frequent_patterns(record, min_support)
print(patterns)

{('salmon',): 319, ('fresh bread',): 323, ('champagne',): 351, ('honey',): 356, ('herb & pepper',): 371, ('soup',): 379, ('cooking oil',): 383, ('grated cheese',): 393, ('whole wheat rice',): 439, ('chicken',): 450, ('turkey',): 469, ('frozen smoothie',): 475, ('olive oil',): 494, ('tomatoes',): 513, ('shrimp',): 536, ('low fat yogurt',): 574, ('escalope',): 595, ('cookies',): 603, ('cake',): 608, ('burgers',): 654, ('pancakes',): 713, ('frozen vegetables',): 715, ('ground beef',): 737, ('ground beef', 'mineral water'): 307, ('milk',): 972, ('milk', 'mineral water'): 360, ('green tea',): 991, ('chocolate',): 1230, ('chocolate', 'mineral water'): 396, ('french fries',): 1282, ('spaghetti',): 1306, ('mineral water', 'spaghetti'): 448, ('eggs',): 1348, ('eggs', 'mineral water'): 382, ('mineral water',): 1788}


尝试增加规则：

In [5]:
rules = pyfpgrowth.generate_association_rules(patterns, 0.35)  # 置信度为0.35
print(rules)

{('ground beef',): (('mineral water',), 0.41655359565807326), ('milk',): (('mineral water',), 0.37037037037037035)}


使用`pyfpgrowth`包在数据量大的时候实在太乱了。

## 使用`mlxtend`包进行FPgrowth算法：

In [6]:
te = TransactionEncoder()
te_ary = te.fit(record).transform(record)  # 转化数据成True or False的形式
df1 = pd.DataFrame(te_ary, columns=te.columns_)
df1.head()

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


In [7]:
print(fpgrowth(df1, min_support=300/7500, use_colnames=True))

     support                      itemsets
0   0.238368               (mineral water)
1   0.132116                   (green tea)
2   0.076523              (low fat yogurt)
3   0.071457                      (shrimp)
4   0.065858                   (olive oil)
5   0.063325             (frozen smoothie)
6   0.047460                       (honey)
7   0.042528                      (salmon)
8   0.179709                        (eggs)
9   0.087188                     (burgers)
10  0.062525                      (turkey)
11  0.129583                        (milk)
12  0.058526            (whole wheat rice)
13  0.170911                (french fries)
14  0.050527                        (soup)
15  0.174110                   (spaghetti)
16  0.095321           (frozen vegetables)
17  0.080389                     (cookies)
18  0.051060                 (cooking oil)
19  0.046794                   (champagne)
20  0.163845                   (chocolate)
21  0.059992                     (chicken)
22  0.06839

以列表形式出现直观多了