В рамках домашнего задания необходимо построить ассоциативные правила для датасета.
Для этого будут использованы следующие алгоритмы: apriori, association_rules, fpgrowth из mlxtend.frequent_patterns (работают чуть быстрее)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth

In [2]:
df = pd.read_csv("dataset.csv", skipinitialspace=True, header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,pork,sandwich bags,lunch meat,all- purpose,flour,soda,butter,vegetables,beef,aluminum foil,all- purpose,dinner rolls,shampoo,all- purpose
1,shampoo,hand soap,waffles,vegetables,cheeses,mixes,milk,sandwich bags,laundry detergent,dishwashing liquid/detergent,waffles,individual meals,hand soap,vegetables
2,pork,soap,ice cream,toilet paper,dinner rolls,hand soap,spaghetti sauce,milk,ketchup,sandwich loaves,poultry,toilet paper,ice cream,ketchup
3,juice,lunch meat,soda,toilet paper,all- purpose,,,,,,,,,
4,pasta,tortillas,mixes,hand soap,toilet paper,vegetables,vegetables,paper towels,vegetables,flour,vegetables,pork,poultry,eggs


Как видно, в данном датасете есть пропуски данных, заменим их пустой строкой

In [3]:
df = df.fillna ('')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,pork,sandwich bags,lunch meat,all- purpose,flour,soda,butter,vegetables,beef,aluminum foil,all- purpose,dinner rolls,shampoo,all- purpose
1,shampoo,hand soap,waffles,vegetables,cheeses,mixes,milk,sandwich bags,laundry detergent,dishwashing liquid/detergent,waffles,individual meals,hand soap,vegetables
2,pork,soap,ice cream,toilet paper,dinner rolls,hand soap,spaghetti sauce,milk,ketchup,sandwich loaves,poultry,toilet paper,ice cream,ketchup
3,juice,lunch meat,soda,toilet paper,all- purpose,,,,,,,,,
4,pasta,tortillas,mixes,hand soap,toilet paper,vegetables,vegetables,paper towels,vegetables,flour,vegetables,pork,poultry,eggs


In [4]:
#составим список покупок для каждого из покупателей
shopping_list = []

for i, row in df.iterrows():
    shopping_list.append(row.loc[row != ''].unique().tolist())
    
for i in range(5):
    print(shopping_list[i])

['pork', 'sandwich bags', 'lunch meat', 'all- purpose', 'flour', 'soda', 'butter', 'vegetables', 'beef', 'aluminum foil', 'dinner rolls', 'shampoo']
['shampoo', 'hand soap', 'waffles', 'vegetables', 'cheeses', 'mixes', 'milk', 'sandwich bags', 'laundry detergent', 'dishwashing liquid/detergent', 'individual meals']
['pork', 'soap', 'ice cream', 'toilet paper', 'dinner rolls', 'hand soap', 'spaghetti sauce', 'milk', 'ketchup', 'sandwich loaves', 'poultry']
['juice', 'lunch meat', 'soda', 'toilet paper', 'all- purpose']
['pasta', 'tortillas', 'mixes', 'hand soap', 'toilet paper', 'vegetables', 'paper towels', 'flour', 'pork', 'poultry', 'eggs']


Воспользуемся алгоритмом Apriori от mlxtend

In [5]:
encoder = TransactionEncoder()

transactions = pd.DataFrame(encoder.fit(shopping_list).transform(shopping_list), columns=encoder.columns_)
display(transactions.head())

Unnamed: 0,all- purpose,aluminum foil,bagels,beef,butter,cereals,cheeses,coffee/tea,dinner rolls,dishwashing liquid/detergent,...,shampoo,soap,soda,spaghetti sauce,sugar,toilet paper,tortillas,vegetables,waffles,yogurt
0,True,True,False,True,True,False,False,False,True,False,...,True,False,True,False,False,False,False,True,False,False
1,False,False,False,False,False,False,True,False,False,True,...,True,False,False,False,False,False,False,True,True,False
2,False,False,False,False,False,False,False,False,True,False,...,False,True,False,True,False,True,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,False,False


In [6]:
%%time
frequent_itemsets = apriori(transactions, min_support= 7/len(shopping_list), use_colnames=True, max_len = 2)
frequent_itemsets

Wall time: 45 ms


Unnamed: 0,support,itemsets
0,0.263509,(all- purpose)
1,0.264176,(aluminum foil)
2,0.278185,(bagels)
3,0.262842,(beef)
4,0.261508,(butter)
...,...,...
736,0.062041,"(waffles, tortillas)"
737,0.068045,"(yogurt, tortillas)"
738,0.168779,"(vegetables, waffles)"
739,0.176117,"(vegetables, yogurt)"


In [7]:
rules = association_rules(frequent_itemsets, metric="lift",  min_threshold = 1.15)
display(rules.head(20))
print("Rules identified: ", len(rules))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(fruits),(all- purpose),0.263509,0.263509,0.08072,0.306329,1.1625,0.011283,1.06173
1,(all- purpose),(fruits),0.263509,0.263509,0.08072,0.306329,1.1625,0.011283,1.06173
2,(sandwich loaves),(butter),0.248833,0.261508,0.078719,0.316354,1.209731,0.013648,1.080226
3,(butter),(sandwich loaves),0.261508,0.248833,0.078719,0.30102,1.209731,0.013648,1.074663
4,(sandwich bags),(cheeses),0.250167,0.260173,0.075384,0.301333,1.158202,0.010297,1.058912
5,(cheeses),(sandwich bags),0.260173,0.250167,0.075384,0.289744,1.158202,0.010297,1.055722
6,(fruits),(dishwashing liquid/detergent),0.263509,0.268179,0.084056,0.318987,1.189458,0.013389,1.074607
7,(dishwashing liquid/detergent),(fruits),0.268179,0.263509,0.084056,0.313433,1.189458,0.013389,1.072715
8,(individual meals),(dishwashing liquid/detergent),0.271514,0.268179,0.084056,0.309582,1.154388,0.011242,1.059969
9,(dishwashing liquid/detergent),(individual meals),0.268179,0.271514,0.084056,0.313433,1.154388,0.011242,1.061055


Rules identified:  16


In [8]:
rules.nlargest(n = 10, columns = "lift")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(sandwich loaves),(butter),0.248833,0.261508,0.078719,0.316354,1.209731,0.013648,1.080226
3,(butter),(sandwich loaves),0.261508,0.248833,0.078719,0.30102,1.209731,0.013648,1.074663
6,(fruits),(dishwashing liquid/detergent),0.263509,0.268179,0.084056,0.318987,1.189458,0.013389,1.074607
7,(dishwashing liquid/detergent),(fruits),0.268179,0.263509,0.084056,0.313433,1.189458,0.013389,1.072715
10,(hand soap),(mixes),0.237492,0.273516,0.076051,0.320225,1.170773,0.011093,1.068712
11,(mixes),(hand soap),0.273516,0.237492,0.076051,0.278049,1.170773,0.011093,1.056177
12,(ketchup),(soap),0.250167,0.26551,0.077385,0.309333,1.165052,0.010963,1.06345
13,(soap),(ketchup),0.26551,0.250167,0.077385,0.291457,1.165052,0.010963,1.058275
0,(fruits),(all- purpose),0.263509,0.263509,0.08072,0.306329,1.1625,0.011283,1.06173
1,(all- purpose),(fruits),0.263509,0.263509,0.08072,0.306329,1.1625,0.011283,1.06173


Воспользуемся алгоритмом fpgrowth из mlxtend.frequent_patterns

In [9]:
%%time
fpgr_results = fpgrowth(transactions, min_support= 7/len(shopping_list), use_colnames=True, max_len = 2)
fpgr_results

Wall time: 577 ms


Unnamed: 0,support,itemsets
0,0.597065,(vegetables)
1,0.276184,(lunch meat)
2,0.274183,(soda)
3,0.264176,(aluminum foil)
4,0.263509,(all- purpose)
...,...,...
736,0.070047,"(fruits, lunch meat)"
737,0.071381,"(fruits, pasta)"
738,0.069380,"(fruits, yogurt)"
739,0.074049,"(fruits, waffles)"


In [10]:
fpgr_rules = association_rules(fpgr_results, metric="lift",  min_threshold = 1.15)
fpgr_rules.nlargest(n = 10, columns = "lift")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
10,(sandwich loaves),(butter),0.248833,0.261508,0.078719,0.316354,1.209731,0.013648,1.080226
11,(butter),(sandwich loaves),0.261508,0.248833,0.078719,0.30102,1.209731,0.013648,1.074663
12,(fruits),(dishwashing liquid/detergent),0.263509,0.268179,0.084056,0.318987,1.189458,0.013389,1.074607
13,(dishwashing liquid/detergent),(fruits),0.268179,0.263509,0.084056,0.313433,1.189458,0.013389,1.072715
6,(hand soap),(mixes),0.237492,0.273516,0.076051,0.320225,1.170773,0.011093,1.068712
7,(mixes),(hand soap),0.273516,0.237492,0.076051,0.278049,1.170773,0.011093,1.056177
8,(ketchup),(soap),0.250167,0.26551,0.077385,0.309333,1.165052,0.010963,1.06345
9,(soap),(ketchup),0.26551,0.250167,0.077385,0.291457,1.165052,0.010963,1.058275
14,(fruits),(all- purpose),0.263509,0.263509,0.08072,0.306329,1.1625,0.011283,1.06173
15,(all- purpose),(fruits),0.263509,0.263509,0.08072,0.306329,1.1625,0.011283,1.06173


Таким образом можем сделать вывод, что оба алгоритма в результате выдают одинаковые метрики, но при этом apriori отработал быстрее, чем fpgrowth