# Apriori Association Rule Mining

Apriori association rules is a machine learning method used to analyze relations between variables such as basket analysis (items that are purchased together).

To demonstrate how to use apriori association rules, I will use grocery store data from Kaggle: https://www.kaggle.com/ekrembayar/apriori-association-rules-grocery-store/data

In this dataset, we have rows of grocery items that are purchased together.

## Import libraries

In [141]:
import pandas as pd
import os
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## Data

In [142]:
path = 'C:/Users/Katia/Documents/Machine learning'
os.chdir(path)

df = pd.read_csv('grocery_data_whole_foods.csv', sep = ",")

In [143]:
df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,23,24,25,26,27,28,29,30,31,32
0,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,,...,,,,,,,,,,
1,tropical fruit,yogurt,coffee,,,,,,,,...,,,,,,,,,,
2,whole milk,,,,,,,,,,...,,,,,,,,,,
3,pip fruit,yogurt,cream cheese,meat spreads,,,,,,,...,,,,,,,,,,
4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,,...,,,,,,,,,,


## Data Coversion

In [144]:
# Convert the dataframe into sparce data, necessary for using the apriori function

sparce_df = pd.get_dummies(df)

In [145]:
sparce_df.head()

Unnamed: 0,1_Instant food products,1_UHT-milk,1_abrasive cleaner,1_artif. sweetener,1_baby cosmetics,1_bags,1_baking powder,1_bathroom cleaner,1_beef,1_berries,...,28_chocolate,28_hygiene articles,28_napkins,28_sugar,29_cooking chocolate,29_house keeping products,29_soups,30_skin care,31_hygiene articles,32_candles
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Support Calculation

Support is a metric of how frequently an item appears in the dataset. For example, if an item is purchased 50 times out of a total of 10,000 transactions, support will be equal to 0.005.

In [146]:
# Get support table (how popular an item is) with length value (number of items purchased together)

df = sparce_df

frequent_itemsets = apriori(df, min_support=0.002, use_colnames=True)
# you can be more conservative regaring the minimum support value, but it might exclude combinations of 3 or more items
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.007117,(1_UHT-milk),1
1,0.030910,(1_beef),1
2,0.012303,(1_berries),1
3,0.008134,(1_beverages),1
4,0.018099,(1_bottled beer),1
...,...,...,...
798,0.002339,"(5_yogurt, 3_other vegetables, 4_whole milk)",3
799,0.004169,"(5_whole milk, 4_other vegetables, 3_root vege...",3
800,0.002034,"(4_curd, 5_yogurt, 3_whole milk)",3
801,0.002135,"(5_whole milk, 6_butter, 4_other vegetables)",3


In [148]:
# Filter the length of item combinations (length = 1 is not useful for basket analysis). 
# Here I will demonstrate with length = 3

frequent_itemsets[frequent_itemsets['length'] >= 3]

Unnamed: 0,support,itemsets,length
783,0.003152,"(3_pip fruit, 2_tropical fruit, 1_citrus fruit)",3
784,0.00244,"(1_other vegetables, 2_whole milk, 3_yogurt)",3
785,0.003254,"(1_root vegetables, 3_whole milk, 2_other vege...",3
786,0.002034,"(3_other vegetables, 4_whole milk, 1_sausage)",3
787,0.00244,"(5_whole milk, 4_other vegetables, 1_sausage)",3
788,0.002135,"(2_other vegetables, 3_whole milk, 1_tropical ...",3
789,0.002237,"(4_butter, 3_whole milk, 2_other vegetables)",3
790,0.003254,"(3_whole milk, 2_other vegetables, 4_yogurt)",3
791,0.002135,"(4_whole milk, 3_other vegetables, 2_pip fruit)",3
792,0.005186,"(4_whole milk, 3_other vegetables, 2_root vege...",3


The most frequent 3-item combination includes whole milk, other vegetables, and root vegetables.

# Association Rules

## Confidence

Confidence is a matric of how frequent the rule has been found to be true. Highest value is 1. 

In [149]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) 
# I am being conservative here with the threshold for simplification purposes
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(2_liquor),(1_bottled beer),0.003152,0.018099,0.002237,0.709677,39.211671,0.00218,3.382105
1,(2_sausage),(1_frankfurter),0.010066,0.058973,0.010066,1.0,16.956897,0.009472,inf
2,(6_whole milk),(5_other vegetables),0.008846,0.012201,0.007016,0.793103,65.001437,0.006908,4.77436
3,(6_butter),(5_whole milk),0.00366,0.01515,0.002949,0.805556,53.172073,0.002893,5.064943
4,(7_whole milk),(6_other vegetables),0.004982,0.006914,0.004067,0.816327,118.067227,0.004033,5.406801
5,(7_butter),(6_whole milk),0.00305,0.008846,0.002339,0.766667,86.668582,0.002312,4.247803
6,"(3_pip fruit, 1_citrus fruit)",(2_tropical fruit),0.003152,0.036096,0.003152,1.0,27.704225,0.003038,inf
7,"(1_other vegetables, 3_yogurt)",(2_whole milk),0.003355,0.066497,0.00244,0.727273,10.936892,0.002217,3.422844
8,"(1_root vegetables, 3_whole milk)",(2_other vegetables),0.004169,0.055923,0.003254,0.780488,13.956541,0.003021,4.300796
9,"(4_butter, 2_other vegetables)",(3_whole milk),0.002339,0.051449,0.002237,0.956522,18.591682,0.002117,21.816675


The item combination with the highest support and confidence is sausage and frankfurter, with 100% confidence.

## Lift

"This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is." definition from https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html
Lift > 1, item X is highly associated with Y. 
Lift < 1, item X is poorly associated with Y.
Lift = 1, item X and Y are not associated.

In [150]:
# Get lift scores

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(2_citrus fruit),(1_beef),0.024403,0.030910,0.004881,0.200000,6.470395,0.004126,1.211362
1,(1_beef),(2_citrus fruit),0.030910,0.024403,0.004881,0.157895,6.470395,0.004126,1.158522
2,(1_beef),(2_other vegetables),0.030910,0.055923,0.003254,0.105263,1.882297,0.001525,1.055145
3,(2_other vegetables),(1_beef),0.055923,0.030910,0.003254,0.058182,1.882297,0.001525,1.028957
4,(2_root vegetables),(1_beef),0.038943,0.030910,0.005592,0.143603,4.645845,0.004389,1.131590
...,...,...,...,...,...,...,...,...,...
785,"(4_root vegetables, 6_whole milk)",(5_other vegetables),0.002745,0.012201,0.002339,0.851852,69.816358,0.002305,6.667641
786,"(5_other vegetables, 6_whole milk)",(4_root vegetables),0.007016,0.010981,0.002339,0.333333,30.354938,0.002262,1.483528
787,(4_root vegetables),"(5_other vegetables, 6_whole milk)",0.010981,0.007016,0.002339,0.212963,30.354938,0.002262,1.261674
788,(5_other vegetables),"(4_root vegetables, 6_whole milk)",0.012201,0.002745,0.002339,0.191667,69.816358,0.002305,1.233717


Let's look at items 785 and 786. We can see that if root vegetables and whole milk are purchased, there is a good chance that other vegetables will also be purchase, as suggested both by the lift value and confidence value. However, even though these three items are highly associated, there is less confidence on the purchase of root vegetables if other vegetables and whole milk are purchased first, as suggested by the low confidence value of 786.

## Lift and Confidence

In [124]:
# Filter rules for optimal lift and confidence

rules[(rules['lift'] >= 5) & (rules['confidence']>= 0.5)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
20,(2_liquor),(1_bottled beer),0.003152,0.018099,0.002237,0.709677,39.211671,0.00218,3.382105
92,(2_sausage),(1_frankfurter),0.010066,0.058973,0.010066,1.0,16.956897,0.009472,inf
100,(3_pork),(1_frankfurter),0.00427,0.058973,0.002237,0.52381,8.882184,0.001985,1.976157
214,(2_ham),(1_sausage),0.006507,0.083884,0.004372,0.671875,8.009564,0.003826,2.791972
218,(2_meat),(1_sausage),0.007626,0.083884,0.004474,0.586667,6.993778,0.003834,2.216409
380,(2_herbs),(3_other vegetables),0.004881,0.042196,0.002542,0.520833,12.343122,0.002336,1.998895
417,(3_beef),(2_pork),0.004474,0.013726,0.002339,0.522727,38.08165,0.002277,2.066478
444,(3_pip fruit),(2_tropical fruit),0.014947,0.036096,0.007931,0.530612,14.700201,0.007391,2.053536
468,(3_curd),(2_whole milk),0.010168,0.066497,0.005186,0.51,7.669495,0.004509,1.905108
510,(3_onions),(4_other vegetables),0.007524,0.025826,0.003864,0.513514,19.883486,0.003669,2.002469


Filtering rules can help prioritize products for upsale and cross-sale, without getting lost in super long lists.