# Association Rule Mining 
- Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases.
- the **Apriori algorithm** is used to implement association rule minig over structured data

Ways to measure Association:
1. **Support**: relative frequency of an item in the dataset <br>
$support(A)=P(A) = \frac{A}{total-trasactions}$<br>
<br> $support (A\rightarrow C) = P(A \cap B) = \frac{A \cap C}{total-transcations}$ <br><br>
2. **Confidence**: confidence is the conditional probability of occurrence of consequent given the antecedent $IF \rightarrow THEN$, <br> $confidence(A \rightarrow C) = \frac{P(A \cap C)}{P(A)} = \frac{support(A \rightarrow C)}{support(A)}$, <br>
where $A$ is the antecedent and $C$ is the consequence<br><br>
3. **Lift**: measures how much more often the antecedent and consequent occur together rather than independently, think of it as the *lift* that $A$ provides to our confidence for having $C$ on the cart.
 <br> $lift(A \rightarrow C) = \frac{P(A \cap B) / P(A) }{P(C)} = \frac{confidence(A \rightarrow C)}{support(C)}$

E.g.: is the purchase of Bread associated with the purchase of Eggs?
- $A$: bread purchases = 500
- $C$: eggs purchases = 350
- $(A \rightarrow C)$ bread and eggs purchased together = 150 
- total purchases = 5000

- $support(bread) = \frac{bread-purchases}{total-purchases} = \frac{500}{5000}=0.1$
- $confidence(bread \rightarrow eggs) = \frac{150/5000}{500/5000}=0.3$
- $lift(bread \rightarrow eggs)= \frac{confidence(bread \rightarrow eggs)}{support(eggs)} = \frac{0.3}{350/5000}=4.28$

Lift Score
- LS > 1: A is highly associated with C - if bread is purchased it's likely eggs will be purchased
- LS < 1: if A purhcased it's unlikely that C will be purchased
- LS ~ 0: no association between A and C

In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## Data Format
- data needs to be in **sparse format** for association rule mining

In [3]:
address = 'data/groceries.csv'
data = pd.read_csv(address)

In [4]:
data.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9
0,citrus fruit,semi-finished bread,margarine,ready soups,,,,,
1,tropical fruit,yogurt,coffee,,,,,,
2,whole milk,,,,,,,,
3,pip fruit,yogurt,cream cheese,meat spreads,,,,,
4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,


## Data Coversion
- get_dummies: Converts categorical variable into dummy/indicator variables.
- like OneHotEncoding
- now every item in the dataset has a column and is either <br>
1 (bought) or 0 (not bought) for every row=observation

In [4]:
basket_sets = pd.get_dummies(data)
basket_sets.head()

Unnamed: 0,1_Instant food products,1_UHT-milk,1_artif. sweetener,1_baby cosmetics,1_bags,1_baking powder,1_bathroom cleaner,1_beef,1_berries,1_beverages,...,9_sweet spreads,9_tea,9_vinegar,9_waffles,9_whipped/sour cream,9_white bread,9_white wine,9_whole milk,9_yogurt,9_zwieback
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Support Calculation
- support tells us how poular an item is - relative frequency
- show item that very bought at least 2% of all times
- but these are mostly singular popular items - just one association/ combination of items (35)

In [6]:
# get item names, with use_colnames
apriori(basket_sets, min_support=0.02, use_colnames=True)



Unnamed: 0,support,itemsets
0,0.030421,(1_beef)
1,0.034951,(1_canned beer)
2,0.029126,(1_chicken)
3,0.049191,(1_citrus fruit)
4,0.064401,(1_frankfurter)
5,0.04466,(1_other vegetables)
6,0.024272,(1_pip fruit)
7,0.040453,(1_pork)
8,0.038835,(1_rolls/buns)
9,0.033981,(1_root vegetables)


In [9]:
df = basket_sets

# to get combinations reduce the min_support
frequent_itemsets = apriori(basket_sets, min_support=0.002, use_colnames=True)

# count the item in the set
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets



Unnamed: 0,support,itemsets,length
0,0.006472,(1_UHT-milk),1
1,0.030421,(1_beef),1
2,0.011974,(1_berries),1
3,0.008414,(1_beverages),1
4,0.014887,(1_bottled beer),1
...,...,...,...
844,0.002265,"(5_other vegetables, 6_whole milk, 3_pip fruit)",3
845,0.002589,"(5_whole milk, 3_root vegetables, 4_other vege...",3
846,0.002913,"(4_curd, 3_whole milk, 5_yogurt)",3
847,0.003236,"(5_other vegetables, 6_whole milk, 4_root vege...",3


In [10]:
# only 3 and more connected items 
frequent_itemsets[frequent_itemsets['length'] >= 3]

Unnamed: 0,support,itemsets,length
820,0.002589,"(1_beef, 3_other vegetables, 2_root vegetables)",3
821,0.002589,"(2_other vegetables, 3_whole milk, 1_chicken)",3
822,0.002589,"(2_other vegetables, 3_whole milk, 1_citrus fr...",3
823,0.003236,"(2_tropical fruit, 1_citrus fruit, 3_pip fruit)",3
824,0.002589,"(4_whole milk, 3_other vegetables, 1_citrus fr...",3
825,0.002265,"(5_other vegetables, 6_whole milk, 1_frankfurter)",3
826,0.002265,"(4_whole milk, 1_pork, 3_other vegetables)",3
827,0.00356,"(1_root vegetables, 2_other vegetables, 3_whol...",3
828,0.002589,"(2_rolls/buns, 3_soda, 1_sausage)",3
829,0.002265,"(4_whole milk, 1_sausage, 3_other vegetables)",3


## Association Rules

### Confidence
- probability that that sth. comes up when sth. else comes up $IF \rightarrow THEN$, <br> 
- Technically, confidence is the conditional probability of occurrence of consequent given the antecedent.
- $confidence(A \rightarrow C) =  \frac{P(A \cap C)}{P(A)}$, <br>
where $A$ is the antecedent and $C$ is the consequence<br><br>

In [11]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(2_sausage),(1_frankfurter),0.011327,0.064401,0.011327,1.0,15.527638,0.010597,inf
1,(7_pastry),(1_frankfurter),0.005178,0.064401,0.002589,0.5,7.763819,0.002256,1.871197
2,(2_ham),(1_sausage),0.00712,0.076052,0.004531,0.636364,8.367505,0.003989,2.540858
3,(2_meat),(1_sausage),0.006796,0.076052,0.004854,0.714286,9.392097,0.004338,3.233819
4,(3_beef),(1_sausage),0.004854,0.076052,0.002589,0.533333,7.012766,0.00222,1.979889


### Lift
- Lift controls for the support (frequency) of consequent while calculating the conditional probability of occurrence of $C$ given $A$. 
- Think of it as the *lift* that $A$ provides to our confidence for having $C$ on the cart. 
- measures how much more often the antecedent and consequent occur together rather than seperately  <br> $lift(A \cap C) = \frac{P(A \cap B) / P(A) }{P(C)}$
- looking at confidence its more likely to see beef $C$ bought when citrus fruit $A$ were bought 0.19 <br> instead of the other way around 0.180

In [12]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(2_citrus fruit),(1_beef),0.028803,0.030421,0.005502,0.191011,6.278986,0.004625,1.198508
1,(1_beef),(2_citrus fruit),0.030421,0.028803,0.005502,0.180851,6.278986,0.004625,1.185618
2,(2_other vegetables),(1_beef),0.0589,0.030421,0.003236,0.054945,1.806173,0.001444,1.02595
3,(1_beef),(2_other vegetables),0.030421,0.0589,0.003236,0.106383,1.806173,0.001444,1.053136
4,(1_beef),(2_root vegetables),0.030421,0.036893,0.005502,0.180851,4.902016,0.004379,1.175741


### Lift and Confidence

In [17]:
# high lift & high confidence
rules[(rules['lift'] >= 50) & (rules['confidence']>= 0.8)][['antecedents', 'consequents',  'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,confidence,lift
778,(7_whole milk),(6_other vegetables),0.923077,129.65035
781,(7_brown bread),(6_rolls/buns),0.818182,50.563636
795,(9_whipped/sour cream),(8_yogurt),1.0,206.0
827,"(5_other vegetables, 1_frankfurter)",(6_whole milk),0.875,93.232759
942,"(6_whole milk, 3_pip fruit)",(5_other vegetables),1.0,79.230769
960,"(6_whole milk, 4_root vegetables)",(5_other vegetables),0.833333,66.025641
965,"(5_other vegetables, 7_butter)",(6_whole milk),0.875,93.232759
