# ASSOCIATION RULE 

Association rules is a rule-based learning method used to draw frequent patterns and correlations from datasets such as transactional and relational data.

In essence,it computes the co-occurrence statistics between items, in the form of an implication expression (x--> y)

For instance, in customer basket analysis,{diaper}-->{beer} means if diaper is bought then beer is put into the basket

4 fundamental concepts in association rules:
-  (not a rule) Support: number of times x occurs over all instances.
- Support(X--> Y) is the probability of co-occurrence of both items within all data.
-  Confidence(X--> Y) is the probability of y occurs given that x is present.
- Lift(X-->Y) is the probability of y being bought given that x is present,taking into account the popularity of y as well.
- Convictiom(X-->Y) is the measure of implication. A value >1 indicates that y is highly depending on X.
    
So basically it is probability / statistics. A simple but useful decision making tool for a wide range of usages such as market basket analysis,
customer relationship management,recommender system,marketing activities,network traffic analysis,intrusion detection(fraud and malware detection)and bioinformatics.


Mlxtend is a rich and useful library for machine learning.It provides methods in association rules with a major algorithm apriori.
you can install mlxtend via pip or conda.



In [41]:
pip install mlxtend

Note: you may need to restart the kernel to use updated packages.


In [42]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

to use association rules first we need some data in one-hot encoded format.
imagine in a grocery database,there are order id with some products.

In [43]:
data = {'ID': [1,2,3,4,5,6],
       'Onion':[1,0,0,1,1,1],
       'Potato':[1,1,0,1,1,1],
       'Burger':[1,1,0,0,1,1],
       'Milk':[0,1,1,1,0,1],
       'Beer':[0,0,1,0,1,0]}


In [44]:
df = pd.DataFrame(data)

In [45]:
df = df[['ID','Onion','Potato','Burger','Milk','Beer']]

In [46]:
df

Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1
5,6,1,1,1,1,0


# then we can generate frequent itemsets based on support

### here we need to set the minimum support value between [0.1]. Using min_supp  = 50% means we only want itemsets thatco-occur more than half of the time.

apriori(df,min_support = 0.5,use_colnames = False,max_len = None)

In [47]:
frequent_itemsets = apriori(df[['Onion','Potato','Burger','Milk','Beer']],min_support = 0.5,use_colnames = True)



In [48]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.666667,(Onion)
1,0.833333,(Potato)
2,0.666667,(Burger)
3,0.666667,(Milk)
4,0.666667,"(Onion, Potato)"
5,0.5,"(Burger, Onion)"
6,0.666667,"(Burger, Potato)"
7,0.5,"(Potato, Milk)"
8,0.5,"(Burger, Onion, Potato)"


- itemsets with 1,2 or 3 items are returned with support>0.5
- the only itemset with 3 products is (onion,potato,burger)

first step: generate the rules with their corressponding support,confidence and lift,(and leverage and conviction):
        association_rules(df,metric = 'confidence',min_threshold =0.8)
        
- here df means the frequent_itemsets dataframe
- metrics is the parameters to consider if there is association. you can set it to one of the 5 metrics.
- min_threshold is the minimum value for the specified metrics.



In [49]:
rules = association_rules(frequent_itemsets,metric = 'lift',min_threshold=1)

In [50]:
df

Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1
5,6,1,1,1,1,0


In [51]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
1,(Potato),(Onion),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
2,(Burger),(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
3,(Onion),(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
4,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
5,(Potato),(Burger),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
6,"(Burger, Onion)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf,0.333333
7,"(Burger, Potato)",(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
8,"(Onion, Potato)",(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
9,(Burger),"(Onion, Potato)",0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333


# interpreting the result 

we can see there that there are quite a few rules with a high lift value which means that it occurs more frequently than  would be expected given the number of transactions and product combinations.

several are high in confidence as well . BUt domain knowledge will be useful in explaining the phenomenon.

In [52]:
rules[(rules['lift']>1.125) &(rules['confidence']>0.7)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
1,(Potato),(Onion),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
4,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
5,(Potato),(Burger),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
6,"(Burger, Onion)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf,0.333333


subsetting the lift and confidence values return you with the itemsets that are relatively highly correlated in this data.

we can see that:
- if onion or burger is in a user's basket ,it is highly likely that the user will buy potato as well.
- if onion and burger is in a user's basket , it is highly likely that the user will also buy potato.


1. lift(X--> Y): the likelihood of y being bought when x is present, taking into account the popularity of y as well.

- when lift = 1, X makes no impact on Y.
- when lift>1 , there is  a relationship between x and y.


2. Conviction(X-->Y): Conviction is a measure of the implication and has value 1 if items are unrelated. 
    
A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score,
the denominator becomes zero( due to 1 -1) for which the conviction score is defined as 'inf'.
similar to 'lift',if items are independent the conviction is 1.

3. Leverage(X--> Y): the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent. An leverage value of 0 indicates independence .

# example 2

In [53]:
retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                          'Basket':[['Beer','Diaper','Pretzels','Chips','Aspirin'],
                                    ['Diaper','Beer','Chips','Lotion','Juice','Babyfood','Milk'],
                                    ['Soda','Chips','Milk'],
                                    ['Soup','Beer','Diaper','Milk','IceCream'],
                                    ['Soda','Coffee','Milk','Bread'],
                                    ['Beer','Chips']]}



In [54]:
retail = pd.DataFrame(retail_shopping_basket)
retail

Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, Babyfood, Milk]"
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


In [55]:
retail = retail[['ID','Basket']]

In [56]:
pd.options.display.max_colwidth = 100

suppose we have a list of customer ids to a list of basket items

In [57]:
retail

Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, Babyfood, Milk]"
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


first,one-hot encode the basket  but how??

In [58]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
retail = pd.DataFrame(mlb.fit_transform(retail['Basket']),columns = mlb.classes_)

In [59]:
retail

Unnamed: 0,Aspirin,Babyfood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,0,0,1,0,1,0,0,0,0,0,0,0,0,0


making use of Series.str.get_dummies, we can easily encode list of items in a dataframe's column!

In [60]:
frequent_itemsets_2 = apriori(retail,min_support = 0.2,use_colnames = True)



In [61]:
frequent_itemsets_2

Unnamed: 0,support,itemsets
0,0.666667,(Beer)
1,0.666667,(Chips)
2,0.5,(Diaper)
3,0.666667,(Milk)
4,0.333333,(Soda)
5,0.5,"(Chips, Beer)"
6,0.5,"(Beer, Diaper)"
7,0.333333,"(Milk, Beer)"
8,0.333333,"(Chips, Diaper)"
9,0.333333,"(Chips, Milk)"


just by calculating the support (x>y),(beer,chips) and (beer,diaper) are the two frequent basket of interest

but which one is more correlated than the other??

In [62]:
association_rules(frequent_itemsets_2,metric = 'lift',min_threshold = 1.5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Beer),(Diaper),0.666667,0.5,0.5,0.75,1.5,0.166667,2.0,1.0
1,(Diaper),(Beer),0.5,0.666667,0.5,1.0,1.5,0.166667,inf,0.666667
2,(Milk),(Soda),0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1.0
3,(Soda),(Milk),0.333333,0.666667,0.333333,1.0,1.5,0.111111,inf,0.5
4,"(Chips, Diaper)",(Beer),0.333333,0.666667,0.333333,1.0,1.5,0.111111,inf,0.5
5,(Beer),"(Chips, Diaper)",0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1.0
6,"(Milk, Beer)",(Diaper),0.333333,0.5,0.333333,1.0,2.0,0.166667,inf,0.75
7,"(Milk, Diaper)",(Beer),0.333333,0.666667,0.333333,1.0,1.5,0.111111,inf,0.5
8,(Beer),"(Milk, Diaper)",0.666667,0.333333,0.333333,0.5,1.5,0.111111,1.333333,1.0
9,(Diaper),"(Milk, Beer)",0.5,0.333333,0.333333,0.666667,2.0,0.166667,2.0,1.0


clearly,(Diaper,beer) is the most associated itemset in the data.

# EXAMPLE 3 : MOVIE GENRE ASSOCIATION

It seems a bit boring playing only with basket analysis and imaginary datasets.

the dataset(ml-latest-small) describes 5-star rating and free text-tagging activity from MovieLens,a movie recommendation service.
It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created vy 671 users between January 09,1995 and october 16,2016.

Users were selected at random for inclusion. All selected users had rated atleast 20 movies. No demographic information is included . 
Each user is represented by an id, and no other information is provided.



In [63]:
movies = pd.read_csv("movies.csv")

In [64]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9120,162672,Mohenjo Daro (2016),Adventure|Drama|Romance
9121,163056,Shin Godzilla (2016),Action|Adventure|Fantasy|Sci-Fi
9122,163949,The Beatles: Eight Days a Week - The Touring Years (2016),Documentary
9123,164977,The Gay Desperado (1936),Comedy


In [66]:
movies_one = movies.drop('genres',1).join(movies.genres.str.get_dummies())

  movies_one = movies.drop('genres',1).join(movies.genres.str.get_dummies())


In [67]:
movies_one.head()

Unnamed: 0,movieId,title,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [68]:
movies_one.shape

(9125, 22)

let's get back to analyzing the genre associations

In [69]:
movies_one.set_index(['movieId','title'],inplace = True)

movies_one

Unnamed: 0_level_0,Unnamed: 1_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,Jumanji (1995),0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162672,Mohenjo Daro (2016),0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
163056,Shin Godzilla (2016),0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
163949,The Beatles: Eight Days a Week - The Touring Years (2016),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
164977,The Gay Desperado (1936),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [70]:
movies_one.Children.value_counts(normalize = True)

0    0.93611
1    0.06389
Name: Children, dtype: float64

In [71]:
movies_one.shape

(9125, 20)

In [72]:
frequent_itemsets_movies = apriori(movies_one,use_colnames = True,min_support = 0.05)





In [73]:
frequent_itemsets_movies

Unnamed: 0,support,itemsets
0,0.169315,(Action)
1,0.122411,(Adventure)
2,0.06389,(Children)
3,0.363288,(Comedy)
4,0.120548,(Crime)
5,0.054247,(Documentary)
6,0.478356,(Drama)
7,0.071671,(Fantasy)
8,0.09611,(Horror)
9,0.059507,(Mystery)


In [74]:
rules_movies = association_rules(frequent_itemsets_movies, metric = 'lift', min_threshold = 1)

In [75]:
rules_movies.sort_values('lift',ascending  = False).head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Adventure),(Action),0.122411,0.169315,0.058301,0.476276,2.812955,0.037575,1.586111,0.734401
1,(Action),(Adventure),0.169315,0.122411,0.058301,0.344337,2.812955,0.037575,1.338475,0.775868
8,(Crime),(Thriller),0.120548,0.189479,0.057863,0.48,2.533256,0.035022,1.558693,0.688214
9,(Thriller),(Crime),0.189479,0.120548,0.057863,0.305379,2.533256,0.035022,1.266089,0.746744
2,(Action),(Thriller),0.169315,0.189479,0.062904,0.371521,1.960746,0.030822,1.289654,0.589863
3,(Thriller),(Action),0.189479,0.169315,0.062904,0.331984,1.960746,0.030822,1.24351,0.604537
5,(Comedy),(Romance),0.363288,0.169315,0.090082,0.247964,1.464511,0.028572,1.104581,0.49815
4,(Romance),(Comedy),0.169315,0.363288,0.090082,0.532039,1.464511,0.028572,1.360609,0.381827
10,(Romance),(Drama),0.169315,0.478356,0.10126,0.598058,1.250236,0.020267,1.29781,0.240947
11,(Drama),(Romance),0.478356,0.169315,0.10126,0.211684,1.250236,0.020267,1.053746,0.383693


#### as we can see in this dataset, the support and hence,confidence values are fairly small. this makes it difficult interpreting the result based on these two values.


whereas, the lift and conviction remains very intuitive and representative. That is why we should understand  the meaning of all the 5 metrics to accurately interpret the result.

In [76]:
rules_movies[rules_movies.lift>2]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Adventure),(Action),0.122411,0.169315,0.058301,0.476276,2.812955,0.037575,1.586111,0.734401
1,(Action),(Adventure),0.169315,0.122411,0.058301,0.344337,2.812955,0.037575,1.338475,0.775868
8,(Crime),(Thriller),0.120548,0.189479,0.057863,0.48,2.533256,0.035022,1.558693,0.688214
9,(Thriller),(Crime),0.189479,0.120548,0.057863,0.305379,2.533256,0.035022,1.266089,0.746744


 as we are expecting the {romance,drama} pair, it is not as correlated  as other groups such as {animation,children} which has a much higher conviction and lift levels.

In [77]:
rules_movies[rules_movies.lift>2].sort_values(by = ['lift','confidence'],ascending=[False,False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Adventure),(Action),0.122411,0.169315,0.058301,0.476276,2.812955,0.037575,1.586111,0.734401
1,(Action),(Adventure),0.169315,0.122411,0.058301,0.344337,2.812955,0.037575,1.338475,0.775868
8,(Crime),(Thriller),0.120548,0.189479,0.057863,0.48,2.533256,0.035022,1.558693,0.688214
9,(Thriller),(Crime),0.189479,0.120548,0.057863,0.305379,2.533256,0.035022,1.266089,0.746744


by making a subset with ordering with lift and conviction :
- the higher correlation: { Animation, Children} correlates in both directions. recall those Pixar and Disney films that we love watching .

- { Children, Adventure}....
- {Fantasy, Adventure}.... how to  interpret these 2 pairs?
 
the best way is to go back to your movies table and check it out!!!



In [78]:
pd.options.display.max_rows = 50

so we want adventure and children but not Animation

In [79]:
movies[ (movies.genres.str.contains('Action')) & (~movies.genres.str.contains('Adventure'))]

Unnamed: 0,movieId,title,genres
5,6,Heat (1995),Action|Crime|Thriller
8,9,Sudden Death (1995),Action
19,20,Money Train (1995),Action|Comedy|Crime|Drama|Thriller
22,23,Assassins (1995),Action|Crime|Thriller
40,42,Dead Presidents (1995),Action|Crime|Drama
...,...,...,...
9093,159093,Now You See Me 2 (2016),Action|Comedy|Thriller
9099,160080,Ghostbusters (2016),Action|Comedy|Horror|Sci-Fi
9100,160271,Central Intelligence (2016),Action|Comedy
9101,160438,Jason Bourne (2016),Action


so, well what are these movies?? I rarely know any of them...(proves again the notion that domain knowledge is of utmost importance in data science)

tomorrowland is a {children , adventure} associated pair.

now you see me 2  is also one.