# An Introduction to Association Rules in Python

##### Association rules is a rule-based learning method used to draw frequent patterns and correlations from datasets such as transactional and relational data.

##### In essence it computes the co-occurence statistics between items, in the form of an implication expression (X → Y).

##### For instance, in customer basket analysis, {diaper} → {beer} means if diaper is bought, then beer is put into basket.

#### 4 fundamental concepts in association rules:

* *(Not a Rule)* Support: number of times X occurs over all instances. 

* Support(X→Y) is the probability of co-occurence of both items within all data.

* Confidence(X→Y) is the probability of Y occurs given that X is present.

* Lift(X→Y) is the probability of Y being bought given that X is present, taking into account the popularity of Y as well.

* Conviction(X→Y) is the measure of implication. A value > 1 indicates that Y is highly depending on X.

So basically it is probability/statistics. A simple but useful decision making tool for a wide range of usages such as market basket analysis, customer relationship management, recommender system, marketing activities, network traffic analysis, intrusion detection (fraud & malware detection) and bioinformatics.


# Example 1

### Before getting into the formnulas and terminology, let's begin by a simple example.

Mlxtend is a rich and useful library for machine learning. It provides methods in association rules with a major algorithm *apriori*.

You can install mlxtend via pip or conda.

In [13]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

To use association rules, first we neeed some data in one-hot encoded format.

Imagine in a grocery database, there are order id with some products...

In [14]:
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Onion': [1, 0, 0, 1, 1, 1],
    'Potato': [1, 1, 0, 1, 1, 1],
    'Burger': [1, 1, 0, 0, 1, 1],
    'Milk': [0, 1, 1, 1, 0, 1],
    'Beer': [0, 0, 1, 0, 1, 0]
}

In [15]:
df = pd.DataFrame(data)

In [16]:
df = df[['ID', 'Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]]

In [None]:
df

### Then, we can generate frequent itemsets based on *support*.

Here we need to set the minimum support value between [0,1]. Using min_supp = 50% means we only want itemsets that co-occur more than half of the time.

`apriori(df, min_support=0.5, use_colnames=False, max_len=None)`

In [18]:
frequent_itemsets = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]], min_support=0.50, use_colnames=True)

In [None]:
frequent_itemsets

Itemsets with 1, 2 or 3 items are returned, with support > 0.5

The only itemset with 3 products is [Onion, Potato, Burger].

### Final Step: generate the rules with their corresponding support, confidence and lift, (and leverage & conviction):

```association_rules(df, metric='confidence', min_threshold=0.8)```

* Here, df means the frequent_itemsets dataframe; 

* metrics is the parameters to consider if there is association. You can set it to one of the five metrics.

* min_threshold is the mininum value for the specified metrics.

In [20]:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)

In [None]:
rules

### Intrepreting the result:

We can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations.

Several are high in confidence as well. But domain knowledge will be useful in explaining the phenomenon.

In [None]:
rules [ (rules['lift'] >1.125)  & (rules['confidence']> 0.8)  ]

Subsetting the lift and confidence values return you with the itemsets that are relatively highly correlated in this data.

We can see that:

* **If Onion or Burger is in a users' basket, it is highly likely that the user will buy Potato as well.**
* **If Burger and Onion is in a users' basket, it is highly likely that the user will also buy Potato.**

### Some notes on Lift, Conviction & Leverage:


1.  Lift(X→Y) : the likelihood of Y being bought when X is present, taking into account the popularity of Y as well.
    > When Lift=1,  X makes no impact on Y  
    > When Lift>1, there is a relationship between X & Y
2.  Conviction(X→Y): Conviction is a measure of the implication and has value 1 if items are unrelated.
    > A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.
3.  Leverage(X→Y): the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent. An leverage value of 0 indicates independence.

# Example 2

In [11]:
retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                         'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
                                   ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
                                   ['Soda', 'Chips', 'Milk'],
                                   ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
                                   ['Soda', 'Coffee', 'Milk', 'Bread'],
                                   ['Beer', 'Chips']
                                  ]
                         }

In [12]:
retail = pd.DataFrame(retail_shopping_basket)

In [13]:
retail = retail[['ID', 'Basket']]

In [14]:
pd.options.display.max_colwidth=100

Suppose we have a list of customer ids to a list of basket items:

In [None]:
retail

First one-hot encode the basket, but how?

In [16]:
#from sklearn.preprocessing import MultiLabelBinarizer
#mlb = MultiLabelBinarizer()
#pd.DataFrame(mlb.fit_transform(retail.Basket), columns=mlb.classes_)

In [None]:
retail = retail.drop('Basket' ,1).join(retail.Basket.str.join(',').str.get_dummies(','))

In [None]:
retail

Making use of `Series.str.get_dummies`, we can easily encode lists of items in a dataframe's column!

In [None]:
frequent_itemsets_2 = apriori(retail.drop('ID',1), use_colnames=True)

In [None]:
frequent_itemsets_2

Just by calculating the support(X>Y), [Beer, Chips] & [Beer, Diaper] are the two frequent basket of intereseted.

But which one is more correlated than the other?

In [None]:
association_rules(frequent_itemsets_2, metric='lift')

In [None]:
association_rules(frequent_itemsets_2)

What can you discover from the two rules? *(Tips: what are the default parameters?)*

Clearly, {Diaper, Beer} is the most associated itemset in this data!

# Example 3 - Movie Genre Associations

It seems a bit boring playing only with basket analysis and imaginary datasets.

In this example, let's play with an open dataset [MovieLens (small)](https://grouplens.org/datasets/movielens/).

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created by 671 users between January 09, 1995 and October 16, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

We might want to take a look at the data and look at the stat first:

In [23]:
movies = pd.read_csv('ml-latest-small/movies.csv')

In [None]:
movies.head(10)

In [None]:
movies_ohe = movies.drop('genres',1).join(movies.genres.str.get_dummies())

In [26]:
pd.options.display.max_columns=100

In [None]:
movies_ohe.head()

In [None]:
stat1 = movies_ohe.drop(['title', 'movieId'],1).apply(pd.value_counts)

In [None]:
stat1 = stat1.transpose().drop(0,1).sort_values(by=1, ascending=False).rename(columns={1:'No. of movies'})

In [30]:
stat2 = movies.join(movies.genres.str.split('|').reset_index().genres.str.len(), rsuffix='r').rename(columns={'genresr':'genre_count'})

In [None]:
stat2 = stat2[stat2['genre_count']==1].drop('movieId',1).groupby('genres').sum().sort_values(by='genre_count', ascending=False)

In [32]:
stat = stat1.merge(stat2, how='left', left_index=True, right_index=True).fillna(0)

In [33]:
stat.genre_count=stat.genre_count.astype(int)
stat.rename(columns={'genre_count': 'No. of movies with only 1 genre'},inplace=True)

In [None]:
stat

Hola! After some dizzy pandas works, we get the number of movies in each genre and the number of movies containing only 1 genre.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
movies_ohe.set_index(['movieId','title']).sum(axis=1).hist()
plt.title('distribution of number of genres')

One can always makes some graphical illustration.

### Let's get back to analysing the genre associations:

In [36]:
movies_ohe.set_index(['movieId','title'],inplace=True)

In [37]:
frequent_itemsets_movies = apriori(movies_ohe,use_colnames=True, min_support=0.025)

In [None]:
frequent_itemsets_movies

In [39]:
rules_movies =  association_rules(frequent_itemsets_movies, metric='lift', min_threshold=1.25)

In [None]:
rules_movies

***As we can see in this dataset, the support and hence confidence values are fairly small. This makes it difficult interpreting the result based on these two values. Whereas, the lift and conviction remains to very intuitive and representative. That is why we should understand the meaning of all of the 5 metrics to accurately interpret the result!***

In [None]:
rules_movies[(rules_movies.conviction>1.25)]

* As we are expecting the {Romance, Drama} pair, it is not as correlated as other groups such as {Animation, Childres} which has a much higher lift & conviction levels.

In [None]:
rules_movies[(rules_movies.conviction>1.5)].sort_values(by=['lift','conviction'], ascending=False)

By making a subset with ordering with lift & conviction:

* The highest correlation: {Animation, Childres} correlates in both directions! Recall those Pixar & Disney films that we love watching
* {Children, Adventure} ...
* {Fantasy, Adventure} ... How to interpret these two pairs?

The best way is to go back to your movies table and check it out!

In [43]:
pd.options.display.max_rows=50

So we want Adventure & Children but NOT Animation...

In [1]:
movies[(movies.genres.str.contains('Adventure')) & (movies.genres.str.contains('Children')) & (~movies.genres.str.contains('Animation'))]

NameError: name 'movies' is not defined

# Summary

To recap, a straightforward 4-steps approach to association rule:

1. One-hot encone the basket in dataframe.
2. Generate frequent itemsets using `apriori`.
3. Generate rule with `association_rules`.
4. Interpret & evalute the result with metrics.