#  Frequent Itemset Mining: Apriori Alternatives

In this notebook, we will apply **apriori**, **FP-Growth**, and **maximal frequent itemset** methods on the same retail dataset that we explored in M3. 

 ### Import required Libraries

In [1]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth, fpmax
import matplotlib.pyplot as plt
%matplotlib inline

### T1: Data Loading

The data is located here: `/dsa/data/DSA-8410/association-mining/retail_dataset.csv`


In [2]:
df = pd.read_csv('/dsa/data/DSA-8410/association-mining/retail_dataset.csv') 
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


### T2: Show the number of transactions and unique items

In [3]:
print(f"Num of transactions = {df.shape[0]}")
print(f"Maximum num of items per transaction = {df.shape[1]}")

Num of transactions = 315
Maximum num of items per transaction = 7


Let’s find out the unique items in this dataset. 

In [4]:
set(df.values.flatten())

{'Bagel',
 'Bread',
 'Cheese',
 'Diaper',
 'Eggs',
 'Meat',
 'Milk',
 'Pencil',
 'Wine',
 nan}

### T3: Transform the dataset to a binary incidence matrix for applying itemset mining methods

In [5]:
from sklearn.preprocessing import MultiLabelBinarizer

trans_data = []
for indx, row in df.iterrows():
    trans_data.append(row.dropna().values)


mlb = MultiLabelBinarizer()
data = mlb.fit_transform(trans_data)
mlb.classes_

array(['Bagel', 'Bread', 'Cheese', 'Diaper', 'Eggs', 'Meat', 'Milk',
       'Pencil', 'Wine'], dtype=object)

In [6]:
trans_data_enc = pd.DataFrame(data, columns=mlb.classes_)
trans_data_enc.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine
0,0,1,1,1,1,1,0,1,1
1,0,1,1,1,0,1,1,1,1
2,0,0,1,0,1,1,1,0,1
3,0,0,1,0,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1


### T4.1: Indentify Frequent Patterns with Apriori Method. Use min_support = 0.2. Show all frequent 3-itemsets. 

In [7]:
freq_items = apriori(trans_data_enc, min_support=0.2, use_colnames=True, verbose=1)

Processing 72 combinations | Sampling itemset size 2Processing 144 combinations | Sampling itemset size 3Processing 4 combinations | Sampling itemset size 4


In [8]:
freq_items.shape

(33, 2)

In [9]:
freq_items = freq_items.reindex(columns=['itemsets', 'support'])
freq_items['length'] = freq_items['itemsets'].apply(lambda x: len(x))

In [10]:
freq_items[freq_items['length'] > 2]

Unnamed: 0,itemsets,support,length
31,"(Cheese, Eggs, Meat)",0.215873,3
32,"(Cheese, Milk, Meat)",0.203175,3


### T4.2: Generate Association Rules from Frequent Itemsets. Show the top-5 rules with high conviction. 

In [11]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
2,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
3,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
4,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148


In [12]:
rules.sort_values(by=['conviction'], ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
13,"(Meat, Milk)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137
10,"(Meat, Eggs)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667
8,"(Cheese, Eggs)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773
9,"(Cheese, Meat)",(Eggs),0.32381,0.438095,0.215873,0.666667,1.521739,0.074014,1.685714
11,"(Cheese, Milk)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429


### T5: Infere frequent itemsets with FP-Growth. Use min support = 0.2. 

In [13]:
fpgrowth(trans_data_enc, min_support=0.2, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.504762,(Bread)
1,0.501587,(Cheese)
2,0.47619,(Meat)
3,0.438095,(Wine)
4,0.438095,(Eggs)
5,0.406349,(Diaper)
6,0.361905,(Pencil)
7,0.501587,(Milk)
8,0.425397,(Bagel)
9,0.238095,"(Cheese, Bread)"


### T6. Show the maximal frequent itemsets for min support = 0.2 

In [14]:
max_patterns = fpmax(trans_data_enc, min_support=0.2, use_colnames=True)

In [15]:
# for readability 
max_patterns = max_patterns.reindex(columns=['itemsets', 'support'])
max_patterns['length'] = max_patterns['itemsets'].apply(lambda x: len(x))

In [16]:
print(f"Total number of maximal frequent patterns = {max_patterns.shape[0]}")
max_patterns

Total number of maximal frequent patterns = 19


Unnamed: 0,itemsets,support,length
0,"(Bread, Pencil)",0.2,2
1,"(Cheese, Pencil)",0.2,2
2,"(Wine, Pencil)",0.2,2
3,"(Cheese, Diaper)",0.2,2
4,"(Bread, Diaper)",0.231746,2
5,"(Wine, Diaper)",0.234921,2
6,"(Bagel, Milk)",0.225397,2
7,"(Bagel, Bread)",0.279365,2
8,"(Wine, Eggs)",0.24127,2
9,"(Milk, Eggs)",0.244444,2


# Save your notebook, then `File > Close and Halt`