<a href="https://colab.research.google.com/github/a-nagar/cs4372/blob/main/Frequent_Pattern_Association_Rules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
%pip install mlxtend --upgrade



# Transactions Dataset
Let's look at a set of transactions stored in the form of a list with elements containing individual transactions.

In [3]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

In [4]:
type(dataset)

list

## Converting to Transactions Dataframe
Before we can proceed, we need to convert the transaction list using TransactionEncoder object. Notice the format of the output dataframe.

In [5]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


# Apriori Algorithm
Let's run apriori algorithm and provide minimum support values.

In [6]:
from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.6, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Onion, Eggs)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


# Frequent Itemsets
Let's create frequent items sets with minimum support.

In [7]:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets,length
0,0.8,(Eggs),1
1,1.0,(Kidney Beans),1
2,0.6,(Milk),1
3,0.6,(Onion),1
4,0.6,(Yogurt),1
5,0.8,"(Kidney Beans, Eggs)",2
6,0.6,"(Onion, Eggs)",2
7,0.6,"(Kidney Beans, Milk)",2
8,0.6,"(Kidney Beans, Onion)",2
9,0.6,"(Kidney Beans, Yogurt)",2


In [8]:
frequent_itemsets[ (frequent_itemsets['length'] >= 2) &
                   (frequent_itemsets['support'] >= 0.6) ]

  and should_run_async(code)


Unnamed: 0,support,itemsets,length
5,0.8,"(Kidney Beans, Eggs)",2
6,0.6,"(Onion, Eggs)",2
7,0.6,"(Kidney Beans, Milk)",2
8,0.6,"(Kidney Beans, Onion)",2
9,0.6,"(Kidney Beans, Yogurt)",2
10,0.6,"(Kidney Beans, Onion, Eggs)",3


# Association Rules
Next, let's try to find association rules with significan confidence values from the transaction dataset.

In [9]:
import warnings
warnings.filterwarnings('ignore')

  and should_run_async(code)


In [10]:
from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7, num_itemsets=df.shape[0])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Kidney Beans),(Eggs),1.0,0.8,0.8,0.8,1.0,1.0,0.0,1.0,0.0,0.8,0.0,0.9
1,(Eggs),(Kidney Beans),0.8,1.0,0.8,1.0,1.0,1.0,0.0,inf,0.0,0.8,0.0,0.9
2,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,1.0,0.12,inf,0.5,0.75,1.0,0.875
3,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,1.0,0.12,1.6,1.0,0.75,0.375,0.875
4,(Milk),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,1.0,0.0,inf,0.0,0.6,0.0,0.8
5,(Onion),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,1.0,0.0,inf,0.0,0.6,0.0,0.8
6,(Yogurt),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,1.0,0.0,inf,0.0,0.6,0.0,0.8
7,"(Kidney Beans, Onion)",(Eggs),0.6,0.8,0.6,1.0,1.25,1.0,0.12,inf,0.5,0.75,1.0,0.875
8,"(Kidney Beans, Eggs)",(Onion),0.8,0.6,0.6,0.75,1.25,1.0,0.12,1.6,1.0,0.75,0.375,0.875
9,"(Onion, Eggs)",(Kidney Beans),0.6,1.0,0.6,1.0,1.0,1.0,0.0,inf,0.0,0.6,0.0,0.8


Let's try to create a column with the length of items in each antecedent.

In [11]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2, num_itemsets=df.shape[0])
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski,antecedent_len
0,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,1.0,0.12,inf,0.5,0.75,1.0,0.875,1
1,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,1.0,0.12,1.6,1.0,0.75,0.375,0.875,1
2,"(Kidney Beans, Onion)",(Eggs),0.6,0.8,0.6,1.0,1.25,1.0,0.12,inf,0.5,0.75,1.0,0.875,2
3,"(Kidney Beans, Eggs)",(Onion),0.8,0.6,0.6,0.75,1.25,1.0,0.12,1.6,1.0,0.75,0.375,0.875,2
4,(Onion),"(Kidney Beans, Eggs)",0.6,0.8,0.6,1.0,1.25,1.0,0.12,inf,0.5,0.75,1.0,0.875,1
5,(Eggs),"(Kidney Beans, Onion)",0.8,0.6,0.6,0.75,1.25,1.0,0.12,1.6,1.0,0.75,0.375,0.875,1


The above information can be used for filtering rules with sufficient number of items in antecedent or consequent.

In [12]:
rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.2) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski,antecedent_len
2,"(Kidney Beans, Onion)",(Eggs),0.6,0.8,0.6,1.0,1.25,1.0,0.12,inf,0.5,0.75,1.0,0.875,2


# FP Growth Algorithm

FP Growth is a faster alternative to Apriori algorithm that doesn't involve explicit candidate generation.

As per documentation, "*In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitely, which makes is particularly attractive for large datasets.*"

In [13]:
from mlxtend.frequent_patterns import fpgrowth

fpgrowth(df, min_support=0.6, use_colnames=True)

Unnamed: 0,support,itemsets
0,1.0,(Kidney Beans)
1,0.8,(Eggs)
2,0.6,(Yogurt)
3,0.6,(Onion)
4,0.6,(Milk)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Kidney Beans, Yogurt)"
7,0.6,"(Onion, Eggs)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Onion, Eggs)"


If you just want the maximal patterns, you can use *fpmax* algorithm.

As per documentation, "*FP-Max is a variant of FP-Growth, which focuses on obtaining maximal itemsets. An itemset X is said to maximal if X is frequent and there exists no frequent super-pattern containing X. In other words, a frequent pattern X cannot be sub-pattern of larger frequent pattern to qualify for the definition maximal itemset.*"

In [14]:
from mlxtend.frequent_patterns import fpmax
fpmax(df, min_support=0.6, use_colnames=True)


Unnamed: 0,support,itemsets
0,0.6,"(Kidney Beans, Milk)"
1,0.6,"(Kidney Beans, Onion, Eggs)"
2,0.6,"(Kidney Beans, Yogurt)"


# Working With A Real Dataset
Let's work with a dataset from UCI repository: https://archive.ics.uci.edu/ml/datasets/online+retail

We will download the file and read it into a Pandas dataframe.

In [15]:
import pandas as pd
df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


Let's get rid of any null or missing invoice values and convert them to string format.

In [16]:
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

In [17]:
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


**This is an important step**.

You need to convert data into the form that the package expects i.e. transaction id along rows and one column for each item.


In [18]:
basket = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

Don't be surprised if you see lots of 0's.

In [19]:
basket.iloc[:100, :]

Description,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TRELLIS COAT RACK,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
548496,0.0,0.0,0.0,0.0,72.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
548553,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
548606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
548725,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


I would like to convert data into only 0 or 1.

In [20]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)

In [21]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

In [22]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.071429,(4 TRADITIONAL SPINNING TOPS)
1,0.096939,(ALARM CLOCK BAKELIKE GREEN)
2,0.102041,(ALARM CLOCK BAKELIKE PINK)
3,0.094388,(ALARM CLOCK BAKELIKE RED )
4,0.081633,(BAKING SET 9 PIECE RETROSPOT )
...,...,...
85,0.084184,"(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE..."
86,0.084184,"(SET/20 RED RETROSPOT PAPER NAPKINS , POSTAGE,..."
87,0.102041,"(SET/6 RED SPOTTY PAPER PLATES, SET/6 RED SPOT..."
88,0.099490,"(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RE..."


In [23]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.head()

Unnamed: 0,support,itemsets,length
0,0.071429,(4 TRADITIONAL SPINNING TOPS),1
1,0.096939,(ALARM CLOCK BAKELIKE GREEN),1
2,0.102041,(ALARM CLOCK BAKELIKE PINK),1
3,0.094388,(ALARM CLOCK BAKELIKE RED ),1
4,0.081633,(BAKING SET 9 PIECE RETROSPOT ),1


In [24]:
pd.set_option('max_colwidth', 600)
frequent_itemsets[ (frequent_itemsets['length'] >= 2) &
                   (frequent_itemsets['support'] >= 0.07) ].sort_values(by="length", ascending=False)

Unnamed: 0,support,itemsets,length
89,0.081633,"(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RED SPOTTY PAPER CUPS, POSTAGE, SET/6 RED SPOTTY PAPER PLATES)",4
88,0.09949,"(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY PAPER PLATES)",3
87,0.102041,"(SET/6 RED SPOTTY PAPER PLATES, SET/6 RED SPOTTY PAPER CUPS, POSTAGE)",3
86,0.084184,"(SET/20 RED RETROSPOT PAPER NAPKINS , POSTAGE, SET/6 RED SPOTTY PAPER PLATES)",3
85,0.084184,"(SET/20 RED RETROSPOT PAPER NAPKINS , SET/6 RED SPOTTY PAPER CUPS, POSTAGE)",3
84,0.084184,"(PLASTERS IN TIN WOODLAND ANIMALS, PLASTERS IN TIN SPACEBOY, POSTAGE)",3
83,0.084184,"(PLASTERS IN TIN WOODLAND ANIMALS, PLASTERS IN TIN CIRCUS PARADE , POSTAGE)",3
82,0.07398,"(PLASTERS IN TIN CIRCUS PARADE , PLASTERS IN TIN SPACEBOY, POSTAGE)",3
81,0.071429,"(ALARM CLOCK BAKELIKE RED , ALARM CLOCK BAKELIKE GREEN, POSTAGE)",3
74,0.107143,"(SET/6 RED SPOTTY PAPER PLATES, POSTAGE)",2


In [25]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1, num_itemsets=basket.shape[0])
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.096939,0.07398,0.725,7.478947,1.0,0.064088,3.283859,0.964734,0.591837,0.69548,0.744079
1,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.102041,0.07398,0.763158,7.478947,1.0,0.064088,3.791383,0.959283,0.591837,0.736244,0.744079
2,(ALARM CLOCK BAKELIKE RED ),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,1.0,0.069932,5.568878,0.976465,0.704545,0.820431,0.826814
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED ),0.096939,0.094388,0.079082,0.815789,8.642959,1.0,0.069932,4.916181,0.979224,0.704545,0.79659,0.826814
4,(ALARM CLOCK BAKELIKE GREEN),(POSTAGE),0.096939,0.765306,0.084184,0.868421,1.134737,1.0,0.009996,1.783673,0.131484,0.108197,0.439359,0.489211


# Lab Assignment
You will use the MovieLens 100K dataset available from
https://grouplens.org/datasets/movielens/

We will use the version for education and research. I have already uploaded the relevant files on the server and below is the command to read the files.


In [26]:
movies = pd.read_csv("https://an-ml.s3.us-west-1.amazonaws.com/ml-latest-small/movies.csv")

In [27]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [28]:
ratings = pd.read_csv("https://an-ml.s3.us-west-1.amazonaws.com/ml-latest-small/ratings.csv")

In [29]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Let's join the above two tables on the common key movieId.

In [30]:
df = pd.merge(movies, ratings, on="movieId")

In [31]:
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


The columns of interest to us are title and userId. Now, repeat the steps that we did earlier and find significant frequent patterns and association rules. You are free to set the selection paramters.

*Optional* - Do the results make sense? Use your knowledge of movies ðŸ˜€