# Frequent Itemset and Association Rules Mining using Apriori Algorithm

In this part, you will build a system which can help make recommendations using the Apriori algorithm.

To solve this assignment you will need to go though these pages:

* https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/
* https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/
* https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
* https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/

The `apply` function in `pandas` can prove very useful for this assignment. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

**Source**: Online Retail. (2015). UCI Machine Learning Repository. https://doi.org/10.24432/C5BW33.

In [104]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## Load and Inspect Data

In [105]:
invoices = pd.read_csv('./apriori_data.csv')
invoices.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## Data Transformation

Drop everything except InvoiceNo and StockCode since we can use InvoiceNo for transaction id and StockCode for item name

In [106]:
# Your code here
columns_to_drop = {'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country'}
data = invoices.drop(columns_to_drop, axis=1)

In [107]:
data.head()

Unnamed: 0,InvoiceNo,StockCode
0,536365,85123A
1,536365,71053
2,536365,84406B
3,536365,84029G
4,536365,84029E


Group the data by InvoiceNo and create a list of StockCode for each invoice

In [108]:
# Your code here
transactions = data.groupby('InvoiceNo')['StockCode'].apply(list).tolist()

In [109]:
transactions[0:4]

[['85123A', '71053', '84406B', '84029G', '84029E', '22752', '21730'],
 ['22633', '22632'],
 ['84879',
  '22745',
  '22748',
  '22749',
  '22310',
  '84969',
  '22623',
  '22622',
  '21754',
  '21755',
  '21777',
  '48187'],
 ['22960', '22913', '22912', '22914']]

Using TransactionEncoder, convert the transactions into a dataset where each row represents a transaction and each column represents an item. The values will be True or False depending on whether the item is present in that specific transaction.

In [110]:
# Your code here
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
transactions_df = pd.DataFrame(te_ary, columns=te.columns_)

In [111]:
transactions_df.head()

Unnamed: 0,10002,10080,10120,10123C,10123G,10124A,10124G,10125,10133,10134,...,M,PADS,POST,S,gift_0001_10,gift_0001_20,gift_0001_30,gift_0001_40,gift_0001_50,m
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Use Apriori to get the frequent itemsets and inspect the results

Use apriori to find the frequent_itemsets for `min_sup` = `1%`

In [112]:
# Your code here
frequent_itemsets = apriori(transactions_df, min_support=0.01)

In [64]:
frequent_itemsets.shape

(1087, 2)

In [65]:
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.020193,(14)
1,0.012587,(20)
2,0.017876,(21)
3,0.011236,(75)
4,0.01251,(147)


Add an additional column called `items_count` to the dataframe which represents the number of items in the itemset.

In [67]:
# Your code here
frequent_itemsets['items_count'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

In [96]:
frequent_itemsets.head()

Unnamed: 0,support,itemsets,items_count
0,0.020193,(14),1
1,0.012587,(20),1
2,0.017876,(21),1
3,0.011236,(75),1
4,0.01251,(147),1


Display the various itemsets generated sorted (descending) by the items_count.

In [97]:
# Your code here
frequent_itemsets.sort_values('items_count', ascending=False).head()

Unnamed: 0,support,itemsets,items_count
1086,0.011699,"(1608, 1609, 1610, 1348)",4
1085,0.010386,"(944, 1336, 3515, 1317)",4
1084,0.010077,"(177, 178, 179, 1290)",4
1032,0.012548,"(1315, 180, 183)",3
1024,0.011042,"(1315, 180, 181)",3


Show how many itemsets exist by items_count

In [98]:
# Your code here
frequent_itemsets['items_count'].count()

1087

## Generate association rules
Generate all association rules using the `lift` metric with a minimum value of 2

In [99]:
# Your code here
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=2)

In [100]:
rules.shape

(1338, 10)

In [101]:
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(171),(172),0.020541,0.033668,0.011158,0.543233,16.135019,0.010467,2.115591,0.957695
1,(172),(171),0.033668,0.020541,0.011158,0.331422,16.135019,0.010467,1.464989,0.970705
2,(944),(171),0.046371,0.020541,0.011506,0.248127,12.079846,0.010553,1.302692,0.961818
3,(171),(944),0.020541,0.046371,0.011506,0.56015,12.079846,0.010553,2.16808,0.936453
4,(171),(1317),0.020541,0.047529,0.010888,0.530075,11.152679,0.009912,2.026858,0.929426


In [102]:
invoices.head()

Unnamed: 0_level_0,InvoiceNo,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"(0, 85123A)",536365,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
"(1, 71053)",536365,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
"(2, 84406B)",536365,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
"(3, 84029G)",536365,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
"(4, 84029E)",536365,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


Add the names of the items back in the data frame as save all rules in a csv file

In [113]:
# Your code here

In [114]:
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(171),(172),0.020541,0.033668,0.011158,0.543233,16.135019,0.010467,2.115591,0.957695
1,(172),(171),0.033668,0.020541,0.011158,0.331422,16.135019,0.010467,1.464989,0.970705
2,(944),(171),0.046371,0.020541,0.011506,0.248127,12.079846,0.010553,1.302692,0.961818
3,(171),(944),0.020541,0.046371,0.011506,0.56015,12.079846,0.010553,2.16808,0.936453
4,(171),(1317),0.020541,0.047529,0.010888,0.530075,11.152679,0.009912,2.026858,0.929426


In [115]:
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(171),(172),0.020541,0.033668,0.011158,0.543233,16.135019,0.010467,2.115591,0.957695
1,(172),(171),0.033668,0.020541,0.011158,0.331422,16.135019,0.010467,1.464989,0.970705
2,(944),(171),0.046371,0.020541,0.011506,0.248127,12.079846,0.010553,1.302692,0.961818
3,(171),(944),0.020541,0.046371,0.011506,0.56015,12.079846,0.010553,2.16808,0.936453
4,(171),(1317),0.020541,0.047529,0.010888,0.530075,11.152679,0.009912,2.026858,0.929426


In [116]:
rules.shape

(1338, 10)

In [117]:
# I used the following line to create the rules_100.csv file which only gives you 100 rules.
# rules.sample(100).to_csv('rules_100.csv', index=False)

# You must submit the rules.csv file that contains all the 1338 rules by running the following command
rules.to_csv('rules.csv', index=False)