# Série TP 2 – Partie 2 - Fouille de Données – Association Rules - Apriori

## Implementing Market Basket Analysis using mlxtend Package

Market Basket Analysis, also known as Association analysis, is a method for understanding client purchase trends based on historical data. In other words, Market Basket Analysis enables merchants to find links between the products that customers purchase.

Learn more : https://www.thepythoncode.com/article/build-a-recommender-system-with-association-rule-mining-in-python

In [None]:
!pip install --user mlxtend-0.21.0-py2.py3-none-any.whl

In [1]:
# Loading neccesary packages
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

## Dataset 1 - Simple Example

In [2]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Beans', 'Ice cream', 'Eggs']]

In [3]:
# Convert the dataset into a formal table
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
dataframe = pd.DataFrame(te_ary, columns=te.columns_)

In [4]:
# Print 5 first lignes of the dataset dataframe
dataframe.head()

Unnamed: 0,Apple,Beans,Corn,Dill,Eggs,Ice cream,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,True,False,False,True,False,True,True,True,False,True
1,False,True,False,True,True,False,False,True,True,False,True
2,True,True,False,False,True,False,True,False,False,False,False
3,False,True,True,False,False,False,True,False,False,True,True
4,False,True,True,False,True,True,False,False,True,False,False


In [5]:
dataframe.shape

(5, 11)

In [6]:
dataframe.columns

Index(['Apple', 'Beans', 'Corn', 'Dill', 'Eggs', 'Ice cream', 'Milk', 'Nutmeg',
       'Onion', 'Unicorn', 'Yogurt'],
      dtype='object')

In [7]:
# Applying Apriori algorithm 
freq_items = apriori(dataframe, min_support=0.6, use_colnames=True)

In [8]:
freq_items

Unnamed: 0,support,itemsets
0,1.0,(Beans)
1,0.8,(Eggs)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Eggs, Beans)"
6,0.6,"(Milk, Beans)"
7,0.6,"(Onion, Beans)"
8,0.6,"(Yogurt, Beans)"
9,0.6,"(Eggs, Onion)"


In [9]:
# Filtering frequent itemsets based on their support
freq_items[freq_items['support'] >= 0.8]

Unnamed: 0,support,itemsets
0,1.0,(Beans)
1,0.8,(Eggs)
5,0.8,"(Eggs, Beans)"


- support(A->C) = support(A+C) [aka 'support'], range: [0, 1]

- confidence(A->C) = support(A+C) / support(A), range: [0, 1]

- lift(A->C) = confidence(A->C) / support(C), range: [0, inf]

- leverage(A->C) = support(A->C) - support(A)*support(C),
range: [-1, 1]

- conviction = [1 - support(C)] / [1 - confidence(A->C)],
range: [0, inf]

In [10]:
# Extracting association rules from the frequent itemsets
rules = association_rules(freq_items, metric="confidence", min_threshold=0.7)

In [11]:
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Eggs),(Beans),0.8,1.0,0.8,1.0,1.0,0.0,inf
1,(Beans),(Eggs),1.0,0.8,0.8,0.8,1.0,0.0,1.0
2,(Milk),(Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
3,(Onion),(Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
4,(Yogurt),(Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf


In [12]:
rules[(rules['confidence'] > 0.75) & (rules['lift'] > 1.2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
6,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
9,"(Onion, Beans)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
11,(Onion),"(Eggs, Beans)",0.6,0.8,0.6,1.0,1.25,0.12,inf


## Dataset 2 - Online Retail Dataset

The dataset is a transnational data collection covering all transactions made by a UK-based and registered non-store internet retailer between 2010 and 2011. The dataset includes information on 500K clients across eight attributes.

Learn more about the dataset and its analysis here: https://www.thepythoncode.com/article/build-a-recommender-system-with-association-rule-mining-in-python

### Data Load and Preparation

In [13]:
# Loading data from CSV file
myretaildata = pd.read_csv('Online_Retail_Cleaned.csv')

  myretaildata = pd.read_csv('Online_Retail_Cleaned.csv')


In [14]:
# Data Cleaning - removes duplicate invoice
myretaildata.dropna(axis=0, subset=['InvoiceNo'], inplace=True) 
myretaildata.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [15]:
myretaildata['Country'].value_counts()

United Kingdom          487622
Germany                   9042
France                    8408
EIRE                      7894
Spain                     2485
Netherlands               2363
Belgium                   2031
Switzerland               1967
Portugal                  1501
Australia                 1185
Norway                    1072
Italy                      758
Channel Islands            748
Finland                    685
Cyprus                     614
Sweden                     451
Unspecified                446
Austria                    398
Denmark                    380
Poland                     330
Japan                      321
Hong Kong                  284
Singapore                  222
Iceland                    182
USA                        179
Canada                     151
Greece                     145
Malta                      112
United Arab Emirates        68
European Community          60
RSA                         58
Lebanon                     45
Lithuani

In [16]:
myretaildata.shape

(532326, 8)

In [17]:
myretaildata.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [19]:
# Verify missing value
myretaildata.isnull().sum().sort_values(ascending=False)

CustomerID     134650
Description      1455
InvoiceNo           0
StockCode           0
Quantity            0
InvoiceDate         0
UnitPrice           0
Country             0
dtype: int64

In [20]:
myretaildata.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,532326.0,532326.0,397676.0
mean,10.237364,3.847741,15295.959361
std,159.637302,41.769099,1712.437293
min,-9600.0,-11062.06,12346.0
25%,1.0,1.25,13969.0
50%,3.0,2.08,15159.0
75%,10.0,4.13,16800.0
max,80995.0,13541.33,18287.0


In [21]:
# Separating transactions for Italy only
mybasket = (myretaildata[myretaildata['Country'] =="Italy"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [22]:
# Viewing transaction basket
mybasket.head()

Description,12 EGG HOUSE PAINTED WOOD,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE SKULLS,12 PENCILS TALL TUBE WOODLAND,16 PIECE CUTLERY SET PANTRY DESIGN,20 DOLLY PEGS RETROSPOT,3 GARDENIA MORRIS BOXED CANDLES,3 ROSE MORRIS BOXED CANDLES,3 STRIPEY MICE FELTCRAFT,3 TIER CAKE TIN RED AND CREAM,...,WOODLAND BUNNIES LOLLY MAKERS,WOODLAND CHARLOTTE BAG,WRAP DOILEY DESIGN,WRAP ENGLISH ROSE,WRAP I LOVE LONDON,WRAP RED APPLES,WRAP RED VINTAGE DOILY,YOU'RE CONFUSING ME METAL SIGN,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
537022,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
539752,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
541115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
541703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
542238,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# converting all positive values to 1(True) and everything else to 0(False) to get a formal table
def my_encode_units(x):
    if x <= 0:
        return False
    if x >= 1:
        return True

my_basket_sets = mybasket.applymap(my_encode_units)

In [27]:
# Viewing transaction basket
my_basket_sets.head()

Description,12 EGG HOUSE PAINTED WOOD,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE SKULLS,12 PENCILS TALL TUBE WOODLAND,16 PIECE CUTLERY SET PANTRY DESIGN,20 DOLLY PEGS RETROSPOT,3 GARDENIA MORRIS BOXED CANDLES,3 ROSE MORRIS BOXED CANDLES,3 STRIPEY MICE FELTCRAFT,3 TIER CAKE TIN RED AND CREAM,...,WOODLAND BUNNIES LOLLY MAKERS,WOODLAND CHARLOTTE BAG,WRAP DOILEY DESIGN,WRAP ENGLISH ROSE,WRAP I LOVE LONDON,WRAP RED APPLES,WRAP RED VINTAGE DOILY,YOU'RE CONFUSING ME METAL SIGN,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
537022,False,False,False,False,False,False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
539752,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
541115,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
541703,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
542238,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Training Model - Which items are frequently bought together?

In [28]:
# Generatig frequent itemsets
my_frequent_itemsets = apriori(my_basket_sets, min_support=0.07, use_colnames=True)

In [32]:
my_frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.105263,(ABC TREASURE BOOK BOX)
1,0.078947,(ADULT APRON APPLE DELIGHT)
2,0.105263,(BAKING SET 9 PIECE RETROSPOT)
3,0.157895,(BREAD BIN DINER STYLE IVORY)
4,0.078947,(BREAD BIN DINER STYLE PINK)


In [30]:
# Generating association rules
my_rules = association_rules(my_frequent_itemsets, metric="confidence", min_threshold=0.5)

In [31]:
# Viewing top 10 rules
my_rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(RED RETROSPOT CHARLOTTE BAG),(ABC TREASURE BOOK BOX),0.078947,0.105263,0.078947,1.0,9.5,0.070637,inf
1,(ABC TREASURE BOOK BOX),(RED RETROSPOT CHARLOTTE BAG),0.105263,0.078947,0.078947,0.75,9.5,0.070637,3.684211
2,(WOODLAND CHARLOTTE BAG),(ABC TREASURE BOOK BOX),0.078947,0.105263,0.078947,1.0,9.5,0.070637,inf
3,(ABC TREASURE BOOK BOX),(WOODLAND CHARLOTTE BAG),0.105263,0.078947,0.078947,0.75,9.5,0.070637,3.684211
4,(BREAD BIN DINER STYLE IVORY),(DOORMAT UNION FLAG),0.157895,0.157895,0.105263,0.666667,4.222222,0.080332,2.526316
5,(DOORMAT UNION FLAG),(BREAD BIN DINER STYLE IVORY),0.157895,0.157895,0.105263,0.666667,4.222222,0.080332,2.526316
6,(BREAD BIN DINER STYLE IVORY),(JAM MAKING SET WITH JARS),0.157895,0.184211,0.105263,0.666667,3.619048,0.076177,2.447368
7,(JAM MAKING SET WITH JARS),(BREAD BIN DINER STYLE IVORY),0.184211,0.157895,0.105263,0.571429,3.619048,0.076177,1.964912
8,(BREAD BIN DINER STYLE IVORY),(MINT KITCHEN SCALES),0.157895,0.131579,0.078947,0.5,3.8,0.058172,1.736842
9,(MINT KITCHEN SCALES),(BREAD BIN DINER STYLE IVORY),0.131579,0.157895,0.078947,0.6,3.8,0.058172,2.105263


- support(A->C) = support(A+C) [aka 'support'], range: [0, 1]

- confidence(A->C) = support(A+C) / support(A), range: [0, 1]

- lift(A->C) = confidence(A->C) / support(C), range: [0, inf]

- leverage(A->C) = support(A->C) - support(A)*support(C),
range: [-1, 1]

- conviction = [1 - support(C)] / [1 - confidence(A->C)],
range: [0, inf]

In [36]:
# Filtering rules based on condition
my_rules[(my_rules['lift'] >= 5) & (my_rules['confidence'] >= 0.8)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(RED RETROSPOT CHARLOTTE BAG),(ABC TREASURE BOOK BOX),0.078947,0.105263,0.078947,1.0,9.500000,0.070637,inf
2,(WOODLAND CHARLOTTE BAG),(ABC TREASURE BOOK BOX),0.078947,0.105263,0.078947,1.0,9.500000,0.070637,inf
23,(JUMBO BAG TOYS),(CHILDRENS APRON APPLES DESIGN),0.078947,0.157895,0.078947,1.0,6.333333,0.066482,inf
36,(SET OF 20 KIDS COOKIE CUTTERS),(CHILDRENS APRON APPLES DESIGN),0.131579,0.157895,0.105263,0.8,5.066667,0.084488,4.210526
38,(TOY TIDY PINK POLKADOT),(CHILDRENS APRON APPLES DESIGN),0.131579,0.157895,0.105263,0.8,5.066667,0.084488,4.210526
...,...,...,...,...,...,...,...,...,...
1317,"(JUMBO BAG TOYS, TOY TIDY PINK POLKADOT)","(RECYCLING BAG RETROSPOT, TOY TIDY SPACEBOY, J...",0.078947,0.078947,0.078947,1.0,12.666667,0.072715,inf
1318,"(JUMBO BAG WOODLAND ANIMALS, CHILDRENS APRON A...","(RECYCLING BAG RETROSPOT, JUMBO BAG TOYS, TOY ...",0.078947,0.078947,0.078947,1.0,12.666667,0.072715,inf
1319,"(TOY TIDY SPACEBOY, JUMBO BAG WOODLAND ANIMALS)","(RECYCLING BAG RETROSPOT, JUMBO BAG TOYS, TOY ...",0.078947,0.078947,0.078947,1.0,12.666667,0.072715,inf
1320,"(TOY TIDY PINK POLKADOT, JUMBO BAG WOODLAND AN...","(RECYCLING BAG RETROSPOT, JUMBO BAG TOYS, TOY ...",0.078947,0.078947,0.078947,1.0,12.666667,0.072715,inf


### Make Recommendation

In [41]:
# Function in which we pass an item name, and it returns the items that are frequently bought together
def frequently_bought_t(item):
    # df of item passed
    item_d = my_basket_sets.loc[my_basket_sets[item]==True]
    # Applying apriori algorithm on item df
    frequentitemsets = apriori(item_d, min_support=0.8, use_colnames=True)
    # Storing association rules
    rules = association_rules(frequentitemsets, metric="confidence", min_threshold=0.7)
    # Sorting on confidence 
    rules.sort_values(['confidence'], ascending=False).reset_index(drop=True)
    print(f'Items frequently bought together with {item} : ')
    # Returning top 6 items with highest confidence
    return rules['consequents'].unique()[:6]

In [42]:
frequently_bought_t('JUMBO BAG TOYS')

Items frequently bought together with JUMBO BAG TOYS


array([frozenset({'CHILDRENS APRON APPLES DESIGN'}),
       frozenset({'JUMBO BAG TOYS'}),
       frozenset({'JUMBO BAG WOODLAND ANIMALS'}),
       frozenset({'RECYCLING BAG RETROSPOT'}),
       frozenset({'TOY TIDY PINK POLKADOT'}),
       frozenset({'TOY TIDY SPACEBOY'})], dtype=object)