Notebook is copyright &copy; of <a href="https://ajaytech.co">Ajay Tech</a>. You can find the same content online at <a href="https://ajaytech.co/python-association-rule-learning">Associate Rule Learning in Python</a> or on <a href="https://github.com/ajaytech002">Ajay Tech's github page</a>

# Associative Rule Learning

## Contents

- What is Associative Rule Learning
- Key Terms
  - Support
  - Confidence
  - Lift
- Apriori Algorithm
- Eclat Algorithm

### What is Association Rule Learning

Associative Rule Learning (or mining) is a Machine Learning Algorithm for discovering relationship between variables. What is new about this, you must be wondering ? Standarad statistical methods like correlation or regression also do the same thing, right ?  

Well, for beginners, thos are typically either supervised algorithms or a way to quantify relationship between a known set of variables - For example, find the relationship between 
- smoking and cancer
- cholesterol and heart disease etc

Associative Rule Learning on the other hand **discovers** or __learns__ relationships between variables that you might not be aware of. That is why it is classified as an _unsupervised_ Machine Learning Algorithm. This was first discovered in 1993 when a group of researchers were interested in finding out the relationship between items sold in supermarkets based on data got from their Point-of-Sale systems.  Here are two classic examples.

- the classic example of an unusual relationship that is hard to miss for human intuition is the relationship between Baby Diapers and beer in supermarket sales. 
- Another example of doing this on a large scale is movie recommender systems in Netflix, Amazon Prime Video. Even if you have not experienced Netflix or Amazon Prime Video, you must have already experienced this as part of Youtube video recommendations. It is pretty accurate actually.

### Key Terms

Before we get into the actual algorithms, let's understand a couple of key terms 

- Support
- Lift
- Confidence

Once we understand these terms, we can move on to the actual algoritm itself. 


Imagine I have made the following transactions in a super market over the course of a week.

**Txn ID - Items**
- 154265 - { Milk }
- 645858 - { Milk, Cheese }
- 588455 - { Milk, Water, Vegetables }
- 554855 - { Milk, Cheese, Yoghurt, Water }
- 558965 - { Water, Vegetables

Say we are trying to quantify the association (or rule) between the items Milk and Cheese. Specifically the association 

- Milk -> Cheese

and not the other way around ( NOT Cheese -> Milk ). Meaning, we are trying to quantify the association that implies that I buy Cheese if I already have Milk in my basket.

#### Support


**Support** is a measure of how frequent a item or an item set appears in a dataset. For example, what is the support for the item set { Milk + Cheese } ?


- 154265 - { Milk }
- **645858 - { Milk, Cheese }**
- 588455 - { Milk, Water, Vegetables }
- **554855 - { Milk, Cheese, Yoghurt, Water }**
- 558965 - { Water, Vegetables

# $ Support_{\color{blue}{Milk + Cheese}} = \frac{Occurances of {Milk, Cheese} }{Total} = \frac{2}{5} = 0.4$

## $Support_{\color{blue}{X + Y}} = How \ Frequent \ is \ the \ combination - {\{X+Y\}}$

#### Confidence

**Confidence** is a measure of how often this rule is found to be true. It is defined as follows. 

## $ Confidence_{\color{blue}{X->Y}} =  \frac {Support_{ (X \bigcup Y)}}{Support_X}$

For example, in our case, the Confidence for the combination { Milk -> Cheese } would be

## $ Confidence_{\color{blue}{Milk -> Cheese}} = \frac{Support_{Milk + Cheese}}{Support_{Milk}} = \frac{0.4}{\frac{4}{5}} = \frac{0.4}{0.8} = \frac{1}{2} = 0.5$

#### Lift

**Lift** of a rule is defined as following.

## $Lift_{\color{blue}{Milk->Cheese}} = \frac {Support_{ (X \bigcup Y)}}{{Support_X \ } \times {\ Support_Y}} = \frac{0.4}{{0.8 \ } \times {\ 0.4}} = 1.25$

Now that we have got the math behind us, let's define in simple terms what these terms mean.

- $Support_{X->Y}$  -  How frequent is this combination ? This is relatively straight forward - it is quite simply the total occurances of the combination in the entire transactions.
- $Confidence_{X->Y}$ - How often is this combination true ? Or, how likely is it that Y is purchased when X is purchased.
- $Lift_{X->Y}$ - Defines the strength of the relationship. 
  - Lift = 1 
    - P(X) = P(Y) - meaning both the events are unrelated.
  - Lift > 1
    - X is very related to Y . For example, in the example above, since _Lift_ > 1 , it means that Milk is very strongly associated with Cheese or in other words, Milk & Cheese occur together more often than separately.
  - Lift < 1
    - X and Y have a negative relationship. In the case of Milk & Cheese above, if _Lift_ was < 1, then Milk would NOT occur together with Cheese.

Now that we have got the math behind us, let's go on to the implementation of the actual algorithms. We are going to focus now on just 2 of the rule mining Algorithms

- Apriori 
- Eclact

### Apriori Algorithm

Apriori is an algorithm that combs through large datasets to identify different rules(associations). At a high level this is how it works. 

- **Step 1** - Identify single items and how frequently they occur - Call this __set 1__. To reduce the complexity, we typically set a minimum support level.
  - _For example, in a supermarket dataset, this dataset identifies all the invidual items (Milk, Cheese, Water etc) and how frequently they occur. If some items (say exotic spices) are not all that frequent, they are removed from this set (not from the actual dataset)_
  
__set 1__ <br>
**_Item   -   Frequency_** <br>
Milk   -   4 <br>
Cheese -   2 <br>
Water  -   2 <br>
Vegetables - 2 <br>
<strike>Yoghurt   - 1</strike> <br>

Say we set a cut-off of 2, we will only be left with 4 items (leave out Yoghurt).


- **Step 2** - Prepare all 2-item combinations of items in __set 1__ . Once again go through the original dataset to find frequency of occurance of each of the 2-item combinations. Once again to reduce the complexity, set a minimum support level.
  - _For example, among all the items, we have identified in **set 1** above that {milk, cheese, water and vegetables} occur in frequency of at least 40%. Now, identify all 2-set combinations
  
__set 2__ <br>
**_Item set - Frequency_** <br>
{Milk, Cheese}  - 2 <br>
<strike>{Milk, Water} - 1 <br> </strike>
<strike>{Milk, Vegetables} - 1 <br></strike>
<strike>{Cheese, Water} - 1<br></strike>
<strike>{Cheese, Vegetables} - 1 <br></strike>
{Water, Vegetables} - 2<br>

Once again, with a cut-off of 2, only 2 item sets remain <br><br>
         {Milk, Cheese}<br>
         {Water, Vegetables}<br>
         
- **Step 3** - Increase the combination size from _set 2_ and repeat *step 3* recursively until no more sets are found. 

#### Implementation

Scikit Learn does not have Association Rule mining algorithms. Luckily, there are many implementations of Apriori Algorithms in standard python. For example, one of the packages is _efficient-apriori_ available as a standard python package that you can install using pip.

<pre>
> pip install efficient-apriori
</pre>

In [4]:
from apyori import apriori

transactions = [
    ['beer', 'nuts'],
    ['beer', 'cheese'],
]
results = list(apriori(transactions))
for result in results :
    print (result)

RelationRecord(items=frozenset({'beer'}), support=1.0, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'beer'}), confidence=1.0, lift=1.0)])
RelationRecord(items=frozenset({'cheese'}), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'cheese'}), confidence=0.5, lift=1.0)])
RelationRecord(items=frozenset({'nuts'}), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'nuts'}), confidence=0.5, lift=1.0)])
RelationRecord(items=frozenset({'beer', 'cheese'}), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset({'beer'}), items_add=frozenset({'cheese'}), confidence=0.5, lift=1.0), OrderedStatistic(items_base=frozenset({'cheese'}), items_add=frozenset({'beer'}), confidence=1.0, lift=1.0)])
RelationRecord(items=frozenset({'beer', 'nuts'}), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset({'beer'}), items_add=frozenset({'nuts'}), conf

Let's increase the size of the dataset. Instacart has a test database (roughly 200M) that you can download from https://www.instacart.com/datasets/grocery-shopping-2017. I have essentially simplified the dataset and provided the same in the data folder. You can access it here.

In [5]:
import pandas as pd

data = pd.read_csv("./data/instacart.csv")
data.head()

Unnamed: 0,order_id,product_id
0,1,49302
1,1,11109
2,1,10246
3,1,49683
4,1,43633


The column *order_id* is the actual order number and *product_id* is the product id of the product ordered. For example, what these rows indicate are that order number 1 has 5 products in it. The **efficient-apriori** discussed above needs the data in a different format. Let's prepare the dataset accordingly. 

In [6]:
data.shape

(1048575, 2)

In [7]:
data.dtypes

order_id      int64
product_id    int64
dtype: object

Since the product id (*product_id*) is an integer, let's convert it to a string so that we can do string concatenation later.


In [8]:
data["product_id"] = data["product_id"].astype(str)
data.dtypes

order_id       int64
product_id    object
dtype: object

In [9]:
# this will hold the data in the format that "efficient-apriori" algorithm requires
transactions = pd.DataFrame()

transactions = data.groupby("order_id")["product_id"].apply(lambda x : "%s" % ",".join(x))

In [10]:
dataset = []
for i in range(transactions.shape[0]) :
    dataset.append(list(transactions.iloc[i].split(","))) 

In [63]:
from efficient_apriori import apriori

itemsets, rules = apriori(dataset, min_support=0.1,  min_confidence=0.5)
print(rules)  # [{eggs} -> {bacon}, {soup} -> {bacon}]
for rule in rules :
    print ( rule )

print ( itemsets)

[]
{1: {('13176',): 11639, ('24852',): 14136}}


In [48]:
dataset_1 = [('49302', '11109', '10246', '49683', '43633', '13176', '47209', '22035'),
 ('39612', '19660', '49235', '43086', '46620', '34497', '48679', '46979'),
('49302', '11109', '10246', '49683', '43633', '13176', '47209', '22035'),
('49302', '11109', '47209', '22035'),
('49302', '11109', '10246', '49683', '43633', '13176'),
('49302', '11109', '10246', '49683', '43633', '13176', '47209', '22035'),
( '10246', '49683', '43633')]

In [25]:
data.dtypes

order_id      int64
product_id    int64
dtype: object

In [40]:
t.head()

order_id
1     (49302,11109,10246,49683,43633,13176,47209,22035)
36    (39612,19660,49235,43086,46620,34497,48679,46979)
38    (11913,18159,4461,21616,23622,32433,28842,4262...
96          (20574,30391,40706,25610,27966,24489,39275)
98    (8859,19731,43654,13176,4357,37664,34065,35951...
Name: product_id, dtype: object

In [64]:
dataset = []
for i in range(transactions.shape[0]) :
    dataset.append(list(transactions.iloc[i].split(","))) 

In [13]:
from apyori import apriori

tx_data = dataset

results = list(apriori(transactions))

len(results)
results[0:10]

[RelationRecord(items=frozenset({','}), support=0.9476670616827686, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({','}), confidence=0.9476670616827686, lift=1.0)]),
 RelationRecord(items=frozenset({'0'}), support=0.8775081848675357, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'0'}), confidence=0.8775081848675357, lift=1.0)]),
 RelationRecord(items=frozenset({'1'}), support=0.9404161728965392, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'1'}), confidence=0.9404161728965392, lift=1.0)]),
 RelationRecord(items=frozenset({'2'}), support=0.9403559162030248, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'2'}), confidence=0.9403559162030248, lift=1.0)]),
 RelationRecord(items=frozenset({'3'}), support=0.9383373169702934, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'3'}), confidence=0.9383373169702934, lift=1.0

In [20]:
from apyori import apriori

tx_data = dataset

results = list(apriori(tx_data,min_lift = 3))

len(results)
results

[]

In [32]:
t

('49302', '11109', '10246', '49683', '43633', '13176', '47209', '22035')

In [19]:
len(tx_data )

99574

In [95]:
# this will hold the data in the format that "efficient-apriori" algorithm requires
transactions = pd.DataFrame()

data = pd.read_csv("./data/gifts.txt", delimiter="\t", encoding='latin1')

data["InvoiceNo"] = data["InvoiceNo"].astype(str)
data.dtypes

data = data.dropna()

In [96]:
transactions = data.iloc[:,:].groupby("InvoiceNo")["Description"].apply(lambda x : "%s" % ",".join(x))

In [101]:
dataset = []
for i in range(transactions.shape[0]) :
    dataset.append(list(transactions.iloc[i].split(","))) 

In [107]:
from apyori import apriori

tx_data = dataset

results = list(apriori(tx_data))

results

[]

In [109]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [111]:
from mlxtend.frequent_patterns import apriori

apriori(df)

Unnamed: 0,support,itemsets


In [177]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

data = pd.read_excel('./data/online_retail.xlsx')
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [126]:
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [178]:
data["Description"] = data["Description"].astype(str)
data["InvoiceNo"] = data["InvoiceNo"].astype(str)

In [114]:
# df['Description'] = df['Description'].str.strip()

In [179]:
data = data.dropna()

In [117]:
# df = df[~df['InvoiceNo'].str.contains('C')]

In [118]:
basket = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [119]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

In [120]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)


In [121]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.102041,0.07398,0.763158,7.478947,0.064088,3.791383
1,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.096939,0.07398,0.725,7.478947,0.064088,3.283859
2,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181
3,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878
4,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE PINK),0.094388,0.102041,0.07398,0.783784,7.681081,0.064348,4.153061


In [189]:
basket = (data[data['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [167]:
b = data.groupby(["InvoiceNo","Description"])["Quantity"].sum()

In [214]:
# Filter out just UK data
data_de = data[data["Country"] == "Germany"]

# 
# data = data.groupby(["InvoiceNo","Description"])["Quantity"].sum()
# data.head()

In [215]:
data_de = data_de.groupby(["InvoiceNo","Description"])["Quantity"].sum()

In [216]:
data_de.head(20)

InvoiceNo  Description                        
536527     3 HOOK HANGER MAGIC GARDEN             12
           5 HOOK HANGER MAGIC TOADSTOOL          12
           5 HOOK HANGER RED MAGIC TOADSTOOL      12
           ASSORTED COLOUR LIZARD SUCTION HOOK    24
           CHILDREN'S CIRCUS PARADE MUG           12
           HOMEMADE JAM SCENTED CANDLES           12
           HOT WATER BOTTLE BABUSHKA               4
           JUMBO BAG OWLS                         10
           JUMBO BAG WOODLAND ANIMALS             10
           MULTI COLOUR SILVER T-LIGHT HOLDER     12
           PACK 3 FIRE ENGINE/CAR PATCHES         12
           PICTURE DOMINOES                       12
           POSTAGE                                 1
           ROTATING SILVER ANGELS T-LIGHT HLDR     6
           SET OF 6 T-LIGHTS SANTA                 6
536840     6 RIBBONS RUSTIC CHARM                 12
           60 CAKE CASES VINTAGE CHRISTMAS        24
           60 TEATIME FAIRY CAKE CASES            24

In [217]:
data_de = data_de.unstack()
data_de.head()

Description,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,,,,,,,,,,,...,,,,,,,,,,
536840,,,,,,,,,,,...,,,,,,,,,,
536861,,,,,,,,,,,...,,,,,,,,,,
536967,,,,,,,,,,,...,,,,,,,,,,
536983,,,,,,,,,,,...,,,,,,,,,,


In [209]:
# data_de = data_de.reset_index()

In [210]:
data_de.head()

Description,InvoiceNo,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
0,536527,,,,,,,,,,...,,,,,,,,,,
1,536840,,,,,,,,,,...,,,,,,,,,,
2,536861,,,,,,,,,,...,,,,,,,,,,
3,536967,,,,,,,,,,...,,,,,,,,,,
4,536983,,,,,,,,,,...,,,,,,,,,,


In [218]:
data_de = data_de.fillna(0)
data_de.head()

Description,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536840,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536983,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [213]:
data_de = data_de.set_index("InvoiceNo")
data_de.head()

Description,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536840,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536983,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

In [226]:
def reduce_to_binary(qty) : 
    if qty >= 1 :
        return 1
    if qty <= 0 :
        return 0

In [227]:
data_de = data_de.applymap(reduce_to_binary)

In [228]:
data_de.head()

Description,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536840,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536861,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536967,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [233]:
frequent_itemsets = apriori(data_de, min_support=0.07, use_colnames=True)

In [232]:
data_de = data_de.dropna()

In [234]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(6 RIBBONS RUSTIC CHARM),(POSTAGE),0.102845,0.818381,0.091904,0.893617,1.091933,0.007738,1.707221
1,(POSTAGE),(6 RIBBONS RUSTIC CHARM),0.818381,0.102845,0.091904,0.112299,1.091933,0.007738,1.010651
2,(JUMBO BAG WOODLAND ANIMALS),(POSTAGE),0.100656,0.818381,0.087527,0.869565,1.062544,0.005152,1.392414
3,(POSTAGE),(JUMBO BAG WOODLAND ANIMALS),0.818381,0.100656,0.087527,0.106952,1.062544,0.005152,1.007049
4,(PLASTERS IN TIN CIRCUS PARADE ),(POSTAGE),0.115974,0.818381,0.100656,0.867925,1.060539,0.005746,1.375117


In [235]:
data_de = data_de.drop(columns=['POSTAGE'])

In [247]:
frequent_itemsets = apriori(data_de, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

ValueError: cannot call `vectorize` on size 0 inputs unless `otypes` is set

In [246]:
rules.sort_values(by = ["lift"],ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
10,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.070022,0.126915,0.059081,0.84375,6.648168,0.050194,5.587746
11,(WOODLAND CHARLOTTE BAG),(RED RETROSPOT CHARLOTTE BAG),0.126915,0.070022,0.059081,0.465517,6.648168,0.050194,1.739959
0,(PLASTERS IN TIN CIRCUS PARADE ),(PLASTERS IN TIN WOODLAND ANIMALS),0.115974,0.137856,0.067834,0.584906,4.242887,0.051846,2.076984
1,(PLASTERS IN TIN WOODLAND ANIMALS),(PLASTERS IN TIN CIRCUS PARADE ),0.137856,0.115974,0.067834,0.492063,4.242887,0.051846,1.740427
6,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.107221,0.137856,0.061269,0.571429,4.145125,0.046488,2.01167
7,(PLASTERS IN TIN WOODLAND ANIMALS),(PLASTERS IN TIN SPACEBOY),0.137856,0.107221,0.061269,0.444444,4.145125,0.046488,1.607002
12,(ROUND SNACK BOXES SET OF 4 FRUITS ),(ROUND SNACK BOXES SET OF4 WOODLAND ),0.157549,0.245077,0.131291,0.833333,3.400298,0.092679,4.52954
13,(ROUND SNACK BOXES SET OF4 WOODLAND ),(ROUND SNACK BOXES SET OF 4 FRUITS ),0.245077,0.157549,0.131291,0.535714,3.400298,0.092679,1.814509
14,(SPACEBOY LUNCH BOX ),(ROUND SNACK BOXES SET OF4 WOODLAND ),0.102845,0.245077,0.070022,0.680851,2.778116,0.044817,2.365427
15,(ROUND SNACK BOXES SET OF4 WOODLAND ),(SPACEBOY LUNCH BOX ),0.245077,0.102845,0.070022,0.285714,2.778116,0.044817,1.256018
