# Association Rule - Apriori and ECLAT 

Training association rule models (Apriori and ECLAT) to find the most related items bought by customers of a french supermarket during a week. All 7501 lines of the dataset represent items bought by an unique customer, during this week.

This algorithm associate products preferences by most of the customers and can be used to generate products recommendation and help on displaying products strategy.

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Data Loading
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)

# Adding all customers into a list of lists
transactions = []
for i in range(0, len(dataset)):
    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

FileNotFoundError: [Errno 2] No such file or directory: 'Market_Basket_Optimisation.csv'

In [3]:
dataset.head(20)

NameError: name 'dataset' is not defined

### Apriori implementation using apyori library 
source: https://github.com/ymoch/apyori

The output of this part is to see which are the products that used to be more bought in combination compared to other combinations using apriori algorithm.

We will put some transformations to fit on dataframes and to make the visualization easier.

In [4]:
# Inspecting elements
transactions[:3]

NameError: name 'transactions' is not defined

In [5]:
# Training Apriori on the dataset
# The hyperparameters choosen on this training are:
# min_support = items bought more than 3 times a day * 7 days (week) / 7500 customers = 0.0028
# min_confidence: at least 20%, min_lift = minimum of 3 (less than that is too low)
# min_length: we want at least 2 items to be associated. No point in having a single item in the result

from apyori import apriori
rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

NameError: name 'transactions' is not defined

In [6]:
# Visualising the results
results = list(rules)

NameError: name 'rules' is not defined

In [7]:
lift = []
association = []
for i in range (0, len(results)):
    lift.append(results[:len(results)][i][2][0][3])
    association.append(list(results[:len(results)][i][0]))

NameError: name 'results' is not defined

### Visualizing results in a dataframe

In [8]:
rank = pd.DataFrame([association, lift]).transpose()
rank.columns = ['Association', 'Lift']

In [9]:
# Show top 10 higher lift scores
rank.sort_values('Lift', ascending=False).head(10)

Unnamed: 0,Association,Lift


By the study, "olive oil, whole wheat pasta, mineral water" are the most commom combined items from this week for the supermarket in question.  

## ECLAT Implementation

This is an implementation of the ECLAT code by hand. It calculate the pairs that have been bought more frequently comparing to other pairs. At the end, we expect to see what is the most common combination of products during the week. 

An extension of the code can calculate the three most common combination, 4, and so on.

#### Getting the list of products bought this week by all customers

In [10]:
# Putting all transactions in a single list
itens = []
for i in range(0, len(transactions)):
    itens.extend(transactions[i])

# Finding unique items from transactions and removing nan
uniqueItems = list(set(itens))
uniqueItems.remove('nan')

NameError: name 'transactions' is not defined

In [11]:
uniqueItems

NameError: name 'uniqueItems' is not defined

#### Creating combinations with the items - pairs

In [12]:
pair = []
for j in range(0, len(uniqueItems)):
    k = 1;
    while k <= len(uniqueItems):
        try:
            pair.append([uniqueItems[j], uniqueItems[j+k]])
        except IndexError:
            pass
        k = k + 1;       

NameError: name 'uniqueItems' is not defined

In [13]:
pair

[]

#### Calculating score
The calculation is done looking at the number of customers that bought both items (the pair) and divided by all customers of the week (7501). This calculation is done for all pairs possible and the score is returned on "score" list.

$ score = \frac{\text{number of lists that contain [item x and item y]}} {\text{number of all lists}} $

In [14]:
%%time
score = []
for i in pair:
    cond = []
    for item in i:
        cond.append('("%s") in s' %item)
    mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
    #mycode = "print 'hello world'"
    score.append(len(eval(mycode))/7501.)

Wall time: 0 ns


#### Showing results

Top 10 Most common pairs of items of this week

In [15]:
ranking_ECLAT = pd.DataFrame([pair, score]).transpose()
ranking_ECLAT.columns = ['Pair', 'Score']

In [16]:
ranking_ECLAT.sort_values('Score', ascending=False).head(10)

Unnamed: 0,Pair,Score


### What if we do that for trios?

In [17]:
# Creating trios
trio = []
for j in range(0, len(uniqueItems)):
    for k in range(j, len(uniqueItems)):
        for l in range(k, len(uniqueItems)):
            if (k != j) and (j != l) and (k != l):
                try:
                    trio.append([uniqueItems[j], uniqueItems[j+k], uniqueItems[j+l]])
                except IndexError:
                    pass 

NameError: name 'uniqueItems' is not defined

In [18]:
trio[:5]

[]

In [19]:
%%time
score_trio = []
for i in trio:
    cond = []
    for item in i:
        cond.append('("%s") in s' %item)
    mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
    #mycode = "print 'hello world'"
    score_trio.append(len(eval(mycode))/7501.)

Wall time: 0 ns


In [20]:
ranking_ECLAT_trio = pd.DataFrame([trio, score_trio]).transpose()
ranking_ECLAT_trio.columns = ['Trio', 'Score']
ranking_ECLAT_trio.sort_values('Score', ascending=False).head(10)

Unnamed: 0,Trio,Score


## What about comparing the results from Apriori and ECLAT?

We got from Apriori that the combination that lead to more "attractiveness power" is "olive oil", "whole wheat pasta" and "mineral water". If we run the ECLAT code for this set of items, we will obtain: 0.0039.

This score of 3 items has not enough score to be placed among top 10, but they are measuring different metrics.  According to apriori these are the items that when picked one lead to another items more frequently than other combinations, i.e. when a person pick 'olive oil', the probability of picking 'whole wheat pasta' and 'mineral water' is much higher than picking another combination. ECLAT in another hand is just sorting as the most common combinations of all lists, not caring about how one item isolatedly can influence in the purchase of another.

In [21]:
i = ["olive oil", "whole wheat pasta", "mineral water"]
cond = []
for item in i:
    cond.append('("{}") in s'.format(item))
mycode = ('[s for s in transactions if ' + ' and '.join(cond) + ']')
tra = eval(mycode)

NameError: name 'transactions' is not defined

In [22]:
cond

['("olive oil") in s', '("whole wheat pasta") in s', '("mineral water") in s']

In [23]:
mycode

'[s for s in transactions if ("olive oil") in s and ("whole wheat pasta") in s and ("mineral water") in s]'

In [24]:
tra

NameError: name 'tra' is not defined

In [25]:
print ('Score for "olive oil", "whole wheat pasta", "mineral water": {}'.format(len(tra)/7501.))

NameError: name 'tra' is not defined