# The basics of market basket analysis

Market basket analysis uses lists of transactions to identify useful associations between items. Such associations can be written in the form of a rule that has an antecedent and a consequent. Let's assume a small grocery store has asked you to look at their transaction data. After some analysis, you find the rule given below.

In [1]:
import pandas as pd 
import numpy as np
from scipy.stats import ttest_ind
import seaborn as sns
import matplotlib.pyplot as plt

from tweepy import OAuthHandler
from tweepy import API
from tweepy import Stream
import json

from itertools import permutations

from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder

sns.set_style('darkgrid')

## Read Datasets

### Movies

In [None]:
movies = pd.read_csv('movielens_movies.csv')
movies.head()

In [None]:
genres = movies['genres'].apply(lambda t: t.split('|'))
genres = list(genres)
genres

In [None]:
movies['genres'].value_counts()

### Books 

In [9]:
bookstore = pd.read_csv('bookstore_transactions.csv')
bookstore.head()

Unnamed: 0,Transaction
0,"History,Bookmark"
1,"History,Bookmark"
2,"Fiction,Bookmark"
3,"Biography,Bookmark"
4,"History,Bookmark"


In [15]:
transactions = bookstore['Transaction'].apply(lambda t: t.split(','))
transactions = list(transactions)
transactions

[['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Poetry', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Fiction', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['History', 'Bookmark'],
 ['History', 'Bookmark'],
 ['Biography', 'Bookmark'],
 ['Biography', 'B

In [11]:
bookstore['Transaction'].value_counts()

Biography,Bookmark    40
History,Bookmark      25
Fiction,Bookmark      25
Poetry,Bookmark        9
Name: Transaction, dtype: int64

## Identifying association rules

Market basket analysis revolves around the use of association rules, which are if-then statements about the relationship between two sets of items. The rule {coffee} -> {milk}, for instance, is read as "if coffee then milk," where coffee is the antecedent and milk is the consequent. Many rules have multiple antecedents and consequents. 

In [12]:
#Extract unique items
flattened = [item for transaction in transactions for item in transaction]
items = list(set(flattened))

In [13]:
items

['Bookmark', 'Biography', 'Fiction', 'Poetry', 'History']

In [14]:
# Compute and print rules
bookstore_rules = list(permutations(items))
bookstore_rules

[('Bookmark', 'Biography', 'Fiction', 'Poetry', 'History'),
 ('Bookmark', 'Biography', 'Fiction', 'History', 'Poetry'),
 ('Bookmark', 'Biography', 'Poetry', 'Fiction', 'History'),
 ('Bookmark', 'Biography', 'Poetry', 'History', 'Fiction'),
 ('Bookmark', 'Biography', 'History', 'Fiction', 'Poetry'),
 ('Bookmark', 'Biography', 'History', 'Poetry', 'Fiction'),
 ('Bookmark', 'Fiction', 'Biography', 'Poetry', 'History'),
 ('Bookmark', 'Fiction', 'Biography', 'History', 'Poetry'),
 ('Bookmark', 'Fiction', 'Poetry', 'Biography', 'History'),
 ('Bookmark', 'Fiction', 'Poetry', 'History', 'Biography'),
 ('Bookmark', 'Fiction', 'History', 'Biography', 'Poetry'),
 ('Bookmark', 'Fiction', 'History', 'Poetry', 'Biography'),
 ('Bookmark', 'Poetry', 'Biography', 'Fiction', 'History'),
 ('Bookmark', 'Poetry', 'Biography', 'History', 'Fiction'),
 ('Bookmark', 'Poetry', 'Fiction', 'Biography', 'History'),
 ('Bookmark', 'Poetry', 'Fiction', 'History', 'Biography'),
 ('Bookmark', 'Poetry', 'History', 'Biog

In [None]:
print(len(bookstore_rules))

In [None]:
#Extract unique items
g_flattened = [item for genre in genres for item in genres]
#g_items = list(set(g_flattened))

In [None]:
g_flattened

In [None]:
# Compute and print rules
g_rules = list(permutations(g_flattened))
g_rules

## Simple Metrics

### Books

In [None]:
encoder = TransactionEncoder().fit(transactions)
onehot = encoder.transform(transactions)

In [None]:
onehot = pd.DataFrame(onehot, columns=encoder.columns_)

In [None]:
onehot['Fiction+Poetry'] = np.logical_and(onehot.Fiction, onehot.Poetry)

In [None]:
onehot.mean()

In [None]:
# Compute frequent itemnsets using the Aprioru Algorith
frequent_itemsets = apriori(onehot, min_support=0.001,max_len=2,use_colnames=True)

In [None]:
# Compute al association rules for frequent_itemsets
rules = association_rules(frequent_itemsets, metric="lift",min_threshold=1.0)

### Movies

In [None]:
g_encoder = TransactionEncoder().fit(genres)
g_onehot = g_encoder.transform(genres)

g_onehot = pd.DataFrame(g_onehot, columns=g_encoder.columns_)

In [None]:
g_onehot.mean()

In [None]:
support_AA = np.logical_and(g_onehot.Action, g_onehot.Adventure).mean()
support_AD = np.logical_and(g_onehot.Adventure, g_onehot.Drama).mean()
support_DA = np.logical_and(g_onehot.Drama, g_onehot.Action).mean()

In [None]:
# Print support values
print("Action and Adventure: %.2f" % support_AA)
print("Adventure and Drama: %.2f" % support_AD)
print("Drama and Action: %.2f" % support_DA)

Action and Adventure or Drama and Action appear to be the best options for cross-promotion

## Confidence and lift


### The confidence metric

It's  defined as the support of items X and Y divided by the support of item X. Confidence tells us the probability that we'll purchase Y, given that we have purchased X.

In [None]:
support_action = g_onehot.Action.mean()
support_drama = g_onehot.Drama.mean()

confidenceAD = support_DA / support_action
confidenceDA = support_DA / support_drama

# Print results
print('{0:.2f}, {1:.2f}'.format(confidenceAD, confidenceDA))

the confidence is much higher for Action -> Drama, since Action has a higher support than Drama.

### The lift metric

The lift metric provides us with another way to improve over support. Lift is calculated as the support of items X and Y divided by the support of X multiplied by the support of Y. The numerator gives us the proportion of transactions that contain both X and Y. The denominator tells us what that proportion would be if X and Y were randomly and independently assigned to transactions. A lift value of greater than one tells us that two items occur in transactions together more often than we would expect based on their individual support values. This means that the relationship is unlikely to be explained by random chance. This natural threshold is convenient for filtering purposes.

In [None]:
# Compute lift
lift = genresDA / (support_action * support_drama)

print("Lift: %.2f" % lift)

This may give us some confidence that the association rule we recommended did not arise by random chance, but a good lift is greater than 1.0

### The Leverage metric

It's also constructed from a simpler metric: support. To compute the leverage of "if X then Y," we compute the support of "if X then Y" and then subtract the product of the support of X and the support of Y. Note that lift and leverage are similar. One advantage of using leverage is that it is bounded from below by minus one and from above by plus one, making it easy to identify high and low values. Lift, to the contrary, is bound from below by 0 and from above by infinity.

### The conviction metric

Conviction is also based on support, but is more complicated and less intuitive than leverage. The conviction of "if X then Y" is computed as the support of X multiplied by the support of NOT Y, divided by the support of X and NOT Y. The support of NOT Y is simply the share of all transactions that do not include Y. The support of X and NOT Y is the share of all transactions that contain X, but not Y.

In [None]:
supportnD = 1.0 - g_onehot['Drama'].mean()

supportAnD = support_action - support_DA
conviction = support_action * supportnD / supportAnD
print("Conviction: %.2f" % conviction)

In [None]:
def conviction(antecedent, consequent):
    # Compute support for antecedent AND consequent
    supportAC = np.logical_and(antecedent, consequent).mean()

    # Compute support for antecedent
    supportA = antecedent.mean()

    # Compute support for NOT consequent
    supportnC = 1.0 - consequent.mean()

    # Compute support for antecedent and NOT consequent
    supportAnC = supportA - supportAC

    # Return conviction
    return supportA * supportnC / supportAnC


In [None]:
print("Conviction: %.2f" % conviction(g_onehot['Drama'],g_onehot['Action']))

In [None]:
print("Conviction: %.2f" % conviction(g_onehot['Action'],g_onehot['Drama']))

In [None]:
print("Conviction: %.2f" % conviction(g_onehot['Action'],g_onehot['Adventure']))

Notice that the value of conviction was less than 1, suggesting that the rule ``if Drama then Action'' is not supported

## Association and dissociation

It is bounded from below by -1 and bounded from above by 1. A value of 1 indicates perfect association. Negative 1 indicates perfect dissociation. **Zhang's metric** is comprehensive in the sense that it measures both association and dissociation. It is also interpretable and has a definition in terms of simpler metrics.

In [None]:
# Define a function to compute Zhang's metric
def zhang(antecedent, consequent):
    supportA = antecedent.mean()
    supportC = consequent.mean()

    supportAC = np.logical_and(antecedent, consequent).mean()

    numerator = supportAC - supportA*supportC
    denominator = max(supportAC*(1-supportA), supportA*(supportC-supportAC))

    # Return Zhang's metric
    return numerator / denominator

In [None]:
supportAc = g_onehot['Action'].mean()
supportAd = g_onehot['Adventure'].mean()

supportAA = np.logical_and(g_onehot['Action'],g_onehot['Adventure']).mean()

# Complete the expressions for the numerator and denominator
numerator = supportAA - supportAc*supportAd
denominator = max(supportAA*(1-supportAc), supportAc*(supportAd-supportAA))

# Compute and print Zhang's metric
zhang = numerator / denominator
print(zhang)

In [None]:
print("Conviction: %.2f" % zhang(g_onehot['Action'],g_onehot['Adventure']))

the association rule ``if Action then Adventure'' proved robust. It had a positive value for Zhang's metric, indicating that the two genres are not dissociated.

In [None]:
# Define an empty list for Zhang's metric
frequent_itemsets = apriori(g_onehot, min_support=0.001,max_len=2,use_colnames=True)
zhangs_metric = []

# Loop over lists in itemsets
for itemset in frequent_itemsets:
    # Extract the antecedent and consequent columns
    antecedent = g_onehot[itemset[0]]
    consequent = g_onehot[itemset[1]]
    
    # Complete Zhang's metric and append it to the list
    zhangs_metric.append(zhang(antecedent, consequent))
    
# Print results
rules['zhang'] = zhangs_metric
print(rules)

In [None]:
rules

## Advanced rules

In [None]:
# Preview the rules DataFrame using the .head() method
print(rules.head())

# Select the subset of rules with antecedent support greater than 0.05
rules = rules[rules['antecedent support'] > 0.05]

# Select the subset of rules with a consequent support greater than 0.01
rules = rules[rules['consequent support'] > 0.01]

# Select the subset of rules with a conviction greater than 1.01
rules = rules[rules['conviction'] > 1.01]

# Print remaining rules
print(rules)

In [None]:
# Set the lift threshold to 1.5
rules = rules[rules['lift'] > 1.5]

# Set the conviction threshold to 1.0
rules = rules[rules['conviction'] > 1.0]

# Set the threshold for Zhang's rule to 0.65
rules = rules[rules['zhang'] > 0.65]

# Print rule
print(rules[['antecedents','consequents']])

## Aggregation

In [95]:
gifts = pd.read_csv('online_retail.csv')
gifts.head()

Unnamed: 0,InvoiceNo,StockCode,Description
0,562583,35637A,IVORY STRING CURTAIN WITH POLE
1,562583,35638A,PINK AND BLACK STRING CURTAIN
2,562583,84927F,PSYCHEDELIC TILE HOOK
3,562583,22425,ENAMEL COLANDER CREAM
4,562583,16008,SMALL FOLDING SCISSOR(POINTED EDGE)


In [80]:
len(gifts['InvoiceNo'].unique())

9709

In [96]:
rated_dummies = pd.get_dummies(data=gifts,columns=['Description'], drop_first=True)
rated_dummies.head()

Unnamed: 0,InvoiceNo,StockCode,Description_ 50'S CHRISTMAS GIFT BAG LARGE,Description_ DOLLY GIRL BEAKER,Description_ I LOVE LONDON MINI BACKPACK,Description_ NINE DRAWER OFFICE TIDY,Description_ OVAL WALL MIRROR DIAMANTE,Description_ RED SPOT GIFT BAG LARGE,Description_ SET 2 TEA TOWELS I LOVE LONDON,Description_ SPACEBOY BABY GIFT SET,...,Description_wet boxes,Description_wet pallet,Description_wet rusty,Description_wet?,Description_wrongly coded 20713,Description_wrongly coded 23343,Description_wrongly coded-23343,Description_wrongly marked,Description_wrongly marked 23343,Description_wrongly marked carton 22804
0,562583,35637A,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,562583,35638A,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,562583,84927F,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,562583,22425,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,562583,16008,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [85]:
gifts = pd.concat([gifts, rated_dummies], axis=1)

In [86]:
gifts = gifts.drop(columns=['InvoiceNo','StockCode','Description'])

In [87]:
gifts.tail()

Unnamed: 0,Description_ 50'S CHRISTMAS GIFT BAG LARGE,Description_ DOLLY GIRL BEAKER,Description_ I LOVE LONDON MINI BACKPACK,Description_ NINE DRAWER OFFICE TIDY,Description_ OVAL WALL MIRROR DIAMANTE,Description_ RED SPOT GIFT BAG LARGE,Description_ SET 2 TEA TOWELS I LOVE LONDON,Description_ SPACEBOY BABY GIFT SET,Description_ TRELLIS COAT RACK,Description_10 COLOUR SPACEBOY PEN,...,Description_wet boxes,Description_wet pallet,Description_wet rusty,Description_wet?,Description_wrongly coded 20713,Description_wrongly coded 23343,Description_wrongly coded-23343,Description_wrongly marked,Description_wrongly marked 23343,Description_wrongly marked carton 22804
227755,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
227756,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
227757,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
227758,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
227759,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [88]:
bag_headers = [i  for i in gifts.columns if i.lower().find('bag')>=0]

In [89]:
box_headers = [i  for i in gifts.columns if i.lower().find('box')>=0]

In [90]:
bags = gifts[bag_headers]
boxes = gifts[box_headers]
print(bags)

        Description_ 50'S CHRISTMAS GIFT BAG LARGE  \
0                                                0   
1                                                0   
2                                                0   
3                                                0   
4                                                0   
...                                            ...   
227755                                           0   
227756                                           0   
227757                                           0   
227758                                           0   
227759                                           0   

        Description_ RED SPOT GIFT BAG LARGE  \
0                                          0   
1                                          0   
2                                          0   
3                                          0   
4                                          0   
...                                      ...   
227755         

In [91]:
bags = (bags.sum(axis=1)>0.0).values
boxes = (boxes.sum(axis=1)>0.0).values
print(bags)

[False False False ... False False False]


In [92]:
aggregated = pd.DataFrame(np.vstack([bags,boxes]).T, columns=['bags','boxes'])

In [93]:
aggregated.head()

Unnamed: 0,bags,boxes
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
