# Ch. 1 Intro to Market Basket Analysis
## What is Market Basket Analysis?
- Identify products frequently purchased together
- Construct useful recommendations based on these findings

### Use Cases
- Build Netflix-Style recommendation engine
- improve product recommendations on an e-commerce store
- cross-sell products in a retail setting
- import inventory management
- upsell products

### market basket analysis
- construct association rules
- identify items frequently bought together

### Association Rules
- Association Rule
    - contain antecedent and consequent
    - {health} --> {cooking}
- Multi-Antecedent Rule
    - {humor, travel} --> {language}
- Multi-Consequent Rule
    - {biography} --> {history, language}
    
### Difficulty of selecting rules
- set of all possible rules is large
- most rules are not useful
- must discard most rules

### Generating the Rules
- use programming tools to generate all possible permutations of the items that can be associated

In [1]:
import pandas as pd

# Load Transactions
books = pd.read_csv('bookstore_transactions.csv')

# Split transactions strings into lists
transactions = books['Transaction'].apply(lambda t: t.split(','))

# Convert Dataframe into a list of strings
transactions = list(transactions)

print(transactions[:5])

[['History', 'Bookmark'], ['History', 'Bookmark'], ['Fiction', 'Bookmark'], ['Biography', 'Bookmark'], ['History', 'Bookmark']]


#### Grocery transactions

In [2]:
transactions = [['milk', 'bread', 'biscuit'], ['bread', 'milk', 'biscuit', 'cereal'], ['bread', 'tea'], ['jam', 'bread', 'milk'], ['tea', 'biscuit'], ['bread', 'tea'], ['tea', 'cereal'], ['bread', 'tea', 'biscuit'], ['jam', 'bread', 'tea'], ['bread', 'milk'], ['coffee', 'orange', 'biscuit', 'cereal'], ['coffee', 'orange', 'biscuit', 'cereal'], ['coffee', 'sugar'], ['bread', 'coffee', 'orange'], ['bread', 'sugar', 'biscuit'], ['coffee', 'sugar', 'cereal'], ['bread', 'sugar', 'biscuit'], ['bread', 'coffee', 'sugar'], ['bread', 'coffee', 'sugar'], ['tea', 'milk', 'coffee', 'cereal']]

#### Generating a list of rules

In [3]:
# Import permutations from the itertools module
from itertools import permutations

# Define the set of groceries
flattened = [i for t in transactions for i in t]
groceries = list(set(flattened))

# Generate all possible rules
rules = list(permutations(groceries, 2))

# Print the number of rules
print('Number of Rules: ',len(rules))

Number of Rules:  72


## Metrics and Pruning
### Metrics
- measure of performance for rules
- {humor} --> {poetry} = 0.81

### Pruning
- Use of metrics to discard rules

### The simplist metric
- the <b>Support Metric</b> measures the share of transactions that contain an item set
    - (number of transactions with items) / (number of transactions)

In [4]:
# Import the transaction encoder function from mlxtend
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Instantiate transaction encoder and identify unique items
encoder = TransactionEncoder().fit(transactions)

# One-hot encode transactions
onehot = encoder.transform(transactions)

# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)

# Print the one-hot encoded transaction dataset
print(onehot)

    biscuit  bread  cereal  coffee    jam   milk  orange  sugar    tea
0      True   True   False   False  False   True   False  False  False
1      True   True    True   False  False   True   False  False  False
2     False   True   False   False  False  False   False  False   True
3     False   True   False   False   True   True   False  False  False
4      True  False   False   False  False  False   False  False   True
5     False   True   False   False  False  False   False  False   True
6     False  False    True   False  False  False   False  False   True
7      True   True   False   False  False  False   False  False   True
8     False   True   False   False   True  False   False  False   True
9     False   True   False   False  False   True   False  False  False
10     True  False    True    True  False  False    True  False  False
11     True  False    True    True  False  False    True  False  False
12    False  False   False    True  False  False   False   True  False
13    

#### Single item support values

In [5]:
# Compute the support
support = onehot.mean()

# Print the support
print(support)

biscuit    0.40
bread      0.65
cereal     0.30
coffee     0.40
jam        0.10
milk       0.25
orange     0.15
sugar      0.30
tea        0.35
dtype: float64


#### Multi item support values

In [6]:
# Add a jam+bread column to the DataFrame onehot
import numpy as np
onehot['jam+bread'] = np.logical_and(onehot['jam'], onehot['bread'])

# Compute the support
support = onehot.mean()

# Print the support values
print(support)

biscuit      0.40
bread        0.65
cereal       0.30
coffee       0.40
jam          0.10
milk         0.25
orange       0.15
sugar        0.30
tea          0.35
jam+bread    0.10
dtype: float64


# Ch. 2 Association Rules
### The Confidence Metric
- can improve over support with additional metrics
- adding confidence provides a more complete picture
- Support(X & Y) / Support(X)

### The Lift Metric
- another metric for evaluating the relationship between items
- Support(X & Y) / (Support(X) * Support(Y))
    - numerator: proportion of transactions that contain X & Y
    - Denominator: proportion if X & Y were assigned randomly and independently
- If lift is greater than 1, the items occur together more often than you might expect based on their individual support values

#### Calculating Support

In [7]:
# Compute support for coffee and tea
supportCS = np.logical_and(onehot['coffee'], onehot['sugar']).mean()

# Compute support for sugar and tea
supportST = np.logical_and(onehot['sugar'], onehot['tea']).mean()

# Compute support for tea and coffee
supportTC = np.logical_and(onehot['tea'], onehot['coffee']).mean()

# Print support values
print("coffee and tea: %.2f" % supportCS)
print("sugar and tea: %.2f" % supportST)
print("tea and coffee: %.2f" % supportTC)

coffee and tea: 0.20
sugar and tea: 0.00
tea and coffee: 0.05


#### Calculating Confidence

In [8]:
# Compute support for coffee and tea
supportCT = np.logical_and(onehot['coffee'], onehot['tea']).mean()

# Compute support for coffee
supportC = onehot['coffee'].mean()

# Compute support for tea
supportT = onehot['tea'].mean()

# Compute confidence for both rules
confidenceCT = supportCT / supportC
confidenceTC = supportCT / supportT

# Print results
print('Coffee then tea: {0:.2f}'.format(confidenceCT))
print('tea then coffee: {0:.2f}'.format(confidenceTC))

Coffee then tea: 0.12
tea then coffee: 0.14


#### Calculating Lift

In [9]:
# Compute support for coffee and tea
supportCT = np.logical_and(onehot['coffee'], onehot['tea']).mean()

# Compute support for Potter
supportC = onehot['coffee'].mean()

# Compute support for Twilight
supportT = onehot['tea'].mean()

# Compute lift
lift = supportCT / (supportC * supportT)

# Print lift
print("Lift: %.2f" % lift)

Lift: 0.36


## Leverage and Conviction
Metrics tend to build on simpler ones
### Leverage
- builds on the metric support
- Leverage = Support(XY) - (Support(X) * Support(Y))
- Similar to lift, but easier to interpret
- range of -1 to +1 makes it easy to identify low and high values (lift ranges from 0 to infinity)
- positive value indicates a more than usual relationship

### Conviction
- built using support
- more complicated and less intuitive that leverage
- conviction(X --> Y) = (Support(X) * Support(Not Y)) / Support(X & Not Y)

In [10]:
def conviction(antecedent, consequent):
    # Compute support for antecedent AND consequent
    supportAC = np.logical_and(antecedent, consequent).mean()

    # Compute support for antecedent
    supportA = antecedent.mean()

    # Compute support for NOT consequent
    supportnC = 1.0 - consequent.mean()

    # Compute support for antecedent and NOT consequent
    supportAnC = supportA - supportAC

    # Return conviction
    return supportA * supportnC / supportAnC

In [11]:
conviction(onehot['coffee'], onehot['sugar'])

1.3999999999999997

## Association and Dissociation
### Dissociation
- Zhang's Metric
    - Values between -1 an +1
    - +1 indicates perfect association
    - -1 indicates perfect dissociation
- Comprehensive as it measures association and dissociation
- Interpretable
- Constructed using support
- Numerator: (Confidence(A-->B) - Confidence( Not A-->B)
- Denominator: Max[Confidence(A-->B),Confidence(Not A-->B)]

In [12]:
def zhang(ant, con):
    # Compute the support of Ant and Con
    supportA = ant.mean()
    supportC = con.mean()

    # Compute the support of both items
    supportAC = np.logical_and(ant, con).mean()

    # Complete the expressions for the numerator and denominator
    numerator = supportAC - supportA*supportC
    denominator = max(supportAC*(1-supportA), supportA*(supportC-supportAC))

    # Compute and print Zhang's metric
    zhang = numerator / denominator
    return(zhang)

In [13]:
zhang(onehot['tea'], onehot['milk'])

-0.5357142857142857

## Advanced Rules
Standard Procedure for market basket analysis
- Generate a large set of rules
- filter rules using metrics
- apply intuition and common sense on remaining rules

### Multi Metric Filtering
Generally a data frame will be created with each rules and the different metrics to go along with that pairing or group. One by one apply filters using metric levels to weed out the lower power rules

# Aggregation and Pruning
## Aggregation
- can be used to simply MBA problems that have many items

In [110]:
import pandas as pd

# Load novelty gift data
gifts = pd.read_csv('novelty_gifts.csv')
gifts['Description'] = gifts['Description'].str.strip()
gifts[['InvoiceNo','Description']] = gifts[['InvoiceNo','Description']].astype('str')

# Concatenate items into one row per invoice
gifts = pd.DataFrame(gifts.groupby('InvoiceNo')['Description'].apply(','.join)).reset_index()

# preview data
gifts.head()

Unnamed: 0,InvoiceNo,Description
0,549687,"DOORMAT RED RETROSPOT,DOORMAT WELCOME SUNRISE,..."
1,550644,"SET OF 6 SPICE TINS PANTRY DESIGN,PANTRY WASHI..."
2,552695,"BIRTHDAY PARTY CORDON BARRIER TAPE,ICE CREAM P..."
3,553857,"VINTAGE GLASS T-LIGHT HOLDER,SET/6 BEAD COASTE..."
4,557499,"PARTY CHARMS 50 PIECES,PACK OF 6 LARGE FRUIT S..."


In [106]:
# Split transactions strings into lists
transactions = gifts['Description'].apply(lambda t: t.split(','))

# Convert Dataframe into a list of strings
transactions = list(transactions)

In [128]:
# Import the transaction encoder function from mlxtend
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Instantiate transaction encoder and identify unique items
encoder = TransactionEncoder().fit(transactions)

# One-hot encode transactions
onehot = encoder.transform(transactions)

# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)

# Print the one-hot encoded transaction dataset
onehot.to_csv('onehot_gifts.csv', index=False)
onehot.head()

Unnamed: 0,Unnamed: 1,1 HANGER,BACK DOOR,BIRTHDAY CARD,BLUE,BREAKFAST IN BED,CHOCOLATE SPOTS,DOTCOMGIFTSHOP.COM,DOUGHNUTS,FRONT DOOR,...,wet boxes,wet pallet,wet rusty,wet?,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Aggregation
- groups items together based on categories or aggregates
- reduces the MBA to rules between categories of items, rather than individual items

In [114]:
# Select the column headers for sign items
sign_headers = [i for i in onehot.columns if i.lower().find('sign')>=0]

# Select columns of sign items
sign_columns = onehot[sign_headers]

# Perform aggregation of sign items into sign category
signs = sign_columns.sum(axis = 1) >= 1.0

# Print support for signs
print('Share of Signs: %.2f' % signs.mean())

Share of Signs: 0.44


In [115]:
def aggregate(item):
    # Select the column headers for sign items
    item_headers = [i for i in onehot.columns if i.lower().find(item)>=0]

    # Select columns of sign items
    item_columns = onehot[item_headers]
    
    # Return category of aggregated items
    return item_columns.sum(axis = 1) >= 1.0

In [127]:
aggregate('sign').mean()

0.4417550726130394

## The Apriori Algorithm
- Reduces complexity by eliminating low support items
- Reduces the number of itemsets

## The Apriori Principle
- Apriori Principle: Subsets of frequent sets are frequent
    - Retain sets known to be frequent
    - Prune sets not known to be frequent

In [135]:
transactions = [['milk', 'bread', 'biscuit'], ['bread', 'milk', 'biscuit', 'cereal'], ['bread', 'tea'], ['jam', 'bread', 'milk'], ['tea', 'biscuit'], ['bread', 'tea'], ['tea', 'cereal'], ['bread', 'tea', 'biscuit'], ['jam', 'bread', 'tea'], ['bread', 'milk'], ['coffee', 'orange', 'biscuit', 'cereal'], ['coffee', 'orange', 'biscuit', 'cereal'], ['coffee', 'sugar'], ['bread', 'coffee', 'orange'], ['bread', 'sugar', 'biscuit'], ['coffee', 'sugar', 'cereal'], ['bread', 'sugar', 'biscuit'], ['bread', 'coffee', 'sugar'], ['bread', 'coffee', 'sugar'], ['tea', 'milk', 'coffee', 'cereal']]

# Import the transaction encoder function from mlxtend
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Instantiate transaction encoder and identify unique items
encoder = TransactionEncoder().fit(transactions)

# One-hot encode transactions
onehot = encoder.transform(transactions)

# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)

In [150]:
# Import Apriori Algorithm
from mlxtend.frequent_patterns import apriori

# Find frequent itemsets
frequent = apriori(onehot, min_support=0.2,
                   max_len=2, use_colnames=True)

frequent

Unnamed: 0,support,itemsets
0,0.4,(biscuit)
1,0.65,(bread)
2,0.3,(cereal)
3,0.4,(coffee)
4,0.25,(milk)
5,0.3,(sugar)
6,0.35,(tea)
7,0.25,"(bread, biscuit)"
8,0.2,"(milk, bread)"
9,0.2,"(sugar, bread)"


## Apriori and association rules
- Apriori prunes itemsets
    - applies minimum support threshold
    - modified version can prune by number of items
    - doesn't tell us about association rules
- Association Rules
    - many more association rules than itemsets
- computing rules from apriori results
    - difficult to enumerate for high n and k
    - could undo itemset pruning by Apriori
- Reduce number of association rules
    - mlxtend module offers means of pruning association rules
    - association_rules() takes frequent items, metric, and threshold

In [159]:
# Import Apriori Algorithm and association ules
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori

# Find frequent itemsets
frequent = apriori(onehot, min_support=0.2,
                   max_len=2, use_colnames=True)

# Create association rules from frequent itemsets
rules = association_rules(frequent, metric='lift', min_threshold=1)

rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(milk),(bread),0.25,0.65,0.2,0.8,1.230769,0.0375,1.75
1,(bread),(milk),0.65,0.25,0.2,0.307692,1.230769,0.0375,1.083333
2,(sugar),(bread),0.3,0.65,0.2,0.666667,1.025641,0.005,1.05
3,(bread),(sugar),0.65,0.3,0.2,0.307692,1.025641,0.005,1.011111
4,(cereal),(coffee),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8
5,(coffee),(cereal),0.4,0.3,0.2,0.5,1.666667,0.08,1.4
6,(sugar),(coffee),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8
7,(coffee),(sugar),0.4,0.3,0.2,0.5,1.666667,0.08,1.4


# Visualizing Rules
## Heatmaps


In [161]:
import pandas as pd

# Load Ratings Data
ratings = pd.read_csv('movielens.csv')
ratings.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
