# 2. Association Rules
Association rules tell us that two or more items are related. Metrics allow us to quantify the usefulness of those relationships. In this chapter, you’ll apply six metrics to evaluate association rules: __supply, confidence, lift, conviction, leverage, and Zhang's metric__. You’ll then use association rules and metrics to assist a library and an e-book seller.

## Recommending books with support
---
A library wants to get members to read more and has decided to use market basket analysis to figure out how. They approach you to do the analysis and ask that you use the five most highly-rated books from the goodbooks-10k dataset. Each column in the DataFrame corresponds to a book and has the value TRUE if the book is contained in a reader's library and is rated highly. To make things simpler, we'll work with shortened book names: _Hunger, Potter, and Twilight_.

In [1]:
    # import packges
import pandas as pd
import numpy as np

# import dataset
books = pd.read_csv('..//Datasets/books.csv')

# observe dataset
books.head()

Unnamed: 0,Hunger,Potter,Twilight,Mockingbird,Gatsby
0,False,True,False,True,True
1,False,True,True,False,True
2,False,False,False,True,False
3,False,True,False,False,True
4,False,False,False,False,True


In [2]:
# Compute support for Hunger and Potter
supportHP = np.logical_and(books['Hunger'], books['Potter']).mean()

# Compute support for Hunger and Twilight
supportHT = np.logical_and(books["Hunger"], books["Twilight"]).mean()

# Compute support for Potter and Twilight
supportPT = np.logical_and(books["Potter"], books["Twilight"]).mean()

# Print support values
print("Hunger Games and Harry Potter: %.2f" % supportHP)
print("Hunger Games and Twilight: %.2f" % supportHT)
print("Harry Potter and Twilight: %.2f" % supportPT)

Hunger Games and Harry Potter: 0.12
Hunger Games and Twilight: 0.09
Harry Potter and Twilight: 0.14


## Refining support with confidence
---
After reporting your findings from the previous exercise, the library asks you about the direction of the relationship. _**Should they use Harry Potter to promote Twilight or Twilight to promote Harry Potter?**_

The confidence metric:
1. Can improve over support with additional metrics.
2. Adding confidence provides a more complete picture.

<center>
$Confidence = \Large\frac{Support(X\,and\,Y)}{Support(X)}$
</center>

After thinking about this, you decide to compute the confidence metric, which has a direction, unlike support. You'll compute it for both __{Potter} -> {Twilight}__ and __{Twilight} -> {Potter}__. The DataFrame books has been imported for you, which has one column for each book: _Potter_ and _Twilight_. 

In [3]:
# Compute support for Potter and Twilight
supportPT = np.logical_and(books["Potter"], books["Twilight"]).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for Twilight
supportT = books['Twilight'].mean()

# Compute confidence for both rules
confidencePT = supportPT / supportP
confidenceTP = supportPT / supportT

# Print results
print('{0:.2f}, {1:.2f}'.format(confidencePT, confidenceTP))

0.29, 0.55


## Further refinement with lift
---
Once again, you report your results to the library: Use Twilight to promote Harry Potter, since the rule has a higher confidence metric. The library thanks you for the suggestion, but asks you to confirm that this is a meaningful relationship using another metric.

You recall that lift may be useful here. If lift is less than 1, this means that Harry Potter and Twilight are paired together less frequently than we would expect if the pairings occurred by random chance.

Lift provides another metric for evaluating the relationship between items:
- __Numerator__: Proportion of transactions that contain X and Y.
- __Denominator__: Proportion if X and Y assigned randomly and independently.

<center>
$Lift = \Large\frac{Support(X\,and\,Y)}{Support(X)Support(Y)}$
</center>

In [4]:
# Compute support for Potter and Twilight
supportPT = np.logical_and(books["Potter"], books["Twilight"]).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for Twilight
supportT = books['Twilight'].mean()

# Compute lift
lift = supportPT / (supportP * supportT)

# Print lift
print("Lift: %.2f" % lift)

Lift: 1.15


## Lift versus leverage
Lift and leverage are quite similar. One is a ratio and the other is a difference. One has a threshold of 0 for good rules and the other has a threshold of 1.

## Computing conviction
---
After hearing about the useful advice you provided to the library, the founder of a small ebook selling start-up approaches you for consulting services. As a test of your abilities, she asks you if you are able to compute conviction for the rule __{Potter} -> {Hunger}__, so she can decide whether to place the books next to each other on the company's website.

1. Conviction is also built using support.
2. More complicated and less intuitive than leverage.

<center>
$Conviction(X → Y ) = \Large\frac{Support(X)\,Support(\hat{Y})}{Support(X\,and\,\hat{Y})}$

In [5]:
# Compute support for Potter AND Hunger
supportPH = np.logical_and(books['Potter'], books['Hunger']).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for NOT Hunger
supportnH = 1.0 - books['Hunger'].mean()

# Compute support for Potter and NOT Hunger
supportPnH = supportP - supportPH

# Compute and print conviction for Potter -> Hunger
conviction = supportP * supportnH / supportPnH
print("Conviction: %.2f" % conviction)

Conviction: 0.92


## Computing conviction with a function
---
After successful completion of her trial project, the ebook start-up's founder decides to hire you for a much bigger project. She asks you if you are able to compute conviction for every pair of books in the goodreads-10k dataset, so she can use that information to decide which books to locate closer together on the website.

You agree to take the job, but realize that you need more a efficient way to compute conviction, since you will need to compute it many times. You decide to write a function that computes it. 

In [6]:
def conviction(antecedent, consequent):
	# Compute support for antecedent AND consequent
	supportAC = np.logical_and(antecedent, consequent).mean()

	# Compute support for antecedent
	supportA = antecedent.mean()

	# Compute support for NOT consequent
	supportnC = 1.0 - consequent.mean()

	# Compute support for antecedent and NOT consequent
	supportAnC = supportA - supportAC

    # Return conviction
	return supportA * supportnC / supportAnC

## Promoting ebooks with conviction
---
In the previous exercise, we defined a function to compute conviction. We were asked to apply that function to all two-book permutations of the goodreads-10k dataset. In this exercise, we'll test the function by applying it to the three most popular books, which we used in earlier exercises: The Hunger Games, Harry Potter, and Twilight.

In [7]:
# Compute conviction for twilight -> potter and potter -> twilight
convictionTP = conviction(books.Twilight, books.Potter)
convictionPT = conviction(books.Potter, books.Twilight)

# Compute conviction for twilight -> hunger and hunger -> twilight
convictionTH = conviction(books.Twilight, books.Hunger)
convictionHT = conviction(books.Hunger, books.Twilight)

# Compute conviction for potter -> hunger and hunger -> potter
convictionPH = conviction(books.Potter, books.Hunger)
convictionHP = conviction(books.Hunger, books.Potter)

# Print results
print('Harry Potter -> Twilight: ', round(convictionHT,2))
print('Twilight -> Harry Potter: ', round(convictionTP,2))

Harry Potter -> Twilight:  1.03
Twilight -> Harry Potter:  1.16


## Computing association and dissociation
---
The library has returned to you once again about your recommendation to promote Harry Potter using Twilight. They're worried that the two might be dissociated, which could have a negative impact on their promotional effort. They ask you to verify that this is not the case. You immediately think of __Zhang's metric__, which measures association and dissociation continuously. Association is positive and dissociation is negative. Zhang's metric is computed as follows:
<center>
$Zhang(A \rightarrow B) = \Large\frac{Support(A \& B) - Support(A) Support(B)}{max[Support(AB) (1-Support(A)), Support(A)(Support(B)-Support(AB))]}$

In [8]:
# Compute the support of Twilight and Harry Potter
supportT = books['Twilight'].mean()
supportP = books['Potter'].mean()

# Compute the support of both books
supportTP = np.logical_and(books['Twilight'], books["Potter"]).mean()

# Complete the expressions for the numerator and denominator
numerator = supportTP - supportT*supportP
denominator = max(supportTP*(1-supportT), supportT*(supportP-supportTP))

# Compute and print Zhang's metric
zhang = numerator / denominator
print(zhang)

0.17231567178855997


## Defining Zhang's metric
---
In general, when we want to perform a task many times, we'll write a function, rather than coding up each individual instance. In this exercise, we'll define a function for Zhang's metric that takes an antecedent and consequent and outputs the metric itself. When the problems we solve become increasingly complicated in the following chapter, having a convenient means of computing a metric will greatly simplify things.

In [9]:
# Define a function to compute Zhang's metric
def zhang(antecedent, consequent):
	# Compute the support of each book
	supportA = antecedent.mean()
	supportC = consequent.mean()

	# Compute the support of both books
	supportAC = np.logical_and(antecedent, consequent).mean()

	# Complete the expressions for the numerator and denominator
	numerator = supportAC - supportA*supportC
	denominator = max(supportAC*(1-supportA), supportA*(supportC-supportAC))

	# Return Zhang's metric
	return numerator / denominator

## Applying Zhang's metric
---
The founder of the ebook start-up has returned for additional consulting services. She has sent you a list of itemsets she's investigating and has asked you to determine whether any of them contain items that are dissociated. When you're finished, she has asked that you add the metric you use to a column in the rules DataFrame, which is available to you, and currently contains columns for antecedents and consequents.

In [10]:
itemsets = [['Potter', 'Hunger'],
 ['Twilight', 'Hunger'],
 ['Mockingbird', 'Hunger'],
 ['Gatsby', 'Hunger'],
 ['Potter', 'Twilight'],
 ['Potter', 'Mockingbird'],
 ['Potter', 'Gatsby'],
 ['Mockingbird', 'Twilight'],
 ['Gatsby', 'Twilight'],
 ['Mockingbird', 'Gatsby']]

rules = pd.DataFrame({'antecedents':['Potter','Twilight','Mockingbird','Gatsby','Potter','Potter','Potter','Mockingbird','Gatsby','Mockingbird'],   
                      'consequents':['Hunger','Hunger','Hunger','Hunger','Twilight','Mockingbird','Gatsby','Twilight','Twilight','Gatsby']})

In [11]:
# Define an empty list for Zhang's metric
zhangs_metric = []

# Loop over lists in itemsets
for itemset in itemsets:
    # Extract the antecedent and consequent columns
	antecedent = books[itemset[0]]
	consequent = books[itemset[1]]
    
    # Complete Zhang's metric and append it to the list
	zhangs_metric.append(zhang(antecedent, consequent))
    
# Print results
rules['zhang'] = zhangs_metric
print(rules)

   antecedents  consequents     zhang
0       Potter       Hunger -0.306049
1     Twilight       Hunger  0.109357
2  Mockingbird       Hunger -0.525436
3       Gatsby       Hunger -0.550446
4       Potter     Twilight  0.245118
5       Potter  Mockingbird -0.065537
6       Potter       Gatsby -0.165572
7  Mockingbird     Twilight -0.319008
8       Gatsby     Twilight -0.370875
9  Mockingbird       Gatsby  0.466460


## Filtering with support and conviction
---
The founder has approached you with the DataFrame rules, which contains the work of a data scientist who was previously on staff. It includes columns for antecedents and consequents, along with the performance for each of those rules with respect to a number of metrics.

In [12]:
# import dataset
rules = pd.read_csv('../Datasets/rules.csv')

# Preview the rules DataFrame using the .head() method
rules.head()

Unnamed: 0,antecedents,consequents,antecedent_support,consequent_support,support,confidence,lift,leverage,conviction,zhang
0,(Hunger),(Potter),0.31913,0.477516,0.123851,0.388089,0.812725,-0.028539,0.853857,-0.252858
1,(Potter),(Hunger),0.477516,0.31913,0.123851,0.259365,0.812725,-0.028539,0.919305,-0.306049
2,(Hunger),(Twilight),0.31913,0.25677,0.089193,0.279486,1.088468,0.007249,1.031527,0.119373
3,(Twilight),(Hunger),0.25677,0.31913,0.089193,0.347363,1.088468,0.007249,1.04326,0.109357
4,(Mockingbird),(Hunger),0.476522,0.31913,0.096273,0.202033,0.633075,-0.055799,0.853256,-0.525436


In [13]:
# Select the subset of rules with antecedent support greater than 0.05
rules = rules[rules['antecedent_support'] > 0.05]

# Select the subset of rules with a consequent support greater than 0.01
rules = rules[rules['consequent_support'] > 0.01]

# Select the subset of rules with a conviction greater than 1.01
rules = rules[rules['conviction'] > 1.01]

# Print remaining rules
rules

Unnamed: 0,antecedents,consequents,antecedent_support,consequent_support,support,confidence,lift,leverage,conviction,zhang
2,(Hunger),(Twilight),0.319130,0.256770,0.089193,0.279486,1.088468,0.007249,1.031527,0.119373
3,(Twilight),(Hunger),0.256770,0.319130,0.089193,0.347363,1.088468,0.007249,1.043260,0.109357
8,(Twilight),(Potter),0.256770,0.477516,0.140621,0.547654,1.146881,0.018009,1.155054,0.172316
9,(Potter),(Twilight),0.477516,0.256770,0.140621,0.294485,1.146881,0.018009,1.053457,0.245118
18,(Mockingbird),(Gatsby),0.476522,0.295155,0.186087,0.390511,1.323070,0.045439,1.156452,0.466460
...,...,...,...,...,...,...,...,...,...,...
143,"(Gatsby, Twilight)","(Mockingbird, Potter)",0.053540,0.219503,0.024348,0.454756,2.071754,0.012596,1.431465,0.546581
144,"(Gatsby, Potter)","(Mockingbird, Twilight)",0.127702,0.098261,0.024348,0.190661,1.940360,0.011800,1.114168,0.555580
146,(Mockingbird),"(Gatsby, Twilight, Potter)",0.476522,0.034161,0.024348,0.051095,1.495687,0.008069,1.017845,0.633094
147,(Gatsby),"(Mockingbird, Twilight, Potter)",0.295155,0.062981,0.024348,0.082492,1.309778,0.005759,1.021264,0.335551


## Using multi-metric filtering to cross-promote books
---
As a final request, the founder of the ebook selling start-up asks you to perform additional filtering. Your previous attempt returned 82 rules, but she wanted only one.

In [14]:
# Set the lift threshold to 1.5
rules = rules[rules['lift'] > 1.5]

# Set the conviction threshold to 1.0
rules = rules[rules['conviction'] > 1.0]

# Set the threshold for Zhang's rule to 0.65
rules = rules[rules['zhang'] > 0.65]

# Print rule
rules[['antecedents','consequents']]

Unnamed: 0,antecedents,consequents
114,"(Mockingbird, Potter)","(Gatsby, Hunger)"
118,(Mockingbird),"(Gatsby, Hunger, Potter)"
127,"(Mockingbird, Hunger)","(Gatsby, Twilight)"
128,"(Mockingbird, Twilight)","(Gatsby, Hunger)"
142,"(Mockingbird, Potter)","(Gatsby, Twilight)"
