In [2]:
import pandas as pd
import numpy as np

books = pd.read_csv('datasets/books.csv').drop(columns=['index'], axis=1)
books.head(1)

Unnamed: 0,Hunger,Potter,Twilight
0,False,True,False


In [3]:
# exercise 01

"""
Recommending books with support

A library wants to get members to read more and has decided to use market basket analysis to figure out how. They approach you to do the analysis and ask that you use the five most highly-rated books from the goodbooks-10k dataset, which was introduced in the video. You are given the data in one-hot encoded format in a pandas DataFrame called books.

Each column in the DataFrame corresponds to a book and has the value TRUE if the book is contained in a reader's library and is rated highly. To make things simpler, we'll work with shortened book names: Hunger, Potter, and Twilight.
"""

# Instructions

"""

    Compute the support for {Hunger, Potter}.

    Compute the support for {Hunger, Twilight}.

    Compute the support for {Potter, Twilight}.

"""

# solution

# Compute support for Hunger and Potter
supportHP = np.logical_and(books['Hunger'], books['Potter']).mean()

# Compute support for Hunger and Twilight
supportHT = np.logical_and(books['Hunger'], books['Twilight']).mean()

# Compute support for Potter and Twilight
supportPT = np.logical_and(books['Potter'], books['Twilight']).mean()

# Print support values
print("Hunger Games and Harry Potter: %.2f" % supportHP)
print("Hunger Games and Twilight: %.2f" % supportHT)
print("Harry Potter and Twilight: %.2f" % supportPT)

#----------------------------------#

# Conclusion

"""
Based on the support metric, Harry Potter and Twilight appear to be the best options for cross-promotion. In the next problem, we'll consider whether we should use Harry Potter to promote Twilight or Twilight to promote Harry Potter.
"""

Hunger Games and Harry Potter: 0.12
Hunger Games and Twilight: 0.09
Harry Potter and Twilight: 0.14


"\nBased on the support metric, Harry Potter and Twilight appear to be the best options for cross-promotion. In the next problem, we'll consider whether we should use Harry Potter to promote Twilight or Twilight to promote Harry Potter.\n"

In [5]:
# exercise 02

"""
Refining support with confidence

After reporting your findings from the previous exercise, the library asks you about the direction of the relationship. Should they use Harry Potter to promote Twilight or Twilight to promote Harry Potter?

After thinking about this, you decide to compute the confidence metric, which has a direction, unlike support. You'll compute it for both {Potter}
{Twilight} and {Twilight} {Potter}. The DataFrame books has been imported for you, which has one column for each book: Potter and Twilight.
"""

# Instructions

"""
    Compute the support of {Potter, Twilight}.
    
    Compute the support of {Potter}.
    
    Compute the support of {Twilight}.
    
    Compute the confidence of {Potter}
    {Twilight} and {Twilight} {Potter}.
"""

# solution

# Compute support for Potter and Twilight
supportPT = np.logical_and(books['Potter'], books['Twilight']).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for Twilight
supportT = books['Twilight'].mean()

# Compute confidence for both rules
confidencePT = supportPT / supportP
confidenceTP = supportPT / supportT

# Print results
print('confidencePT: {0:.2f}, confidenceTP: {1:.2f}'.format(confidencePT, confidenceTP))

#----------------------------------#

# Conclusion

"""
Even though the support is identical for the two association rules, the confidence is much higher for Twilight -> Harry Potter, since Harry Potter has a higher support than Twilight.
"""

confidencePT: 0.29, confidenceTP: 0.55


'\nEven though the support is identical for the two association rules, the confidence is much higher for Twilight -> Harry Potter, since Harry Potter has a higher support than Twilight.\n'

In [6]:
# exercise 03

"""
Further refinement with lift

Once again, you report your results to the library: Use Twilight to promote Harry Potter, since the rule has a higher confidence metric. The library thanks you for the suggestion, but asks you to confirm that this is a meaningful relationship using another metric.

You recall that lift may be useful here. If lift is less than 1, this means that Harry Potter and Twilight are paired together less frequently than we would expect if the pairings occurred by random chance. As with the previous two exercises, the DataFrame books has been imported for you, along with numpy under the alias np.
"""

# Instructions

"""
    Compute the support of {Potter, Twilight}.
    
    Compute the support of {Potter}.
    
    Compute the support of {Twilight}.
    
    Compute the lift of {Potter}
    {Twilight}.
"""

# solution

# Compute support for Potter and Twilight
supportPT = np.logical_and(books['Potter'], books['Twilight']).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for Twilight
supportT = books['Twilight'].mean()

# Compute lift
lift = (supportPT) / (supportP * supportT)

# Print lift
print("Lift: %.2f" % lift)

#----------------------------------#

# Conclusion

"""
As it turns out, lift is greater than 1.0. This could give us some confidence that the association rule we recommended did not arise by random chance.
"""

Lift: 1.15


'\nAs it turns out, lift is greater than 1.0. This could give us some confidence that the association rule we recommended did not arise by random chance.\n'

# Lift versus leverage

As we discovered in the video, lift and leverage are quite similar. One is a ratio and the other is a difference. One has a threshold of 0 for good rules and the other has a threshold of 1. Can you remember which is which?

![Answer](images/ch02-01.png)

In [7]:
# exercise 04

"""
Computing conviction

After hearing about the useful advice you provided to the library, the founder of a small ebook selling start-up approaches you for consulting services. As a test of your abilities, she asks you if you are able to compute conviction for the rule {Potter}
{Hunger}, so she can decide whether to place the books next to each other on the company's website. Fortunately, you still have access to the goodreads-10k data, which is available as books. Additionally, pandas has been imported as pd and numpy as np.
"""

# Instructions

"""

    Compute the support for {Potter} and assign it to supportP.

    Compute the support for NOT {Hunger}.

    Compute the support for {Potter} and NOT {Hunger}.

    Complete the expression for the conviction metric in the return statement.

"""

# solution

# Compute support for Potter AND Hunger
supportPH = np.logical_and(books['Potter'], books['Hunger']).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for NOT Hunger
supportnH = 1.0 - books['Hunger'].mean()

# Compute support for Potter and NOT Hunger
supportPnH = supportP - supportPH

# Compute and print conviction for Potter -> Hunger
conviction = supportP * supportnH / supportPnH
print("Conviction: %.2f" % conviction)

#----------------------------------#

# Conclusion

"""
Great job! Notice that the value of conviction was less than 1, suggesting that the rule ``if Potter then Hunger'' is not supported.
"""

Conviction: 0.92


"\nGreat job! Notice that the value of conviction was less than 1, suggesting that the rule ``if Potter then Hunger'' is not supported.\n"

In [8]:
# exercise 05

"""
Computing conviction with a function

After successful completion of her trial project, the ebook start-up's founder decides to hire you for a much bigger project. She asks you if you are able to compute conviction for every pair of books in the goodreads-10k dataset, so she can use that information to decide which books to locate closer together on the website.

You agree to take the job, but realize that you need more a efficient way to compute conviction, since you will need to compute it many times. You decide to write a function that computes it. It will take two columns of a pandas DataFrame as an input, one antecedent and one consequent, and output the conviction metric. Note that pandas is available as pd and numpy is available as np.
"""

# Instructions

"""

    Compute the support for the antecedent and assign it to supportA.

    Compute the support for NOT consequent.

    Compute the support for antecedent and NOT consequent.

"""

# solution

def conviction(antecedent, consequent):
	# Compute support for antecedent AND consequent
	supportAC = np.logical_and(antecedent, consequent).mean()

	# Compute support for antecedent
	supportA = antecedent.mean()

	# Compute support for NOT consequent
	supportnC = 1.0 - consequent.mean()

	# Compute support for antecedent and NOT consequent
	supportAnC = supportA - supportAC

    # Return conviction
	return supportA * supportnC / supportAnC

#----------------------------------#

# Conclusion

"""
Excellent work! In the next exercise, we'll start applying this function to the dataset to see what we find.
"""

"\nExcellent work! In the next exercise, we'll start applying this function to the dataset to see what we find.\n"

In [9]:
potter = books.loc[:, 'Potter']
twilight = books.loc[:, 'Twilight']
hunger = books.loc[:, 'Hunger']

In [10]:
# exercise 06

"""
Promoting ebooks with conviction

In the previous exercise, we defined a function to compute conviction. We were asked to apply that function to all two-book permutations of the goodreads-10k dataset. In this exercise, we'll test the function by applying it to the three most popular books, which we used in earlier exercises: The Hunger Games, Harry Potter, and Twilight.

The function has been defined for you and is available as conviction. Recall that it takes an antecedent and a consequent as its two arguments. Additionally, the columns of the books DataFrame from earlier exercises are available as three separate DataFrames: potter, twilight, and hunger.
"""

# Instructions

"""
Compute conviction for {Twilight} {Potter} and {Potter}
{Twilight}.

Compute conviction for {Twilight}
{Hunger} and {Hunger}
{Twilight}.

Compute conviction for {Potter}
{Hunger} and {Hunger} {Potter}.
"""

# solution

# Compute conviction for twilight -> potter and potter -> twilight
convictionTP = conviction(twilight, potter)
convictionPT = conviction(potter, twilight)

# Compute conviction for twilight -> hunger and hunger -> twilight
convictionTH = conviction(twilight, hunger)
convictionHT = conviction(hunger, twilight)

# Compute conviction for potter -> hunger and hunger -> potter
convictionPH = conviction(potter, hunger)
convictionHP = conviction(hunger, potter)

# Print results
print('Harry Potter -> Twilight: ', convictionHT)
print('Twilight -> Potter: ', convictionTP)

#----------------------------------#

# Conclusion

"""
Great job! Notice that the conviction metric for if Potter then Twilight and if Twilight then Potter are both above 1, indicating that they are both viable rules.
"""

Harry Potter -> Twilight:  1.0315274939515657
Twilight -> Potter:  1.1550539077290998


'\nGreat job! Notice that the conviction metric for if Potter then Twilight and if Twilight then Potter are both above 1, indicating that they are both viable rules.\n'

In [11]:
# exercise 07

"""
Computing association and dissociation

The library has returned to you once again about your recommendation to promote Harry Potter using Twilight. They're worried that the two might be dissociated, which could have a negative impact on their promotional effort. They ask you to verify that this is not the case.

You immediately think of Zhang's metric, which measures association and dissociation continuously. Association is positive and dissociation is negative. As with the previous exercises, the DataFrame books has been imported for you, along with numpy under the alias np. Zhang's metric is computed as follows:


"""

# Instructions

"""
    Compute the support of {Twilight} and the support of {Potter}.

    Compute the support of {Twilight, Potter}.

    Complete the expression for the denominator.

    Compute Zhang's metric for {Twilight}
    {Potter}.
"""

# solution

# Compute the support of Twilight and Harry Potter
supportT = books['Twilight'].mean()
supportP = books['Potter'].mean()

# Compute the support of both books
supportTP = np.logical_and(books['Twilight'], books['Potter']).mean()

# Complete the expressions for the numerator and denominator
numerator = supportTP - supportT*supportP
denominator = max(supportTP*(1-supportT), supportT*(supportP-supportTP))

# Compute and print Zhang's metric
zhang = numerator / denominator
print(zhang)

#----------------------------------#

# Conclusion

"""
Great work! Once again, the association rule ``if Twilight then Harry Potter'' proved robust. It had a positive value for Zhang's metric, indicating that the two books are not dissociated.
"""

'\n\n'

In [12]:
# exercise 08

"""
Defining Zhang's metric

In general, when we want to perform a task many times, we'll write a function, rather than coding up each individual instance. In this exercise, we'll define a function for Zhang's metric that takes an antecedent and consequent and outputs the metric itself. When the problems we solve become increasingly complicated in the following chapter, having a convenient means of computing a metric will greatly simplify things.

Note that numpy has been imported as np and pandas has been imported as pd. Additionally, recall that the expression for Zhang's metric in terms of support calculations is the following:


"""

# Instructions

"""

    Define the support values of the antecedent and consequent individually.

    Define the support of {antecedent, consequent}.

    Complete the expressions for the numerator and denominator.

    Complete the expression for Zhang's metric.

"""

# solution

# Define a function to compute Zhang's metric
def zhang(antecedent, consequent):
	# Compute the support of each book
	supportA = antecedent.mean()
	supportC = consequent.mean()

	# Compute the support of both books
	supportAC = np.logical_and(antecedent, consequent).mean()

	# Complete the expressions for the numerator and denominator
	numerator = supportAC - supportA*supportC
	denominator = max(supportAC*(1-supportA), supportA*(supportC-supportAC))

	# Return Zhang's metric
	return numerator / denominator

#----------------------------------#

# Conclusion

"""
Excellent work! We've now defined Zhang's metric. In the next exercise, we'll apply it to the problem of selecting a website layout for the ebook start-up.
"""

"\nExcellent work! We've now defined Zhang's metric. In the next exercise, we'll apply it to the problem of selecting a website layout for the ebook start-up.\n"

In [14]:
itemsets = [['Potter', 'Hunger'],
 ['Twilight', 'Hunger'],
 ['Mockingbird', 'Hunger'],
 ['Gatsby', 'Hunger'],
 ['Potter', 'Twilight'],
 ['Potter', 'Mockingbird'],
 ['Potter', 'Gatsby'],
 ['Mockingbird', 'Twilight'],
 ['Gatsby', 'Twilight'],
 ['Mockingbird', 'Gatsby']]

In [17]:
books = pd.read_csv('datasets/books_2.csv').drop(columns=['index'],axis=1)

In [22]:
rules = """
	antecedents	consequents
0	Potter	Hunger
1	Twilight	Hunger
2	Mockingbird	Hunger
3	Gatsby	Hunger
4	Potter	Twilight
5	Potter	Mockingbird
6	Potter	Gatsby
7	Mockingbird	Twilight
8	Gatsby	Twilight
9	Mockingbird	Gatsby

"""

In [23]:
rules = pd.read_clipboard()
rules

Unnamed: 0,antecedents,consequents
0,Potter,Hunger
1,Twilight,Hunger
2,Mockingbird,Hunger
3,Gatsby,Hunger
4,Potter,Twilight
5,Potter,Mockingbird
6,Potter,Gatsby
7,Mockingbird,Twilight
8,Gatsby,Twilight
9,Mockingbird,Gatsby


In [21]:
rules.to_clipboard()

In [24]:
# exercise 09

"""
Applying Zhang's metric

The founder of the ebook start-up has returned for additional consulting services. She has sent you a list of itemsets she's investigating and has asked you to determine whether any of them contain items that are dissociated. When you're finished, she has asked that you add the metric you use to a column in the rules DataFrame, which is available to you, and currently contains columns for antecedents and consequents.

The itemsets are available as a list of lists called itemsets. Each list contains the antecedent first and the consequent second. You also have access to the books DataFrame from previous exercises. Note that Zhang's metric has been defined for you and is available as zhang(). Additionally, pandas is available as pd and numpy as np.
"""

# Instructions

"""

    Loop over each itemset in itemsets.
    
    Extract the antecedent and consequent columns from books for each itemset.
    
    Complete the statement and append it to the zhangs_metric list.
    
    Print the metric for each itemset.

"""

# solution

# Define an empty list for Zhang's metric
zhangs_metric = []

# Loop over lists in itemsets
for itemset in itemsets:
    # Extract the antecedent and consequent columns
	antecedent = books[itemset[0]]
	consequent = books[itemset[1]]
    
    # Complete Zhang's metric and append it to the list
	zhangs_metric.append(zhang(antecedent, consequent))
    
# Print results
rules['zhang'] = zhangs_metric
print(rules)

#----------------------------------#

# Conclusion

"""
Good job! Notice that most of the items were dissociated, which suggests that they would have been a poor choice to pair together for promotional purposes.
"""

   antecedents  consequents     zhang
0       Potter       Hunger -0.306049
1     Twilight       Hunger  0.109357
2  Mockingbird       Hunger -0.525436
3       Gatsby       Hunger -0.550446
4       Potter     Twilight  0.245118
5       Potter  Mockingbird -0.065537
6       Potter       Gatsby -0.165572
7  Mockingbird     Twilight -0.319008
8       Gatsby     Twilight -0.370875
9  Mockingbird       Gatsby  0.466460


'\nGood job! Notice that most of the items were dissociated, which suggests that they would have been a poor choice to pair together for promotional purposes.\n'

In [25]:
rules = pd.read_csv('datasets/rules.csv').drop(columns=['index'], axis=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,frozenset({'Hunger'}),frozenset({'Potter'}),0.31913,0.477516,0.123851,0.388089,0.812725,-0.028539,0.853857
1,frozenset({'Potter'}),frozenset({'Hunger'}),0.477516,0.31913,0.123851,0.259365,0.812725,-0.028539,0.919305
2,frozenset({'Twilight'}),frozenset({'Hunger'}),0.25677,0.31913,0.089193,0.347363,1.088468,0.007249,1.04326
3,frozenset({'Hunger'}),frozenset({'Twilight'}),0.31913,0.25677,0.089193,0.279486,1.088468,0.007249,1.031527
4,frozenset({'Mockingbird'}),frozenset({'Hunger'}),0.476522,0.31913,0.096273,0.202033,0.633075,-0.055799,0.853256


In [26]:
# exercise 10

"""
Filtering with support and conviction

In the video, we discussed the continued consulting work you are doing for the founder of an ebook selling start-up. The founder has approached you with the DataFrame rules, which contains the work of a data scientist who was previously on staff. It includes columns for antecedents and consequents, along with the performance for each of those rules with respect to a number of metrics.

Your objective will be to perform multi-metric filtering on the dataset to identify potentially useful rules. Note that pandas is available as pd and numpy as np. Additionally, rules has been defined and is available.
"""

# Instructions

"""

    Use the .head() method with print to preview the dataset.

    Select the subset of rules with an antecedent support greater than 0.05.

    Select the subset of rules with a consequent support greater than 0.02.

    Select the subset of rules with a conviction greater than 1.01.

"""

# solution

# Preview the rules DataFrame using the .head() method
print(rules.head())

# Select the subset of rules with antecedent support greater than 0.05
rules = rules[rules['antecedent support'] > 0.05]

# Select the subset of rules with a consequent support greater than 0.02
rules = rules[rules['consequent support'] > 0.02]

# Select the subset of rules with a conviction greater than 1.01
rules = rules[rules['conviction'] > 1.01]

# Print remaining rules
print(rules)

#----------------------------------#

# Conclusion

"""
Excellent work! You have now successfully performed multi-metric filtering. In the final exercise in this chapter, you'll go even further by including an advanced metric.
"""

                  antecedents              consequents  antecedent support  \
0       frozenset({'Hunger'})    frozenset({'Potter'})            0.319130   
1       frozenset({'Potter'})    frozenset({'Hunger'})            0.477516   
2     frozenset({'Twilight'})    frozenset({'Hunger'})            0.256770   
3       frozenset({'Hunger'})  frozenset({'Twilight'})            0.319130   
4  frozenset({'Mockingbird'})    frozenset({'Hunger'})            0.476522   

   consequent support   support  confidence      lift  leverage  conviction  
0            0.477516  0.123851    0.388089  0.812725 -0.028539    0.853857  
1            0.319130  0.123851    0.259365  0.812725 -0.028539    0.919305  
2            0.319130  0.089193    0.347363  1.088468  0.007249    1.043260  
3            0.256770  0.089193    0.279486  1.088468  0.007249    1.031527  
4            0.319130  0.096273    0.202033  0.633075 -0.055799    0.853256  
                                antecedents  \
2               

"\nExcellent work! You have now successfully performed multi-metric filtering. In the final exercise in this chapter, you'll go even further by including an advanced metric.\n"

In [None]:
# exercise 11

"""
Using multi-metric filtering to cross-promote books

As a final request, the founder of the ebook selling start-up asks you to perform additional filtering. Your previous attempt returned 82 rules, but she wanted only one. The rules dataset has again been made available in the console. Finally, Zhang's metric has been computed for you and included in the rules DataFrame under the column header zhang.
"""

# Instructions

"""

    Set the lift threshold to be greater than 1.5.

    Use a conviction threshold of 1.0.

    Require Zhang's metric to be greater than 0.65.

"""

# solution

# Set the lift threshold to 1.5
rules = rules[rules['lift'] > 1.5]

# Set the conviction threshold to 1.0
rules = rules[rules['conviction'] > 1.0]

# Set the threshold for Zhang's rule to 0.65
rules = rules[rules['zhang'] > 0.65]

# Print rule
print(rules[['antecedents','consequents']])

#----------------------------------#

# Conclusion

"""
Great job! Notice that we started with an enormous set of user libraries, narrowed it down to a subset of interest, and then applied filtering to identify five association rules. We then printed the antecedent and consequent in the iPython Shell.
"""