# The basics of market basket analysis

Market basket analysis uses lists of transactions to identify useful associations between items. Such associations can be written in the form of a rule that has an antecedent and a consequent. Let's assume a small grocery store has asked you to look at their transaction data. After some analysis, you find the rule given below.

{cereal} -> {milk}

Which statement about this rule is correct?

### Possible Answers


    {cereal} is the antecedent, {milk} is the consequent, and both are items. {Answer}
    
    
    {milk} is the antecedent, {cereal} is the consequent, and both are items.
    
    
    {cereal} is the antecedent, but neither is an item.

In [1]:
transactions = [['bread', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum']]

In [2]:
# exercise 01

"""
Cross-selling products

The small grocery store has decided to cross-sell chewing gum with either coffee, cereal, or bread. To determine which of the three items is best to use, the store owner has performed an experiment. For one week, she sold chewing gum next to the register and recorded all transactions where it was purchased with either coffee, cereal, or bread. The transactions from that day are available as a list of lists named transactions. Each transaction is either ['coffee','gum'], ['cereal','gum'], or ['bread','gum'].
"""

# Instructions

"""

    Count the number of transactions that contain coffee and gum.

    Count the number of transactions that contain cereal and gum.

    Count the number of transactions that contain bread and gum.

"""

# solution

# Count the number of transactions with coffee and gum
coffee = transactions.count(['coffee', 'gum'])

# Count the number of transactions with cereal and gum
cereal = transactions.count(['cereal', 'gum'])

# Count the number of transactions with bread and gum
bread = transactions.count(['bread', 'gum'])

# Print the counts for each transaction.
print('coffee:', coffee)
print('cereal:', cereal)
print('bread:', bread)

#----------------------------------#

# Conclusion

"""
Excellent work! Based on our results, which were printed to the console, we can recommend that the store owner cross-sell chewing gum next to the coffee. In the following video, we will consider how to identify useful association rules in more challenging environments.
"""

coffee: 40
cereal: 25
bread: 20


'\nExcellent work! Based on our results, which were printed to the console, we can recommend that the store owner cross-sell chewing gum next to the coffee. In the following video, we will consider how to identify useful association rules in more challenging environments.\n'

# Multiple antecedents and consequents

Market basket analysis revolves around the use of association rules, which are if-then statements about the relationship between two sets of items. The rule {coffee}
{milk}, for instance, is read as "if coffee then milk," where coffee is the antecedent and milk is the consequent. Many rules have multiple antecedents and consequents. We will examine such rules in this exercise.

![Answer](images/ch01-01.png)

In [4]:
# exercise 02

"""
Preparing data for market basket analysis

Throughout this course, you will typically encounter data in one of two formats: a pandas DataFrame or a list of lists. DataFrame objects will be constructed by importing a csv file using pandas. They will consist of a single column of data, where each element contains a string of items in a transaction, separated by a comma, as in the table below.

In this exercise, you will practice loading the data from a csv file and will prepare it for use as a list of lists. Note that the path to the grocery store dataset has been defined and is available to you as groceries_path.
Transaction
'milk,bread,biscuit'
'bread,milk,biscuit,cereal'
…
'tea,milk,coffee,cereal'
"""

# Instructions

"""

    Import the pandas package under the alias pd.

    Use pandas to read the csv file at the path specified by groceries_path.

    Select the Transaction column from the DataFrame and split each string of comma-separated items into a list.

    Convert the DataFrame of transactions into a list of lists.

"""

# solution
groceries_path = 'datasets/groceries.csv'
# Import pandas under the alias pd
import pandas as pd

# Load transactions from pandas
groceries = pd.read_csv(groceries_path)

# Split transaction strings into lists
transactions = groceries['Transaction'].apply(lambda t: t.split(','))

# Convert DataFrame column into list of strings
transactions = list(transactions)

# Print the list of transactions
print(transactions)

#----------------------------------#

# Conclusion

"""
Excellent work! In the rest of the chapter, we will discuss how to put all of the pieces together to perform market basket analysis. You will load and prepare data, generate association rules, and then discard the subset of rules that are not useful.
"""

[['milk', 'bread', 'biscuit'], ['bread', 'milk', 'biscuit', 'cereal'], ['bread', 'tea'], ['jam', 'bread', 'milk'], ['tea', 'biscuit'], ['bread', 'tea'], ['tea', 'cereal'], ['bread', 'tea', 'biscuit'], ['jam', 'bread', 'tea'], ['bread', 'milk'], ['coffee', 'orange', 'biscuit', 'cereal'], ['coffee', 'orange', 'biscuit', 'cereal'], ['coffee', 'sugar'], ['bread', 'coffee', 'orange'], ['bread', 'sugar', 'biscuit'], ['coffee', 'sugar', 'cereal'], ['bread', 'sugar', 'biscuit'], ['bread', 'coffee', 'sugar'], ['bread', 'coffee', 'sugar'], ['tea', 'milk', 'coffee', 'cereal']]


'\nExcellent work! In the rest of the chapter, we will discuss how to put all of the pieces together to perform market basket analysis. You will load and prepare data, generate association rules, and then discard the subset of rules that are not useful.\n'

In [5]:
# exercise 03

"""
Generating association rules

As you saw, the function permutations from the module itertools can be used to quickly generate the set of all one-antecedent, one-consequent rules. You do not, of course, know which of these rules are useful. You simply know that each is a valid way to combine two items.

Let's practice generating and counting the set of all rules for a subset of the grocery dataset: coffee, tea, milk, and sugar.
"""

# Instructions

"""

    Complete the import statement to import the permutations function.

    Generate all association rules from the groceries list.

    Print the set of rules.

    Print the number of rules.

"""

# solution

# Import permutations from the itertools module
from itertools import permutations

# Define the set of groceries
flattened = [i for t in transactions for i in t]
groceries = list(set(flattened))

# Generate all possible rules from groceries list
rules = list(permutations(groceries, 2))

# Print the set of rules
print(rules)

# Print the number of rules
print(len(rules))

#----------------------------------#

# Conclusion

"""
Excellent work! Later in the chapter, you'll move beyond generating and counting association rules to selecting rules that are useful.
"""

[('tea', 'cereal'), ('tea', 'milk'), ('tea', 'orange'), ('tea', 'coffee'), ('tea', 'sugar'), ('tea', 'jam'), ('tea', 'biscuit'), ('tea', 'bread'), ('cereal', 'tea'), ('cereal', 'milk'), ('cereal', 'orange'), ('cereal', 'coffee'), ('cereal', 'sugar'), ('cereal', 'jam'), ('cereal', 'biscuit'), ('cereal', 'bread'), ('milk', 'tea'), ('milk', 'cereal'), ('milk', 'orange'), ('milk', 'coffee'), ('milk', 'sugar'), ('milk', 'jam'), ('milk', 'biscuit'), ('milk', 'bread'), ('orange', 'tea'), ('orange', 'cereal'), ('orange', 'milk'), ('orange', 'coffee'), ('orange', 'sugar'), ('orange', 'jam'), ('orange', 'biscuit'), ('orange', 'bread'), ('coffee', 'tea'), ('coffee', 'cereal'), ('coffee', 'milk'), ('coffee', 'orange'), ('coffee', 'sugar'), ('coffee', 'jam'), ('coffee', 'biscuit'), ('coffee', 'bread'), ('sugar', 'tea'), ('sugar', 'cereal'), ('sugar', 'milk'), ('sugar', 'orange'), ('sugar', 'coffee'), ('sugar', 'jam'), ('sugar', 'biscuit'), ('sugar', 'bread'), ('jam', 'tea'), ('jam', 'cereal'), ('ja

"\nExcellent work! Later in the chapter, you'll move beyond generating and counting association rules to selecting rules that are useful.\n"

In [6]:
# exercise 04

"""
One-hot encoding transaction data

Throughout the course, we will use a common pipeline for preprocessing data for use in market basket analysis. The first step is to import a pandas DataFrame and select the column that contains transactions. Each transaction in the column will be a string that consists of a number of items, each separated by a comma. The next step is to use a lambda function to split each transaction string into a list, thereby transforming the column into a list of lists.

In this exercise, you'll start with the list of lists from the grocery dataset, which is available to you as transactions. You will then transform transactions into a one-hot encoded DataFrame, where each column consists of TRUE and FALSE values that indicate whether an item was included in a transaction.
"""

# Instructions

"""

    From the mlxtend.preprocessing, import TransactionEncoder

    Instantiate a transaction encoder and identify the unique items in transactions.

    One-hot encode transactions in an array and assign its values to onehot.

    Convert the array into a pandas DataFrame using the item names as column headers.

"""

# solution

# Import the transaction encoder function from mlxtend
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Instantiate transaction encoder and identify unique items in transactions
encoder = TransactionEncoder().fit(transactions)

# One-hot encode transactions
onehot = encoder.transform(transactions)

# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)

# Print the one-hot encoded transaction dataset
print(onehot)

#----------------------------------#

# Conclusion

"""
Excellent work! In the next exercise, we'll make use of the one-hot encoded transaction dataset to compute the support metric.
"""

    biscuit  bread  cereal  coffee    jam   milk  orange  sugar    tea
0      True   True   False   False  False   True   False  False  False
1      True   True    True   False  False   True   False  False  False
2     False   True   False   False  False  False   False  False   True
3     False   True   False   False   True   True   False  False  False
4      True  False   False   False  False  False   False  False   True
5     False   True   False   False  False  False   False  False   True
6     False  False    True   False  False  False   False  False   True
7      True   True   False   False  False  False   False  False   True
8     False   True   False   False   True  False   False  False   True
9     False   True   False   False  False   True   False  False  False
10     True  False    True    True  False  False    True  False  False
11     True  False    True    True  False  False    True  False  False
12    False  False   False    True  False  False   False   True  False
13    

"\nExcellent work! In the next exercise, we'll make use of the one-hot encoded transaction dataset to compute the support metric.\n"

In [7]:
import numpy as np

In [8]:
# exercise 05

"""
Computing the support metric

In the previous exercise, you one-hot encoded a small grocery store's transactions as the DataFrame onehot. In this exercise, you'll make use of that DataFrame and the support metric to help the store's owner. First, she has asked you to identify frequently purchased items, which you'll do by computing support at the item-level. And second, she asked you to check whether the rule {jam}
{bread} has a support of over 0.05. Note that onehot has been defined and is available. Additionally, pandas has been imported under the alias pd and numpy has been imported under the alias np.
"""

# Instructions

"""

    Compute the support value for each item in the one-hot encoded dataset, onehot.
    Print the support values.

    Add a column, jam+bread, to onehot that is TRUE if both jam and bread are both in the transaction.
    Print support.

"""

# solution

# Compute the support
support = onehot.mean()

# Print the support
print(support)

#----------------------------------#

# Add a jam+bread column to the DataFrame onehot
onehot['jam+bread'] = np.logical_and(onehot['jam'], onehot['bread'])

# Compute the support
support = onehot.mean()

# Print the support values
print(support)

#----------------------------------#

# Conclusion

"""
Excellent work! In the next chapter, we'll start working with much larger datasets, where it will be necessary to prune both items and rules by support.
"""

biscuit    0.40
bread      0.65
cereal     0.30
coffee     0.40
jam        0.10
milk       0.25
orange     0.15
sugar      0.30
tea        0.35
dtype: float64
biscuit      0.40
bread        0.65
cereal       0.30
coffee       0.40
jam          0.10
milk         0.25
orange       0.15
sugar        0.30
tea          0.35
jam+bread    0.10
dtype: float64


"\nExcellent work! In the next chapter, we'll start working with much larger datasets, where it will be necessary to prune both items and rules by support.\n"