<a href="https://colab.research.google.com/github/Leooiam/data_science_practice/blob/main/Market_Basket_Analysis_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Market basket analysis uses lists of transactions to identify useful associations between items. Such associations can be written in the form of a rule that has an antecedent and a consequent. Let's assume a small grocery store has asked you to look at their transaction data. After some analysis, you find the rule given below.

{cereal}  {milk}



# Introduction to Market Basket Analysis

## What is Market Basket Analysis?

### Cross-selling products

The small grocery store has decided to cross-sell chewing gum with either coffee, cereal, or bread. To determine which of the three items is best to use, the store owner has performed an experiment. For one week, she sold chewing gum next to the register and recorded all transactions where it was purchased with either coffee, cereal, or bread. The transactions from that day are available as a list of lists named transactions. Each transaction is either ['coffee','gum'], ['cereal','gum'], or ['bread','gum'].

In [None]:
transactions = [['bread', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['bread', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['cereal', 'gum'],
 ['cereal', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['bread', 'gum'],
 ['coffee', 'gum'],
 ['coffee', 'gum']]

In [None]:
# Count the number of transactions with coffee and gum
coffee = transactions.count(['coffee', 'gum'])

# Count the number of transactions with cereal and gum
cereal = transactions.count(['cereal', 'gum'])

# Count the number of transactions with bread and gum
bread = transactions.count(['bread', 'gum'])

# Print the counts for each transaction.
print('coffee:', coffee)
print('cereal:', cereal)
print('bread:', bread)

coffee: 40
cereal: 25
bread: 20


## Identifying association rules

### Preparing data for market basket analysis

Throughout this course, you will typically encounter data in one of two formats: a pandas DataFrame or a list of lists. DataFrame objects will be constructed by importing a csv file using pandas. They will consist of a single column of data, where each element contains a string of items in a transaction, separated by a comma, as in the table below.

In this exercise, you will practice loading the data from a csv file and will prepare it for use as a list of lists. Note that the path to the grocery store dataset has been defined and is available to you as groceries_path.

In [None]:
# Import pandas under the alias pd
import pandas as pd

# Load transactions from pandas
groceries = pd.read_csv('small_grocery_store.csv')

# Split transaction strings into lists
transactions = groceries['Transaction'].apply(lambda t: t.split(','))

# Convert DataFrame column into list of strings
transactions = list(transactions)

# Print the list of transactions
print(transactions)

[['milk', 'bread', 'biscuit'], ['bread', 'milk', 'biscuit', 'cereal'], ['bread', 'tea'], ['jam', 'bread', 'milk'], ['tea', 'biscuit'], ['bread', 'tea'], ['tea', 'cereal'], ['bread', 'tea', 'biscuit'], ['jam', 'bread', 'tea'], ['bread', 'milk'], ['coffee', 'orange', 'biscuit', 'cereal'], ['coffee', 'orange', 'biscuit', 'cereal'], ['coffee', 'sugar'], ['bread', 'coffee', 'orange'], ['bread', 'sugar', 'biscuit'], ['coffee', 'sugar', 'cereal'], ['bread', 'sugar', 'biscuit'], ['bread', 'coffee', 'sugar'], ['bread', 'coffee', 'sugar'], ['tea', 'milk', 'coffee', 'cereal']]


### Generating association rules

As you saw, the function permutations from the module itertools can be used to quickly generate the set of all one-antecedent, one-consequent rules. You do not, of course, know which of these rules are useful. You simply know that each is a valid way to combine two items.

Let's practice generating and counting the set of all rules for a subset of the grocery dataset: coffee, tea, milk, and sugar.

In [None]:
# Import permutations from the itertools module
from itertools import permutations

# Define the set of groceries
flattened = [i for t in transactions for i in t]
groceries = list(set(flattened))

# Generate all possible rules
rules = list(permutations(groceries, 2))

# Print the set of rules
print(rules)

# Print the number of rules
print(len(rules))

[('milk', 'orange'), ('milk', 'biscuit'), ('milk', 'cereal'), ('milk', 'jam'), ('milk', 'bread'), ('milk', 'sugar'), ('milk', 'tea'), ('milk', 'coffee'), ('orange', 'milk'), ('orange', 'biscuit'), ('orange', 'cereal'), ('orange', 'jam'), ('orange', 'bread'), ('orange', 'sugar'), ('orange', 'tea'), ('orange', 'coffee'), ('biscuit', 'milk'), ('biscuit', 'orange'), ('biscuit', 'cereal'), ('biscuit', 'jam'), ('biscuit', 'bread'), ('biscuit', 'sugar'), ('biscuit', 'tea'), ('biscuit', 'coffee'), ('cereal', 'milk'), ('cereal', 'orange'), ('cereal', 'biscuit'), ('cereal', 'jam'), ('cereal', 'bread'), ('cereal', 'sugar'), ('cereal', 'tea'), ('cereal', 'coffee'), ('jam', 'milk'), ('jam', 'orange'), ('jam', 'biscuit'), ('jam', 'cereal'), ('jam', 'bread'), ('jam', 'sugar'), ('jam', 'tea'), ('jam', 'coffee'), ('bread', 'milk'), ('bread', 'orange'), ('bread', 'biscuit'), ('bread', 'cereal'), ('bread', 'jam'), ('bread', 'sugar'), ('bread', 'tea'), ('bread', 'coffee'), ('sugar', 'milk'), ('sugar', 'or

## The simplest metric

### One-hot encoding transaction data

Throughout the course, we will use a common pipeline for preprocessing data for use in market basket analysis. The first step is to import a pandas DataFrame and select the column that contains transactions. Each transaction in the column will be a string that consists of a number of items, each separated by a comma. The next step is to use a lambda function to split each transaction string into a list, thereby transforming the column into a list of lists.

In this exercise, you'll start with the list of lists from the grocery dataset, which is available to you as transactions. You will then transform transactions into a one-hot encoded DataFrame, where each column consists of TRUE and FALSE values that indicate whether an item was included in a transaction.

In [None]:
# Import the transaction encoder function from mlxtend
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Instantiate transaction encoder and identify unique items
encoder = TransactionEncoder().fit(transactions)

# One-hot encode transactions
onehot = encoder.transform(transactions)

# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)

# Print the one-hot encoded transaction dataset
print(onehot)

    biscuit  bread  cereal  coffee    jam   milk  orange  sugar    tea
0      True   True   False   False  False   True   False  False  False
1      True   True    True   False  False   True   False  False  False
2     False   True   False   False  False  False   False  False   True
3     False   True   False   False   True   True   False  False  False
4      True  False   False   False  False  False   False  False   True
5     False   True   False   False  False  False   False  False   True
6     False  False    True   False  False  False   False  False   True
7      True   True   False   False  False  False   False  False   True
8     False   True   False   False   True  False   False  False   True
9     False   True   False   False  False   True   False  False  False
10     True  False    True    True  False  False    True  False  False
11     True  False    True    True  False  False    True  False  False
12    False  False   False    True  False  False   False   True  False
13    

### Computing the support metric

In the previous exercise, you one-hot encoded a small grocery store's transactions as the DataFrame onehot. In this exercise, you'll make use of that DataFrame and the support metric to help the store's owner. First, she has asked you to identify frequently purchased items, which you'll do by computing support at the item-level. And second, she asked you to check whether the rule {jam}  {bread} has a support of over 0.05. Note that onehot has been defined and is available. Additionally, pandas has been imported under the alias pd and numpy has been imported under the alias np.

In [None]:
# Compute the support
support = onehot.mean()

# Print the support
print(support)

biscuit    0.40
bread      0.65
cereal     0.30
coffee     0.40
jam        0.10
milk       0.25
orange     0.15
sugar      0.30
tea        0.35
dtype: float64


In [None]:
import numpy as np

# Add a jam+bread column to the DataFrame onehot
onehot['jam+bread'] = np.logical_and(onehot['jam'], onehot['bread'])

# Compute the support
support = onehot.mean()

# Print the support values
print(support)

biscuit      0.40
bread        0.65
cereal       0.30
coffee       0.40
jam          0.10
milk         0.25
orange       0.15
sugar        0.30
tea          0.35
jam+bread    0.10
dtype: float64


# Association Rules

## Confidence and lift