## Practice Lab: Implementation of Apriori algorithm — Market basket analysis using Python
#### Descriptive market basket (MBA) analysis

Descriptive Market Basket Analysis provides valuable insights that can drive business strategies and enhance customer satisfaction by meeting their purchasing needs more effectively.

### Recap:
1. **Transactions**: A set of items that are purchased together. For example, a customer buying milk, bread, and butter in one shopping trip.
2. **Itemset**: A collection of one or more items. In the context of MBA, an itemset could be {milk, bread, butter}.

A metric is a measure of the performance of a rule.

1. **Support**: The support of an itemset is the proportion of transactions in the dataset in which the itemset appears. It gives an indication of how frequently an itemset appears in the dataset.
2. **Confidence**: Given the rule (X→Y). The confidence of a rule is a measure of how often items in 𝑌 appear in transactions that contain 𝑋. It is an indication of the strength of the implication.
3. **Lift**: Lift measures how much more likely 𝑌 is to be bought when 𝑋 is bought, compared to 𝑌 being bought by itself. Lift values greater than 1 indicate a positive association, values less than 1 indicate a negative association, and values close to 1 indicate no association.
----

## Quiz:
Given the following rule:

$\{bread\} \rightarrow \{milk\}$

What is the antecedent and the consequent?
- Answer: Antecedent - Bread; Consequent - Milk

**Note**: Many rules have multiple antecedents and consequents. For example
- $\{Beer, Cola, Milk\} \rightarrow	\{Diapers\}$
- $\{Eggs\} \rightarrow	\{Bread, Diapers\}$
---

### Class Example :
Imagine a supermarket wants to analyze purchasing patterns to optimize product placement.  They collect transaction data and perform market basket analysis.
Some typical questions that can be answered:
- What are the most frequently purchased items?
- What are the common item combinations or patterns?
- What are the strong association rules?
- Can customers be segmented based on their purchasing patterns?
- Which products should be placed together or promoted together?
- What are the potential cross-selling or upselling opportunities?
- ...

#### Steps in Market Basket Analysis

In [24]:
# 1. Data Collection
dataset = [['Milk','Bread'],
           ['Diapers', 'Bread', 'Eggs', 'Beer'],
           ['Milk', 'Diapers', 'Beer', 'Cola'],
           ['Bread', 'Milk', 'Diapers','Beer'],
           ['Milk', 'Bread', 'Diapers', 'Cola']]

# Display the dataset
dataset

  and should_run_async(code)


[['Milk', 'Bread'],
 ['Diapers', 'Bread', 'Eggs', 'Beer'],
 ['Milk', 'Diapers', 'Beer', 'Cola'],
 ['Bread', 'Milk', 'Diapers', 'Beer'],
 ['Milk', 'Bread', 'Diapers', 'Cola']]

- How many unique items do we have?

Answer: 6

---
Next, we clean the data to remove any inconsistencies or missing values, if any.

Convert the data into a suitable format for analysis, often a binary matrix where rows represent transactions and columns represent items.
- That is, we use the transform method to construct an array of one-hot encoded transactions.
- More details about the encoder can be found [here](https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/).


In [25]:
# 2. Data Preprocessing
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

# Creat an instance of TransactionEncoder
te = TransactionEncoder()

# Fit and transform the dataset which returns a NumPy array
dataset_array = te.fit_transform(dataset)

# Convert into a Pandas DataFrame
df = pd.DataFrame(dataset_array, columns=te.columns_)
# Overview
df

  and should_run_async(code)


Unnamed: 0,Beer,Bread,Cola,Diapers,Eggs,Milk
0,False,True,False,False,False,True
1,True,True,False,True,True,False
2,True,False,True,True,False,True
3,True,True,False,True,False,True
4,False,True,True,True,False,True


**Question**: First, your boss has asked you to identify frequently purchased items.

In [26]:
# Compute the support
support = df.mean()
print(support)

Beer       0.6
Bread      0.8
Cola       0.4
Diapers    0.8
Eggs       0.2
Milk       0.8
dtype: float64


  and should_run_async(code)


- What are the most frequently purchased items?

Answer: Bread, milk, Diapers

---

**Question**: What are the common item combinations or patterns?

**Recall**:
- The **Apriori** function is used to find frequent itemsets in transaction data. Read more [here](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/).
- The `min_support` parameter determines the minimum frequency required for an itemset to be considered "frequent".

To use the Apriori algorithm we use the mlxtend library in Python. Install the mlxtend library if you haven't already.
`pip install mlxtend`

In [27]:
# 3. Generate Itemsets

from mlxtend.frequent_patterns import apriori

# Run the Apriori algorithm
frequent_itemsets = apriori(df, min_support=0.6)

# print the frequent_itemsets
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.6,(0)
1,0.8,(1)
2,0.8,(3)
3,0.8,(5)
4,0.6,"(0, 3)"
5,0.6,"(1, 3)"
6,0.6,"(1, 5)"
7,0.6,"(3, 5)"


By default, the **Apriori** algorithm returns the column indices of the items, which can be useful for downstream operations like association rule mining. However, for better readability, we can set the parameter `use_colnames=True`. This will convert these integer values into the respective item names.

In [28]:
# 3. Generate Itemsets

from mlxtend.frequent_patterns import apriori

# Run the Apriori algorithm
frequent_itemsets = apriori(df, min_support=0.6,use_colnames=True)

# print the frequent_itemsets
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.6,(Beer)
1,0.8,(Bread)
2,0.8,(Diapers)
3,0.8,(Milk)
4,0.6,"(Diapers, Beer)"
5,0.6,"(Diapers, Bread)"
6,0.6,"(Milk, Bread)"
7,0.6,"(Diapers, Milk)"


- What are the most frequent item pairs?

Answer: (Diapers, Beer)


----

#### What are the strong association rules?
Next, for each itemset and potential association rule, we calculate the **support**, **confidence**, and **lift** values.

For example, we can use the `association_rules` function from `mlxtend` and filter the rules based on a minimum confidence threshold. Read more [here](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/).

In [29]:
# 4. Generate association rules
from mlxtend.frequent_patterns import association_rules

# Using the association rule with metric = confidence
rules = association_rules(frequent_itemsets, metric = 'confidence', min_threshold=0.5)

# Print the strong association rules
print("Strong Association Rules:")
rules

Strong Association Rules:


  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Diapers),(Beer),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0
1,(Beer),(Diapers),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5
2,(Diapers),(Bread),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
3,(Bread),(Diapers),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
4,(Milk),(Bread),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
5,(Bread),(Milk),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
6,(Diapers),(Milk),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
7,(Milk),(Diapers),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25


For example, the rule (Beer) => (Diapers) has a confidence of 1.0, meaning that 100% of transactions containing "Beer" also contain "Diaper".

------

In [32]:
filtered_rules = rules[(rules['lift']>0.5) & (rules['confidence']>0.7) ]
filtered_rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Diapers),(Beer),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0
1,(Beer),(Diapers),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5
2,(Diapers),(Bread),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
3,(Bread),(Diapers),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
4,(Milk),(Bread),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
5,(Bread),(Milk),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
6,(Diapers),(Milk),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25
7,(Milk),(Diapers),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8,-0.25


----
----
Grocery dataset from [Kaggle](https://www.kaggle.com/datasets/umairaslam/grocery/download).

- The retailer of a grocery store is striving to uncover association rules between items through market basket analysis.
- This involves identifying items that are frequently bought together, empowering the retailer to strategically position these items to enhance sales.

In [16]:
# 1. Data Collection
# Load and read our dataset
groceries = pd.read_csv('groceries.csv')

  and should_run_async(code)


In [17]:
# Overview dataset
groceries.head()

  and should_run_async(code)


Unnamed: 0,Items
0,"citrus fruit,semi-finished bread,margarine,rea..."
1,"tropical fruit,yogurt,coffee"
2,whole milk
3,"pip fruit,yogurt,cream cheese ,meat spreads"
4,"other vegetables,whole milk,condensed milk,lon..."


In [18]:
groceries.shape

  and should_run_async(code)


(700, 1)

**Data Preprocessing**:
- We need a dataset where each row represents a transaction and each column represents an item.
- The dataset should have binary values indicating whether or not each item was purchased in that transaction.

In [19]:
# We can use the get_dummies to create a binary indicator DataFrame

groceries_encoded = groceries['Items'].str.get_dummies(sep=',')
groceries_encoded


  and should_run_async(code)


Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baking powder,bathroom cleaner,beef,berries,beverages,...,tropical fruit,turkey,vinegar,waffles,whipped/sour cream,white bread,white wine,whole milk,yogurt,zwieback
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Explore the dataset

#### View 10 Top Sold items

In [20]:
# Sum up the occurrences of each item across all transactions
item_counts = groceries_encoded.sum()

# Sort the items based on their counts in descending order
top_sold_item = item_counts.sort_values(ascending =False)

# Print the top sold items
top_sold_item[:10]

  and should_run_async(code)


whole milk          176
rolls/buns          153
other vegetables    129
soda                112
bottled water        96
yogurt               81
root vegetables      71
citrus fruit         65
tropical fruit       65
sausage              57
dtype: int64

### Generating frequent itemsets
**Identify frequent itemsets** - groups of items that frequently appear together in transactions.

**Recall**: With the apriori function from mlxtend.frequent_patterns, you can easily apply the Apriori algorithm to find frequent itemsets in transactional data.

This is done using metrics like **support** (how often an itemset appears in all transactions) and **confidence** (how likely it is for one item to be bought if another item is present).


In [23]:
# 3. Generate Itemsets

from mlxtend.frequent_patterns import apriori

# Run the Apriori algorithm
frequent_itemsets = apriori(groceries_encoded, min_support = 0.01, use_colnames=True, max_len =2)

# print the frequent_itemsets
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.020000,(UHT-milk)
1,0.011429,(baking powder)
2,0.060000,(beef)
3,0.040000,(berries)
4,0.031429,(beverages)
...,...,...
274,0.021429,"(yogurt, tropical fruit)"
275,0.012857,"(waffles, whole milk)"
276,0.024286,"(whipped/sour cream, whole milk)"
277,0.010000,"(yogurt, whipped/sour cream)"


Additionally, you can set other parameters to control the behavior of the Apriori algorithm, such as `max_len` to limit the maximum length of the itemsets to consider.

----
## Task: Generating association rules from the frequent itemsets
The `association_rules()` function enables you to
- specify our desired metric and
- set the corresponding threshold.

For now, we will implement following metrics, confidence and lift.

Recall: Lift tells us how much more likely Y is purchased when X is purchased, compared to if they were independent.

1. Suppose you want to analyze rules derived from frequent itemsets only when the confidence level exceeds 70 percent (min_threshold=0.7).

2. Generate association rules based on the lift metric with a minimum lift threshold of 1.5.
-----