# Lab 04: Supermarket basket association pattern mining

In this question, we perform association pattern mining using the supermarket dataset `supermarket.arff` from the [Weka MOOC](https://www.cs.waikato.ac.nz/ml/weka/courses.html).

1. Load the data file `supermarket.arff` into a pandas data frame
2. Remove the following attributes 
    - `department*`
    - `non host support`
    - `total`
3.  Select the Apriori algorithm and perform frequent itemset mining with minsup = 0.2 and minconf = 0.8 and find out: 
    - The numbers of frequent 2-itemsets, and 3-itemsets. 
    - The best three (2) rules with largest confidence. Examine these rules and describe them in your own words. 
4. The supermarket manager wishes to boost the sale of fruit and therefore the manager needs to know other itemsets most likely be purchased with fruit to make promotion decisions. 
    - Using the same minimum support and minimum confidence value. 
    - List the top three itemsets to report to the supermarket manager. 
5. Repeat task 3, but using the FP Growth algorithm instead.  
    - Compare the rules found. 
    - Are they consistent? 

## 0. Upgrade mlxtend
The default version of `mlxtend` on Google Colaborate is too old for this prac
so we must upgrade it. We want something that is at least version 0.18.
Note that code statements beginning with `!` are not python code, but system calls. If you are running this in a personal jupyterlab you might have to update this module a different way.

In [2]:
! pip install --upgrade 'mlxtend>=0.18'



In [3]:
# Check we have the right version
import mlxtend
print(mlxtend.__version__)

0.23.4


If you ran the two cells above inreverse order then you'll have to restart the kernel before you can load the newer version of the `mlxtend` module.

To do this: choose "Runitime" -> "Restart runtime".

In [4]:
import pandas as pd
from scipy.io import arff
import urllib
import urllib.request
import numpy as np

## 1. Load the data file `supermarket.arff` into a pandas data frame

In this task, you will load the `supermarket.arff` dataset into a Pandas DataFrame for analysis. We have performed a similar procedure in a previous practical. The steps are as follows:

1. **Download the file** into your working directory using `urllib`.
2. **Load the dataset** using `scipy.io.arff`.
3. **Convert the data** into a Pandas DataFrame.

> **Note:** The `supermarket.arff` file is available on Blackboard under **Labs → Lab04**.

In [5]:
# Read the ARFF file into a tuple (data, metadata)
data = arff.loadarff("supermarket.arff")

# Convert the data portion (index 0) into a Pandas DataFrame
raw_df = pd.DataFrame(data[0])

# Convert integer values (0/1) into boolean values (False/True)
# This makes the dataset more interpretable for transaction-style data
df = raw_df.astype(bool)

# Display summary statistics for each column
# For boolean columns, 'count' is total records, 'unique' will be 2, 
# 'top' will be True/False, and 'freq' is the most common value's count
df.describe()

Unnamed: 0,department1,department2,department3,department4,department5,department6,department7,department8,department9,grocery misc,...,department208,department209,department210,department211,department212,department213,department214,department215,department216,total
count,4627,4627,4627,4627,4627,4627,4627,4627,4627,4627,...,4627,4627,4627,4627,4627,4627,4627,4627,4627,4627
unique,2,2,2,2,2,2,2,1,2,2,...,1,1,2,2,2,2,1,1,1,1
top,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
freq,3580,4496,4537,4543,4452,4625,4560,4627,4545,4449,...,4627,4627,4436,4420,4589,4605,4627,4627,4627,4627


## 2. Remove Unnecessary Attributes

The following attributes have been identified as not useful for further analysis and should be removed from the dataset:
- `department`
- `non host support`
- `total`


In [6]:
# --- Identify columns to remove from the dataset ---

# Start with known columns to drop
cols_to_drop = ['non host support', 'total']

# Automatically detect all columns starting with 'department'
for col in df.columns:
    if col.startswith('department'):  # Select columns whose names start with 'department'
        cols_to_drop.append(col)

# Display the list of columns to be removed
print("The following columns will be dropped:")
print(cols_to_drop)


The following columns will be dropped:
['non host support', 'total', 'department1', 'department2', 'department3', 'department4', 'department5', 'department6', 'department7', 'department8', 'department9', 'department11', 'department57', 'department70', 'department79', 'department80', 'department81', 'department88', 'department89', 'department98', 'department100', 'department101', 'department102', 'department107', 'department108', 'department109', 'department110', 'department111', 'department112', 'department113', 'department114', 'department116', 'department117', 'department118', 'department119', 'department120', 'department122', 'department123', 'department124', 'department125', 'department126', 'department127', 'department128', 'department129', 'department130', 'department137', 'department138', 'department139', 'department140', 'department141', 'department142', 'department143', 'department144', 'department145', 'department146', 'department147', 'department148', 'department149', 'depar

In [7]:
df = df.drop(columns=cols_to_drop)

In [8]:
# confirm we have dropped the columns by showing a summary, we should have 104 cols left, all with descriptive names.
df.describe()

Unnamed: 0,grocery misc,baby needs,bread and cake,baking needs,coupons,juice-sat-cord-ms,tea,biscuits,canned fish-meat,canned fruit,...,casks red wine,750ml white nz,750ml red nz,750ml white imp,750ml red imp,sparkling nz,sparkling imp,brew kits/accesry,port and sherry,ctrled label wine
count,4627,4627,4627,4627,4627,4627,4627,4627,4627,4627,...,4627,4627,4627,4627,4627,4627,4627,4627,4627,4627
unique,2,2,2,2,1,2,2,2,2,2,...,2,2,2,2,2,2,2,1,2,1
top,False,False,True,True,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
freq,4449,4008,3330,2795,4627,2463,3731,2605,3686,3344,...,4576,4346,4536,4528,4530,4498,4604,4627,4602,4627


## 3. Select and Apply the Apriori Algorithm

In this step, you will apply the **Apriori algorithm** to perform frequent itemset mining on the dataset.

### Task Details
- **Minimum support (`minsup`)**: 0.2  
- **Minimum confidence (`minconf`)**: 0.8  

### Required Outcomes
1. Identify the **number of frequent 2-itemsets** and **3-itemsets**.
2. Determine the **top three association rules** with the highest confidence values.
3. Examine these top rules and describe them in your own words, explaining the relationship between items.

The `apriori` algorithm is found in the `mlxtend` package, so we import it along with the `association_rules` function.

In [9]:
# --- Frequent Itemset Mining with Apriori ---

from mlxtend.frequent_patterns import apriori, association_rules

# Apply the Apriori algorithm to the boolean transaction DataFrame
ap_itemsets = apriori(
    df,
    min_support=?,   # Your answer
    use_colnames=True  # Use actual item names instead of column indices
)

# Display the resulting frequent itemsets with their support values
ap_itemsets

Unnamed: 0,support,itemsets
0,0.719689,(bread and cake)
1,0.604063,(baking needs)
2,0.532310,(juice-sat-cord-ms)
3,0.563000,(biscuits)
4,0.203372,(canned fish-meat)
...,...,...
541,0.224552,"(vegetables, fruit, biscuits, frozen foods)"
542,0.219365,"(fruit, vegetables, milk-cream, biscuits)"
543,0.228442,"(fruit, vegetables, milk-cream, frozen foods)"
544,0.202939,"(vegetables, baking needs, milk-cream, bread a..."


Now that we have our itemsets we want to chose those with `2<=k<=3`.
This isn't explicitly stored within our dataframe so we'll make a new column which is just the value of `len(itemsets)`.

In [10]:
# --- Count the number of items in each frequent itemset ---

def find_k(row):
    """
    Return the number of items in the given itemset.
    
    Parameters
    ----------
    row : pandas.Series
        A row from the frequent itemsets DataFrame, 
        where 'itemsets' is a Python set of items.
    
    Returns
    -------
    int
        The number of items in the itemset.
    """
    return len(row['itemsets'])

# Apply the function to each row and store the result in a new column 'k'
ap_itemsets['k'] = ap_itemsets.apply(find_k, axis=1)

# Preview the updated DataFrame
ap_itemsets.head()


Unnamed: 0,support,itemsets,k
0,0.719689,(bread and cake),1
1,0.604063,(baking needs),1
2,0.53231,(juice-sat-cord-ms),1
3,0.563,(biscuits),1
4,0.203372,(canned fish-meat),1


In [11]:
ap_itemsets

Unnamed: 0,support,itemsets,k
0,0.719689,(bread and cake),1
1,0.604063,(baking needs),1
2,0.532310,(juice-sat-cord-ms),1
3,0.563000,(biscuits),1
4,0.203372,(canned fish-meat),1
...,...,...,...
541,0.224552,"(vegetables, fruit, biscuits, frozen foods)",4
542,0.219365,"(fruit, vegetables, milk-cream, biscuits)",4
543,0.228442,"(fruit, vegetables, milk-cream, frozen foods)",4
544,0.202939,"(vegetables, baking needs, milk-cream, bread a...",5


In [12]:
# --- Count the number of frequent 2-itemsets and 3-itemsets ---

# Count rows where 'k' equals 2 (frequent 2-itemsets)
k2_itemsets = # Your code

# Count rows where 'k' equals 3 (frequent 3-itemsets)
k3_itemsets = # Your code

# Display the results
print(f"There are {k2_itemsets} frequent itemsets with k = 2")
print(f"There are {k3_itemsets} frequent itemsets with k = 3")


There are 182 frequent itemsets with k = 2
There are 252 frequent itemsets with k = 3


In [13]:
# Now lets see the top 10 itemsets
## Your code ###



TypeError: NDFrame.head() takes from 1 to 2 positional arguments but 3 were given

Note that the top 10 itemsets are all 1-itemsets. Is this surprising to you?

We use these itemsets to generate association rules with a minimum confidence of 0.8.

In [13]:
# --- Generate association rules from frequent itemsets ---

ap_rules = association_rules(
    ap_itemsets,
    metric='confidence',   # Evaluate rules based on confidence
    min_threshold='?') # choose the minimum confidence value

# Preview the first few generated rules
ap_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(canned fruit),(bread and cake),0.277285,0.719689,0.224768,0.8106,1.12632,1.0,0.025208,1.479997,0.155183,0.291072,0.324323,0.561456
1,(jams-spreads),(bread and cake),0.276205,0.719689,0.221958,0.803599,1.116593,1.0,0.023177,1.427242,0.144265,0.286791,0.299348,0.556004
2,(margarine),(bread and cake),0.494489,0.719689,0.395721,0.800262,1.111956,1.0,0.039843,1.403396,0.199172,0.483496,0.287443,0.675056
3,(small goods),(bread and cake),0.241193,0.719689,0.201426,0.835125,1.160398,1.0,0.027843,1.700148,0.182163,0.265225,0.411816,0.557503
4,"(baking needs, biscuits)",(bread and cake),0.381241,0.719689,0.314675,0.825397,1.14688,1.0,0.0403,1.605419,0.206978,0.40022,0.37711,0.631317


Note that the rules above are not sorted by confidence. We should do that ourselves by using the `sort_values` function.

In [14]:
ap_rules.sort_values('confidence', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
177,"(frozen foods, vegetables, fruit, biscuits)",(bread and cake),0.224552,0.719689,0.200778,0.894129,1.242383,1.0,0.039171,2.647667,0.251590,0.270058,0.622309,0.586554
139,"(margarine, fruit, biscuits)",(bread and cake),0.231900,0.719689,0.202723,0.874185,1.214670,1.0,0.035828,2.227955,0.230089,0.270707,0.551158,0.577933
132,"(frozen foods, fruit, biscuits)",(bread and cake),0.282905,0.719689,0.247028,0.873186,1.213282,1.0,0.043425,2.210406,0.245141,0.326945,0.547594,0.608214
138,"(milk-cream, vegetables, biscuits)",(bread and cake),0.267128,0.719689,0.232332,0.869741,1.208496,1.0,0.040083,2.151954,0.235410,0.307935,0.535306,0.596282
117,"(margarine, fruit, baking needs)",(bread and cake),0.244003,0.719689,0.212016,0.868911,1.207342,1.0,0.036410,2.138320,0.227163,0.282059,0.532343,0.581753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154,"(frozen foods, bread and cake, fruit)",(vegetables),0.334558,0.639939,0.268424,0.802326,1.253752,1.0,0.054328,1.821483,0.304150,0.380165,0.450997,0.610889
152,"(bread and cake, frozen foods, vegetables)",(fruit),0.334558,0.640156,0.268424,0.802326,1.253329,1.0,0.054255,1.820389,0.303745,0.380049,0.450667,0.610818
91,"(vegetables, breakfast food)",(fruit),0.275989,0.640156,0.221310,0.801879,1.252632,1.0,0.044634,1.816290,0.278561,0.318507,0.449427,0.573796
173,"(milk-cream, frozen foods, vegetables)",(fruit),0.285066,0.640156,0.228442,0.801365,1.251828,1.0,0.045955,1.811583,0.281380,0.327854,0.447997,0.579109


Now describe the first three that you see above in your own words.

## 4. Boost Fruit Sales

The supermarket manager aims to increase the sales of **fruit** and requires insights into other products most frequently purchased alongside fruit. This information will be used to guide promotional strategies.

**Task:**
- Use the same parameters as before:
  - **Minimum support (`minsup`)**: 0.2  
  - **Minimum confidence (`minconf`)**: 0.8  
- Identify and list the **top three itemsets** that are most strongly associated with fruit purchases.
- Present these itemsets to the supermarket manager for consideration in promotion planning.

In [15]:
# --- Find rules that predict 'fruit' as the consequent ---

# Filter for rules where the consequent is exactly 'fruit'
# (Note: This excludes items like 'canned fruit')
fruit_rules = ap_rules[ap_rules['consequents'] == frozenset(['Your answer'])]

# Sort the rules by confidence in descending order and select the top 3
top_fruit_rules = fruit_rules.sort_values(
    by='confidence',
    ascending=False
).head(3)

# Display the top 3 rules
top_fruit_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
178,"(bread and cake, frozen foods, vegetables, bis...",(fruit),0.242057,0.640156,0.200778,0.829464,1.295723,1.0,0.045824,2.110082,0.301118,0.29464,0.526085,0.571552
172,"(milk-cream, vegetables, biscuits)",(fruit),0.267128,0.640156,0.219365,0.821197,1.282809,1.0,0.048361,2.012523,0.300817,0.318882,0.503111,0.581936
140,"(bread and cake, vegetables, biscuits)",(fruit),0.321375,0.640156,0.262805,0.817754,1.27743,1.0,0.057076,1.974497,0.320027,0.376121,0.493542,0.614144


## 5. FP-Growth

Repeat the analysis from **Task 3**, but this time use the **FP-Growth** algorithm for frequent itemset mining.

**Tasks:**
1. Apply FP-Growth with the same parameters as before:
   - **Minimum support (`minsup`)**: 0.2  
   - **Minimum confidence (`minconf`)**: 0.8  
2. Determine:
   - The number of frequent 2-itemsets and 3-itemsets.
   - The top three rules with the highest confidence.
3. Compare the results with those obtained using the **Apriori** algorithm:
   - Are the frequent itemsets and rules consistent between the two methods?
   - Highlight any differences in the results or execution time.

Import the `fpgrowth` function from our `mlxtend` module

There are a lot of rules, lets compare just the first 10 most confident rules.

In [22]:
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Frequent itemset mining using FP-Growth
fp_itemsets = fpgrowth(
    df,
    min_support= 'Your answer',    # Itemsets must appear in ≥ 20% of transactions
    use_colnames=True   # Show actual item names instead of column indices
)

# Generate association rules from FP-Growth itemsets
fp_rules = association_rules(
    fp_itemsets,
    metric='confidence',  # Filter rules by confidence metric
    min_threshold='Your answer'    # Keep rules with confidence ≥ 80%
)

# Select the top 10 rules (FP-Growth) by confidence
fp_top_10 = fp_rules.sort_values(
    by='confidence',
    ascending=False
).head(10)

# Select the top 10 rules (Apriori) by confidence
ap_top_10 = ap_rules.sort_values(
    by='confidence',
    ascending=False
).head(10)

In [23]:
# Display results for side-by-side comparison
print("Top 10 Rules (FP-Growth):")
display(fp_top_10)

print("\nTop 10 Rules (Apriori):")
display(ap_top_10)

Top 10 Rules (FP-Growth):


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
44,"(frozen foods, vegetables, fruit, biscuits)",(bread and cake),0.224552,0.719689,0.200778,0.894129,1.242383,1.0,0.039171,2.647667,0.25159,0.270058,0.622309,0.586554
101,"(margarine, fruit, biscuits)",(bread and cake),0.2319,0.719689,0.202723,0.874185,1.21467,1.0,0.035828,2.227955,0.230089,0.270707,0.551158,0.577933
31,"(frozen foods, fruit, biscuits)",(bread and cake),0.282905,0.719689,0.247028,0.873186,1.213282,1.0,0.043425,2.210406,0.245141,0.326945,0.547594,0.608214
49,"(milk-cream, vegetables, biscuits)",(bread and cake),0.267128,0.719689,0.232332,0.869741,1.208496,1.0,0.040083,2.151954,0.23541,0.307935,0.535306,0.596282
93,"(margarine, baking needs, fruit)",(bread and cake),0.244003,0.719689,0.212016,0.868911,1.207342,1.0,0.03641,2.13832,0.227163,0.282059,0.532343,0.581753
42,"(frozen foods, vegetables, biscuits)",(bread and cake),0.278798,0.719689,0.242057,0.868217,1.206378,1.0,0.041409,2.127067,0.237205,0.32,0.529869,0.602277
94,"(milk-cream, margarine, fruit)",(bread and cake),0.237087,0.719689,0.205749,0.867821,1.205829,1.0,0.03512,2.120699,0.223741,0.273957,0.528457,0.576854
34,"(milk-cream, frozen foods, biscuits)",(bread and cake),0.271234,0.719689,0.235358,0.867729,1.2057,1.0,0.040154,2.11922,0.234103,0.311499,0.528128,0.597378
41,"(vegetables, fruit, biscuits)",(bread and cake),0.303436,0.719689,0.262805,0.866097,1.203432,1.0,0.044426,2.093388,0.242682,0.345651,0.522305,0.615631
88,"(milk-cream, margarine, baking needs)",(bread and cake),0.246812,0.719689,0.213313,0.864273,1.200899,1.0,0.035685,2.065261,0.22211,0.283214,0.5158,0.580335



Top 10 Rules (Apriori):


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
177,"(frozen foods, vegetables, fruit, biscuits)",(bread and cake),0.224552,0.719689,0.200778,0.894129,1.242383,1.0,0.039171,2.647667,0.25159,0.270058,0.622309,0.586554
139,"(margarine, fruit, biscuits)",(bread and cake),0.2319,0.719689,0.202723,0.874185,1.21467,1.0,0.035828,2.227955,0.230089,0.270707,0.551158,0.577933
132,"(frozen foods, fruit, biscuits)",(bread and cake),0.282905,0.719689,0.247028,0.873186,1.213282,1.0,0.043425,2.210406,0.245141,0.326945,0.547594,0.608214
138,"(milk-cream, vegetables, biscuits)",(bread and cake),0.267128,0.719689,0.232332,0.869741,1.208496,1.0,0.040083,2.151954,0.23541,0.307935,0.535306,0.596282
117,"(margarine, fruit, baking needs)",(bread and cake),0.244003,0.719689,0.212016,0.868911,1.207342,1.0,0.03641,2.13832,0.227163,0.282059,0.532343,0.581753
133,"(frozen foods, vegetables, biscuits)",(bread and cake),0.278798,0.719689,0.242057,0.868217,1.206378,1.0,0.041409,2.127067,0.237205,0.32,0.529869,0.602277
163,"(milk-cream, margarine, fruit)",(bread and cake),0.237087,0.719689,0.205749,0.867821,1.205829,1.0,0.03512,2.120699,0.223741,0.273957,0.528457,0.576854
130,"(milk-cream, frozen foods, biscuits)",(bread and cake),0.271234,0.719689,0.235358,0.867729,1.2057,1.0,0.040154,2.11922,0.234103,0.311499,0.528128,0.597378
141,"(vegetables, fruit, biscuits)",(bread and cake),0.303436,0.719689,0.262805,0.866097,1.203432,1.0,0.044426,2.093388,0.242682,0.345651,0.522305,0.615631
114,"(milk-cream, margarine, baking needs)",(bread and cake),0.246812,0.719689,0.213313,0.864273,1.200899,1.0,0.035685,2.065261,0.22211,0.283214,0.5158,0.580335


Do the above tables agree?