# Exploring Frequent Itemsets: Closed vs Maximal in Supermarket Data

This notebook focuses on simulating transaction data for a supermarket scenario and applying frequent pattern mining using the Apriori algorithm.

In this section, we simulate transaction data that will later be used to identify:
- Frequent Itemsets
- Closed Frequent Itemsets
- Maximal Frequent Itemsets


##  Task 1: Simulate Supermarket Transaction Data

In this section, we simulate 3,000 supermarket transactions.  
Each transaction contains between 2 and 7 items, randomly selected from a pool of 30 unique items.  
The resulting dataset is saved to `supermarket_transactions.csv` for further analysis.

---

###  Student Responsible: Selmah Tzindori



In [41]:
# [Student: Selmah Tzindori] Simulate 3,000 supermarket transactions
# [Student: Hana Gashaw] Modified for reproducibility

import random
import pandas as pd

# -------------------------------
# Step 0: Set seed for reproducibility
# -------------------------------
# Ensures that random selections (bundles and extra items) are the same every time the script runs.
random.seed(42)

# -------------------------------
# Step 1: Define a pool of items
# -------------------------------
# A diverse list of 30+ common grocery items typically found in a supermarket.
item_pool = [
    'milk', 'bread', 'eggs', 'cheese', 'butter', 'juice', 'apples', 'bananas', 'oranges', 'grapes',
    'cereal', 'chocolate', 'yogurt', 'chicken', 'beef', 'pasta', 'rice', 'tomatoes', 'onions', 'potatoes',
    'carrots', 'lettuce', 'beans', 'soda', 'water', 'coffee', 'tea', 'cookies', 'ice cream', 'toilet paper'
]

# -------------------------------
# Step 2: Define common frequent bundles
# -------------------------------
# These are multi-item sets that will be intentionally injected into 50% of transactions
# to simulate real-world item associations (like milk+bread or apples+bananas+yogurt).
frequent_bundles = [
    ['milk', 'bread'],
    ['apples', 'bananas', 'yogurt'],
    ['chicken', 'rice', 'beans'],
    ['soda', 'chips', 'cookies'],  # Note: 'chips' will be added to item pool
    ['cheese', 'butter', 'eggs']
]

# Add any missing bundle items to item pool (e.g., 'chips' not in original pool)
item_pool = list(set(item_pool + ['chips']))

# -------------------------------
# Step 3: Generate synthetic transactions
# -------------------------------
# Loop generates 3,000 transactions. Each transaction:
# - Has a 50% chance of including one frequent bundle
# - Adds 0 to 4 extra random (non-duplicate) items
# - Randomizes item order to avoid fixed patterns
num_transactions = 3000
transactions = []

for _ in range(num_transactions):
    transaction = []

    # Inject a frequent bundle 50% of the time
    if random.random() < 0.5:
        bundle = random.choice(frequent_bundles)
        transaction.extend(bundle)

    # Add a few additional random items (avoid duplicates)
    num_extra_items = random.randint(0, 4)
    remaining_items = list(set(item_pool) - set(transaction))
    extras = random.sample(remaining_items, num_extra_items)
    transaction.extend(extras)

    # Shuffle items so the order is randomized
    random.shuffle(transaction)
    transactions.append(transaction)

# -------------------------------
# Step 4: Save transactions to CSV
# -------------------------------
# Each transaction is saved as a comma-separated string in one row.
# Useful for visual inspection or loading later.
transaction_strings = [', '.join(t) for t in transactions]
transactions_df = pd.DataFrame({'Transaction': transaction_strings})
transactions_df.to_csv('supermarket_transactions.csv', index=False)

#step 5
#export the DataFrame to a CSV file
# This will create a CSV file named 'supermarket_transactions.csv' in the current directory. 
transactions_df.to_csv('supermarket_transactions.csv', index=False)

# -------------------------------
# Step 6: Preview the simulated data
# -------------------------------
# Display the first few rows to confirm structure and content.
print("Sample Transactions:")
transactions_df.head()


Sample Transactions:


Unnamed: 0,Transaction
0,
1,tea
2,"chips, tomatoes, coffee, bread, ice cream, milk"
3,"tea, eggs, soda, coffee"
4,"lettuce, carrots"


##  Task 2: Convert Transactions to One-Hot Encoded Format After Cleaning

We convert the simulated transaction data into a one-hot encoded format.  
This format is required by the `apriori()` algorithm in the `mlxtend.frequent_patterns` module.

Each transaction becomes a row in the DataFrame, and each unique item becomes a column.  
A value of `True` indicates that the item is present in the transaction.

---

###  Student Responsible: Levin Ekuam



In [26]:
# [Student: Levin Ekuam] Convert Transactions to One-Hot Encoded Format
# [Student : Hana Gashaw] Modified to clean the data from missing values

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

# Drop any rows where the Transaction is missing (NaN or empty)
transactions_df.dropna(subset=['Transaction'], inplace=True)

# Optional: Drop rows where the transaction string is just empty
transactions_df = transactions_df[transactions_df['Transaction'].str.strip() != '']

# Reset index after cleaning
transactions_df.reset_index(drop=True, inplace=True)

# Convert cleaned string transactions to list format
transactions = transactions_df['Transaction'].apply(lambda x: x.strip().split(', '))


# Encode
# Initializing the encoder object
te = TransactionEncoder()

# Fitting the encoder to the transaction data and transform it to a boolean array
# This will return a 2D array where each row represents a transaction and each column represents an item
# The value will be True if the item is in that transaction, otherwise False
te_ary = te.fit(transactions).transform(transactions)

# Converting the boolean array to a DataFrame with column names as item names
# Each column now corresponds to an item, and each row is a transaction with True/False values
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

df_encoded # displaying the data

Unnamed: 0,apples,bananas,beans,beef,bread,butter,carrots,cereal,cheese,chicken,...,oranges,pasta,potatoes,rice,soda,tea,toilet paper,tomatoes,water,yogurt
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
1,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2720,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2721,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2722,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2723,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,True,True,False,False,False,False


##  Task 3: Find Frequent Itemsets using the Apriori Algorithm


- We mine the frequent items that support ≥ 0.05 by using`apriori()` algorithm in the `mlxtend.frequent_patterns` module.

- Then we displayed the top ten items and their respective support value.
___
###  Student Responsible: Ted Korir


In [43]:
# [Student: Ted Korir] Find Frequent Itemsets using the Apriori Algorithm

from mlxtend.frequent_patterns import apriori

# ----------------------------------------
# Step 1: Generate frequent itemsets
# ----------------------------------------
# This uses the Apriori algorithm to identify combinations of items (itemsets)
# that appear together in at least 5% of the transactions (min_support = 0.05).
# The `use_colnames=True` ensures the output uses actual item names instead of column indices.
frequent_itemsets = apriori(df_encoded, min_support=0.05, use_colnames=True)

# ----------------------------------------
# Step 2: Round support values
# ----------------------------------------
# Support represents the proportion of transactions that contain the itemset.
# We round it to 2 decimal places for cleaner output and easier interpretation.
frequent_itemsets['support'] = frequent_itemsets['support'].round(2)

# ----------------------------------------
# Step 3: Display top 10 frequent itemsets
# ----------------------------------------
# Display the first 10 rows of the resulting frequent itemsets DataFrame.
# This includes single items as well as combinations that occur frequently together.
print(frequent_itemsets.head(10))

#export the top 10 of the DataFrame to a CSV file
# This will create a CSV file named 'frequent_itemsets.csv' in the current directory    
frequent_itemsets.head(10).to_csv('frequent_itemsets.csv', index=False)


   support   itemsets
0     0.17   (apples)
1     0.17  (bananas)
2     0.17    (beans)
3     0.08     (beef)
4     0.17    (bread)
5     0.18   (butter)
6     0.07  (carrots)
7     0.08   (cereal)
8     0.18   (cheese)
9     0.17  (chicken)


#  Task 4: Find Closed Frequent Itemsets

This section focuses on analyzing **closed frequent itemsets** from the supermarket transaction dataset.

A **closed frequent itemset** is an itemset that has **no immediate superset** with the **same support count**. These itemsets provide a more compact and informative representation of frequent patterns compared to all frequent itemsets.

We use a **support count dictionary** to track the frequency of each item and item combination across the dataset.

---
### _Student Responsible: Angela Irungu_



In [44]:
# [Student: Angela Irungu] Find Closed Frequent Itemsets

# NOTE: Assumes that 'frequent_itemsets' has already been generated 
# using the apriori algorithm from mlxtend.frequent_patterns

# Step 2: Identify Closed Itemsets from frequent_itemsets

# Initialize an empty list to hold closed itemsets
closed_itemsets = []

# Loop through each itemset in the frequent itemsets list
for i, row in frequent_itemsets.iterrows():
    current_itemset = row['itemsets']     # The current itemset
    current_support = row['support']      # Its support value
    is_closed = True                      # Assume it's closed initially

    # Compare with all other itemsets to check for supersets
    for j, other_row in frequent_itemsets.iterrows():
        other_itemset = other_row['itemsets']
        other_support = other_row['support']

        # Condition to disqualify 'current_itemset' from being closed:
        # If there's a **proper superset** (i.e., current_itemset < other_itemset)
        # that has the **same support**, then current_itemset is not closed.
        if current_itemset < other_itemset and current_support == other_support:
            is_closed = False
            break   # No need to check further

    # If no such superset found, it is a closed frequent itemset
    if is_closed:
        closed_itemsets.append((current_itemset, current_support))

# Convert the list of closed itemsets to a DataFrame for easy viewing and export
closed_df = pd.DataFrame(closed_itemsets, columns=["itemsets", "support"])

# Optional: Save the closed itemsets to a CSV file
closed_df.to_csv("closed_itemsets.csv", index=False)

# Display the result
print("Closed Frequent Itemsets:")
print(closed_df)  # Show only the first 5 for brevity


Closed Frequent Itemsets:
                     itemsets  support
0                    (apples)     0.17
1                   (bananas)     0.17
2                     (beans)     0.17
3                      (beef)     0.08
4                     (bread)     0.17
5                    (butter)     0.18
6                   (carrots)     0.07
7                    (cereal)     0.08
8                    (cheese)     0.18
9                   (chicken)     0.17
10                    (chips)     0.20
11                (chocolate)     0.07
12                   (coffee)     0.07
13                  (cookies)     0.19
14                     (eggs)     0.18
15                   (grapes)     0.08
16                (ice cream)     0.07
17                    (juice)     0.07
18                  (lettuce)     0.08
19                     (milk)     0.18
20                   (onions)     0.07
21                  (oranges)     0.08
22                    (pasta)     0.07
23                 (potatoes)     0.08

## TASK 5: Maximal Frequent Itemsets





In this section, we identify **maximal frequent itemsets** from the frequent itemsets previously generated using the Apriori algorithm.

A **maximal itemset** is one that is **not a subset** of any other frequent itemset.  
This means that no larger itemset containing it is frequent.

To find them, we compare each itemset with all others.  
If we find no **frequent superset**, we mark it as **maximal**.

Finally, the list of maximal frequent itemsets is saved to `maximal_itemsets.csv` and the first five results are displayed.

___

### **Student Responsible: Trizah Nzioka**

In [45]:
# [Student: Trizah Nzioka] Find Maximal Frequent Itemsets

# Maximal frequent itemsets are those that:
# - Are frequent (appear in enough transactions, i.e., ≥ min_support)
# - Have no **frequent superset** (i.e., no larger itemset that is also frequent)

# Step 1: Create an empty list to hold all maximal itemsets
maximal_itemsets = []

# Step 2: Loop through each frequent itemset found using Apriori
for i, row in frequent_itemsets.iterrows():
    current_itemset = row['itemsets']  # The itemset under consideration
    is_maximal = True                  # Assume it's maximal unless proven otherwise

    # Step 3: Compare current_itemset with all other itemsets
    for j, other_row in frequent_itemsets.iterrows():
        other_itemset = other_row['itemsets']

        # Check if there's a **proper superset** of current_itemset
        # If yes, current_itemset is not maximal
        if current_itemset < other_itemset:
            is_maximal = False
            break   # No need to continue checking other itemsets

    # Step 4: If no frequent superset was found, add to maximal list
    if is_maximal:
        maximal_itemsets.append((current_itemset, row['support']))

# Step 5: Convert the list of maximal itemsets into a DataFrame
maximal_df = pd.DataFrame(maximal_itemsets, columns=["itemsets", "support"])

# Step 6: Save results to a CSV file (optional for reporting)
maximal_df.to_csv("maximal_itemsets.csv", index=False)

# Step 7: Print the first 5 maximal frequent itemsets for review
print("Maximal Frequent Itemsets (first 5):")
print(maximal_df)


Maximal Frequent Itemsets (first 5):
                     itemsets  support
0                      (beef)     0.08
1                   (carrots)     0.07
2                    (cereal)     0.08
3                 (chocolate)     0.07
4                    (coffee)     0.07
5                    (grapes)     0.08
6                 (ice cream)     0.07
7                     (juice)     0.07
8                   (lettuce)     0.08
9                    (onions)     0.07
10                  (oranges)     0.08
11                    (pasta)     0.07
12                 (potatoes)     0.08
13                      (tea)     0.07
14             (toilet paper)     0.07
15                 (tomatoes)     0.08
16                    (water)     0.07
17              (milk, bread)     0.12
18  (yogurt, apples, bananas)     0.10
19     (beans, chicken, rice)     0.10
20     (eggs, cheese, butter)     0.11
21     (soda, cookies, chips)     0.13
