# Exploring Frequent Itemsets: Closed vs Maximal

This notebook focuses on simulating transaction data for a supermarket scenario and applying frequent pattern mining using the Apriori algorithm.

In this section, we simulate transaction data that will later be used to identify:
- Frequent Itemsets
- Closed Frequent Itemsets
- Maximal Frequent Itemsets


##  Task 1: Simulate Supermarket Transaction Data

In this section, we simulate 3,000 supermarket transactions.  
Each transaction contains between 2 and 7 items, randomly selected from a pool of 30 unique items.  
The resulting dataset is saved to `supermarket_transactions.csv` for further analysis.

---

###  Student Responsible: Selmah Tzindori



In [1]:
# [Student: Selmah Tzindori] Simulation of 3,000 supermarket transactions and export to CSV

# Import the random module to help us randomly select items for each transaction
import random

# Import pandas for working with structured data like tables and CSV files
import pandas as pd

# Step 1: Define a list (pool) of 30 unique supermarket items
# These will be randomly picked to form each transaction
item_pool = [
    'milk', 'bread', 'eggs', 'cheese', 'butter', 'juice', 'apples', 'bananas', 'oranges', 'grapes',
    'cereal', 'chocolate', 'yogurt', 'chicken', 'beef', 'pasta', 'rice', 'tomatoes', 'onions', 'potatoes',
    'carrots', 'lettuce', 'beans', 'soda', 'water', 'coffee', 'tea', 'cookies', 'ice cream', 'toilet paper'
]

# Step 2: Set the number of transactions to simulate
num_transactions = 3000  # Total number of customers or baskets

# Create an empty list that will hold each simulated transaction
transactions = []

# Step 3: Loop 3,000 times to create each transaction
for _ in range(num_transactions):
    # Randomly choose a number between 2 and 7 to determine how many items in this transaction
    transaction_length = random.randint(2, 7)

    # Randomly select 'transaction_length' number of unique items from the item pool
    transaction = random.sample(item_pool, transaction_length)

    # Add the generated transaction (a list of items) to our list of all transactions
    transactions.append(transaction)

# Step 4: Convert the list of transactions into a format suitable for saving to CSV
# Each transaction will become one string, with items separated by commas
transaction_strings = [', '.join(t) for t in transactions]

# Create a pandas DataFrame with one column called 'Transaction'
# Each row in the DataFrame represents a customer transaction
transactions_df = pd.DataFrame({'Transaction': transaction_strings})

# Step 5: Save the DataFrame to a CSV file
# This file will be used in the next steps of the project (frequent itemset mining)
transactions_df.to_csv('supermarket_transactions.csv', index=False)

# Step 6: Show the first 5 transactions to check the output looks correct
transactions_df.head()


Unnamed: 0,Transaction
0,"tea, onions, cheese, chicken, juice, apples, b..."
1,"cookies, ice cream, juice, tomatoes, cheese"
2,"bananas, cereal"
3,"chicken, beef, eggs, milk, bananas, oranges"
4,"cookies, tea, apples, grapes"


##  Task 2: Convert Transactions to One-Hot Encoded Format

We convert the simulated transaction data into a one-hot encoded format.  
This format is required by the `apriori()` algorithm in the `mlxtend.frequent_patterns` module.

Each transaction becomes a row in the DataFrame, and each unique item becomes a column.  
A value of `True` indicates that the item is present in the transaction.

---

###  Student Responsible: Levin



In [2]:
#Importing the TransactionEncoder class from mlxtend
from mlxtend.preprocessing import TransactionEncoder

# Initializing the encoder object
te = TransactionEncoder()
# Fitting the encoder to the transaction data and transform it to a boolean array
# This will return a 2D array where each row represents a transaction and each column represents an item
# The value will be True if the item is in that transaction, otherwise False
te_ary = te.fit(transactions).transform(transactions)

# Converting the boolean array to a DataFrame with column names as item names
# Each column now corresponds to an item, and each row is a transaction with True/False values
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

df_encoded # displaying the data


Unnamed: 0,apples,bananas,beans,beef,bread,butter,carrots,cereal,cheese,chicken,...,oranges,pasta,potatoes,rice,soda,tea,toilet paper,tomatoes,water,yogurt
0,True,False,True,False,False,False,False,False,True,True,...,False,False,False,False,False,True,False,False,False,False
1,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False
2,False,True,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,True,False,True,False,False,False,False,False,True,...,True,False,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,True,False,False,False,True,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2996,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2997,True,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2998,False,False,True,False,False,False,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False


##  Task 3: Find Frequent Itemsets using the Apriori Algorithm


We mine the frequent items that support ≥ 0.05 by using`apriori()` algorithm in the `mlxtend.frequent_patterns` module.

The output is the top ten items and their respective support value
###  Student Responsible: Ted Korir


In [5]:


# Importing library
from mlxtend.frequent_patterns import apriori
import pandas as pd

# Loading the transactions
trans_df = pd.read_csv("supermarket_transactions.csv")     

# Split items in the "Transaction" column into lists
transactions = trans_df["Transaction"].str.split(", ")

# One‑hot encoding: create a DataFrame with True/False for each item per transaction
item_pool = sorted({item for basket in transactions for item in basket})
encoded_rows = [{item: (item in basket) for item in item_pool} for basket in transactions]
df = pd.DataFrame(encoded_rows)

# 4. Mining frequent itemsets (support ≥ 0.05)
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)
frequent_itemsets["itemsets"] = frequent_itemsets["itemsets"].apply(set)

# [Student: Angela Irungu 669289] 5. Sort by support 
frequent_itemsets = frequent_itemsets.sort_values("support", ascending=False, ignore_index=True)

# Display top‑10 frequent itemsets
print("\nTop 10 frequent itemsets (min_support = 0.05):")
print(frequent_itemsets.head(10))



Top 10 frequent itemsets (min_support = 0.05):
    support   itemsets
0  0.159333     {beef}
1  0.158333   {apples}
2  0.157333    {water}
3  0.155333   {butter}
4  0.155333  {lettuce}
5  0.155333   {coffee}
6  0.155000   {onions}
7  0.153333   {yogurt}
8  0.153000    {juice}
9  0.152667  {carrots}


<span style='font-size:xx-large'>**TASK 4**</span>

<span style='font-size:large'>_Student Responsible: Angela Irungu_ </span>

<span style='font-size:large'><u>_CLOSED FREQUENT ITEMSETS ANALYSIS SECTION_</u></span>

<span style='font-size:large'>This section reads the supermarket transaction data and identifies closed frequent itemsets.</span>

<span style='font-size:large'>A closed itemset is one that has no superset with the same support count.</span>

<span style='font-size:large'>We use a support count dictionary to track frequency of items and combinations.</span>

<span style='font-size:large'>The output below shows the first five closed itemsets and their support counts.</span>


In [6]:

from collections import defaultdict
import pandas as pd


# Load the transactions from CSV file
df = pd.read_csv("supermarket_transactions.csv")

# Prepare list of transactions
if df.shape[1] == 1:
    transactions = df.iloc[:, 0].apply(lambda x: set(str(x).split(','))).tolist()
else:
    transactions = df.apply(lambda row: set(row.dropna().astype(str)), axis=1).tolist()

# Count support
support_count = defaultdict(int)
for t in transactions:
    for item in t:
        support_count[frozenset([item])] += 1
    for i1 in t:
        for i2 in t:
            if i1 < i2:
                support_count[frozenset([i1, i2])] += 1
    if len(t) >= 3:
        support_count[frozenset(t)] += 1

# Find closed itemsets
closed_itemsets = []
for itemset in support_count:
    is_closed = True
    for other in support_count:
        if itemset < other and support_count[itemset] == support_count[other]:
            is_closed = False
            break
    if is_closed:
        closed_itemsets.append((set(itemset), support_count[itemset]))

# Show first five only
print("Closed Frequent Itemsets (first 5):")
for itemset, count in closed_itemsets[:5]:
    print(f"{itemset} -> support: {count}")


Closed Frequent Itemsets (first 5):
{' onions'} -> support: 354
{' cheese'} -> support: 351
{' chicken'} -> support: 323
{' beans'} -> support: 338
{' juice'} -> support: 359
