## *Exploring Frequent Itemsets: Closed vs Maximal in Supermarket Data*
### *Introduction: Understanding Maximal and Closed Frequent Itemsets*

#### *Frequent Itemsets*

*A frequent itemset is a set of items that appears together in a dataset more than a specified minimum number of times, known as the support threshold.*

---

#### *Closed Frequent Itemsets*

*A closed frequent itemset is a frequent itemset for which no superset has the same support.*

*In other words, it is not possible to add any more items to the set without decreasing how often it appears. Closed itemsets help reduce redundancy while preserving complete support information.*

---

#### *Maximal Frequent Itemsets*

*A maximal frequent itemset is a frequent itemset for which no superset is also frequent.*

*This means that no additional items can be added to the itemset while still satisfying the minimum support threshold. Maximal itemsets provide a highly compact representation of frequent patterns.*

---


### *[Student: Mohammed]*
*Import necessary libraries*

In [1041]:
import pandas as pd # Import pandas for data manipulation
import numpy as np # Import numpy for numerical operations

import random    # Import random for generating random numbers
from mlxtend.frequent_patterns import apriori   # Import apriori algorithm from mlxtend for frequent itemset mining
from collections import defaultdict     # Import defaultdict for creating dictionaries with default values

## *Step-1:Generating Supermarket Transactions*

*This section generates `3,000 supermarket transactions.` Each transaction includes between `2 to 7 items` randomly selected from a pool of `30 unique grocery items.` To ensure reproducibility, a random seed is set. The resulting transactions are stored in a pandas DataFrame and saved as a CSV file for future use.*


 ### *Define Item Pool*

In [1042]:
# Generate 3000 supermarket transactions
# Each transaction will have between 2 and 7 items randomly chosen from a pool of 30 unique items

#Define the item pool
item_pool = [   # List of 30 unique grocery items
    'Milk', 'Bread', 'Butter', 'Eggs', 'Cheese', 'Apples', 'Bananas', 'Chicken',
    'Beef', 'Fish', 'Rice', 'Pasta', 'Cereal', 'Juice', 'Soda', 'Yogurt',
    'Tomatoes', 'Onions', 'Potatoes', 'Carrots', 'Cookies', 'Chips', 'Ice Cream',
    'Coffee', 'Tea', 'Sugar', 'Flour', 'Salt', 'Pepper', 'Oil'
]   

###  *Generating Supermarket Transactions*

*We generate 3,000 transactions by randomly sampling between 2 and 7 items from the predefined item pool. A random seed is set for reproducibility.*


In [1043]:
# generate transactions
random.seed(42)  # For reproducibility
transactions = []     # List to hold the transactions
# Create 3000 transactions
for _ in range(3000):
    transaction = random.sample(item_pool, k=random.randint(2, 7))  # 2 to 7 items per transaction
    transactions.append(transaction)   # Append the transaction to the list

### *Save and Display*

In [1044]:
#save the transactions to a CSV file to spefic directory
df_transactions = pd.DataFrame({'Transaction': transactions})      # Creating a DataFrame from the transactions list
df_transactions.to_csv('data/supermarket_transactions.csv', index=False)   # Saving the DataFrame to a CSV file


# Display shape of the DataFrame
print("Number of transactions:", df_transactions.shape[0])    # Display the number of transactions
print("DataFrame shape:", df_transactions.shape)  # Display the shape of the DataFrame
df_transactions.head()   # Display the first few transactions


Number of transactions: 3000
DataFrame shape: (3000, 1)


Unnamed: 0,Transaction
0,"[Eggs, Milk, Coffee, Beef, Chicken, Sugar, Che..."
1,"[Eggs, Chips, Coffee, Onions, Butter, Potatoes..."
2,"[Milk, Butter]"
3,"[Chicken, Tomatoes, Carrots]"
4,"[Onions, Bananas]"


## *[Student: Lesala]*

## *Step-2:Encoding and Mining Frequent Itemsets*

*In this section, we transform the transaction data into a one-hot encoded format and apply the Apriori algorithm to identify the most frequently purchased item combinations. Itemsets that appear in at least 5% of transactions are retained.*

#### *One-Hot Encode the Transactions*
*We convert each transaction—a list of purchased items—into a format suitable for the Apriori algorithm. Each row represents a transaction, and each column corresponds to an item, marked as `True` if present in that transaction and `False` otherwise. This binary structure is crucial for applying the Apriori method.*

---

##### *Why this is necessary ?*
*The Apriori algorithm requires data in a tabular format where each transaction is a binary vector. Without one-hot encoding, the algorithm wouldn't know which items co-occur across transactions.*



In [1045]:
# Convert list of items to one-hot encoded DataFrame

# Each row is a transaction, each column is an item, and values are True/False
encoded_data = []

# Loop through each transaction (a list of items)
for transaction in transactions:
    # Create a dictionary for each transaction
    # Key: item name
    # Value: True if item is in the transaction, else False
    encoded_row = {item: (item in transaction) for item in item_pool}
    
    # Add the encoded transaction to the list
    encoded_data.append(encoded_row)

df = pd.DataFrame(encoded_data)  # Create one-hot encoded DataFrame

###  *Find Frequent Itemsets using the Apriori Algorithm*

*We use the `mlxtend` library’s `apriori` function to identify frequent itemsets—combinations of items that appear together in at least 5% of transactions.*

---

#### *Why this is necessary:*

*Identifying frequent itemsets helps uncover common buying patterns. This is foundational for later steps like generating association rules, which tell us how the presence of one item implies another.*

---


In [1046]:
# Generate Frequent Itemsets using Apriori algorithm
# Minimum support threshold is 0.05 (i.e., items appearing in at least 5% of transactions)
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)

### *Sorting and Exporting the Top Itemsets*

*After generating frequent itemsets, we sort them by support (frequency of occurrence) and export the top 10 for reporting and further analysis.*

---

#### *Why this is necessary:*

*Sorting allows us to focus on the most significant patterns, while exporting ensures we can reuse or share the findings in a reproducible and organized way.*

---


In [1047]:
# Sort the itemsets by support in descending order
frequent_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)

#Save the top 10 frequent itemsets to CSV
frequent_itemsets.head(10).to_csv('data/frequent_itemsets.csv', index=False)
#Display output summaries
print("\nTop 10 Frequent Itemsets:\n", frequent_itemsets.head(10))



Top 10 Frequent Itemsets:
      support     itemsets
21  0.170000      (Chips)
3   0.162333       (Eggs)
15  0.161000     (Yogurt)
20  0.159333    (Cookies)
4   0.159000     (Cheese)
22  0.156667  (Ice Cream)
13  0.154333      (Juice)
14  0.154000       (Soda)
23  0.154000     (Coffee)
9   0.152667       (Fish)


### *[Student: Halima]*
## *Step 3: Identify Closed Frequent Itemsets*

### *Introduction: Understanding Maximal and Closed Frequent Itemsets*

In *Market Basket Analysis*, one of our goals is to uncover frequent patterns—groups of items that appear together in many transactions.

However, as we mine more patterns, the number of frequent itemsets can *grow explosively*. This leads to *redundancy* and makes interpretation more difficult.

*To solve this*, we use *condensed representations* of frequent itemsets:

* *Maximal Frequent Itemsets (MFI)*
* *Closed Frequent Itemsets (CFI)*

These approaches help reduce the number of itemsets while retaining the most important information for analysis.



---

#### *What is Support?*

**Support** is a measure that tells you **how often an itemset appears in the dataset**, expressed as a **proportion of total transactions**.

---

###  *Support Formula*
\[
\text{Support}(X) = \frac{\text{Number of transactions containing } X}{\text{Total number of transactions}}
\]


Where:
- **Support(X)** is the support of itemset **X**.
- **Number of transactions containing X** is how many transactions include the itemset.
- **Total number of transactions** is the total count of all transactions in the dataset.
- The result is a number between 0 and 1 (or between 0% and 100%).

---


In [1048]:

# An itemset is closed if there is no superset with the same support

closed_itemsets = []
for i, row_i in frequent_itemsets.iterrows():
    is_closed = True
    for j, row_j in frequent_itemsets.iterrows():
        if row_i['itemsets'] < row_j['itemsets'] and row_i['support'] == row_j['support']:
            is_closed = False
            break
    if is_closed:
        closed_itemsets.append(row_i)

#Convert closed itemsets list to DataFrame and save to data folder
closed_df = pd.DataFrame(closed_itemsets)
closed_df.to_csv('data/closed_itemsets.csv', index=False)

# Convert closed itemsets list to DataFrame
closed_df = pd.DataFrame(closed_itemsets)

# Add Approximate Occurrences
total_transactions = df.shape[0]  # Total number of transactions in your dataset
closed_df['approx_occurrences'] = (closed_df['support'] * total_transactions).round().astype(int)

# Save to CSV
closed_df.to_csv('data/closed_itemsets.csv', index=False)

# Display output summaries
print("\nTop 10 Closed Itemsets:\n")
print(closed_df[['support', 'itemsets', 'approx_occurrences']].head(10))

print("\nTotal Number of Closed Itemsets:", len(closed_df))


Top 10 Closed Itemsets:

     support     itemsets  approx_occurrences
21  0.170000      (Chips)                 510
3   0.162333       (Eggs)                 487
15  0.161000     (Yogurt)                 483
20  0.159333    (Cookies)                 478
4   0.159000     (Cheese)                 477
22  0.156667  (Ice Cream)                 470
13  0.154333      (Juice)                 463
14  0.154000       (Soda)                 462
23  0.154000     (Coffee)                 462
9   0.152667       (Fish)                 458

Total Number of Closed Itemsets: 30


## *Interpretation of Closed Frequent Itemsets*

*We analyzed 3,000 simulated supermarket transactions using the Apriori algorithm to discover frequent itemsets, and then filtered the results to find closed frequent itemsets. This method helped eliminate redundant patterns while preserving essential support information.*

---

### *What Are Closed Itemsets?*

*A closed itemset is a frequent itemset for which no superset has the same support. In other words, you can't add more items to the set without lowering how often it appears. Closed itemsets are valuable because they:*

*- Represent non-redundant patterns*
*- Preserve support values for all items*
*- Provide compact but complete summaries of frequent patterns*

---

### *Summary of Results*

*We identified:*

*- 30 closed frequent itemsets*
*- Top 10 are all single items*
*- Support values range between 15% and 17%*

*Here are the top 10 closed itemsets:*

| *Rank* | *Itemset*     | *Support*  | *Approx. Occurrences* |
| ------ | ------------- | ---------- | --------------------- |
| *1*    | *(Chips)*     | *0.170000* | *510 transactions*    |
| *2*    | *(Eggs)*      | *0.162333* | *487 transactions*    |
| *3*    | *(Yogurt)*    | *0.161000* | *483 transactions*    |
| *4*    | *(Cookies)*   | *0.159333* | *478 transactions*    |
| *5*    | *(Cheese)*    | *0.159000* | *477 transactions*    |
| *6*    | *(Ice Cream)* | *0.156667* | *470 transactions*    |
| *7*    | *(Juice)*     | *0.154333* | *463 transactions*    |
| *8*    | *(Soda)*      | *0.154000* | *462 transactions*    |
| *9*    | *(Coffee)*    | *0.154000* | *462 transactions*    |
| *10*   | *(Fish)*      | *0.152667* | *458 transactions*    |

---

### *Interpretation by Key Observations*

#### *1. Most Closed Itemsets Are Single Products*

*The top closed itemsets are individual items.*
*This means these products are frequently purchased alone, not consistently paired with others.*
*Their status as closed means that no frequent superset (e.g., Chips + Soda) occurs with the same frequency.*

*Insight:*

*These are strong independent sellers that customers purchase regularly, without always pairing them.*

---

#### *2. Consistent Support Values Across Items*

*Support values range narrowly from 15% to 17%.*
*This suggests a balanced set of popular items, not just one or two dominant ones.*

`*Insight:*`

*The store benefits from a diversified demand across multiple products, reducing overreliance on a few best-sellers.*

---

#### *3. No Frequent Pairs or Triplets in Top Results*

*All top closed itemsets are length 1, meaning pairs or larger groups did not qualify as closed.*
*This implies that customers tend to buy diverse item combinations, and not the same sets repeatedly.*

`*Insight:*`

*Cross-selling opportunities might not come from frequency alone, and association rules (like lift and confidence) could better capture related buying patterns.*

---

#### *4. Relatively Small Number of Closed Itemsets*

*Only 30 closed itemsets out of thousands of potential combinations.*
*This reflects that most combinations are either:*
*- Not frequent enough, or*
*- Redundant (i.e., have the same support as a subset)*

`*Insight:*`

*Closed itemsets offer a concise and meaningful summary of frequent patterns — helping analysts focus on what truly matters.*

---

#### *Business Implications*

*- Promote top individual products like Chips, Eggs, and Yogurt — they are reliable and popular purchases.*
*- Consider targeted bundle offers using association rules to discover less obvious, non-redundant item pairs.*
*- Use closed itemsets for efficient inventory planning and to understand core customer preferences.*

---

#### *Conclusion*

*Closed frequent itemsets helped us reduce noise and focus on specific, high-support patterns in customer behavior. While no dominant product combinations were found, several independent best-sellers emerged clearly, offering strong insight for sales and marketing strategies.*

---


### *[Student: Snit]*
## *Step 4: Identify Maximal Frequent Itemsets*

In [1049]:
# Identify Maximal Frequent Itemsets
# An itemset is maximal if there is no frequent superset of it

maximal_itemsets = []
for i, row_i in frequent_itemsets.iterrows():
    is_maximal = True
    for j, row_j in frequent_itemsets.iterrows():
        if row_i['itemsets'] < row_j['itemsets']:
            is_maximal = False
            break
    if is_maximal:
        maximal_itemsets.append(row_i)

# Convert to DataFrame
maximal_df = pd.DataFrame(maximal_itemsets)


# Add approximate occurrence count
total_transactions = len(df)  # Make sure 'df' is your original transaction DataFrame
maximal_df['occurrences'] = (maximal_df['support'] * total_transactions).round().astype(int)

# Save to CSV
maximal_df.to_csv('data/maximal_itemsets.csv', index=False)

# Display results
print("\n*Top 10 Maximal Frequent Itemsets:*\n", maximal_df.head(10))
print("\n*Number of Maximal Frequent Itemsets:*", len(maximal_df))



*Top 10 Maximal Frequent Itemsets:*
      support     itemsets  occurrences
21  0.170000      (Chips)          510
3   0.162333       (Eggs)          487
15  0.161000     (Yogurt)          483
20  0.159333    (Cookies)          478
4   0.159000     (Cheese)          477
22  0.156667  (Ice Cream)          470
13  0.154333      (Juice)          463
14  0.154000       (Soda)          462
23  0.154000     (Coffee)          462
9   0.152667       (Fish)          458

*Number of Maximal Frequent Itemsets:* 30


### *Interpretation of Maximal Frequent Itemsets*

*We analyzed 3,000 simulated supermarket transactions using the Apriori algorithm and identified **maximal frequent itemsets** — itemsets for which no frequent superset exists. These represent the **most specific, high-support patterns** that are not further extendable without falling below the support threshold.*

---

### *What Are Maximal Itemsets?*

*A maximal frequent itemset is a frequent itemset that has **no frequent superset**. In simpler terms: you cannot add any more items to the set without its support dropping below the minimum threshold.*

*They are valuable because they:*

* *Represent the "boundary" of frequent patterns*
* *Are **more compact** than closed itemsets*
* *Omit internal structure (i.e., no support for subsets), focusing only on maximal combinations*

---

### *Summary of Results*

*We identified:*

* *30 maximal frequent itemsets*
* *Top 10 are all single items — identical to the closed set*
* *Support values range from **15.2% to 17%***
* *Approximate occurrence counts between **458 and 510** transactions*


### *Why Are Maximal and Closed Itemsets Showing the Same Result?*

This happens when **none of the larger combinations of items** (pairs, triplets, etc.) meet the support threshold. That is:

* The frequent itemsets **larger than size 1** were **not frequent enough** to pass the `min_support` threshold.
* Therefore, **no itemset of size > 1** appears in either the closed or maximal result.
* In such cases, **closed itemsets ≈ maximal itemsets ≈ frequent itemsets of size 1.**

>  *This is a **data-driven limitation**, not an error in logic. You can confirm this by lowering the `min_support` to discover combinations that become frequent.*

---

### *Interpretation by Key Observations*

#### *1. All Maximal Itemsets Are Size 1*

*The absence of longer itemsets (pairs, triplets) indicates that item combinations are not commonly repeated across transactions.*

*This may be due to:*

* *A wide diversity of purchases (many unique combinations)*
* *Support threshold set too high for item pairs to qualify*

**Insight:** *Customers buy a wide mix of items, so identifying strong bundles may require using association rules like lift and confidence.*

---

#### *2. High Support, Independent Items*

*Items like Chips, Eggs, and Yogurt are reliable best-sellers.*

**Insight:** *They are key drivers of traffic and could be used in promotions, featured categories, or stock priority planning.*

---

#### *3. Maximal Itemsets Are a Compact Summary*

*The 30 itemsets cover the essential frequent patterns without including any subsets — this makes them efficient for pattern mining when detail isn't required.*

**Insight:** *Maximal itemsets are ideal when you want the most condensed set of patterns that still capture frequent behaviors.*

---

#### *4. Use Association Rules for Deeper Combinations*

*If your goal is to discover bundles or co-purchased items, use `association_rules()` instead of relying only on frequent/maximal itemsets.*


### *Business Implications*

* *Promote individually strong items seen in top maximal patterns.*
* *Explore bundling or cross-selling by analyzing association rules.*
* *Use maximal itemsets to simplify downstream reporting or clustering.*

---

### *Conclusion*

*The current data and threshold reveal that single products dominate frequent patterns. While maximal and closed sets appear the same here, the methods remain useful tools — and adjusting parameters like `min_support` may reveal deeper insights.*

---
