## *Exploring Frequent Itemsets: Closed vs Maximal in Supermarket Data*
#### *Introduction: Understanding Maximal and Closed Frequent Itemsets*

#### *Frequent Itemsets*

*A frequent itemset is a set of items that appears together in a dataset more than a specified minimum number of times, known as the support threshold.*

---

#### *Closed Frequent Itemsets*

*A closed frequent itemset is a frequent itemset for which no superset has the same support.*

*In other words, it is not possible to add any more items to the set without decreasing how often it appears. Closed itemsets help reduce redundancy while preserving complete support information.*

---

#### *Maximal Frequent Itemsets*

*A maximal frequent itemset is a frequent itemset for which no superset is also frequent.*

*This means that no additional items can be added to the itemset while still satisfying the minimum support threshold. Maximal itemsets provide a highly compact representation of frequent patterns.*

---


### *[Student: Mohammed]*
*Import necessary libraries*

In [1182]:
import pandas as pd # Import pandas for data manipulation
import numpy as np # Import numpy for numerical operations

import random    # Import random for generating random numbers
from mlxtend.frequent_patterns import apriori   # Import apriori algorithm from mlxtend for frequent itemset mining
from collections import defaultdict     # Import defaultdict for creating dictionaries with default values

## *Step-1. Simulate   Supermarket Transactions Data*

*This section generates `3,000 supermarket transactions.` Each transaction includes between `2 to 7 items` randomly selected from a pool of `30 unique grocery items.` To ensure reproducibility, a random seed is set. The resulting transactions are stored in a pandas DataFrame and saved as a CSV file for future use.*


 ### *Define Item Pool*

In [1183]:
# Generate 3000 supermarket transactions
# Each transaction will have between 2 and 7 items randomly chosen from a pool of 30 unique items
random.seed(42)
#Define the item pool
item_pool = [   # List of 30 unique grocery items
    'Milk', 'Bread', 'Butter', 'Eggs', 'Cheese', 'Apples', 'Bananas', 'Chicken',
    'Beef', 'Fish', 'Rice', 'Pasta', 'Cereal', 'Juice', 'Soda', 'Yogurt',
    'Tomatoes', 'Onions', 'Potatoes', 'Carrots', 'Cookies', 'Chips', 'Ice Cream',
    'Coffee', 'Tea', 'Sugar', 'Flour', 'Salt', 'Pepper', 'Oil'
]  

# -------------------------------  
# Step 2: Define common frequent bundles  
# -------------------------------  
frequent_bundles = [  
    ['Milk', 'Bread'],  
    ['Apples', 'Bananas', 'Yogurt'],  
    ['Chicken', 'Rice', 'Beans'],  
    ['Soda', 'Chips', 'Cookies'],  
    ['Cheese', 'Butter', 'Eggs']  
]
 
# Add missing bundle items  
item_pool = list(set(item_pool + ['chips']))  

###  *Generating Supermarket Transactions*

*We generate 3,000 transactions by randomly sampling between 2 and 7 items from the predefined item pool. A random seed is set for reproducibility.*


In [1184]:
# -------------------------------
# Step 3: Generate synthetic transactions
# -------------------------------
# Loop generates 3,000 transactions. Each transaction:
# - Has a 50% chance of including one frequent bundle
# - Adds 0 to 4 extra random (non-duplicate) items
# - Randomizes item order to avoid fixed patterns

num_transactions = 3000
transactions = []

for _ in range(num_transactions):
    transaction = []

    # Inject a frequent bundle 50% of the time
    if random.random() < 0.5:
        bundle = random.choice(frequent_bundles)
        transaction.extend(bundle)

    # Add a few additional random items (avoid duplicates)
    num_extra_items = random.randint(0, 4)
    remaining_items = list(set(item_pool) - set(transaction))
    extras = random.sample(remaining_items, num_extra_items)
    transaction.extend(extras)

    # Shuffle items so the order is randomized
    random.shuffle(transaction)
    transactions.append(transaction)

### *Save and Display*

In [1185]:
# Step 4: Save transactions to CSV
# -------------------------------
# Each transaction is saved as a comma-separated string in one row.
# Useful for visual inspection or loading later.
transaction_strings = [', '.join(t) for t in transactions]
transactions_df = pd.DataFrame({'Transaction': transaction_strings})
transactions_df.to_csv('supermarket_transactions.csv', index=False)


# -------------------------------
# Step 5: Preview the simulated data
# -------------------------------
# Display the first few rows to confirm structure and content.
print("Sample Transactions:")
transactions_df


Sample Transactions:


Unnamed: 0,Transaction
0,
1,Yogurt
2,"Juice, Beef, Bananas, Bread, Soda, Milk"
3,"Yogurt, Carrots, Chicken, Bananas"
4,"Pepper, Coffee"
...,...
2995,Carrots
2996,"Cereal, Bananas"
2997,"Bananas, Pepper"
2998,Juice


## *[Student: Lesala]*

## *Step-2:Generate Frequent Itemsets*
### *Encoding and Mining Frequent Itemsets*

*In this section, we transform the transaction data into a one-hot encoded format and apply the Apriori algorithm to identify the most frequently purchased item combinations. Itemsets that appear in at least 5% of transactions are retained.*

#### *One-Hot Encode the Transactions*
*We convert each transaction—a list of purchased items—into a format suitable for the Apriori algorithm. Each row represents a transaction, and each column corresponds to an item, marked as `True` if present in that transaction and `False` otherwise. This binary structure is crucial for applying the Apriori method.*

---

##### *Why this is necessary ?*
*The Apriori algorithm requires data in a tabular format where each transaction is a binary vector. Without one-hot encoding, the algorithm wouldn't know which items co-occur across transactions.*



In [1186]:
# Convert list of items to one-hot encoded DataFrame

# Each row is a transaction, each column is an item, and values are True/False
encoded_data = []

# Loop through each transaction (a list of items)
for transaction in transactions:
    # Create a dictionary for each transaction
    # Key: item name
    # Value: True if item is in the transaction, else False
    encoded_row = {item: (item in transaction) for item in item_pool}
    
    # Add the encoded transaction to the list
    encoded_data.append(encoded_row)

df = pd.DataFrame(encoded_data)  # Create one-hot encoded DataFrame
df.head()  # Display the first few rows of the DataFrame

Unnamed: 0,Bread,Ice Cream,Potatoes,Cookies,Tomatoes,Juice,Oil,Salt,Onions,Pasta,...,Chicken,Bananas,Milk,Carrots,Sugar,Flour,Fish,chips,Apples,Tea
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,True,False,False,False,False,...,False,True,True,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,True,True,False,True,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


###  *Find Frequent Itemsets using the Apriori Algorithm*

*We use the `mlxtend` library’s `apriori` function to identify frequent itemsets—combinations of items that appear together in at least 5% of transactions.*

---

#### *Why this is necessary:*

*Identifying frequent itemsets helps uncover common buying patterns. This is foundational for later steps like generating association rules, which tell us how the presence of one item implies another.*

---


In [1187]:
# Generate Frequent Itemsets using Apriori algorithm
# Minimum support threshold is 0.05 (i.e., items appearing in at least 5% of transactions)
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)

### *Sorting and Exporting the Top Itemsets*

*After generating frequent itemsets, we sort them by support (frequency of occurrence) and export the top 10 for reporting and further analysis.*

---

#### *Why this is necessary:*

*Sorting allows us to focus on the most significant patterns, while exporting ensures we can reuse or share the findings in a reproducible and organized way.*

---


In [1188]:
# Sort the itemsets by support in descending order
frequent_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)

#Save the top 10 frequent itemsets to CSV
frequent_itemsets.head(10).to_csv('frequent_itemsets.csv', index=False)
#Display output summaries
print("\nTop 10 Frequent Itemsets:\n", frequent_itemsets.head(10))



Top 10 Frequent Itemsets:
      support   itemsets
13  0.180333     (Soda)
19  0.179333    (Chips)
3   0.173000  (Cookies)
0   0.165000    (Bread)
23  0.163000     (Milk)
16  0.162667   (Cheese)
22  0.161000  (Bananas)
10  0.157333     (Eggs)
29  0.153333   (Apples)
12  0.151000   (Butter)


### *[Student: Halima]*
## *Step 3: Identify Closed Frequent Itemsets*

### *Introduction: Understanding Maximal and Closed Frequent Itemsets*

In *Market Basket Analysis*, one of our goals is to uncover frequent patterns—groups of items that appear together in many transactions.

However, as we mine more patterns, the number of frequent itemsets can *grow explosively*. This leads to *redundancy* and makes interpretation more difficult.

*To solve this*, we use *condensed representations* of frequent itemsets:

* *Maximal Frequent Itemsets (MFI)*
* *Closed Frequent Itemsets (CFI)*

These approaches help reduce the number of itemsets while retaining the most important information for analysis.



---

#### *What is Support?*

**Support** is a measure that tells you **how often an itemset appears in the dataset**, expressed as a **proportion of total transactions**.

---

###  *Support Formula*
\[
\text{Support}(X) = \frac{\text{Number of transactions containing } X}{\text{Total number of transactions}}
\]


Where:
- **Support(X)** is the support of itemset **X**.
- **Number of transactions containing X** is how many transactions include the itemset.
- **Total number of transactions** is the total count of all transactions in the dataset.
- The result is a number between 0 and 1 (or between 0% and 100%).

---


In [1189]:

# An itemset is closed if there is no superset with the same support

closed_itemsets = []
for i, row_i in frequent_itemsets.iterrows():
    is_closed = True
    for j, row_j in frequent_itemsets.iterrows():
        if row_i['itemsets'] < row_j['itemsets'] and row_i['support'] == row_j['support']:
            is_closed = False
            break
    if is_closed:
        closed_itemsets.append(row_i)

#Convert closed itemsets list to DataFrame and save to data folder
closed_df = pd.DataFrame(closed_itemsets)
closed_df.to_csv('closed_itemsets.csv', index=False)

# Convert closed itemsets list to DataFrame
closed_df = pd.DataFrame(closed_itemsets)

# Add Approximate Occurrences
total_transactions = df.shape[0]  # Total number of transactions in your dataset
closed_df['approx_occurrences'] = (closed_df['support'] * total_transactions).round().astype(int)

# Save to CSV
closed_df.to_csv('data/closed_itemsets.csv', index=False)

# Display output summaries
print("\n Closed Itemsets:\n")
print(closed_df[['support', 'itemsets', 'approx_occurrences']])

print("\nTotal Number of Closed Itemsets:", len(closed_df))


 Closed Itemsets:

     support                   itemsets  approx_occurrences
13  0.180333                     (Soda)                 541
19  0.179333                    (Chips)                 538
3   0.173000                  (Cookies)                 519
0   0.165000                    (Bread)                 495
23  0.163000                     (Milk)                 489
16  0.162667                   (Cheese)                 488
22  0.161000                  (Bananas)                 483
10  0.157333                     (Eggs)                 472
29  0.153333                   (Apples)                 460
12  0.151000                   (Butter)                 453
21  0.150667                  (Chicken)                 452
17  0.147667                   (Yogurt)                 443
20  0.144667                     (Rice)                 434
37  0.123000              (Soda, Chips)                 369
33  0.123000           (Cookies, Chips)                 369
32  0.120667        

## *Interpretation of Closed Frequent Itemsets*

*We analyzed 3,000 simulated supermarket transactions using the Apriori algorithm to discover frequent itemsets, and then filtered the results to find closed frequent itemsets. This method helped eliminate redundant patterns while preserving essential support information.*

---

### *What Are Closed Itemsets?*

*A closed itemset is a frequent itemset for which no superset has the same support. In other words, you can't add more items to the set without lowering how often it appears. Closed itemsets are valuable because they:*

*- Represent non-redundant patterns*
*- Preserve support values for all items*
*- Provide compact but complete summaries of frequent patterns*

---

### *Summary of Results*

*We identified:*

*- 30 closed frequent itemsets*
*- Top 10 are all single items*
*- Support values range between 0.06 and 18%*

#### *Top 10 Closed Itemsets*

| Rank | Support   | Itemsets   | Approx. Occurrences |
|------|-----------|------------|----------------------|
| 1    | 0.180333  | (Soda)     | 541                  |
| 2    | 0.179333  | (Chips)    | 538                  |
| 3    | 0.173000  | (Cookies)  | 519                  |
| 4    | 0.165000  | (Bread)    | 495                  |
| 5    | 0.163000  | (Milk)     | 489                  |
| 6    | 0.162667  | (Cheese)   | 488                  |
| 7    | 0.161000  | (Bananas)  | 483                  |
| 8    | 0.157333  | (Eggs)     | 472                  |
| 9    | 0.153333  | (Apples)   | 460                  |
| 10   | 0.151000  | (Butter)   | 453                  |

**Total Number of Closed Itemsets:** 45



### *Interpretation by Key Observations*

#### *1. Most Closed Itemsets Are Single Products*

*The top closed itemsets are individual items.*
*This means these products are frequently purchased alone, not consistently paired with others.*
*Their status as closed means that no frequent superset (e.g., Chips + Soda) occurs with the same frequency.*



*These are strong independent sellers that customers purchase regularly, without always pairing them.*
### *Insight:*
---
*Top 10 Closed Itemsets show the most frequent individual products in transactions.*
*Soda, Chips, and Cookies are the top 3, each appearing in over 500 transactions.*
*Single items dominate the highest support values, suggesting strong individual preferences.*

*As we move lower in the list, combinations of items (e.g., Soda & Chips, Cookies & Chips) emerge.*
*These combinations reflect common co-purchase patterns, such as snack combinations or breakfast items.*

*The itemset (Soda, Cookies, Chips) still has a high support of 11.7%, indicating a significant number of customers buy these three together.*
*This can inform store layout (placing items together), bundling strategies, and promotions.*

*Support values range between 0.06 and 18%, which means the most frequent item (Soda) appears in 18% of all transactions, while the least frequent (Salt) appears in about 6%.*

*Overall, this analysis reveals customer preferences, frequent co-purchases, and potential bundles, helping optimize inventory, marketing, and sales strategies.*




### *[Student: Snit]*
## *Step 4: Identify Maximal Frequent Itemsets*

In [1190]:
# Identify Maximal Frequent Itemsets
# An itemset is maximal if there is no frequent superset of it

maximal_itemsets = []
for i, row_i in frequent_itemsets.iterrows():
    is_maximal = True
    for j, row_j in frequent_itemsets.iterrows():
        if row_i['itemsets'] < row_j['itemsets']:
            is_maximal = False
            break
    if is_maximal:
        maximal_itemsets.append(row_i)

# Convert to DataFrame
maximal_df = pd.DataFrame(maximal_itemsets)


# Add approximate occurrence count
total_transactions = len(df)  # Make sure 'df' is your original transaction DataFrame
maximal_df['occurrences'] = (maximal_df['support'] * total_transactions).round().astype(int)

# Save to CSV
maximal_df.to_csv('maximal_itemsets.csv', index=False)

# Display results
print("\n*Maximal Frequent Itemsets:*\n", maximal_df)
print("\n*Number of Maximal Frequent Itemsets:*", len(maximal_df))



*Maximal Frequent Itemsets:*
      support                   itemsets  occurrences
42  0.116667     (Soda, Cookies, Chips)          350
31  0.107667              (Bread, Milk)          323
43  0.100000     (Cheese, Eggs, Butter)          300
44  0.094000  (Apples, Yogurt, Bananas)          282
40  0.093333            (Rice, Chicken)          280
8   0.083000                   (Onions)          249
5   0.075667                    (Juice)          227
27  0.072333                     (Fish)          217
9   0.071667                    (Pasta)          215
4   0.071333                 (Tomatoes)          214
11  0.071000                   (Cereal)          213
30  0.070333                      (Tea)          211
14  0.069333                     (Beef)          208
1   0.069000                (Ice Cream)          207
28  0.068667                    (chips)          206
15  0.068000                   (Pepper)          204
6   0.067667                      (Oil)          203
2   0.067333   


*We analyzed 3,000 simulated supermarket transactions using the Apriori algorithm and identified* **_maximal frequent itemsets_** — *itemsets for which no frequent superset exists. These represent the* **_most specific, high-support patterns_** *that are not further extendable without falling below the support threshold.*

---

### *What Are Maximal Itemsets?*

*A maximal frequent itemset is a frequent itemset that has* **_no frequent superset_**. *In simpler terms: you cannot add any more items to the set without its support dropping below the minimum threshold.*

*They are valuable because they:*
* *Represent the "boundary" of frequent patterns*
* *Are* **_more compact_** *than closed itemsets*
* *Omit internal structure (i.e., no support for subsets), focusing only on maximal combinations*

---

### *Summary of Results*

*We identified:*
* *23 maximal frequent itemsets*
* *Top results include sets like (Soda, Cookies, Chips), (Cheese, Eggs, Butter), etc.*
* *Support values range from* **_6.1% to 11.6%_**
* *Approximate occurrence counts range from* **_184 to 350 transactions_**

---

### *Interpretation by Key Observations*

#### *1. Strong Triplets Suggest Popular Bundles*

*The itemset (Soda, Cookies, Chips) appears in 11.6% of transactions — a strong indicator of frequent snacking behavior.*

*Likewise, (Cheese, Eggs, Butter) and (Apples, Yogurt, Bananas) suggest breakfast-related patterns.*

**_Insight:_** *These patterns can inform bundle pricing or in-store co-location.*

---

#### *2. Maximal Itemsets Do Not Repeat Subsets*

*Since maximal sets exclude all subsets that are also frequent, these sets are ideal when you need a concise summary of customer behavior.*

**_Insight:_** *If you only need top-level patterns (not all sub-patterns), maximal itemsets are computationally efficient.*

---

#### *3. Some Maximal Sets Are Still Single Items*

*Despite the algorithm's design, some high-frequency single items (like Juice, Fish, Tea) still appear as maximal because they are not part of any frequent superset.*

**_Insight:_** *These items stand out individually and aren’t frequently bundled with others.*

---

### *Why Do Maximal and Closed Itemsets Sometimes Match?*

*This happens when no frequent supersets exist above a certain support threshold.*

*In our case, larger item combinations may be just below the support cut-off.*

**_Solution:_** *Lowering `min_support` can uncover more multi-item combinations.*

---

### *Business Implications*

* *Use strong maximal patterns (e.g., triplets) for marketing bundles*
* *Monitor and restock individual high-frequency items*
* *Avoid overcomplicating models when maximal itemsets suffice*
* *Use in recommendation systems or layout optimization*

---

### *Conclusion*

*Maximal itemsets provide a high-level, non-redundant summary of frequent item combinations. In this dataset, they revealed both individual product popularity and a few strong bundles. Their value lies in compactness and clarity, especially when deeper hierarchy isn’t required.*


#### *Mathematically, we can summarize the relationships between these sets as follows:*

### *Maximal ⊆ Closed ⊆ Frequent*

### **Maximal Itemsets as Subsets of Closed Itemsets**

- *('Apples', 'Bananas', 'Yogurt')* → *('Apples', 'Bananas', 'Yogurt')*
- *('Bananas', 'Yogurt')* → *('Apples', 'Bananas', 'Yogurt')*
- *('Bread', 'Milk')* → *('Bread', 'Milk')*
- *('Butter', 'Cheese', 'Eggs')* → *('Butter', 'Cheese', 'Eggs')*
- *('Cheese', 'Eggs')* → *('Butter', 'Cheese', 'Eggs')*
- *('Chicken', 'Beans', 'Rice')* → *('Chicken', 'Beans', 'Rice')*
- *('Chicken', 'Rice')* → *('Chicken', 'Beans', 'Rice')*
- *('Chips', 'Cookies', 'Soda')* → *('Chips', 'Cookies', 'Soda')*
- *('Chips', 'Cookies')* → *('Chips', 'Cookies', 'Soda')*
- *('Chips', 'Soda')* → *('Chips', 'Cookies', 'Soda')*
- *('Cookies', 'Soda')* → *('Chips', 'Cookies', 'Soda')*
- *('Apples', 'Bananas')* → *('Apples', 'Bananas', 'Yogurt')*
- *('Apples', 'Yogurt')* → *('Apples', 'Bananas', 'Yogurt')*
- *('Bread',)* → *('Bread',)*
- *('Butter', 'Cheese')* → *('Butter', 'Cheese', 'Eggs')*
- *('Cheese',)* → *('Butter', 'Cheese', 'Eggs')*
- *('Chicken', 'Beans')* → *('Chicken', 'Beans', 'Rice')*
- *('Cookies',)* → *('Cookies',)*
- *('Milk',)* → *('Milk',)*
- *('Soda',)* → *('Soda',)*
- *('Eggs',)* → *('Butter', 'Cheese', 'Eggs')*
- *('Butter',)* → *('Butter', 'Cheese', 'Eggs')*
- *('Chips',)* → *('Chips', 'Cookies', 'Soda')*
```


