# Question 1: Association Rule Mining on Groceries Dataset

This notebook applies association rule mining to analyze patterns in transactional data from the groceries dataset and generate product recommendations.

## Objective
Analyze customer purchase patterns using association rules to identify frequently purchased product combinations.

In [1]:
#Import required libraries
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import warnings
warnings.filterwarnings('ignore')

## Step A: Load and Preprocess the Data

In [8]:
# Load the groceries dataset
# Each row represents a transaction with comma-separated products
# Reading with error handling for variable-length rows
transactions_raw = []
with open('groceries.csv', 'r') as file:
    for line in file:
        # Split by comma and strip whitespace
        items = [item.strip() for item in line.strip().split(',')]
        transactions_raw.append(items)

# Convert to DataFrame for easier handling
df = pd.DataFrame(transactions_raw)

# Display basic information
print("Dataset Shape:", df.shape)
print(f"Number of transactions: {len(df)}")
print("\nFirst 5 transactions:")
print(df.head())

Dataset Shape: (9835, 32)
Number of transactions: 9835

First 5 transactions:
                 0                    1               2   \
0      citrus fruit  semi-finished bread       margarine   
1    tropical fruit               yogurt          coffee   
2        whole milk                 None            None   
3         pip fruit               yogurt    cream cheese   
4  other vegetables           whole milk  condensed milk   

                         3     4     5     6     7     8     9   ...    22  \
0               ready soups  None  None  None  None  None  None  ...  None   
1                      None  None  None  None  None  None  None  ...  None   
2                      None  None  None  None  None  None  None  ...  None   
3              meat spreads  None  None  None  None  None  None  ...  None   
4  long life bakery product  None  None  None  None  None  None  ...  None   

     23    24    25    26    27    28    29    30    31  
0  None  None  None  None  None  N

In [4]:
# Convert each row into a list of items (transactions)
transactions = []

for i in range(len(df)):
    # Get all items in the transaction (all columns in the row)
    transaction = df.iloc[i, :].dropna().values.tolist()
    # Strip whitespace from item names
    transaction = [str(item).strip() for item in transaction]
    transactions.append(transaction)

# Display sample transactions
print(f"Total number of transactions: {len(transactions)}")
print("\nFirst 5 transactions:")
for i, transaction in enumerate(transactions[:5]):
    print(f"Transaction {i+1}: {transaction}")

Total number of transactions: 9835

First 5 transactions:
Transaction 1: ['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups']
Transaction 2: ['tropical fruit', 'yogurt', 'coffee']
Transaction 3: ['whole milk']
Transaction 4: ['pip fruit', 'yogurt', 'cream cheese', 'meat spreads']
Transaction 5: ['other vegetables', 'whole milk', 'condensed milk', 'long life bakery product']


In [5]:
# Use TransactionEncoder to convert transactions into one-hot encoded format
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)

# Convert to DataFrame
df_encoded = pd.DataFrame(te_array, columns=te.columns_)

print("One-Hot Encoded Dataset Shape:", df_encoded.shape)
print(f"\nNumber of unique products: {len(te.columns_)}")
print("\nFirst 5 rows of encoded dataset:")
print(df_encoded.head())
print("\nSample of product columns:")
print(df_encoded.columns.tolist()[:20])

One-Hot Encoded Dataset Shape: (9835, 169)

Number of unique products: 169

First 5 rows of encoded dataset:
   Instant food products  UHT-milk  abrasive cleaner  artif. sweetener  \
0                  False     False             False             False   
1                  False     False             False             False   
2                  False     False             False             False   
3                  False     False             False             False   
4                  False     False             False             False   

   baby cosmetics  baby food   bags  baking powder  bathroom cleaner   beef  \
0           False      False  False          False             False  False   
1           False      False  False          False             False  False   
2           False      False  False          False             False  False   
3           False      False  False          False             False  False   
4           False      False  False          False 

## Step B: Generate Frequent Itemsets

Apply the Apriori algorithm to identify frequent itemsets with minimum support of 1%.

In [9]:
# Apply Apriori algorithm with minimum support of 0.01 (1%)
frequent_itemsets = apriori(df_encoded, min_support=0.01, use_colnames=True)

# Sort by support in descending order
frequent_itemsets = frequent_itemsets.sort_values('support', ascending=False).reset_index(drop=True)

print(f"Total number of frequent itemsets found: {len(frequent_itemsets)}")
print(f"\nFrequent itemsets with support >= 1%: {len(frequent_itemsets)}")
print("\n" + "="*80)
print("Top 10 Frequent Itemsets by Support")
print("="*80)

# Display top 10 frequent itemsets
for idx, row in frequent_itemsets.head(10).iterrows():
    itemset = ', '.join(list(row['itemsets']))
    support_pct = row['support'] * 100
    print(f"\n{idx+1}. Itemset: {{{itemset}}}")
    print(f"   Support: {row['support']:.4f} ({support_pct:.2f}%)")
    print(f"   Appears in {int(row['support'] * len(df_encoded))} transactions")

Total number of frequent itemsets found: 333

Frequent itemsets with support >= 1%: 333

Top 10 Frequent Itemsets by Support

1. Itemset: {whole milk}
   Support: 0.2555 (25.55%)
   Appears in 2513 transactions

2. Itemset: {other vegetables}
   Support: 0.1935 (19.35%)
   Appears in 1903 transactions

3. Itemset: {rolls/buns}
   Support: 0.1839 (18.39%)
   Appears in 1809 transactions

4. Itemset: {soda}
   Support: 0.1744 (17.44%)
   Appears in 1715 transactions

5. Itemset: {yogurt}
   Support: 0.1395 (13.95%)
   Appears in 1372 transactions

6. Itemset: {bottled water}
   Support: 0.1105 (11.05%)
   Appears in 1087 transactions

7. Itemset: {root vegetables}
   Support: 0.1090 (10.90%)
   Appears in 1072 transactions

8. Itemset: {tropical fruit}
   Support: 0.1049 (10.49%)
   Appears in 1032 transactions

9. Itemset: {shopping bags}
   Support: 0.0985 (9.85%)
   Appears in 969 transactions

10. Itemset: {sausage}
   Support: 0.0940 (9.40%)
   Appears in 924 transactions


## Step C: Generate Association Rules

Generate association rules from frequent itemsets with minimum confidence of 20%.

In [11]:
# Generate association rules with minimum confidence of 0.2 (20%)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)

# Sort by confidence in descending order
rules = rules.sort_values('confidence', ascending=False).reset_index(drop=True)

print(f"Total number of association rules generated: {len(rules)}")
print(f"\nRules with confidence >= 20%: {len(rules)}")
print("\n" + "="*100)
print("Association Rules - Sorted by Confidence")
print("="*100)

# Display top 20 rules with detailed information
print("\nTop 20 Association Rules:\n")
for idx, rule in rules.head(20).iterrows():
    antecedents = ', '.join(list(rule['antecedents']))
    consequents = ', '.join(list(rule['consequents']))
    
    print(f"{idx+1}. Rule: {{{antecedents}}} => {{{consequents}}}")
    print(f"   Support:    {rule['support']:.4f} ({rule['support']*100:.2f}%)")
    print(f"   Confidence: {rule['confidence']:.4f} ({rule['confidence']*100:.2f}%)")
    print(f"   Lift:       {rule['lift']:.4f}")
    print(f"   Interpretation: If a customer buys {{{antecedents}}},")
    print(f"                   there is a {rule['confidence']*100:.2f}% chance they will also buy {{{consequents}}}")
    print()

Total number of association rules generated: 234

Rules with confidence >= 20%: 234

Association Rules - Sorted by Confidence

Top 20 Association Rules:

1. Rule: {root vegetables, citrus fruit} => {other vegetables}
   Support:    0.0104 (1.04%)
   Confidence: 0.5862 (58.62%)
   Lift:       3.0296
   Interpretation: If a customer buys {root vegetables, citrus fruit},
                   there is a 58.62% chance they will also buy {other vegetables}

2. Rule: {root vegetables, tropical fruit} => {other vegetables}
   Support:    0.0123 (1.23%)
   Confidence: 0.5845 (58.45%)
   Lift:       3.0210
   Interpretation: If a customer buys {root vegetables, tropical fruit},
                   there is a 58.45% chance they will also buy {other vegetables}

3. Rule: {curd, yogurt} => {whole milk}
   Support:    0.0101 (1.01%)
   Confidence: 0.5824 (58.24%)
   Lift:       2.2791
   Interpretation: If a customer buys {curd, yogurt},
                   there is a 58.24% chance they will also buy {w

## Step D: Recommend Items

Create a recommendation function that suggests items based on association rules.

In [18]:
def recommend_items(transaction, rules_df=rules, top_n=5):
    """
    Recommend items based on association rules for a given transaction.
    
    Parameters:
    -----------
    transaction : list
        List of items purchased by the user (e.g., ["whole milk", "bread"])
    rules_df : pandas.DataFrame
        DataFrame containing association rules with antecedents, consequents, confidence, lift
    top_n : int
        Number of top recommendations to return
    
    Returns:
    --------
    list of tuples : [(recommended_item, confidence, lift), ...]
        List of recommended items with their confidence and lift scores
    """
    # Convert transaction to a set for easier matching
    transaction_set = set([item.lower().strip() for item in transaction])
    
    # Store recommendations with their scores
    recommendations = {}
    
    # Iterate through all rules
    for idx, rule in rules_df.iterrows():
        # Get antecedents (items in the rule's "if" part)
        antecedents_set = set([item.lower().strip() for item in rule['antecedents']])
        
        # Check if all antecedents are in the transaction
        if antecedents_set.issubset(transaction_set):
            # Get consequents (recommended items)
            consequents_list = list(rule['consequents'])
            
            # Add each consequent as a recommendation
            for item in consequents_list:
                item_lower = item.lower().strip()
                
                # Don't recommend items already in the transaction
                if item_lower not in transaction_set:
                    # If item already recommended, keep the one with higher confidence
                    if item_lower not in recommendations or rule['confidence'] > recommendations[item_lower][1]:
                        recommendations[item_lower] = (item, rule['confidence'], rule['lift'], rule['support'])
    
    # Sort recommendations by confidence (descending) and then by lift
    sorted_recommendations = sorted(recommendations.values(), 
                                   key=lambda x: (x[1], x[2]), 
                                   reverse=True)
    
    # Return top N recommendations
    return sorted_recommendations[:top_n]



print("\nThe function:")
print("  • Takes a list of purchased items as input")
print("  • Finds matching association rules")
print("  • Returns top recommendations with confidence, lift, and support scores")


The function:
  • Takes a list of purchased items as input
  • Finds matching association rules
  • Returns top recommendations with confidence, lift, and support scores


### Testing the Recommendation Function

Let's test the function with various example transactions.

In [15]:
# Test Case 1: Customer buys whole milk and bread (rolls/buns)
print("="*100)
print("TEST CASE 1: Customer purchases ['whole milk', 'rolls/buns']")
print("="*100)

test_transaction_1 = ["whole milk", "rolls/buns"]
recommendations_1 = recommend_items(test_transaction_1, top_n=5)

if recommendations_1:
    print(f"\n✓ Found {len(recommendations_1)} recommendations:\n")
    for i, (item, confidence, lift, support) in enumerate(recommendations_1, 1):
        print(f"{i}. Recommend: {item}")
        print(f"   Confidence: {confidence*100:.2f}% | Lift: {lift:.2f} | Support: {support*100:.2f}%")
        print()
else:
    print("\n✗ No recommendations found for this transaction.")

# Test Case 2: Customer buys yogurt
print("\n" + "="*100)
print("TEST CASE 2: Customer purchases ['yogurt']")
print("="*100)

test_transaction_2 = ["yogurt"]
recommendations_2 = recommend_items(test_transaction_2, top_n=5)

if recommendations_2:
    print(f"\n✓ Found {len(recommendations_2)} recommendations:\n")
    for i, (item, confidence, lift, support) in enumerate(recommendations_2, 1):
        print(f"{i}. Recommend: {item}")
        print(f"   Confidence: {confidence*100:.2f}% | Lift: {lift:.2f} | Support: {support*100:.2f}%")
        print()
else:
    print("\n✗ No recommendations found for this transaction.")

# Test Case 3: Customer buys root vegetables and tropical fruit
print("\n" + "="*100)
print("TEST CASE 3: Customer purchases ['root vegetables', 'tropical fruit']")
print("="*100)

test_transaction_3 = ["root vegetables", "tropical fruit"]
recommendations_3 = recommend_items(test_transaction_3, top_n=5)

if recommendations_3:
    print(f"\n✓ Found {len(recommendations_3)} recommendations:\n")
    for i, (item, confidence, lift, support) in enumerate(recommendations_3, 1):
        print(f"{i}. Recommend: {item}")
        print(f"   Confidence: {confidence*100:.2f}% | Lift: {lift:.2f} | Support: {support*100:.2f}%")
        print()
else:
    print("\n✗ No recommendations found for this transaction.")

TEST CASE 1: Customer purchases ['whole milk', 'rolls/buns']

✓ Found 4 recommendations:

1. Recommend: other vegetables
   Confidence: 31.60% | Lift: 1.63 | Support: 1.79%

2. Recommend: yogurt
   Confidence: 27.47% | Lift: 1.97 | Support: 1.56%

3. Recommend: root vegetables
   Confidence: 22.44% | Lift: 2.06 | Support: 1.27%

4. Recommend: soda
   Confidence: 20.84% | Lift: 1.20 | Support: 3.83%


TEST CASE 2: Customer purchases ['yogurt']

✓ Found 4 recommendations:

1. Recommend: whole milk
   Confidence: 40.16% | Lift: 1.57 | Support: 5.60%

2. Recommend: other vegetables
   Confidence: 31.12% | Lift: 1.61 | Support: 4.34%

3. Recommend: rolls/buns
   Confidence: 24.64% | Lift: 1.34 | Support: 3.44%

4. Recommend: tropical fruit
   Confidence: 20.99% | Lift: 2.00 | Support: 2.93%


TEST CASE 3: Customer purchases ['root vegetables', 'tropical fruit']

✓ Found 4 recommendations:

1. Recommend: other vegetables
   Confidence: 58.45% | Lift: 3.02 | Support: 1.23%

2. Recommend: whol

## Step E: Analysis and Evaluation



### 1. How Association Rule Mining Helped Uncover Patterns

Association rule mining successfully revealed meaningful shopping patterns in the groceries dataset through several key insights:

#### **A. Discovered Strong Product Associations**

The analysis identified **234 association rules** with confidence ≥ 20%, revealing which products are frequently purchased together. Key findings include:

- **Dairy-Produce Connection**: Strong associations between dairy products (whole milk, yogurt, curd) and vegetables
  - `{root vegetables, tropical fruit} → {whole milk}` (57% confidence, 2.23 lift)
  - `{other vegetables, yogurt} → {whole milk}` (51% confidence, 2.01 lift)

- **Produce Clustering**: Vegetables and fruits tend to be bought together
  - `{root vegetables, citrus fruit} → {other vegetables}` (59% confidence, 3.03 lift)
  - Highest lift value indicates these are 3x more likely to occur together than by chance

- **Staple Product Patterns**: Basic items like milk, bread (rolls/buns), and vegetables dominate frequent itemsets
  - Whole milk appears in 25.55% of all transactions
  - Most frequent 2-item set: `{other vegetables, whole milk}` (7.48% support)

#### **B. Quantified Pattern Strength**

The metrics provided actionable insights:

- **Support**: Identified how common each pattern is across all transactions
- **Confidence**: Measured the reliability of recommendations (up to 58.62% for top rules)
- **Lift**: Revealed true associations vs. random co-occurrence (values up to 3.30 indicate strong relationships)

#### **C. Enabled Data-Driven Recommendations**

The patterns translated directly into practical recommendations:
- If customer buys `{curd, yogurt}`, recommend `whole milk` with 58.24% confidence
- If customer buys `{butter, other vegetables}`, recommend `whole milk` with 57.36% confidence
- These aren't random suggestions—they're backed by actual purchasing behavior from 9,835 transactions

### 2. Limitations of Association Rule Mining for Recommendations

While powerful, this approach has several important limitations:

#### **A. Cold Start Problem**

- **New Users**: Cannot recommend items to users with no purchase history
- **New Products**: Recently added products won't appear in rules until enough transactions accumulate
- **Solution Needed**: Combine with content-based filtering or demographic data


#### **B. Frequent Itemset Bias**

- **Popular Item Dominance**: Whole milk appears in 25% of transactions, so it gets recommended often
- **Rare Gems Missed**: Niche products with lower support are overlooked, even if they're perfect for specific users
- **Threshold Challenge**: Setting min_support too high misses rare patterns; too low generates noise

#### **C. Static Rules**

- **No Adaptability**: Rules don't update in real-time as trends change
- **Seasonal Blindness**: Can't detect that ice cream sells more in summer
- **Requires Retraining**: Must periodically regenerate rules from new transaction data

#### **D. Scalability Issues**

- **Computational Cost**: With 169 products, we found 333 itemsets—imagine 10,000+ products
- **Memory Requirements**: Large datasets with many products can overwhelm the Apriori algorithm
- **Solution**: Use more efficient algorithms (FP-Growth, ECLAT) for larger catalogs