
# 🛒 Market Basket Analysis (MBA) – Product Recommendations

**Author:** Sri Saranya Chandrapati  
**Objective:** Discover frequent itemsets and association rules from transactional data and visualize insights.

This notebook supports:
- Loading your own dataset **or** using the included `sample_transactions.csv`
- Data wrangling to build baskets (transactions → list of items)
- Apriori frequent itemset mining (pure Python implementation; no external libs required)
- Association rules (support, confidence, lift)
- Visualizations (matplotlib only)
- Export of rules to CSV


In [None]:

# =====================
# Setup & Configuration
# =====================
import pandas as pd
import numpy as np
from itertools import combinations
import math
import matplotlib.pyplot as plt

# Matplotlib settings (no styles or colors set as requested)
plt.rcParams["figure.figsize"] = (9, 5)

# Paths
DATA_PATH = "/mnt/data/sample_transactions.csv"  # replace with your file path if using your own dataset
EXPORT_RULES_CSV = "/mnt/data/mba_rules.csv"
EXPORT_ITEMSETS_CSV = "/mnt/data/mba_frequent_itemsets.csv"


## 1. Load Transactions

In [None]:

try:
    df_raw = pd.read_csv(DATA_PATH)
except Exception as e:
    raise SystemExit(f"Could not read dataset at {DATA_PATH}: {e}")

print("Preview:")
display(df_raw.head())
print("\nColumns:", list(df_raw.columns))
print("\nRows:", len(df_raw))

# Expecting columns: TransactionID, Item
if not set(["TransactionID", "Item"]).issubset(df_raw.columns):
    raise ValueError("Dataset must contain 'TransactionID' and 'Item' columns.")


## 2. Build Baskets (Transaction → List of Items)

In [None]:

baskets_df = df_raw.groupby("TransactionID")["Item"].apply(lambda s: list(dict.fromkeys([str(x).strip() for x in s if pd.notnull(x)]))).reset_index()
transactions = baskets_df["Item"].tolist()
print(f"Total transactions: {len(transactions)}")
print("Sample baskets:")
for i in range(min(5, len(transactions))):
    print(f"{i+1}: {transactions[i]}")


## 3. Apriori – Frequent Itemset Mining (Pure Python)

In [None]:

from collections import defaultdict

def get_support(itemset, transactions):
    count = 0
    for t in transactions:
        if itemset.issubset(set(t)):
            count += 1
    return count / len(transactions) if transactions else 0.0

def generate_candidates(prev_frequents, k):
    # Join step: combine itemsets of size k-1 to produce candidates of size k
    prev_list = sorted(list(prev_frequents))
    candidates = set()
    for i in range(len(prev_list)):
        for j in range(i+1, len(prev_list)):
            a = prev_list[i]
            b = prev_list[j]
            union = a.union(b)
            if len(union) == k:
                # Prune step: all (k-1)-subsets must be frequent
                all_subsets_frequent = True
                for subset in combinations(union, k-1):
                    if frozenset(subset) not in prev_frequents:
                        all_subsets_frequent = False
                        break
                if all_subsets_frequent:
                    candidates.add(frozenset(union))
    return candidates

def apriori(transactions, min_support=0.2):
    # L1: singletons
    item_counts = defaultdict(int)
    n = len(transactions)
    for t in transactions:
        for item in set(t):
            item_counts[item] += 1

    L1 = set()
    supports = {}
    for item, cnt in item_counts.items():
        sup = cnt / n if n else 0.0
        if sup >= min_support:
            L1.add(frozenset([item]))
            supports[frozenset([item])] = sup

    frequents = [L1]
    k = 2
    while True:
        Ck = generate_candidates(frequents[-1], k)
        Lk = set()
        for c in Ck:
            sup = get_support(c, transactions)
            if sup >= min_support:
                Lk.add(c)
                supports[c] = sup
        if not Lk:
            break
        frequents.append(Lk)
        k += 1

    # Flatten frequent itemsets
    all_frequents = set().union(*frequents) if frequents else set()
    return all_frequents, supports

# Run apriori
MIN_SUPPORT = 0.2  # adjust as needed
frequent_itemsets, supports_map = apriori(transactions, min_support=MIN_SUPPORT)
print(f"Found {len(frequent_itemsets)} frequent itemsets with min_support={MIN_SUPPORT}.")
# Convert to DataFrame for review/export
fi_rows = []
for itemset in sorted(frequent_itemsets, key=lambda x: (len(x), sorted(list(x)))):
    fi_rows.append({
        "itemset": ", ".join(sorted(list(itemset))),
        "k": len(itemset),
        "support": supports_map[itemset]
    })
df_fi = pd.DataFrame(fi_rows).sort_values(["k", "support"], ascending=[True, False]).reset_index(drop=True)
display(df_fi.head(20))
df_fi.to_csv(EXPORT_ITEMSETS_CSV, index=False)
print(f"Saved frequent itemsets to {EXPORT_ITEMSETS_CSV}")


## 4. Association Rules (Support, Confidence, Lift)

In [None]:

def association_rules(frequents, supports, min_confidence=0.5, min_lift=1.0):
    rules = []
    for itemset in frequents:
        if len(itemset) < 2:  # rules need at least 2 items
            continue
        items = set(itemset)
        # For each non-empty proper subset as antecedent
        for r in range(1, len(itemset)):
            for antecedent in combinations(items, r):
                antecedent = frozenset(antecedent)
                consequent = itemset - antecedent
                if not consequent:
                    continue
                sup_X = supports.get(antecedent, 0.0)
                sup_XY = supports.get(itemset, 0.0)
                sup_Y = supports.get(consequent, 0.0)
                if sup_X <= 0 or sup_Y <= 0:
                    continue
                conf = sup_XY / sup_X
                lift = conf / sup_Y
                if conf >= min_confidence and lift >= min_lift:
                    rules.append({
                        "antecedent": ", ".join(sorted(list(antecedent))),
                        "consequent": ", ".join(sorted(list(consequent))),
                        "support": sup_XY,
                        "confidence": conf,
                        "lift": lift
                    })
    rules_df = pd.DataFrame(rules).sort_values(["lift", "confidence", "support"], ascending=False).reset_index(drop=True)
    return rules_df

MIN_CONFIDENCE = 0.5
MIN_LIFT = 1.0
rules_df = association_rules(frequent_itemsets, supports_map, min_confidence=MIN_CONFIDENCE, min_lift=MIN_LIFT)
print(f"Generated {len(rules_df)} rules (min_confidence={MIN_CONFIDENCE}, min_lift={MIN_LIFT}).")
display(rules_df.head(20))
rules_df.to_csv(EXPORT_RULES_CSV, index=False)
print(f"Saved rules to {EXPORT_RULES_CSV}")


## 5. Visualizations

In [None]:

# 5.1 Top-k frequent itemsets by support (k >= 2)
df_pairs_or_more = df_fi[df_fi["k"] >= 2].copy()
top_n = min(10, len(df_pairs_or_more))
top_itemsets = df_pairs_or_more.nlargest(top_n, "support")

if not top_itemsets.empty:
    plt.figure()
    plt.barh(range(len(top_itemsets)), top_itemsets["support"])
    plt.yticks(range(len(top_itemsets)), top_itemsets["itemset"])
    plt.title("Top Frequent Itemsets (k >= 2) by Support")
    plt.xlabel("Support")
    plt.ylabel("Itemset")
    plt.gca().invert_yaxis()
    plt.show()
else:
    print("No itemsets of size >= 2 at current thresholds.")


In [None]:

# 5.2 Scatter plot: Support vs Confidence (top rules)
if not rules_df.empty:
    plt.figure()
    plt.scatter(rules_df["support"], rules_df["confidence"])
    plt.title("Rules: Support vs Confidence")
    plt.xlabel("Support")
    plt.ylabel("Confidence")
    plt.show()
else:
    print("No rules to visualize at current thresholds.")


In [None]:

# 5.3 Scatter plot: Confidence vs Lift (top rules)
if not rules_df.empty:
    plt.figure()
    plt.scatter(rules_df["confidence"], rules_df["lift"])
    plt.title("Rules: Confidence vs Lift")
    plt.xlabel("Confidence")
    plt.ylabel("Lift")
    plt.show()
else:
    print("No rules to visualize at current thresholds.")



## 6. Use This Notebook With Your Own Data

Your CSV must have **two columns** (any order, any names are fine, just rename in the cell):
- A transaction identifier (e.g., `TransactionID`, `InvoiceNo`, etc.)
- An item / product name (e.g., `Item`, `Description`, etc.)

**Steps:**
1. Upload your CSV to this environment or place it in your project folder.
2. Change `DATA_PATH` at the top to your file path, for example:
   ```python
   DATA_PATH = "/mnt/data/OnlineRetail.csv"
   ```
3. If your column names differ, rename them before building baskets:
   ```python
   df_raw = pd.read_csv(DATA_PATH)
   df_raw = df_raw.rename(columns={"InvoiceNo":"TransactionID", "Description":"Item"})
   ```
4. Tune thresholds:
   ```python
   MIN_SUPPORT = 0.02
   MIN_CONFIDENCE = 0.6
   MIN_LIFT = 1.2
   ```
5. Re-run the notebook. Exports:
   - Frequent Itemsets → `/mnt/data/mba_frequent_itemsets.csv`
   - Rules → `/mnt/data/mba_rules.csv`



## 7. Resume Bullet (Copy-Paste)

**Market Basket Analysis** – Implemented Apriori in Python to mine frequent itemsets and generate association rules from retail transaction data. Built matplotlib visualizations and exported rules, uncovering product bundles that inform cross‑sell strategies.
