# Apriori Algorithm Implementation Assignment

### Objective:
You will implement the **Apriori algorithm** from scratch (i.e., without using any libraries like `mlxtend`) to find frequent itemsets and generate association rules.

### Dataset:
Use the [Online Retail Dataset](https://www.kaggle.com/datasets/vijayuv/onlineretail) from Kaggle. You can filter it for a specific country (e.g., `United Kingdom`) and time range to reduce size if needed.

---

In [2]:
import pandas as pd

## Step 1: Data Preprocessing

- Load the dataset
- Remove rows with missing values
- Filter out rows where `Quantity <= 0`
- Convert Data into Basket Format

👉 **Implement code below**

In [3]:
# Load the dataset
df = pd.read_csv(r"C:\Ddrive data\Sem 5 Labs\Data Mining\ProjectTasks\Datasets\OnlineRetail.csv", encoding='ISO-8859-1')
# Preprocess as per the instructions above | We have already done in TASK 2

# Remove missing values
df.dropna(subset=["InvoiceNo", "Description"], inplace=True)

# Filter for Quantity <= 0
df = df[df["Quantity"] > 0]

# Filter for a specific country (United Kingdom)
df = df[df["Country"] == "United Kingdom"]

# for Basket
basket = df.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket = basket.applymap(lambda x: 1 if x > 0 else 0)

  basket = basket.applymap(lambda x: 1 if x > 0 else 0)


## Step 2: Implement Apriori Algorithm
Step-by-Step Procedure:
1. Generate Frequent 1-Itemsets
Count the frequency (support) of each individual item in the dataset.
Keep only those with support ≥ min_support.
→ Result is L1 (frequent 1-itemsets)
2. Iterative Candidate Generation (k = 2 to n)
While L(k-1) is not empty:
a. Candidate Generation

Generate candidate itemsets Ck of size k from L(k-1) using the Apriori property:
Any (k-itemset) is only frequent if all of its (k−1)-subsets are frequent.
b. Prune Candidates
Eliminate candidates that have any (k−1)-subset not in L(k-1).
c. Count Support
For each transaction, count how many times each candidate in Ck appears.
d. Generate Frequent Itemsets
Form Lk by keeping candidates from Ck that meet the min_support.
Repeat until Lk becomes empty.
Implement the following functions:
1. `get_frequent_itemsets(transactions, min_support)` - Returns frequent itemsets and their support
2. `generate_candidates(prev_frequent_itemsets, k)` - Generates candidate itemsets of length `k`
3. `calculate_support(transactions, candidates)` - Calculates the support count for each candidate

**Write reusable functions** for each part of the algorithm.

In [None]:
# Step 1: Generate Frequent 1-itemsets
def get_frequent_itemsets(transactions, min_support):
    n = len(transactions)
    
    # Count support for each single item
    item_counts = {}
    for transaction in transactions:
        for item in transaction:
            item_counts[frozenset([item])] = item_counts.get(frozenset([item]), 0) + 1

    # Keep only items with support >= min_support
    frequent_itemsets = {item: count / n for item, count in item_counts.items()
                         if count / n >= min_support}
    return frequent_itemsets


# Step 2: Generate candidates manually
def generate_candidates(prev_frequent_itemsets, k):
    candidates = set()
    prev_items = list(prev_frequent_itemsets.keys())
    
    for i in range(len(prev_items)):
        for j in range(i + 1, len(prev_items)):
            # Join step: take union of two sets
            union_set = prev_items[i] | prev_items[j]
            
            # Only consider if size = k
            if len(union_set) == k:
                # Prune step: check all (k-1) subsets are frequent
                all_subsets_frequent = True
                for subset in union_set:
                    subset_set = union_set - frozenset([subset])
                    if subset_set not in prev_frequent_itemsets:
                        all_subsets_frequent = False
                        break
                if all_subsets_frequent:
                    candidates.add(union_set)
    return candidates


# Step 3: Calculate support of candidates
def calculate_support(transactions, candidates, min_support):
    n = len(transactions)
    support_data = {}
    
    for candidate in candidates:
        count = 0
        for transaction in transactions:
            if candidate.issubset(transaction):
                count += 1
        support = count / n
        if support >= min_support:
            support_data[candidate] = round(support, 2)
    return support_data


# Step 4: Apriori Algorithm
def apriori(transactions, min_support=0.5):
    result = {}
    
    # Step 1: L1
    L1 = get_frequent_itemsets(transactions, min_support)
    result.update(L1)
    
    k = 2
    Lk = L1
    while Lk:
        # Step 2: Generate candidates from L(k-1)
        candidates = generate_candidates(Lk, k)
        
        # Step 3: Calculate support
        Lk = calculate_support(transactions, candidates, min_support)
        
        # Save results
        result.update(Lk)
        k += 1
    return result


## Step 3: Generate Association Rules

- Use frequent itemsets to generate association rules
- For each rule `A => B`, calculate:
  - **Support**
  - **Confidence**
- Only return rules that meet a minimum confidence threshold (e.g., 0.5)

👉 **Implement rule generation function below**

In [None]:
def generate_subsets(itemset):
    """Generate all non-empty proper subsets of an itemset using bit masking."""
    item_list = list(itemset)
    n = len(item_list)
    subsets = []
    
    for mask in range(1, 2**n - 1):  # exclude 0 (empty set) and full set
        subset = frozenset(item_list[i] for i in range(n) if mask & (1 << i))
        subsets.append(subset)
    return subsets


def generate_association_rules(frequent_itemsets, min_confidence=0.5):
    rules = []
    
    for itemset, support in frequent_itemsets.items():
        if len(itemset) >= 2:  # only for 2+ items
            for lhs in generate_subsets(itemset):
                rhs = itemset - lhs
                if lhs in frequent_itemsets:
                    confidence = support / frequent_itemsets[lhs]
                    if confidence >= min_confidence:
                        rules.append({
                            'lhs': set(lhs),
                            'rhs': set(rhs),
                            'support': support,
                            'confidence': round(confidence, 2)
                        })
    return rules



## Step 4: Output and Visualize

- Print top 10 frequent itemsets
- Print top 10 association rules (by confidence or lift)

👉 **Output results below**

In [None]:
# Step 4: Output and Visualize
#to print output

# Get frequent itemsets
frequent_itemsets = get_frequent_itemsets(df, min_support=0.3)

# Sort by support (descending) and take top 10
top_itemsets = sorted(frequent_itemsets, key=lambda x: x[1], reverse=True)[:10]
print("🔹 Top 10 Frequent Itemsets:")
for itemset, support in top_itemsets:
    print(f"{set(itemset)} -> support: {support}")

print("\n")

# Generate association rules
rules = generate_association_rules(frequent_itemsets, min_confidence=0.5)

# Sort rules by confidence (descending) and take top 10
top_rules = sorted(rules, key=lambda x: x['confidence'], reverse=True)[:10]
print("🔹 Top 10 Association Rules (by confidence):")
for r in top_rules:
    print(f"{r['lhs']} => {r['rhs']} | support: {r['support']} | confidence: {r['confidence']}")


🔹 Top 10 Frequent Itemsets:


🔹 Top 10 Association Rules (by confidence):
