# Homework 2: Discovery of Frequent Itemsets and Association Rules

**Course:** ID2222 Data Mining (KTH Royal Institute of Technology)  
**Authors:** [Your Names]  
**Date:** [Insert Date]

---

### üéØ Objective
This notebook implements the **A-Priori algorithm** to find **frequent itemsets** and (optionally) to generate **association rules**.

We will:
1. Load a dataset of transactions (`T10I4D100K.dat`)
2. Implement the A-Priori algorithm using **pandas**
3. Find all frequent itemsets with support ‚â• *s*
4. (Bonus) Generate association rules with confidence ‚â• *c*

---



In [1]:
import pandas as pd
import itertools
from collections import defaultdict


In [8]:
# Path to dataset (you will upload it in VS Code / GitHub)
file_path = "data/T10I4D100K.dat"

# Each line = one transaction, items separated by spaces
transactions = []
with open(file_path, "r") as f:
    for line in f:
        items = list(map(int, line.strip().split()))
        transactions.append(items)

# Convert into a pandas DataFrame
df = pd.DataFrame({"TransactionID": range(1, len(transactions) + 1), "Items": transactions})

df.head()

# dimensions of the dataset
num_transactions = df.shape[0]
print(f"Number of transactions: {num_transactions}")

Number of transactions: 100000


In [13]:
# Minimum support (as an absolute count)
# Let's start with a threshold that gives interpretable results without overloading memory.
# The dataset has 100,000 transactions, so 0.5% = 500 transactions.
min_support = 1000

# Minimum confidence for association rules (between 0 and 1)
min_confidence = 0.6

print(f"Minimum Support: {min_support}")
print(f"Minimum Confidence: {min_confidence}")


Minimum Support: 1000
Minimum Confidence: 0.6


In [14]:
def get_frequent_itemsets(transactions, min_support):
    """Run the Apriori algorithm to find all frequent itemsets above the given support."""
    # Step 1: Count single items
    item_counts = defaultdict(int)
    for transaction in transactions:
        for item in transaction:
            item_counts[frozenset([item])] += 1

    # Filter by min_support
    frequent_itemsets = {item: count for item, count in item_counts.items() if count >= min_support}
    all_frequent = dict(frequent_itemsets)

    k = 2
    current_Lk = list(frequent_itemsets.keys())

    # Step 2: Iteratively find larger itemsets
    while current_Lk:
        # Generate candidate itemsets of size k
        candidate_itemsets = set(
            [i.union(j) for i in current_Lk for j in current_Lk if len(i.union(j)) == k]
        )

        candidate_counts = defaultdict(int)

        # Count occurrences of candidates in all transactions
        for transaction in transactions:
            tset = set(transaction)
            for candidate in candidate_itemsets:
                if candidate.issubset(tset):
                    candidate_counts[candidate] += 1

        # Filter by min_support
        current_Lk = [item for item, count in candidate_counts.items() if count >= min_support]
        frequent_k = {item: count for item, count in candidate_counts.items() if count >= min_support}

        # Add to results
        all_frequent.update(frequent_k)
        k += 1

    return all_frequent


In [15]:
frequent_itemsets = get_frequent_itemsets(transactions, min_support)

print(f"Number of frequent itemsets found: {len(frequent_itemsets)}")

# Show top 10
for itemset, support in list(frequent_itemsets.items())[:10]:
    print(f"Itemset: {set(itemset)}, Support: {support}")


Number of frequent itemsets found: 385
Itemset: {25}, Support: 1395
Itemset: {52}, Support: 1983
Itemset: {240}, Support: 1399
Itemset: {274}, Support: 2628
Itemset: {368}, Support: 7828
Itemset: {448}, Support: 1370
Itemset: {538}, Support: 3982
Itemset: {561}, Support: 2783
Itemset: {630}, Support: 1523
Itemset: {687}, Support: 1762


In [None]:
freq_df = pd.DataFrame(
    [(tuple(itemset), support) for itemset, support in frequent_itemsets.items()],
    columns=["Itemset", "Support"]
)
freq_df.sort_values(by="Support", ascending=False).head(10)


            Itemset  Support
0             (25,)     1395
1             (52,)     1983
2            (240,)     1399
3            (274,)     2628
4            (368,)     7828
..              ...      ...
380      (368, 829)     1194
381      (217, 346)     1336
382      (368, 682)     1193
383      (722, 390)     1042
384  (704, 825, 39)     1035

[385 rows x 2 columns]


In [17]:
def generate_rules(frequent_itemsets, min_confidence, num_transactions):
    """Generate association rules from frequent itemsets."""
    rules = []
    for itemset, support_count in frequent_itemsets.items():
        if len(itemset) < 2:
            continue
        for consequent_size in range(1, len(itemset)):
            for consequent in itertools.combinations(itemset, consequent_size):
                consequent = frozenset(consequent)
                antecedent = itemset - consequent
                if not antecedent:
                    continue
                support_XY = support_count / num_transactions
                support_X = frequent_itemsets.get(antecedent, 0) / num_transactions
                if support_X > 0:
                    confidence = support_XY / support_X
                    if confidence >= min_confidence:
                        rules.append({
                            "Antecedent": tuple(antecedent),
                            "Consequent": tuple(consequent),
                            "Support": round(support_XY, 4),
                            "Confidence": round(confidence, 4)
                        })
    return pd.DataFrame(rules)

rules_df = generate_rules(frequent_itemsets, min_confidence, len(transactions))
rules_df.head(10)


Unnamed: 0,Antecedent,Consequent,Support,Confidence
0,"(704,)","(825,)",0.011,0.6143
1,"(704,)","(39,)",0.0111,0.6171
2,"(825, 39)","(704,)",0.0103,0.8719
3,"(704, 39)","(825,)",0.0103,0.935
4,"(704, 825)","(39,)",0.0103,0.9392


In [20]:
print("Top 10 Association Rules by Confidence:")
rules_df.sort_values(by="Confidence", ascending=False).head(10)


Top 10 Association Rules by Confidence:


Unnamed: 0,Antecedent,Consequent,Support,Confidence
4,"(704, 825)","(39,)",0.0103,0.9392
3,"(704, 39)","(825,)",0.0103,0.935
2,"(825, 39)","(704,)",0.0103,0.8719
1,"(704,)","(39,)",0.0111,0.6171
0,"(704,)","(825,)",0.011,0.6143


### ‚úÖ Conclusions

- The **A-Priori algorithm** efficiently finds frequent itemsets by iteratively expanding candidate sets.  
- Using the **monotonicity property of support**, it avoids unnecessary computations.  
- With reasonable support thresholds (e.g., 0.5% of transactions), the number of candidates remains manageable.
- The **association rules** provide interpretable patterns, e.g., *‚Äúif {X, Y}, then Z‚Äù*, useful in market basket analysis.

---

### ‚öôÔ∏è How to Run
1. Upload `T10I4D100K.dat` into the same folder as this notebook (in VS Code or GitHub).  
2. Open the notebook in VS Code or JupyterLab.  
3. Run all cells sequentially (Shift + Enter).  


