# CS 584 :: Data Mining :: George Mason University :: Fall 2025


# Homework 4: Association Rule Mining

- **100 points [6% of your final grade]**
- **Due Sunday, December 7 by 11:59pm**

- *Goals of this homework:* implement the association rule mining process with the Apriori algorithm.

- *Submission instructions:* for this homework, you need to submit to Canvas. Please name your submission **FirstName_Lastname_hw4.ipynb**, so for example, my submission would be something like **Ziwei_Zhu_hw4.ipynb**. Your notebook should be fully executed so that we can see all outputs.

In this assignment, you are going to examine movies using our understanding of association rules. For this part, you need to implement the apriori algorithm, and apply it to a movie rating dataset to find association rules of user-rate-movie behaviors. First, run the next cell to load the dataset we are going to use.

In [12]:
import numpy as np
user_movie_data = np.loadtxt("/content/drive/MyDrive/HW4_data/movie_rated.txt", delimiter=',')
print('array of user-movie matrix: shape ' + str(np.shape(user_movie_data)))

array of user-movie matrix: shape (11743, 2)


In [13]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In this dataset, there are two columns: the first column is the integer IDs of users, and the second column is the integer ids of movies. Each row denotes that the user of the given user id watched the movie of the given movie id. We are going to treat each user as a transaction, so you will need to collect all the movies that have been watched by a single user as a transaction.

Now, you need to implement the apriori algorithm and apply it to this dataset to find association rules of user movie-watching behaviors with **minimum support of 0.2** and **minimum confidence of 0.8**. We know there are many existing implementations of apriori online (check github for some good starting points). You are welcome to read existing codebases and let that inform your approach.

**Note: Do not copy-paste any existing code.**

**Note: We want your code to have sufficient comments to explain your steps, to show us that you really know what you are doing.**

**Note: You should add print statements to print out the intermediate steps of your method -- e.g., the size of the candidate set at each step of the method, the size of the filtered set, and any other important information you think will explain the process of the method.**

**Hint: If you implement your algorithm correctly, you should be able to see the intermediate result as:**
- Candidates of length 1 count: 408
- After Pruning count: 21
- Candidates of length 2 count: 210
- After Pruning 2 count: 36
- Candidates of length 3 count: 55
- After Pruning 3 count: 12
- Candidates of length 4 count: 1
- After Pruning 4 count: 0

**Hint: "Candidates of length 1/2/3/4 count" can be different, depending on what methods you use to generate candidate sets. But, your "After Pruning count" should be the same as what is shown above.**

In [14]:
# Write your code

import numpy as np
from collections import defaultdict
from itertools import combinations

# 1) Build transactions: one user = one transaction
assert user_movie_data.ndim == 2 and user_movie_data.shape[1] == 2, "Expected 2 columns: userId,movieId"

transactions_by_user = defaultdict(set)
for uid, mid in user_movie_data:
    transactions_by_user[int(uid)].add(int(mid))

transactions = [frozenset(mset) for mset in transactions_by_user.values() if mset]
N = len(transactions)
universe_items = sorted({m for t in transactions for m in t})

print(f"Total transactions (users): {N}")
print(f"Unique items (movies): {len(universe_items)}")
print("Sample transactions (up to 3):", list(transactions)[:3])

# 2) Helpers for Apriori
def compute_support_counts(transactions, candidates):
    """Scan once to count how many transactions contain each candidate itemset."""
    counts = defaultdict(int)
    for t in transactions:
        for c in candidates:
            if c.issubset(t):
                counts[c] += 1
    return counts

def generate_candidates(prev_frequents, k):
    """
    Self-join step: generate k-itemset candidates from frequent (k-1)-itemsets.
    We use simple set union and keep only those whose size becomes exactly k.
    """
    prev = sorted(list(prev_frequents))
    Ck = set()
    for i in range(len(prev)):
        for j in range(i+1, len(prev)):
            u = prev[i].union(prev[j])
            if len(u) == k:
                Ck.add(frozenset(u))
    return Ck

def prune_with_apriori_property(candidates, prev_frequents, k):
    """
    candidates are size-k itemsets.
    Keep candidate c only if ALL (k-1)-subsets of c are in prev_frequents (size k-1 frequents).
    """
    prev_frequents = set(prev_frequents)
    pruned = set()
    for c in candidates:
        ok = True
        for sub in combinations(c, k-1):   # <-- (k-1)-subsets
            if frozenset(sub) not in prev_frequents:
                ok = False
                break
        if ok:
            pruned.add(c)
    return pruned


def filter_by_support(transactions, candidates, min_support, label=""):
    """
    Keep only candidates whose support >= min_support.
    Prints "After Pruning {label} count: X" per the assignment instructions.
    Returns: (frequents_set, support_map_for_these, raw_counts)
    """
    counts = compute_support_counts(transactions, candidates)
    frequents, sup_map = set(), {}
    N = len(transactions)
    for itemset, cnt in counts.items():
        sup = cnt / N
        if sup >= min_support:
            frequents.add(itemset)
            sup_map[itemset] = sup
    if label:
        print(f"After Pruning {label} count: {len(frequents)}")
    return frequents, sup_map, counts

def apriori(transactions, min_support=0.2, verbose=True):
    """
    Mine all frequent itemsets using Apriori.
    Returns:
      all_frequents: dict size k -> set of frequent k-itemsets
      support_map : dict itemset -> support (fraction)
    """
    # 1-item candidates
    C1 = set(frozenset([i]) for i in sorted({m for t in transactions for m in t}))
    if verbose:
        print(f"Candidates of length 1 count: {len(C1)}")
    L1, sup1, _ = filter_by_support(transactions, C1, min_support, label="1")

    all_frequents = {1: L1}
    support_map = dict(sup1)

    k = 2
    L_prev = L1
    while L_prev:
        # Generate apriori-prune candidates of size k
        Ck = generate_candidates(L_prev, k)
        if verbose:
            print(f"Candidates of length {k} count: {len(Ck)}")

        if k >= 3:
            Ck = prune_with_apriori_property(Ck, L_prev, k)


        Lk, supk, _ = filter_by_support(transactions, Ck, min_support, label=str(k))
        if not Lk:
            break

        all_frequents[k] = Lk
        support_map.update(supk)
        L_prev = Lk
        k += 1

    return all_frequents, support_map

def all_nonempty_proper_subsets(itemset):
    """Yield all non-empty proper subsets of itemset."""
    s = list(itemset)
    for r in range(1, len(s)):
        for comb in combinations(s, r):
            yield frozenset(comb)

def generate_rules(all_frequents, support_map, min_confidence=0.8):
    """
    Generate rules A -> B where confidence >= min_confidence.
    Returns a list of dicts (antecedent, consequent, support, confidence, lift).
    """
    rules = []
    # consider only frequent itemsets of size >= 2
    for k, sets_k in all_frequents.items():
        if k < 2:
            continue
        for I in sets_k:
            s_I = support_map[I]
            for A in all_nonempty_proper_subsets(I):
                B = I - A
                if not B:
                    continue
                s_A = support_map.get(A, 0.0)
                if s_A == 0:
                    continue
                conf = s_I / s_A
                if conf >= min_confidence:
                    s_B = support_map.get(B, 0.0)
                    lift = (s_I / (s_A * s_B)) if s_B > 0 else float("nan")
                    rules.append({
                        "antecedent": A,
                        "consequent": B,
                        "support": s_I,
                        "confidence": conf,
                        "lift": lift
                    })
    # sort for consistent viewing (confidence desc, support desc, then shorter rules first)
    rules.sort(key=lambda r: (-r["confidence"], -r["support"], len(r["antecedent"]) + len(r["consequent"])))
    return rules

#  3) Run Apriori with required thresholds & print intermediate steps
MIN_SUPPORT = 0.2
MIN_CONFIDENCE = 0.8

print("=== Running Apriori ===")
all_frequents, support_map = apriori(transactions, min_support=MIN_SUPPORT, verbose=True)

print("=== Frequent itemsets summary (k, count) ===")
for k in sorted(all_frequents.keys()):
    print(f"k={k}: {len(all_frequents[k])} frequent itemsets")

rules_found = generate_rules(all_frequents, support_map, min_confidence=MIN_CONFIDENCE)
print(f"Total rules stored (not printed here): {len(rules_found)}")






Total transactions (users): 494
Unique items (movies): 408
Sample transactions (up to 3): [frozenset({2160, 2312, 144, 480}), frozenset({480, 2160, 1221, 2890, 1228}), frozenset({1270})]
=== Running Apriori ===
Candidates of length 1 count: 408
After Pruning 1 count: 21
Candidates of length 2 count: 210
After Pruning 2 count: 36
Candidates of length 3 count: 123
After Pruning 3 count: 12
Candidates of length 4 count: 15
After Pruning 4 count: 0
=== Frequent itemsets summary (k, count) ===
k=1: 21 frequent itemsets
k=2: 36 frequent itemsets
k=3: 12 frequent itemsets
Total rules stored (not printed here): 14


Finally, print your final association rules in the following format:

**movie_name_1, movie_name_2, ... --> movie_name_k**

where the movie names can be fetched by joining the movieId with the file 'movies.csv'. For example, one rule that you should find is:

**Jurassic Park (1993), Back to the Future (1985) --> Star Wars: Episode IV - A New Hope (1977)**

**Hint: You may need to use the Pandas library to load and process the movies.csv file, such as using pandas.read_csv() to load the data. https://pandas.pydata.org/pandas-docs/dev/user_guide/10min.html is a good place to learn the basics about Pandas.**

**Hint: if you implement the algorithm correctly, you will find 14 rules in total:**

In [15]:
# Write your code to print out the rules



import pandas as pd

# 1) Load movies.csv and build a lookup: movieId -> title

movies_df = pd.read_csv('/content/drive/MyDrive/HW4_data/movies.csv')

df = pd.DataFrame(movies_df)
df.head()
#  file uses: movieId, movie_name
assert {'movieId', 'movie_name'}.issubset(df.columns), \
    "movies.csv must have columns: movieId, movie_name"

# Build lookup dict {movieId -> movie_name}
id_to_title = dict(zip(df['movieId'].astype(int), df['movie_name'].astype(str)))

def ids_to_titles(id_iterable):
    ids_sorted = sorted(int(x) for x in id_iterable)
    return [id_to_title.get(i, f"<movieId {i}>") for i in ids_sorted]

def print_rule_human(rule):
    left  = ', '.join(ids_to_titles(rule['antecedent']))
    right = ', '.join(ids_to_titles(rule['consequent']))
    sup = rule.get('support', float('nan'))
    conf = rule.get('confidence', float('nan'))
    lift = rule.get('lift', float('nan'))
    print(f"{left} --> {right}  (support={sup:.3f}, confidence={conf:.3f}, lift={lift:.3f})")

# Ensure rules are present
try:
    _ = rules_found
except NameError as e:
    raise RuntimeError("`rules_found` is not defined. Run the Apriori cell first.") from e

print("=== Final Association Rules ===")
for r in rules_found:
    left = ', '.join(ids_to_titles(r["antecedent"]))
    right = ', '.join(ids_to_titles(r["consequent"]))
    print(f"{left} --> {right}")



if len(rules_found) == 14:
    print("\n Matches the hint: 14 rules found.")
else:
    print(f"\n Hint says 14 rules; {len(rules_found)} (can differ if data/thresholds differ).")

=== Final Association Rules ===
Star Wars: Episode IV - A New Hope (1977), Godfather: Part II, The (1974) --> Godfather, The (1972)
Jurassic Park (1993), Princess Bride, The (1987) --> Star Wars: Episode IV - A New Hope (1977)
Jurassic Park (1993), Back to the Future (1985) --> Star Wars: Episode IV - A New Hope (1977)
Schindler's List (1993), Back to the Future (1985) --> Star Wars: Episode IV - A New Hope (1977)
Princess Bride, The (1987), Saving Private Ryan (1998) --> Star Wars: Episode IV - A New Hope (1977)
Godfather, The (1972), Godfather: Part II, The (1974) --> Star Wars: Episode IV - A New Hope (1977)
Princess Bride, The (1987), Back to the Future (1985) --> Star Wars: Episode IV - A New Hope (1977)
Godfather: Part II, The (1974) --> Godfather, The (1972)
Back to the Future (1985), Saving Private Ryan (1998) --> Star Wars: Episode IV - A New Hope (1977)
Star Wars: Episode IV - A New Hope (1977), Groundhog Day (1993) --> Back to the Future (1985)
Jurassic Park (1993), Saving P