                    Data Mining Using A Priori Algorithm for Movie rentals

In [1]:
import pandas as pd

# filepath: import_csv.py

# Import the CSV file into a DataFrame
file_path = r"C:\Users\George\Downloads\large_customer_movie_rentals - large_customer_movie_rentals.csv"
df = pd.read_csv(file_path)

# Display the first few rows to verify import
print(df.head())

  CustomerID                                           Sequence
0      C0001  [The Matrix],[Thor Iron Man],[Thor Ant-Man],[T...
1      C0002              [Inception Captain America],[Ant-Man]
2      C0003  [Guardians of the Galaxy],[Guardians of the Ga...
3      C0004  [Inception Iron Man],[Ant-Man The Dark Knight]...
4      C0005  [Shang-Chi Doctor Strange Thor The Matrix],[An...


                    1. Check Basic Info and Shape

In [2]:
print("Shape of the DataFrame:", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nData Types:\n", df.dtypes)

Shape of the DataFrame: (500, 2)

Column Names: ['CustomerID', 'Sequence']

Data Types:
 CustomerID    object
Sequence      object
dtype: object


                      2. Preview the Data

In [3]:
print("\nFirst 5 Rows:\n", df.head())
print("\nLast 5 Rows:\n", df.tail())


First 5 Rows:
   CustomerID                                           Sequence
0      C0001  [The Matrix],[Thor Iron Man],[Thor Ant-Man],[T...
1      C0002              [Inception Captain America],[Ant-Man]
2      C0003  [Guardians of the Galaxy],[Guardians of the Ga...
3      C0004  [Inception Iron Man],[Ant-Man The Dark Knight]...
4      C0005  [Shang-Chi Doctor Strange Thor The Matrix],[An...

Last 5 Rows:
     CustomerID                                           Sequence
495      C0496  [Inception The Matrix The Dark Knight],[Black ...
496      C0497  [Doctor Strange Black Panther],[Guardians of t...
497      C0498  [The Dark Knight Hulk Ant-Man Iron Man],[Thor ...
498      C0499  [Inception Captain America Spider-Man Guardian...
499      C0500  [Hulk The Matrix Black Panther Thor],[Guardian...


              3. Check for Missing Values

In [4]:
print("\nMissing Values:\n", df.isnull().sum())


Missing Values:
 CustomerID    0
Sequence      0
dtype: int64


                  Exploring Movie Rental Data: Manual Analysis Before Apriori

--Before applying the Apriori algorithm to our movie rental dataset, it’s important to manually explore the data. This helps us form hypotheses about which movies and movie combinations are most popular, and what patterns we might expect the algorithm to find. By running some simple Python code, we can gain valuable insights and set better parameters for our analysis.

    Exploring the Data Manually helps us to:

        1. Understand the data structure: See how movies are grouped and rented.
        2. Spot popular items: Identify which movies and pairs are most frequently rented.
        3. Set expectations: Form hypotheses about frequent itemsets and rules before running Apriori.
        4. Parameter tuning: Use findings to choose appropriate support and confidence thresholds for the algorithm.


Task 1. Counting the Most Frequently Picked Movies


                counts how many times each individual movie appears in all customer baskets

In [7]:
import re

# Print the first 10 baskets for inspection
for seq in df['Sequence'].head(10):
    baskets = re.findall(r'\[([^\]]+)\]', seq)
    print(baskets)

Top 10 Most Picked Movies:
The Matrix: 43
Shang-Chi: 39
Avengers: 38
Interstellar: 37
The Dark Knight: 36
Inception: 34
Hulk: 34
Spider-Man: 34
Black Panther: 31
Guardians of the Galaxy: 28


`The Matrix` is the most popular, followed closely by titles like `Shang-Chi`, `Avengers`, and `Interstellar`

Task 2: Count Movie Pairs manually before a priori algorithm

In [11]:
#Basket Structure:
import re

# Print the first 10 baskets for inspection
for seq in df['Sequence'].head(10):
    baskets = re.findall(r'\[([^\]]+)\]', seq)
    print(baskets)

['The Matrix', 'Thor Iron Man', 'Thor Ant-Man', 'The Dark Knight']
['Inception Captain America', 'Ant-Man']
['Guardians of the Galaxy', 'Guardians of the Galaxy The Dark Knight', 'Guardians of the Galaxy Black Panther', 'Guardians of the Galaxy', 'Avengers Hulk', 'Black Panther Interstellar Ant-Man']
['Inception Iron Man', 'Ant-Man The Dark Knight', 'The Dark Knight']
['Shang-Chi Doctor Strange Thor The Matrix', 'Ant-Man Spider-Man', 'The Matrix', 'The Dark Knight Interstellar', 'Avengers', 'Ant-Man']
['Avengers Ant-Man Interstellar The Matrix', 'The Matrix Hulk Captain America Shang-Chi', 'Avengers Hulk Spider-Man The Matrix', 'The Dark Knight']
['Guardians of the Galaxy Doctor Strange Ant-Man', 'Inception The Dark Knight', 'Interstellar', 'Iron Man Interstellar']
['Thor Avengers Ant-Man The Matrix', 'Spider-Man Inception', 'The Dark Knight', 'Thor Captain America The Matrix Shang-Chi', 'Iron Man Avengers Guardians of the Galaxy Black Panther']
['Thor Ant-Man Inception', 'Captain Amer

In [13]:
movie_titles = [
    "The Matrix", "Thor", "Iron Man", "Ant-Man", "The Dark Knight",
    "Inception", "Captain America", "Guardians of the Galaxy",
    "Black Panther", "Interstellar", "Shang-Chi", "Doctor Strange",
    "Avengers", "Hulk", "Spider-Man"
]

In [14]:
import re
from collections import Counter
from itertools import combinations

pair_counts = Counter()

for seq in df['Sequence']:
    baskets = re.findall(r'\[([^\]]+)\]', seq)
    for basket in baskets:
        movies_in_basket = []
        basket_copy = basket
        # Match longest titles first to avoid partial matches
        for title in sorted(movie_titles, key=len, reverse=True):
            if title in basket_copy:
                movies_in_basket.append(title)
                basket_copy = basket_copy.replace(title, "")
        # Count pairs if more than one movie in the basket
        if len(movies_in_basket) > 1:
            for pair in combinations(sorted(set(movies_in_basket)), 2):
                pair_counts[pair] += 1

print("Top 10 Most Common Movie Pairs:")
for pair, count in pair_counts.most_common(10):
    print(f"{pair[0]} & {pair[1]}: {count}")

Top 10 Most Common Movie Pairs:
Doctor Strange & Hulk: 64
Interstellar & The Matrix: 62
Captain America & The Dark Knight: 62
Guardians of the Galaxy & Iron Man: 60
The Dark Knight & The Matrix: 60
Inception & Iron Man: 59
Guardians of the Galaxy & Inception: 59
Guardians of the Galaxy & Spider-Man: 59
Ant-Man & Black Panther: 58
Ant-Man & Hulk: 58


`Guardians of the Galaxy` and `The Matrix` frequently appear in these combinations, indicating customer preferences for these movies together.

``Apriori Algorithm Implementation on Movie Rental Data``

Libraries

In [15]:
import re
import pandas as pd
from itertools import combinations, chain

In [None]:
#  1. Data Preparation: Extract Transactions from Dataset 

# List of all possible movie titles (update as needed)
movie_titles = [
    "The Matrix", "Thor", "Iron Man", "Ant-Man", "The Dark Knight",
    "Inception", "Captain America", "Guardians of the Galaxy",
    "Black Panther", "Interstellar", "Shang-Chi", "Doctor Strange",
    "Avengers", "Hulk", "Spider-Man"
]


In [22]:
# Load your dataset
file_path = r"C:\Users\George\Downloads\large_customer_movie_rentals - large_customer_movie_rentals.csv"
df = pd.read_csv(file_path)

# Extract transactions: each transaction is a list of movie titles
transactions = []
for seq in df['Sequence']:
    baskets = re.findall(r'\[([^\]]+)\]', seq)
    for basket in baskets:
        movies_in_basket = []
        basket_copy = basket
        for title in sorted(movie_titles, key=len, reverse=True):
            if title in basket_copy:
                movies_in_basket.append(title)
                basket_copy = basket_copy.replace(title, "")
        if movies_in_basket:
            transactions.append(set(movies_in_basket))

In [None]:
#  2. Improved Apriori: Find Frequent Itemsets and Their Supports 

def get_frequent_itemsets(transactions, min_support=0.02):
    """
    Finds all frequent itemsets in the transactions with support >= min_support.
    Also enables discovery of larger itemsets (pairs, triplets, etc.).
    """
    itemset_size = 1
    total_transactions = len(transactions)
    frequent_itemsets = dict()
    # Start with all unique items as candidates
    candidates = set(chain.from_iterable(transactions))
    prev_frequent = set()
    while candidates:
        itemset_counts = {}
        for transaction in transactions:
            for itemset in combinations(candidates, itemset_size):
                if set(itemset).issubset(transaction):
                    itemset_counts[itemset] = itemset_counts.get(itemset, 0) + 1
        # Calculate support and filter by min_support
        curr_frequent = {
            itemset: count / total_transactions
            for itemset, count in itemset_counts.items()
            if count / total_transactions >= min_support
        }
        if not curr_frequent:
            break
        frequent_itemsets.update(curr_frequent)
        # Prepare candidates for next round (grow itemsets)
        prev_frequent = set(chain.from_iterable([set(itemset) for itemset in curr_frequent]))
        candidates = prev_frequent
        itemset_size += 1
    return frequent_itemsets

# Find frequent itemsets
frequent_itemsets = get_frequent_itemsets(transactions, min_support=0.02)

# Show top 5 frequent itemsets (any size)
top5 = sorted(frequent_itemsets.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 Frequent Itemsets with Support (any size):")
for itemset, support in top5:
    print(f"{set(itemset)}: {support:.3f}")

# Show top 5 largest frequent itemsets (pairs/triplets)
top5_largest = sorted(
    frequent_itemsets.items(),
    key=lambda x: (len(x[0]), x[1]),
    reverse=True
)[:5]
print("\nTop 5 Largest Frequent Itemsets with Support:")
for itemset, support in top5_largest:
    print(f"{set(itemset)}: {support:.3f}")

Top 5 Frequent Itemsets with Support (any size):
{'The Matrix'}: 0.185
{'Guardians of the Galaxy'}: 0.181
{'The Dark Knight'}: 0.176
{'Ant-Man'}: 0.175
{'Hulk'}: 0.173

Top 5 Largest Frequent Itemsets with Support:
{'Doctor Strange', 'Hulk'}: 0.032
{'The Matrix', 'Interstellar'}: 0.031
{'The Dark Knight', 'Captain America'}: 0.031
{'Iron Man', 'Guardians of the Galaxy'}: 0.030
{'The Dark Knight', 'The Matrix'}: 0.030


## Comparison of Apriori Algorithm and Manual Counts

| Aspect                            | Apriori Algorithm Results                                                                                                                                                       | Manual Count Results                                               |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| Most Frequent Item                | The Matrix (Support: 0.185)                                                                                                                                                     | The Matrix (Picked 43 times)                                       |
| Second Most Frequent              | Guardians of the Galaxy (0.181)                                                                                                                                                 | Shang-Chi (39 picks)                                               |
| Other Frequent Items              | The Dark Knight, Ant-Man, Hulk                                                                                                                                                  | Avengers, Interstellar, Inception, Hulk, Spider-Man, Black Panther |
| Items in Both Lists               | The Matrix, Guardians of the Galaxy, The Dark Knight, Hulk                                                                                                                      | The Matrix, Guardians of the Galaxy, The Dark Knight, Hulk         |
| Items Missing in Apriori Top 5    | Shang-Chi, Avengers, Interstellar, Inception, Spider-Man                                                                                                                        | N/A                                                                |
| Unique to Apriori Top 5           | Ant-Man                                                                                                                                                                         | N/A                                                                |
| Largest Frequent Itemsets (Pairs) | Doctor Strange & Hulk (0.032), The Matrix & Interstellar (0.031), The Dark Knight & Captain America (0.031), Iron Man & Guardians (0.030), The Dark Knight & The Matrix (0.030) | Not available from manual counts                                   |

## Comments

The `Matrix` is the most frequent movie in both methods.

* Apriori reveals co-occurrence patterns, showing which movies tend to be picked together.
* Manual counts show overall popularity but lack relationship insights.

`Shang-Chi`, `Avengers`, and `Interstellar` are popular individually but not frequently paired with others.

* Apriori results include itemsets not present in the manual list, like `Ant-Man` and `Doctor Strange`, due to frequent pairings.


In [None]:
#  3.  Association Rules: Support, Confidence, and Lift 

def generate_association_rules(frequent_itemsets, min_confidence=0.1):
    """
    Generate association rules from frequent itemsets.
    Only rules with confidence >= min_confidence are returned.
    """
    rules = []
    for itemset in frequent_itemsets:
        if len(itemset) < 2:
            continue  # Only consider itemsets of size 2 or more
        itemset_support = frequent_itemsets[itemset]
        # Generate all possible non-empty antecedent/consequent splits
        for i in range(1, len(itemset)):
            for antecedent in combinations(itemset, i):
                antecedent = tuple(sorted(antecedent))
                consequent = tuple(sorted(set(itemset) - set(antecedent)))
                if not consequent:
                    continue
                antecedent_support = frequent_itemsets.get(antecedent, 0)
                consequent_support = frequent_itemsets.get(consequent, 0)
                if antecedent_support == 0 or consequent_support == 0:
                    continue
                confidence = itemset_support / antecedent_support
                lift = confidence / consequent_support
                if confidence >= min_confidence:
                    rules.append({
                        'antecedent': set(antecedent),
                        'consequent': set(consequent),
                        'support': itemset_support,
                        'confidence': confidence,
                        'lift': lift
                    })
    return rules

# Generate rules with a minimum confidence threshold
rules = generate_association_rules(frequent_itemsets, min_confidence=0.1)

# Filter and display only strong rules (lift > 1)
strong_rules = [rule for rule in rules if rule['lift'] > 1]

print("\nTop 10 Strong Association Rules (Support, Confidence, Lift):")
for rule in strong_rules[:10]:
    print(f"{rule['antecedent']} => {rule['consequent']} | support: {rule['support']:.3f}, confidence: {rule['confidence']:.3f}, lift: {rule['lift']:.3f}")


Top 10 Strong Association Rules (Support, Confidence, Lift):
{'Inception'} => {'Iron Man'} | support: 0.030, confidence: 0.178, lift: 1.032
{'Iron Man'} => {'Inception'} | support: 0.030, confidence: 0.172, lift: 1.032
{'Thor'} => {'Doctor Strange'} | support: 0.025, confidence: 0.164, lift: 1.013
{'Doctor Strange'} => {'Thor'} | support: 0.025, confidence: 0.152, lift: 1.013
{'The Matrix'} => {'Interstellar'} | support: 0.031, confidence: 0.168, lift: 1.045
{'Interstellar'} => {'The Matrix'} | support: 0.031, confidence: 0.193, lift: 1.045
{'Captain America'} => {'Shang-Chi'} | support: 0.028, confidence: 0.169, lift: 1.021
{'Shang-Chi'} => {'Captain America'} | support: 0.028, confidence: 0.170, lift: 1.021
{'Spider-Man'} => {'Avengers'} | support: 0.028, confidence: 0.170, lift: 1.053
{'Avengers'} => {'Spider-Man'} | support: 0.028, confidence: 0.171, lift: 1.053


## Comparison of Association Rules: Apriori Algorithm vs Manual Counts

| Aspect                    | Apriori Algorithm (Top 10 Rules)                                                   | Manual Count (Top 10 Pairs)                                                     |
| ------------------------- | ---------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| Strongest Rule by Support | `Interstellar` => `The Matrix` (0.031)                                             | `Doctor Strange` & `Hulk` (64 co-occurrences)                                   |
| High Confidence Rule      | `Interstellar` => `The Matrix` (0.193)                                             | `Interstellar` & `The Matrix` (62)                                              |
| Symmetric Associations    | `Inception` <=> `Iron Man`, `Thor` <=> `Doctor Strange`                            | `Inception` & `Iron Man` (59), `Ant-Man` & `Hulk` (58)                          |
| Popular Character Pairs   | `Captain America` => `Shang-Chi`, `Avengers` <=> `Spider-Man`                      | `Captain America` & `The Dark Knight`, `Spider-Man` & `Guardians of the Galaxy` |
| Overlapping Pairs         | `Interstellar` & `The Matrix`, `Iron Man` & `Inception`, `Spider-Man` & `Avengers` | Also appear in manual counts                                                    |
| Unique to Apriori Rules   | `Thor` <=> `Doctor Strange`, `Captain America` => `Shang-Chi`                      | Not counted manually                                                            |
| Unique to Manual Pairs    | `Guardians` & `Inception`, `Ant-Man` & `Black Panther`, `Guardians` & `Spider-Man` | Not present in top 10 Apriori rules                                             |

## Comments

The association rule `Interstellar` => `The Matrix` stands out in both methods with strong support and confidence.

* Apriori rules reveal directional patterns, such as `Inception` leading to `Iron Man`, with calculated metrics like confidence and lift.
* Manual pair counts show raw co-occurrence but do not indicate direction or strength of influence.

Pairs like `Doctor Strange` & `Hulk` and `Guardians` & `Spider-Man` are common in manual counts but don’t appear in the top Apriori rules.

* Apriori surfaces unique associations like `Thor` <=> `Doctor Strange`, which might be less obvious from raw frequency.
* Manual counts are better for identifying raw frequency, while Apriori provides insights into probabilistic relationships between movie choices.
