(c)
Identify pairs of items (X, Y ) such that the support of {X, Y } is at least 100. For all such pairs, compute the confidence scores of the corresponding association rules: X ⇒ Y , Y ⇒ X. Sort the rules in decreasing order of confidence scores and list the top 5 rules in the writeup. Break ties, if any, by lexicographically increasing order on the left hand side of the rule.

In [1]:
!pip install apyori




In [2]:
import os
from apyori import apriori
import itertools
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

In [3]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [4]:
# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [5]:
# ID of the files on Google Drive
browsing_behaviour = '1hJn7T9hYcC4qYq-4t-PkjTK_aXkzE-gQ'

# Download 'browsing.txt'
browsing_downloaded = drive.CreateFile({'id': browsing_behaviour})
browsing_downloaded.GetContentFile('browsing.txt')

In [6]:
data = []
with open('browsing.txt', 'r') as file:
    for line in file:
        data.append(line.strip().split())

In [7]:
# Convert the data into a list of transactions
transactions = [list(set(session)) for session in data]

In [8]:
min_support = 100 / len(transactions)

# Find frequent items in the first pass
first_pass_results = list(apriori(transactions, min_support=min_support, min_confidence=0, min_lift=1, max_length=1))

# Extract the frequent items from the first pass
frequent_items = set()
for result in first_pass_results:
    frequent_items.update(result.items)

# Calculate the number of frequent items found in the first pass
num_frequent_items = len(frequent_items)

# Perform the sanity check
expected_frequent_items_count = 647  # The expected count
if num_frequent_items == expected_frequent_items_count:
    print("Sanity check passed: The number of frequent items matches the expected count.")
else:
    print("Sanity check failed: The number of frequent items does not match the expected count.")

Sanity check passed: The number of frequent items matches the expected count.


In [9]:
# Define a function to find frequent pairs
def find_frequent_pairs(transactions, min_support):
    results = list(apriori(transactions, min_support=min_support, min_confidence=0, min_lift=1, max_length=2))

    # Extract pairs with confidence
    conf_pairs = []
    for result in results:
        for ordered_stat in result.ordered_statistics:
            x = list(ordered_stat.items_base)
            y = list(ordered_stat.items_add)
            conf = ordered_stat.confidence
            conf_pairs.append(((x, y), conf))

    return conf_pairs

In [10]:
# Find frequent pairs
conf_pairs = find_frequent_pairs(transactions, min_support)

In [11]:
# Sort the rules by confidence and lexicographically
conf_pairs.sort(key=lambda x: (-x[1], x[0]))
top_5_pairs = conf_pairs[:5]

# Print the top 5 pairs
for i, (pair, conf) in enumerate(top_5_pairs):
    x, y = pair
    print(f"Top Pair Rule {i + 1}: {x} => {y}, Confidence: {conf}")

Top Pair Rule 1: ['DAI93865'] => ['FRO40251'], Confidence: 1.0
Top Pair Rule 2: ['GRO85051'] => ['FRO40251'], Confidence: 0.9991762767710051
Top Pair Rule 3: ['GRO38636'] => ['FRO40251'], Confidence: 0.9906542056074765
Top Pair Rule 4: ['ELE12951'] => ['FRO40251'], Confidence: 0.9905660377358491
Top Pair Rule 5: ['DAI88079'] => ['FRO40251'], Confidence: 0.9867256637168142


(d) Identify item triples (X, Y, Z) such that the support of {X, Y, Z} is at least 100. For all such triples, compute the confidence scores of the corresponding association rules: (X, Y ) ⇒ Z, (X, Z) ⇒ Y , (Y, Z) ⇒ X. Sort the rules in decreasing order of confidence scores and list the top 5 rules in the writeup. Order the left-hand-side pair lexicographically and break ties, if any, by lexicographical order of the first then the second item in the pair.

In [12]:
# Define a function to find frequent triples
def find_frequent_triples(transactions, min_support):
    results = list(apriori(transactions, min_support=min_support, min_confidence=0, min_lift=1, max_length=3))

    # Extract triples with confidence
    conf_triples = []
    for result in results:
        for ordered_stat in result.ordered_statistics:
            x = list(ordered_stat.items_base)
            y = list(ordered_stat.items_add)
            conf = ordered_stat.confidence
            conf_triples.append(((x, y), conf))

    return conf_triples

In [13]:
# Find frequent triples
min_support = 100 / len(transactions)
conf_triples = find_frequent_triples(transactions, min_support)

In [14]:
# Sort the rules by confidence and lexicographically
conf_triples.sort(key=lambda x: (-x[1], x[0]))
top_5_triples = conf_triples[:5]

# Print the top 5 triples
for i, (triple, conf) in enumerate(top_5_triples):
    x, y = triple
    print(f"Top Triple Rule {i + 1}: {x} => {y}, Confidence: {conf}")

Top Triple Rule 1: ['DAI23334', 'ELE92920'] => ['DAI62779'], Confidence: 1.0
Top Triple Rule 2: ['DAI31081', 'GRO85051'] => ['FRO40251'], Confidence: 1.0
Top Triple Rule 3: ['DAI55911', 'GRO85051'] => ['FRO40251'], Confidence: 1.0
Top Triple Rule 4: ['DAI62779', 'DAI88079'] => ['FRO40251'], Confidence: 1.0
Top Triple Rule 5: ['DAI75645', 'GRO85051'] => ['FRO40251'], Confidence: 1.0
