<div align="center">

## DMML Assignment 1: Frequent Itemsets

**Trishita Patra - MDS202440**  
**Boda Surya Venkata Jyothi Sowmya - MDS202413**

</div>

This file contains the final code used to complete the task of Assignment 1. 

We aim to compute frequent itemsets for this data. As usual, a K-itemset of words is a collection of words of size K that occur together in the same document. Write a program to find all K-itemsets of words occurring with frequency F, where K and F are parameters to your program.

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Documentation of packeages used:
1. The `time` package in Python is part of the standard library and provides various functions to work with time-related operations. In the provided code, the time package is used to measure the execution time of the Apriori algorithm.
2. `from collections import defaultdict` is used to import the defaultdict class from Python's built-in collections module. This external library is part of Python's standard library and provides a specialized dictionary subclass that allows us to specify a default value for keys that do not exist in the dictionary. `defaultdict` is advantageous over `.get()`, and eliminates the manual check, making the code cleaner and more efficient.

 **Vocabulary File Parsing**:
  - The vocabulary file (`vocab.file.txt`) is read using a helper function `read_vocab(file_path)`.
  - This function maps word IDs to their corresponding words, creating a dictionary where keys are word IDs and values are the actual words.
  - **Input**: Path to the vocabulary file.
  - **Output**: A dictionary mapping word IDs to words.


In [14]:
import time
from collections import defaultdict

def read_vocab(file_path):
    vocab = {}
    with open(file_path, 'r') as f:
        for i, line in enumerate(f, start=1):
            vocab[i] = line.strip()
    return vocab

**Document-Word File Parsing**:
  - The document-word file (`docword.file.txt`) is processed using the helper function `read_docword(file_path)`.
  - For each nonzero entry, it records the document ID, word ID, and count. These occurrences are stored in a `defaultdict(set)` data structure, where each word ID maps to a set of document IDs containing that word.
  - **Input**: Path to the document-word file.
  - **Output**: A dictionary (`word_docs`) where keys are word IDs and values are sets of document IDs containing those words.

In [15]:
def read_docword(file_path):
    with open(file_path, 'r') as f:
        D = int(f.readline().strip())  # Number of documents
        W = int(f.readline().strip())  # Number of words
        NNZ = int(f.readline().strip())  # Nonzero entries

        word_docs = defaultdict(set)  # wordID -> set of documents containing it
        for _ in range(NNZ):
            doc_id, word_id, count = map(int, f.readline().strip().split())
            word_docs[word_id].add(doc_id)

    return word_docs

### Apriori Algorithm Implementation *`apriori(word_docs, K, F)`*:
- **Purpose**: Identifies frequent itemsets efficiently using the Apriori algorithm.
- **Input**:
  - `word_docs`: Dictionary mapping word IDs to sets of document IDs.
  - `K`: Size of frequent itemsets.
  - `F`: Minimum support threshold (minimum number of documents an itemset must appear in).
- **Output**: A sorted list of frequent itemsets with their support counts.

---

### Key Steps and Data Structures:

1. **Frequent 1-Itemsets**:
   - **Step**: Identify words with document sets ≥ `F`.
   - **Data Structure**: Dictionary `{(word,): docs}` where keys are single-word tuples, and values are sets of document IDs.
   - **Purpose**: Filters out infrequent words early, reducing the search space.

2. **Candidate Generation**:
   - **Step**: Generate candidate itemsets of size `k` by merging frequent `(k-1)`-itemsets that share the first `(k-2)` elements.
   - **Data Structure**: Set for candidates ensures uniqueness.
   - **Purpose**: Efficient pruning reduces unnecessary computations by merging only compatible itemsets.

3. **Support Counting**:
   - **Step**: Compute support for each candidate by intersecting document sets of its constituent words using `set.intersection()`.
   - **Data Structure**: Retain frequent itemsets as a dictionary `{itemset: docs}`.

4. **Early Termination**:
   - **Step**: If no new frequent itemsets are found at any iteration, terminate early.
   - **Benefit**: Saves computation time by avoiding unnecessary iterations.



In [3]:
def apriori(word_docs, K, F):

    # Step 1: Find frequent 1-itemsets
    freq_itemsets = { (word,): docs for word, docs in word_docs.items() if len(docs) >= F }

    # Step 2: Generate k-itemsets iteratively
    for k in range(2, K + 1):
        candidates = set()
        freq_keys = list(freq_itemsets.keys())  # List of current frequent itemsets

        # Generate candidate itemsets of size k using (k-1)-itemsets
        for i in range(len(freq_keys)):
            for j in range(i + 1, len(freq_keys)):
                a, b = freq_keys[i], freq_keys[j]

                # Merge only if first (k-2) elements are same (Efficient pruning)
                if a[:-1] == b[:-1]:
                    new_itemset = tuple(sorted(set(a) | set(b)))  # Union
                    if len(new_itemset) == k:
                        candidates.add(new_itemset)

        # Count support for candidate itemsets
        new_freq_itemsets = {}
        for c in candidates:
            intersect_docs = set.intersection(*(word_docs[word] for word in c))
            if len(intersect_docs) >= F:
                new_freq_itemsets[c] = intersect_docs

        # If no new frequent itemsets, break early
        if not new_freq_itemsets:
            return []

        freq_itemsets = new_freq_itemsets

    # Convert sets to counts for readability
    return sorted([(itemset, len(docs)) for itemset, docs in freq_itemsets.items()],
                  key=lambda x: x[1], reverse=True)

`main(vocab_file, docword_file, K, F)`

  - Implements the apriori algorithm using aforementioned helper functions, and computes the time taken to run the algorithm for given `(K,F)` pairs, and presents results in a user-friendly format.

In [None]:
def main(vocab_file, docword_file, K, F):
    print("Reading dataset...")
    word_docs = read_docword(docword_file)
    vocab = read_vocab(vocab_file)

    print(f"Running Apriori for K={K}, F={F}")
    start_time = time.time()
    frequent_itemsets = apriori(word_docs, K, F)

    elapsed = time.time() - start_time

    print(f"\nTime taken: {elapsed:.2f} seconds")

    if not frequent_itemsets:
        print("\nNo itemsets found.")
    else:
        print(f"\nTotal Frequent K-itemsets Found: {len(frequent_itemsets)}")
        print("Frequent K-itemsets:")
        for itemset, count in frequent_itemsets:
            print(f"Itemset: {tuple(vocab[word] for word in itemset)}, Count: {count}")

Example with Enron dataset


In [4]:
vocab_file = "/content/drive/My Drive/DMML_Assignment_1/vocab.enron.txt"
docword_file = "/content/drive/My Drive/DMML_Assignment_1/docword.enron.txt"

In [None]:
main(vocab_file, docword_file, K = 3, F = 1700)

Reading dataset...
Running Apriori for K=3, F=1700

Time taken: 25.14 seconds

Total Frequent K-itemsets Found: 1
Frequent K-itemsets:
Itemset: ('energy', 'market', 'power'), Count: 1764


In [None]:
main(vocab_file, docword_file, K = 3, F = 1300)

Reading dataset...
Running Apriori for K=3, F=1300

Time taken: 51.52 seconds

Total Frequent K-itemsets Found: 25
Frequent K-itemsets:
Itemset: ('energy', 'market', 'power'), Count: 1764
Itemset: ('california', 'energy', 'power'), Count: 1667
Itemset: ('california', 'market', 'power'), Count: 1602
Itemset: ('market', 'power', 'price'), Count: 1481
Itemset: ('california', 'energy', 'market'), Count: 1473
Itemset: ('energy', 'market', 'price'), Count: 1439
Itemset: ('business', 'company', 'market'), Count: 1412
Itemset: ('cost', 'energy', 'power'), Count: 1403
Itemset: ('energy', 'power', 'price'), Count: 1401
Itemset: ('california', 'cost', 'power'), Count: 1398
Itemset: ('cost', 'market', 'power'), Count: 1397
Itemset: ('electricity', 'energy', 'power'), Count: 1386
Itemset: ('market', 'price', 'prices'), Count: 1381
Itemset: ('california', 'electricity', 'power'), Count: 1378
Itemset: ('market', 'power', 'prices'), Count: 1344
Itemset: ('california', 'market', 'price'), Count: 1342
I

In [None]:
main(vocab_file, docword_file, K = 5, F = 1300)

Reading dataset...
Running Apriori for K=5, F=1300

Time taken: 53.76 seconds

No itemsets found.
