
# Assignment 1: Boolean Model, TF-IDF, and Data Retrieval vs. Information Retrieval Conceptual Questions

**Student names**: _Your_names_here_ <br>
**Group number**: _Your_group_here_ <br>
**Date**: _Submission Date_

## Important notes
Please carefully read the following notes and consider them for the assignment delivery. Submissions that do not fulfill these requirements will not be assessed and should be submitted again.
1. You may work in groups of maximum 2 students.
2. The assignment must be delivered in ipynb format.
3. The assignment must be typed. Handwritten assignments are not accepted.

**Due date**: 14.09.2025 23:59

In this assignment, you will:
- Implement a Boolean retrieval model
- Compute TF-IDF vectors for documents
- Run retrieval on queries
- Answer conceptual questions 

---
## Dataset

You will use the **Cranfield** dataset, provided in this file:

- `cran.all.1400`: The document collection (1400 documents)

**The code to parse the file is ready — just update the cran file path to match your own file location. Use the docs variable in your code for the parsed file**

### Load and parse documents (provided)

Run the cell to parse the Cranfield documents. Update the path so it points to your `cran.all.1400` file.


In [3]:

# Read 'cran.all.1400' and parse the documents into a suitable data structure

CRAN_PATH = r"cran.all.1400"  # <-- change this!

def parse_cranfield(path):
    docs = {}
    current_id = None
    current_field = None
    buffers = {"T": [], "A": [], "B": [], "W": []}
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith(".I "):
                if current_id is not None:
                    docs[current_id] = {
                        "id": current_id,
                        "title": " ".join(buffers["T"]).strip(),
                        "abstract": " ".join(buffers["W"]).strip()
                    }
                current_id = int(line.split()[1])
                buffers = {k: [] for k in buffers}
                current_field = None
            elif line.startswith("."):
                tag = line[1:].strip()
                current_field = tag if tag in buffers else None
            else:
                if current_field is not None:
                    buffers[current_field].append(line)
    if current_id is not None:
        docs[current_id] = {
            "id": current_id,
            "title": " ".join(buffers["T"]).strip(),
            "abstract": " ".join(buffers["W"]).strip()
        }
    print(f"Parsed {len(docs)} documents.")
    return docs

docs = parse_cranfield(CRAN_PATH)



Parsed 1400 documents.


## 1.1 – Boolean Retrieval Model

### 1.1.1 Tokenize documents

Implement tokenization using the given list of stopwords. Create a list of normalized terms per document (e.g., lowercase, remove punctuation/digits; drop stopwords). Store the token lists to use in later steps.

In [19]:
# TODO: Implement tokenization using the given list of stopwords, create list of terms per document

STOPWORDS = set("""a about above after again against all am an and any are aren't as at be because been
before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down
during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers
herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most
mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she
she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's
these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're
we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't
you you'd you'll you're you've your yours yourself yourselves""".split())

# Your code here

import string

def tokenization(docs):
    """
    Tokenize documents from Cranfield corpus dictionary.
    
    Args:
        docs (dict): Dictionary from parse_cranfield() with document data
        
    Returns:
        list: List of tokenized terms per document (stopwords removed)
    """
    result = []
    
    for doc in docs.values():
        # Combine title and abstract
        text = doc["title"] + " " + doc["abstract"]
        
        # Convert to lowercase
        text = text.lower()
        
        # Remove punctuation and split into words
        translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
        text = text.translate(translator)
        
        # Split into tokens
        tokens = text.split()
        
        # Remove stopwords
        filtered_tokens = [token for token in tokens if token not in STOPWORDS and token.strip()]
        
        result.append(filtered_tokens)
    
    return result

tokenized_documents = tokenization(docs)  # <-- This creates tokenized_documents


### Build vocabulary

Create a set (or list) of unique terms from all tokenized documents. Report the number of unique terms.


In [21]:
# TODO: Create a set or list of unique terms

# Report: 
# - Number of unique terms

# Your code here

def build_vocabulary(tokenized_documents):
    """
    Build vocabulary from tokenized documents.
    
    Args:
        tokenized_documents (list): List of tokenized documents (list of lists)
        
    Returns:
        set: Set of unique terms from all documents
    """
    vocabulary = set()
    
    for document_tokens in tokenized_documents:
        vocabulary.update(document_tokens)
        
    
    return vocabulary

vocabulary = build_vocabulary(tokenized_documents)  # Receives the returned set
print(f"Number of unique terms: {len(vocabulary)}")  # Uses the returned set



Number of unique terms: 7361


### Build inverted index

For each term, store the list (or set) of document IDs where the term appears.


In [None]:

# TODO: For each term, store list of document IDs where the term appears
# Your code here

def build_inverted_index(tokenized_documents):
    """
    Build an inverted index from tokenized documents.
    
    Args:
        tokenized_documents (list): List of tokenized documents (list of lists)
        
    Returns:
        dict: Dictionary where keys are terms and values are sets of document IDs
              that contain the term
    """
    inverted_index = {}
    
    for doc_id, document_tokens in enumerate(tokenized_documents):
        for term in document_tokens:
            if term not in inverted_index:
                inverted_index[term] = set()
            inverted_index[term].add(doc_id)
    
    return inverted_index

inverted_index = build_inverted_index(tokenized_documents)
print(f"Number of unique terms in index: {len(inverted_index)}")


Number of unique terms in index: 7361
Term 'wing' appears in documents: {0, 519, 12, 13, 29, 30, 544, 546, 1061, 1063, 41, 560, 1073, 1074, 51, 59, 1088, 1089, 1090, 1091, 68, 1093, 1094, 75, 77, 598, 599, 600, 1110, 1114, 91, 94, 1127, 631, 632, 1143, 635, 637, 642, 1161, 1162, 1163, 1167, 1168, 145, 146, 1169, 670, 672, 673, 674, 675, 676, 1185, 1187, 679, 680, 681, 682, 1196, 1201, 691, 692, 693, 694, 695, 1206, 697, 698, 1207, 188, 190, 702, 704, 1217, 194, 708, 198, 199, 710, 201, 711, 203, 204, 713, 1228, 1232, 1238, 1242, 221, 1245, 223, 224, 225, 228, 229, 234, 746, 747, 748, 751, 1265, 754, 756, 245, 246, 1270, 249, 251, 1275, 1276, 255, 767, 1279, 1288, 1289, 778, 780, 782, 278, 790, 792, 793, 794, 283, 795, 796, 286, 287, 288, 800, 802, 807, 808, 810, 1327, 1330, 1332, 310, 1335, 1336, 1337, 1339, 1340, 1341, 1342, 332, 1361, 859, 1379, 876, 882, 378, 894, 900, 901, 394, 915, 916, 917, 918, 919, 920, 922, 923, 926, 415, 928, 419, 431, 432, 433, 441, 452, 969, 970, 463, 990, 

### Retrieve documents for a Boolean query (AND/OR)

Create a function to retrieve documents for a Boolean query (AND/OR) with query terms.  


In [35]:
# TODO: Create a function for retrieving documents for a Boolean query (AND/OR) with query terms

def boolean_retrieve(query: str):
    """
    Retrieve documents for a Boolean query (AND/OR) with query terms.
    
    Args:
        query (str): Boolean query string (e.g., "wing AND flutter", "heat OR transfer")
        
    Returns:
        list: Sorted list of document IDs that match the Boolean query
    """
    # Convert query to lowercase and strip whitespace
    query = query.lower().strip()
    
    # Handle empty query
    if not query:
        return []
    
    # Split query into tokens
    tokens = query.split()
    
    # Filter out stopwords but keep AND/OR operators
    filtered_tokens = []
    for token in tokens:
        if token in ['and', 'or'] or (token not in STOPWORDS and token.strip()):
            filtered_tokens.append(token)
    
    if not filtered_tokens:
        return []
    
    # Handle OR operations first (lower precedence)
    # Split by OR and process each OR group with AND operations
    or_groups = []
    current_group = []
    
    for token in filtered_tokens:
        if token == 'or':
            if current_group:
                or_groups.append(current_group)
                current_group = []
        else:
            current_group.append(token)
    
    # Add the last group
    if current_group:
        or_groups.append(current_group)
    
    # Process each OR group (handle AND operations within each group)
    final_results = set()
    
    for group in or_groups:
        # Filter out 'and' tokens and keep only search terms
        terms = [token for token in group if token != 'and']
        
        if not terms:
            continue
            
        # For AND operation: start with first term and intersect with others
        group_result = inverted_index.get(terms[0], set()).copy()
        
        for term in terms[1:]:
            term_docs = inverted_index.get(term, set())
            group_result = group_result.intersection(term_docs)
        
        # Union this group's results with final results (OR operation between groups)
        final_results = final_results.union(group_result)
    
    # Convert to sorted list before returning
    return sorted(list(final_results))


In [31]:
# Do not change this code
boolean_queries = [
  "gas AND pressure",
  "structural AND aeroelastic AND flight AND high AND speed OR aircraft",
  "heat AND conduction AND composite AND slabs",
  "boundary AND layer AND control",
  "compressible AND flow AND nozzle",
  "combustion AND chamber AND injection",
  "laminar AND turbulent AND transition",
  "fatigue AND crack AND growth",
  "wing AND tip AND vortices",
  "propulsion AND efficiency"
]

In [36]:
# Run Boolean queries in batch, using the function you created
def run_batch_boolean(queries):
    results = {}
    for i, q in enumerate(queries, 1):
        res = boolean_retrieve(q)
        results[f"Q{i}"] = res
    return results

boolean_results = run_batch_boolean(boolean_queries)
for qid, res in boolean_results.items():
    print(qid, "=>", res[:5])


Q1 => [26, 48, 84, 100, 109]
Q2 => [11, 13, 28, 46, 50]
Q3 => [4, 398]
Q4 => [0, 60, 243, 264, 341]
Q5 => [117, 130]
Q6 => []
Q7 => [6, 8, 79, 88, 95]
Q8 => []
Q9 => [674]
Q10 => [967]


## Part 1.2 – TF-IDF Indexing


$tf_{i,j} = \text{Raw Frequency}$

$idf_t = \log\left(\frac{N}{df_t}\right)$

### Build document–term matrix (TF and IDF weights)

Compute tf and idf using the formulas above and store the weights in a document–term matrix (rows = documents, columns = terms).



In [None]:
# TODO: Calculate the weights for the documents and the terms using tf and idf weighting. Put these values into a document–term matrix (rows = documents, columns = terms).

# Your code here
import math
import numpy as np
from collections import Counter

def calculate_tf_idf_matrix():
    """
    Calculate TF-IDF weights using previously created variables:
    - tokenized_documents (from tokenization cell)
    - vocabulary (from build_vocabulary cell)
    - inverted_index (from inverted index cell)
    
    Returns:
        tuple: (tf_idf_matrix, terms_list)
    """
    # Convert vocabulary to sorted list for consistent column ordering
    terms_list = sorted(list(vocabulary))
    num_docs = len(tokenized_documents)
    num_terms = len(terms_list)
    
    # Create term to index mapping
    term_to_idx = {term: idx for idx, term in enumerate(terms_list)}
    
    # Initialize TF-IDF matrix
    tf_idf_matrix = np.zeros((num_docs, num_terms))
    
    # Calculate TF-IDF for each document
    for doc_idx, doc_tokens in enumerate(tokenized_documents):
        # Calculate term frequency (tf) for this document
        term_counts = Counter(doc_tokens)
        
        for term in term_counts:
            if term in term_to_idx:
                term_idx = term_to_idx[term]
                
                # TF: Raw frequency
                tf = term_counts[term]
                
                # IDF: log(N / df_t) - use inverted_index to get df_t
                df_t = len(inverted_index[term])  # Number of docs containing this term
                idf = math.log(num_docs / df_t)
                
                # TF-IDF weight
                tf_idf = tf * idf
                
                tf_idf_matrix[doc_idx, term_idx] = tf_idf
    
    return tf_idf_matrix, terms_list


### Build TF–IDF document vectors

From the matrix, build a TF–IDF vector for each document (consider normalization if needed for cosine similarity).


In [None]:

# TODO: Build TF–IDF document vectors from the document–term matrix
# Your code here

def build_document_vectors(tf_idf_matrix, normalize=True):
    """
    Build TF-IDF document vectors from the document-term matrix.
    
    Args:
        tf_idf_matrix (np.array): TF-IDF matrix (rows = docs, cols = terms)
        normalize (bool): Whether to normalize vectors for cosine similarity
        
    Returns:
        np.array: Document vectors (normalized if specified)
    """
    document_vectors = tf_idf_matrix.copy()
    
    if normalize:
        # L2 normalization for each document vector (row)
        # This prepares vectors for cosine similarity calculations
        norms = np.linalg.norm(document_vectors, axis=1, keepdims=True)
        # Avoid division by zero for empty documents
        norms = np.where(norms == 0, 1, norms)
        document_vectors = document_vectors / norms
    
    return document_vectors


### Implement cosine similarity

Implement a function to compute cosine similarity scores between a (tokenized) query and all documents.


In [None]:

# TODO: Create a function for calculating the similarity score of all the documents by their relevance to query terms

def tfidf_retrieve(query: str):
    # Your code here
    


In [None]:
# Do not change this code
tfidf_queries = [
  "gas pressure",
  "structural aeroelastic flight high speed aircraft",
  "heat conduction composite slabs",
  "boundary layer control",
  "compressible flow nozzle",
  "combustion chamber injection",
  "laminar turbulent transition",
  "fatigue crack growth",
  "wing tip vortices",
  "propulsion efficiency"
]

In [None]:
# Run TF-IDF queries in batch (print top-5 results for each), using the function you created
def run_batch_tfidf(queries):
    results = {}
    for i, q in enumerate(queries, 1):
        res = tfidf_retrieve(q)
        results[f"Q{i}"] = res
    return results

tfidf_results = run_batch_tfidf(tfidf_queries)

for qid, res in tfidf_results.items():
    print(qid, "=>", res[:5])



## Part 1.3 – Conceptual Questions

Answer the following questions:

**1. What is the difference between data retrieval and information retrieval?**
*Your answer here*

**For the following scenarios, which approach would be suitable data retrieval or information retrieval? Explain your reasoning.** <br>
1.a A clerk in pharmacy uses the following query: Medicine_name = Ibuprofen_400mg
*Your answer here*

1.b A clerk in pharmacy uses the following query: An anti-biotic medicine 
*Your answer here*

1.c Searching for the schedule of a flight using the following query: Flight_ID = ZEFV2
*Your answer here*

1.d Searching an E-commerce website using the following query to find an specific shoe: Brooks Ghost 15
*Your answer here*

1.e Searching the same E-commerce website using the following query: Nice running shoes
*Your answer here*
