
# Assignment 1: Boolean Model, TF-IDF, and Data Retrieval vs. Information Retrieval Conceptual Questions

**Student names**: Ramtin Forouzandehjoo Samavat <br>
**Group number**: 30 <br>
**Date**: 09.09.2025

## Important notes
Please carefully read the following notes and consider them for the assignment delivery. Submissions that do not fulfill these requirements will not be assessed and should be submitted again.
1. You may work in groups of maximum 2 students.
2. The assignment must be delivered in ipynb format.
3. The assignment must be typed. Handwritten assignments are not accepted.

**Due date**: 14.09.2025 23:59

In this assignment, you will:
- Implement a Boolean retrieval model
- Compute TF-IDF vectors for documents
- Run retrieval on queries
- Answer conceptual questions 

---
## Dataset

You will use the **Cranfield** dataset, provided in this file:

- `cran.all.1400`: The document collection (1400 documents)

**The code to parse the file is ready — just update the cran file path to match your own file location. Use the docs variable in your code for the parsed file**

### Load and parse documents (provided)

Run the cell to parse the Cranfield documents. Update the path so it points to your `cran.all.1400` file.


In [1]:
# Read 'cran.all.1400' and parse the documents into a suitable data structure

CRAN_PATH = r"cran.all.1400"  # <-- change this!

def parse_cranfield(path):
    docs = {} # Dictionary with documents.
    current_id = None
    current_field = None
    buffers = {"T": [], "A": [], "B": [], "W": []}
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith(".I "):
                if current_id is not None:
                    docs[current_id] = {
                        "id": current_id,
                        "title": " ".join(buffers["T"]).strip(),
                        "abstract": " ".join(buffers["W"]).strip()
                    }
                current_id = int(line.split()[1])
                buffers = {k: [] for k in buffers}
                current_field = None
            elif line.startswith("."):
                tag = line[1:].strip()
                current_field = tag if tag in buffers else None
            else:
                if current_field is not None:
                    buffers[current_field].append(line)
    if current_id is not None:
        docs[current_id] = {
            "id": current_id,
            "title": " ".join(buffers["T"]).strip(),
            "abstract": " ".join(buffers["W"]).strip()
        }
    print(f"Parsed {len(docs)} documents.")
    return docs

documents = parse_cranfield(CRAN_PATH)



Parsed 1400 documents.


## 1.1 – Boolean Retrieval Model

### 1.1.1 Tokenize documents

Implement tokenization using the given list of stopwords. Create a list of normalized terms per document (e.g., lowercase, remove punctuation/digits; drop stopwords). Store the token lists to use in later steps.

In [6]:
# TODO: Implement tokenization using the given list of stopwords, create list of terms per document

STOPWORDS = set("""a about above after again against all am an and any are aren't as at be because been
before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down
during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers
herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most
mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she
she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's
these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're
we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't
you you'd you'll you're you've your yours yourself yourselves""".split())

# Your code here

import re

def tokenize(text, stopwords):
  tokens = re.findall(r"\b[a-z]+\b", text.lower()) # Convert to lowercase and remove all non-alphabetic characters.
  tokens = [word for word in tokens if word not in stopwords] # Only keep words that are not stopwords.
  return tokens

document_tokens = {} # Dictionary<doc_id, token> to store each token.
for doc_id, doc in documents.items():
  combined_text = f"{doc['title']} {doc['abstract']}"
  document_tokens[doc_id] = tokenize(combined_text, STOPWORDS)

# Check if done correctly.
for i in range(1, 3):
  print(f"Doc {i}: {document_tokens[i][:30]} ...")


Doc 1: ['experimental', 'investigation', 'aerodynamics', 'wing', 'slipstream', 'experimental', 'investigation', 'aerodynamics', 'wing', 'slipstream', 'experimental', 'study', 'wing', 'propeller', 'slipstream', 'made', 'order', 'determine', 'spanwise', 'distribution', 'lift', 'increase', 'due', 'slipstream', 'different', 'angles', 'attack', 'wing', 'different', 'free'] ...
Doc 2: ['simple', 'shear', 'flow', 'past', 'flat', 'plate', 'incompressible', 'fluid', 'small', 'viscosity', 'simple', 'shear', 'flow', 'past', 'flat', 'plate', 'incompressible', 'fluid', 'small', 'viscosity', 'study', 'high', 'speed', 'viscous', 'flow', 'past', 'two', 'dimensional', 'body', 'usually'] ...


### Build vocabulary

Create a set (or list) of unique terms from all tokenized documents. Report the number of unique terms.


In [7]:
# TODO: Create a set or list of unique terms

# Report: 
# - Number of unique terms

# Your code here

vocabulary = set() # Will only keep unique tokens.
for tokens in document_tokens.values():
  vocabulary.update(tokens) # Add each unique term from each token to the vocabulary.

print(f"Number of unique terms in the vocabulary: {len(vocabulary)}")


Number of unique terms in the vocabulary: 6928


### Build inverted index

For each term, store the list (or set) of document IDs where the term appears.


In [10]:

# TODO: For each term, store list of document IDs where the term appears
# Your code here

# An inverted index is a map from terms to lists of documents that contain them.
# Using set to prevent duplicates from when a word appears multiple times in a document.
inverted_index = {term: set() for term in vocabulary}

for doc_id, tokens in document_tokens.items():
  for token in tokens:
    inverted_index[token].add(doc_id)

# Check result.
for term in list(inverted_index.keys())[:3]:
    print(term, "→", inverted_index[term])


ogival → {359}
disparity → {1359}
alternate → {705, 618, 300}


### Retrieve documents for a Boolean query (AND/OR)

Create a function to retrieve documents for a Boolean query (AND/OR) with query terms.  


In [11]:
# TODO: Create a function for retrieving documents for a Boolean query (AND/OR) with query terms

def boolean_retrieve(query:str):
# Your code here

  query_parts = query.split()
  result = inverted_index.get(query_parts[0], set())

  index = 1
  while index < len(query_parts):
    operator = query_parts[index].upper()
    term = query_parts[index + 1].lower()
    docs = inverted_index.get(term, set())

    if operator == "AND":
      result = result & docs
    elif operator == "OR":
      result = result | docs
    else:
      raise ValueError(f"Unknown operator: {operator}")

    index += 2

  return sorted(list(result))


In [12]:
# Do not change this code
boolean_queries = [
  "gas AND pressure",
  "structural AND aeroelastic AND flight AND high AND speed OR aircraft",
  "heat AND conduction AND composite AND slabs",
  "boundary AND layer AND control",
  "compressible AND flow AND nozzle",
  "combustion AND chamber AND injection",
  "laminar AND turbulent AND transition",
  "fatigue AND crack AND growth",
  "wing AND tip AND vortices",
  "propulsion AND efficiency"
]

In [17]:
# Run Boolean queries in batch, using the function you created
def run_batch_boolean(queries):
    results = {}
    for i, q in enumerate(queries, 1):
        res = boolean_retrieve(q)
        results[f"Q{i}"] = res
    return results

boolean_results = run_batch_boolean(boolean_queries)
for qid, res in boolean_results.items():
    print(qid, "=>", res[:10])


Q1 => [27, 49, 85, 101, 110, 166, 168, 169, 183, 185]
Q2 => [12, 14, 29, 47, 51, 75, 76, 78, 100, 172]
Q3 => [5, 399]
Q4 => [1, 61, 244, 265, 342, 416, 792, 798, 933, 974]
Q5 => [118, 131]
Q6 => []
Q7 => [7, 9, 80, 89, 96, 142, 187, 207, 261, 294]
Q8 => []
Q9 => [675]
Q10 => [968]


## Part 1.2 – TF-IDF Indexing


$tf_{i,j} = \text{Raw Frequency}$ - The number of times a term appears in a document.

$idf_t = \log\left(\frac{N}{df_t}\right)$ - N = total number of documents, df = number of documents that contain the term.

### Build document–term matrix (TF and IDF weights)

Compute tf and idf using the formulas above and store the weights in a document–term matrix (rows = documents, columns = terms).



In [18]:
# TODO: Calculate the weights for the documents and the terms using tf and idf weighting. Put these values into a document–term matrix (rows = documents, columns = terms).

# Your code here

import math
from collections import Counter
import pandas as pd

N = len(documents) # Total number of documents.

# Document frequency for each term.
df = {term: len(doc_ids) for term, doc_ids in inverted_index.items()}

# Inverted document frequency for each term.
idf = {term: math.log(N / df) for term, df in df.items()}

document_term_matrix = {}

for doc_id, tokens in document_tokens.items():
  tf_counter = Counter(tokens) # tf for each term in the current document.
  document_term_matrix[doc_id] = {}
  for term in vocabulary:
    tf = tf_counter.get(term, 0) # tf for specific term.
    document_term_matrix[doc_id][term] = tf * idf.get(term, 0) # TF * IDF

# Convert dictionary to DataFrame to make it easier to work with.
vocabulary_list = sorted(list(vocabulary)) # Columns do not support a set, need to convert the vocabulary to a list.
dtm_df = pd.DataFrame.from_dict(document_term_matrix, orient='index', columns=vocabulary_list)

print(dtm_df) # Check result.


       ab  abbreviated  ability  ablated  ablating  ablation  ablative  able  \
1     0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
2     0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
3     0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
4     0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
5     0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
...   ...          ...      ...      ...       ...       ...       ...   ...   
1396  0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
1397  0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
1398  0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
1399  0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   
1400  0.0          0.0      0.0      0.0       0.0       0.0       0.0   0.0   

      abrupt  abruptly  ...  zehnder   

### Build TF–IDF document vectors

From the matrix, build a TF–IDF vector for each document (consider normalization if needed for cosine similarity).


In [21]:

# TODO: Build TF–IDF document vectors from the document–term matrix
# Your code here

import numpy as np

document_vectors = {}

for doc_id, row in dtm_df.iterrows():
  vector = row.values.astype(float) # Convert each row in the document term matrix to a document vector.

  magnitude = np.linalg.norm(vector)
  if magnitude > 0:
    vector = vector / magnitude # Normalize vector so that the length is 1.

  document_vectors[doc_id] = vector

# Check vectors.
for doc_id in list(document_vectors.keys())[:3]:
  print(f"Doc {doc_id} vector: {document_vectors[doc_id][:20]}")

Doc 1 vector: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Doc 2 vector: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Doc 3 vector: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


### Implement cosine similarity

Implement a function to compute cosine similarity scores between a (tokenized) query and all documents.


In [22]:

# TODO: Create a function for calculating the similarity score of all the documents by their relevance to query terms

def tfidf_retrieve(query: str):
    # Your code here

    query_parts = query.split()
    tf_counter_query = Counter(query_parts) # tf for each term in the query.

    # The vector must match the vocabulary size so query and document vectors are comparable.
    query_vector = np.zeros(len(vocabulary))

    for index, term in enumerate(vocabulary_list):
      if term in tf_counter_query:
        tf_query = tf_counter_query[term] # Term frequency for each term in the query.
        query_vector[index] = tf_query * idf.get(term, 0)

    # Normalize query vector.
    magnitude_query = np.linalg.norm(query_vector)
    if magnitude_query > 0:
      query_vector = query_vector / magnitude_query

    scores = {} # Values between 0 (no similarly) and 1 (identical content).
    for doc_id, doc_vector in document_vectors.items():
      similarity = float(np.dot(query_vector, doc_vector))
      scores[doc_id] = similarity

    ranked_documents = sorted(scores.items(), key=lambda x: x[1], reverse=True)

    return ranked_documents


In [23]:
# Do not change this code
tfidf_queries = [
  "gas pressure",
  "structural aeroelastic flight high speed aircraft",
  "heat conduction composite slabs",
  "boundary layer control",
  "compressible flow nozzle",
  "combustion chamber injection",
  "laminar turbulent transition",
  "fatigue crack growth",
  "wing tip vortices",
  "propulsion efficiency"
]

In [24]:
# Run TF-IDF queries in batch (print top-5 results for each), using the function you created
def run_batch_tfidf(queries):
    results = {}
    for i, q in enumerate(queries, 1):
        res = tfidf_retrieve(q)
        results[f"Q{i}"] = res
    return results

tfidf_results = run_batch_tfidf(tfidf_queries)

for qid, res in tfidf_results.items():
    print(qid, "=>", res[:4])


Q1 => [(169, 0.3264772671329021), (1286, 0.3008925433931835), (167, 0.298238680232968), (185, 0.2822065495985724)]
Q2 => [(12, 0.5274627281621773), (51, 0.3411923675365786), (746, 0.29878572016098404), (875, 0.29208208108755074)]
Q3 => [(399, 0.6253357148814741), (144, 0.46639427061347216), (485, 0.4441821036132606), (5, 0.406749496370075)]
Q4 => [(368, 0.3952257052043), (748, 0.3746534145938327), (638, 0.35794322304922394), (451, 0.30243005168793896)]
Q5 => [(389, 0.29052624820998024), (118, 0.27718780183272584), (1187, 0.2664773147364016), (172, 0.2603483717758538)]
Q6 => [(974, 0.24580667067168463), (628, 0.23388314284392048), (397, 0.22916719468407404), (308, 0.22385350738346987)]
Q7 => [(418, 0.5512607808825641), (1264, 0.4246955942843387), (315, 0.37108249990632514), (272, 0.3625940045253208)]
Q8 => [(768, 0.40030390719716785), (726, 0.3739708023618002), (1196, 0.3737047186693344), (883, 0.32582652155412056)]
Q9 => [(1284, 0.35384882554362423), (433, 0.3097586233179562), (675, 0.


## Part 1.3 – Conceptual Questions

Answer the following questions:

**1. What is the difference between data retrieval and information retrieval?**

*Your answer here*:

The difference lies in the type of data and the retrieval method. Data retrieval deals with structured data and returns specific records based on structured queries. Information retrieval, on the other hand, deals with unstructured or semi-structured data, such as text documents, and returns the most relevant resources rather than exact matches, often based on natural language.

**For the following scenarios, which approach would be suitable data retrieval or information retrieval? Explain your reasoning.** <br>
1.a A clerk in pharmacy uses the following query: Medicine_name = Ibuprofen_400mg

*Your answer here*:

In this scenario, it would be most suitable to use data retrieval, because the query specifies an exact value for a predefined field, indicating that the user is looking for an exact match for a specific medicine.

1.b A clerk in pharmacy uses the following query: An anti-biotic medicine 

*Your answer here*:

In this scenario, it would be most suitable to use information retrieval, because the query is in natural language and is vague. The system should return all relevant medicines matching the description rather than an exact match.

1.c Searching for the schedule of a flight using the following query: Flight_ID = ZEFV2

*Your answer here*:

In this scenario, it would be most suitable to use data retrieval, because the query specifies a unique identifier for an exact match.

1.d Searching an E-commerce website using the following query to find a specific shoe: Brooks Ghost 15

*Your answer here*:

In this scenario, it would be most suitable to use information retrieval, because the search may need to rank results by relevance and handle different variations of the product.

1.e Searching the same E-commerce website using the following query: Nice running shoes

*Your answer here*:

In this scenario, it would be most suitable to use information retrieval, because the query is vague and in natural language, requiring the system to identify and rank products that match the description rather than an exact match.
