# Assignment 3: Local Association Matrix 

**Student names**: Ramtin Forouzandehjoo Samavat <br>
**Group number**: 30 <br>
**Date**: 05.10.2025

## Important notes
Please read and follow these rules. Submissions that do not fulfill them may be returned.
1. You may work in groups of maximum 2 students.
2. Submit in **.ipynb** format only.
3. The assignment must be typed. Handwritten answers are not accepted.

**Due date**: 12.10.2025 23:59

### What you will do 
- Build a **local association matrix** from Cranfield collection.
- Compute the **normalized association matrix**.
- Use the normalized matrix to **identify neighborhood terms** for expansion for given queries.


---
## Dataset

You will use the **Cranfield** dataset, provided in this file:

- `cran.all.1400`: The document collection (1400 documents)

**The code to parse the file is ready — just update the cran file path to match your own file location. Use the docs variable in your code for the parsed file**


### Load and parse documents (provided)

Run the cell to parse the Cranfield documents. Update the path so it points to your `cran.all.1400` file.

In [10]:
# Read 'cran.all.1400' and parse the documents into a suitable data structure

CRAN_PATH = r"cran.all.1400"  # <-- change this!

def parse_cranfield(path):
    docs = {}
    current_id = None
    current_field = None
    buffers = {"T": [], "A": [], "B": [], "W": []}
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith(".I "):
                if current_id is not None:
                    docs[current_id] = {
                        "id": current_id,
                        "title": " ".join(buffers["T"]).strip(),
                        "abstract": " ".join(buffers["W"]).strip()
                    }
                current_id = int(line.split()[1])
                buffers = {k: [] for k in buffers}
                current_field = None
            elif line.startswith("."):
                tag = line[1:].strip()
                current_field = tag if tag in buffers else None
            else:
                if current_field is not None:
                    buffers[current_field].append(line)
    if current_id is not None:
        docs[current_id] = {
            "id": current_id,
            "title": " ".join(buffers["T"]).strip(),
            "abstract": " ".join(buffers["W"]).strip()
        }
    print(f"Parsed {len(docs)} documents.")
    return docs

documents = parse_cranfield(CRAN_PATH)

Parsed 1400 documents.


## 3.1  Local association matrix

For the given Cranfield document collection in cran.all.1400 construct a local association matrix to identify association clusters. Use the docs variable with the parsed file. Omit stopwords in the STOPWORDS list given below from the vocabulary. 


The correlation factors $c_{u,v}$ between any pair of terms $w_u$ and $w_v$ are defined as  
$c_{u,v} = \sum_{d_j \in D} f_{u,j} \cdot f_{v,j}$  

$f_{u,j}$ is the raw term frequency of $w_u$ in document $d_j$.

### Weighting variants: **scalar** and **metric**

Add two alternative weighting schemes for the matrix (only the formula for assigning the matrix cell value changes):

- **Metric weighting** :
Let $w_u(n,j)$ and $w_v(m,j)$ denote the $n$-th and $m$-th occurrences of terms $w_u$ and $w_v$ in document $d_j$.  
Define a distance function $r(w_u(n,j), w_v(m,j))$ (e.g., $r(i,k) = 1 + |i - k|$).  
Then:

$$
c_{u,v} = \sum_{d_j \in D} \sum_n \sum_m \frac{1}{r(w_u(n,j), w_v(m,j))}
$$


- **Scalar weighting** :
Let $\vec{s}_u = \langle c_{u,x_1}, c_{u,x_2}, \dots, c_{u,x_n} \rangle$ be the neighborhood vector of term $w_u$, and similarly for $w_v$.  
Then:

$$
c_{u,v} = \frac{\vec{s}_u \cdot \vec{s}_v}{|\vec{s}_u| \cdot |\vec{s}_v|}
$$

In [11]:
# TODO: Construct a local association matrix for the cranfield collection. Use both weighting variants.

STOPWORDS = set("""a about above after again against all am an and any are aren't as at be because been
before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down
during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers
herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most
mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she
she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's
these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're
we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't
you you'd you'll you're you've your yours yourself yourselves""".split())

In [12]:
# Tokenize documents and remove stopwords.

import re

def tokenize(text, stopwords):
  tokens = re.findall(r"\b[a-z]+\b", text.lower()) # Convert to lowercase and remove all non-alphabetic characters.
  tokens = [word for word in tokens if word not in stopwords] # Only keep words that are not stopwords.
  return tokens

document_tokens = {} # Dictionary<doc_id, token> to store each token.
for doc_id, doc in documents.items():
  combined_text = f"{doc['title']} {doc['abstract']}"
  document_tokens[doc_id] = tokenize(combined_text, STOPWORDS)

# Check if tokenization is done correctly.
for i in range(1, 2):
  print(f"Doc {i}: {document_tokens[i][:30]} ...")

Doc 1: ['experimental', 'investigation', 'aerodynamics', 'wing', 'slipstream', 'experimental', 'investigation', 'aerodynamics', 'wing', 'slipstream', 'experimental', 'study', 'wing', 'propeller', 'slipstream', 'made', 'order', 'determine', 'spanwise', 'distribution', 'lift', 'increase', 'due', 'slipstream', 'different', 'angles', 'attack', 'wing', 'different', 'free'] ...


In [13]:
# Create a term vocabulary.

vocabulary = set() # Will only keep unique tokens.
for tokens in document_tokens.values():
  vocabulary.update(tokens) # Add each unique term from each token to the vocabulary.

# Turn it to a dictionary that maps terms and column index in the document matrix.
vocabulary = {term: index for index, term in enumerate(sorted(vocabulary))}

print(f"Number of terms in the vocabulary: {len(vocabulary)}")

Number of terms in the vocabulary: 6928


In [14]:
# Create a document term frequency matrix with raw term frequency

from collections import Counter
from scipy.sparse import lil_matrix

number_documents = len(document_tokens)
number_tokens = len(vocabulary)

# Initialize matrix
dtm = lil_matrix((number_documents, number_tokens)) # Creates a number_docs x number_tokens empty matrix.

# Mapping from doc id to row indices in the matrix.
doc_index = {doc_id: index for index, doc_id in enumerate(document_tokens.keys())}

for doc_id, tokens in document_tokens.items():
  row = doc_index[doc_id]
  tf_counter = Counter(tokens) # tf for each term in the current document.
  for term, tf in tf_counter.items():
    col = vocabulary[term]
    dtm[row, col] = tf # Assign raw term frequency

print("Document-term matrix shape:", dtm.shape)

Document-term matrix shape: (1400, 6928)


In [15]:
# Local Association Matrix (Raw Correlation)

lam = dtm.T @ dtm # Term–term local association matrix.

print("Local Association Matrix shape:", lam.shape)

Local Association Matrix shape: (6928, 6928)


In [16]:
# Metric Weighting Local Association Matrix

number_terms = len(vocabulary)

metric_lam = lil_matrix((number_terms, number_terms), dtype=float) # Initialize metric weighted Matrix

for tokens in document_tokens.values(): # Iterate over each document
  for position1, term1 in enumerate(tokens): # Iterate over each token position for term 1.
    if term1 not in vocabulary: continue # Skip tokens not in vocabulary.

    for position2, term2 in enumerate(tokens): # Iterate over each token position for term 2.
        if term2 not in vocabulary: continue

        distance = 1 + abs(position1 - position2) # Compute distance-based weight

        # Increment the matrix cell by inverse distance
        metric_lam[vocabulary[term1], vocabulary[term2]] += 1 / distance

print("Metric Weighting Local Association Matrix shape:", metric_lam.shape)

Metric Weighting Local Association Matrix shape: (6928, 6928)


In [17]:
# Scalar Weighting Local Association Matrix

import numpy as np
from numpy.linalg import norm

number_terms = len(vocabulary)

lam_array = lam.toarray() # Convert sparse matrix to dense array for cosine computation.

scalar_lam = np.zeros((number_terms, number_terms), dtype=float) # Initialize scalar weighted Matrix

for term_index1 in range(number_terms):
  for term_index2 in range(number_terms):

    # Get the neighborhood vectors
    neighborhood_vector1 = lam_array[term_index1]
    neighborhood_vector2 = lam_array[term_index2]

    # Compute vector norms
    norm1 = norm(neighborhood_vector1)
    norm2 = norm(neighborhood_vector2)

    # Skip if either vector is zero to avoid division by zero.
    if norm1 == 0 or norm2 == 0: continue

    # Compute cosine similarity (scalar weighting)
    cosine_similarity = np.dot(neighborhood_vector1, neighborhood_vector2) / (norm1 * norm2)

    # Store the similarity score in the matrix.
    scalar_lam[term_index1, term_index2] = cosine_similarity

print("Scalar Weighting Local Association Matrix shape:", scalar_lam.shape)

Scalar Weighting Local Association Matrix shape: (6928, 6928)


## 3.2 Normalized association matrix

Compute the normalized association matrix from the unnormalized matrix computed above. 

To normalize the matrix use the following formula: <br>
$c'_{u,v} = \frac{c_{u,v}}{c_{u,u} + c_{v,v} - c_{u,v}}$  


In [21]:
#TODO: Compute the normalized association matrix 

# Your code here

dense_lam = lam.toarray() # Convert sparse to dense matrix.

number_terms = len(vocabulary)

normalized_lam = np.zeros_like(dense_lam, dtype=float) # new array of zeros with the same shape as given input array.

diagonal = np.diag(dense_lam) # self-correlation values c_{u,u} for each term

# Compute normalized association for every term pair
for term_index1 in range(number_terms):
  for term_index2 in range(number_terms):

    # c_{u,u} + c_{v,v} - c_{u,v}
    denominator = diagonal[term_index1] + diagonal[term_index2] - dense_lam[term_index1, term_index2]

    if denominator != 0:
      # c_{u,v} / (c_{u,u} + c_{v,v} - c_{u,v})
      normalized_lam[term_index1, term_index2] = dense_lam[term_index1, term_index2] / denominator
    else:
      normalized_lam[term_index1, term_index2] = 0.0


print("Value range before normalization:", lam.min(), lam.max())
print("Value range after normalization:", normalized_lam.min(), normalized_lam.max())

Value range before normalization: 0.0 9649.0
Value range after normalization: 0.0 1.0


## 3.3 Neighborhood terms

With the help of the normalized local association matrix, identify the neighborhood terms that should be used for expansion for the following queries (queries_assignment3):


In [22]:
# Do not change this code
queries_assignment3 = [
  "gas pressure",
  "structural aeroelastic flight high speed aircraft",
  "heat conduction composite slabs",
  "boundary layer control",
  "compressible flow nozzle",
  "combustion chamber injection",
  "laminar turbulent transition",
  "fatigue crack growth",
  "wing tip vortices",
  "propulsion efficiency"
]

In [38]:
#TODO: Identify neighborhood terms for queries_assignment3

# Your code here

# Reverse vocabulary (index -> term) for mapping index from matrix to term.
index_to_term = {index: term for term, index in vocabulary.items()}

neighborhood_terms = {}

top_k = 5

for query in queries_assignment3:

  query_terms = query.split()

  for query_term in query_terms:

    if query_term not in vocabulary: continue  # skip unknown terms

    query_index = vocabulary[query_term]
    associations = normalized_lam[query_index, :] # row for query term

    # Get top_k indices with the highest association (excluding the term itself)
    top_indices = np.argsort(associations)[::-1]
    top_indices = [index for index in top_indices if index != query_index][:top_k]

    # Convert indices back to terms
    top_terms = [index_to_term[index] for index in top_indices]

    neighborhood_terms[query_term] = top_terms


# Expansion terms for all queries
print("Query Expansion Terms:\n")

for query in queries_assignment3:

  query_terms = query.split()

  expansion_terms = []
  for term in query_terms:
    if term in neighborhood_terms:
      expansion_terms.extend(neighborhood_terms[term])

  expansion_terms = list(set(expansion_terms)) # Remove duplicates by using set

  print(f"Query: '{query}'")
  print(f"Expansion terms: {expansion_terms}\n")


Query Expansion Terms:

Query: 'gas pressure'
Expansion terms: ['mach', 'equilibrium', 'number', 'injection', 'ideal', 'flow', 'results', 'real', 'air', 'jet']

Query: 'structural aeroelastic flight high speed aircraft'
Expansion terms: ['altitude', 'aircraft', 'range', 'effects', 'may', 'speed', 'fatigue', 'test', 'stations', 'vtol', 'high', 'low', 'characteristics', 'loads', 'entirely', 'structural', 'numbers', 'slipstream', 'speeds', 'ground', 'thermo', 'random', 'effect', 'structure', 'responses', 'piston']

Query: 'heat conduction composite slabs'
Expansion terms: ['boundary', 'variational', 'transfer', 'slabs', 'controlled', 'solid', 'shielded', 'medium', 'trail', 'input', 'slab', 'temperature', 'laminar', 'refractory', 'periodic', 'radiation', 'layer', 'melting', 'composite']

Query: 'boundary layer control'
Expansion terms: ['boundary', 'use', 'characteristics', 'longitudinal', 'transfer', 'number', 'trimmed', 'flow', 'laminar', 'wall', 'utilizing', 'layer']

Query: 'compressib