<H3>PRI 2023/24: second
    project delivery</H3>

**GROUP 1**
- Amanda Tofthagen, 113124
- Tora Kristine Løtveit, 112927
- Tuva Grønvold Natvig, 113107

<H2>Main facilities</H2>

#### Loading and preproccesing data (using with functions made in project 1)

In [2]:
import json
import os
import time
import xml.etree.ElementTree as ET
from collections import defaultdict
from string import punctuation
import nltk  
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize
from scipy.spatial.distance import cosine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import ndcg_score, average_precision_score
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/torakristinelotveit/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/torakristinelotveit/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/torakristinelotveit/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/torakristinelotveit/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
def preprocess_text(text):
    # Convert text to lower case and tokenize
    tokens = word_tokenize(text.lower())
    # Remove punctuation
    tokens = [token for token in tokens if token not in punctuation]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

# Load metadata as both dataframe and list
def load_metadata(file_path):
    df = pd.read_csv(file_path, low_memory=False)
    df = df[['cord_uid', 'title', 'abstract']].dropna()  # Keep only required columns
    df['title'] = df['title'].astype(str)
    df['abstract'] = df['abstract'].astype(str)

    # Store as a list of "title + abstract" for ranking models
    doc_list = df.apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
    
    return df, doc_list  # Return both formats


# Load qrels
def load_qrels(file_path):
    qrels = defaultdict(dict)
    with open(file_path, 'r') as f:
        for line in f:
            topic_id, _, doc_id, relevance = line.strip().split()
            qrels[topic_id][doc_id] = int(relevance)
    return qrels

def load_queries(file_path):
    tree = ET.parse(file_path)
    root = tree.getroot()
    queries = {}
    for topic in root.findall('topic'):
        topic_number = topic.get('number')
        query_text = preprocess_text(topic.find('query').text)
        queries[topic_number] = " ".join(query_text)  # Ensure consistency
    return queries

metadata_path = "data2/metadata.csv"
qrels_path = "data2/qrels.txt"
queries_path = "data2/topics.xml"

D, D_list = load_metadata(metadata_path)
qrels = load_qrels(qrels_path)
queries = load_queries(queries_path)

def indexing(D):
    start_time = time.time()  # Start timing
    
    # Initialize the inverted index
    inverted_index = defaultdict(dict)

    # Process each document
    for index, row in D.iterrows():
        # Combine title and abstract for indexing
        document_text = f"{row['title']} {row['abstract']}"
        # Preprocess text
        tokens = preprocess_text(document_text)

        # Build the index
        for term in tokens:
            if index in inverted_index[term]:
                inverted_index[term][index] += 1
            else:
                inverted_index[term][index] = 1

    # Calculate time and space used
    indexing_time = time.time() - start_time
    index_size = sum(sum(freq.values()) for freq in inverted_index.values())  # Calculate the size of the index

    # Return the inverted index, time taken, and estimated size of the index
    return inverted_index, indexing_time, index_size

def save_index_to_json(inverted_index, file_name='inverted_index.json'):
    with open(file_name, 'w') as f:
        json.dump(inverted_index, f, indent=4)

def save_index_to_json(inverted_index, directory='output_data', file_name='inverted_index.json'):
    # Check if the directory exists, if not create it
    if not os.path.exists(directory):
        os.makedirs(directory)

    # Full path to save the file
    path_to_file = os.path.join(directory, file_name)

    # Save the JSON file
    with open(path_to_file, 'w') as f:
        json.dump(inverted_index, f, indent=4)

# Run it:
inverted_index, time_taken, size = indexing(D)
save_index_to_json(inverted_index)

<h3>Part I: clustering</h3>

*A) Clustering*

In [4]:
#code, statistics and/or charts here

*B) Summarization*

In [5]:
#code, statistics and/or charts here

*C) Keyword extraction*

In [6]:
#code, statistics and/or charts here

*D) Evaluation*

In [7]:
#code, statistics and/or charts here

<h3>Part II: classification</h3>

*A) Feature extraction*

In [None]:
#Function to extract features from documents using TF-IDF
def extract_features(doc_list):
    processed_docs = []
    for doc in doc_list:
        tokens = preprocess_text(doc)  
        processed_docs.append(" ".join(tokens)) 
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(processed_docs)
    return tfidf_matrix, vectorizer

# Usage
tfidf_matrix, vectorizer = extract_features(D_list)


*B) Classification*

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Function to build a classification dataset based on qrels and the document list
def build_classification_dataset(qrels, queries, D, vectorizer, max_neg_per_query=50):
    X, y = [], []

    # Map cord_uid to row index
    cord_uid_to_index = {uid: idx for idx, uid in enumerate(D['cord_uid'])}

    # Precompute all document vectors
    processed_docs = [" ".join(preprocess_text(f"{row['title']} {row['abstract']}")) for _, row in D.iterrows()]
    doc_vectors = vectorizer.transform(processed_docs)

    for topic_id, docs in qrels.items():
        if topic_id not in queries:
            continue

        query_text = queries[topic_id]
        query_vec = vectorizer.transform([query_text])

        # Positive samples
        for doc_id, rel in docs.items():
            doc_idx = cord_uid_to_index.get(doc_id)
            if doc_idx is None:
                continue
            label = 1 if rel > 0 else 0
            doc_vec = doc_vectors[doc_idx]
            combined = 0.5 * query_vec + 0.5 * doc_vec
            X.append(combined.toarray()[0])
            y.append(label)

        # Negative samples (not in qrels)
        neg_added = 0
        for i, doc_id in enumerate(D['cord_uid']):
            if doc_id in docs or neg_added >= max_neg_per_query:
                continue
            doc_vec = doc_vectors[i]
            combined = 0.5 * query_vec + 0.5 * doc_vec
            X.append(combined.toarray()[0])
            y.append(0)
            neg_added += 1

    return np.array(X), np.array(y)

# Build and train
X, y = build_classification_dataset(qrels, queries, D, vectorizer)
print("Shape of X:", X.shape)
print("Positive samples:", sum(y))
print("Negative samples:", len(y) - sum(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

clf = LogisticRegression(max_iter=200, class_weight='balanced')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test) # Predicting on the test set

print(classification_report(y_test, y_pred))


Shape of X: (8965, 109895)
Positive samples: 2096
Negative samples: 6869
              precision    recall  f1-score   support

           0       0.89      0.73      0.80      1374
           1       0.45      0.72      0.55       419

    accuracy                           0.73      1793
   macro avg       0.67      0.72      0.68      1793
weighted avg       0.79      0.73      0.74      1793



*C) Ranking extension*

In [29]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import networkx as nx

# Function to vectorize queries
def vectorize_queries(queries_dict, vectorizer):
    query_vectors = {}
    for qid, query in queries_dict.items():
        query_vec = vectorizer.transform([query]) 
        query_vectors[qid] = query_vec
    return query_vectors

# Function to get top-k documents for each query, based on cosine similarity
def get_top_k_documents_per_query(query_vectors, doc_matrix, top_k=1000):
    top_docs = {}
    for qid, qvec in query_vectors.items():
        sims = cosine_similarity(qvec, doc_matrix).flatten()
        top_indices = np.argsort(sims)[::-1][:top_k]
        top_scores = sims[top_indices]
        top_docs[qid] = list(zip(top_indices, top_scores))  # (doc_id, similarity)
    return top_docs


# Function to build an undirected graph where edges are added between documents
# based on cosine similarity above a certain threshold
def build_graph(doc_indices, doc_matrix, theta=0.2):
    G = nx.Graph()
    docs_subset = doc_matrix[doc_indices]

    sim_matrix = cosine_similarity(docs_subset)

    for i, idx1 in enumerate(doc_indices):
        for j, idx2 in enumerate(doc_indices):
            if i < j and sim_matrix[i, j] >= theta:
                G.add_edge(idx1, idx2, weight=sim_matrix[i, j])
    return G

# Function to run PageRank on the similarity graph
def run_pagerank(graph, p=0.15, max_iter=50):
    return nx.pagerank(graph, alpha=1 - p, max_iter=max_iter)

# Function to rerank documents using PageRank based on importance
# of documents in the similarity graph
def undirected_page_rank(qid, tfidf_matrix, top_docs_per_query, theta=0.2):
    if qid not in top_docs_per_query:
        print(f"Query {qid} not found.")
        return []

    top_docs = [doc_id for doc_id, _ in top_docs_per_query[qid]]
    G = build_graph(top_docs, tfidf_matrix, theta=theta)
    scores = run_pagerank(G)

    reranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return reranked


# Function to rerank documents using PageRank with a combination of original score
# and PageRank score
def undirected_page_rank_fused(qid, tfidf_matrix, top_docs_per_query, theta=0.2, alpha=0.6):
    if qid not in top_docs_per_query:
        print(f"Query {qid} not found.")
        return []

    # Get top-1000 document indices
    doc_indices = [doc_id for doc_id, _ in top_docs_per_query[qid]]
    original_scores = dict(top_docs_per_query[qid])  # doc_id → cosine_sim

    # Build graph
    G = build_graph(doc_indices, tfidf_matrix, theta=theta)
    pr_scores = run_pagerank(G)

    # Combine original score with PageRank
    combined_scores = {
        doc_id: alpha * pr_scores.get(doc_id, 0) + (1 - alpha) * original_scores.get(doc_id, 0)
        for doc_id in doc_indices
    }

    reranked = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    return reranked


# Example usage


# Step 1: Vectorize queries
query_vectors = vectorize_queries(queries, vectorizer)

# Step 2: Find top-1000 most similar docs per query
top_docs_per_query = get_top_k_documents_per_query(query_vectors, tfidf_matrix)

# Step 3: Rerank one query using PageRank
qid = list(top_docs_per_query.keys())[0]
reranked_docs = undirected_page_rank(qid, tfidf_matrix, top_docs_per_query, theta=0.2)

# Show top 5 reranked docs
print(f"Top 5 reranked documents for query {qid}:")
for doc_id, score in reranked_docs[:5]:
    print(f"Doc {doc_id} → PR score: {score:.6f}")
    print(D.iloc[doc_id][['cord_uid', 'title']])



Top 5 reranked documents for query 1:
Doc 33981 → PR score: 0.005286
cord_uid                                             wl121lg4
title       Molecular immune pathogenesis and diagnosis of...
Name: 41082, dtype: object
Doc 39810 → PR score: 0.004797
cord_uid                                             msohf5oa
title       The origin, transmission and clinical therapie...
Name: 48161, dtype: object
Doc 15165 → PR score: 0.004604
cord_uid                                             52kqp9yw
title       From SARS coronavirus to novel animal and huma...
Name: 16201, dtype: object
Doc 4168 → PR score: 0.004365
cord_uid                                             nj1p4ehx
title       T-cell-mediated immune response to respiratory...
Name: 4373, dtype: object
Doc 34462 → PR score: 0.004342
cord_uid                                             kwq2y3il
title       Coronavirus Disease 2019: Coronaviruses and Bl...
Name: 41663, dtype: object


*D) Evaluation*

In [44]:
from sklearn.metrics import ndcg_score

# Function to get the true relevance scores for a given query from qrels
def get_qrels_for_query(qid, qrels):
    return qrels.get(qid, {})

#Function to measure precision at k
def precision_at_k(ranked_docs, relevant_docs, k=10):
    top_k = ranked_docs[:k]
    relevant = [1 if doc_id in relevant_docs and relevant_docs[doc_id] > 0 else 0 for doc_id in top_k]
    return sum(relevant) / k

# Function to calculate Mean Average Precision (MAP) for a ranked list of documents for a query
def mean_average_precision(ranked_docs, relevant_docs):
    num_relevant = 0
    avg_precision = 0.0

    for i, doc_id in enumerate(ranked_docs):
        if doc_id in relevant_docs and relevant_docs[doc_id] > 0:
            num_relevant += 1
            avg_precision += num_relevant / (i + 1)

    if num_relevant == 0:
        return 0.0
    return avg_precision / num_relevant

# Function to calculate NDCG at k
def ndcg_at_k(ranked_docs, relevant_docs, k=10):
    true_rels = [relevant_docs.get(doc_id, 0) for doc_id in ranked_docs[:k]]
    ideal_rels = sorted(relevant_docs.values(), reverse=True)[:k]

    if not ideal_rels:
        return 0.0

    return ndcg_score([ideal_rels], [true_rels])

# Function to evaluate the baseline cosine similarity ranking
def evaluate_baseline_cosine(qrels, tfidf_matrix, query_vectors, D, top_k=1000):
    D_reset = D.reset_index(drop=True)
    index_to_uid = dict(enumerate(D_reset['cord_uid']))

    all_precisions, all_maps, all_ndcgs = [], [], []
    evaluated_queries = 0

    for qid, qvec in query_vectors.items():
        sims = cosine_similarity(qvec, tfidf_matrix).flatten()
        top_indices = np.argsort(sims)[::-1][:top_k]
        ranked_doc_ids = [index_to_uid[idx] for idx in top_indices]

        relevant_docs = get_qrels_for_query(qid, qrels)
        if not relevant_docs:
            continue

        p_at_10 = precision_at_k(ranked_doc_ids, relevant_docs, k=10)
        avg_prec = mean_average_precision(ranked_doc_ids, relevant_docs)
        ndcg = ndcg_at_k(ranked_doc_ids, relevant_docs, k=10)

        all_precisions.append(p_at_10)
        all_maps.append(avg_prec)
        all_ndcgs.append(ndcg)
        evaluated_queries += 1

    print(f"\n📊 Baseline Cosine Similarity Ranking (no graph):")
    print(f"Evaluated {evaluated_queries} queries.")
    print(f"Avg Precision@10: {sum(all_precisions)/evaluated_queries:.4f}")
    print(f"Avg MAP:          {sum(all_maps)/evaluated_queries:.4f}")
    print(f"Avg NDCG@10:      {sum(all_ndcgs)/evaluated_queries:.4f}")


def evaluate_all_queries_fused(qrels, tfidf_matrix, top_docs_per_query, D, alpha=0.6, theta=0.2):
    D_reset = D.reset_index(drop=True)
    index_to_uid = dict(enumerate(D_reset['cord_uid']))

    all_precisions, all_maps, all_ndcgs = [], [], []
    evaluated_queries = 0

    for qid in top_docs_per_query:
        # ⚠️ Pass theta into your fused PageRank function
        reranked = undirected_page_rank_fused(qid, tfidf_matrix, top_docs_per_query, alpha=alpha, theta=theta)
        reranked_doc_ids = [index_to_uid[idx] for idx, _ in reranked]

        relevant_docs = get_qrels_for_query(qid, qrels)
        if not relevant_docs:
            continue

        p10 = precision_at_k(reranked_doc_ids, relevant_docs, k=10)
        map_ = mean_average_precision(reranked_doc_ids, relevant_docs)
        ndcg = ndcg_at_k(reranked_doc_ids, relevant_docs, k=10)

        all_precisions.append(p10)
        all_maps.append(map_)
        all_ndcgs.append(ndcg)
        evaluated_queries += 1

    print(f"\n📊 Fused Ranking (alpha={alpha}, θ={theta}):")
    print(f"Avg Precision@10: {sum(all_precisions)/evaluated_queries:.4f}")
    print(f"Avg MAP:          {sum(all_maps)/evaluated_queries:.4f}")
    print(f"Avg NDCG@10:      {sum(all_ndcgs)/evaluated_queries:.4f}")


# Ensure the DataFrame has a clean integer index
D_reset = D.reset_index(drop=True)

# Build index-to-uid map safely
index_to_uid = dict(enumerate(D_reset['cord_uid']))
reranked_doc_ids = [index_to_uid[idx] for idx, _ in reranked_docs]


# Function to evaluate all queries
def evaluate_all_queries(qrels, tfidf_matrix, top_docs_per_query, D, theta=0.2):
    D_reset = D.reset_index(drop=True)
    index_to_uid = dict(enumerate(D_reset['cord_uid']))

    all_precisions = []
    all_maps = []
    all_ndcgs = []
    evaluated_queries = 0

    for qid in qrels.keys():
        if qid not in top_docs_per_query:
            continue

        # Pass the theta value into the graph builder
        reranked_docs = undirected_page_rank(qid, tfidf_matrix, top_docs_per_query, theta=theta)
        reranked_doc_ids = [index_to_uid.get(idx) for idx, _ in reranked_docs if idx in index_to_uid]

        relevant_docs = get_qrels_for_query(qid, qrels)
        if not relevant_docs:
            continue

        p_at_10 = precision_at_k(reranked_doc_ids, relevant_docs, k=10)
        avg_prec = mean_average_precision(reranked_doc_ids, relevant_docs)
        ndcg = ndcg_at_k(reranked_doc_ids, relevant_docs, k=10)

        all_precisions.append(p_at_10)
        all_maps.append(avg_prec)
        all_ndcgs.append(ndcg)
        evaluated_queries += 1

    avg_p10 = sum(all_precisions) / evaluated_queries if evaluated_queries > 0 else 0.0
    avg_map = sum(all_maps) / evaluated_queries if evaluated_queries > 0 else 0.0
    avg_ndcg = sum(all_ndcgs) / evaluated_queries if evaluated_queries > 0 else 0.0

    print(f"\nEvaluated {evaluated_queries} queries with θ = {theta}")
    print(f"Average Precision@10: {avg_p10:.4f}")
    print(f"Average MAP: {avg_map:.4f}")
    print(f"Average NDCG@10: {avg_ndcg:.4f}")


#
# Evaluate baseline cosine similarity ranking
evaluate_baseline_cosine(qrels, tfidf_matrix, query_vectors, D)
# Evaluate with different θ values
for theta in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]:
    evaluate_all_queries(qrels, tfidf_matrix, top_docs_per_query, D, theta=theta)
    evaluate_all_queries_fused(qrels, tfidf_matrix, top_docs_per_query, D, theta=theta)
    # Evaluate the fused function with different α values
best_theta = 0.2
# Evaluate with different α values
for alpha in [0.3, 0.5, 0.7]:
    evaluate_all_queries_fused(qrels, tfidf_matrix, top_docs_per_query, D, alpha=alpha, theta=best_theta)







📊 Baseline Cosine Similarity Ranking (no graph):
Evaluated 30 queries.
Avg Precision@10: 0.3800
Avg MAP:          0.2475
Avg NDCG@10:      0.9943

Evaluated 30 queries with θ = 0.1
Average Precision@10: 0.1200
Average MAP: 0.0876
Average NDCG@10: 0.9940

📊 Fused Ranking (alpha=0.6, θ=0.1):
Avg Precision@10: 0.3833
Avg MAP:          0.2482
Avg NDCG@10:      0.9943

Evaluated 30 queries with θ = 0.2
Average Precision@10: 0.0900
Average MAP: 0.0779
Average NDCG@10: 0.9940

📊 Fused Ranking (alpha=0.6, θ=0.2):
Avg Precision@10: 0.3833
Avg MAP:          0.2489
Avg NDCG@10:      0.9943

Evaluated 30 queries with θ = 0.3
Average Precision@10: 0.0733
Average MAP: 0.0682
Average NDCG@10: 0.9940

📊 Fused Ranking (alpha=0.6, θ=0.3):
Avg Precision@10: 0.3833
Avg MAP:          0.2483
Avg NDCG@10:      0.9943

Evaluated 30 queries with θ = 0.4
Average Precision@10: 0.0633
Average MAP: 0.0737
Average NDCG@10: 0.9940

📊 Fused Ranking (alpha=0.6, θ=0.4):
Avg Precision@10: 0.3800
Avg MAP:          0.248

Weighted PageRank

In [48]:
import numpy as np
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
import time


#Function to rerank documents using undirected PageRank with weighted edges
# and classifier-based priors
def undirected_page_rank_weighted(
    qid,
    tfidf_matrix,
    top_docs_per_query,
    classifier=None,
    doc_vectors=None,
    query_vectors=None,
    D=None,
    theta=0.2,
    p=0.15,
    max_iter=50,
    top_k_neighbors=10,
    use_cosine_fusion=False,
    alpha=0.6
):
    if qid not in top_docs_per_query:
        print(f"[SKIP] Query {qid} not found in top_docs.")
        return []

    doc_indices = [doc_id for doc_id, _ in top_docs_per_query[qid]]
    docs_subset = tfidf_matrix[doc_indices]
    sim_matrix = cosine_similarity(docs_subset)

    # Build graph with top-k neighbors
    G = nx.Graph()
    for i, idx1 in enumerate(doc_indices):
        sim_scores = list(enumerate(sim_matrix[i]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        for j, score in sim_scores[:top_k_neighbors + 1]:
            idx2 = doc_indices[j]
            if i < j and score >= theta:
                G.add_edge(idx1, idx2, weight=score)

    if classifier and doc_vectors is not None and query_vectors is not None:
        qvec = query_vectors[qid]
        priors = {}
        for i, idx in enumerate(doc_indices):
            dvec = doc_vectors[idx]
            combined = 0.5 * qvec + 0.5 * dvec
            clf_score = classifier.predict_proba(combined)[0][1]
            priors[idx] = clf_score ** 2  # emphasize strong confidence
    else:
        priors = {idx: 1.0 for idx in doc_indices}

    # Normalize priors
    total_prior = sum(priors.values())
    priors = {k: v / total_prior for k, v in priors.items()}

    # Weighted PageRank
    PR = {node: 1 / len(G) for node in G.nodes()}
    for _ in range(max_iter):
        new_PR = {}
        for node in G.nodes():
            neighbors = list(G[node])
            link_sum = 0.0
            for neighbor in neighbors:
                weight = G[neighbor][node]['weight']
                total_neighbor_weight = sum(G[neighbor][n]['weight'] for n in G[neighbor])
                link_sum += PR[neighbor] * (weight / total_neighbor_weight)
            new_PR[node] = p * priors[node] + (1 - p) * link_sum
        PR = new_PR

    # Combine PR with cosine similarity if enabled
    if use_cosine_fusion:
        cosine_scores = dict(top_docs_per_query[qid])
        reranked = [(doc_id, alpha * PR.get(doc_id, 0.0) + (1 - alpha) * cosine_scores.get(doc_id, 0.0))
                    for doc_id in doc_indices]
        reranked = sorted(reranked, key=lambda x: x[1], reverse=True)
    else:
        reranked = sorted(PR.items(), key=lambda x: x[1], reverse=True)

    return reranked

# Function to evaluate all queries using the weighted PageRank method
def evaluate_all_queries_weighted(
    qrels,
    tfidf_matrix,
    top_docs_per_query,
    classifier,
    doc_vectors,
    query_vectors,
    D,
    theta=0.2,
    p=0.15,
    max_iter=50,
    top_k_neighbors=10,
    use_cosine_fusion=False,
    alpha=0.6,
    limit_queries=None
):
    start = time.time()
    D_reset = D.reset_index(drop=True)
    index_to_uid = dict(enumerate(D_reset['cord_uid']))

    all_precisions, all_maps, all_ndcgs = [], [], []
    evaluated_queries = 0

    print("Evaluating weighted PageRank across queries...\n")
    for i, qid in enumerate(qrels.keys()):
        if limit_queries and i >= limit_queries:
            break
        if qid not in top_docs_per_query:
            continue

        reranked_docs = undirected_page_rank_weighted(
            qid=qid,
            tfidf_matrix=tfidf_matrix,
            top_docs_per_query=top_docs_per_query,
            classifier=classifier,
            doc_vectors=doc_vectors,
            query_vectors=query_vectors,
            D=D,
            theta=theta,
            p=p,
            max_iter=max_iter,
            top_k_neighbors=top_k_neighbors,
            use_cosine_fusion=use_cosine_fusion,
            alpha=alpha
        )

        reranked_doc_ids = [index_to_uid.get(idx) for idx, _ in reranked_docs if idx in index_to_uid]
        relevant_docs = get_qrels_for_query(qid, qrels)
        if not relevant_docs:
            continue

        p_at_10 = precision_at_k(reranked_doc_ids, relevant_docs, k=10)
        avg_prec = mean_average_precision(reranked_doc_ids, relevant_docs)
        ndcg = ndcg_at_k(reranked_doc_ids, relevant_docs, k=10)

        all_precisions.append(p_at_10)
        all_maps.append(avg_prec)
        all_ndcgs.append(ndcg)
        evaluated_queries += 1

    avg_p10 = sum(all_precisions) / evaluated_queries if evaluated_queries > 0 else 0.0
    avg_map = sum(all_maps) / evaluated_queries if evaluated_queries > 0 else 0.0
    avg_ndcg = sum(all_ndcgs) / evaluated_queries if evaluated_queries > 0 else 0.0

    end = time.time()
    print(f"\nEvaluated {evaluated_queries} queries in {end - start:.1f} seconds.")
    print(f"Average Precision@10: {avg_p10:.4f}")
    print(f"Average MAP:          {avg_map:.4f}")
    print(f"Average NDCG@10:      {avg_ndcg:.4f}")


evaluate_all_queries_weighted(
    qrels,
    tfidf_matrix,
    top_docs_per_query,
    classifier=clf,
    doc_vectors=tfidf_matrix,
    query_vectors=query_vectors,
    D=D,
    theta=0.2,
    p=0.15,
    max_iter=50,
    top_k_neighbors=10,
    use_cosine_fusion=True,
    alpha=0.6
)

# Explore the effect of top_k_neighbors and alpha
for k in [5, 10, 20]:
    for alpha in [0.4, 0.6, 0.8]:
        print(f"\nTesting: k={k}, alpha={alpha}")
        evaluate_all_queries_weighted(
            qrels=qrels,
            tfidf_matrix=tfidf_matrix,
            top_docs_per_query=top_docs_per_query,
            classifier=clf,
            doc_vectors=tfidf_matrix,
            query_vectors=query_vectors,
            D=D,
            theta=0.2,
            p=0.15,
            max_iter=50,
            top_k_neighbors=k,
            use_cosine_fusion=True,
            alpha=alpha
        )


Evaluating weighted PageRank across queries...


Evaluated 30 queries in 45.3 seconds.
Average Precision@10: 0.3833
Average MAP:          0.2506
Average NDCG@10:      0.9943

Testing: k=5, alpha=0.4
Evaluating weighted PageRank across queries...


Evaluated 30 queries in 24.3 seconds.
Average Precision@10: 0.3833
Average MAP:          0.2491
Average NDCG@10:      0.9943

Testing: k=5, alpha=0.6
Evaluating weighted PageRank across queries...


Evaluated 30 queries in 23.1 seconds.
Average Precision@10: 0.3833
Average MAP:          0.2515
Average NDCG@10:      0.9943

Testing: k=5, alpha=0.8
Evaluating weighted PageRank across queries...


Evaluated 30 queries in 23.6 seconds.
Average Precision@10: 0.3933
Average MAP:          0.2561
Average NDCG@10:      0.9943

Testing: k=10, alpha=0.4
Evaluating weighted PageRank across queries...


Evaluated 30 queries in 44.0 seconds.
Average Precision@10: 0.3833
Average MAP:          0.2487
Average NDCG@10:      0.9943

Testing: k=10, alpha=0.6
Eva

<H2>Question materials (optional)</H2>

<H3>Part I: clustering</H3>

**(1)** Do clustering-guided summarization alters the behavior and efficacy of the IR system?

In [None]:
#code, statistics and/or charts here

**(2)** How sentence representations, clustering choices, and rank criteria impact summarization?

In [None]:
#code, statistics and/or charts here

**...** (additional questions with empirical results)

<H3>END</H3>