# **Text, Web, & Media Analytics Assignment 2**

# Setup

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

from ir_models import BM25, JM_LM, My_PRM
from ir_tools import write_scores_to_file
from parsing_functions import parse_stop_words, parse_collection, parse_query, parse_query_set, parse_evaluations, parse_ranking_files

In [None]:
# Parse in stop words
stop_words = parse_stop_words('common-english-words.txt')

# Load the document set (series of collection objects)
document_set = {}
input_path = 'Data_Collection'
for collection_path in os.listdir(input_path):
    data_key = collection_path.split('_C', 1)[1]
    document_set[data_key] = parse_collection(stop_words, os.path.join(input_path, collection_path))

# Parse in query set, apply term specificity to parsed queries
query_frame = parse_query_set('the50Queries.txt')
query_frame['parsed_title'] = query_frame['title'].apply(lambda row: parse_query(row, stop_words))

# Experiment to see if adding quarter-weighted frequency of description element helps
query_frame['parsed_description'] = query_frame['description'].apply(lambda row: parse_query(row, stop_words) if row is not pd.NA else pd.NA)
query_frame['parsed_description'] = query_frame['parsed_description'].apply(lambda row: {k:v/4 for k,v in row.items()} if row is not pd.NA else pd.NA)

query_frame['parsed_query'] = query_frame.apply(
    lambda row: {**row['parsed_title'], **{k: v for k, v in row['parsed_description'].items() if k not in row['parsed_title']}} if row['parsed_description'] is not pd.NA else row['parsed_title'], 
    axis=1
)

# Task 1: BM25 ✔️

# Task 2: Jelinek-Mercer Language Model ✔️

# Task 3: Pseudo-Relevance Model ✔️

# Task 4: Model Testing ✔️

**Description:** Use Python to implement three models: `BM25`, `JM_LM`, and `My_PRM`, and **test them on the given 50 data collections for the corresponding 50 queries (topics)**. 

Design Python programs to implement these three models. You can use a .py file (or a .ipynb file) for each model.


For each long query, your python programs will produce ranked results and save them into .dat files. For example, for query R107, you can save the ranked results of three models into “BM25_R107Ranking.dat”, “JM_LM_R107Ranking.dat”, and “My_PRM_R107Ranking.dat”, respectively by using the following format:
- The first column is the document id (the itemid in the corresponding XML document)
- The second column is the document score (or probability).

**Describe:** 
- Python packages or modules (or any open-source software) you used
- The data structures used to represent a single document and a set of documents for each model (you can use different data structures for different models).


You also need to **test the three models on the given 50 data collections for the 50 queries (topics) by *printing out the top 15 documents* for each data collection (in descending order)**. The **output will also be put in the appendix of your final report**.

In [None]:
# Initialise result dicts
BM25_results = {}
JM_LM_results = {}
My_PRM_results = {}

# Loop over queries/collection objects
for query_key, collection in document_set.items():
    query = query_frame.loc[query_frame['number'] == query_key, 'parsed_query'].iloc[0]  # retrieve weighted query

    # Rank documents
    BM25_results[query_key] = BM25(collection=collection, query=query)
    JM_LM_results[query_key] = JM_LM(collection=collection, query=query)
    My_PRM_results[query_key] = My_PRM(weighting_function=BM25, collection=collection, query=query, threshold=0.7, theta=0)  # NOTE: GSCV threshold/theta?

    # Save results
    write_scores_to_file(BM25_results[query_key], f"BM25_R{query_key}Ranking")
    write_scores_to_file(JM_LM_results[query_key], f"JM_LM_R{query_key}Ranking")
    write_scores_to_file(My_PRM_results[query_key], f"My_PRM_R{query_key}Ranking")

In [None]:
def get_top_15(model_results):
    """
    Takes the model results, prints out the top-15 sorted by weights.
    """

    # Iterate terating over each set of {query:predictions}, where predictions is a dictionary of {doc_id : document weight}
    for(query, predictions) in model_results.items():
        print('Query' + str(query) + ' (DocID Weight):') # print result header information

        # For the given result set, sort the document weights and take the top 15 scores ("up to n" indexing doesn't break for lists shorter than n)
        sorted_weights_top15 = {doc_id:doc_score for doc_id,doc_score in sorted(predictions.items(), key=lambda item: item[1], reverse=True)[:15]}

        # Iterate over each doc_id:weight for the predictions
        for (doc_id, weight) in sorted_weights_top15.items():
            print(doc_id + ': ' + str(weight))  # print results data

        print()  # print linebreak for readability

In [None]:
get_top_15(BM25_results)
get_top_15(JM_LM_results)
get_top_15(My_PRM_results)

# Task 5: Model Evaluation ✔️

**Description:** Use three effectiveness measures to evaluate the three models.

In this task, you need to **use the relevance judgments (EvaluationBenchmark.zip)** to **compare with the ranking outputs in the folder of “RankingOutputs” for the selected effectiveness metric** for the three models.


You need to use the following three different effectiveness measures to evaluate the document ranking results you saved in the folder “RankingOutputs”:
1) Average precision (and MAP)
2) Precision@10 (and their average)
3) Discounted cumulative gain at rank position 10 ($p = 10$), $DCG_{10}$ (and their average):  
    $DCG_p=rel_i+\sum_{i=2}^p\frac{rel_i}{log_2(i)}$  
        $rel_i=1$ if the document at position $i$ is releveant; otherwise, it is 0.

Evaluation results can be summarized in tables or graphs. Examples are provided in the sepcification sheet.

In [None]:
# # Parse DATs
# BM25_results = parse_ranking_files('RankingOutputs', 'BM25')
# JM_LM_results = parse_ranking_files('RankingOutputs', 'JM_LM')
# My_PRM_results = parse_ranking_files('RankingOutputs', 'My_PRM')

# Parse in evaluation benchmarks
evaluations = parse_evaluations('EvaluationBenchmark/')

## Average Precision (MAP)

In [None]:
def calculate_precision(evaluations, model_results, threshold: float, top_k: int = None) -> pd.DataFrame:
    """
    Calculate the precision for each topic in a collection, optionally only considering the top_k results.
    If top_k is specified, precision is calculated based on the top_k highest scored documents.
    Average Precision across all topics is appended as a final row.
    """

    precisions = []

    for topic, relevancy in evaluations.items():
        predicted_scores = model_results.get(topic, {})

        # Sort and possibly limit the results to top_k if specified
        if top_k:
            top_items = sorted(predicted_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
            filtered_scores = dict(top_items)
        else:
            filtered_scores = {doc_id: score for doc_id, score in predicted_scores.items() if score > threshold}

        # Calculate the number of predicted relevant documents that meet the threshold
        retrieved_docs = len(filtered_scores)

        # Calculate the number of correctly retrieved documents (true positives)
        true_positives = sum(1 for doc_id in filtered_scores.keys() if relevancy.get(doc_id) == 1)

        # Calculate precision
        precision = true_positives / retrieved_docs if retrieved_docs > 0 else 0

        # Append results to list for DataFrame conversion
        precisions.append({'topic': topic, 'precision': precision})

    # Create DataFrame
    precision_df = pd.DataFrame(precisions)

    # Calculate MAP (Average) and append as a new row
    map_score = precision_df['precision'].mean()
    average_row = pd.DataFrame([{'topic': 'MAP' if not top_k else 'Average', 'precision': map_score}])
    precision_df = pd.concat([precision_df, average_row], ignore_index=True)

    return precision_df

In [None]:
# Defining thresholds (used in both average precision and precision@10)
bm25_threshold = 0.6
jm_lm_threshold = 0.000001
prm_threshold = 0.005

In [None]:
# Calculate precision for each query
bm25_precision = calculate_precision(evaluations, BM25_results, bm25_threshold, top_k = None).rename({'precision': 'bm25_precision'}, axis = 1)

# Calculate precision for each query
jm_lm_precision = calculate_precision(evaluations, JM_LM_results, jm_lm_threshold, top_k = None).rename({'precision': 'jm_lm_precision'}, axis = 1)

# Calculate precision for each query
prm_precision = calculate_precision(evaluations, My_PRM_results, prm_threshold, top_k = None).rename({'precision': 'prm_precision'}, axis = 1)

# Merging results
average_precision = pd.merge((pd.merge(bm25_precision, jm_lm_precision, on='topic')), prm_precision, on='topic')
average_precision

## Precision @ 10

In [None]:
# Rank variable
top_k = 10

# Calculate precision for each model
bm25_precision_10 = calculate_precision(evaluations, BM25_results, bm25_threshold, top_k = top_k).rename({'precision': f'bm25_precision@{top_k}'}, axis = 1)
jm_lm_precision_10 = calculate_precision(evaluations, JM_LM_results, jm_lm_threshold, top_k = top_k).rename({'precision': f'jm_lm_precision@{top_k}'}, axis = 1)
prm_precision_10 = calculate_precision(evaluations, My_PRM_results, prm_threshold, top_k = top_k).rename({'precision': f'prm_precision@{top_k}'}, axis = 1)

# Merging results
precision_10 = pd.merge((pd.merge(bm25_precision_10, jm_lm_precision_10, on='topic')), prm_precision_10, on='topic')
precision_10

## DCG @ 10

In [None]:
def calculate_dcg(evaluations, model_results, threshold, p):
    """
    Calculate the Discounted Cumulative Gain (DCG) at rank p (effectively top k) for each topic in a collection.
    DCG is calculated using a logarithmic discount factor to decrease the weight of relevance for documents retrieved later in the list.
    """

    dcgs = []

    for topic, relevancy in evaluations.items():
        predicted_scores = model_results.get(topic, {})

        # Sort predicted scores and limit results to top p
        top_p_scores = sorted(predicted_scores.items(), key=lambda x: x[1], reverse=True)[:p]

        # Calculate DCG using the logarithmic discount
        dcg = 0
        for rank, (doc_id, score) in enumerate(top_p_scores, start=1):
            relevance = 1 if score > threshold and relevancy.get(doc_id, 0) == 1 else 0
            if rank == 1:
                dcg += relevance  # no discount for the first item
            else:
                dcg += relevance / np.log2(rank)  # discounting starts from the second item

        # Append results to list for DataFrame conversion
        dcgs.append({'topic': topic, 'DCG': dcg})

    # Create DataFrame
    dcg_df = pd.DataFrame(dcgs)

    # Calculate Average DCG and append as a new row
    average_dcg = dcg_df['DCG'].mean()
    average_row = pd.DataFrame([{'topic': 'Average DCG', 'DCG': average_dcg}])
    dcg_df = pd.concat([dcg_df, average_row], ignore_index=True)

    return dcg_df

In [None]:
# Rank variable
p = 10

# Calculate precision for each model
bm25_dcg_10 = calculate_dcg(evaluations, BM25_results, bm25_threshold, p = p).rename({'DCG': f'bm25_DCG_p{p}'}, axis = 1)
jm_lm_dcg_10 = calculate_dcg(evaluations, JM_LM_results, jm_lm_threshold, p = p).rename({'DCG': f'jm_lm_DCG_p{p}'}, axis = 1)
prm_dcg_10 = calculate_dcg(evaluations, My_PRM_results, prm_threshold, p = p).rename({'DCG': f'prm_DCG_p{p}'}, axis = 1)

# Merging results
dcg_10 = pd.merge((pd.merge(bm25_dcg_10, jm_lm_dcg_10, on='topic')), prm_dcg_10, on='topic')
dcg_10

## F1 Optimisation

In [None]:
def calculate_weighted_f1_score(evaluations, model_results, threshold: float, beta: float = 1.0, top_k: int = None) -> float:
    """
    Calculate the weighted F1 score across all topics for a given threshold and beta, optionally considering only the top_k results.
    """
    total_f1_score = 0
    count = 0

    for topic, relevancy in evaluations.items():
        predicted_scores = model_results.get(topic, {})

        # Sort and possibly limit the results to top_k if specified
        if top_k:
            top_items = sorted(predicted_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
            filtered_scores = dict(top_items)  # Convert sorted list back to dict
        else:
            filtered_scores = {doc_id: score for doc_id, score in predicted_scores.items() if score > threshold}

        # Calculate true positives, false positives, and false negatives
        true_positives = sum(1 for doc_id in filtered_scores.keys() if relevancy.get(doc_id) == 1)
        false_positives = sum(1 for doc_id in filtered_scores.keys() if relevancy.get(doc_id) == 0)
        false_negatives = sum(1 for doc_id, is_relevant in relevancy.items() if is_relevant == 1 and doc_id not in filtered_scores)

        # Calculate precision and recall
        precision = true_positives / (true_positives + false_positives) if true_positives + false_positives > 0 else 0
        recall = true_positives / (true_positives + false_negatives) if true_positives + false_negatives > 0 else 0

        # Calculate weighted F1 score
        if (beta**2 * precision + recall) != 0:
            f1_score = (1 + beta**2) * (precision * recall) / (beta**2 * precision + recall)
        else:
            f1_score = 0

        # Add to total F1 score and increment count
        total_f1_score += f1_score
        count += 1

    # Calculate average F1 score across all topics
    average_f1_score = total_f1_score / count if count > 0 else 0
    return average_f1_score

def manual_grid_search(evaluations, model_results, thresholds, top_ks: list = None):
    # Define thresholds and top_k values to be tested

    best_score = -float('inf')  # Assuming higher score is better; adjust if needed
    best_params = {}

    # Loop over all combinations of threshold and top_k
    for threshold in thresholds:
        if top_ks:
            for top_k in top_ks:
                # Calculate average precision for the current combination of threshold and top_k
                average_f1  = calculate_weighted_f1_score(evaluations, model_results, threshold, top_k)
                
                # Check if the current score is better than what we've seen and update best score and parameters
                if average_f1  > best_score:
                    best_score = average_f1 
                    best_params = {'threshold': threshold}
        
        else:
            # Calculate average precision for the current combination of threshold
            average_f1  = calculate_weighted_f1_score(evaluations, model_results, threshold)
            
            # Check if the current score is better than what we've seen and update best score and parameters
            if average_f1  > best_score:
                best_score = average_f1
                best_params = {'threshold': threshold}

    return best_score, best_params

# Build threshold list
low_threshold = 0.0000001
step_size = 0.0001
num_steps = 100
thresholds = [low_threshold + i * step_size for i in range(num_steps)]
top_ks = [10]
manual_grid_search(evaluations, JM_LM_results, thresholds)

# Task 6: Recommendation

**Description:** Recommend a model based on significance test and your analysis. 

You need to conduct a significance test to compare models. You can choose a t-test to perform a significance test on the evaluation results (e.g., in Tables 1, 2 and 3). 

You can compare models between:
- **BM25** and **JM_LM**
- **BM25** and **My_PRM**
- **JM_LM** and **My_PRM**

Based on $t$-test results ($p$-value and $t$-statistic), you can recommend a model (You ***want the proposed "My_RPM" to be the best because it is your own model***). You can perform the $t$-test using a single effectiveness measure or multiple measures. Generally, using more effectiveness measures provides stronger evidence against the null hypothesis. Note that if the $t$-test is unsatisfactory, you can use the evaluation results to refine **My_PRM** mode. For example, you can adjust parameter settings or update your design and implementation.