# Extrinsic Evaluation
Post - Filter approach:

- we use the sensitivity labels from our intrinsic evaluation 
  
- in the post filter approach we rank the documents according to the  coordinate ascent algorithm optimizing towards normalized
Discounted Cumulative Gain (nDCG)

- for that we use predefined functions from https://github.com/rueycheng/CoordinateAscent/blob/master as the implementation was not mentioned in the paper



In [48]:
import pandas as pd

In [49]:
test_predicted = pd.read_csv("intermediate_results/logistic_regression_test_predicted.csv")

query_groups = test_predicted.groupby("Query")

In [50]:
test_predicted["Query"] = test_predicted["Query"].astype(int)
test_predicted["title_abstract"] = test_predicted["title_abstract"].astype(str)

In [51]:
def parse_queries_alternate(file_path):
    with open(file_path, "r") as file:
        lines = file.readlines()
    
    queries = []
    current_id = None
    current_text = []
    
    for line in lines:
        line = line.strip()
        if line.startswith(".I"):
            if current_id is not None:
                queries.append({"query_id": current_id, "query_text": " ".join(current_text)})
            current_id = int(line.split()[1])
            current_text = []
        elif line.startswith(".W"):
            continue  # Skip the .W line
        else:
            current_text.append(line)
    
    # Add the last query
    if current_id is not None:
        queries.append({"query_id": current_id, "query_text": " ".join(current_text)})
    
    return pd.DataFrame(queries)

# Parse the files using the alternate function
queries1 = parse_queries_alternate("data/Queries1.txt")
queries2 = parse_queries_alternate("data/Queries2.txt")

# Combine the queries
queries_df = pd.concat([queries1, queries2], ignore_index=True)

# Verify
queries_df.head()

Unnamed: 0,query_id,query_text
0,1,.B 60 year old menopausal woman without hormon...
1,2,.B 60 yo male with disseminated intravascular ...
2,3,.B prolonged prothrombin time anticardiolipin ...
3,4,.B 88 yo with subdural reviews on subdurals in...
4,5,.B 58 yo with cancer and hypercalcemia effecti...


In [52]:
test_predicted = test_predicted.merge(queries_df, left_on="Query", right_on="query_id", how="left")

we are going to use the TFIDF vectorization to compare the queries with our title and abstract and use this similiarity for the ranking algorithm
-> problem is they never mentioned in the paper what they used to compute these scores 

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = test_predicted["query_text"].tolist() + test_predicted["title_abstract"].tolist()

vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(corpus)

n_queries = len(test_predicted["query_text"])
query_tfidf = tfidf_matrix[:n_queries]
doc_tfidf = tfidf_matrix[n_queries:]

similarities = cosine_similarity(query_tfidf, doc_tfidf)

test_predicted["tfidf_similarity"] = similarities.diagonal()

values need to be between 0 and 2 
(as we are looking at the ration this wont make a difference)

In [54]:
from sklearn.preprocessing import MinMaxScaler

# Rescale TF-IDF similarities to the range [0, 2]
scaler = MinMaxScaler(feature_range=(0, 2))
test_predicted["tfidf_similarity"] = scaler.fit_transform(
    test_predicted[["tfidf_similarity"]]
)

In [55]:
print(test_predicted["tfidf_similarity"].describe())

count    28860.000000
mean         0.284187
std          0.246498
min          0.000000
25%          0.097436
50%          0.231182
75%          0.407963
max          2.000000
Name: tfidf_similarity, dtype: float64


In [56]:
X_test = test_predicted["tfidf_similarity"].values
y_test = test_predicted["Relevance_total"].values
qid_test= test_predicted["Query"].values

In [57]:
X_test = X_test.reshape(-1, 1)

we use predefined coordinate ascent and ndcgscorer modules here. They can be found on: https://github.com/rueycheng/CoordinateAscent/blob/master   

- we use each queryid and group the according entries to the queries and calculate the average ndcg_score for the entire test set 

In [58]:
from utils.coordinate_ascent import CoordinateAscent
from utils.metrics import NDCGScorer
from scipy.sparse import csr_matrix

scorer = NDCGScorer(k=10, idcg_cache={})

# Convert X_test to a sparse matrix
X_test_sparse = csr_matrix(X_test)

model = CoordinateAscent(n_restarts=2, max_iter=25, verbose=True, scorer=scorer).fit(X_test_sparse, y_test, qid_test)

pred = model.predict(X_test_sparse, qid_test)

test_predicted["predicted_scores"] = pred

In [59]:
test_predicted_filtered = test_predicted[test_predicted["predicted_label"] == 0]

In [60]:
X_test_filtered = test_predicted_filtered["tfidf_similarity"].values.reshape(-1, 1)  # Feature matrix
y_test_filtered = test_predicted_filtered["Relevance_total"].values  # Relevance labels
qid_test_filtered = test_predicted_filtered["Query"].values  # Query IDs

In [61]:
ndcg_score = scorer(y_test_filtered, test_predicted_filtered["predicted_scores"].values, qid_test_filtered).mean()

# Print the final score
print(f"Postfilter Average nCS-DCG@10: {ndcg_score:.4f}")

Postfilter Average nCS-DCG@10: 0.3017


as these results are not very good, we are going to try the approach that is mentioned in paper 28 that they mentioned

in paper 28 they mentioned some other feature meassures such as BM250 and the proximity_count so we are going to try this aswell 

In [62]:
#this runs for a very long time so its commented out

#from rank_bm25 import BM25Okapi
#
## Tokenize documents and queries
#documents = [doc.split() for doc in test_predicted["title_abstract"]]
#queries = [query.split() for query in test_predicted["query_text"]]
#
## Initialize BM25 model
#bm25 = BM25Okapi(documents)
#
## Compute BM25 scores

In [63]:
test_predicted = pd.read_csv("intermediate_results/test_predicted_withBM25.csv")

we also use proxmity count

In [64]:
def proximity_count(query, doc, window=8):
    """
    Counts the number of windows within the document that contain at least one query term.
    Optionally, weighs matches based on the number of query terms in the window.
    """
    # Tokenize the query and document
    query_terms = query.split()
    doc_terms = doc.split()
    
    count = 0
    
    # Slide a window over the document terms
    for i in range(len(doc_terms) - window + 1):
        window_terms = doc_terms[i:i + window]
        overlap = set(query_terms) & set(window_terms)  # Intersection of query terms and window terms
        count += len(overlap)  # Optionally, count unique query terms in the window
    
    return count

In [65]:
test_predicted["proximity_count"] = test_predicted.apply(
    lambda row: proximity_count(row["query_text"], row["title_abstract"]),
    axis=1
)

# Inspect proximity counts
print(test_predicted["proximity_count"].describe())

count    28860.000000
mean        79.681635
std         69.239140
min          0.000000
25%         24.000000
50%         68.000000
75%        120.000000
max        591.000000
Name: proximity_count, dtype: float64


In [66]:
X_test = test_predicted[["bm25_score", "proximity_count"]].values
y_test = test_predicted["Relevance_total"].values
qid_test= test_predicted["Query"].values

predict the actual scores with our scorer and our CoordinateAscent algorithm using the bm25_score and the proximity_count

In [67]:
scorer = NDCGScorer(k=10, idcg_cache={})

X_test_sparse = csr_matrix(X_test)

model = CoordinateAscent(n_restarts=2, max_iter=25, verbose=True, scorer=scorer).fit(X_test_sparse, y_test, qid_test)

pred = model.predict(X_test_sparse, qid_test)

test_predicted["predicted_scores"] = pred

1	1	1	0.30425724608270827
1	2	0	0.30458623209584335
1	1	1	0.30490724996267293
1	2	0	0.30493944056703975
1	1	0	0.3049623310198546
1	2	1	0.3049945216242214
2	1	0	0.30458623209584335
2	2	1	0.30494094090988433
2	1	1	0.30495510080330307
2	2	0	0.3049623310198546
2	1	0	0.3049945216242214


In [68]:
test_predicted_filtered= test_predicted[test_predicted["predicted_label"] == 0]

X_test_filtered = test_predicted_filtered[["bm25_score", "proximity_count"]].values
y_test_filtered = test_predicted_filtered["Relevance_total"].values  # Relevance labels
qid_test_filtered = test_predicted_filtered["Query"].values  # Query IDs

In [69]:
ndcg_score = scorer(y_test_filtered, test_predicted_filtered["predicted_scores"].values, qid_test_filtered).mean()

# Print the final score
print(f"Postfilter Average nCS-DCG@10: {ndcg_score:.4f}")

Postfilter Average nCS-DCG@10: 0.3016


## Joint - approach
- here we want to find a balanced result between sensitivity and relevance 
- for that we apply the penalty for sensitvity directly during the ranking process
- we again use the features from above, but also take sensitivity directly into account

In [70]:
feature_columns = ["bm25_score", "proximity_count"]
X_joint = test_predicted[feature_columns].values

y_joint = test_predicted[["Relevance_total", "predicted_label"]].values
qid_joint = test_predicted["Query"].values

In [71]:
X_joint = csr_matrix(X_joint)

in the paper it was mentioned that they applied a penalty of 12

In [72]:
from utils.metrics import NDCGScorer

class nCS_DCGScorer:
    def __init__(self, y_sensitivity, k=10, sensitivity_penalty=12, idcg_cache={}):
        self.y_sensitivity = y_sensitivity  # Store sensitivity labels
        self.k = k
        self.sensitivity_penalty = sensitivity_penalty
        self.ndcg_scorer = NDCGScorer(k=k, idcg_cache=idcg_cache)

    def __call__(self, y_relevance, pred, qid):
        # Apply sensitivity penalty to predicted scores
        penalized_pred = pred.copy()
        penalized_pred[self.y_sensitivity == 1] -= self.sensitivity_penalty

        # Compute nDCG@10 with penalized predictions
        return self.ndcg_scorer(y_relevance, penalized_pred, qid)

In [73]:
y_relevance = y_joint[:, 0]
y_sensitivity = y_joint[:, 1]

ncs_dcg_scorer = nCS_DCGScorer(y_sensitivity=y_sensitivity, k=10, sensitivity_penalty=12)

model = CoordinateAscent(
    n_restarts=5,
    max_iter=50,
    verbose=True,
    scorer=ncs_dcg_scorer  
).fit(X_joint, y_relevance, qid_joint)


1	1	1	0.3042431191865809
1	2	0	0.30461842270021017
1	1	0	0.3046991298652665
1	2	1	0.305047633729398
2	1	1	0.3042431191865809
2	2	0	0.30461842270021017
2	1	1	0.30465121320050803
2	2	0	0.3046991298652665
2	1	1	0.305047633729398
3	1	0	0.30461842270021017
3	2	1	0.30465121320050803
3	1	0	0.3046991298652665
3	2	1	0.3049925526722163
3	1	0	0.3049952052072145
3	2	1	0.305047633729398
4	1	0	0.30461842270021017
4	2	1	0.30465121320050803
4	1	1	0.30502093433537214
4	2	0	0.3050296030184499
4	1	1	0.305047633729398
5	1	1	0.3042431191865809
5	2	0	0.30461842270021017
5	1	0	0.3046991298652665
5	2	1	0.305047633729398


In [74]:
pred_joint = model.predict(X_joint, qid_joint)

test_predicted["joint_scores"] = pred_joint

average_ncs_dcg = ncs_dcg_scorer(y_relevance, pred_joint, qid_joint).mean()
print(f"Joint Approach - Average nCS-DCG@10: {average_ncs_dcg:.4f}")

Joint Approach - Average nCS-DCG@10: 0.3050


this does not seem reproducable either

- the correct features they used were never mentioned, we can not tell how they experimental setup looked like here