# Consumer Health Search TREC Evaluation

In this notebook, we'll calculate the topicality score using PyTerrier and the credibility score using a trained model. We'll then compute a weighted average of the topicality and credibility scores.

## Setup

First, let's install and import the necessary libraries.

In [None]:
!pip install python-terrier
!pip install bs4
!pip install gensim
!pip install scikit-learn pandas numpy trectools

Collecting python-terrier
  Downloading python-terrier-0.9.2.tar.gz (104 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/104.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━[0m [32m81.9/104.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.4/104.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting wget (from python-terrier)
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyjnius>=1.4.2 (from python-terrier)
  Downloading pyjnius-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting matchpy (from python-terrier)
  Downloading matchpy-0.5.5-py3-none-any.whl (69 kB)


## Initialization

We'll initialize PyTerrier and load our datasets.


In [None]:
import pandas as pd
import pyterrier as pt
from bs4 import BeautifulSoup
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
import os
from tqdm import tqdm
import tarfile
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import joblib
from gensim.models import Word2Vec
from trectools import TrecQrel, TrecRun, TrecEval
import tempfile

# Initialize PyTerrier
pt.init()

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



## Data Loading

Load the document CSV, qrels, and query datasets.

Dataset Download:



In [None]:
import os

# List of datasets and their URLs
datasets = {
    "CHS_docs.csv": "https://owncloud.tuwien.ac.at/index.php/s/3HHq8r94QP9Vu1b/download",
    "trec_qrels.csv": "https://owncloud.tuwien.ac.at/index.php/s/CSVE6tWnh4G8UIF/download",
    "trec_topics.csv": "https://owncloud.tuwien.ac.at/index.php/s/1G1yjzlV4AsLl9N/download"
}

# Download datasets if they don't exist
for filename, url in datasets.items():
    if not os.path.exists(filename):
        !wget {url} -O {filename}
    else:
        print(f"{filename} already exists. Skipping download.")


--2023-08-24 18:37:57--  https://owncloud.tuwien.ac.at/index.php/s/3HHq8r94QP9Vu1b/download
Resolving owncloud.tuwien.ac.at (owncloud.tuwien.ac.at)... 128.130.35.207, 2001:629:3800:335::207
Connecting to owncloud.tuwien.ac.at (owncloud.tuwien.ac.at)|128.130.35.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 633213130 (604M) [text/csv]
Saving to: ‘CHS_docs.csv’


2023-08-24 18:38:23 (24.1 MB/s) - ‘CHS_docs.csv’ saved [633213130/633213130]

--2023-08-24 18:38:23--  https://owncloud.tuwien.ac.at/index.php/s/CSVE6tWnh4G8UIF/download
Resolving owncloud.tuwien.ac.at (owncloud.tuwien.ac.at)... 128.130.35.207, 2001:629:3800:335::207
Connecting to owncloud.tuwien.ac.at (owncloud.tuwien.ac.at)|128.130.35.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 331879 (324K) [text/csv]
Saving to: ‘trec_qrels.csv’


2023-08-24 18:38:24 (702 KB/s) - ‘trec_qrels.csv’ saved [331879/331879]

--2023-08-24 18:38:24--  https://owncloud.tuwien.ac.at/index

In [None]:
# Replace with your paths

data_set='TREC'
dataset_path='/content/'
config={'TREC':{
                'file_path':f'''{dataset_path}/CHS_docs.csv''',
                'index_path':f'''{dataset_path}/CHS_bm25''',
                'topics':f'''{dataset_path}/trec_topics.csv''',
                'result_name':f'''{dataset_path}/CHS_BM25.csv''',
                'qrels':f'''{dataset_path}/trec_qrels.csv''',
                'final_retrieved_name':f'''{dataset_path}/TREC/TREC2020_BM25_clean_100.csv'''}}



# Load the documents from the specified file path
documents = pd.read_csv(config[data_set]['file_path'], sep='\t')
print("Loaded documents:")
print(documents.head())  # Display the first few rows of the documents dataframe

# Load the qrels (query relevance judgments) from the specified file path
qrels = pd.read_csv(config[data_set]['qrels'], sep=' ', header=None,names=['qid','Q0','docno','topical','credible'])
print("\nLoaded qrels:")
print(qrels.head())  # Display the first few rows of the qrels dataframe

# Load the queries/topics from the specified file path
queries = pd.read_csv(config[data_set]['topics'], sep=' ', header=None)
print("\nLoaded queries:")
print(queries.head())  # Display the first few rows of the queries dataframe

Loaded documents:
                                  docno  \
0  0113bb03-2a3a-4602-9394-d2fe911b624a   
1  015c98bf-8632-4537-9038-7bc3e128cb97   
2  01e198e3-ec00-432d-92f0-cca8251db33d   
3  02700110-5195-4cee-b584-8fe6d870e2dd   
4  02fb6095-115b-4418-bb34-8b76cc65059c   

                                                text  
0  tyler perry reveals role vitamin d plays in fi...  
1  this is why you should include vitamin c and z...  
2  supplements for coronavirus probably won t hel...  
3  coronavirus top ways to protect yourself from ...  
4  coronavirus it s time to debunk claims that vi...  

Loaded qrels:
   qid  Q0                                 docno  topical  credible
0    1   0  0113bb03-2a3a-4602-9394-d2fe911b624a        1         0
1    1   0  015c98bf-8632-4537-9038-7bc3e128cb97        1         1
2    1   0  01e198e3-ec00-432d-92f0-cca8251db33d        1         1
3    1   0  02700110-5195-4cee-b584-8fe6d870e2dd        1         1
4    1   0  02fb6095-115b-4418-bb34-8b

## Topicality Scoring with PyTerrier

We'll index the documents and retrieve the topicality scores.

In [None]:
# Index the documents using PyTerrier

index_doc=documents.dropna(subset=['text'])
index_doc.drop_duplicates(subset=['text'],inplace=True)

index_path=config[data_set]['index_path']
index_doc=index_doc[['docno','text']]
if not os.path.exists(f'''{index_path}/data.properties'''):
    indexer = pt.DFIndexer(index_path, overwrite=True, verbose=True, Threads=8)
    indexer.setProperty("termpipelines", "PorterStemmer") # Removes the default PorterStemmer (English)
    indexref3 = indexer.index(index_doc["text"], index_doc)
else:
    indexref3 = pt.IndexRef.of(f'''{index_path}/data.properties''')




  0%|          | 0/86779 [00:00<?, ?documents/s]

  indexref3 = indexer.index(index_doc["text"], index_doc)
  for column, value in meta_column[1].iteritems():


In [None]:
# Retrieve the topicality scores using PyTerrier
indexref3 = pt.IndexRef.of(f'''{config[data_set]['index_path']}/data.properties''')
BM25 = pt.BatchRetrieve(indexref3, num_results=50, controls = {"wmodel": "BM25"}) # change the num_results according to requirement
topics=pt.io.read_topics(config[data_set]['topics'],format='singleline')
results=BM25.transform(topics)
results.head()

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,16,0b1b5f08-0ae2-4974-9ee9-21ebc16e2ad9,0,32.178085,can vitamin d cure covid 19
1,1,61,2c53accc-8c15-46cd-8cdc-99269ce6eb63,1,31.808723,can vitamin d cure covid 19
2,1,257,c80bd81d-2112-42fe-b04e-373b3a2172bf,2,31.577835,can vitamin d cure covid 19
3,1,279,e971787d-4b8c-4c8d-b4ef-890749d996ab,3,31.530513,can vitamin d cure covid 19
4,1,390,ab2e7f3c-d0a6-46f6-bf55-ccdf8ed8e4d1,4,31.454145,can vitamin d cure covid 19


In [None]:
# Merge results with documents df for merging text
ranked_document=results.merge(documents,on='docno')[['qid','docno','rank','score','query','text']]
ranked_document.head()

Unnamed: 0,qid,docno,rank,score,query,text
0,1,0b1b5f08-0ae2-4974-9ee9-21ebc16e2ad9,0,32.178085,can vitamin d cure covid 19,vitamin d ingredients market estimated to soar...
1,1,2c53accc-8c15-46cd-8cdc-99269ce6eb63,1,31.808723,can vitamin d cure covid 19,fact check does vitamin d protect from coronav...
2,2,2c53accc-8c15-46cd-8cdc-99269ce6eb63,29,31.919435,can vitamin c cure covid 19,fact check does vitamin d protect from coronav...
3,1,c80bd81d-2112-42fe-b04e-373b3a2172bf,2,31.577835,can vitamin d cure covid 19,coronavirus there are no miracle foods or diet...
4,2,c80bd81d-2112-42fe-b04e-373b3a2172bf,13,33.303274,can vitamin c cure covid 19,coronavirus there are no miracle foods or diet...


# Credibility Scoring with Logistic Regression

In this section, we aim to assess the credibility of documents. Our approach involves two main steps:

1. **Vectorization of Documents**: We'll leverage a pre-trained Word2Vec model to convert the textual content of documents into meaningful vectors.
2. **Credibility Prediction**: With the vectors obtained, we'll use a Logistic Regression model, which has been trained for credibility detection using the dataset mentioned [this research](https://link.springer.com/chapter/10.1007/978-3-642-28997-2_19), to predict the credibility scores of the documents.


## Clinical Embeddings

The Word2Vec model we're using is sourced from a collection of clinical embeddings. These embeddings are specifically tailored for clinical texts, making them highly relevant for our use-case.

> **Reference**: [Clinical Embeddings on GitHub](https://github.com/gweissman/clinical_embeddings)

By utilizing domain-specific embeddings, we aim to capture the nuances and intricacies of clinical texts, thereby enhancing the accuracy of our credibility predictions.


In [None]:
# Load the trained Word2Vec model
embedding_url='https://owncloud.tuwien.ac.at/index.php/s/Z5JwuDQgef32BgU/download'
embed_filename='w2v_100d_oa_cr.tar.gz'
!wget {embedding_url} -O {embed_filename}


with tarfile.open("./w2v_100d_oa_cr.tar.gz", "r:gz") as tar:
    tar.extractall(path="clinical_embeddings")

# Load the clinical embeddings using gensim
model = Word2Vec.load('./clinical_embeddings/W2V_100/w2v_OA_CR_100d.bin')


--2023-08-24 18:57:35--  https://owncloud.tuwien.ac.at/index.php/s/Z5JwuDQgef32BgU/download
Resolving owncloud.tuwien.ac.at (owncloud.tuwien.ac.at)... 128.130.35.207, 2001:629:3800:335::207
Connecting to owncloud.tuwien.ac.at (owncloud.tuwien.ac.at)|128.130.35.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 255220810 (243M) [application/gzip]
Saving to: ‘w2v_100d_oa_cr.tar.gz’


2023-08-24 18:57:46 (22.8 MB/s) - ‘w2v_100d_oa_cr.tar.gz’ saved [255220810/255220810]



In [None]:
texts=ranked_document['text'].values

tokenized_texts = [text.split() for text in texts]

def text_to_vector(text, model):
    vectors = [model.wv[word] for word in text if word in model.wv.key_to_index]
    if vectors:
        avg_vector = sum(vectors) / len(vectors)
        return avg_vector
    else:
        return [0] * model.vector_size


# Convert document texts to vectors
document_vectors = [text_to_vector(text, model) for text in tqdm(tokenized_texts)]


100%|██████████| 2500/2500 [00:12<00:00, 193.94it/s]


In [None]:
#load the trained model
!wget https://owncloud.tuwien.ac.at/index.php/s/4QzyW9GNhP2Zkro/download -O credibility_model.pkl
clf = clf = joblib.load('credibility_model.pkl')


# Predict credibility scores
credibility_scores = clf.predict_proba(document_vectors)[:, 1]

ranked_document['credibility_score']=credibility_scores

--2023-08-24 18:58:11--  https://owncloud.tuwien.ac.at/index.php/s/4QzyW9GNhP2Zkro/download
Resolving owncloud.tuwien.ac.at (owncloud.tuwien.ac.at)... 128.130.35.207, 2001:629:3800:335::207
Connecting to owncloud.tuwien.ac.at (owncloud.tuwien.ac.at)|128.130.35.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1663 (1.6K) [application/octet-stream]
Saving to: ‘credibility_model.pkl’


2023-08-24 18:58:12 (955 MB/s) - ‘credibility_model.pkl’ saved [1663/1663]



https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


## Combine Scores

To achieve a more comprehensive evaluation of the documents, we'll integrate both the topicality and credibility dimensions. By computing a weighted average of these two scores, we aim to capture the essence of a document's relevance not just based on its content, but also its credibility.



In [None]:
# Combine topicality and credibility scores
# Assuming a weight of 0.5 for both scores for simplicity
# Normalize the 'score' column using Min-Max normalization
min_score = ranked_document['score'].min()
max_score = ranked_document['score'].max()
ranked_document['normalized_score'] = (ranked_document['score'] - min_score) / (max_score - min_score)

weight_top= 0.5
weight_cred= 0.5

# Combine the normalized score with the 'credibility_score'
combined_scores = weight_top * ranked_document['normalized_score'] + weight_cred * ranked_document['credibility_score']

ranked_document['combined_scores']=combined_scores


In [None]:
# Rank based on 'score'
ranked_by_score = ranked_document.groupby('qid').apply(lambda x: x.nsmallest(100, 'score')).reset_index(drop=True)
ranked_by_score['rank'] = ranked_by_score.groupby('qid')['score'].rank(ascending=False).astype(int)
ranked_by_score = ranked_by_score[['qid', 'docno', 'rank', 'score', 'query']]

# Rank based on 'credibility_score'
ranked_by_credibility = ranked_document.groupby('qid').apply(lambda x: x.nsmallest(100, 'credibility_score')).reset_index(drop=True)
ranked_by_credibility['rank'] = ranked_by_credibility.groupby('qid')['credibility_score'].rank(ascending=False).astype(int)
ranked_by_credibility = ranked_by_credibility[['qid', 'docno', 'rank', 'credibility_score', 'query']]

# Rank based on 'combined_scores'
ranked_by_combined = ranked_document.groupby('qid').apply(lambda x: x.nsmallest(100, 'combined_scores')).reset_index(drop=True)
ranked_by_combined['rank'] = ranked_by_combined.groupby('qid')['combined_scores'].rank(ascending=False).astype(int)
ranked_by_combined = ranked_by_combined[['qid', 'docno', 'rank', 'combined_scores', 'query']]


## Evaluation

Joao Palotti, Guido Zuccon, and Allan Hanbury. 2018. **MM: A new Framework for Multidimensional Evaluation of Search Engines**. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18). Association for Computing Machinery, New York, NY, USA, 1699–1702. [DOI](https://doi.org/10.1145/3269206.3269261)

---

**For the MM equation with NDCG:**

$
\text{MM}_{\text{NDCG}} = \frac{2 \times \text{NDCG}_{\text{topicality}} \times \text{NDCG}_{\text{credibility}}}{\text{NDCG}_{\text{topicality}} + \text{NDCG}_{\text{credibility}}}
$

---

**For the MM equation with MAP:**

$
\text{MM}_{\text{MAP}} = \frac{2 \times \text{MAP}_{\text{topicality}} \times \text{MAP}_{\text{credibility}}}{\text{MAP}_{\text{topicality}} + \text{MAP}_{\text{credibility}}}
$


In [None]:

# Convert ranked documents to trectools format
def convert_to_trec_format(df, score_column):
    trec_df = df[['qid', 'docno', score_column]].copy()
    trec_df['q0'] = '0'  # Adding a constant '0' column
    trec_df = trec_df[['qid', 'q0', 'docno', score_column]]
    trec_df.columns = ['query', 'q0', 'docid', 'score']
    return trec_df

# Convert qrels to trectools format
def convert_qrels_to_trec_format(df, relevance_column):
    trec_df = df[['qid', 'Q0', 'docno', relevance_column]]
    trec_df.columns = ['query', 'q0', 'docid', 'rel']
    return trec_df

# Calculate NDCG and MAP
def evaluate(trec_qrels, trec_run):
    te = TrecEval(trec_run, trec_qrels)
    ndcg = te.get_ndcg()
    map_score = te.get_map()
    return ndcg, map_score

# Convert ranked_by_combined to trectools format
temp_file = tempfile.NamedTemporaryFile(delete=False)
convert_to_trec_format(ranked_by_combined, 'combined_scores').to_csv(temp_file.name, sep='\t', header=None, index=False)

# Load the temporary file into TrecRun
trec_run = TrecRun(temp_file.name)

In [None]:
# Convert qrels to trectools format and save to a temporary file
def convert_qrels_to_trec_format_and_save(df, relevance_column):
    trec_df = convert_qrels_to_trec_format(df, relevance_column)
    temp_file = tempfile.NamedTemporaryFile(delete=False)
    trec_df.to_csv(temp_file.name, sep='\t', header=None, index=False)
    return temp_file.name

# Convert qrels to trectools format for topicality and credibility and load into TrecQrel
qrels_topicality_path = convert_qrels_to_trec_format_and_save(qrels, 'topical')
qrels_credibility_path = convert_qrels_to_trec_format_and_save(qrels, 'credible')

qrels_topicality = TrecQrel(qrels_topicality_path)
qrels_credibility = TrecQrel(qrels_credibility_path)



In [None]:
# Evaluate based on topicality
ndcg_topicality, map_topicality = evaluate(qrels_topicality, trec_run)

# Evaluate based on credibility
ndcg_credibility, map_credibility = evaluate(qrels_credibility, trec_run)


#MM Evaluation Framework
harmonic_mean_ndcg =2 * (ndcg_topicality * ndcg_credibility) / (ndcg_topicality + ndcg_credibility)
harmonic_mean_map = 2 * (map_topicality * map_credibility) / (map_topicality + map_credibility)

print(f"MM Framework using NDCG: {harmonic_mean_ndcg}")
print(f"MM Framework using MAP: {harmonic_mean_map}")

MM Framework using NDCG: 0.38715400366651526
MM Framework using MAP: 0.23120627159732557


## Conclusion

We've computed a combined score based on topicality and credibility for the Consumer Health Search TREC dataset. Adjust the weights and models as needed for further optimization.
