# 1. Label Ranker

This notebook contains code that is used to analyse the Anserini BM25 top 100 passage ranks and find if the MSMARCO relevant passage is in the top 100. It then provides each top 100 of an label. 

The label is either:
- high: rank 1-20
- medium: rank 21-80
- low: rank 81-100
- outside scope: rank >100

## Imports

In [44]:
import pandas as pd
import numpy as np

## Paths

In [45]:
msmarco_dir = "../data/msmarco_files/"
anserini_dir = "../data/anserini_output/"
output_dir = "../data/output/"

## Load Data

CHANGE THE FOLLOWING FILENAMES

In [46]:
anserini_filename = 'run_development_top100.tsv'
relevance_filename = 'qrels.dev.small.tsv'
top100_query_id_filename = 'top100_dev_small_query_ids.tsv'
output_filename = 'bm25_rank_labels_dev_small.tsv'

#### Anserini BM25 Top 100 Ranking

In [47]:
bm25_rankings = pd.read_csv(anserini_dir + anserini_filename,delimiter='\t',encoding='utf-8', header=None)
bm25_rankings.columns = ['query_id', 'passage_id', 'ranking']

In [48]:
bm25_rankings.shape

(697868, 3)

In [49]:
len(np.unique(bm25_rankings['query_id'].tolist()))

6980

SKIP THIS: Apparently Anserini was not able to find a top 100 for every query. So we it is an option to filter out those queries which do not have a complete top 100 ranking. But it can be the case that these queries have certain passages that share a rank. 

In [5]:
'''
ids_100 = bm25_rankings[bm25_rankings['ranking'] == 100]['query_id'].tolist()
bm25_rankings = bm25_rankings[bm25_rankings['query_id'].isin(ids_100)].copy()
bm25_rankings.shape
bm25_rankings.ranking.value_counts()[100]
'''

"\nids_100 = bm25_rankings[bm25_rankings['ranking'] == 100]['query_id'].tolist()\nbm25_rankings = bm25_rankings[bm25_rankings['query_id'].isin(ids_100)].copy()\nbm25_rankings.shape\nbm25_rankings.ranking.value_counts()[100]\n"

#### MSMARCO Relevance File

In [50]:
msmarco_relevance_df = pd.read_csv(msmarco_dir + relevance_filename,delimiter='\t',encoding='utf-8', header=None)
msmarco_relevance_df.columns = ['query_id', 'label1', 'passage_id', 'label2']

In [51]:
len(np.unique(msmarco_relevance_df['query_id'].tolist()))

6980

## Merge Anserini Ranking with MSMARCO Relevance

Here we merge the BM25 ranking dataframe with the MSMARCO relevance dataframe. The MSMARCO relevance dataframe contains a column named label2. If we merge on query\_id and passage\_id we either get NaN in the label2 column or the actual value, which is 1, from the MSMARCO relevance file. We can then make a new column named true\_label, which sets each NaN to zero and take the sum of the column to get the number of top-100 rankings which include the MSMARCO most relevant passage.

In [52]:
ranking_relevance_merged_df = bm25_rankings.merge(msmarco_relevance_df,how='left',on=['query_id','passage_id'])
ranking_relevance_merged_df['true_label'] = ranking_relevance_merged_df['label2'].fillna(0)
ranking_relevance_merged_df['true_label'].sum()

4925.0

In total there were 6890 queries with a top 100 computed with BM25. There are 4925 passages labeled most relevant by MSMARCO ranked inside the top100.

#### Save queries with most relevent passage in top 100

In [53]:
top100_query_ids = np.unique(ranking_relevance_merged_df[ranking_relevance_merged_df['label2'] == 1]['query_id'].tolist())

In [54]:
len(top100_query_ids)

4737

There are 4737 unique query ids with one or more msmarco relevant passage in the top 100.

In [55]:
top100_query_ids_df = pd.DataFrame(top100_query_ids)
top100_query_ids_df.columns = ['query_id']
top100_query_ids_df.to_csv(output_dir + top100_query_id_filename,sep="\t", header=False,index=False)

## Label Relevance Rank

#### Sort Query ids

In [56]:
high_ids = ranking_relevance_merged_df[(ranking_relevance_merged_df['ranking'] < 21) & (ranking_relevance_merged_df['label2'] == 1)][['query_id', 'passage_id']].values.tolist()
medium_ids = ranking_relevance_merged_df[(ranking_relevance_merged_df['ranking'] > 20) & (ranking_relevance_merged_df['ranking'] < 81) & (ranking_relevance_merged_df['label2'] == 1)][['query_id', 'passage_id']].values.tolist()
low_ids = ranking_relevance_merged_df[(ranking_relevance_merged_df['ranking'] > 80) & (ranking_relevance_merged_df['label2'] == 1)][['query_id', 'passage_id']].values.tolist()

#### Create New Dataframe

In [57]:
def getLabel(query_id, passage_id):
    entry = [query_id,passage_id]
    if entry in high_ids:
        return "high"
    elif entry in medium_ids:
        return "medium"
    elif entry in low_ids:
        return "low"
    else:
        return "outside scope"

In [58]:
output_df = pd.DataFrame(top100_query_ids)
output_df.columns = ['query_id']
output_df.shape

(4737, 1)

In [59]:
output_df = output_df.merge(msmarco_relevance_df,how='left',on=['query_id'])
output_df.shape

(5057, 4)

In [60]:
output_df['label'] = output_df.apply(lambda x: getLabel(x.query_id, x.passage_id), axis=1)
output_df.shape

(5057, 5)

In [61]:
len(np.unique(output_df['query_id'].tolist()))

4737

Some query_ids have multiple relevant passages linked to them. That is why, for now, I remove those query_ids.

In [62]:
ids_of_interest = []
vc = output_df['query_id'].value_counts()
for k,v in vc.items():
    if v > 1:
        ids_of_interest.append(k)

In [63]:
output_df = output_df[~output_df['query_id'].isin(ids_of_interest)]
output_df.shape

(4460, 5)

In [64]:
del output_df['label1']
del output_df['label2']

In [65]:
output_df.to_csv(output_dir + output_filename,sep="\t", header=False,index=False)