# Analyse Labels

This notebook is used to gather statistics on the labels given to each query. 

These labels are:
- high: rank 1-20
- medium: rank 21 - 80
- low: rank 81 - 100
- outside scope: rank >100

## Imports

In [1]:
import pandas as pd
import numpy as np

## Paths

In [2]:
msmarco_dir = "../data/msmarco_files/"
anserini_dir = "../data/anserini_output/"
output_dir = "../data/output/"

## Load Data

CHANGE THE FOLLOWING FILENAMES

In [3]:
bert_labels_filename = 'bert_rank_labels_dev_small.tsv'
bm25_labels_filename = 'bm25_rank_labels_dev_small.tsv'
query_filename = 'queries.dev.small.tsv'
output_filename = "label_analysis_dev_small.tsv"

In [4]:
bert_labels_df = pd.read_csv(output_dir + bert_labels_filename,delimiter='\t',encoding='utf-8', header=None)
bert_labels_df.columns = ['query_id', 'passage_id', 'bert_label']

In [5]:
bert_labels_df.head(5)

Unnamed: 0,query_id,passage_id,bert_label
0,2,4339068,high
1,1215,7395960,high
2,1288,7473138,high
3,2235,7609417,high
4,2798,7800991,high


In [6]:
bert_labels_df.shape

(4460, 3)

In [7]:
bm25_labels_df = pd.read_csv(output_dir + bm25_labels_filename,delimiter='\t',encoding='utf-8',header=None)
bm25_labels_df.columns = ['query_id', 'passage_id', 'bm25_label']

In [8]:
bm25_labels_df.head(5)

Unnamed: 0,query_id,passage_id,bm25_label
0,2,4339068,high
1,1215,7395960,high
2,1288,7473138,high
3,2235,7609417,high
4,2798,7800991,high


In [10]:
bm25_labels_df.shape

(4460, 3)

In [11]:
query_df = pd.read_csv(msmarco_dir + query_filename,delimiter='\t',encoding='utf-8',header=None)
query_df.columns = ['query_id','query_text']

In [12]:
query_df.head(5)

Unnamed: 0,query_id,query_text
0,1048585,what is paula deen's brother
1,2,Androgen receptor define
2,524332,treating tension headaches without medication
3,1048642,what is paranoid sc
4,524447,treatment of varicose veins in legs


In [13]:
query_df.shape

(6980, 2)

In [14]:
passage_df = pd.read_csv(msmarco_dir + 'collection.tsv',delimiter='\t',encoding='utf-8',header=None)
passage_df.columns = ['passage_id','passage_text']

In [15]:
passage_df.head(5)

Unnamed: 0,passage_id,passage_text
0,0,The presence of communication amid scientific ...
1,1,The Manhattan Project and its atomic bomb help...
2,2,Essay on The Manhattan Project - The Manhattan...
3,3,The Manhattan Project was the name for a proje...
4,4,versions of each volume as well as complementa...


In [16]:
passage_df.shape

(8841823, 2)

## Analyse Labels

In [17]:
print("BM25 labels")
print("Nr high labels: " + str(bm25_labels_df[bm25_labels_df['bm25_label'] == 'high'].shape[0]))
print("Nr medium labels: " + str(bm25_labels_df[bm25_labels_df['bm25_label'] == 'medium'].shape[0]))
print("Nr low labels: " + str(bm25_labels_df[bm25_labels_df['bm25_label'] == 'low'].shape[0]))
print("Nr outside scope labels: " + str(bm25_labels_df[bm25_labels_df['bm25_label'] == 'outside scope'].shape[0]))

BM25 labels
Nr high labels: 3249
Nr medium labels: 1063
Nr low labels: 148
Nr outside scope labels: 0


In [18]:
print("BERT labels")
print("Nr high labels: " + str(bert_labels_df[bert_labels_df['bert_label'] == 'high'].shape[0]))
print("Nr medium labels: " + str(bert_labels_df[bert_labels_df['bert_label'] == 'medium'].shape[0]))
print("Nr low labels: " + str(bert_labels_df[bert_labels_df['bert_label'] == 'low'].shape[0]))
print("Nr outside scope labels: " + str(bert_labels_df[bert_labels_df['bert_label'] == 'outside scope'].shape[0]))

BERT labels
Nr high labels: 4147
Nr medium labels: 310
Nr low labels: 3
Nr outside scope labels: 0


In [19]:
merged_labels_df = bm25_labels_df.merge(bert_labels_df,how='left',on=['query_id','passage_id'])

In [20]:
merged_labels_df.head(5)

Unnamed: 0,query_id,passage_id,bm25_label,bert_label
0,2,4339068,high,high
1,1215,7395960,high,high
2,1288,7473138,high,high
3,2235,7609417,high,high
4,2798,7800991,high,high


In [21]:
merged_labels_df.shape

(4460, 4)

In [22]:
def label2num(label):
    if label == "outside scope":
        return 0
    elif label == "low":
        return 1
    elif label == "medium":
        return 2
    elif label == "high":
        return 3

In [23]:
def scanLabels(bm25_label,bert_label):
    if bert_label > bm25_label:
        return "+"*(bert_label-bm25_label)
    elif bert_label == bm25_label:
        return "0"
    elif bert_label < bm25_label:
        return "-"*(bm25_label-bert_label)

In [24]:
merged_labels_df['bm25_num_label'] = merged_labels_df.apply(lambda x: label2num(x.bm25_label),axis=1)
merged_labels_df['bert_num_label'] = merged_labels_df.apply(lambda x: label2num(x.bert_label),axis=1)

In [25]:
merged_labels_df.head(5)

Unnamed: 0,query_id,passage_id,bm25_label,bert_label,bm25_num_label,bert_num_label
0,2,4339068,high,high,3,3
1,1215,7395960,high,high,3,3
2,1288,7473138,high,high,3,3
3,2235,7609417,high,high,3,3
4,2798,7800991,high,high,3,3


In [26]:
merged_labels_df['label_comparison'] = merged_labels_df.apply(lambda x: scanLabels(x.bm25_num_label,x.bert_num_label),axis=1)

## Create Output Dataframe

In [27]:
output_df = merged_labels_df.merge(query_df,how='left',on=['query_id'])

In [28]:
output_df = output_df.merge(passage_df,how='left',on=['passage_id'])

In [29]:
output_df.head(5)

Unnamed: 0,query_id,passage_id,bm25_label,bert_label,bm25_num_label,bert_num_label,label_comparison,query_text,passage_text
0,2,4339068,high,high,3,3,0,Androgen receptor define,"The androgen receptor (AR), also known as NR3C..."
1,1215,7395960,high,high,3,3,0,3 levels of government in canada and their res...,"In Canada, there are 3 levels of government. E..."
2,1288,7473138,high,high,3,3,0,3/5 of 60,3/5 = 60/1 3*60 = 5*1 180/5 = 36*1 answer = 36.
3,2235,7609417,high,high,3,3,0,Bethel University was founded in what year,Bethel University is a private institution tha...
4,2798,7800991,high,high,3,3,0,Does Suddenlink Carry ESPN3,"Earlier this month, Suddenlink and ESPN parent..."


In [30]:
output_df.to_csv(output_dir + output_filename,sep="\t", header=False,index=False)