#### **BEIR SGPT**

This notebook uses Log-Prob extraction to rerank BM25 predictions with GPT Models.

##### Setup

In [1]:
# Quick torch & cuda test
import torch
print(torch.cuda.current_device())
print(torch.cuda.is_available())
device = torch.device("cuda:1")
torch.rand(10).to(device)

0
True


tensor([0.3950, 0.0391, 0.8199, 0.9315, 0.2679, 0.5504, 0.0635, 0.1426, 0.1325,
        0.8696], device='cuda:1')

In [1]:
from beir import util, LoggingHandler
import logging
# Code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

  from tqdm.autonotebook import tqdm


In [2]:
%%writefile kaggle.json
{"username":"YOURNAME","key":"APIKEY"}

Writing kaggle.json


In [None]:
# Creates the dir /home/.kaggle
!kaggle

In [4]:
!mv kaggle.json /home/.kaggle/kaggle.json
!kaggle datasets download -d 'ANONYMIZED/beirbm25results'
!unzip beirbm25results

Downloading beirbm25results.zip to /home/repos/semanticsearch
100%|████████████████████████████████████████| 433M/433M [01:32<00:00, 4.97MB/s]
100%|████████████████████████████████████████| 433M/433M [01:32<00:00, 4.93MB/s]
Archive:  beirbm25results.zip
  inflating: beir_bm25_ndcgs.json    
  inflating: results_arguana.json    
  inflating: results_climate-fever.json  
  inflating: results_cqadupstack_android.json  
  inflating: results_cqadupstack_english.json  
  inflating: results_cqadupstack_gaming.json  
  inflating: results_cqadupstack_gis.json  
  inflating: results_cqadupstack_mathematica.json  
  inflating: results_cqadupstack_physics.json  
  inflating: results_cqadupstack_programmers.json  
  inflating: results_cqadupstack_stats.json  
  inflating: results_cqadupstack_tex.json  
  inflating: results_cqadupstack_unix.json  
  inflating: results_cqadupstack_webmasters.json  
  inflating: results_cqadupstack_wordpress.json  
  inflating: results_dbpedia-entity.json  
  inflatin

##### GPT Cross Encoder Experiments

In [2]:
##### Reranker & Logprobs extraction
##### Taken & adapted from: https://github.com/EleutherAI/lm-evaluation-harness/blob/ff8de12027f6d2d774bf56c3885f75122a37377c/lm_eval/models/gpt2.py#L110

import transformers
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm
import numpy as np

def encode(requests, tokenizer):
    new_reqs = []
    # Changed the order from original; as requests is queries, docs & we want query to be the continuation
    for continuation, context in requests:
        if context == "":
            # end of text as context
            context_enc = [tokenizer.eos_token_id]
        else:
            context_enc = tokenizer.encode(context, add_special_tokens=False)

        continuation_enc = tokenizer.encode(continuation, add_special_tokens=False)

        new_reqs.append(((context, continuation), context_enc, continuation_enc))

    return new_reqs 
    #self._loglikelihood_tokens(new_reqs)

import collections

def group(arr, fn):
    res = collections.defaultdict(list)

    for ob in arr:
        res[fn(ob)].append(ob)
    
    return list(res.values())

class Reorderer:
    def __init__(self, arr, fn):
        self.size = len(arr)
        arr = list(enumerate(arr))
        arr = group(arr, lambda x: fn(x[1]))
        arr = [
            ([y[0] for y in x], x[0][1]) for x in arr
        ]
        arr.sort(key=lambda x: fn(x[1]))

        self.arr = arr
        
    
    def get_reordered(self):
        return [x[1] for x in self.arr]
    
    def get_original(self, newarr):
        res = [None] * self.size
        cov = [False] * self.size

        for (inds, _), v in zip(self.arr, newarr):
            for ind in inds: 
                res[ind] = v
                cov[ind] = True
        
        assert all(cov)
        
        return res

def chunks(iter, n):
    arr = []
    for x in iter:
        arr.append(x)
        if len(arr) == n:
            yield arr
            arr = []
    
    if arr: yield arr

def _model_call(inps, model):
    """
    inps: a torch tensor of shape [batch, sequence]
    the size of sequence may vary from call to call
    returns: a torch tensor of shape [batch, sequence, vocab] with the
    logits retuned from the model
    """
    return model(inps)[0][:, :, :50257]

def _loglikelihood_tokens(requests, model, max_length, device, disable_tqdm=False, batch_size=1, 
                          sub_select_idx=None, instruction_len=0, tokenizer=None, debug=False):
    # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
    res = []
    with torch.no_grad():

        def _collate(x):
            # the negative sign on len(toks) sorts descending - this has a few advantages:
            # - time estimates will always be over not underestimates, which is more useful for planning
            # - to know the size of a batch when going through the list, you know the first one is always the batch padded context length.
            #   this is useful to simplify the batching logic and more importantly to make automatic adaptive batches much much easier to implement
            # - any OOMs will happen right away rather than near the end

            toks = x[1] + x[2]
            return (-len(toks), tuple(toks))
        
        # TODO: automatic (variable) batch size detection for vectorization
        reord = Reorderer(requests, _collate)
        for chunk in chunks(tqdm(reord.get_reordered(), disable=disable_tqdm), batch_size):
            inps = []
            contlens = []
            inplens = []

            padding_length = None

            # because vectorizing is annoying, we first convert each (context, continuation) pair to padded
            # tensors, then we pack them together into a batch, call the model, and then pick it all apart
            # again because vectorizing is annoying

            for _, context_enc, continuation_enc in chunk:
                # sanity check
                assert len(context_enc) > 0
                assert len(continuation_enc) > 0
                assert len(continuation_enc) <= max_length

                # how this all works:
                #          CTX      CONT
                # inp    0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1]
                # gpt2    \               \
                # logits   1 2 3|4 5 6 7 8 9   <- the ctx half gets tossed out by the [:, -len(continuation_enc):, :self.VOCAB_SIZE] slice
                # cont_toks      4 5 6 7 8 9

                # Original:
                #inp = torch.tensor(
                #    (context_enc + continuation_enc)[-(self.max_length+1):][:-1]
                #, dtype=torch.long).to(self.device)
                
                # Modified
                # when too long to fit in context, truncate from the left & remove fin token # NM: + After the initial instruction
                inp = torch.tensor(
                    # Instruction + Text + Continuation
                    # Truncation from right: [:(max_length+1-instruction_len)]
                    # Truncation from left:  [-(max_length+1-instruction_len):]
                    #(context_enc[:instruction_len] + ((context_enc[instruction_len:] + continuation_enc)[:(max_length+1-instruction_len)]))[:-1]
                    (context_enc[:instruction_len] + ((context_enc[instruction_len:] + continuation_enc)[-(max_length+1-instruction_len):]))[:-1]
                , dtype=torch.long).to(device)
                inplen, = inp.shape

                cont = continuation_enc

                # since in _collate we make sure length is descending, the longest is always the first one.
                padding_length = padding_length if padding_length is not None else inplen

                # pad to length
                inp = torch.cat([
                    inp, # [seq]
                    torch.zeros(padding_length - inplen, dtype=torch.long).to(inp.device) # [padding_length - seq]
                ], dim=0)

                if debug:
                    print("Model Input")
                    print(tokenizer.decode(inp))

                inps.append(inp.unsqueeze(0))
                contlens.append(cont)
                inplens.append(inplen)
               
            if sub_select_idx:
                if debug:
                    print("Subselecting tokens:")
                    print(tokenizer.decode(sub_select_idx))
                # Subselect vocab for softmax by masking out all other vocab
                mask = torch.zeros_like(output_logits)
                mask[:,:,sub_select_idx] = 1
                output_logits = output_logits.masked_fill(mask == 0, float('-inf'))
                multi_logits = F.log_softmax(output_logits, dim=-1).cpu()
            else:
                multi_logits = F.log_softmax(_model_call(torch.cat(inps, dim=0), model), dim=-1).cpu()  # [batch, seq, vocab]

            for (cache_key, _, _), logits, inp, inplen, cont_toks in zip(chunk, multi_logits, inps, inplens, contlens):
                contlen = len(cont_toks)

                logits = logits[inplen-contlen:inplen].unsqueeze(0) # [1, seq, vocab]

                greedy_tokens = logits.argmax(dim=-1)

                # cont_toks :: [1, seq]
                cont_toks = torch.tensor(cont_toks, dtype=torch.long).unsqueeze(0)
                
                if debug:
                    print("Continuation Given")
                    print(tokenizer.batch_decode(cont_toks))
                    print("Continuation Produced")
                    print(tokenizer.batch_decode(greedy_tokens))

                max_equal = (greedy_tokens == cont_toks).all()

                #last_token_slice = logits[:, -1, :].squeeze(0).tolist()

                # cont_toks are the vocab indices that make up the perfect continuation
                # Hence we gather those vocab indices from the logits, i.e. their probabilities
                logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(-1) # [1, seq]

                #answer = (float(logits.sum()), bool(max_equal))
                # Sum to get a total score of that continuation
                res.append(float(logits.sum()))

                # partial caching
                #if cache_key is not None:
                #    self.cache_hook.add_partial("loglikelihood", cache_key, answer)

                #res.append(answer)

    return reord.get_original(res)

In [3]:
from typing import List, Tuple

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

from beir.reranking import Rerank

class GPTRanker:
    def __init__(self, model=None, model_path="EleutherAI/gpt-neo-1.3B", use_prompt=True, prompt_doc="{}\n", prompt_doc_start="{}\n{}\n",
                 debug=False, fewshots="", **kwargs):
        """
        GPTRanker producing log-probabilities for reranking doc & query with a GPT-like model
        Args:
            model_path: HuggingFace weight name of a decoder transformer model
            use_prompt: Whether to use a prompt
            prompt_doc: Prompting scheme to embed document and query 
                Needs to contain two {} as query is not used for logprobs in this ranker
            prompt_doc_start: Prompting scheme specifically used for the first example, e.g. to include description
            fewshots: Fewshot example to use [doc, query]
            debug: To get information while running
        """
        
        #if device is None:
        self.device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
        
        if model is not None:
            # Allow specific model loads, e.g. for GPT-J half-precision; Or parallel
            self.model = model
        else:
            self.model = AutoModelForCausalLM.from_pretrained(model_path).to(self.device)

        self.model.eval()
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        
        if self.model.config.model_type == 'gpt2':
            self.max_length = self.model.config.n_ctx
        elif self.model.config.model_type == 'gptj':
            self.max_length = self.model.config.n_ctx
        elif self.model.config.model_type == 'gpt_neo':
            self.max_length = self.model.config.max_position_embeddings
        else:
            raise ValueError(f"Unknown model of type {self.model.config.model_type}")
            
        # Truncation will be done from the left in the log likelihood
        self.prompt_doc = prompt_doc
        self.use_prompt = use_prompt
        self.instruction_len = len(self.tokenizer.tokenize(self.prompt_doc[:self.prompt_doc.index("{")]))
        self.debug = debug
    
        self.fewshots = fewshots
        if self.fewshots:
            # doc, query
            self.fewshots = prompt_doc_start.format(self.fewshots[0], self.fewshots[1])
            # Still take overflowing tokens away from the current doc (not the fewshot doc)
            self.instruction_len += len(self.tokenizer.tokenize(self.fewshots))
            
    def predict(self, sentences: List[Tuple[str,str]], batch_size: int, **kwags) -> List[float]:
        """
        Args:
          sentences: [query, document]
          batch_size: Unused

        Returns:
          log_probs: float log probability for each query-doc pair
        """
        # TODO: Possibly feed in batched?; Depending on model size?
        if self.use_prompt:
            # Leave queries as is, as all its tokens will be used to compute the loglikelihoods
            sentences = [(query, self.fewshots + self.prompt_doc.format(doc)) for (query, doc) in sentences]

        encoded = encode(sentences, self.tokenizer)
        # loglikelihood batch_size is not the batch_size fed into this func
        log_probs = _loglikelihood_tokens(encoded, self.model, self.max_length, self.device, instruction_len=self.instruction_len, 
                                          tokenizer=self.tokenizer, debug=self.debug)

        return log_probs

In [4]:
### Main Loop A: Using fewshot=0, varying prompts with GPT Reranker ###

import json
import os

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval


### GPT-Neo ###

#model_path = "EleutherAI/gpt-neo-125M"
#model_out_name = "gptneo01" 

#model_path = "EleutherAI/gpt-neo-1.3B"
#model_out_name = "gptneo13" 

#model_path = "EleutherAI/gpt-neo-2.7B"
#model_out_name = "gptneo27" 

### GPT-J ###

from transformers import GPTJForCausalLM
import torch

model_path = "EleutherAI/gpt-j-6B"
model_out_name = "gptj"

# Option a) - Half-precision on one GPU (~14GB total GPU Memory needed)
#model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
#device = torch.device('cuda:1') if torch.cuda.is_available() else torch.device('cpu')
#model = model.to(device)

# Option b) - Full-precision on two GPUs (~28GB total GPU Memory needed) [This is used in the experiments]
model = GPTJForCausalLM.from_pretrained(model_path)
model.parallelize()

# Set to True for GPT J ;;; Set to False & comment out the above if using GPT-Neo
use_custom_model = True


prompts_ablations = {
    "A": "{} ",
    "B": "{}\n",
    "C": "Document:\n{}\n\nQuery:\n",
    "D": "Body:{}\n\nTitle:\n",
    "E": "selected document:\n{}\n\nrelevant query:\n",
    "F": "The selected text is:\n{}\n\nThe relevant query is:\n",
    "G": 'Documents are searched to find matches with the same content.\nThe document "{}" is a good search result for "',
    "H": 'Documents are searched to find matches with the same content.\nDocument: "{}"\n\nThe above document is a good match for the query: "',
    "I": '# Get matching document and query with the same content\nget_document()\n{}\nget_query_matching_document()\n"',
    
    # Quora ablations (Make sure to set datasets = ["quora"] only):
    #"quoraA": 'Questions are searched to find matches with the same content.\nThe question "{}" is a good search result for "',
    #"quoraB": 'Below are two similar questions asking the same thing.\nThe question "{}" is similar to "',
    #"quoraC": 'These two questions are the same: 1. {} 2.',
    #"quoraD": 'Question Body: {} Question Title:',
    #"quoraE": ('Question Body: {} Question Title: {}\n', 'Question Body: {} Question Title:'),
    
    
    # Use fewshots=1 scripts
    #"J": "Documents are searched to find matches with the same content.\nDocument:\n{}\nQuery:\n{}\n", # "Document:\n{}\nQuery:\n",),
    #"K": "Document:\n{}\nQuery:\n{}\n",# "Document:\n{}\nQuery:\n",),
    
    # Use script using GPTYesRanker
    #"L": 'An intelligent, helpful bot is given. The bot responds "Yes" if the document is a fit to the query and "No" otherwise.\n###\nDocument: {}\nQuery: {}\nBot:',

    
}

# All datasets
datasets = ["trec-covid", "webis-touche2020", "nfcorpus", "scifact", "fiqa", "dbpedia-entity",
            "nq", "hotpotqa", "quora", "fever", "climate-fever", "arguana", "msmarco", "scidocs", "cqadupstack",
            "signal1m", "trec-news", "bioasq", "robust04"]

# Main prompt
prompts = {"G": 'Documents are searched to find matches with the same content.\nThe document "{}" is a good search result for "',}


def clean_titles(corpus):
    for k in corpus:
        if "title" in corpus[k] and corpus[k]["title"] is None:
            corpus[k]["title"] = ""
    return corpus


def run_reranking(results_bm25_path, results_path, data_path, top_k=100, k_values=[1, 3, 5, 10, 100, 1000]):
    """
    Args:
        results_bm25_path: Path to .json results from bm25 for the dataset
        results_path: Path to .json to write rerank results
        top_k: How many docs to rerank per query
        k_values: For how many docs per query to compute the scores
    """
    
    split = "dev" if "msmarco" in data_path else "test"
    
    corpus, queries, qrels = GenericDataLoader(data_path).load(split=split)
    
    corpus = clean_titles(corpus) if "robust04" in data_path else corpus
    
    with open(results_bm25_path, 'r') as fp:
        results_bm25 = json.load(fp)
    
    # Optional, make sure results are correct
    ndcg_bm25, _map_bm25, recall_bm25, precision_bm25 = EvaluateRetrieval.evaluate(qrels, results_bm25, k_values)

    # Rerank top-100 results using the reranker provided
    results_rerank = reranker.rerank(corpus, queries, results_bm25, top_k=top_k)
    
    # Save rerank results
    with open(results_path, 'w') as fp:
        json.dump(results_rerank, fp)

    #### Evaluate retrieval using NDCG@k, MAP@K ...
    ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results_rerank, k_values)

    return (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision)


for prompt_id, prompt_doc in prompts.items():
    
    scores_out_path = f"beir_scores_{model_out_name}_{prompt_id}.json"
    
    # Optionally skip
    #if os.path.exists(os.path.join(os.getcwd(), scores_out_path)):
    #    continue

    ndcgs_bm25 = {}
    ndcgs = {}
    
    logging.info(f"\n{'-' * 20} Running prompt {prompt_id}: {prompt_doc} {'-' * 20}\n")
    
    if use_custom_model:
        reranker = Rerank(GPTRanker(model=model, model_path=model_path, use_prompt=True, prompt_doc=prompt_doc), batch_size=128)
    else:
        reranker = Rerank(GPTRanker(model_path=model_path, use_prompt=True, prompt_doc=prompt_doc, debug=False), batch_size=128)

    for i, dataset in enumerate(datasets):

        logging.info(f"\n{'-' * 10} Running {dataset} {'-' * 10}\n")
        
        if not(os.path.exists(os.path.join(os.getcwd(), 'datasets', dataset))):
            url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
            out_dir = os.path.join(os.getcwd(), "datasets")
            data_path = util.download_and_unzip(url, out_dir)
            print("Dataset downloaded here: {}".format(data_path))
            
        # Load the dataset into BEIR
        data_path = f"datasets/{dataset}"

        # cqadupstack - Contains several sub datasets
        if dataset == "cqadupstack":
            cqa_ndcgs_bm25, cqa_maps_bm25, cqa_recalls_bm25, cqa_precisions_bm25 = [], [], [], []
            cqa_ndcgs, cqa_maps, cqa_recalls, cqa_precisions = [], [], [], []
            for sub_dataset in os.listdir(data_path):
                sub_data_path = f"datasets/{dataset}/{sub_dataset}"
                
                results_bm25_path = f"results_{dataset}_{sub_dataset}.json"
                results_path = f"results_{model_out_name}_prompt{prompt_id}_{dataset}_{sub_dataset}.json"
                # Skip if already computed these results
                if os.path.exists(os.path.join(os.getcwd(), results_path)):
                    continue

                (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision) = run_reranking(results_bm25_path, results_path, sub_data_path)

                cqa_ndcgs_bm25.append(ndcg)
                cqa_maps_bm25.append(_map)
                cqa_recalls_bm25.append(recall)
                cqa_precisions_bm25.append(precision)

                cqa_ndcgs.append(ndcg)
                cqa_maps.append(_map)
                cqa_recalls.append(recall)
                cqa_precisions.append(precision)

            for (metric, metric_group) in [(ndcg_bm25, cqa_ndcgs_bm25), (_map_bm25, cqa_maps_bm25), (recall_bm25, cqa_recalls_bm25), (precision_bm25, cqa_precisions_bm25)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in metric_group]) / len(metric_group)

            for (metric, metric_group) in [(ndcg, cqa_ndcgs), (_map, cqa_maps), (recall, cqa_recalls), (precision, cqa_precisions)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in metric_group]) / len(metric_group)

            logging.info("CQA Final BM25")
            logging.info(f"{ndcg_bm25}")
            logging.info(f"{_map_bm25}")
            logging.info(f"{recall_bm25}")
            logging.info(f"{precision_bm25}")

            logging.info("CQA Final")
            logging.info(f"{ndcg}")
            logging.info(f"{_map}")
            logging.info(f"{recall}")
            logging.info(f"{precision}")

        else:
            results_bm25_path = f"results_{dataset}.json"
            results_path = f"results_{model_out_name}_prompt{prompt_id}_{dataset}.json"
            # Skip if already computed these results
            if os.path.exists(os.path.join(os.getcwd(), results_path)):
                continue
            (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision) = run_reranking(results_bm25_path, results_path, data_path)

        ndcgs[dataset] = ndcg
        ndcgs_bm25[dataset] = ndcg_bm25

        # Optionally clean-up each time to avoid running out of space
        # !rm -r datasets

    with open(scores_out_path, 'w') as fp:
        json.dump(ndcgs, fp)
        
    # Optionally also save the bm25 results tho they should be the same each time    
    #with open(f"./beir_scores_bm25_{prompt_id}.json", 'w') as fp:
    #    json.dump(ndcg_bm25, fp)

2022-01-16 09:43:49 - Loading faiss with AVX2 support.
2022-01-16 09:43:49 - Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2022-01-16 09:43:49 - Loading faiss.
2022-01-16 09:43:49 - Successfully loaded faiss.
2022-01-16 09:45:08 - 
-------------------- Running prompt quoraD: Question Body: {} Question Title: --------------------

2022-01-16 09:45:13 - 
---------- Running quora ----------

2022-01-16 09:45:13 - Loading Corpus...


  0%|          | 0/522931 [00:00<?, ?it/s]

2022-01-16 09:45:15 - Loaded 522931 TEST Documents.
2022-01-16 09:45:15 - Doc Example: {'text': 'What is the step by step guide to invest in share market in india?', 'title': ''}
2022-01-16 09:45:15 - Loading Queries...
2022-01-16 09:45:15 - Loaded 10000 TEST Queries.
2022-01-16 09:45:15 - Query Example: Which question should I ask on Quora?
2022-01-16 09:45:24 - 

2022-01-16 09:45:24 - NDCG@1: 0.7230
2022-01-16 09:45:24 - NDCG@3: 0.7701
2022-01-16 09:45:24 - NDCG@5: 0.7895
2022-01-16 09:45:24 - NDCG@10: 0.8077
2022-01-16 09:45:24 - NDCG@100: 0.8277
2022-01-16 09:45:24 - NDCG@1000: 0.8312
2022-01-16 09:45:24 - 

2022-01-16 09:45:24 - MAP@1: 0.6310
2022-01-16 09:45:24 - MAP@3: 0.7294
2022-01-16 09:45:24 - MAP@5: 0.7476
2022-01-16 09:45:24 - MAP@10: 0.7596
2022-01-16 09:45:24 - MAP@100: 0.7669
2022-01-16 09:45:24 - MAP@1000: 0.7672
2022-01-16 09:45:24 - 

2022-01-16 09:45:24 - Recall@1: 0.6310
2022-01-16 09:45:24 - Recall@3: 0.7969
2022-01-16 09:45:24 - Recall@5: 0.8495
2022-01-16 09:45:

100%|██████████| 997731/997731 [17:40:34<00:00, 15.68it/s]  


2022-01-17 03:28:44 - 

2022-01-17 03:28:44 - NDCG@1: 0.7343
2022-01-17 03:28:44 - NDCG@3: 0.7915
2022-01-17 03:28:44 - NDCG@5: 0.8127
2022-01-17 03:28:44 - NDCG@10: 0.8297
2022-01-17 03:28:44 - NDCG@100: 0.8441
2022-01-17 03:28:44 - NDCG@1000: 0.8441
2022-01-17 03:28:44 - 

2022-01-17 03:28:44 - MAP@1: 0.6408
2022-01-17 03:28:44 - MAP@3: 0.7494
2022-01-17 03:28:44 - MAP@5: 0.7692
2022-01-17 03:28:44 - MAP@10: 0.7813
2022-01-17 03:28:44 - MAP@100: 0.7879
2022-01-17 03:28:44 - MAP@1000: 0.7879
2022-01-17 03:28:44 - 

2022-01-17 03:28:44 - Recall@1: 0.6408
2022-01-17 03:28:44 - Recall@3: 0.8229
2022-01-17 03:28:44 - Recall@5: 0.8794
2022-01-17 03:28:44 - Recall@10: 0.9277
2022-01-17 03:28:44 - Recall@100: 0.9772
2022-01-17 03:28:44 - Recall@1000: 0.9772
2022-01-17 03:28:44 - 

2022-01-17 03:28:44 - P@1: 0.7343
2022-01-17 03:28:44 - P@3: 0.3454
2022-01-17 03:28:44 - P@5: 0.2299
2022-01-17 03:28:44 - P@10: 0.1270
2022-01-17 03:28:44 - P@100: 0.0145
2022-01-17 03:28:44 - P@1000: 0.0014


In [4]:
### Main Loop B: Using fewshot=1, varying prompts with GPT Reranker ###

import json
import os

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

model = "EleutherAI/gpt-neo-125M"
model = "EleutherAI/gpt-neo-1.3B"
model = "EleutherAI/gpt-neo-2.7B"

prompts = {"J": ("Documents are searched to find matches with the same content.\nDocument:\n{}\nQuery:\n{}\n", "Document:\n{}\nQuery:\n",),
           "K": ("Document:\n{}\nQuery:\n{}\n", "Document:\n{}\nQuery:\n",),
           
           # Quora ablations
           #"quoraE": ('Question Body: {} Question Title: {}\n', 'Question Body: {} Question Title:'),
}


MIN_CORP_QUERY_LEN = 0 
# For Quora set the below to avoid a short question with no meaning:
# MIN_CORP_QUERY_LEN = 10 # Prompt will be Question Body: Why do I have nightmares? Question Title: What causes a nightmare?


def get_match_len(qid, tokenizer, corpus, queries, qrels, get_corpus_id=False):
    """
    Get the shortest corpus len given a query
    
    Args:
        qid: id of a query
        tokenizer: HF Tokenizer
        corpus, queries, qrels: Dataset
        get_corpus_id: Whether to return the min len corpus id
    """
    
    query_len = len(tokenizer.tokenize(queries[qid]))
    
    corpora = qrels[qid]
    corpora_lens = []
    for corpus_id, score in corpora.items():
        corpus_len = len(tokenizer.tokenize(corpus[corpus_id]["text"]))
        # Prefer corpora with high score (won't matter for most cases where all scores are 1)
        if (corpus_len + query_len) > MIN_CORP_QUERY_LEN:
            corpora_lens.append((corpus_len + query_len) / (score + 1e-10))
        else:
            corpora_lens.append(float("inf"))
    # Optionally get the shortest fitting corpus id
    if get_corpus_id: return list(corpora.keys())[np.argmin(corpora_lens)]
    return min(corpora_lens)

def run_reranking(results_bm25_path, results_path, data_path, top_k=100, k_values=[1, 3, 5, 10, 100, 1000], fewshots=True):
    """
    Args:
        results_bm25_path: Path to .json results from bm25 for the dataset
        results_path: Path to .json to write rerank results
        top_k: How many docs to rerank per query
        k_values: For how many docs per query to compute the scores
    """
    split = "dev" if "msmarco" in data_path else "test"
    
    corpus, queries, qrels = GenericDataLoader(data_path).load(split=split)
    
    with open(results_bm25_path, 'r') as fp:
        results_bm25 = json.load(fp)
    
    # Optional, make sure results are correct
    ndcg_bm25, _map_bm25, recall_bm25, precision_bm25 = EvaluateRetrieval.evaluate(qrels, results_bm25, k_values)
    
    if fewshots:
        tokenizer = AutoTokenizer.from_pretrained(model)
        # Find shortest query
        min_q_id = min(qrels, key=lambda x: get_match_len(x, tokenizer, corpus, queries, qrels))
        # Find matching shortest corpus
        min_corp_id = get_match_len(min_q_id, tokenizer, corpus, queries, qrels, get_corpus_id=True)

        query_shot = queries[min_q_id]
        corp_shot = corpus[min_corp_id]["text"]
        
        fewshots = [corp_shot, query_shot]
        
    reranker = Rerank(GPTRanker(model_path=model, use_prompt=True, prompt_doc=prompt_doc,
                                fewshots=fewshots, prompt_doc_start=prompt_doc_start, debug=False), batch_size=128)

    # Rerank top_k results using the reranker provided
    results_rerank = reranker.rerank(corpus, queries, results_bm25, top_k=top_k)
    
    # Save rerank results
    with open(results_path, 'w') as fp:
        json.dump(results_rerank, fp)

    #### Evaluate retrieval using NDCG@k, MAP@K ...
    ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results_rerank, k_values)

    return (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision)


for prompt_id, (prompt_doc_start, prompt_doc) in prompts.items():
    
    scores_out_path = f"beir_scores_gpt_{prompt_id}.json"
    # TODO: RENAME TO f"beir_gptprompt{prompt_id}_ndcgs.json"
    if os.path.exists(os.path.join(os.getcwd(), scores_out_path)):
        continue

    ndcgs_bm25 = {}
    ndcgs = {}
    
    logging.info(f"\n{'-' * 20} Running prompt {prompt_id}: {prompt_doc} {'-' * 20}\n")

    for i, dataset in enumerate(datasets):

        logging.info(f"\n{'-' * 10} Running {dataset} {'-' * 10}\n")
        
        if not(os.path.exists(os.path.join(os.getcwd(), 'datasets', dataset))):
            url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
            out_dir = os.path.join(os.getcwd(), "datasets")
            data_path = util.download_and_unzip(url, out_dir)
            print("Dataset downloaded here: {}".format(data_path))
            
        # Load the dataset into BEIR
        data_path = f"datasets/{dataset}"

        # cqadupstack - Contains several sub datasets
        if dataset == "cqadupstack":
            cqa_ndcgs_bm25, cqa_maps_bm25, cqa_recalls_bm25, cqa_precisions_bm25 = [], [], [], []
            cqa_ndcgs, cqa_maps, cqa_recalls, cqa_precisions = [], [], [], []
            for sub_dataset in os.listdir(data_path):
                sub_data_path = f"datasets/{dataset}/{sub_dataset}"
                
                results_bm25_path = f"results_{dataset}_{sub_dataset}.json"
                results_path = f"results_gpt_prompt{prompt_id}_{dataset}_{sub_dataset}.json"
                # Skip if already computed these results
                if os.path.exists(os.path.join(os.getcwd(), results_path)):
                    continue

                (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision) = run_reranking(results_bm25_path, results_path, sub_data_path)

                cqa_ndcgs_bm25.append(ndcg)
                cqa_maps_bm25.append(_map)
                cqa_recalls_bm25.append(recall)
                cqa_precisions_bm25.append(precision)

                cqa_ndcgs.append(ndcg)
                cqa_maps.append(_map)
                cqa_recalls.append(recall)
                cqa_precisions.append(precision)

            for (metric, group) in [(ndcg_bm25, cqa_ndcgs_bm25), (_map_bm25, cqa_maps_bm25), (recall_bm25, cqa_recalls_bm25), (precision_bm25, cqa_precisions_bm25)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

            for (metric, group) in [(ndcg, cqa_ndcgs), (_map, cqa_maps), (recall, cqa_recalls), (precision, cqa_precisions)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

            logging.info("CQA Final BM25")
            logging.info(f"{ndcg_bm25}")
            logging.info(f"{_map_bm25}")
            logging.info(f"{recall_bm25}")
            logging.info(f"{precision_bm25}")

            logging.info("CQA Final")
            logging.info(f"{ndcg}")
            logging.info(f"{_map}")
            logging.info(f"{recall}")
            logging.info(f"{precision}")

        else:
            results_bm25_path = f"results_{dataset}.json"
            results_path = f"results_gpt_prompt{prompt_id}_{dataset}.json"
            # Skip if already computed these results
            if os.path.exists(os.path.join(os.getcwd(), results_path)):
                continue
            (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision) = run_reranking(results_bm25_path, results_path, data_path)

        ndcgs[dataset] = ndcg
        ndcgs_bm25[dataset] = ndcg_bm25

        # Optionally clean-up each time to avoid running out of space
        # !rm -r datasets

    with open(scores_out_path, 'w') as fp:
        json.dump(ndcgs, fp)
    # Optionally also save the bm25 results tho they should be the same each time    
    with open(f"./beir_scores_bm25_{prompt_id}.json", 'w') as fp:
        json.dump(ndcg_bm25, fp)



2022-01-14 08:41:33 - Loading faiss with AVX2 support.
2022-01-14 08:41:33 - Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2022-01-14 08:41:33 - Loading faiss.
2022-01-14 08:41:33 - Successfully loaded faiss.
2022-01-14 08:41:33 - 
-------------------- Running prompt quoraE: Question Body: {} Question Title: --------------------

2022-01-14 08:41:33 - 
---------- Running quora ----------

2022-01-14 08:41:33 - Loading Corpus...


  0%|          | 0/522931 [00:00<?, ?it/s]

2022-01-14 08:41:35 - Loaded 522931 TEST Documents.
2022-01-14 08:41:35 - Doc Example: {'text': 'What is the step by step guide to invest in share market in india?', 'title': ''}
2022-01-14 08:41:35 - Loading Queries...
2022-01-14 08:41:35 - Loaded 10000 TEST Queries.
2022-01-14 08:41:35 - Query Example: Which question should I ask on Quora?
2022-01-14 08:41:45 - 

2022-01-14 08:41:45 - NDCG@1: 0.7230
2022-01-14 08:41:45 - NDCG@3: 0.7701
2022-01-14 08:41:45 - NDCG@5: 0.7895
2022-01-14 08:41:45 - NDCG@10: 0.8077
2022-01-14 08:41:45 - NDCG@100: 0.8277
2022-01-14 08:41:45 - NDCG@1000: 0.8312
2022-01-14 08:41:45 - 

2022-01-14 08:41:45 - MAP@1: 0.6310
2022-01-14 08:41:45 - MAP@3: 0.7294
2022-01-14 08:41:45 - MAP@5: 0.7476
2022-01-14 08:41:45 - MAP@10: 0.7596
2022-01-14 08:41:45 - MAP@100: 0.7669
2022-01-14 08:41:45 - MAP@1000: 0.7672
2022-01-14 08:41:45 - 

2022-01-14 08:41:45 - Recall@1: 0.6310
2022-01-14 08:41:45 - Recall@3: 0.7969
2022-01-14 08:41:45 - Recall@5: 0.8495
2022-01-14 08:41:

100%|██████████| 997731/997731 [3:11:31<00:00, 86.82it/s]  


2022-01-14 11:56:55 - 

2022-01-14 11:56:55 - NDCG@1: 0.6598
2022-01-14 11:56:55 - NDCG@3: 0.7257
2022-01-14 11:56:55 - NDCG@5: 0.7497
2022-01-14 11:56:55 - NDCG@10: 0.7726
2022-01-14 11:56:55 - NDCG@100: 0.7959
2022-01-14 11:56:55 - NDCG@1000: 0.7959
2022-01-14 11:56:55 - 

2022-01-14 11:56:55 - MAP@1: 0.5794
2022-01-14 11:56:55 - MAP@3: 0.6833
2022-01-14 11:56:55 - MAP@5: 0.7030
2022-01-14 11:56:55 - MAP@10: 0.7171
2022-01-14 11:56:55 - MAP@100: 0.7255
2022-01-14 11:56:55 - MAP@1000: 0.7255
2022-01-14 11:56:55 - 

2022-01-14 11:56:55 - Recall@1: 0.5794
2022-01-14 11:56:55 - Recall@3: 0.7650
2022-01-14 11:56:55 - Recall@5: 0.8265
2022-01-14 11:56:55 - Recall@10: 0.8915
2022-01-14 11:56:55 - Recall@100: 0.9772
2022-01-14 11:56:55 - Recall@1000: 0.9772
2022-01-14 11:56:55 - 

2022-01-14 11:56:55 - P@1: 0.6598
2022-01-14 11:56:55 - P@3: 0.3166
2022-01-14 11:56:55 - P@5: 0.2128
2022-01-14 11:56:55 - P@10: 0.1203
2022-01-14 11:56:55 - P@100: 0.0145
2022-01-14 11:56:55 - P@1000: 0.0014


In [10]:
# Compute scores based on results.json

import json
import os

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval


def compute_result(results_path, data_path, top_k=100, k_values=[1, 3, 5, 10, 100]):
    """
    Args:
        results_path: Path to .json to read rerank results
        top_k: How many docs to rerank per query
        k_values: For how many docs per query to compute the scores
    """
    split = "dev" if "msmarco" in data_path else "test"
    corpus, queries, qrels = GenericDataLoader(data_path).load(split=split)
    
    with open(results_path, 'r') as fp:
        results = json.load(fp)
    
    ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results, k_values)

    return (ndcg, _map, recall, precision)


# All datasets
datasets = ["trec-covid", "webis-touche2020", "nfcorpus", "scifact", "fiqa", "dbpedia-entity",
            "nq", "hotpotqa", "quora", "fever", "climate-fever", "arguana", "msmarco", "scidocs", "cqadupstack",
            "signal1m", "trec-news", "bioasq", "robust04"]

prompts = {
    "G": 'Documents are searched to find matches with the same content.\nThe document "{}" is a good search result for "',
}


latex_help = {}
latex_help_avgs = {}
results_prefix = ""
model_name = "gptj"
# Make empty string if no maxrerank in title (if none in title  = 100)
maxrerank = ""

for prompt_id, prompt_doc in prompts.items():
    
    scores_out_path = f"{results_prefix}beir_{model_name}_prompt{prompt_id}{maxrerank}_ndcgs.json"
    ndcgs = {}
    
    for i, dataset in enumerate(datasets):
        
        data_path = f"datasets/{dataset}"

        if dataset == "cqadupstack":
            cqa_ndcgs, cqa_maps, cqa_recalls, cqa_precisions = [], [], [], []
            for sub_dataset in os.listdir(data_path):
                sub_data_path = f"datasets/{dataset}/{sub_dataset}"
                results_path = f"{results_prefix}results_{model_name}_prompt{prompt_id}{maxrerank}_{dataset}_{sub_dataset}.json"
                assert os.path.exists(os.path.join(os.getcwd(), results_path)), f"Missing path: {results_path}"

                ndcg, _map, recall, precision = compute_result(results_path, sub_data_path)
                cqa_ndcgs.append(ndcg)
                cqa_maps.append(_map)
                cqa_recalls.append(recall)
                cqa_precisions.append(precision)

            for (metric, group) in [(ndcg, cqa_ndcgs), (_map, cqa_maps), (recall, cqa_recalls), (precision, cqa_precisions)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

        else:
            results_path = f"{results_prefix}results_{model_name}_prompt{prompt_id}{maxrerank}_{dataset}.json"
            assert os.path.exists(os.path.join(os.getcwd(), results_path)), f"Missing path: {results_path}"
            ndcg, _map, recall, precision = compute_result(results_path, data_path)
           
        latex_help.setdefault(dataset, "")
        latex_help[dataset] += f" & {ndcg['NDCG@10']}"
        latex_help_avgs.setdefault(prompt_id, 0)
        latex_help_avgs[prompt_id] += ndcg['NDCG@10']

        ndcgs[dataset] = ndcg

    with open(scores_out_path, 'w') as fp:
        json.dump(ndcgs, fp)
        
for k, v in latex_help.items():
    print(f"{k} {v}")

print("& ".join([f"{k}: {round(v/len(datasets), 5)}" for k,v in latex_help_avgs.items()]))

2022-01-17 06:47:22 - Loading Corpus...


  0%|          | 0/522931 [00:00<?, ?it/s]

2022-01-17 06:47:24 - Loaded 522931 TEST Documents.
2022-01-17 06:47:24 - Doc Example: {'text': 'What is the step by step guide to invest in share market in india?', 'title': ''}
2022-01-17 06:47:24 - Loading Queries...
2022-01-17 06:47:24 - Loaded 10000 TEST Queries.
2022-01-17 06:47:24 - Query Example: Which question should I ask on Quora?
2022-01-17 06:47:25 - 

2022-01-17 06:47:25 - NDCG@1: 0.7343
2022-01-17 06:47:25 - NDCG@3: 0.7915
2022-01-17 06:47:25 - NDCG@5: 0.8127
2022-01-17 06:47:25 - NDCG@10: 0.8297
2022-01-17 06:47:25 - NDCG@100: 0.8441
2022-01-17 06:47:25 - 

2022-01-17 06:47:25 - MAP@1: 0.6408
2022-01-17 06:47:25 - MAP@3: 0.7494
2022-01-17 06:47:25 - MAP@5: 0.7692
2022-01-17 06:47:25 - MAP@10: 0.7813
2022-01-17 06:47:25 - MAP@100: 0.7879
2022-01-17 06:47:25 - 

2022-01-17 06:47:25 - Recall@1: 0.6408
2022-01-17 06:47:25 - Recall@3: 0.8229
2022-01-17 06:47:25 - Recall@5: 0.8794
2022-01-17 06:47:25 - Recall@10: 0.9277
2022-01-17 06:47:25 - Recall@100: 0.9772
2022-01-17 06:4

In [None]:
# Remove kaggle.json again
!rm /home/.kaggle/kaggle.json

In [10]:
# References
# ipynb error
# conda install -c conda-forge ipywidgets

# tensorflow error; metric already exists
# downgrade to 2.5.0
# ~/miniconda3/envs/semanticsearch/bin/pip install tensorflow==2.5.0

# CUDA error: no kernel image is available for execution on the device
# Installing the below worked
# ~/miniconda3/envs/semanticsearch/bin/pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

# Kill Nvidia processes when out of memory
# kill -9 PID

# Try to clear cache without killing
# del reranker
# torch.cuda.empty_cache()

# Kaggle API
# kaggle datasets init

False

##### Reranker module not using query logprobs, but custom output, e.g. "Yes" [Prompt Ablation L]

In [8]:
class GPTYesRanker:
    def __init__(self, model_path="EleutherAI/gpt-neo-1.3B", use_prompt=True, prompt_doc="{}\n{}\n", prompt_doc_start="{}\n{}\n",
                 continuation="Yes", sub_select_voc=["Yes", "No"], fewshots="", debug=False, **kwargs):
        """
        Variation of GPTRanker only allowing the output of specific vocabulary.
        Args:
            model_path: HuggingFace weight name of a decoder transformer model
            use_prompt: Whether to use a prompt
            prompt_doc: Prompting scheme to embed document and query 
                Needs to contain two {} as query is not used for logprobs in this ranker
            prompt_doc_start: Prompting scheme specifically used for the first example, e.g. to include description
            continuation: Expected continuation to measure logprobs of
            sub_select_voc: Vocabulary to use for measuring probability
            fewshots: Fewshot example to use [doc, query]
            debug: To get information while running
        """
        
        self.device = torch.device('cuda:1') if torch.cuda.is_available() else torch.device('cpu')

        self.model = AutoModelForCausalLM.from_pretrained(model_path).to(self.device)
        self.model.eval()
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        
        if self.model.config.model_type == 'gpt2':
            self.max_length = self.model.config.n_ctx
        elif self.model.config.model_type == 'gpt_neo':
            self.max_length = self.model.config.max_position_embeddings
        else:
            raise ValueError(f"Unknown model of type {self.model.config.model_type}")

        # Truncation will be done from the left in the log likelihood
        self.prompt_doc = prompt_doc
        self.use_prompt = use_prompt
        self.instruction_len = len(self.tokenizer.tokenize(self.prompt_doc[:self.prompt_doc.index("{")]))
        
        self.continuation = continuation
        
        self.fewshots = fewshots
        if self.fewshots:
            # doc, query
            self.fewshots = prompt_doc_start.format(self.fewshots[0], self.fewshots[1]) + self.continuation
            # Still take overflowing tokens away from the current doc (not the fewshot doc)
            self.instruction_len += len(self.tokenizer.tokenize(self.fewshots))
            
            
        self.debug = debug
        self.sub_select_idx = self.tokenizer.encode(sub_select_voc, add_special_tokens=False)
    
    # Write your own score function, which takes in query-document text pairs and returns the similarity scores
    def predict(self, sentences: List[Tuple[str,str]], batch_size: int, **kwags) -> List[float]:
        """
        Args:
          sentences: [query, document]
          batch_size: Unused

        Returns:
          log_probs: float log probability for each query-doc pair
        """
        # TODO: Possibly feed in batched?; Depending on model size?
        if self.use_prompt:
            # Leave queries as is, as all its tokens will be used to compute the loglikelihoods
            sentences = [(self.continuation, self.fewshots + self.prompt_doc.format(doc, query)) for (query, doc) in sentences]

        encoded = encode(sentences, self.tokenizer)
        # loglikelihood batch_size is not the batch_size fed into this func
        log_probs = _loglikelihood_tokens(encoded, self.model, self.max_length, self.device, instruction_len=self.instruction_len, 
                                          tokenizer=self.tokenizer, sub_select_idx=self.sub_select_idx,debug=self.debug)

        return log_probs

{"trec-covid": [{"NDCG@1": 0.88, "NDCG@3": 0.82642, "NDCG@5": 0.78436, "NDCG@10": 0.76104, "NDCG@100": 0.50256, "NDCG@1000": 0.19689}, {"MAP@1": 0.00246, "MAP@3": 0.00653, "MAP@5": 0.00974, "MAP@10": 0.01779, "MAP@100": 0.09178, "MAP@1000": 0.09178}, {"Recall@1": 0.00246, "Recall@3": 0.00674, "Recall@5": 0.01033, "Recall@10": 0.02003, "Recall@100": 0.11731, "Recall@1000": 0.11731}, {"P@1": 0.92, "P@3": 0.88, "P@5": 0.82, "P@10": 0.802, "P@100": 0.5084, "P@1000": 0.05084}], "webis-touche2020": [{"NDCG@1": 0.28571, "NDCG@3": 0.2358, "NDCG@5": 0.23434, "NDCG@10": 0.2339, "NDCG@100": 0.3984, "NDCG@1000": 0.3984}, {"MAP@1": 0.0238, "MAP@3": 0.04423, "MAP@5": 0.05902, "MAP@10": 0.08103, "MAP@100": 0.15511, "MAP@1000": 0.15511}, {"Recall@1": 0.0238, "Recall@3": 0.05602, "Recall@5": 0.08951, "Recall@10": 0.15152, "Recall@100": 0.56093, "Recall@1000": 0.56093}, {"P@1": 0.30612, "P@3": 0.2381, "P@5": 0.24082, "P@10": 0.21633, "P@100": 0.09531, "P@1000": 0.00953}], "nfcorpus": [{"NDCG@1": 0.41391

In [None]:
import json
import os

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

model = "EleutherAI/gpt-neo-1.3B"
model = "EleutherAI/gpt-neo-2.7B"

prompts_start = {"L": 'An intelligent, helpful bot is given. The bot responds "Yes" if the document is a fit to the query and "No" otherwise.\n###\nDocument: {}\nQuery: {}\nBot:',
                 "M": 'An intelligent, helpful bot is given. The bot responds "Yes" if the document is a fit to the query and "No" otherwise.\n###\nDocument: {}\nQuery: {}\nBot: ',
}

prompts_base = ["\nDocument: {}\nQuery: {}\nBot:", "\nDocument: {}\nQuery: {}\nBot: "]

continuations = [" Yes", "Yes"]
sub_select_vocs = [[" Yes", " No"], ["Yes", "No"]]
  

def get_match_len(qid, tokenizer, corpus, queries, qrels, get_corpus_id=False):    
    query_len = len(tokenizer.tokenize(queries[qid]))
    
    corpora = qrels[qid]
    corpora_lens = []
    for corpus_id, score in corpora.items():
        corpus_len = len(tokenizer.tokenize(corpus[corpus_id]["text"]))
        # Prefer corpora with high score (won't matter for most cases where all scores are 1)
        corpora_lens.append((corpus_len + query_len) / (score + 1e-10))
    # Optionally get the shortest fitting corpus
    if get_corpus_id: return list(corpora.keys())[np.argmin(corpora_lens)]
    return min(corpora_lens)


def run_reranking(results_bm25_path, results_path, data_path, top_k=100, k_values=[1, 3, 5, 10, 100, 1000], fewshots=True):
    """
    Args:
        results_bm25_path: Path to .json results from bm25 for the dataset
        results_path: Path to .json to write rerank results
        top_k: How many docs to rerank per query
        k_values: For how many docs per query to compute the scores
    """
    
    corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")
    
    with open(results_bm25_path, 'r') as fp:
        results_bm25 = json.load(fp)
    
    # Optional, make sure results are correct
    ndcg_bm25, _map_bm25, recall_bm25, precision_bm25 = EvaluateRetrieval.evaluate(qrels, results_bm25, k_values)
    
    
    if fewshots:
        tokenizer = AutoTokenizer.from_pretrained(model)
        min_q_id = min(qrels, key=lambda x: get_match_len(x, tokenizer, corpus, queries, qrels))
        min_corp_id = get_match_len(min_q_id, tokenizer, corpus, queries, qrels, get_corpus_id=True)

        query_shot = queries[min_q_id]
        corp_shot = corpus[min_corp_id]["text"]
        
        fewshots = [corp_shot, query_shot]
        
    reranker = Rerank(GPTYesRanker(model_path=model, use_prompt=True, prompt_doc=prompt_base, 
                                   sub_select_voc=sub_select_voc, continuation=continuation, 
                                   fewshots=fewshots, prompt_doc_start=prompt_start, debug=False), batch_size=128)
    

    # Rerank top-100 results using the reranker provided
    results_rerank = reranker.rerank(corpus, queries, results_bm25, top_k=top_k)
    
    # Save rerank results
    with open(results_path, 'w') as fp:
        json.dump(results_rerank, fp)

    #### Evaluate retrieval using NDCG@k, MAP@K ...
    ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results_rerank, k_values)

    return (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision)


for (prompt_id, prompt_start), continuation, sub_select_voc, prompt_base in zip(prompts_start.items(), continuations, sub_select_vocs, prompts_base):
    
    scores_out_path = f"beir_scores_gpt_{prompt_id}.json"
    if os.path.exists(os.path.join(os.getcwd(), scores_out_path)):
        continue

    ndcgs_bm25 = {}
    ndcgs = {}
    
    logging.info(f"\n{'-' * 20} Running prompt {prompt_id}: {prompt_start} {'-' * 20}\n")
    
    for i, dataset in enumerate(datasets):

        logging.info(f"\n{'-' * 10} Running {dataset} {'-' * 10}\n")
        
        if not(os.path.exists(os.path.join(os.getcwd(), 'datasets', dataset))):
            url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
            out_dir = os.path.join(os.getcwd(), "datasets")
            data_path = util.download_and_unzip(url, out_dir)
            print("Dataset downloaded here: {}".format(data_path))
            
        # Load the dataset into BEIR
        data_path = f"datasets/{dataset}"

        # cqadupstack - Contains several sub datasets
        if dataset == "cqadupstack":
            cqa_ndcgs_bm25, cqa_maps_bm25, cqa_recalls_bm25, cqa_precisions_bm25 = [], [], [], []
            cqa_ndcgs, cqa_maps, cqa_recalls, cqa_precisions = [], [], [], []
            for sub_dataset in os.listdir(data_path):
                sub_data_path = f"datasets/{dataset}/{sub_dataset}"
                
                results_bm25_path = f"results_{dataset}_{sub_dataset}.json"
                results_path = f"results_gpt_prompt{prompt_id}_{dataset}_{sub_dataset}.json"
                # Skip if already computed these results
                if os.path.exists(os.path.join(os.getcwd(), results_path)):
                    continue

                (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision) = run_reranking(results_bm25_path, results_path, sub_data_path)

                cqa_ndcgs_bm25.append(ndcg)
                cqa_maps_bm25.append(_map)
                cqa_recalls_bm25.append(recall)
                cqa_precisions_bm25.append(precision)

                cqa_ndcgs.append(ndcg)
                cqa_maps.append(_map)
                cqa_recalls.append(recall)
                cqa_precisions.append(precision)

            for (metric, group) in [(ndcg_bm25, cqa_ndcgs_bm25), (_map_bm25, cqa_maps_bm25), (recall_bm25, cqa_recalls_bm25), (precision_bm25, cqa_precisions_bm25)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

            for (metric, group) in [(ndcg, cqa_ndcgs), (_map, cqa_maps), (recall, cqa_recalls), (precision, cqa_precisions)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

            logging.info("CQA Final BM25")
            logging.info(f"{ndcg_bm25}")
            logging.info(f"{_map_bm25}")
            logging.info(f"{recall_bm25}")
            logging.info(f"{precision_bm25}")

            logging.info("CQA Final")
            logging.info(f"{ndcg}")
            logging.info(f"{_map}")
            logging.info(f"{recall}")
            logging.info(f"{precision}")

        else:
            results_bm25_path = f"results_{dataset}.json"
            results_path = f"results_gpt_prompt{prompt_id}_{dataset}.json"
            # Skip if already computed these results
            if os.path.exists(os.path.join(os.getcwd(), results_path)):
                continue
            (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision) = run_reranking(results_bm25_path, results_path, data_path)

        ndcgs[dataset] = ndcg
        ndcgs_bm25[dataset] = ndcg_bm25

        # Optionally clean-up each time to avoid running out of space
        # !rm -r datasets

    with open(scores_out_path, 'w') as fp:
        json.dump(ndcgs, fp)
    # Optionally also save the bm25 results tho they should be the same each time    
    with open(f"./beir_scores_bm25_{prompt_id}.json", 'w') as fp:
        json.dump(ndcg_bm25, fp)

##### Computing max_rerank=10 based on max_rerank=100

In [7]:
import json
import os

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

# Subselecting all available ones
datasets = ["trec-covid", "webis-touche2020", "nfcorpus", "scifact", "fiqa", "dbpedia-entity",
            "nq", "hotpotqa", "quora", "fever", "climate-fever", "arguana", "msmarco", "scidocs", "cqadupstack",
            "signal1m", "trec-news", "bioasq", "robust04"]

k_values = [1, 10]
max_rerank = 10 # In BEIR 100 documents are reranked for their rerank encoder benchmark

# Need to rerank min 100 for @100;
assert max_rerank >= max(k_values), "Max Rerank is too small for the sample scores to compute"

results_prefix = ""
model_name = "gptneo01"
ndcgs = {}

def simulate_rerank(results_bm25, results_rerank, new_max_rerank):
    """
    results: Dict[str, Dict[str, float]]
    """
    simulate_rerank_results = {}
    for qid in results_bm25:
        # Topk items that would have been fed into the model
        topk = sorted(results_bm25[qid].items(), key=lambda item: item[1], reverse=True)[:new_max_rerank]
        topk = [key_val_pair[0] for key_val_pair in topk]

        # Scores the model would have given to these topk items
        simulate_rerank_results[qid] = {k: results_rerank[qid][k] for k in topk}

    return simulate_rerank_results

for i, dataset in enumerate(datasets):
    
    results_out_path = f"{results_prefix}results_{model_name}_promptG_{max_rerank}_{dataset}.json"
    
    if os.path.exists(results_out_path):
        continue

    if not(os.path.exists(os.path.join(os.getcwd(), 'datasets', dataset))):
        url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
        out_dir = os.path.join(os.getcwd(), "datasets")
        data_path = util.download_and_unzip(url, out_dir)
        print("Dataset downloaded here: {}".format(data_path))
    # Load the dataset into BEIR
    data_path = f"datasets/{dataset}"
    # In the paper it says, BEIR used the dev set for msmarco
    split = "dev" if dataset == "msmarco" else "test"
    

    # cqadupstack - Contains several sub datasets
    if dataset == "cqadupstack":
        cqa_ndcgs, cqa_maps, cqa_recalls, cqa_precisions = [], [], [], []
        for sub_dataset in os.listdir(data_path):
            
            sub_results_out_path = f"{results_prefix}results_{model_name}_promptG_{max_rerank}_{dataset}_{sub_dataset}.json"
            if os.path.exists(sub_results_out_path):
                continue
        
            sub_data_path = f"datasets/{dataset}/{sub_dataset}"
            corpus, queries, qrels = GenericDataLoader(sub_data_path).load(split=split)
            with open(f"./results_{dataset}_{sub_dataset}.json", 'r') as fp:
                results_loaded = json.load(fp)
            with open(f"{results_prefix}results_{model_name}_promptG_{dataset}_{sub_dataset}.json", 'r') as fp:
                results_gpt_loaded = json.load(fp)
            
            simulate_rerank_results = simulate_rerank(results_loaded, results_gpt_loaded, max_rerank)

            with open(sub_results_out_path, 'w') as fp:
                json.dump(simulate_rerank_results, fp)

            ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, simulate_rerank_results, k_values)

            cqa_ndcgs.append(ndcg)
            cqa_maps.append(_map)
            cqa_recalls.append(recall)
            cqa_precisions.append(precision)
        
        if cqa_ndcgs:
            for (metric, group) in [(ndcg, cqa_ndcgs), (_map, cqa_maps), (recall, cqa_recalls), (precision, cqa_precisions)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

            logging.info("CQA Final")
            logging.info(f"{ndcg}")
            logging.info(f"{_map}")
            logging.info(f"{recall}")
            logging.info(f"{precision}")

    else:
        corpus, queries, qrels = GenericDataLoader(data_path).load(split=split)
        with open(f"./results_{dataset}.json", 'r') as fp:
            results_loaded = json.load(fp)
        with open(f"{results_prefix}results_{model_name}_promptG_{dataset}.json", 'r') as fp:
            results_gpt_loaded = json.load(fp)


        simulate_rerank_results = simulate_rerank(results_loaded, results_gpt_loaded, max_rerank)
        ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, simulate_rerank_results, k_values)
        with open(results_out_path, 'w') as fp:
            json.dump(simulate_rerank_results, fp)

    ndcgs[dataset] = ndcg

2022-01-07 18:07:19 - Loading Corpus...


  0%|          | 0/171332 [00:00<?, ?it/s]

2022-01-07 18:07:20 - Loaded 171332 TEST Documents.
2022-01-07 18:07:20 - Doc Example: {'text': 'OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract 

  0%|          | 0/382545 [00:00<?, ?it/s]

2022-01-07 18:07:25 - Loaded 382545 TEST Documents.
2022-01-07 18:07:25 - Doc Example: {'text': 'My opponent forfeited every round. None of my arguments were answered. I don’t like the idea of winning by default, but here we are.Tule: it’s good for students to get involved and address big issues like teen pregnancy. You need to be able to answer arguments like mine and not simply prepare for an abstinence-only type of response. You should also be aware that, in the U.S., condoms may be sold to minors in ANY state. A retailer who says it is illegal to sell you them is, frankly, wrong.', 'title': 'Contraceptive Forms for High School Students'}
2022-01-07 18:07:25 - Loading Queries...
2022-01-07 18:07:25 - Loaded 49 TEST Queries.
2022-01-07 18:07:25 - Query Example: Should teachers get tenure?
2022-01-07 18:07:25 - 

2022-01-07 18:07:25 - NDCG@1: 0.3469
2022-01-07 18:07:25 - NDCG@10: 0.3396
2022-01-07 18:07:25 - 

2022-01-07 18:07:25 - MAP@1: 0.0272
2022-01-07 18:07:25 - MAP@10: 0.1295
20

  0%|          | 0/3633 [00:00<?, ?it/s]

2022-01-07 18:07:25 - Loaded 3633 TEST Documents.
2022-01-07 18:07:25 - Doc Example: {'text': 'Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants die

  0%|          | 0/5183 [00:00<?, ?it/s]

2022-01-07 18:07:25 - Loaded 5183 TEST Documents.
2022-01-07 18:07:25 - Doc Example: {'text': 'Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 vers

  0%|          | 0/57638 [00:00<?, ?it/s]

2022-01-07 18:07:26 - Loaded 57638 TEST Documents.
2022-01-07 18:07:26 - Doc Example: {'text': "I'm not saying I don't like the idea of on-the-job training too, but you can't expect the company to do that. Training workers is not their job - they're building software. Perhaps educational systems in the U.S. (or their students) should worry a little about getting marketable skills in exchange for their massive investment in education, rather than getting out with thousands in student debt and then complaining that they aren't qualified to do anything.", 'title': ''}
2022-01-07 18:07:26 - Loading Queries...
2022-01-07 18:07:26 - Loaded 648 TEST Queries.
2022-01-07 18:07:26 - Query Example: How to deposit a cheque issued to an associate in my business into my business account?
2022-01-07 18:07:26 - 

2022-01-07 18:07:26 - NDCG@1: 0.2870
2022-01-07 18:07:26 - NDCG@10: 0.2804
2022-01-07 18:07:26 - 

2022-01-07 18:07:26 - MAP@1: 0.1472
2022-01-07 18:07:26 - MAP@10: 0.2227
2022-01-07 18:07:26

  0%|          | 0/4635922 [00:00<?, ?it/s]

2022-01-07 18:07:52 - Loaded 4635922 TEST Documents.
2022-01-07 18:07:53 - Doc Example: {'text': "Animalia is an illustrated children's book by Graeme Base. It was originally published in 1986, followed by a tenth anniversary edition in 1996, and a 25th anniversary edition in 2012. Over three million copies have been sold.   A special numbered and signed anniversary edition was also published in 1996, with an embossed gold jacket.", 'title': 'Animalia (book)'}
2022-01-07 18:07:53 - Loading Queries...
2022-01-07 18:07:53 - Loaded 400 TEST Queries.
2022-01-07 18:07:53 - Query Example: Szechwan dish food cuisine
2022-01-07 18:07:53 - 

2022-01-07 18:07:53 - NDCG@1: 0.4163
2022-01-07 18:07:53 - NDCG@10: 0.3275
2022-01-07 18:07:53 - 

2022-01-07 18:07:53 - MAP@1: 0.0657
2022-01-07 18:07:53 - MAP@10: 0.1568
2022-01-07 18:07:53 - 

2022-01-07 18:07:53 - Recall@1: 0.0657
2022-01-07 18:07:53 - Recall@10: 0.2089
2022-01-07 18:07:53 - 

2022-01-07 18:07:53 - P@1: 0.5375
2022-01-07 18:07:53 - P@10

  0%|          | 0/2681468 [00:00<?, ?it/s]

2022-01-07 18:08:08 - Loaded 2681468 TEST Documents.
2022-01-07 18:08:08 - Doc Example: {'text': "In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation's stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]", 'title': 'Minority interest'}
2022-01-07 18:08:08 - Loading Queries...
2022-01-07 18:08:08 - Loaded 3452 TEST Queries.
2022-01-07 18:08:08 - Query Example: what is non controlling interest on balance sheet
2022-01-07 18:08:12 - 

2022-01-07 18:08:12 - NDCG@1: 0.1860
2022-01-07 18:08:12 - NDCG@10: 0.3361
2022-01-07 18:08:12 - 

2022-01-07 18:08:12 - MAP@1: 0.1660
2022-01-07 18:08:12 - MAP@10: 0.2742
2022-01-07 18:08:12 - 

2022-01-07 18:08:12 - Recall@1: 0.1660
2022-01-07 18:08:12 - Recall@10: 0.5067
2022-01-07 18:08:12 - 

2022-01-07 18:08:12

  0%|          | 0/5233329 [00:00<?, ?it/s]

2022-01-07 18:08:41 - Loaded 5233329 TEST Documents.
2022-01-07 18:08:41 - Doc Example: {'text': 'Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary and harmful.', 'title': 'Anarchism'}
2022-01-07 18:08:41 - Loading Queries...
2022-01-07 18:08:41 - Loaded 7405 TEST Queries.
2022-01-07 18:08:41 - Query Example: Were Scott Derrickson and Ed Wood of the same nationality?
2022-01-07 18:08:48 - 

2022-01-07 18:08:48 - NDCG@1: 0.7772
2022-01-07 18:08:48 - NDCG@10: 0.6327
2022-01-07 18:08:48 - 

2022-01-07 18:08:48 - MAP@1: 0.3886
2022-01-07 18:08:48 - MAP@10: 0.5462
2022-01-07 18:08:48 - 

2022-01-07 18:08:48 - Recall@1: 0.3886
2022-01-07 18:08:48 - Recall@10: 0.6295
2022-01-07 18:08:48 - 

2022-01-07 18:08:48 

  0%|          | 0/522931 [00:00<?, ?it/s]

2022-01-07 18:08:50 - Loaded 522931 TEST Documents.
2022-01-07 18:08:50 - Doc Example: {'text': 'What is the step by step guide to invest in share market in india?', 'title': ''}
2022-01-07 18:08:50 - Loading Queries...
2022-01-07 18:08:50 - Loaded 10000 TEST Queries.
2022-01-07 18:08:50 - Query Example: Which question should I ask on Quora?
2022-01-07 18:08:59 - 

2022-01-07 18:08:59 - NDCG@1: 0.6877
2022-01-07 18:08:59 - NDCG@10: 0.7942
2022-01-07 18:08:59 - 

2022-01-07 18:08:59 - MAP@1: 0.6002
2022-01-07 18:08:59 - MAP@10: 0.7415
2022-01-07 18:08:59 - 

2022-01-07 18:08:59 - Recall@1: 0.6002
2022-01-07 18:08:59 - Recall@10: 0.9026
2022-01-07 18:08:59 - 

2022-01-07 18:08:59 - P@1: 0.6877
2022-01-07 18:08:59 - P@10: 0.1220
2022-01-07 18:08:59 - Loading Corpus...


  0%|          | 0/5416568 [00:00<?, ?it/s]

2022-01-07 18:09:31 - Loaded 5416568 TEST Documents.
2022-01-07 18:09:31 - Doc Example: {'text': 'The following are the football ( soccer ) events of the year 1928 throughout the world .', 'title': '1928 in association football'}
2022-01-07 18:09:31 - Loading Queries...
2022-01-07 18:09:32 - Loaded 6666 TEST Queries.
2022-01-07 18:09:32 - Query Example: Ukrainian Soviet Socialist Republic was a founding participant of the UN.
2022-01-07 18:09:38 - 

2022-01-07 18:09:38 - NDCG@1: 0.6589
2022-01-07 18:09:38 - NDCG@10: 0.7348
2022-01-07 18:09:38 - 

2022-01-07 18:09:38 - MAP@1: 0.6178
2022-01-07 18:09:38 - MAP@10: 0.6961
2022-01-07 18:09:38 - 

2022-01-07 18:09:38 - Recall@1: 0.6178
2022-01-07 18:09:38 - Recall@10: 0.8141
2022-01-07 18:09:38 - 

2022-01-07 18:09:38 - P@1: 0.6589
2022-01-07 18:09:38 - P@10: 0.0888
2022-01-07 18:09:38 - Loading Corpus...


  0%|          | 0/5416593 [00:00<?, ?it/s]

2022-01-07 18:10:10 - Loaded 5416593 TEST Documents.
2022-01-07 18:10:10 - Doc Example: {'text': 'The following are the football ( soccer ) events of the year 1928 throughout the world .', 'title': '1928 in association football'}
2022-01-07 18:10:10 - Loading Queries...
2022-01-07 18:10:10 - Loaded 1535 TEST Queries.
2022-01-07 18:10:10 - Query Example: Global warming is driving polar bears toward extinction
2022-01-07 18:10:13 - 

2022-01-07 18:10:13 - NDCG@1: 0.1785
2022-01-07 18:10:13 - NDCG@10: 0.1942
2022-01-07 18:10:13 - 

2022-01-07 18:10:13 - MAP@1: 0.0803
2022-01-07 18:10:13 - MAP@10: 0.1367
2022-01-07 18:10:13 - 

2022-01-07 18:10:13 - Recall@1: 0.0803
2022-01-07 18:10:13 - Recall@10: 0.2318
2022-01-07 18:10:13 - 

2022-01-07 18:10:13 - P@1: 0.1785
2022-01-07 18:10:13 - P@10: 0.0575
2022-01-07 18:10:13 - Loading Corpus...


  0%|          | 0/8674 [00:00<?, ?it/s]

2022-01-07 18:10:13 - Loaded 8674 TEST Documents.
2022-01-07 18:10:13 - Doc Example: {'text': "You don’t have to be vegetarian to be green. Many special environments have been created by livestock farming – for example chalk down land in England and mountain pastures in many countries. Ending livestock farming would see these areas go back to woodland with a loss of many unique plants and animals. Growing crops can also be very bad for the planet, with fertilisers and pesticides polluting rivers, lakes and seas. Most tropical forests are now cut down for timber, or to allow oil palm trees to be grown in plantations, not to create space for meat production.  British farmer and former editor Simon Farrell also states: “Many vegans and vegetarians rely on one source from the U.N. calculation that livestock generates 18% of global carbon emissions, but this figure contains basic mistakes. It attributes all deforestation from ranching to cattle, rather than logging or development. It also m

  0%|          | 0/8841823 [00:00<?, ?it/s]

2022-01-07 18:10:58 - Loaded 8841823 DEV Documents.
2022-01-07 18:10:59 - Doc Example: {'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 'title': ''}
2022-01-07 18:10:59 - Loading Queries...
2022-01-07 18:11:00 - Loaded 6980 DEV Queries.
2022-01-07 18:11:00 - Query Example: how many years did william bradford serve as governor of plymouth colony?
2022-01-07 18:11:05 - 

2022-01-07 18:11:05 - NDCG@1: 0.1125
2022-01-07 18:11:05 - NDCG@10: 0.2370
2022-01-07 18:11:05 - 

2022-01-07 18:11:05 - MAP@1: 0.1097
2022-01-07 18:11:05 - MAP@10: 0.1902
2022-01-07 18:11:05 - 

2022-01-07 18:11:05 - Recall@1: 0.1097
2022-01-07 18:11:05 - Recall@10: 0.3817
2022-01-07 18:11:05 - 

2022-01-07 18:11:05 - P@1: 0.1125
2022-0

  0%|          | 0/25657 [00:00<?, ?it/s]

2022-01-07 18:11:06 - Loaded 25657 TEST Documents.
2022-01-07 18:11:06 - Doc Example: {'text': 'An evolutionary recurrent network which automates the design of recurrent neural/fuzzy networks using a new evolutionary learning algorithm is proposed in this paper. This new evolutionary learning algorithm is based on a hybrid of genetic algorithm (GA) and particle swarm optimization (PSO), and is thus called HGAPSO. In HGAPSO, individuals in a new generation are created, not only by crossover and mutation operation as in GA, but also by PSO. The concept of elite strategy is adopted in HGAPSO, where the upper-half of the best-performing individuals in a population are regarded as elites. However, instead of being reproduced directly to the next generation, these elites are first enhanced. The group constituted by the elites is regarded as a swarm, and each elite corresponds to a particle within it. In this regard, the elites are enhanced by PSO, an operation which mimics the maturing pheno

  0%|          | 0/47382 [00:00<?, ?it/s]

2022-01-07 18:11:09 - Loaded 47382 TEST Documents.
2022-01-07 18:11:09 - Loading Queries...
2022-01-07 18:11:12 - Loaded 1072 TEST Queries.
2022-01-07 18:11:12 - Query Example: Yanked USB Key During Move
2022-01-07 18:11:13 - 

2022-01-07 18:11:13 - NDCG@1: 0.2883
2022-01-07 18:11:13 - NDCG@10: 0.3301
2022-01-07 18:11:13 - 

2022-01-07 18:11:13 - MAP@1: 0.2475
2022-01-07 18:11:13 - MAP@10: 0.3003
2022-01-07 18:11:13 - 

2022-01-07 18:11:13 - Recall@1: 0.2475
2022-01-07 18:11:13 - Recall@10: 0.3841
2022-01-07 18:11:13 - 

2022-01-07 18:11:13 - P@1: 0.2883
2022-01-07 18:11:13 - P@10: 0.0477
2022-01-07 18:11:13 - Loading Corpus...


  0%|          | 0/38316 [00:00<?, ?it/s]

2022-01-07 18:11:13 - Loaded 38316 TEST Documents.
2022-01-07 18:11:13 - Doc Example: {'text': "Let's discuss about $SU(3)$. I understand that the most important representations (relevant to physics) are the defining and the adjoint. In the defining representation of $SU(3)$; namely $\\mathbf{3}$, the Gell-Mann matrices are used to represent the generators $$ \\left[T^{A}\\right]_{ij} = \\dfrac{1}{2}\\lambda^{A}, $$ where $T^A$ are the generators and $\\lambda^A$ the Gell-Mann matrices. In adjoint representation, on the other hand, an $\\mathbf{8}$, the generators are represented by matrices according to $$ \\left[ T_{i} \\right]_{jk} = -if_{ijk}, $$ where $f_{ijk}$ are the structure constants. My question is this, how can one represent the generators in the $\\mathbf{10}$ of $SU(3)$, which corresponds to a symmetric tensor with 3 upper or lower indices (or for that matter how to represent the $\\mathbf{6}$ with two symmetric indices). What is the general procedure to represent the gen

  0%|          | 0/48605 [00:00<?, ?it/s]

2022-01-07 18:11:17 - Loaded 48605 TEST Documents.
2022-01-07 18:11:17 - Doc Example: {'text': "In a shortcode context, is there any difference here?               array(             'slideshow' => '',         ),       and               array(             'slideshow' => NULL,         ),       Is there a best practice for that?", 'title': 'What is the difference between Null vs Empty (Zero Length) string?'}
2022-01-07 18:11:17 - Loading Queries...
2022-01-07 18:11:18 - Loaded 541 TEST Queries.
2022-01-07 18:11:18 - Query Example: How to enqueue script or style in a theme's template file?
2022-01-07 18:11:19 - 

2022-01-07 18:11:19 - NDCG@1: 0.2606
2022-01-07 18:11:19 - NDCG@10: 0.3154
2022-01-07 18:11:19 - 

2022-01-07 18:11:19 - MAP@1: 0.2363
2022-01-07 18:11:19 - MAP@10: 0.2864
2022-01-07 18:11:19 - 

2022-01-07 18:11:19 - Recall@1: 0.2363
2022-01-07 18:11:19 - Recall@10: 0.3816
2022-01-07 18:11:19 - 

2022-01-07 18:11:19 - P@1: 0.2606
2022-01-07 18:11:19 - P@10: 0.0442
2022-01-07 18:

  0%|          | 0/37637 [00:00<?, ?it/s]

2022-01-07 18:11:19 - Loaded 37637 TEST Documents.
2022-01-07 18:11:19 - Doc Example: {'text': "There is a satellite image it's size is 10 GB and I need to display this image using GeoServer and OpenLayers. When user select the Satellite image in the layer switcher need to display image within 10 seconds. I tried geopdf but the image quality loss isn't acceptable to customer. I want to achieve 10 seconds response time using 32 GB satellite image. Please advice me how to achieve this? Thanks in advance.", 'title': 'Satellite image display with the help of GeoServer and OpenLayers'}
2022-01-07 18:11:19 - Loading Queries...
2022-01-07 18:11:22 - Loaded 885 TEST Queries.
2022-01-07 18:11:22 - Query Example: Calculating mean upslope aspect from each cell in DEM using Python?
2022-01-07 18:11:22 - 

2022-01-07 18:11:22 - NDCG@1: 0.2791
2022-01-07 18:11:22 - NDCG@10: 0.3362
2022-01-07 18:11:22 - 

2022-01-07 18:11:22 - MAP@1: 0.2559
2022-01-07 18:11:22 - MAP@10: 0.3078
2022-01-07 18:11:22 - 


  0%|          | 0/45301 [00:00<?, ?it/s]

2022-01-07 18:11:22 - Loaded 45301 TEST Documents.
2022-01-07 18:11:22 - Doc Example: {'text': 'What\'s your Supreme Commander 2 build order. I don\'t just want "6 mass extractors, 2 power and a factory". List of building and units out to the second or third factory, please.', 'title': 'Supreme Commander 2 - Build Orders'}
2022-01-07 18:11:22 - Loading Queries...
2022-01-07 18:11:28 - Loaded 1595 TEST Queries.
2022-01-07 18:11:28 - Query Example: Can the trophy system protect me against bullets?
2022-01-07 18:11:28 - 

2022-01-07 18:11:28 - NDCG@1: 0.4357
2022-01-07 18:11:28 - NDCG@10: 0.5051
2022-01-07 18:11:28 - 

2022-01-07 18:11:28 - MAP@1: 0.3795
2022-01-07 18:11:28 - MAP@10: 0.4648
2022-01-07 18:11:28 - 

2022-01-07 18:11:28 - Recall@1: 0.3795
2022-01-07 18:11:28 - Recall@10: 0.5864
2022-01-07 18:11:28 - 

2022-01-07 18:11:28 - P@1: 0.4357
2022-01-07 18:11:28 - P@10: 0.0750
2022-01-07 18:11:28 - Loading Corpus...


  0%|          | 0/42269 [00:00<?, ?it/s]

2022-01-07 18:11:29 - Loaded 42269 TEST Documents.
2022-01-07 18:11:29 - Doc Example: {'text': "I'm a beginner in statistics and R, sorry if this question may seem trivial. I've collected data measuring several different parameters in 40 subjects at two time-points (t1 and t2). There are 3 main parameters in which I'm interested, let's call them ParA, ParB, ParC. ParA is a score of disability. It is on an arbitrary scale (so it is an ordinal scale measure, if my understanding is correct) and values range from 0.0 to 10.0. Note that the increments in this scale are by 0.5 unit, so values like, e.g. 1.5 are possible. I have two measures, at t1 and t2, so I can describe at least three variables from ParA: ParA at t1, ParA at t2, and whether a subject progressed or not (0 or 1). Being a ratio scale measure, I think it would not make much sense to compute a difference (eg. ParA at t2 - ParA at t1), but I'm willing to accept suggestions on this matter. ParB and ParC are meausurements of two 

  0%|          | 0/17405 [00:00<?, ?it/s]

2022-01-07 18:11:31 - Loaded 17405 TEST Documents.
2022-01-07 18:11:31 - Doc Example: {'text': 'I\'m making a website for a small hotel in php. The hotel owners want a reservation system that uses paypal. They want people to see a calendar and choose a date to make a reservation. If the day has vacancy, they want the user to request booking a room. This would then require the hotel owner to accept the purchase. I have not worked on a project that has this "request to purchase" method of buying with paypal. Is this possible? Does anyone know of an open php system that handles this?', 'title': 'Hotel Reservation Request Booking Paypal PHP'}
2022-01-07 18:11:31 - Loading Queries...
2022-01-07 18:11:32 - Loaded 506 TEST Queries.
2022-01-07 18:11:32 - Query Example: Someone else is using our Google Analytics Tracking code number. What do we do?
2022-01-07 18:11:32 - 

2022-01-07 18:11:32 - NDCG@1: 0.3123
2022-01-07 18:11:32 - NDCG@10: 0.3562
2022-01-07 18:11:32 - 

2022-01-07 18:11:32 - MAP

  0%|          | 0/16705 [00:00<?, ?it/s]

2022-01-07 18:11:32 - Loaded 16705 TEST Documents.
2022-01-07 18:11:32 - Doc Example: {'text': "I'm trying to use `Get` to load some pretty substantial packages from a custom menu in the _Mathematica_ toolbar (added via MenuSetup.tr).   The problem is, the standard 5-second evaluation timeout seems to apply to commands executed with `KernelExecute`, so only a fraction of my `Get` is evaluated before the command times out. I'm wondering whether there's an option that can be passed to `KernelExecute` (or to `Item` / `MenuItem`) that will remove that time constraint so that my command can be executed completely.", 'title': 'Time constraints on KernelExecute commands or MenuItems?'}
2022-01-07 18:11:32 - Loading Queries...
2022-01-07 18:11:35 - Loaded 804 TEST Queries.
2022-01-07 18:11:35 - Query Example: How to use Automorphisms[] on a graph?
2022-01-07 18:11:35 - 

2022-01-07 18:11:35 - NDCG@1: 0.1866
2022-01-07 18:11:35 - NDCG@10: 0.2422
2022-01-07 18:11:35 - 

2022-01-07 18:11:35 - MAP

  0%|          | 0/22998 [00:00<?, ?it/s]

2022-01-07 18:11:35 - Loaded 22998 TEST Documents.
2022-01-07 18:11:35 - Doc Example: {'text': "I want to send files to android tablet with a application from PC. - I can send files directly to tablet (2.3 android OS) PC see it as a external usb drive. - But i can't send files to tablet (4.2 android OS), because PC see it as a portable media player.(MTP) - How can i fix this problem ? - How can show my device as a external drive? my application that sent files written via Delphi.", 'title': 'How can show android tablet as a external storage to PC?'}
2022-01-07 18:11:35 - Loading Queries...
2022-01-07 18:11:36 - Loaded 699 TEST Queries.
2022-01-07 18:11:36 - Query Example: Android chroot ubuntu - is it possible to get ubuntu to recognise usb devices
2022-01-07 18:11:37 - 

2022-01-07 18:11:37 - NDCG@1: 0.3619
2022-01-07 18:11:37 - NDCG@10: 0.4204
2022-01-07 18:11:37 - 

2022-01-07 18:11:37 - MAP@1: 0.2879
2022-01-07 18:11:37 - MAP@10: 0.3735
2022-01-07 18:11:37 - 

2022-01-07 18:11:37 -

  0%|          | 0/32176 [00:00<?, ?it/s]

2022-01-07 18:11:37 - Loaded 32176 TEST Documents.
2022-01-07 18:11:37 - Doc Example: {'text': "I am in the midst of writing a web application for work. Everything is from scratch. I have been a PHP programmer for about 13 years, Node.js programmer for the past 2 years, and have no shortage of experience with JavaScript. I love Node.js, and recently rebuilt the company's API in it... So, in planning this web application, the approach I'm considering is, have the Node.js API for getting data from the server, but render everything in the browser. Use AJAX for retrieving data, History API for loading pages, and a MVC-like pattern for the different components. I have read articles detailing twitters rebuild a few years ago. It was more or less a client-side JavaScript app, but a couple years after launching it, they started moving a lot of processing/rendering back to the server, claiming the app improved dramatically in terms of speed. So, my question is as the title asks, is a client-sid

  0%|          | 0/40221 [00:00<?, ?it/s]

2022-01-07 18:11:40 - Loaded 40221 TEST Documents.
2022-01-07 18:11:40 - Doc Example: {'text': 'An eponym is one way to eternal (if posthumous) fame. But is there a word meaning an eponym someone would sooner not have? (One would presume that Captain Charles _Boycott_ , Mr Justice _Lynch_ , and Patrick _Hooligan_ would not appreciate their undying notoriety.)', 'title': 'Is there a word meaning "an unwanted eponym"?'}
2022-01-07 18:11:40 - Loading Queries...
2022-01-07 18:11:47 - Loaded 1570 TEST Queries.
2022-01-07 18:11:47 - Query Example: Is "a wide range of features" singular or plural?
2022-01-07 18:11:48 - 

2022-01-07 18:11:48 - NDCG@1: 0.3401
2022-01-07 18:11:48 - NDCG@10: 0.3781
2022-01-07 18:11:48 - 

2022-01-07 18:11:48 - MAP@1: 0.2710
2022-01-07 18:11:48 - MAP@10: 0.3393
2022-01-07 18:11:48 - 

2022-01-07 18:11:48 - Recall@1: 0.2710
2022-01-07 18:11:48 - Recall@10: 0.4299
2022-01-07 18:11:48 - 

2022-01-07 18:11:48 - P@1: 0.3401
2022-01-07 18:11:48 - P@10: 0.0643
2022-01-07

  0%|          | 0/68184 [00:00<?, ?it/s]

2022-01-07 18:11:49 - Loaded 68184 TEST Documents.
2022-01-07 18:11:49 - Doc Example: {'text': "I am using a pgfplots stacked bar to display the aggregated energy demand of a houshold and the associated price. When the energy demand exceeds a certain threshold, than a higher price has to be paid. This is visualized by the color red and blue of the bars. The threshold is displayed by the thick red horizontal line. My problem is, that I want this red line to exceed the width of the bar, so that it's width is circa 120 percent of the width of the bar. Is there any possibility to achieve this? Thanks ![enter image description here](http://i.stack.imgur.com/3qeEi.jpg)               \\documentclass[tikz]{standalone}     \\usepackage{pgfplots}     \\pgfplotsset{compat=1.10}     \\begin{document}     \\begin{tikzpicture}     \\begin{axis}[       ymin=0,ymax=4,       samples=3,       enlarge x limits={abs=0.5},       bar width=0.6,       ybar stacked,       legend pos=south east,         every 

  0%|          | 0/2866316 [00:00<?, ?it/s]

2022-01-07 18:12:52 - Loaded 2866316 TEST Documents.
2022-01-07 18:12:52 - Doc Example: {'text': 'This Boston college professor who lives in #NH is on leave after being arrested for child pornography, endangerment:', 'title': ''}
2022-01-07 18:12:52 - Loading Queries...
2022-01-07 18:12:52 - Loaded 97 TEST Queries.
2022-01-07 18:12:52 - Query Example: VIDEO:Good Samaritans Stop Alleged Hit-and-Run Driver in Miami
2022-01-07 18:12:52 - 

2022-01-07 18:12:52 - NDCG@1: 0.4691
2022-01-07 18:12:52 - NDCG@10: 0.3388
2022-01-07 18:12:52 - 

2022-01-07 18:12:52 - MAP@1: 0.0368
2022-01-07 18:12:52 - MAP@10: 0.1286
2022-01-07 18:12:52 - 

2022-01-07 18:12:52 - Recall@1: 0.0368
2022-01-07 18:12:52 - Recall@10: 0.1607
2022-01-07 18:12:52 - 

2022-01-07 18:12:52 - P@1: 0.5464
2022-01-07 18:12:52 - P@10: 0.3083
2022-01-07 18:12:52 - Loading Corpus...


  0%|          | 0/594977 [00:00<?, ?it/s]

2022-01-07 18:13:06 - Loaded 594977 TEST Documents.
2022-01-07 18:13:06 - Doc Example: {'text': 'NEW ORLEANS — Whenever a Virginia Tech offensive coach is asked how the most prolific receiving duo in school history came to be, inevitably the first road game in 2008 against North Carolina comes up. Midway through the first quarter, Virginia Tech had to call two timeouts in a row because then-freshmen Jarrett Boykin and Danny Coale couldn’t seem to line up right, and “they had those big eyes out there looking around,” Kevin Sherman, their position coach, said recently. Now that Boykin and Coale have only Tuesday’s Sugar Bowl remaining before leaving Virginia Tech with every major school record for a wide receiver, they’ve taken a different stance. “I still don’t think that was on us. Macho [Harris] was in the game and he lined up wrong,” said Boykin, as Coale sat next to him nodding in agreement. Just add that to the list of slights these seniors have had to overcome. Boykin has been the

  0%|          | 0/14914714 [00:00<?, ?it/s]

2022-01-07 18:15:14 - Loaded 14914604 TEST Documents.
2022-01-07 18:15:15 - Doc Example: {'text': 'Depressive disorder is one of the most widespread forms of mental disorders which lead to a significant public health concern, such as disability, suicide, and so on. Its etiology remains vague but it is believed that depressive disorder is a multifactorial disease which is induced by the interaction of social, psychological, and biological factors. Thus, there is no clear and definite pathological theory could illustrate its mechanism independently until now, involving genetics, neuroimaging, neuroinflammation, neuroendocrine, and others. Comprehensive assessment to patients with depression is the starting point for a right diagnosis. History-taking of physical condition is as important as psychiatric interview and rational usage of scales would be beneficial for screening. There are many kinds of therapeutic measures for depressive patients nowadays, including general intervention, phar

  0%|          | 0/528155 [00:00<?, ?it/s]

2022-01-07 18:15:26 - Loaded 528155 TEST Documents.
2022-01-07 18:15:26 - Doc Example: {'text': '\n\nPOLITICIANS,  PARTY PREFERENCES \n\n   Summary:  Newspapers in the Former Yugoslav Republic of \n   Macedonia have published the results of opinion polls, \n   indicating the relative popularity of politicians, \n   political parties, and attitudes toward the political system. \n\n   The 22-23 January edition of the Skopje newspaper VECER in \nMacedonian published on pages 6-7 the results of an opinion poll \nconducted by the "BriMa" agency in November 1993. According to \nVECER, 1,036 respondents were classified by age and residence, but \nthe paper did not explain the methodology or give the margin of \nerror.  For the purpose of comparison, the paper cited the results \nof an unidentified poll made in May 1993. The approval/disapproval \nratings, in percent, for ten Macedonian politicians were: \n\n                                           November 1993    May 1993 \n\nKiro Gligorov

##### API - 13B model

In [None]:
### Run Aleph Alpha Client ###

from typing import List, Tuple

from aleph_alpha_client import AlephAlphaClient

from beir.reranking import Rerank

import time

from typing import List

# tokenizers==0.10.3
from tokenizers import Tokenizer as HF_Tokenizer


AA_TOKEN = "API_TOKEN"

class Tokenizer:
    """
    Wrapper around HF tokenizer to be able to exchange easily at a later point
    """
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    @classmethod
    def from_file(cls, filename):
        tokenizer = HF_Tokenizer.from_file(filename)
        return cls(tokenizer=tokenizer)

    def __len__(self) -> int:
        """
        Returns the vocab size of the tokenizer
        """
        return self.tokenizer.get_vocab_size(with_added_tokens=False)

    def encode(self, text: str) -> List[int]:
        """
        converts a string into token ids
        """
        return self.tokenizer.encode(text).ids

    def decode(self, token_ids: List[int]):
        """
        converts a list of token ids to a string
        """
        return self.tokenizer.decode(token_ids)


class AARanker:
    def __init__(self, model="EUTran13B", tokenizer_file="alpha-03-128k.json", use_prompt=True, prompt_doc="{}", prompt_doc_start="{}\n{}\n",
                 fewshots="", **kwargs):
        """
        GPTRanker producing log-probabilities for reranking doc & query with a GPT-like model
        """
        
        self.client = AlephAlphaClient(host="https://api.aleph-alpha.de", token=AA_TOKEN)
        self.model = model
        self.tokenizer = Tokenizer.from_file(tokenizer_file)
            
        # Truncation will be done from the left in the log likelihood
        self.prompt_doc = prompt_doc
        self.use_prompt = use_prompt
        
        self.instruction_len = len(self.tokenizer.encode(self.prompt_doc[:self.prompt_doc.index("{")]))
        
        self.fewshots = fewshots
        if self.fewshots:
            # doc, query
            self.fewshots = prompt_doc_start.format(self.fewshots[0], self.fewshots[1])
            # Still take overflowing tokens away from the current doc (not the fewshot doc)
            self.instruction_len += len(self.tokenizer.encode(self.fewshots))
            
    # Write your own score function, which takes in query-document text pairs and returns the similarity scores
    def predict(self, sentences: List[Tuple[str,str]], batch_size: int, **kwags) -> List[float]:
        """
        Args:
          sentences: [query, document]
          batch_size: Unused

        Returns:
          log_probs: float log probability for each query-doc pair
        """
        
        log_probs = []

        for query, doc in sentences:
            
            doc_with_prompt = self.prompt_doc.format(doc)
            
            doc_with_prompt_tokens = self.tokenizer.encode(doc_with_prompt)
            query_tokens = self.tokenizer.encode(query)
            
            token_len = len(doc_with_prompt_tokens) + len(query_tokens)
            
            # AA does not accept token len > 2048
            max_len = 2048
            if token_len > max_len:
                # Truncate from the left of the doc only without truncating from the prompt
                # Truncate as much until there is enough space for instruction & query (& rest of the doc)
                # Removing one more token here, as there seem to be some discrepencies btw this tokenizer and the one of the API
                doc_with_prompt_tokens_trunc = (
                    doc_with_prompt_tokens[:self.instruction_len] + doc_with_prompt_tokens[-(max_len-self.instruction_len-len(query_tokens)):]
                )
                
                logging.info(f"Truncated by {len(doc_with_prompt_tokens) - len(doc_with_prompt_tokens_trunc)} tokens")
                
                doc_with_prompt = self.tokenizer.decode(doc_with_prompt_tokens_trunc)
                
                # Tokenizer sometimes prepends a whitespace after decoding - We remove as we want to maintain the same input just with some toks removed
                if self.prompt_doc[0] != " " and doc_with_prompt[0] == " ":
                    doc_with_prompt = doc_with_prompt[1:]
            
            for i in range(1,20,2):
                try:
                    result = self.client.evaluate(self.model, prompt=doc_with_prompt, completion_expected=query)
                except Exception as e:
                    logging.info(f"Retrying after {i} seconds due to {e}")
                    time.sleep(i)
                    
                    
                    if token_len > max_len:
                        ### Truncate by one more element ###
                        # Copy of above truncation except for additional "- 1"
                        doc_with_prompt_tokens_trunc = (
                            doc_with_prompt_tokens[:self.instruction_len] + doc_with_prompt_tokens[-(max_len-self.instruction_len-len(query_tokens)- 1):]
                        )
                        logging.info(f"Truncated by {len(doc_with_prompt_tokens) - len(doc_with_prompt_tokens_trunc)} tokens")
                        doc_with_prompt = self.tokenizer.decode(doc_with_prompt_tokens_trunc)
                        # Tokenizer sometimes prepends a whitespace after decoding - We remove as we want to maintain the same input just with some toks removed
                        if self.prompt_doc[0] != " " and doc_with_prompt[0] == " ":
                            doc_with_prompt = doc_with_prompt[1:]
                
                    # For other errors just try again
                    
                    continue
                    
                break
                
            log_probs.append(result["result"]["log_probability"])
        
        assert len(log_probs) == len(sentences), "Only produced {len(log_probs)} results for {len(sentences)} sentences"
        
        return log_probs

In [5]:
### Main Loop A: Using fewshot=0, varying prompts with GPT Reranker ###

import json
import os

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

# All datasets
datasets = ["trec-covid", "webis-touche2020", "nfcorpus", "scifact", "fiqa", "dbpedia-entity",
            "nq", "hotpotqa", "quora", "fever", "climate-fever", "arguana", "msmarco", "scidocs", 
            "trec-news", "cqadupstack"]


datasets = ["robust04"]


# "trec-covid", "webis-touche2020", "nfcorpus", "scifact", "fiqa", "dbpedia-entity","nq", "hotpotqa", "quora", 

# Datasets by speed - Used for prompt ablations
#datasets = ["trec-covid"]#, "webis-touche2020", "nfcorpus", "scifact", "fiqa", "dbpedia-entity"]

# Main prompt
prompts = {"G": 'Documents are searched to find matches with the same content.\nThe document "{}" is a good search result for "',}


model_out_name = "aleph"

def clean_titles(corpus):
    for k in corpus:
        if "title" in corpus[k] and corpus[k]["title"] is None:
            corpus[k]["title"] = ""
    return corpus


def run_reranking(results_bm25_path, results_path, data_path, top_k=100, k_values=[1, 3, 5, 10, 100, 1000]):
    """
    Args:
        results_bm25_path: Path to .json results from bm25 for the dataset
        results_path: Path to .json to write rerank results
        top_k: How many docs to rerank per query
        k_values: For how many docs per query to compute the scores
    """
    
    split = "dev" if "msmarco" in data_path else "test"
    
    corpus, queries, qrels = GenericDataLoader(data_path).load(split=split)
    
    corpus = clean_titles(corpus) if "robust04" in data_path else corpus
    
    with open(results_bm25_path, 'r') as fp:
        results_bm25 = json.load(fp)
    
    # Optional, make sure results are correct
    ndcg_bm25, _map_bm25, recall_bm25, precision_bm25 = EvaluateRetrieval.evaluate(qrels, results_bm25, k_values)

    # Rerank top-100 results using the reranker provided
    results_rerank = reranker.rerank(corpus, queries, results_bm25, top_k=top_k)
    
    # Save rerank results
    with open(results_path, 'w') as fp:
        json.dump(results_rerank, fp)

    #### Evaluate retrieval using NDCG@k, MAP@K ...
    ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results_rerank, k_values)

    return (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision)


for prompt_id, prompt_doc in prompts.items():
    
    scores_out_path = f"beir_scores_{model_out_name}_{prompt_id}.json"
    if os.path.exists(os.path.join(os.getcwd(), scores_out_path)):
        continue

    ndcgs_bm25 = {}
    ndcgs = {}
    
    logging.info(f"\n{'-' * 20} Running prompt {prompt_id}: {prompt_doc} {'-' * 20}\n")
    
    reranker = Rerank(AARanker(use_prompt=True, prompt_doc=prompt_doc), batch_size=128)

    for i, dataset in enumerate(datasets):

        logging.info(f"\n{'-' * 10} Running {dataset} {'-' * 10}\n")
        
        if not(os.path.exists(os.path.join(os.getcwd(), 'datasets', dataset))):
            url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
            out_dir = os.path.join(os.getcwd(), "datasets")
            data_path = util.download_and_unzip(url, out_dir)
            print("Dataset downloaded here: {}".format(data_path))
            
        # Load the dataset into BEIR
        data_path = f"datasets/{dataset}"

        # cqadupstack - Contains several sub datasets
        if dataset == "cqadupstack":
            cqa_ndcgs_bm25, cqa_maps_bm25, cqa_recalls_bm25, cqa_precisions_bm25 = [], [], [], []
            cqa_ndcgs, cqa_maps, cqa_recalls, cqa_precisions = [], [], [], []
            for sub_dataset in os.listdir(data_path):
                sub_data_path = f"datasets/{dataset}/{sub_dataset}"
                
                results_bm25_path = f"results_{dataset}_{sub_dataset}.json"
                results_path = f"results_{model_out_name}_prompt{prompt_id}_{dataset}_{sub_dataset}.json"
                # Skip if already computed these results
                if os.path.exists(os.path.join(os.getcwd(), results_path)):
                    continue

                (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision) = run_reranking(results_bm25_path, results_path, sub_data_path)

                cqa_ndcgs_bm25.append(ndcg)
                cqa_maps_bm25.append(_map)
                cqa_recalls_bm25.append(recall)
                cqa_precisions_bm25.append(precision)

                cqa_ndcgs.append(ndcg)
                cqa_maps.append(_map)
                cqa_recalls.append(recall)
                cqa_precisions.append(precision)

            for (metric, group) in [(ndcg_bm25, cqa_ndcgs_bm25), (_map_bm25, cqa_maps_bm25), (recall_bm25, cqa_recalls_bm25), (precision_bm25, cqa_precisions_bm25)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

            for (metric, group) in [(ndcg, cqa_ndcgs), (_map, cqa_maps), (recall, cqa_recalls), (precision, cqa_precisions)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

            logging.info("CQA Final BM25")
            logging.info(f"{ndcg_bm25}")
            logging.info(f"{_map_bm25}")
            logging.info(f"{recall_bm25}")
            logging.info(f"{precision_bm25}")

            logging.info("CQA Final")
            logging.info(f"{ndcg}")
            logging.info(f"{_map}")
            logging.info(f"{recall}")
            logging.info(f"{precision}")

        else:
            results_bm25_path = f"results_{dataset}.json"
            results_path = f"results_{model_out_name}_prompt{prompt_id}_{dataset}.json"
            # Skip if already computed these results
            if os.path.exists(os.path.join(os.getcwd(), results_path)):
                continue
            (ndcg_bm25, _map_bm25, recall_bm25, precision_bm25), (ndcg, _map, recall, precision) = run_reranking(results_bm25_path, results_path, data_path)

        ndcgs[dataset] = ndcg
        ndcgs_bm25[dataset] = ndcg_bm25

        # Optionally clean-up each time to avoid running out of space
        # !rm -r datasets

    with open(scores_out_path, 'w') as fp:
        json.dump(ndcgs, fp)

2021-12-13 05:09:09 - 
-------------------- Running prompt G: Documents are searched to find matches with the same content.
The document "{}" is a good search result for " --------------------

2021-12-13 05:09:10 - 
---------- Running robust04 ----------

2021-12-13 05:09:10 - Loading Corpus...


  0%|          | 0/528155 [00:00<?, ?it/s]

2021-12-13 05:09:21 - Loaded 528155 TEST Documents.
2021-12-13 05:09:21 - Doc Example: {'text': '\n\nPOLITICIANS,  PARTY PREFERENCES \n\n   Summary:  Newspapers in the Former Yugoslav Republic of \n   Macedonia have published the results of opinion polls, \n   indicating the relative popularity of politicians, \n   political parties, and attitudes toward the political system. \n\n   The 22-23 January edition of the Skopje newspaper VECER in \nMacedonian published on pages 6-7 the results of an opinion poll \nconducted by the "BriMa" agency in November 1993. According to \nVECER, 1,036 respondents were classified by age and residence, but \nthe paper did not explain the methodology or give the margin of \nerror.  For the purpose of comparison, the paper cited the results \nof an unidentified poll made in May 1993. The approval/disapproval \nratings, in percent, for ten Macedonian politicians were: \n\n                                           November 1993    May 1993 \n\nKiro Gligorov

##### Compute perfect rerank scores

In [3]:
import json
import os

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

# Subselect datasets
datasets = ["robust04"]

k_values = [1, 10]
max_rerank = 10 # In BEIR 100 documents are reranked for their rerank encoder benchmark

# Need to rerank min 100 for @100;
assert max_rerank >= max(k_values), "Max Rerank is too small for the sample scores to compute"

ndcgs = {}

def perfect_rerank(results, qrels, max_rerank):
    """
    qrels: Dict[str, Dict[str, int]]
    results: Dict[str, Dict[str, float]]
    """
    perfect_rerank_results = {}
    for qid in qrels:
        if qid in results:
            # Subselect max rerank items with highest score
            topk = sorted(results[qid].items(), key=lambda item: item[1], reverse=True)[:max_rerank]
            topk = [key_val_pair[0] for key_val_pair in topk]

            perfect_rerank_results[qid] = {doc: float(score) for doc, score in qrels[qid].items() if doc in topk}
        else:
            # CAUTION: Skipping as we do here inflates the results vs putting an empty list, as the score is ignored
            # It might be more appropriate to put an empty list, i.e. no results, i.e. worse total score
            # However it seems like the default in BEIR is not to do so - Also only one dataset is affected: NFCorpus
            logging.info(f"Skipping {qid}")
        # Adding all other results with lower ranks is unnecessary as NDCG ignores them anyways
        #extra_results = {doc: float(score) * 0.0001 for doc, score in results[qid].items() if doc not in perfect_rerank_results[qid]}
        #perfect_rerank_results[qid] = {**perfect_rerank_results[qid], **extra_results}

    return perfect_rerank_results

for i, dataset in enumerate(datasets):

    if not(os.path.exists(os.path.join(os.getcwd(), 'datasets', dataset))):
        url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
        out_dir = os.path.join(os.getcwd(), "datasets")
        data_path = util.download_and_unzip(url, out_dir)
        print("Dataset downloaded here: {}".format(data_path))
    # Load the dataset into BEIR
    data_path = f"datasets/{dataset}"
    # In the paper it says, BEIR used the dev set for msmarco
    split = "dev" if dataset == "msmarco" else "test"

    # cqadupstack - Contains several sub datasets
    if dataset == "cqadupstack":
        cqa_ndcgs, cqa_maps, cqa_recalls, cqa_precisions = [], [], [], []
        for sub_dataset in os.listdir(data_path):
            sub_data_path = f"datasets/{dataset}/{sub_dataset}"
            corpus, queries, qrels = GenericDataLoader(sub_data_path).load(split=split)
            with open(f"./results_{dataset}_{sub_dataset}.json", 'r') as fp:
                results_loaded = json.load(fp)
            
            perfect_rerank_results = perfect_rerank(results_loaded, qrels, max_rerank)

            with open(f"./beir_perfect_rerank_{max_rerank}_{dataset}_{sub_dataset}.json", 'w') as fp:
                json.dump(perfect_rerank_results, fp)

            ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, perfect_rerank_results, k_values)

            cqa_ndcgs.append(ndcg)
            cqa_maps.append(_map)
            cqa_recalls.append(recall)
            cqa_precisions.append(precision)

        for (metric, group) in [(ndcg, cqa_ndcgs), (_map, cqa_maps), (recall, cqa_recalls), (precision, cqa_precisions)]:
            for k in metric.keys():
                metric[k] = sum([score[k] for score in group]) / len(group)

        logging.info("CQA Final")
        logging.info(f"{ndcg}")
        logging.info(f"{_map}")
        logging.info(f"{recall}")
        logging.info(f"{precision}")

    else:
        corpus, queries, qrels = GenericDataLoader(data_path).load(split=split)
        with open(f"./results_{dataset}.json", 'r') as fp:
            results_loaded = json.load(fp)
        perfect_rerank_results = perfect_rerank(results_loaded, qrels, max_rerank)
        ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, perfect_rerank_results, k_values)
        with open(f"./beir_perfect_rerank_{max_rerank}_{dataset}.json", 'w') as fp:
            json.dump(perfect_rerank_results, fp)

    ndcgs[dataset] = ndcg

    # Cleanup if necessary
    #if ("signal1m" not in data_path) and ("trec-news" not in data_path):
    #    !rm -r {data_path}
    #    !rm -r {data_path}.zip


2021-12-17 16:57:41 - Loading Corpus...


  0%|          | 0/528155 [00:00<?, ?it/s]

2021-12-17 16:57:52 - Loaded 528155 TEST Documents.
2021-12-17 16:57:52 - Doc Example: {'text': '\n\nPOLITICIANS,  PARTY PREFERENCES \n\n   Summary:  Newspapers in the Former Yugoslav Republic of \n   Macedonia have published the results of opinion polls, \n   indicating the relative popularity of politicians, \n   political parties, and attitudes toward the political system. \n\n   The 22-23 January edition of the Skopje newspaper VECER in \nMacedonian published on pages 6-7 the results of an opinion poll \nconducted by the "BriMa" agency in November 1993. According to \nVECER, 1,036 respondents were classified by age and residence, but \nthe paper did not explain the methodology or give the margin of \nerror.  For the purpose of comparison, the paper cited the results \nof an unidentified poll made in May 1993. The approval/disapproval \nratings, in percent, for ten Macedonian politicians were: \n\n                                           November 1993    May 1993 \n\nKiro Gligorov

In [2]:
# Compute scores based on results.json

import json
import os

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval


def compute_result(results_path, data_path, top_k=100, k_values=[1, 3, 5, 10, 100]):
    """
    Args:
        results_path: Path to .json to read rerank results
        top_k: How many docs to rerank per query
        k_values: For how many docs per query to compute the scores
    """
    split = "dev" if "msmarco" in data_path else "test"
    corpus, queries, qrels = GenericDataLoader(data_path).load(split=split)
    
    with open(results_path, 'r') as fp:
        results = json.load(fp)
    
    ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results, k_values)

    return (ndcg, _map, recall, precision)


# All datasets
datasets = ["trec-covid", "webis-touche2020", "nfcorpus", "scifact", "fiqa", "dbpedia-entity",
            "nq", "hotpotqa", "quora", "fever", "climate-fever", "arguana", "msmarco", "scidocs", "cqadupstack",
            "signal1m", "trec-news", "bioasq", "robust04"]

prompts = {
    "": '',
}

latex_help = {}
latex_help_avgs = {}
results_prefix = "beirbm25perfectrerankresults/"
model_name = "perfect_rerank"
# Make empty string if no maxrerank in title (if none in title  = 100)
maxrerank = "_10"

for prompt_id, prompt_doc in prompts.items():
    
    scores_out_path = f"{results_prefix}beir_{model_name}{prompt_id}{maxrerank}_ndcgs.json"
    ndcgs = {}
    
    for i, dataset in enumerate(datasets):
        
        data_path = f"datasets/{dataset}"

        if dataset == "cqadupstack":
            cqa_ndcgs, cqa_maps, cqa_recalls, cqa_precisions = [], [], [], []
            for sub_dataset in os.listdir(data_path):
                sub_data_path = f"datasets/{dataset}/{sub_dataset}"
                results_path = f"{results_prefix}beir_{model_name}{prompt_id}{maxrerank}_{dataset}_{sub_dataset}.json"
                assert os.path.exists(os.path.join(os.getcwd(), results_path)), f"Missing path: {results_path}"

                ndcg, _map, recall, precision = compute_result(results_path, sub_data_path)
                cqa_ndcgs.append(ndcg)
                cqa_maps.append(_map)
                cqa_recalls.append(recall)
                cqa_precisions.append(precision)

            for (metric, group) in [(ndcg, cqa_ndcgs), (_map, cqa_maps), (recall, cqa_recalls), (precision, cqa_precisions)]:
                for k in metric.keys():
                    metric[k] = sum([score[k] for score in group]) / len(group)

        else:
            results_path = f"{results_prefix}beir_{model_name}{prompt_id}{maxrerank}_{dataset}.json"
            assert os.path.exists(os.path.join(os.getcwd(), results_path)), f"Missing path: {results_path}"
            ndcg, _map, recall, precision = compute_result(results_path, data_path)
           
        latex_help.setdefault(dataset, "")
        latex_help[dataset] += f" & {ndcg['NDCG@10']}"
        latex_help_avgs.setdefault(prompt_id, 0)
        latex_help_avgs[prompt_id] += ndcg['NDCG@10']

        ndcgs[dataset] = ndcg

    with open(scores_out_path, 'w') as fp:
        json.dump(ndcgs, fp)
        
for k, v in latex_help.items():
    print(f"{k} {v}")

print("& ".join([f"{k}: {round(v/len(datasets), 5)}" for k,v in latex_help_avgs.items()]))

2021-12-20 13:56:35 - Loading faiss with AVX2 support.
2021-12-20 13:56:35 - Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2021-12-20 13:56:35 - Loading faiss.
2021-12-20 13:56:35 - Successfully loaded faiss.
2021-12-20 13:56:35 - Loading Corpus...


  0%|          | 0/171332 [00:00<?, ?it/s]

2021-12-20 13:56:36 - Loaded 171332 TEST Documents.
2021-12-20 13:56:36 - Doc Example: {'text': 'OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract 

  0%|          | 0/382545 [00:00<?, ?it/s]

2021-12-20 13:56:41 - Loaded 382545 TEST Documents.
2021-12-20 13:56:41 - Doc Example: {'text': 'My opponent forfeited every round. None of my arguments were answered. I don’t like the idea of winning by default, but here we are.Tule: it’s good for students to get involved and address big issues like teen pregnancy. You need to be able to answer arguments like mine and not simply prepare for an abstinence-only type of response. You should also be aware that, in the U.S., condoms may be sold to minors in ANY state. A retailer who says it is illegal to sell you them is, frankly, wrong.', 'title': 'Contraceptive Forms for High School Students'}
2021-12-20 13:56:41 - Loading Queries...
2021-12-20 13:56:41 - Loaded 49 TEST Queries.
2021-12-20 13:56:41 - Query Example: Should teachers get tenure?
2021-12-20 13:56:41 - 

2021-12-20 13:56:41 - NDCG@1: 0.8571
2021-12-20 13:56:41 - NDCG@3: 0.7704
2021-12-20 13:56:41 - NDCG@5: 0.6615
2021-12-20 13:56:41 - NDCG@10: 0.4666
2021-12-20 13:56:41 - NDC

  0%|          | 0/3633 [00:00<?, ?it/s]

2021-12-20 13:56:41 - Loaded 3633 TEST Documents.
2021-12-20 13:56:41 - Doc Example: {'text': 'Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants die

  0%|          | 0/5183 [00:00<?, ?it/s]

2021-12-20 13:56:42 - Loaded 5183 TEST Documents.
2021-12-20 13:56:42 - Doc Example: {'text': 'Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 vers

  0%|          | 0/57638 [00:00<?, ?it/s]

2021-12-20 13:56:42 - Loaded 57638 TEST Documents.
2021-12-20 13:56:42 - Doc Example: {'text': "I'm not saying I don't like the idea of on-the-job training too, but you can't expect the company to do that. Training workers is not their job - they're building software. Perhaps educational systems in the U.S. (or their students) should worry a little about getting marketable skills in exchange for their massive investment in education, rather than getting out with thousands in student debt and then complaining that they aren't qualified to do anything.", 'title': ''}
2021-12-20 13:56:42 - Loading Queries...
2021-12-20 13:56:42 - Loaded 648 TEST Queries.
2021-12-20 13:56:42 - Query Example: How to deposit a cheque issued to an associate in my business into my business account?
2021-12-20 13:56:42 - 

2021-12-20 13:56:42 - NDCG@1: 0.5015
2021-12-20 13:56:42 - NDCG@3: 0.3874
2021-12-20 13:56:42 - NDCG@5: 0.3697
2021-12-20 13:56:42 - NDCG@10: 0.3632
2021-12-20 13:56:42 - NDCG@100: 0.3627
202

  0%|          | 0/4635922 [00:00<?, ?it/s]

2021-12-20 13:57:09 - Loaded 4635922 TEST Documents.
2021-12-20 13:57:09 - Doc Example: {'text': "Animalia is an illustrated children's book by Graeme Base. It was originally published in 1986, followed by a tenth anniversary edition in 1996, and a 25th anniversary edition in 2012. Over three million copies have been sold.   A special numbered and signed anniversary edition was also published in 1996, with an embossed gold jacket.", 'title': 'Animalia (book)'}
2021-12-20 13:57:09 - Loading Queries...
2021-12-20 13:57:09 - Loaded 400 TEST Queries.
2021-12-20 13:57:09 - Query Example: Szechwan dish food cuisine
2021-12-20 13:57:09 - 

2021-12-20 13:57:09 - NDCG@1: 0.6800
2021-12-20 13:57:09 - NDCG@3: 0.5504
2021-12-20 13:57:09 - NDCG@5: 0.4787
2021-12-20 13:57:09 - NDCG@10: 0.3974
2021-12-20 13:57:09 - NDCG@100: 0.3063
2021-12-20 13:57:09 - 

2021-12-20 13:57:09 - MAP@1: 0.1031
2021-12-20 13:57:09 - MAP@3: 0.1742
2021-12-20 13:57:09 - MAP@5: 0.1947
2021-12-20 13:57:09 - MAP@10: 0.2089
20

  0%|          | 0/2681468 [00:00<?, ?it/s]

2021-12-20 13:57:26 - Loaded 2681468 TEST Documents.
2021-12-20 13:57:26 - Doc Example: {'text': "In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation's stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]", 'title': 'Minority interest'}
2021-12-20 13:57:26 - Loading Queries...
2021-12-20 13:57:26 - Loaded 3452 TEST Queries.
2021-12-20 13:57:26 - Query Example: what is non controlling interest on balance sheet
2021-12-20 13:57:26 - 

2021-12-20 13:57:26 - NDCG@1: 0.5403
2021-12-20 13:57:26 - NDCG@3: 0.5143
2021-12-20 13:57:26 - NDCG@5: 0.5143
2021-12-20 13:57:26 - NDCG@10: 0.5143
2021-12-20 13:57:26 - NDCG@100: 0.5143
2021-12-20 13:57:26 - 

2021-12-20 13:57:26 - MAP@1: 0.4818
2021-12-20 13:57:26 - MAP@3: 0.5066
2021-12-20 13:57:26 - MAP@5: 0.506

  0%|          | 0/5233329 [00:00<?, ?it/s]

2021-12-20 13:57:56 - Loaded 5233329 TEST Documents.
2021-12-20 13:57:56 - Doc Example: {'text': 'Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary and harmful.', 'title': 'Anarchism'}
2021-12-20 13:57:56 - Loading Queries...
2021-12-20 13:57:57 - Loaded 7405 TEST Queries.
2021-12-20 13:57:57 - Query Example: Were Scott Derrickson and Ed Wood of the same nationality?
2021-12-20 13:57:57 - 

2021-12-20 13:57:57 - NDCG@1: 0.8952
2021-12-20 13:57:57 - NDCG@3: 0.6896
2021-12-20 13:57:57 - NDCG@5: 0.6896
2021-12-20 13:57:57 - NDCG@10: 0.6896
2021-12-20 13:57:57 - NDCG@100: 0.6896
2021-12-20 13:57:57 - 

2021-12-20 13:57:57 - MAP@1: 0.4476
2021-12-20 13:57:57 - MAP@3: 0.6295
2021-12-20 13:57:57 - MAP@5: 0.6295

  0%|          | 0/522931 [00:00<?, ?it/s]

2021-12-20 13:57:59 - Loaded 522931 TEST Documents.
2021-12-20 13:57:59 - Doc Example: {'text': 'What is the step by step guide to invest in share market in india?', 'title': ''}
2021-12-20 13:57:59 - Loading Queries...
2021-12-20 13:58:00 - Loaded 10000 TEST Queries.
2021-12-20 13:58:00 - Query Example: Which question should I ask on Quora?
2021-12-20 13:58:00 - 

2021-12-20 13:58:00 - NDCG@1: 0.9447
2021-12-20 13:58:00 - NDCG@3: 0.9233
2021-12-20 13:58:00 - NDCG@5: 0.9174
2021-12-20 13:58:00 - NDCG@10: 0.9138
2021-12-20 13:58:00 - NDCG@100: 0.9125
2021-12-20 13:58:00 - 

2021-12-20 13:58:00 - MAP@1: 0.8140
2021-12-20 13:58:00 - MAP@3: 0.8969
2021-12-20 13:58:00 - MAP@5: 0.9017
2021-12-20 13:58:00 - MAP@10: 0.9026
2021-12-20 13:58:00 - MAP@100: 0.9026
2021-12-20 13:58:00 - 

2021-12-20 13:58:00 - Recall@1: 0.8140
2021-12-20 13:58:00 - Recall@3: 0.8969
2021-12-20 13:58:00 - Recall@5: 0.9017
2021-12-20 13:58:00 - Recall@10: 0.9026
2021-12-20 13:58:00 - Recall@100: 0.9026
2021-12-20 13:5

  0%|          | 0/5416568 [00:00<?, ?it/s]

2021-12-20 13:58:32 - Loaded 5416568 TEST Documents.
2021-12-20 13:58:32 - Doc Example: {'text': 'The following are the football ( soccer ) events of the year 1928 throughout the world .', 'title': '1928 in association football'}
2021-12-20 13:58:32 - Loading Queries...
2021-12-20 13:58:33 - Loaded 6666 TEST Queries.
2021-12-20 13:58:33 - Query Example: Ukrainian Soviet Socialist Republic was a founding participant of the UN.
2021-12-20 13:58:33 - 

2021-12-20 13:58:33 - NDCG@1: 0.8609
2021-12-20 13:58:33 - NDCG@3: 0.8258
2021-12-20 13:58:33 - NDCG@5: 0.8246
2021-12-20 13:58:33 - NDCG@10: 0.8243
2021-12-20 13:58:33 - NDCG@100: 0.8243
2021-12-20 13:58:33 - 

2021-12-20 13:58:33 - MAP@1: 0.8019
2021-12-20 13:58:33 - MAP@3: 0.8141
2021-12-20 13:58:33 - MAP@5: 0.8141
2021-12-20 13:58:33 - MAP@10: 0.8141
2021-12-20 13:58:33 - MAP@100: 0.8141
2021-12-20 13:58:33 - 

2021-12-20 13:58:33 - Recall@1: 0.8019
2021-12-20 13:58:33 - Recall@3: 0.8141
2021-12-20 13:58:33 - Recall@5: 0.8141
2021-12-20

  0%|          | 0/5416593 [00:00<?, ?it/s]

2021-12-20 13:59:05 - Loaded 5416593 TEST Documents.
2021-12-20 13:59:06 - Doc Example: {'text': 'The following are the football ( soccer ) events of the year 1928 throughout the world .', 'title': '1928 in association football'}
2021-12-20 13:59:06 - Loading Queries...
2021-12-20 13:59:06 - Loaded 1535 TEST Queries.
2021-12-20 13:59:06 - Query Example: Global warming is driving polar bears toward extinction
2021-12-20 13:59:06 - 

2021-12-20 13:59:06 - NDCG@1: 0.4612
2021-12-20 13:59:06 - NDCG@3: 0.2963
2021-12-20 13:59:06 - NDCG@5: 0.2805
2021-12-20 13:59:06 - NDCG@10: 0.2805
2021-12-20 13:59:06 - NDCG@100: 0.2805
2021-12-20 13:59:06 - 

2021-12-20 13:59:06 - MAP@1: 0.1957
2021-12-20 13:59:06 - MAP@3: 0.2316
2021-12-20 13:59:06 - MAP@5: 0.2318
2021-12-20 13:59:06 - MAP@10: 0.2318
2021-12-20 13:59:06 - MAP@100: 0.2318
2021-12-20 13:59:06 - 

2021-12-20 13:59:06 - Recall@1: 0.1957
2021-12-20 13:59:06 - Recall@3: 0.2316
2021-12-20 13:59:06 - Recall@5: 0.2318
2021-12-20 13:59:06 - Recall

  0%|          | 0/8674 [00:00<?, ?it/s]

2021-12-20 13:59:07 - Loaded 8674 TEST Documents.
2021-12-20 13:59:07 - Doc Example: {'text': "You don’t have to be vegetarian to be green. Many special environments have been created by livestock farming – for example chalk down land in England and mountain pastures in many countries. Ending livestock farming would see these areas go back to woodland with a loss of many unique plants and animals. Growing crops can also be very bad for the planet, with fertilisers and pesticides polluting rivers, lakes and seas. Most tropical forests are now cut down for timber, or to allow oil palm trees to be grown in plantations, not to create space for meat production.  British farmer and former editor Simon Farrell also states: “Many vegans and vegetarians rely on one source from the U.N. calculation that livestock generates 18% of global carbon emissions, but this figure contains basic mistakes. It attributes all deforestation from ranching to cattle, rather than logging or development. It also m

  0%|          | 0/8841823 [00:00<?, ?it/s]

2021-12-20 13:59:50 - Loaded 8841823 DEV Documents.
2021-12-20 13:59:51 - Doc Example: {'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 'title': ''}
2021-12-20 13:59:51 - Loading Queries...
2021-12-20 13:59:52 - Loaded 6980 DEV Queries.
2021-12-20 13:59:52 - Query Example: how many years did william bradford serve as governor of plymouth colony?
2021-12-20 13:59:52 - 

2021-12-20 13:59:52 - NDCG@1: 0.3888
2021-12-20 13:59:52 - NDCG@3: 0.3833
2021-12-20 13:59:52 - NDCG@5: 0.3833
2021-12-20 13:59:52 - NDCG@10: 0.3833
2021-12-20 13:59:52 - NDCG@100: 0.3833
2021-12-20 13:59:52 - 

2021-12-20 13:59:52 - MAP@1: 0.3777
2021-12-20 13:59:52 - MAP@3: 0.3817
2021-12-20 13:59:52 - MAP@5: 0.3817
2021-12-20 13:59:52

  0%|          | 0/25657 [00:00<?, ?it/s]

2021-12-20 13:59:55 - Loaded 25657 TEST Documents.
2021-12-20 13:59:55 - Doc Example: {'text': 'An evolutionary recurrent network which automates the design of recurrent neural/fuzzy networks using a new evolutionary learning algorithm is proposed in this paper. This new evolutionary learning algorithm is based on a hybrid of genetic algorithm (GA) and particle swarm optimization (PSO), and is thus called HGAPSO. In HGAPSO, individuals in a new generation are created, not only by crossover and mutation operation as in GA, but also by PSO. The concept of elite strategy is adopted in HGAPSO, where the upper-half of the best-performing individuals in a population are regarded as elites. However, instead of being reproduced directly to the next generation, these elites are first enhanced. The group constituted by the elites is regarded as a swarm, and each elite corresponds to a particle within it. In this regard, the elites are enhanced by PSO, an operation which mimics the maturing pheno

  0%|          | 0/47382 [00:00<?, ?it/s]

2021-12-20 13:59:55 - Loaded 47382 TEST Documents.
2021-12-20 13:59:55 - Loading Queries...
2021-12-20 13:59:59 - Loaded 1072 TEST Queries.
2021-12-20 13:59:59 - Query Example: Yanked USB Key During Move
2021-12-20 13:59:59 - 

2021-12-20 13:59:59 - NDCG@1: 0.4366
2021-12-20 13:59:59 - NDCG@3: 0.3993
2021-12-20 13:59:59 - NDCG@5: 0.3962
2021-12-20 13:59:59 - NDCG@10: 0.3957
2021-12-20 13:59:59 - NDCG@100: 0.3956
2021-12-20 13:59:59 - 

2021-12-20 13:59:59 - MAP@1: 0.3697
2021-12-20 13:59:59 - MAP@3: 0.3841
2021-12-20 13:59:59 - MAP@5: 0.3841
2021-12-20 13:59:59 - MAP@10: 0.3841
2021-12-20 13:59:59 - MAP@100: 0.3841
2021-12-20 13:59:59 - 

2021-12-20 13:59:59 - Recall@1: 0.3697
2021-12-20 13:59:59 - Recall@3: 0.3841
2021-12-20 13:59:59 - Recall@5: 0.3841
2021-12-20 13:59:59 - Recall@10: 0.3841
2021-12-20 13:59:59 - Recall@100: 0.3841
2021-12-20 13:59:59 - 

2021-12-20 13:59:59 - P@1: 0.4366
2021-12-20 13:59:59 - P@3: 0.1589
2021-12-20 13:59:59 - P@5: 0.0953
2021-12-20 13:59:59 - P@10: 0

  0%|          | 0/38316 [00:00<?, ?it/s]

2021-12-20 13:59:59 - Loaded 38316 TEST Documents.
2021-12-20 13:59:59 - Doc Example: {'text': "Let's discuss about $SU(3)$. I understand that the most important representations (relevant to physics) are the defining and the adjoint. In the defining representation of $SU(3)$; namely $\\mathbf{3}$, the Gell-Mann matrices are used to represent the generators $$ \\left[T^{A}\\right]_{ij} = \\dfrac{1}{2}\\lambda^{A}, $$ where $T^A$ are the generators and $\\lambda^A$ the Gell-Mann matrices. In adjoint representation, on the other hand, an $\\mathbf{8}$, the generators are represented by matrices according to $$ \\left[ T_{i} \\right]_{jk} = -if_{ijk}, $$ where $f_{ijk}$ are the structure constants. My question is this, how can one represent the generators in the $\\mathbf{10}$ of $SU(3)$, which corresponds to a symmetric tensor with 3 upper or lower indices (or for that matter how to represent the $\\mathbf{6}$ with two symmetric indices). What is the general procedure to represent the gen

  0%|          | 0/48605 [00:00<?, ?it/s]

2021-12-20 14:00:03 - Loaded 48605 TEST Documents.
2021-12-20 14:00:03 - Doc Example: {'text': "In a shortcode context, is there any difference here?               array(             'slideshow' => '',         ),       and               array(             'slideshow' => NULL,         ),       Is there a best practice for that?", 'title': 'What is the difference between Null vs Empty (Zero Length) string?'}
2021-12-20 14:00:03 - Loading Queries...
2021-12-20 14:00:04 - Loaded 541 TEST Queries.
2021-12-20 14:00:04 - Query Example: How to enqueue script or style in a theme's template file?
2021-12-20 14:00:04 - 

2021-12-20 14:00:04 - NDCG@1: 0.4103
2021-12-20 14:00:04 - NDCG@3: 0.3903
2021-12-20 14:00:04 - NDCG@5: 0.3886
2021-12-20 14:00:04 - NDCG@10: 0.3883
2021-12-20 14:00:04 - NDCG@100: 0.3883
2021-12-20 14:00:04 - 

2021-12-20 14:00:04 - MAP@1: 0.3709
2021-12-20 14:00:04 - MAP@3: 0.3812
2021-12-20 14:00:04 - MAP@5: 0.3816
2021-12-20 14:00:04 - MAP@10: 0.3816
2021-12-20 14:00:04 - MAP

  0%|          | 0/37637 [00:00<?, ?it/s]

2021-12-20 14:00:04 - Loaded 37637 TEST Documents.
2021-12-20 14:00:04 - Doc Example: {'text': "There is a satellite image it's size is 10 GB and I need to display this image using GeoServer and OpenLayers. When user select the Satellite image in the layer switcher need to display image within 10 seconds. I tried geopdf but the image quality loss isn't acceptable to customer. I want to achieve 10 seconds response time using 32 GB satellite image. Please advice me how to achieve this? Thanks in advance.", 'title': 'Satellite image display with the help of GeoServer and OpenLayers'}
2021-12-20 14:00:04 - Loading Queries...
2021-12-20 14:00:07 - Loaded 885 TEST Queries.
2021-12-20 14:00:07 - Query Example: Calculating mean upslope aspect from each cell in DEM using Python?
2021-12-20 14:00:07 - 

2021-12-20 14:00:07 - NDCG@1: 0.4316
2021-12-20 14:00:07 - NDCG@3: 0.4114
2021-12-20 14:00:07 - NDCG@5: 0.4104
2021-12-20 14:00:07 - NDCG@10: 0.4101
2021-12-20 14:00:07 - NDCG@100: 0.4101
2021-12

  0%|          | 0/45301 [00:00<?, ?it/s]

2021-12-20 14:00:07 - Loaded 45301 TEST Documents.
2021-12-20 14:00:07 - Doc Example: {'text': 'What\'s your Supreme Commander 2 build order. I don\'t just want "6 mass extractors, 2 power and a factory". List of building and units out to the second or third factory, please.', 'title': 'Supreme Commander 2 - Build Orders'}
2021-12-20 14:00:07 - Loading Queries...
2021-12-20 14:00:13 - Loaded 1595 TEST Queries.
2021-12-20 14:00:13 - Query Example: Can the trophy system protect me against bullets?
2021-12-20 14:00:13 - 

2021-12-20 14:00:13 - NDCG@1: 0.6332
2021-12-20 14:00:13 - NDCG@3: 0.6022
2021-12-20 14:00:13 - NDCG@5: 0.5991
2021-12-20 14:00:13 - NDCG@10: 0.5978
2021-12-20 14:00:13 - NDCG@100: 0.5970
2021-12-20 14:00:13 - 

2021-12-20 14:00:13 - MAP@1: 0.5486
2021-12-20 14:00:13 - MAP@3: 0.5844
2021-12-20 14:00:13 - MAP@5: 0.5860
2021-12-20 14:00:13 - MAP@10: 0.5864
2021-12-20 14:00:13 - MAP@100: 0.5864
2021-12-20 14:00:13 - 

2021-12-20 14:00:13 - Recall@1: 0.5486
2021-12-20 14:00:

  0%|          | 0/42269 [00:00<?, ?it/s]

2021-12-20 14:00:13 - Loaded 42269 TEST Documents.
2021-12-20 14:00:13 - Doc Example: {'text': "I'm a beginner in statistics and R, sorry if this question may seem trivial. I've collected data measuring several different parameters in 40 subjects at two time-points (t1 and t2). There are 3 main parameters in which I'm interested, let's call them ParA, ParB, ParC. ParA is a score of disability. It is on an arbitrary scale (so it is an ordinal scale measure, if my understanding is correct) and values range from 0.0 to 10.0. Note that the increments in this scale are by 0.5 unit, so values like, e.g. 1.5 are possible. I have two measures, at t1 and t2, so I can describe at least three variables from ParA: ParA at t1, ParA at t2, and whether a subject progressed or not (0 or 1). Being a ratio scale measure, I think it would not make much sense to compute a difference (eg. ParA at t2 - ParA at t1), but I'm willing to accept suggestions on this matter. ParB and ParC are meausurements of two 

  0%|          | 0/17405 [00:00<?, ?it/s]

2021-12-20 14:00:15 - Loaded 17405 TEST Documents.
2021-12-20 14:00:15 - Doc Example: {'text': 'I\'m making a website for a small hotel in php. The hotel owners want a reservation system that uses paypal. They want people to see a calendar and choose a date to make a reservation. If the day has vacancy, they want the user to request booking a room. This would then require the hotel owner to accept the purchase. I have not worked on a project that has this "request to purchase" method of buying with paypal. Is this possible? Does anyone know of an open php system that handles this?', 'title': 'Hotel Reservation Request Booking Paypal PHP'}
2021-12-20 14:00:15 - Loading Queries...
2021-12-20 14:00:15 - Loaded 506 TEST Queries.
2021-12-20 14:00:15 - Query Example: Someone else is using our Google Analytics Tracking code number. What do we do?
2021-12-20 14:00:15 - 

2021-12-20 14:00:15 - NDCG@1: 0.4644
2021-12-20 14:00:15 - NDCG@3: 0.4360
2021-12-20 14:00:15 - NDCG@5: 0.4298
2021-12-20 14

  0%|          | 0/16705 [00:00<?, ?it/s]

2021-12-20 14:00:16 - Loaded 16705 TEST Documents.
2021-12-20 14:00:16 - Doc Example: {'text': "I'm trying to use `Get` to load some pretty substantial packages from a custom menu in the _Mathematica_ toolbar (added via MenuSetup.tr).   The problem is, the standard 5-second evaluation timeout seems to apply to commands executed with `KernelExecute`, so only a fraction of my `Get` is evaluated before the command times out. I'm wondering whether there's an option that can be passed to `KernelExecute` (or to `Item` / `MenuItem`) that will remove that time constraint so that my command can be executed completely.", 'title': 'Time constraints on KernelExecute commands or MenuItems?'}
2021-12-20 14:00:16 - Loading Queries...
2021-12-20 14:00:18 - Loaded 804 TEST Queries.
2021-12-20 14:00:18 - Query Example: How to use Automorphisms[] on a graph?
2021-12-20 14:00:18 - 

2021-12-20 14:00:18 - NDCG@1: 0.3769
2021-12-20 14:00:18 - NDCG@3: 0.3335
2021-12-20 14:00:18 - NDCG@5: 0.3301
2021-12-20 14

  0%|          | 0/22998 [00:00<?, ?it/s]

2021-12-20 14:00:18 - Loaded 22998 TEST Documents.
2021-12-20 14:00:18 - Doc Example: {'text': "I want to send files to android tablet with a application from PC. - I can send files directly to tablet (2.3 android OS) PC see it as a external usb drive. - But i can't send files to tablet (4.2 android OS), because PC see it as a portable media player.(MTP) - How can i fix this problem ? - How can show my device as a external drive? my application that sent files written via Delphi.", 'title': 'How can show android tablet as a external storage to PC?'}
2021-12-20 14:00:18 - Loading Queries...
2021-12-20 14:00:19 - Loaded 699 TEST Queries.
2021-12-20 14:00:19 - Query Example: Android chroot ubuntu - is it possible to get ubuntu to recognise usb devices
2021-12-20 14:00:19 - 

2021-12-20 14:00:19 - NDCG@1: 0.5765
2021-12-20 14:00:19 - NDCG@3: 0.5343
2021-12-20 14:00:19 - NDCG@5: 0.5245
2021-12-20 14:00:19 - NDCG@10: 0.5184
2021-12-20 14:00:19 - NDCG@100: 0.5155
2021-12-20 14:00:19 - 

2021-

  0%|          | 0/32176 [00:00<?, ?it/s]

2021-12-20 14:00:20 - Loaded 32176 TEST Documents.
2021-12-20 14:00:20 - Doc Example: {'text': "I am in the midst of writing a web application for work. Everything is from scratch. I have been a PHP programmer for about 13 years, Node.js programmer for the past 2 years, and have no shortage of experience with JavaScript. I love Node.js, and recently rebuilt the company's API in it... So, in planning this web application, the approach I'm considering is, have the Node.js API for getting data from the server, but render everything in the browser. Use AJAX for retrieving data, History API for loading pages, and a MVC-like pattern for the different components. I have read articles detailing twitters rebuild a few years ago. It was more or less a client-side JavaScript app, but a couple years after launching it, they started moving a lot of processing/rendering back to the server, claiming the app improved dramatically in terms of speed. So, my question is as the title asks, is a client-sid

  0%|          | 0/40221 [00:00<?, ?it/s]

2021-12-20 14:00:23 - Loaded 40221 TEST Documents.
2021-12-20 14:00:23 - Doc Example: {'text': 'An eponym is one way to eternal (if posthumous) fame. But is there a word meaning an eponym someone would sooner not have? (One would presume that Captain Charles _Boycott_ , Mr Justice _Lynch_ , and Patrick _Hooligan_ would not appreciate their undying notoriety.)', 'title': 'Is there a word meaning "an unwanted eponym"?'}
2021-12-20 14:00:23 - Loading Queries...
2021-12-20 14:00:30 - Loaded 1570 TEST Queries.
2021-12-20 14:00:30 - Query Example: Is "a wide range of features" singular or plural?
2021-12-20 14:00:30 - 

2021-12-20 14:00:30 - NDCG@1: 0.5051
2021-12-20 14:00:30 - NDCG@3: 0.4613
2021-12-20 14:00:30 - NDCG@5: 0.4531
2021-12-20 14:00:30 - NDCG@10: 0.4482
2021-12-20 14:00:30 - NDCG@100: 0.4452
2021-12-20 14:00:30 - 

2021-12-20 14:00:30 - MAP@1: 0.3946
2021-12-20 14:00:30 - MAP@3: 0.4277
2021-12-20 14:00:30 - MAP@5: 0.4295
2021-12-20 14:00:30 - MAP@10: 0.4299
2021-12-20 14:00:30 -

  0%|          | 0/68184 [00:00<?, ?it/s]

2021-12-20 14:00:31 - Loaded 68184 TEST Documents.
2021-12-20 14:00:31 - Doc Example: {'text': "I am using a pgfplots stacked bar to display the aggregated energy demand of a houshold and the associated price. When the energy demand exceeds a certain threshold, than a higher price has to be paid. This is visualized by the color red and blue of the bars. The threshold is displayed by the thick red horizontal line. My problem is, that I want this red line to exceed the width of the bar, so that it's width is circa 120 percent of the width of the bar. Is there any possibility to achieve this? Thanks ![enter image description here](http://i.stack.imgur.com/3qeEi.jpg)               \\documentclass[tikz]{standalone}     \\usepackage{pgfplots}     \\pgfplotsset{compat=1.10}     \\begin{document}     \\begin{tikzpicture}     \\begin{axis}[       ymin=0,ymax=4,       samples=3,       enlarge x limits={abs=0.5},       bar width=0.6,       ybar stacked,       legend pos=south east,         every 

  0%|          | 0/2866316 [00:00<?, ?it/s]

2021-12-20 14:01:32 - Loaded 2866316 TEST Documents.
2021-12-20 14:01:32 - Doc Example: {'text': 'This Boston college professor who lives in #NH is on leave after being arrested for child pornography, endangerment:', 'title': ''}
2021-12-20 14:01:32 - Loading Queries...
2021-12-20 14:01:32 - Loaded 97 TEST Queries.
2021-12-20 14:01:32 - Query Example: VIDEO:Good Samaritans Stop Alleged Hit-and-Run Driver in Miami
2021-12-20 14:01:32 - 

2021-12-20 14:01:32 - NDCG@1: 0.6804
2021-12-20 14:01:32 - NDCG@3: 0.5734
2021-12-20 14:01:32 - NDCG@5: 0.5029
2021-12-20 14:01:32 - NDCG@10: 0.3900
2021-12-20 14:01:32 - NDCG@100: 0.2722
2021-12-20 14:01:32 - 

2021-12-20 14:01:32 - MAP@1: 0.0523
2021-12-20 14:01:32 - MAP@3: 0.1099
2021-12-20 14:01:32 - MAP@5: 0.1364
2021-12-20 14:01:32 - MAP@10: 0.1607
2021-12-20 14:01:32 - MAP@100: 0.1607
2021-12-20 14:01:32 - 

2021-12-20 14:01:32 - Recall@1: 0.0523
2021-12-20 14:01:32 - Recall@3: 0.1099
2021-12-20 14:01:32 - Recall@5: 0.1364
2021-12-20 14:01:32 - R

  0%|          | 0/594977 [00:00<?, ?it/s]

2021-12-20 14:01:48 - Loaded 594977 TEST Documents.
2021-12-20 14:01:48 - Doc Example: {'text': 'NEW ORLEANS — Whenever a Virginia Tech offensive coach is asked how the most prolific receiving duo in school history came to be, inevitably the first road game in 2008 against North Carolina comes up. Midway through the first quarter, Virginia Tech had to call two timeouts in a row because then-freshmen Jarrett Boykin and Danny Coale couldn’t seem to line up right, and “they had those big eyes out there looking around,” Kevin Sherman, their position coach, said recently. Now that Boykin and Coale have only Tuesday’s Sugar Bowl remaining before leaving Virginia Tech with every major school record for a wide receiver, they’ve taken a different stance. “I still don’t think that was on us. Macho [Harris] was in the game and he lined up wrong,” said Boykin, as Coale sat next to him nodding in agreement. Just add that to the list of slights these seniors have had to overcome. Boykin has been the

  0%|          | 0/14914714 [00:00<?, ?it/s]

2021-12-20 14:03:55 - Loaded 14914604 TEST Documents.
2021-12-20 14:03:56 - Doc Example: {'text': 'Depressive disorder is one of the most widespread forms of mental disorders which lead to a significant public health concern, such as disability, suicide, and so on. Its etiology remains vague but it is believed that depressive disorder is a multifactorial disease which is induced by the interaction of social, psychological, and biological factors. Thus, there is no clear and definite pathological theory could illustrate its mechanism independently until now, involving genetics, neuroimaging, neuroinflammation, neuroendocrine, and others. Comprehensive assessment to patients with depression is the starting point for a right diagnosis. History-taking of physical condition is as important as psychiatric interview and rational usage of scales would be beneficial for screening. There are many kinds of therapeutic measures for depressive patients nowadays, including general intervention, phar

  0%|          | 0/528155 [00:00<?, ?it/s]

2021-12-20 14:04:10 - Loaded 528155 TEST Documents.
2021-12-20 14:04:11 - Doc Example: {'text': '\n\nPOLITICIANS,  PARTY PREFERENCES \n\n   Summary:  Newspapers in the Former Yugoslav Republic of \n   Macedonia have published the results of opinion polls, \n   indicating the relative popularity of politicians, \n   political parties, and attitudes toward the political system. \n\n   The 22-23 January edition of the Skopje newspaper VECER in \nMacedonian published on pages 6-7 the results of an opinion poll \nconducted by the "BriMa" agency in November 1993. According to \nVECER, 1,036 respondents were classified by age and residence, but \nthe paper did not explain the methodology or give the margin of \nerror.  For the purpose of comparison, the paper cited the results \nof an unidentified poll made in May 1993. The approval/disapproval \nratings, in percent, for ten Macedonian politicians were: \n\n                                           November 1993    May 1993 \n\nKiro Gligorov