<a href="https://colab.research.google.com/github/Stern5497/Doc2docBeirIR/blob/main/Swiss_Doc2doc_IR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Doc2doc IR:  Information Retrieval models for Swiss Court Rulings**

@misc{rasiah2023scale,
      title={SCALE: Scaling up the Complexity for Advanced Language Model Evaluation},
      author={Vishvaksenan Rasiah and Ronja Stern and Veton Matoshi and Matthias Stürmer and Ilias Chalkidis and Daniel E. Ho and Joel Niklaus},
      year={2023},
      eprint={2306.09237},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}


This notebook contains code from BEIR. https://github.com/beir-cellar


## References
We are using the BEIR packacge and provided examples of:
  https://github.com/beir-cellar/beir/wiki/Examples-and-tutorials
  Nandan Thakur, Researcher @ UKP Lab, TU Darmstadt
  (https://nthakur.xyz) (nandant@gmail.com)



# **Run Experiments**

We have 4 different experiments:


**Lexical Retrieval using BM25 (elasticsearch)** run_lr_bm25(queries, qrels, corpus)

Use different datasets for experiments and process text in two different ways. First we need to shorten the query text to avoid errors. In addition we can remove stopwords from the text to include more "interesting" input text.

*   LR-BM25_S_DE (Remove Stopwords and shorten, German)
*   LR-BM25_DE (Shortend, German)
*   LR-BM25_S_DE (Remove Stopwords and shorten, French)
*   LR-BM25_DE (Shortend, French)
*   LR-BM25_S_DE (Remove Stopwords and shorten, Italian)
*   LR-BM25_DE (Shortend, Italian)
*   LR-BM25_S (Remove Stopwords and shorten, Mixed)
*   LR-BM25 (Shortend, Mixed)


**Multilingual Retrieval BM25 with Elasticsearch** run_mr_bm25(queries, qrels, corpus, language)

Specify the used language. Use different datasets for experiments and process text in two different ways. First we need to shorten the query text to avoid errors. In addition we can remove stopwords from the text to include more "interesting" input text.

*   MR-BM25_S_DE (Remove Stopwords and shorten, German)
*   MR-BM25_DE (Shortend, German)
*   MR-BM25_S_DE (Remove Stopwords and shorten, French)
*   MR-BM25_DE (Shortend, French)
*   MR-BM25_S_DE (Remove Stopwords and shorten, Italian)
*   MR-BM25_DE (Shortend, Italian)
*   MR-BM25_S (Remove Stopwords and shorten, Mixed)
*   MR-BM25 (Shortend, Mixed)

**Reranking BM25 using Cross-Encoder** run_rce_bm25(queries, qrels, corpus, model_name)

Here we are using cross encoders. Ther is a recently published multilingual model called: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
It might be interesting to compare results when using only single language linked dataset. That means query and corpus are the same language. In addition we can try to use mutlilingual where the the corpus is multilingual.
Do we need to shorten the input?

*  RCE_BM25_M (Mixed dataset)
*  RCE_BM25 (All languages but not mixed)
*  RCE_BM25_DE (German)
*  RCE_BM25_FR French
*  RCE_BM25_IT Italian

**Rerank sBert with bm25** run_sb_r(queries, qrels, corpus, model_name)

We want to compare different sBert models which are multilingual. We will differ between a dataset which contains mixed linkings (different languages) and one containing only same language for query and corpus text.

> We will use two different multilingual models: "distiluse-base-multilingual-cased" (or named "distiluse-base-multilingual-cased-v1") and "paraphrase-albert-small-v2"


*  SB_R_M1_M
*  SB_R_M1_
*  SB_R_M2_M
*  SB_R_M2_M












In [None]:
# Install the beir PyPI, datasets, packeges
!pip install beir
!pip install datasets
!pip install tensorflow-text
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# **Data Loading**

In [None]:
import random
import re

import datasets
import json
import gc
import pandas as pd
import ast
from datasets import load_dataset
from tqdm import tqdm
import json
import os, pathlib
import re
import nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


class ProcessData:

    def __init__(self):
        self.counter = 0

    def load_from_hf(self, name):
        dataset = load_dataset(name)
        dataset = dataset['train']
        return dataset

    def get_data(self):
        qrel_dataset = self.load_from_hf('Stern5497/qrel')
        querie_dataset = self.load_from_hf("Stern5497/querie")
        corpus_dataset = self.load_from_hf("Stern5497/corpus")
        querie_dataset_train, querie_dataset_test = self.create_splits(querie_dataset)
        return querie_dataset, querie_dataset_train, querie_dataset_test, qrel_dataset, corpus_dataset

    def create_splits(self, querie_dataset):
        splits = querie_dataset.train_test_split(test_size=0.9)
        querie_dataset_train = splits['train']
        querie_dataset_test = splits['test']
        return querie_dataset_train, querie_dataset_test

    def create_corpus_dict(self, corpus_dataset):
        corpus_dict = {}
        def write_dict(row):
            id = str(row['id'])
            corpus_dict[id] = {"title": '', "text": row['text']}
            return row

        corpus_dataset.apply(write_dict, axis="columns")

        return corpus_dict

    def create_qrels_dict(self, qrels_dataset, corpus_dict):
        qrels_dict = {}

        def write_qrels(row):
            if row['corp_id'] in corpus_dict:
                if row['id'] not in qrels_dict:
                    qrels_dict[row['id']] = {row['corp_id']: 1}
                else:
                    qrels_dict[row['id']][row['corp_id']] = 1

        qrel_dataset = qrels_dataset.apply(write_qrels, axis='columns')

        return qrels_dict

    def create_query_dict(self, queries_dataset, qrels_dict):
        queries_dict = {}

        def write_dict(row):
            id = row['id']
            if id in qrels_dict:
                text = row['text']
                queries_dict[id] = str(text)

        queries_dataset.apply(write_dict, axis="columns")
        return queries_dict

    def create_data_dicts(self, querie_dataset, qrel_dataset, corpus_dataset):
        corpu = {}

        def write_corpus(row):
            corpu[row['id']] = {'text': row['text'], 'title': ""}
            return row

        corpus_dataset = corpus_dataset.map(write_corpus)
        print(f"Corpus: {len(corpu.items())}")

        querie_tmp = {}

        def write_queries(row):
            id = row['id']
            text = row['text']
            querie_tmp[id] = str(text)
            return row

        querie_dataset = querie_dataset.map(write_queries, keep_in_memory=True)
        print(f"Query before cleaning: {len(querie_tmp.items())}")

        qrel = {}
        ids = []
        def write_qrels(row):
            # only use qrels with valid query
            if row['id'] in querie_tmp:
                if row['id'] not in qrel:
                    qrel[row['id']] = {row['corp_id']: 1}
                    ids.append(row['id'])
                else:
                    qrel[row['id']][row['corp_id']] = 1
                    ids.append(row['id'])
            return row
        qrel_dataset = qrel_dataset.map(write_qrels)
        print(f"Qrels: {len(qrel.items())}")

        query = {}

        for key, value in querie_tmp.items():
            if key in ids:
                query[key] = value

        print(f"Query: {len(query.items())}")

        return query, qrel, corpu, querie_dataset, qrel_dataset, corpus_dataset

    def remove_stopwords(self, queries, corpus, language_long):

        sw_nltk = stopwords.words(language_long)
        print(sw_nltk)

        queries_filtered = {}

        for key, value in queries.items():
            value = re.sub('\W+', ' ', str(value))
            words = [word for word in value.split() if word.lower() not in sw_nltk]
            new_value = " ".join(words)
            # new_value = new_value[:500]
            queries_filtered[key] = new_value

        corpus_filtered = {}

        for key, value in corpus.items():
            text = re.sub('\W+', ' ', str(value['text']))
            words = [word for word in text.split() if word.lower() not in sw_nltk]
            new_value = " ".join(words)
            # new_value = new_value[:500]
            value['text'] = new_value
            corpus_filtered[key] = value

        print("Shortened queries and removed stopwords")

        return queries_filtered, corpus_filtered

    def create_subset(self, n, queries, qrels, corpus):
        print("Start creating subsets")
        counter = 0
        qrels_subset = {}
        queries_subset = {}
        corpus_subset = {}

        print(len(queries.items()))
        print(len(qrels.items()))
        print(len(corpus.items()))
        short = 0

        for key, value in qrels.items():
            for corp_id, _ in value.items():
                found = False
                if len(corpus[corp_id]['text']) > 10:
                    found = True
                    corpus_subset[corp_id] = corpus[corp_id]
                else:
                    short = short + 1
            if found:
                qrels_subset[key] = value
                queries_subset[key] = queries[key]
            counter += 1
            if counter > n:
                break

        print(f"create subset of {n} queries.")
        print(f"Found {short} entries in corpus with short text.")

        return queries_subset, qrels_subset, corpus_subset

    def create_splits_old(self, n, queries, qrels, corpus):
        print("Start creating splits")
        counter = 0
        qrels_train = {}
        queries_train = {}
        qrels_test = {}
        queries_test = {}

        print(f"We have {len(queries)} queries in total. {n * len(queries)} will be used for train, the rest for test.")
        print(f"We have {len(qrels)} qrels")
        print(f"We have {len(corpus)} corpus")
        missing_corpus = 0
        missing_qrel_train = 0
        missing_qrel_test = 0

        for key, value in queries.items():
            if counter < n * len(queries):
                if key in qrels:
                    # assert all corp_ids exist in corpus!
                    queries_train[key] = value
                    qrels_train[key] = qrels[key]
                    counter = counter + 1
                    for corp_id in qrels[key].keys():
                        if corp_id not in corpus:
                            missing_corpus = missing_corpus + 1
                else:
                    missing_qrel_train = missing_qrel_train + 1
            elif counter >= n * len(queries):
                if key in qrels:
                    # assert all corp_ids exist in corpus!
                    queries_test[key] = value
                    qrels_test[key] = qrels[key]
                    counter = counter + 1
                    for corp_id in qrels[key].keys():
                        if corp_id not in corpus:
                            missing_corpus = missing_corpus + 1
                else:
                    missing_qrel_test = missing_qrel_test + 1

        print(f"created splits.")
        print(f"Found {missing_corpus} missing entries in corpus.")
        print(f"We have {len(qrels_train)} train qrels")
        print(f"We have {len(qrels_test)} test qrels")
        print(f"We have {len(queries_train)} train queries")
        print(f"We have {len(queries_test)} test queries")
        return qrels_train, queries_train, qrels_test, queries_test



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def get_formatted_data(filter_language):
    process_data = ProcessData()
    querie_dataset, querie_dataset_train, querie_dataset_test, qrels_dataset, corpus_dataset = process_data.get_data()

    print(querie_dataset_train)
    print(querie_dataset_test)
    print(qrels_dataset)
    print(corpus_dataset)

    # corpus stays always the same
    corpus_dict = process_data.create_corpus_dict(pd.DataFrame(corpus_dataset))

    # qrels are split into mixed, and single languages
    qrels_dataset_ssl = qrels_dataset.filter(lambda row: row['language'] == filter_language)
    qrels_dict = process_data.create_qrels_dict(pd.DataFrame(qrels_dataset), corpus_dict)
    qrels_dict_ssl = process_data.create_qrels_dict(pd.DataFrame(qrels_dataset_ssl), corpus_dict)

    # we always split so results are comparable
    if filter_language != 'mixed':
        querie_dataset_test = querie_dataset_test.filter(lambda row: row['language'] == filter_language)
    queries_dict_train = process_data.create_query_dict(pd.DataFrame(querie_dataset_train), qrels_dict)
    queries_dict_test = process_data.create_query_dict(pd.DataFrame(querie_dataset_test), qrels_dict)

    query_adapted = {}
    for id in queries_dict_test.keys():
        if id in qrels_dict_ssl:
            # make_sure_this_is_done = 'XXX'
            query_adapted[id] = queries_dict_test[id]

    print("####################################################################################################")
    print("Successfully loaded data.")
    print("####################################################################################################")

    print(len(corpus_dict))
    print(len(qrels_dict))
    print(len(qrels_dict_ssl))
    print(len(queries_dict_train))
    print(len(queries_dict_test))

    return queries_dict_train, queries_dict_test, qrels_dict, query_adapted, corpus_dict

# queries_dict_train, queries_dict_test, qrels_dict, qrels_dict_ssl, corpus_dict = get_formatted_data('it')

# **Models**

**Lexical Retrieval using BM25 (Elasticsearch)**


In [None]:
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval
import random

def run_lr_bm25(queries, qrels, corpus):

  #### Provide parameters for elastic-search
  hostname = "localhost"
  index_name = "scifact"
  initialize = True # True, will delete existing index with same name and reindex all documents

  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
  retriever = EvaluateRetrieval(model)

  #### Retrieve dense results (format of results is identical to qrels)
  results = retriever.retrieve(corpus, queries)

  #### Evaluate your retrieval using NDCG@k, MAP@K ...
  ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
  recall_cap = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="recall_cap")
  print(ndcg)
  print(_map)
  print(recall)
  print(precision)
  print(recall_cap)

**Multilingual Retrieval BM25 with Elasticsearch**


In [None]:

def run_mr_bm25(queries, qrels, corpus, language):

  hostname = "localhost"
  index_name = "scifact"
  initialize = True # True, will delete existing index with same name and reindex all documents

  #### Language ####
  # For languages supported by Elasticsearch by default, check here ->
  # https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html
  # Please provide full names in lowercase for eg. english, hindi ...

  #### Sharding ####
  # (1) For datasets with small corpus (datasets ~ < 5k docs) => limit shards = 1
  # number_of_shards = 1
  # model = BM25(index_name=index_name, hostname=hostname, language=language, initialize=initialize, number_of_shards=number_of_shards)

  # (2) For datasets with big corpus ==> keep default configuration
  model = BM25(index_name=index_name, hostname=hostname, language=language, initialize=initialize)
  retriever = EvaluateRetrieval(model)

  #### Retrieve dense results (format of results is identical to qrels)
  results = retriever.retrieve(corpus, queries)

  #### Evaluate your retrieval using NDCG@k, MAP@K ...
  logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
  ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
  recall_cap = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="recall_cap")
  print(ndcg)
  print(_map)
  print(recall)
  print(precision)
  print(recall_cap)

  #### Retrieval Example ####
  query_id, scores_dict = random.choice(list(results.items()))
  logging.info("Query : %s\n" % queries[query_id])

  scores = sorted(scores_dict.items(), key=lambda item: item[1], reverse=True)
  for rank in range(10):
      doc_id = scores[rank][0]
      logging.info("Doc %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))


**Reranking BM25 using Cross-Encoder**

In this example, we rerank the top-20 documents retrieved from BM25, using ([cross-encoder/ms-marco-electra-base](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html)) SBERT cross-encoder model

In [None]:
from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.reranking.models import CrossEncoder
from beir.reranking import Rerank

def run_rce_bm25(queries, qrels, corpus, model_name):

  hostname = "localhost"
  index_name = "scifact"
  initialize = True # True, will delete existing index with same name and reindex all documents

  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
  retriever = EvaluateRetrieval(model)

  #### Retrieve dense results (format of results is identical to qrels)
  results = retriever.retrieve(corpus, queries)

  ################################################
  #### (2) RERANK Top-100 docs using Cross-Encoder
  ################################################

  #### Reranking using Cross-Encoder models #####
  #### https://www.sbert.net/docs/pretrained_cross-encoders.html
  # 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'
  cross_encoder_model = CrossEncoder(model_name)

  #### Or use MiniLM, TinyBERT etc. CE models (https://www.sbert.net/docs/pretrained-models/ce-msmarco.html)
  # cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
  # cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')

  reranker = Rerank(cross_encoder_model, batch_size=128)

  # Rerank top-100 results using the reranker provided
  rerank_results = reranker.rerank(corpus, queries, results, top_k=100)

  #### Evaluate your retrieval using NDCG@k, MAP@K ...
  #### Evaluate your retrieval using NDCG@k, MAP@K ...
  ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, rerank_results, retriever.k_values)
  recall_cap = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="recall_cap")
  print(ndcg)
  print(_map)
  print(recall)
  print(precision)
  print(recall_cap)

  #### Print top-k documents retrieved ####
  top_k = 10

  query_id, ranking_scores = random.choice(list(rerank_results.items()))
  scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)
  logging.info("Query : %s\n" % queries[query_id])

  for rank in range(top_k):
    doc_id = scores_sorted[rank][0]
    # Format: Rank x: ID [Title] Body
    logging.info("Rank %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))

**Rerank sbert with bm25**

In [None]:
from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from beir.retrieval import models

import pathlib, os
import logging
import random

def run_sb_r(queries, qrels, corpus, model_name):

  hostname = "localhost"
  index_name = "scifact"
  initialize = True # True, will delete existing index with same name and reindex all documents

  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
  retriever = EvaluateRetrieval(model)

  #### Retrieve dense results (format of results is identical to qrels)
  results = retriever.retrieve(corpus, queries)

  #### Reranking top-100 docs using Dense Retriever model
  # distiluse-base-multilingual-cased
  # distiluse-base-multilingual-cased-v1
  # paraphrase-albert-small-v2
  model = DRES(models.SentenceBERT(model_name), batch_size=128)
  dense_retriever = EvaluateRetrieval(model, score_function="cos_sim", k_values=[1,3,5,10,100])

  #### Retrieve dense results (format of results is identical to qrels)
  rerank_results = dense_retriever.rerank(corpus, queries, results, top_k=100)

  #### Evaluate your retrieval using NDCG@k, MAP@K ...
  # ndcg, _map, recall, precision, hole = dense_retriever.evaluate(qrels, rerank_results, retriever.k_values)
  ndcg, _map, recall, precision = dense_retriever.evaluate(qrels, rerank_results, retriever.k_values)
  recall_cap = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="recall_cap")
  print(ndcg)
  print(_map)
  print(recall)
  print(precision)
  print(recall_cap)

  #### Print top-k documents retrieved ####
  top_k = 10

  query_id, ranking_scores = random.choice(list(rerank_results.items()))
  scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)
  logging.info("Query : %s\n" % queries[query_id])

  for rank in range(top_k):
    doc_id = scores_sorted[rank][0]
    # Format: Rank x: ID [Title] Body
    logging.info("Rank %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))

**Train S-Bert hard neg**

In [None]:


from sentence_transformers import losses, models, SentenceTransformer
from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.train import TrainRetriever
import pathlib, os, tqdm
import logging
import json

def create_triplets(corpus, queries, qrels):

  hostname = "localhost"
  index_name = "scifact"
  initialize = True # True, will delete existing index with same name and reindex all documents

  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
  bm25 = EvaluateRetrieval(model)

  print("successful until retrieve corpus")

  #### Index passages into the index (seperately)
  bm25.retriever.index(corpus)

  triplets = []
  qids = list(qrels)
  print(qids[0])
  hard_negatives_max = 10

  #### Retrieve BM25 hard negatives => Given a positive document, find most similar lexical documents
  for idx in tqdm.tqdm(range(len(qids)), desc="Retrieve Hard Negatives using BM25"):
    # due to key error the followin lines were adapted
    query_id = qids[idx]
    query_text = queries[qids[idx]]
    pos_docs = [doc_id for doc_id in qrels[query_id] if qrels[query_id][doc_id] > 0]
    pos_doc_texts = [corpus[doc_id]["title"] + " " + corpus[doc_id]["text"] for doc_id in pos_docs]
    hits = bm25.retriever.es.lexical_multisearch(texts=pos_doc_texts, top_hits=hard_negatives_max+1)
    print(hits)
    for (pos_text, hit) in zip(pos_doc_texts, hits):
        for (neg_id, _) in hit.get("hits"):
            if neg_id not in pos_docs:
                neg_text = corpus[neg_id]["title"] + " " + corpus[neg_id]["text"]
                triplets.append([query_text, pos_text, neg_text])

  print("creted triplets")
  with open('triplets.json', 'w', encoding='utf-8') as f:
    json.dump(triplets, f, ensure_ascii=False, indent=4)



# **Set up Elasticsearch**

## 1. Download and setup the Elasticsearch instance
Reference: https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/elasticsearch.ipynb

For demo purposes, the open-source version of the elasticsearch package is used.

In [None]:
%%bash

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512

elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK


Run the instance as a daemon process


In [None]:
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

In [None]:
import time

# Sleep for few seconds to let the instance start.
time.sleep(20)

Once the instance has been started, grep for ``elasticsearch`` in the processes list to confirm the availability.

In [None]:
%%bash

ps -ef | grep elasticsearch

root       13952   13950  0 09:57 ?        00:00:00 sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
daemon     13953   13952 10 09:57 ?        00:05:21 /content/elasticsearch-7.9.2/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-15157512740479983317 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:fileco

In [None]:
%%bash

curl -sX GET "localhost:9200/"

{
  "name" : "5c892e65cf69",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "oYJ7HPjxSTGcTbuOJf-x_A",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


# **Experients**


In [None]:
# we us train queries only for creation of triplets.
queries_dict_train, queries_dict_test, qrels_dict, qrels_dict_ssl, corpus_dict = get_formatted_data('mixed')

In [None]:
# make sure there are only qrels for queries present in corpus
qrels_adapted = {}
for id in queries_dict_train.keys():
    qrels_adapted[id] = qrels_dict[id]

In [None]:
print(len(qrels_adapted))

11246


In [None]:
create_triplets(corpus_dict, queries_dict_train, qrels_adapted)

In [None]:
# test lexical
run_lr_bm25(queries_dict_test, qrels_adapted, corpus_dict)

  0%|          | 0/9897 [00:00<?, ?docs/s]
que:   0%|          | 0/791 [00:21<?, ?it/s]


KeyError: ignored

# **Results**

create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)

In [None]:
querie_dataset, qrel_dataset, corpus_dataset = load_from_hf()



  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 72344
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 674296
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 18177
})


In [None]:
removed_querie_dataset = querie_dataset.filter(lambda row: len(row['text']) <= 3)
querie_dataset = querie_dataset.filter(lambda row: len(row['text']) > 3)
print(querie_dataset)


Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]

Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 62557
})


In [None]:
sll = False
querie_language = 'mixed'
stopwords_remove = False
qrel_language = 'mixed'
split = False

process_data = ProcessData()
querie_dataset, qrel_dataset, corpus_dataset = process_data.get_data()

data = {}

# filter for languages
data['corpus'] = corpus_dataset
data['query'] = {'de': querie_dataset.filter(lambda x: x['language'] == 'de')}
data['query']['fr'] = querie_dataset.filter(lambda x: x['language'] == 'fr')
data['query']['it'] = querie_dataset.filter(lambda x: x['language'] == 'it')
data['query']['mixed'] = querie_dataset.filter(lambda x: x['language'] == 'mixed')
data['qrel'] = {'de': qrel_dataset.filter(lambda x: x['language'] == 'de')}
data['qrel']['fr'] = qrel_dataset.filter(lambda x: x['language'] == 'fr')
data['qrel']['it'] = qrel_dataset.filter(lambda x: x['language'] == 'it')
data['qrel']['mixed'] = qrel_dataset.filter(lambda x: x['language'] == 'mixed')
print("After filtering for language")
print(data.keys())

# create Train Test split (only for mixed dataset)
queries, qrels, corpus, _, __, ___ = process_data.create_data_dicts(data['query']['mixed'], data['qrel']['mixed'], data['corpus'])


In [None]:
qrels_test_mixed, queries_test_mixed = process_data.create_sll(data['qrel']['mixed'], qrels, queries, corpus)
qrels_test_de, queries_test_de = process_data.create_sll(data['qrel']['de'], qrels, queries, corpus)
qrels_test_fr, queries_test_fr = process_data.create_sll(data['qrel']['fr'], qrels, queries, corpus)
qrels_test_it, queries_test_it = process_data.create_sll(data['qrel']['it'], qrels, queries, corpus)

In [None]:
from datasets import concatenate_datasets, load_dataset
use_subset = True
subset_length = 10000
remove_stopwords = True
language_long = 'german'

querie_dataset_it, qrel_dataset_it, corpus_dataset_it = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'it')
querie_dataset_fr, qrel_dataset_fr, corpus_dataset_fr = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'fr')
querie_dataset_de, qrel_dataset_de, corpus_dataset_de = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'de')

querie_dataset_p = concatenate_datasets([querie_dataset_it, querie_dataset_fr, querie_dataset_de])
qrel_dataset_p = concatenate_datasets([qrel_dataset_it, qrel_dataset_fr, qrel_dataset_de])
corpus_dataset_p = concatenate_datasets([corpus_dataset_it, corpus_dataset_fr, corpus_dataset_de])

querie_dataset_p = querie_dataset_p.shuffle(seed=42)
qrel_dataset_p = qrel_dataset_p.shuffle(seed=42)
corpus_dataset_p = corpus_dataset_p.shuffle(seed=42)

queries, qrels, corpus = write_dictionaries(querie_dataset_p, qrel_dataset_p, corpus_dataset_p)

In [None]:
run_lr_bm25(queries, qrels, corpus)
create_triplets(corpus, queries, qrels)

# Lexical Retrieval for monolingual datasets with stopwordsremoval

In [None]:
### IT ###
language = 'it'
language_long = 'italian'
remove_stopwords =False
use_subset = False
subset_length = 0
queries, qrels, corpus_it = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)

Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 1873
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 4882
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 248
})
Finished filtering for language


Map:   0%|          | 0/4882 [00:00<?, ? examples/s]

1873


Map:   0%|          | 0/248 [00:00<?, ? examples/s]

248


Map:   0%|          | 0/1873 [00:00<?, ? examples/s]

1873
finished creating dictionaries
['ad', 'al', 'allo', 'ai', 'agli', 'all', 'agl', 'alla', 'alle', 'con', 'col', 'coi', 'da', 'dal', 'dallo', 'dai', 'dagli', 'dall', 'dagl', 'dalla', 'dalle', 'di', 'del', 'dello', 'dei', 'degli', 'dell', 'degl', 'della', 'delle', 'in', 'nel', 'nello', 'nei', 'negli', 'nell', 'negl', 'nella', 'nelle', 'su', 'sul', 'sullo', 'sui', 'sugli', 'sull', 'sugl', 'sulla', 'sulle', 'per', 'tra', 'contro', 'io', 'tu', 'lui', 'lei', 'noi', 'voi', 'loro', 'mio', 'mia', 'miei', 'mie', 'tuo', 'tua', 'tuoi', 'tue', 'suo', 'sua', 'suoi', 'sue', 'nostro', 'nostra', 'nostri', 'nostre', 'vostro', 'vostra', 'vostri', 'vostre', 'mi', 'ti', 'ci', 'vi', 'lo', 'la', 'li', 'le', 'gli', 'ne', 'il', 'un', 'uno', 'una', 'ma', 'ed', 'se', 'perché', 'anche', 'come', 'dov', 'dove', 'che', 'chi', 'cui', 'non', 'più', 'quale', 'quanto', 'quanti', 'quanta', 'quante', 'quello', 'quelli', 'quella', 'quelle', 'questo', 'questi', 'questa', 'queste', 'si', 'tutto', 'tutti', 'a', 'c', 'e', '

In [None]:
print("#######################################################################################################")
print("LR-BM25_IT")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("MR-BM25_IT")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_IT


  0%|          | 0/248 [00:00<?, ?docs/s]
que: 100%|██████████| 15/15 [00:31<00:00,  2.09s/it]


{'NDCG@1': 0.062, 'NDCG@3': 0.0556, 'NDCG@5': 0.05578, 'NDCG@10': 0.06314, 'NDCG@100': 0.14398, 'NDCG@1000': 0.25371}
{'MAP@1': 0.02395, 'MAP@3': 0.03475, 'MAP@5': 0.03742, 'MAP@10': 0.0409, 'MAP@100': 0.05172, 'MAP@1000': 0.06007}
{'Recall@1': 0.02395, 'Recall@3': 0.04495, 'Recall@5': 0.05543, 'Recall@10': 0.07595, 'Recall@100': 0.40679, 'Recall@1000': 0.99767}
{'P@1': 0.062, 'P@3': 0.04328, 'P@5': 0.03253, 'P@10': 0.02195, 'P@100': 0.01182, 'P@1000': 0.00288}
#######################################################################################################
MR-BM25_IT


  0%|          | 0/248 [00:00<?, ?docs/s]
que: 100%|██████████| 15/15 [00:21<00:00,  1.44s/it]


{'NDCG@1': 0.05835, 'NDCG@3': 0.05583, 'NDCG@5': 0.05796, 'NDCG@10': 0.06482, 'NDCG@100': 0.15248, 'NDCG@1000': 0.25566}
{'MAP@1': 0.02231, 'MAP@3': 0.03447, 'MAP@5': 0.03816, 'MAP@10': 0.04123, 'MAP@100': 0.05286, 'MAP@1000': 0.06082}
{'Recall@1': 0.02231, 'Recall@3': 0.0466, 'Recall@5': 0.0601, 'Recall@10': 0.08027, 'Recall@100': 0.44213, 'Recall@1000': 0.99767}
{'P@1': 0.05835, 'P@3': 0.04449, 'P@5': 0.03559, 'P@10': 0.02334, 'P@100': 0.01279, 'P@1000': 0.00288}


In [None]:
### FR ###
language = 'fr'
language_long = 'french'
remove_stopwords =True
use_subset = False
subset_length = 0
queries, qrels, corpus = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)

Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 12164
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 63203
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 2379
})
Finished filtering for language


Map:   0%|          | 0/63203 [00:00<?, ? examples/s]

12164


Map:   0%|          | 0/2379 [00:00<?, ? examples/s]

2379


Map:   0%|          | 0/12164 [00:00<?, ? examples/s]

12164
finished creating dictionaries
['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'au

In [None]:
print("#######################################################################################################")
print("LR-BM25_IT")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("MR-BM25_IT")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_IT


  0%|          | 0/2379 [00:00<?, ?docs/s]
que: 100%|██████████| 96/96 [14:54<00:00,  9.32s/it]


{'NDCG@1': 0.06465, 'NDCG@3': 0.0549, 'NDCG@5': 0.0522, 'NDCG@10': 0.05596, 'NDCG@100': 0.09514, 'NDCG@1000': 0.1701}
{'MAP@1': 0.0136, 'MAP@3': 0.02177, 'MAP@5': 0.02477, 'MAP@10': 0.02818, 'MAP@100': 0.03448, 'MAP@1000': 0.03721}
{'Recall@1': 0.0136, 'Recall@3': 0.03061, 'Recall@5': 0.04111, 'Recall@10': 0.06049, 'Recall@100': 0.18197, 'Recall@1000': 0.57429}
{'P@1': 0.06465, 'P@3': 0.04893, 'P@5': 0.04026, 'P@10': 0.03054, 'P@100': 0.00995, 'P@1000': 0.00304}
#######################################################################################################
MR-BM25_IT


  0%|          | 0/2379 [00:00<?, ?docs/s]
que: 100%|██████████| 96/96 [12:28<00:00,  7.80s/it]


{'NDCG@1': 0.06439, 'NDCG@3': 0.05416, 'NDCG@5': 0.05216, 'NDCG@10': 0.05614, 'NDCG@100': 0.09538, 'NDCG@1000': 0.1727}
{'MAP@1': 0.01358, 'MAP@3': 0.02159, 'MAP@5': 0.02472, 'MAP@10': 0.02821, 'MAP@100': 0.03451, 'MAP@1000': 0.03732}
{'Recall@1': 0.01358, 'Recall@3': 0.03026, 'Recall@5': 0.04129, 'Recall@10': 0.06103, 'Recall@100': 0.18304, 'Recall@1000': 0.58918}
{'P@1': 0.06439, 'P@3': 0.0479, 'P@5': 0.04036, 'P@10': 0.03083, 'P@100': 0.00998, 'P@1000': 0.0031}


In [None]:
### DE ###
language = 'de'
language_long = 'german'
remove_stopwords =True
use_subset = False
subset_length = 0
queries, qrels, corpus = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)

Filter:   0%|          | 0/72344 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 21739
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 147486
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 5653
})
Finished filtering for language


Map:   0%|          | 0/147486 [00:00<?, ? examples/s]

21739


Map:   0%|          | 0/5653 [00:00<?, ? examples/s]

5653


Map:   0%|          | 0/21739 [00:00<?, ? examples/s]

21739
finished creating dictionaries
['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'dass', 'daß', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ih

In [None]:
print("#######################################################################################################")
print("LR-BM25_DE")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("MR-BM25_DE")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_DE


  0%|          | 0/5653 [00:00<?, ?docs/s]
que: 100%|██████████| 170/170 [21:54<00:00,  7.73s/it]


{'NDCG@1': 0.06041, 'NDCG@3': 0.05317, 'NDCG@5': 0.04906, 'NDCG@10': 0.04843, 'NDCG@100': 0.08245, 'NDCG@1000': 0.13682}
{'MAP@1': 0.00902, 'MAP@3': 0.01494, 'MAP@5': 0.01735, 'MAP@10': 0.02023, 'MAP@100': 0.02607, 'MAP@1000': 0.02816}
{'Recall@1': 0.00902, 'Recall@3': 0.02101, 'Recall@5': 0.02917, 'Recall@10': 0.04425, 'Recall@100': 0.14389, 'Recall@1000': 0.38765}
{'P@1': 0.06041, 'P@3': 0.0498, 'P@5': 0.04213, 'P@10': 0.03306, 'P@100': 0.01154, 'P@1000': 0.00303}
#######################################################################################################
MR-BM25_DE


  0%|          | 0/5653 [00:00<?, ?docs/s]
que: 100%|██████████| 170/170 [18:59<00:00,  6.70s/it]


{'NDCG@1': 0.06353, 'NDCG@3': 0.05486, 'NDCG@5': 0.05059, 'NDCG@10': 0.04992, 'NDCG@100': 0.08444, 'NDCG@1000': 0.13893}
{'MAP@1': 0.00948, 'MAP@3': 0.0154, 'MAP@5': 0.01792, 'MAP@10': 0.02091, 'MAP@100': 0.0269, 'MAP@1000': 0.02902}
{'Recall@1': 0.00948, 'Recall@3': 0.02154, 'Recall@5': 0.03006, 'Recall@10': 0.04567, 'Recall@100': 0.14661, 'Recall@1000': 0.39049}
{'P@1': 0.06353, 'P@3': 0.05121, 'P@5': 0.04336, 'P@10': 0.03389, 'P@100': 0.01175, 'P@1000': 0.00305}


# **Lexical Retrieval using full corpus, queries (mixed, de, fr, it) and full qrels. Main language for ML Lexical varies**

In [None]:
### Mixed ###
language = 'mixed'
language_long = 'german'
remove_stopwords =False
use_subset = False
subset_length = 0
queries, qrels, corpus = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)

Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 31566
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 458725
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 9897
})
Finished filtering for language


Map:   0%|          | 0/9897 [00:00<?, ? examples/s]

9897


Map:   0%|          | 0/31566 [00:00<?, ? examples/s]

31566


Map:   0%|          | 0/458725 [00:00<?, ? examples/s]

31566
finished creating dictionaries
Shortened queries


In [None]:
language_long='french'
print("Attention: As one lanugage needs to be specified, French was chosen as it is the most common.")
print("MR-BM25_MXF")
run_mr_bm25(queries, qrels, corpus, language_long)

Attention: As one lanugage needs to be specified, French was chosen as it is the most common.
MR-BM25_MXF


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 247/247 [1:02:34<00:00, 15.20s/it]


{'NDCG@1': 0.1137, 'NDCG@3': 0.10314, 'NDCG@5': 0.09392, 'NDCG@10': 0.08335, 'NDCG@100': 0.11514, 'NDCG@1000': 0.16627}
{'MAP@1': 0.0102, 'MAP@3': 0.01785, 'MAP@5': 0.02127, 'MAP@10': 0.02521, 'MAP@100': 0.03283, 'MAP@1000': 0.03518}
{'Recall@1': 0.0102, 'Recall@3': 0.0254, 'Recall@5': 0.03616, 'Recall@10': 0.0555, 'Recall@100': 0.16537, 'Recall@1000': 0.34645}
{'P@1': 0.1137, 'P@3': 0.09885, 'P@5': 0.08538, 'P@10': 0.06656, 'P@100': 0.02082, 'P@1000': 0.00456}
{'R_cap@1': 0.1137, 'R_cap@3': 0.09998, 'R_cap@5': 0.08823, 'R_cap@10': 0.07742, 'R_cap@100': 0.16537, 'R_cap@1000': 0.34645}


In [None]:
language_long='italian'
print("Attention: As one lanugage needs to be specified, French was chosen as it is the most common.")
print("MR-BM25_MXI")
run_mr_bm25(queries, qrels, corpus, language_long)
language_long='english'
print("Attention: As one lanugage needs to be specified, French was chosen as it is the most common.")
print("MR-BM25_MXE")
run_mr_bm25(queries, qrels, corpus, language_long)

Attention: As one lanugage needs to be specified, French was chosen as it is the most common.
MR-BM25_MXI


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 247/247 [1:08:08<00:00, 16.55s/it]


{'NDCG@1': 0.1008, 'NDCG@3': 0.09112, 'NDCG@5': 0.08385, 'NDCG@10': 0.07582, 'NDCG@100': 0.11021, 'NDCG@1000': 0.16311}
{'MAP@1': 0.00956, 'MAP@3': 0.01644, 'MAP@5': 0.01955, 'MAP@10': 0.02321, 'MAP@100': 0.03074, 'MAP@1000': 0.03316}
{'Recall@1': 0.00956, 'Recall@3': 0.02328, 'Recall@5': 0.03315, 'Recall@10': 0.05121, 'Recall@100': 0.16294, 'Recall@1000': 0.35015}
{'P@1': 0.1008, 'P@3': 0.08696, 'P@5': 0.0762, 'P@10': 0.06069, 'P@100': 0.02049, 'P@1000': 0.00462}
{'R_cap@1': 0.1008, 'R_cap@3': 0.0882, 'R_cap@5': 0.0791, 'R_cap@10': 0.07118, 'R_cap@100': 0.16294, 'R_cap@1000': 0.35015}
Attention: As one lanugage needs to be specified, French was chosen as it is the most common.
MR-BM25_MXE


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 247/247 [1:11:32<00:00, 17.38s/it]


{'NDCG@1': 0.08376, 'NDCG@3': 0.07664, 'NDCG@5': 0.07207, 'NDCG@10': 0.06658, 'NDCG@100': 0.10226, 'NDCG@1000': 0.15642}
{'MAP@1': 0.00763, 'MAP@3': 0.01335, 'MAP@5': 0.01621, 'MAP@10': 0.01966, 'MAP@100': 0.02694, 'MAP@1000': 0.02941}
{'Recall@1': 0.00763, 'Recall@3': 0.01923, 'Recall@5': 0.02857, 'Recall@10': 0.04607, 'Recall@100': 0.15764, 'Recall@1000': 0.34937}
{'P@1': 0.08376, 'P@3': 0.07335, 'P@5': 0.06634, 'P@10': 0.05457, 'P@100': 0.0199, 'P@1000': 0.00461}
{'R_cap@1': 0.08376, 'R_cap@3': 0.07449, 'R_cap@5': 0.06905, 'R_cap@10': 0.06434, 'R_cap@100': 0.15764, 'R_cap@1000': 0.34937}


In [None]:
print("#######################################################################################################")
print("LR-BM25_MX")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
language_long='german'
print("Attention: As one lanugage needs to be specified, german was chosen as it is the most common.")
print("MR-BM25_MXD")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_MX


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 247/247 [1:09:08<00:00, 16.79s/it]


{'NDCG@1': 0.08376, 'NDCG@3': 0.07664, 'NDCG@5': 0.07207, 'NDCG@10': 0.06658, 'NDCG@100': 0.10226, 'NDCG@1000': 0.15642}
{'MAP@1': 0.00763, 'MAP@3': 0.01335, 'MAP@5': 0.01621, 'MAP@10': 0.01966, 'MAP@100': 0.02694, 'MAP@1000': 0.02941}
{'Recall@1': 0.00763, 'Recall@3': 0.01923, 'Recall@5': 0.02857, 'Recall@10': 0.04607, 'Recall@100': 0.15764, 'Recall@1000': 0.34937}
{'P@1': 0.08376, 'P@3': 0.07335, 'P@5': 0.06634, 'P@10': 0.05457, 'P@100': 0.0199, 'P@1000': 0.00461}
{'R_cap@1': 0.08376, 'R_cap@3': 0.07449, 'R_cap@5': 0.06905, 'R_cap@10': 0.06434, 'R_cap@100': 0.15764, 'R_cap@1000': 0.34937}
#######################################################################################################
Attention: As one lanugage needs to be specified, german was chosen as it is the most common.
MR-BM25_MXD


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 247/247 [1:03:41<00:00, 15.47s/it]


{'NDCG@1': 0.08687, 'NDCG@3': 0.07908, 'NDCG@5': 0.0745, 'NDCG@10': 0.06816, 'NDCG@100': 0.10426, 'NDCG@1000': 0.15816}
{'MAP@1': 0.0079, 'MAP@3': 0.01384, 'MAP@5': 0.01671, 'MAP@10': 0.02018, 'MAP@100': 0.02767, 'MAP@1000': 0.03016}
{'Recall@1': 0.0079, 'Recall@3': 0.0199, 'Recall@5': 0.02927, 'Recall@10': 0.04652, 'Recall@100': 0.15989, 'Recall@1000': 0.35064}
{'P@1': 0.08687, 'P@3': 0.07558, 'P@5': 0.06859, 'P@10': 0.05571, 'P@100': 0.02034, 'P@1000': 0.00464}
{'R_cap@1': 0.08687, 'R_cap@3': 0.07681, 'R_cap@5': 0.07146, 'R_cap@10': 0.06539, 'R_cap@100': 0.15989, 'R_cap@1000': 0.35064}


# **Lexical Retrieval Stopword removal**

In [None]:
# add stopword removal:
queries_s, corpus_s = shorten_and_reduce(queries, corpus, language_long)

['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'dass', 'daß', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres', 'euc

In [None]:
queries_s, corpus_s = shorten_and_reduce(queries_s, corpus_s, 'italian')
queries_s, corpus_s = shorten_and_reduce(queries_s, corpus_s, 'french')

['ad', 'al', 'allo', 'ai', 'agli', 'all', 'agl', 'alla', 'alle', 'con', 'col', 'coi', 'da', 'dal', 'dallo', 'dai', 'dagli', 'dall', 'dagl', 'dalla', 'dalle', 'di', 'del', 'dello', 'dei', 'degli', 'dell', 'degl', 'della', 'delle', 'in', 'nel', 'nello', 'nei', 'negli', 'nell', 'negl', 'nella', 'nelle', 'su', 'sul', 'sullo', 'sui', 'sugli', 'sull', 'sugl', 'sulla', 'sulle', 'per', 'tra', 'contro', 'io', 'tu', 'lui', 'lei', 'noi', 'voi', 'loro', 'mio', 'mia', 'miei', 'mie', 'tuo', 'tua', 'tuoi', 'tue', 'suo', 'sua', 'suoi', 'sue', 'nostro', 'nostra', 'nostri', 'nostre', 'vostro', 'vostra', 'vostri', 'vostre', 'mi', 'ti', 'ci', 'vi', 'lo', 'la', 'li', 'le', 'gli', 'ne', 'il', 'un', 'uno', 'una', 'ma', 'ed', 'se', 'perché', 'anche', 'come', 'dov', 'dove', 'che', 'chi', 'cui', 'non', 'più', 'quale', 'quanto', 'quanti', 'quanta', 'quante', 'quello', 'quelli', 'quella', 'quelle', 'questo', 'questi', 'questa', 'queste', 'si', 'tutto', 'tutti', 'a', 'c', 'e', 'i', 'l', 'o', 'ho', 'hai', 'ha', 'ab

In [None]:
print("#######################################################################################################")
print("LR-BM25_MX")
run_lr_bm25(queries_s, qrels, corpus_s)
print("#######################################################################################################")
language_long='german'
print("Attention: As one lanugage needs to be specified, german was chosen as it is the most common.")
print("MR-BM25_MXD")
run_mr_bm25(queries_s, qrels, corpus_s, language_long)
language_long='french'
print("Attention: As one lanugage needs to be specified, French was chosen as it is the most common.")
print("MR-BM25_MXF")
run_mr_bm25(queries_s, qrels, corpus_s, language_long)

#######################################################################################################
LR-BM25_MX


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 247/247 [1:08:44<00:00, 16.70s/it]


{'NDCG@1': 0.10644, 'NDCG@3': 0.09688, 'NDCG@5': 0.08906, 'NDCG@10': 0.08044, 'NDCG@100': 0.11329, 'NDCG@1000': 0.16542}
{'MAP@1': 0.00986, 'MAP@3': 0.0172, 'MAP@5': 0.02047, 'MAP@10': 0.02444, 'MAP@100': 0.03208, 'MAP@1000': 0.03448}
{'Recall@1': 0.00986, 'Recall@3': 0.02444, 'Recall@5': 0.03482, 'Recall@10': 0.0543, 'Recall@100': 0.16465, 'Recall@1000': 0.34962}
{'P@1': 0.10644, 'P@3': 0.09277, 'P@5': 0.08119, 'P@10': 0.06478, 'P@100': 0.0208, 'P@1000': 0.00461}
{'R_cap@1': 0.10644, 'R_cap@3': 0.09403, 'R_cap@5': 0.08417, 'R_cap@10': 0.07566, 'R_cap@100': 0.16465, 'R_cap@1000': 0.34962}
#######################################################################################################
Attention: As one lanugage needs to be specified, german was chosen as it is the most common.
MR-BM25_MXD


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 247/247 [1:08:43<00:00, 16.69s/it]


{'NDCG@1': 0.10879, 'NDCG@3': 0.098, 'NDCG@5': 0.09019, 'NDCG@10': 0.0814, 'NDCG@100': 0.11527, 'NDCG@1000': 0.1671}
{'MAP@1': 0.00995, 'MAP@3': 0.0173, 'MAP@5': 0.02062, 'MAP@10': 0.02465, 'MAP@100': 0.03251, 'MAP@1000': 0.03494}
{'Recall@1': 0.00995, 'Recall@3': 0.02449, 'Recall@5': 0.0351, 'Recall@10': 0.05471, 'Recall@100': 0.16793, 'Recall@1000': 0.35134}
{'P@1': 0.10879, 'P@3': 0.09376, 'P@5': 0.08232, 'P@10': 0.06564, 'P@100': 0.02128, 'P@1000': 0.00465}
{'R_cap@1': 0.10879, 'R_cap@3': 0.09499, 'R_cap@5': 0.08519, 'R_cap@10': 0.07646, 'R_cap@100': 0.16793, 'R_cap@1000': 0.35134}
Attention: As one lanugage needs to be specified, French was chosen as it is the most common.
MR-BM25_MXF


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 247/247 [1:07:40<00:00, 16.44s/it]


{'NDCG@1': 0.10971, 'NDCG@3': 0.09911, 'NDCG@5': 0.09113, 'NDCG@10': 0.08135, 'NDCG@100': 0.11405, 'NDCG@1000': 0.16614}
{'MAP@1': 0.00993, 'MAP@3': 0.01732, 'MAP@5': 0.02072, 'MAP@10': 0.02461, 'MAP@100': 0.03229, 'MAP@1000': 0.0347}
{'Recall@1': 0.00993, 'Recall@3': 0.02457, 'Recall@5': 0.03542, 'Recall@10': 0.05441, 'Recall@100': 0.16522, 'Recall@1000': 0.35001}
{'P@1': 0.10971, 'P@3': 0.09484, 'P@5': 0.0831, 'P@10': 0.06525, 'P@100': 0.02088, 'P@1000': 0.00462}
{'R_cap@1': 0.10971, 'R_cap@3': 0.09599, 'R_cap@5': 0.08606, 'R_cap@10': 0.07602, 'R_cap@100': 0.16522, 'R_cap@1000': 0.35001}


# **Lexical Retrieval: SSL datasets**

In [None]:
# SSL for lexical
from datasets import concatenate_datasets, load_dataset
use_subset = True
subset_length = 10000
remove_stopwords = False
language_long = 'german'

querie_dataset_it, qrel_dataset_it, corpus_dataset_it = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'it')
querie_dataset_fr, qrel_dataset_fr, corpus_dataset_fr = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'fr')
querie_dataset_de, qrel_dataset_de, corpus_dataset_de = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'de')

querie_dataset_p = concatenate_datasets([querie_dataset_it, querie_dataset_fr, querie_dataset_de])
qrel_dataset_p = concatenate_datasets([qrel_dataset_it, qrel_dataset_fr, qrel_dataset_de])
corpus_dataset_p = concatenate_datasets([corpus_dataset_it, corpus_dataset_fr, corpus_dataset_de])

querie_dataset_p = querie_dataset_p.shuffle(seed=42)
qrel_dataset_p = qrel_dataset_p.shuffle(seed=42)
corpus_dataset_p = corpus_dataset_p.shuffle(seed=42)

queries_ssl, qrels_ssl, corpus_ = write_dictionaries(querie_dataset_p, qrel_dataset_p, corpus_dataset_p)

Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 1371
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 4882
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 248
})
Finished filtering for language


Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 11942
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 63203
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 2379
})
Finished filtering for language


Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 17678
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 147486
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 5653
})
Finished filtering for language


Map:   0%|          | 0/8280 [00:00<?, ? examples/s]

8280


Map:   0%|          | 0/30991 [00:00<?, ? examples/s]

30991


Map:   0%|          | 0/215571 [00:00<?, ? examples/s]

30991
finished creating dictionaries


In [None]:
print("#######################################################################################################")
print("LR-BM25_MX")
run_lr_bm25(queries_ssl, qrels_ssl, corpus)
print("#######################################################################################################")
language_long='german'
print("Attention: As one lanugage needs to be specified, german was chosen as it is the most common.")
print("MR-BM25_MXD")
run_mr_bm25(queries_ssl, qrels_ssl, corpus, language_long)
print("#######################################################################################################")
language_long='french'
print("Attention: As one lanugage needs to be specified, French was chosen as it is the most common.")
print("MR-BM25_MXF")
run_mr_bm25(queries_ssl, qrels_ssl, corpus, language_long)
print("#######################################################################################################")
language_long='italian'
print("Attention: As one lanugage needs to be specified, French was chosen as it is the most common.")
print("MR-BM25_MXI")
run_mr_bm25(queries_ssl, qrels_ssl, corpus, language_long)
print("#######################################################################################################")
language_long='english'
print("Attention: As one lanugage needs to be specified, French was chosen as it is the most common.")
print("MR-BM25_MXE")
run_mr_bm25(queries_ssl, qrels_ssl, corpus, language_long)

#######################################################################################################
LR-BM25_MX


  0%|          | 0/9897 [00:00<?, ?docs/s]
que:   0%|          | 0/243 [00:21<?, ?it/s]


KeyError: ignored

# Lexical Retrieval with only one language per queries:




In [None]:
### Italian ###
language = 'it'
language_long = 'italian'
remove_stopwords = False
use_subset = False
subset_length = 0
queries, qrels, corpus_it = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)

Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 1371
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 4882
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 248
})
Finished filtering for language


Map:   0%|          | 0/248 [00:00<?, ? examples/s]

248


Map:   0%|          | 0/1371 [00:00<?, ? examples/s]

1371


Map:   0%|          | 0/4882 [00:00<?, ? examples/s]

1371
finished creating dictionaries
Shortened queries


In [None]:
print("#######################################################################################################")
print("LR-BM25_IT")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("MR-BM25_IT")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_IT_S


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 11/11 [03:18<00:00, 18.09s/it]


{'NDCG@1': 0.14223, 'NDCG@3': 0.13575, 'NDCG@5': 0.15606, 'NDCG@10': 0.19881, 'NDCG@100': 0.34109, 'NDCG@1000': 0.37782}
{'MAP@1': 0.04739, 'MAP@3': 0.081, 'MAP@5': 0.09817, 'MAP@10': 0.118, 'MAP@100': 0.1531, 'MAP@1000': 0.15644}
{'Recall@1': 0.04739, 'Recall@3': 0.11602, 'Recall@5': 0.17553, 'Recall@10': 0.27552, 'Recall@100': 0.80278, 'Recall@1000': 1.0}
{'P@1': 0.14223, 'P@3': 0.11743, 'P@5': 0.10562, 'P@10': 0.08366, 'P@100': 0.02298, 'P@1000': 0.00289}
{'R_cap@1': 0.14223, 'R_cap@3': 0.13713, 'R_cap@5': 0.1785, 'R_cap@10': 0.27568, 'R_cap@100': 0.80278, 'R_cap@1000': 1.0}
#######################################################################################################
MR-BM25_IT_S


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 11/11 [02:57<00:00, 16.10s/it]


{'NDCG@1': 0.24799, 'NDCG@3': 0.22079, 'NDCG@5': 0.23772, 'NDCG@10': 0.28604, 'NDCG@100': 0.41971, 'NDCG@1000': 0.44462}
{'MAP@1': 0.08239, 'MAP@3': 0.13673, 'MAP@5': 0.15927, 'MAP@10': 0.18438, 'MAP@100': 0.21999, 'MAP@1000': 0.22222}
{'Recall@1': 0.08239, 'Recall@3': 0.18622, 'Recall@5': 0.25373, 'Recall@10': 0.37164, 'Recall@100': 0.86336, 'Recall@1000': 1.0}
{'P@1': 0.24799, 'P@3': 0.18867, 'P@5': 0.15244, 'P@10': 0.10985, 'P@100': 0.02484, 'P@1000': 0.00289}
{'R_cap@1': 0.24799, 'R_cap@3': 0.21882, 'R_cap@5': 0.25773, 'R_cap@10': 0.37182, 'R_cap@100': 0.86336, 'R_cap@1000': 1.0}


In [None]:
### French ###
language = 'fr'
language_long = 'french'
remove_stopwords =False
use_subset = False
subset_length = 0
queries, qrels, corpus_fr = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)

Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 11942
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 63203
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 2379
})
Finished filtering for language


Map:   0%|          | 0/2379 [00:00<?, ? examples/s]

2379


Map:   0%|          | 0/11942 [00:00<?, ? examples/s]

11942


Map:   0%|          | 0/63203 [00:00<?, ? examples/s]

11942
finished creating dictionaries
Shortened queries


In [None]:
print("#######################################################################################################")
print("LR-BM25_FR")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("MR-BM25_FR")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_FR_S


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 94/94 [34:58<00:00, 22.32s/it]


{'NDCG@1': 0.07888, 'NDCG@3': 0.07454, 'NDCG@5': 0.07614, 'NDCG@10': 0.08818, 'NDCG@100': 0.17314, 'NDCG@1000': 0.25976}
{'MAP@1': 0.01655, 'MAP@3': 0.02964, 'MAP@5': 0.03603, 'MAP@10': 0.0436, 'MAP@100': 0.05942, 'MAP@1000': 0.0644}
{'Recall@1': 0.01655, 'Recall@3': 0.04378, 'Recall@5': 0.06573, 'Recall@10': 0.1071, 'Recall@100': 0.38333, 'Recall@1000': 0.81509}
{'P@1': 0.07888, 'P@3': 0.06861, 'P@5': 0.06197, 'P@10': 0.05052, 'P@100': 0.01844, 'P@1000': 0.00413}
{'R_cap@1': 0.07888, 'R_cap@3': 0.07419, 'R_cap@5': 0.0799, 'R_cap@10': 0.10915, 'R_cap@100': 0.38333, 'R_cap@1000': 0.81509}
#######################################################################################################
MR-BM25_FR_S


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 94/94 [30:26<00:00, 19.43s/it]


{'NDCG@1': 0.15893, 'NDCG@3': 0.14974, 'NDCG@5': 0.147, 'NDCG@10': 0.16145, 'NDCG@100': 0.2463, 'NDCG@1000': 0.31715}
{'MAP@1': 0.03556, 'MAP@3': 0.064, 'MAP@5': 0.07524, 'MAP@10': 0.0873, 'MAP@100': 0.10642, 'MAP@1000': 0.11082}
{'Recall@1': 0.03556, 'Recall@3': 0.09148, 'Recall@5': 0.12714, 'Recall@10': 0.1866, 'Recall@100': 0.46047, 'Recall@1000': 0.80652}
{'P@1': 0.15893, 'P@3': 0.13739, 'P@5': 0.11496, 'P@10': 0.08482, 'P@100': 0.02152, 'P@1000': 0.00407}
{'R_cap@1': 0.15893, 'R_cap@3': 0.14896, 'R_cap@5': 0.15089, 'R_cap@10': 0.18957, 'R_cap@100': 0.46047, 'R_cap@1000': 0.80652}


In [None]:
### German ###
language = 'de'
language_long = 'german'
remove_stopwords = False
use_subset = False
subset_length = 0
queries, qrels, corpus_de = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)

Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 17678
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 147486
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 5653
})
Finished filtering for language


Map:   0%|          | 0/5653 [00:00<?, ? examples/s]

5653


Map:   0%|          | 0/17678 [00:00<?, ? examples/s]

17678


Map:   0%|          | 0/147486 [00:00<?, ? examples/s]

17678
finished creating dictionaries
Shortened queries


In [None]:
print("#######################################################################################################")
print("LR-BM25_DE")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("MR-BM25_DE")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_DE_S


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 139/139 [54:42<00:00, 23.62s/it]


{'NDCG@1': 0.08525, 'NDCG@3': 0.0797, 'NDCG@5': 0.07827, 'NDCG@10': 0.08341, 'NDCG@100': 0.15674, 'NDCG@1000': 0.24341}
{'MAP@1': 0.01374, 'MAP@3': 0.02411, 'MAP@5': 0.02963, 'MAP@10': 0.03636, 'MAP@100': 0.05085, 'MAP@1000': 0.05598}
{'Recall@1': 0.01374, 'Recall@3': 0.03471, 'Recall@5': 0.05265, 'Recall@10': 0.08684, 'Recall@100': 0.30672, 'Recall@1000': 0.68687}
{'P@1': 0.08525, 'P@3': 0.07548, 'P@5': 0.06834, 'P@10': 0.05675, 'P@100': 0.02128, 'P@1000': 0.00501}
{'R_cap@1': 0.08525, 'R_cap@3': 0.07835, 'R_cap@5': 0.07865, 'R_cap@10': 0.0939, 'R_cap@100': 0.30672, 'R_cap@1000': 0.68687}
#######################################################################################################
MR-BM25_DE_S


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 139/139 [44:49<00:00, 19.35s/it]


{'NDCG@1': 0.09662, 'NDCG@3': 0.08977, 'NDCG@5': 0.0882, 'NDCG@10': 0.09295, 'NDCG@100': 0.1705, 'NDCG@1000': 0.25436}
{'MAP@1': 0.01587, 'MAP@3': 0.02783, 'MAP@5': 0.03391, 'MAP@10': 0.04128, 'MAP@100': 0.05715, 'MAP@1000': 0.06233}
{'Recall@1': 0.01587, 'Recall@3': 0.03987, 'Recall@5': 0.05944, 'Recall@10': 0.09569, 'Recall@100': 0.32771, 'Recall@1000': 0.69299}
{'P@1': 0.09662, 'P@3': 0.08451, 'P@5': 0.07693, 'P@10': 0.06278, 'P@100': 0.02278, 'P@1000': 0.00507}
{'R_cap@1': 0.09662, 'R_cap@3': 0.08833, 'R_cap@5': 0.08848, 'R_cap@10': 0.10343, 'R_cap@100': 0.32771, 'R_cap@1000': 0.69299}


# Cross Encoder, Sbert Reranking and Lexical for subset of 100 queries



In [None]:
querie_dataset = querie_dataset.shuffle(seed=42)
qrel_dataset = qrel_dataset.shuffle(seed=42)
corpus_dataset = corpus_dataset.shuffle(seed=42)

In [None]:
print(querie_dataset)
print(qrel_dataset)
print(corpus_dataset)

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 62557
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 674296
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 18177
})


In [None]:
### Mixed ###
language = 'mixed'
language_long = 'german'
remove_stopwords = False
use_subset = True
subset_length = 10000
queries, qrels, corpus = create_dataset(language, language_long, remove_stopwords, use_subset, subset_length)



Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 31566
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 458725
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 9897
})
Finished filtering for language


Map:   0%|          | 0/9897 [00:00<?, ? examples/s]

9897


Map:   0%|          | 0/31566 [00:00<?, ? examples/s]

31566


Map:   0%|          | 0/458725 [00:00<?, ? examples/s]

31566
finished creating dictionaries
create subset of 10000 queries.
10001
10001
9897
Shortened queries


In [None]:
### Subset 100 ###
queries_100, qrels_100, corpus_100 = create_subset(100, queries, qrels, corpus)

create subset of 100 queries.
101
101
9897


In [None]:
print("#######################################################################################################")
print("RCE_BM25_MX_100")
model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'
run_rce_bm25(queries_100, qrels_100, corpus_100, model_name)
print("#######################################################################################################")
print("SB_R_M1_MX_100")
# distiluse-base-multilingual-cased
# distiluse-base-multilingual-cased-v1 # model 2
model_name = "paraphrase-albert-small-v2" # model 1
run_sb_r(queries_100, qrels_100, corpus_100, model_name)
print("#######################################################################################################")
print("SB_R_M2_MX_100")
# distiluse-base-multilingual-cased
model_name = 'sentence-transformers/distiluse-base-multilingual-cased-v1'
# model_name = "paraphrase-albert-small-v2"
run_sb_r(queries_100, qrels_100, corpus_100, model_name)

#######################################################################################################
RCE_BM25_MX_100


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:20<00:00, 20.96s/it]


Downloading (…)lve/main/config.json:   0%|          | 0.00/891 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Batches:   0%|          | 0/79 [00:00<?, ?it/s]

{'NDCG@1': 0.0297, 'NDCG@3': 0.02213, 'NDCG@5': 0.01989, 'NDCG@10': 0.01839, 'NDCG@100': 0.07353, 'NDCG@1000': 0.07353}
{'MAP@1': 0.00182, 'MAP@3': 0.00255, 'MAP@5': 0.00278, 'MAP@10': 0.0032, 'MAP@100': 0.00893, 'MAP@1000': 0.00893}
{'Recall@1': 0.00182, 'Recall@3': 0.00355, 'Recall@5': 0.00472, 'Recall@10': 0.0081, 'Recall@100': 0.14198, 'Recall@1000': 0.14198}
{'P@1': 0.0297, 'P@3': 0.0198, 'P@5': 0.01782, 'P@10': 0.01683, 'P@100': 0.02426, 'P@1000': 0.00243}
{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}
#######################################################################################################
SB_R_M1_MX_100


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:15<00:00, 15.66s/it]


Downloading (…)f333f/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)4d423f333f/README.md:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

Downloading (…)423f333f/config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)f333f/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading (…)23f333f/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

{'NDCG@1': 0.0396, 'NDCG@3': 0.02323, 'NDCG@5': 0.02113, 'NDCG@10': 0.01654, 'NDCG@100': 0.02023, 'NDCG@1000': 0.02023}
{'MAP@1': 0.00182, 'MAP@3': 0.00229, 'MAP@5': 0.00279, 'MAP@10': 0.00305, 'MAP@100': 0.00388, 'MAP@1000': 0.00388}
{'Recall@1': 0.00182, 'Recall@3': 0.00324, 'Recall@5': 0.005, 'Recall@10': 0.00708, 'Recall@100': 0.02562, 'Recall@1000': 0.02562}
{'P@1': 0.0396, 'P@3': 0.0198, 'P@5': 0.01782, 'P@10': 0.01287, 'P@100': 0.00465, 'P@1000': 0.00047}
{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}
#######################################################################################################
SB_R_M2_MX_100


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:14<00:00, 14.41s/it]


Downloading (…)5f450/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

Downloading (…)966465f450/README.md:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading (…)6465f450/config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5f450/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading (…)966465f450/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)465f450/modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

{'NDCG@1': 0.0198, 'NDCG@3': 0.03446, 'NDCG@5': 0.0291, 'NDCG@10': 0.02518, 'NDCG@100': 0.03546, 'NDCG@1000': 0.03584}
{'MAP@1': 0.00096, 'MAP@3': 0.00329, 'MAP@5': 0.00384, 'MAP@10': 0.00444, 'MAP@100': 0.00614, 'MAP@1000': 0.00617}
{'Recall@1': 0.00096, 'Recall@3': 0.00589, 'Recall@5': 0.00752, 'Recall@10': 0.01149, 'Recall@100': 0.05395, 'Recall@1000': 0.05477}
{'P@1': 0.0198, 'P@3': 0.0363, 'P@5': 0.02772, 'P@10': 0.02277, 'P@100': 0.0095, 'P@1000': 0.00097}
{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}


In [None]:
print("#######################################################################################################")
print("SB_R_M3_MX_100")
# distiluse-base-multilingual-cased
model_name = 'Stern5497/sBert-swiss-legal-base'
# model_name = "paraphrase-albert-small-v2"
run_sb_r(queries_100, qrels_100, corpus_100, model_name)

#######################################################################################################
SB_R_M3_MX_100


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:21<00:00, 21.54s/it]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

{'NDCG@1': 0.17822, 'NDCG@3': 0.15366, 'NDCG@5': 0.15579, 'NDCG@10': 0.14174, 'NDCG@100': 0.21998, 'NDCG@1000': 0.22114}
{'MAP@1': 0.01264, 'MAP@3': 0.02295, 'MAP@5': 0.03383, 'MAP@10': 0.04498, 'MAP@100': 0.07343, 'MAP@1000': 0.07364}
{'Recall@1': 0.01264, 'Recall@3': 0.03062, 'Recall@5': 0.05417, 'Recall@10': 0.08554, 'Recall@100': 0.3219, 'Recall@1000': 0.32485}
{'P@1': 0.17822, 'P@3': 0.14851, 'P@5': 0.15248, 'P@10': 0.12772, 'P@100': 0.05267, 'P@1000': 0.00532}
{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}


In [None]:
print("#######################################################################################################")
print("LR-BM25_MX_100")
run_lr_bm25(queries_100, qrels_100, corpus_100)
print("#######################################################################################################")
print("Attention: As one lanugage needs to be specified, german was chosen as it is the most common.")
print("MR-BM25_MXD_100")
run_mr_bm25(queries_100, qrels_100, corpus_100, language_long)

#######################################################################################################
LR-BM25_MX_100


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:22<00:00, 22.23s/it]


{'NDCG@1': 0.05941, 'NDCG@3': 0.10974, 'NDCG@5': 0.09872, 'NDCG@10': 0.08517, 'NDCG@100': 0.10635, 'NDCG@1000': 0.16756}
{'MAP@1': 0.00291, 'MAP@3': 0.01122, 'MAP@5': 0.01411, 'MAP@10': 0.01803, 'MAP@100': 0.02495, 'MAP@1000': 0.02798}
{'Recall@1': 0.00291, 'Recall@3': 0.01884, 'Recall@5': 0.0282, 'Recall@10': 0.04683, 'Recall@100': 0.14198, 'Recall@1000': 0.3337}
{'P@1': 0.05941, 'P@3': 0.11881, 'P@5': 0.09901, 'P@10': 0.07723, 'P@100': 0.02426, 'P@1000': 0.00585}
{'R_cap@1': 0.05941, 'R_cap@3': 0.11881, 'R_cap@5': 0.0995, 'R_cap@10': 0.08044, 'R_cap@100': 0.14198, 'R_cap@1000': 0.3337}
#######################################################################################################
Attention: As one lanugage needs to be specified, german was chosen as it is the most common.
MR-BM25_MXD_100


  0%|          | 0/9897 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:19<00:00, 19.58s/it]


{'NDCG@1': 0.09901, 'NDCG@3': 0.11477, 'NDCG@5': 0.11066, 'NDCG@10': 0.09139, 'NDCG@100': 0.1158, 'NDCG@1000': 0.17426}
{'MAP@1': 0.00601, 'MAP@3': 0.01368, 'MAP@5': 0.01783, 'MAP@10': 0.0208, 'MAP@100': 0.02878, 'MAP@1000': 0.03167}
{'Recall@1': 0.00601, 'Recall@3': 0.02098, 'Recall@5': 0.0336, 'Recall@10': 0.04694, 'Recall@100': 0.1519, 'Recall@1000': 0.33719}
{'P@1': 0.09901, 'P@3': 0.11881, 'P@5': 0.11089, 'P@10': 0.08119, 'P@100': 0.02614, 'P@1000': 0.00588}
{'R_cap@1': 0.09901, 'R_cap@3': 0.11881, 'R_cap@5': 0.11139, 'R_cap@10': 0.08412, 'R_cap@100': 0.1519, 'R_cap@1000': 0.33719}


# SSL Experiments

In [None]:
    ### Mixed single language links###
from datasets import concatenate_datasets, load_dataset
use_subset = True
subset_length = 10000
remove_stopwords = False
language_long = 'german'

querie_dataset_it, qrel_dataset_it, corpus_dataset_it = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'it')
querie_dataset_fr, qrel_dataset_fr, corpus_dataset_fr = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'fr')
querie_dataset_de, qrel_dataset_de, corpus_dataset_de = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'de')

querie_dataset_p = concatenate_datasets([querie_dataset_it, querie_dataset_fr, querie_dataset_de])
qrel_dataset_p = concatenate_datasets([qrel_dataset_it, qrel_dataset_fr, qrel_dataset_de])
corpus_dataset_p = concatenate_datasets([corpus_dataset_it, corpus_dataset_fr, corpus_dataset_de])

querie_dataset_p = querie_dataset_p.shuffle(seed=42)
qrel_dataset_p = qrel_dataset_p.shuffle(seed=42)
corpus_dataset_p = corpus_dataset_p.shuffle(seed=42)

queries, qrels, corpus = write_dictionaries(querie_dataset_p, qrel_dataset_p, corpus_dataset_p)
### Subset 100 ###
queries_100, qrels_100, corpus_100 = create_subset(100, queries, qrels, corpus)

Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 1371
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 4882
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 248
})
Finished filtering for language


Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 11942
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 63203
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 2379
})
Finished filtering for language


Filter:   0%|          | 0/62557 [00:00<?, ? examples/s]

Filter:   0%|          | 0/674296 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18177 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 17678
})
Dataset({
    features: ['id', 'corp_id', 'match', 'language'],
    num_rows: 147486
})
Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 5653
})
Finished filtering for language


Map:   0%|          | 0/8280 [00:00<?, ? examples/s]

8280


Map:   0%|          | 0/30991 [00:00<?, ? examples/s]

30991


Map:   0%|          | 0/215571 [00:00<?, ? examples/s]

30991
finished creating dictionaries
create subset of 100 queries.
101
101
8280


In [None]:
print("#######################################################################################################")
print("LR-BM25_MX_SSL_100")
run_lr_bm25(queries_100, qrels_100, corpus_100)
print("#######################################################################################################")
print("Attention: As one lanugage needs to be specified, german was chosen as it is the most common.")
print("MR-BM25_MXD_SSL_100")
run_mr_bm25(queries_100, qrels_100, corpus_100, language_long)
print("#######################################################################################################")
print("Attention: As one lanugage needs to be specified, french was chosen as it is the most common.")
print("MR-BM25_MXD_SSL_100")
run_mr_bm25(queries_100, qrels_100, corpus_100, "french")
print("#######################################################################################################")
print("SB_R_M3_MX_SSL_100")
# distiluse-base-multilingual-cased
model_name = 'Stern5497/sBert-swiss-legal-base'
# model_name = "paraphrase-albert-small-v2"
run_sb_r(queries_100, qrels_100, corpus_100, model_name)

#######################################################################################################
LR-BM25_MX_SSL_100


  0%|          | 0/8280 [00:00<?, ?docs/s]
que:   0%|          | 0/1 [00:15<?, ?it/s]


KeyError: ignored

# Results SSL and shortened


In [None]:
                ### Mixed single language links###
from datasets import concatenate_datasets, load_dataset
use_subset = True
subset_length = 10000
remove_stopwords = True
language_long = 'german'

querie_dataset_it, qrel_dataset_it, corpus_dataset_it = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'it')
querie_dataset_fr, qrel_dataset_fr, corpus_dataset_fr = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'fr')
querie_dataset_de, qrel_dataset_de, corpus_dataset_de = filter_data(querie_dataset, qrel_dataset, corpus_dataset, 'de')

querie_dataset_p = concatenate_datasets([querie_dataset_it, querie_dataset_fr, querie_dataset_de])
qrel_dataset_p = concatenate_datasets([qrel_dataset_it, qrel_dataset_fr, qrel_dataset_de])
corpus_dataset_p = concatenate_datasets([corpus_dataset_it, corpus_dataset_fr, corpus_dataset_de])

querie_dataset_p = querie_dataset_p.shuffle(seed=42)
qrel_dataset_p = qrel_dataset_p.shuffle(seed=42)
corpus_dataset_p = corpus_dataset_p.shuffle(seed=42)

queries, qrels, corpus = write_dictionaries(querie_dataset_p, qrel_dataset_p, corpus_dataset_p)
queries = shorten(queries)

In [None]:
print("#######################################################################################################")
print("LR-BM25_MX-SLL_10000")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("Attention: As one lanugage needs to be specified, german was chosen as it is the most common.")
print("MR-BM25_MXD_SLL_10000")
language_long = 'german'
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_MX-SLL_10000


  0%|          | 0/8280 [00:00<?, ?docs/s]
que: 100%|██████████| 280/280 [1:26:13<00:00, 18.48s/it]


{'NDCG@1': 0.09128, 'NDCG@3': 0.08575, 'NDCG@5': 0.08629, 'NDCG@10': 0.09645, 'NDCG@100': 0.18027, 'NDCG@1000': 0.26529}
{'MAP@1': 0.01785, 'MAP@3': 0.03149, 'MAP@5': 0.03831, 'MAP@10': 0.04651, 'MAP@100': 0.06339, 'MAP@1000': 0.06853}
{'Recall@1': 0.01785, 'Recall@3': 0.04554, 'Recall@5': 0.06803, 'Recall@10': 0.11037, 'Recall@100': 0.37469, 'Recall@1000': 0.76708}
{'P@1': 0.09128, 'P@3': 0.07969, 'P@5': 0.07143, 'P@10': 0.05846, 'P@100': 0.02106, 'P@1000': 0.00469}
{'R_cap@1': 0.07908, 'R_cap@3': 0.07362, 'R_cap@5': 0.07727, 'R_cap@10': 0.09993, 'R_cap@100': 0.32457, 'R_cap@1000': 0.66449}
#######################################################################################################
Attention: As one lanugage needs to be specified, german was chosen as it is the most common.
MR-BM25_MXD_S


  0%|          | 0/8280 [00:00<?, ?docs/s]
que: 100%|██████████| 280/280 [1:19:08<00:00, 16.96s/it]


{'NDCG@1': 0.09293, 'NDCG@3': 0.08733, 'NDCG@5': 0.08782, 'NDCG@10': 0.09699, 'NDCG@100': 0.18167, 'NDCG@1000': 0.26676}
{'MAP@1': 0.01802, 'MAP@3': 0.03164, 'MAP@5': 0.03845, 'MAP@10': 0.04653, 'MAP@100': 0.06359, 'MAP@1000': 0.0688}
{'Recall@1': 0.01802, 'Recall@3': 0.04568, 'Recall@5': 0.06785, 'Recall@10': 0.10937, 'Recall@100': 0.37672, 'Recall@1000': 0.76984}
{'P@1': 0.09293, 'P@3': 0.08121, 'P@5': 0.07333, 'P@10': 0.05944, 'P@100': 0.02144, 'P@1000': 0.00473}
{'R_cap@1': 0.0805, 'R_cap@3': 0.07498, 'R_cap@5': 0.07827, 'R_cap@10': 0.09936, 'R_cap@100': 0.32633, 'R_cap@1000': 0.66687}


In [None]:
### Subset 1000 ###
queries, qrels, corpus = create_subset(1000, queries, qrels, corpus)

create subset of 1000 queries.


In [None]:
print("#######################################################################################################")
print("LR-BM25_MX-SLL_1000")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("Attention: As one lanugage needs to be specified, german was chosen as it is the most common.")
print("MR-BM25_MXD_SLL_1000")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_MX-SLL_1000


  0%|          | 0/2413 [00:00<?, ?docs/s]
que: 100%|██████████| 8/8 [01:56<00:00, 14.51s/it]


{'NDCG@1': 0.2266, 'NDCG@3': 0.1866, 'NDCG@5': 0.17495, 'NDCG@10': 0.17431, 'NDCG@100': 0.28217, 'NDCG@1000': 0.39647}
{'MAP@1': 0.0302, 'MAP@3': 0.05143, 'MAP@5': 0.06379, 'MAP@10': 0.07935, 'MAP@100': 0.1103, 'MAP@1000': 0.12076}
{'Recall@1': 0.0302, 'Recall@3': 0.06724, 'Recall@5': 0.09781, 'Recall@10': 0.15643, 'Recall@100': 0.46355, 'Recall@1000': 0.90347}
{'P@1': 0.2266, 'P@3': 0.17376, 'P@5': 0.15426, 'P@10': 0.12394, 'P@100': 0.04066, 'P@1000': 0.00854}
{'R_cap@1': 0.21279, 'R_cap@3': 0.16533, 'R_cap@5': 0.15468, 'R_cap@10': 0.16586, 'R_cap@100': 0.4353, 'R_cap@1000': 0.84842}
#######################################################################################################
Attention: As one lanugage needs to be specified, german was chosen as it is the most common.
MR-BM25_MXD_SLL_1000


  0%|          | 0/2413 [00:00<?, ?docs/s]
que: 100%|██████████| 8/8 [01:49<00:00, 13.73s/it]


{'NDCG@1': 0.22021, 'NDCG@3': 0.19416, 'NDCG@5': 0.18155, 'NDCG@10': 0.18012, 'NDCG@100': 0.28781, 'NDCG@1000': 0.40035}
{'MAP@1': 0.02877, 'MAP@3': 0.05277, 'MAP@5': 0.06582, 'MAP@10': 0.08192, 'MAP@100': 0.11312, 'MAP@1000': 0.12355}
{'Recall@1': 0.02877, 'Recall@3': 0.06997, 'Recall@5': 0.10269, 'Recall@10': 0.16347, 'Recall@100': 0.47183, 'Recall@1000': 0.90584}
{'P@1': 0.22021, 'P@3': 0.18404, 'P@5': 0.16149, 'P@10': 0.12904, 'P@100': 0.04167, 'P@1000': 0.00856}
{'R_cap@1': 0.20679, 'R_cap@3': 0.17516, 'R_cap@5': 0.16214, 'R_cap@10': 0.17292, 'R_cap@100': 0.44308, 'R_cap@1000': 0.85064}


In [None]:
### Subset 100 ###
queries, qrels, corpus = create_subset(100, queries, qrels, corpus)

In [None]:
print("#######################################################################################################")
print("RCE_BM25_MX_100")
model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'
run_rce_bm25(queries, qrels, corpus, model_name)
print("#######################################################################################################")
print("SB_R_M1_MX_100")
# distiluse-base-multilingual-cased
# distiluse-base-multilingual-cased-v1 # model 2
model_name = "paraphrase-albert-small-v2" # model 1
run_sb_r(queries, qrels, corpus, model_name)
print("#######################################################################################################")
print("SB_R_M2_MX_100")
# distiluse-base-multilingual-cased
model_name = 'sentence-transformers/distiluse-base-multilingual-cased-v1'
# model_name = "paraphrase-albert-small-v2"
run_sb_r(queries, qrels, corpus, model_name)

#######################################################################################################
RCE_BM25_MX_100
#######################################################################################################
SB_R_M1_MX_100


  0%|          | 0/525 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:05<00:00,  5.71s/it]


Downloading (…)f333f/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)4d423f333f/README.md:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

Downloading (…)423f333f/config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)f333f/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading (…)23f333f/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

{'NDCG@1': 0.06931, 'NDCG@3': 0.06931, 'NDCG@5': 0.07174, 'NDCG@10': 0.07365, 'NDCG@100': 0.19027, 'NDCG@1000': 0.19177}
{'MAP@1': 0.01071, 'MAP@3': 0.01854, 'MAP@5': 0.02231, 'MAP@10': 0.02803, 'MAP@100': 0.05022, 'MAP@1000': 0.05046}
{'Recall@1': 0.01071, 'Recall@3': 0.02776, 'Recall@5': 0.04216, 'Recall@10': 0.07312, 'Recall@100': 0.40525, 'Recall@1000': 0.41032}
{'P@1': 0.06931, 'P@3': 0.06601, 'P@5': 0.06535, 'P@10': 0.05347, 'P@100': 0.03604, 'P@1000': 0.00364}
{'R_cap@1': 0.42574, 'R_cap@3': 0.40924, 'R_cap@5': 0.35842, 'R_cap@10': 0.3372, 'R_cap@100': 0.69103, 'R_cap@1000': 0.94839}
#######################################################################################################
SB_R_M2_MX_100


  0%|          | 0/525 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:05<00:00,  5.84s/it]


Downloading (…)5f450/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

Downloading (…)966465f450/README.md:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading (…)6465f450/config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5f450/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading (…)966465f450/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)465f450/modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

{'NDCG@1': 0.11881, 'NDCG@3': 0.10952, 'NDCG@5': 0.11224, 'NDCG@10': 0.10551, 'NDCG@100': 0.19898, 'NDCG@1000': 0.19985}
{'MAP@1': 0.02043, 'MAP@3': 0.03049, 'MAP@5': 0.03721, 'MAP@10': 0.04471, 'MAP@100': 0.06809, 'MAP@1000': 0.06824}
{'Recall@1': 0.02043, 'Recall@3': 0.03537, 'Recall@5': 0.05329, 'Recall@10': 0.08054, 'Recall@100': 0.35261, 'Recall@1000': 0.35513}
{'P@1': 0.11881, 'P@3': 0.09901, 'P@5': 0.10297, 'P@10': 0.08317, 'P@100': 0.03564, 'P@1000': 0.00359}
{'R_cap@1': 0.42574, 'R_cap@3': 0.40924, 'R_cap@5': 0.35842, 'R_cap@10': 0.3372, 'R_cap@100': 0.69103, 'R_cap@1000': 0.94839}


In [None]:
print("#######################################################################################################")
print("SB_R_M3_MX_SSL_100")
# distiluse-base-multilingual-cased
model_name = 'Stern5497/sBert-swiss-legal-base'
# model_name = "paraphrase-albert-small-v2"
run_sb_r(queries, qrels, corpus, model_name)
print("#######################################################################################################")
print("RCE_BM25_MX_100")
model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'
run_rce_bm25(queries_short, qrels, corpus, model_name)

In [None]:
print("#######################################################################################################")
print("LR-BM25_MX-SLL_100")
run_lr_bm25(queries, qrels, corpus)
print("#######################################################################################################")
print("Attention: As one lanugage needs to be specified, german was chosen as it is the most common.")
print("MR-BM25_MXD_SLL_100")
run_mr_bm25(queries, qrels, corpus, language_long)

#######################################################################################################
LR-BM25_MX-SLL_100


  0%|          | 0/525 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:05<00:00,  5.46s/it]


{'NDCG@1': 0.44792, 'NDCG@3': 0.43547, 'NDCG@5': 0.39547, 'NDCG@10': 0.3697, 'NDCG@100': 0.5127, 'NDCG@1000': 0.59221}
{'MAP@1': 0.0557, 'MAP@3': 0.1331, 'MAP@5': 0.15931, 'MAP@10': 0.19631, 'MAP@100': 0.26777, 'MAP@1000': 0.28189}
{'Recall@1': 0.0557, 'Recall@3': 0.16642, 'Recall@5': 0.21655, 'Recall@10': 0.31046, 'Recall@100': 0.72702, 'Recall@1000': 0.99779}
{'P@1': 0.44792, 'P@3': 0.41667, 'P@5': 0.35417, 'P@10': 0.2625, 'P@100': 0.06917, 'P@1000': 0.00991}
{'R_cap@1': 0.42574, 'R_cap@3': 0.40924, 'R_cap@5': 0.35842, 'R_cap@10': 0.3372, 'R_cap@100': 0.69103, 'R_cap@1000': 0.94839}
#######################################################################################################
Attention: As one lanugage needs to be specified, german was chosen as it is the most common.
MR-BM25_MXD_SLL_100


  0%|          | 0/525 [00:00<?, ?docs/s]
que: 100%|██████████| 1/1 [00:03<00:00,  3.83s/it]

{'NDCG@1': 0.53125, 'NDCG@3': 0.44811, 'NDCG@5': 0.41081, 'NDCG@10': 0.37753, 'NDCG@100': 0.52324, 'NDCG@1000': 0.60228}
{'MAP@1': 0.07362, 'MAP@3': 0.13567, 'MAP@5': 0.16665, 'MAP@10': 0.20199, 'MAP@100': 0.27637, 'MAP@1000': 0.29058}
{'Recall@1': 0.07362, 'Recall@3': 0.15661, 'Recall@5': 0.21891, 'Recall@10': 0.30757, 'Recall@100': 0.72761, 'Recall@1000': 0.99779}
{'P@1': 0.53125, 'P@3': 0.41319, 'P@5': 0.35833, 'P@10': 0.26146, 'P@100': 0.06958, 'P@1000': 0.00991}
{'R_cap@1': 0.50495, 'R_cap@3': 0.40429, 'R_cap@5': 0.3637, 'R_cap@10': 0.3356, 'R_cap@100': 0.69159, 'R_cap@1000': 0.94839}



