## GETTING HARD NEGATIVES

In [1]:
import logging
from simpletransformers.retrieval import RetrievalModel, RetrievalArgs


In [2]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [3]:
import torch
import pandas as pd
import numpy as np


In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [5]:
claim_df = pd.read_parquet('processed_df/claim_df.parquet')
wiki_df = pd.read_parquet('processed_df/wiki_df.parquet')

In [6]:
queries = list(set(claim_df['claim']))
print(len(queries))

115305


In [14]:
passages = list(set(wiki_df['text']))
print(len(passages))

3715301


In [6]:
model_type = "dpr"
context_name = "facebook/dpr-ctx_encoder-single-nq-base"
query_name = "facebook/dpr-question_encoder-single-nq-base"

model_args = RetrievalArgs()

# Create a TransformerModel
model = RetrievalModel(
    model_type=model_type,
    context_encoder_name=context_name,
    query_encoder_name=query_name,
    args=model_args
)

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.weight', 'ctx_encoder.bert_model.pooler.dense.bias']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokeniz

In [7]:
merged_df = pd.read_parquet('processed_df/merged_df.parquet')


In [8]:
train_data = merged_df[['claim', 'wiki_text', 'wiki_title']]

In [9]:
len(train_data)

102748

In [10]:
train_data.rename(columns={'claim': 'query_text', 'wiki_text': 'gold_passage', 'wiki_title': 'title'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data.rename(columns={'claim': 'query_text', 'wiki_text': 'gold_passage', 'wiki_title': 'title'}, inplace=True)


In [11]:
train_data.head()

Unnamed: 0,query_text,gold_passage,title
0,nikolaj coster waldau worked with the fox broa...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau
1,nikolaj coster waldau was not in a danish thri...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau
2,nikolaj coster waldau worked with peter dinklage,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau
3,nikolaj coster waldau refused to ever work wit...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau
4,nikolaj coster waldau was in a film,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau


In [12]:
queries = list(train_data['query_text'].tolist())
len(queries)

102748

In [13]:
passages = list(set(train_data['gold_passage'].tolist()))
len(passages)

9528

In [17]:
# The hard negatives will be written to the output dir by default.
hard_df = model.build_hard_negatives(
    queries=queries,
    passage_dataset=passages,
    retrieve_n_docs=1,
#     output_dir = '/home/rahvk/tmp/cache/hard_neg'
)


INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages started
INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages completed
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for prediction passages started


Map:   0%|          | 0/9528 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for prediction passages completed


Saving the dataset (0/1 shards):   0%|          | 0/9528 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to prediction passages


  0%|          | 0/10 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to prediction passages completed


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/201 [00:00<?, ?it/s]

In [18]:
len(hard_df)

102748

In [19]:
hard_df.to_parquet('processed_df/hard_df.parquet')

In [14]:
hard_df = pd.read_parquet('processed_df/hard_df.parquet')


In [20]:
len(train_data)

102748

In [15]:
train_data['hard_negative'] = hard_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['hard_negative'] = hard_df


In [16]:
train_data.head()

Unnamed: 0,query_text,gold_passage,title,hard_negative
0,nikolaj coster waldau worked with the fox broa...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,frederick fred seibert lrb born september 15 1...
1,nikolaj coster waldau was not in a danish thri...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,lars von trier lrb lars trier 30 april 1956 rr...
2,nikolaj coster waldau worked with peter dinklage,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,arthur schopenhauer lrb lsb a t o pm ha rsb 22...
3,nikolaj coster waldau refused to ever work wit...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,harry herbert frazee lrb june 29 1880 june 4 1...
4,nikolaj coster waldau was in a film,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,armin mueller stahl lrb born 17 december 1930 ...


In [None]:
# %pip install simpletransformers

In [17]:
from simpletransformers.retrieval import RetrievalModel


model_type = "dpr"
context_encoder_name = "facebook/dpr-ctx_encoder-single-nq-base"
question_encoder_name = "facebook/dpr-question_encoder-single-nq-base"

model = RetrievalModel(
    model_type=model_type,
    context_encoder_name=context_encoder_name,
    query_encoder_name=question_encoder_name,
    hard_negatives=True,
    include_title=True,
)

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.weight', 'ctx_encoder.bert_model.pooler.dense.bias']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokeniz

In [19]:
hist=model.train_model(train_data, output_dir='/home/rahvk/data/tmp/cache/model-it', show_running_loss=True, use_cuda=True)
hist

Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_model: Training started


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/12844 [00:00<?, ?it/s]

  (max_idxs == torch.tensor(labels)).sum().cpu().detach().numpy().item()
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-it/checkpoint-2000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-it/checkpoint-4000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-it/checkpoint-6000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-it/checkpoint-8000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-it/checkpoint-10000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-it/checkpoint-12000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-it/checkpoint-12844-epoch-1
INFO:simpletransformers.retrieval.retrieval_model:Saving model into outputs/

(12844, 0.274144477266532)

In [20]:
global_step, training_loss = hist

In [21]:
result = model.eval_model(train_data)
# result

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset


Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset completed.
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages


Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages completed.


Saving the dataset (0/2 shards):   0%|          | 0/102748 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages


  0%|          | 0/103 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages completed.


Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

Running Evaluation:   0%|          | 0/12844 [00:00<?, ?it/s]

Retrieving docs:   0%|          | 0/201 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model:{'eval_loss': 8.48934105770011, 'mrr@1': 0.34977809787051817, 'mrr@2': 0.36124304122707984, 'mrr@3': 0.3646105033674622, 'mrr@5': 0.3672037411920427, 'mrr@10': 0.3689218644660599, 'top_1_accuracy': 0.34977809787051817, 'top_2_accuracy': 0.3727079845836415, 'top_3_accuracy': 0.38281037100478843, 'top_5_accuracy': 0.39420718651457937, 'top_10_accuracy': 0.40698602405886247}


In [None]:
result

In [None]:
res, doc_ids, doc_vectors, doc_dicts = result

In [None]:
res

In [19]:
doc_ids.shape

(102751, 10)

In [20]:
doc_ids[0]

array([ 3442., 14938.,  3440.,  3447., 69013.,  3448.,  3443., 14943.,
       69021., 14945.])