## GETTING HARD NEGATIVES

In [1]:
import logging
from simpletransformers.retrieval import RetrievalModel, RetrievalArgs


In [2]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [3]:
import torch
import pandas as pd
import numpy as np


In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [5]:
claim_df = pd.read_parquet('processed_df/claim_df.parquet')
wiki_df = pd.read_parquet('processed_df/wiki_df.parquet')

In [6]:
queries = list(set(claim_df['claim']))
print(len(queries))

115305


In [14]:
passages = list(set(wiki_df['text']))
print(len(passages))

3715301


In [6]:
model_type = "dpr"
context_name = "facebook/dpr-ctx_encoder-single-nq-base"
query_name = "facebook/dpr-question_encoder-single-nq-base"

model_args = RetrievalArgs()

# Create a TransformerModel
model = RetrievalModel(
    model_type=model_type,
    context_encoder_name=context_name,
    query_encoder_name=query_name,
    args=model_args
)

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.weight', 'ctx_encoder.bert_model.pooler.dense.bias']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokeniz

In [5]:
merged_df = pd.read_parquet('processed_df/merged_df.parquet')


In [6]:
train_data = merged_df[['claim', 'wiki_text', 'wiki_title']]

In [7]:
len(train_data)

102748

In [8]:
train_data.rename(columns={'claim': 'query_text', 'wiki_text': 'gold_passage', 'wiki_title': 'title'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data.rename(columns={'claim': 'query_text', 'wiki_text': 'gold_passage', 'wiki_title': 'title'}, inplace=True)


In [9]:
train_data.head()

Unnamed: 0,query_text,gold_passage,title
0,nikolaj coster waldau worked with the fox broa...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau
1,nikolaj coster waldau was not in a danish thri...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau
2,nikolaj coster waldau worked with peter dinklage,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau
3,nikolaj coster waldau refused to ever work wit...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau
4,nikolaj coster waldau was in a film,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau


In [11]:
queries = list(train_data['query_text'].tolist())
len(queries)

102748

In [12]:
passages = list(set(train_data['gold_passage'].tolist()))
len(passages)

9528

In [17]:
# The hard negatives will be written to the output dir by default.
hard_df = model.build_hard_negatives(
    queries=queries,
    passage_dataset=passages,
    retrieve_n_docs=1,
#     output_dir = '/home/rahvk/tmp/cache/hard_neg'
)


INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages started
INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages completed
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for prediction passages started


Map:   0%|          | 0/9528 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for prediction passages completed


Saving the dataset (0/1 shards):   0%|          | 0/9528 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to prediction passages


  0%|          | 0/10 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to prediction passages completed


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/201 [00:00<?, ?it/s]

In [18]:
len(hard_df)

102748

In [10]:
hard_df.to_parquet('processed_df/hard_df.parquet')

NameError: name 'hard_df' is not defined

## PREPARING DF TO TRAIN ON HARD NEGATIVES

In [11]:
hard_df = pd.read_parquet('processed_df/hard_df.parquet')


In [12]:
len(train_data)

102748

In [13]:
train_data['hard_negative'] = hard_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['hard_negative'] = hard_df


In [14]:
train_data.head()

Unnamed: 0,query_text,gold_passage,title,hard_negative
0,nikolaj coster waldau worked with the fox broa...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,frederick fred seibert lrb born september 15 1...
1,nikolaj coster waldau was not in a danish thri...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,lars von trier lrb lars trier 30 april 1956 rr...
2,nikolaj coster waldau worked with peter dinklage,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,arthur schopenhauer lrb lsb a t o pm ha rsb 22...
3,nikolaj coster waldau refused to ever work wit...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,harry herbert frazee lrb june 29 1880 june 4 1...
4,nikolaj coster waldau was in a film,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,armin mueller stahl lrb born 17 december 1930 ...


In [15]:
train_data.to_parquet('processed_df/train_data.parquet')

In [17]:
# %pip install simpletransformers

In [10]:
train_data = pd.read_parquet('processed_df/train_data.parquet')


In [11]:
train_data.head()

Unnamed: 0,query_text,gold_passage,title,hard_negative
0,nikolaj coster waldau worked with the fox broa...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,frederick fred seibert lrb born september 15 1...
1,nikolaj coster waldau was not in a danish thri...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,lars von trier lrb lars trier 30 april 1956 rr...
2,nikolaj coster waldau worked with peter dinklage,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,arthur schopenhauer lrb lsb a t o pm ha rsb 22...
3,nikolaj coster waldau refused to ever work wit...,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,harry herbert frazee lrb june 29 1880 june 4 1...
4,nikolaj coster waldau was in a film,nikolaj coster waldau lrb lsb ne ola k sd ald ...,Nikolaj_Coster-Waldau,armin mueller stahl lrb born 17 december 1930 ...


In [12]:
from simpletransformers.retrieval import RetrievalModel


model_type = "dpr"
context_encoder_name = "facebook/dpr-ctx_encoder-single-nq-base"
question_encoder_name = "facebook/dpr-question_encoder-single-nq-base"

model = RetrievalModel(
    model_type=model_type,
    context_encoder_name=context_encoder_name,
    query_encoder_name=question_encoder_name,
#     hard_negatives=True,
#     include_title=True,
#     num_train_epochs=10,
#     save_model_every_epoch=True,
    args = {"hard_negatives":True,"include_title":True,"num_train_epochs":5, "save_model_every_epoch":True, "save_steps":-1}
)

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.weight', 'ctx_encoder.bert_model.pooler.dense.bias']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokeniz

In [13]:
model.args

RetrievalArgs(adafactor_beta1=None, adafactor_clip_threshold=1.0, adafactor_decay_rate=-0.8, adafactor_eps=(1e-30, 0.001), adafactor_relative_step=True, adafactor_scale_parameter=True, adafactor_warmup_init=True, adam_betas=(0.9, 0.999), adam_epsilon=1e-08, best_model_dir='outputs/best_model', cache_dir='cache_dir/', config={}, cosine_schedule_num_cycles=0.5, custom_layer_parameters=[], custom_parameter_groups=[], dataloader_num_workers=0, do_lower_case=False, dynamic_quantize=False, early_stopping_consider_epochs=False, early_stopping_delta=0, early_stopping_metric='eval_loss', early_stopping_metric_minimize=True, early_stopping_patience=3, encoding=None, eval_batch_size=8, evaluate_during_training=False, evaluate_during_training_silent=True, evaluate_during_training_steps=2000, evaluate_during_training_verbose=False, evaluate_each_epoch=True, fp16=True, gradient_accumulation_steps=1, learning_rate=4e-05, local_rank=-1, logging_steps=50, loss_type=None, loss_args={}, manual_seed=None,

In [14]:
hist=model.train_model(train_data, output_dir='/home/rahvk/data/tmp/cache/model-hn-5', show_running_loss=True, use_cuda=True)
# hist

Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_model: Training started


Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/12844 [00:00<?, ?it/s]

  (max_idxs == torch.tensor(labels)).sum().cpu().detach().numpy().item()
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-2000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-4000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-6000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-8000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-10000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-12000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-12844-epoch-1


Running Epoch 1 of 5:   0%|          | 0/12844 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-14000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-16000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-18000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-20000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-22000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-24000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-25688-epoch-2


Running Epoch 2 of 5:   0%|          | 0/12844 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-26000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-28000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-30000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-32000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-34000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-36000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-38000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-38532-epoch-3


Running Epoch 3 of 5:   0%|          | 0/12844 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-40000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-42000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-44000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-46000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-48000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-50000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-51376-epoch-4


Running Epoch 4 of 5:   0%|          | 0/12844 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-52000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-54000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-56000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-58000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-60000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-62000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-64000
INFO:simpletransformers.retrieval.retrieval_model:Saving model into /home/rahvk/data/tmp/cache/model-hn-5/checkpoint-64220-epoch-5
INFO:sim

In [15]:
global_step, training_loss = hist

In [16]:
print(training_loss)

0.1602270204794271


## LOADING PRETRAINED MODEL

In [20]:
from simpletransformers.retrieval import RetrievalModel


model_type = "dpr"
model_name = "/home/rahvk/data/tmp/cache/model-hn-5/checkpoint-12844-epoch-1"

# Initialize a RetrievalModel
model = RetrievalModel(
    model_type=model_type,
    model_name=model_name,
)

## EVALUATE THE MODEL

In [None]:
result = model.eval_model(train_data)
# result

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset


Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Loading evaluation passages to a Huggingface Dataset completed.
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages


Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for evaluation passages completed.


Saving the dataset (0/1 shards):   0%|          | 0/102748 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages


  0%|          | 0/103 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to evaluation passages completed.


Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

Map:   0%|          | 0/102748 [00:00<?, ? examples/s]

Running Evaluation:   0%|          | 0/12844 [00:00<?, ?it/s]

Retrieving docs:   0%|          | 0/201 [00:00<?, ?it/s]

In [None]:
res, doc_ids, doc_vectors, doc_dicts = result

In [None]:
res

In [26]:
doc_ids.shape

(102748, 10)

In [27]:
doc_ids[0]

array([3440., 3441., 3442., 3443., 3444., 3445., 3446., 3447., 3448.,
       3449.])

## MAKING PREDICTIONS

In [29]:
to_predict = queries[:5]

predicted_passages, doc_ids, doc_vectors, doc_dicts = model.predict(to_predict, prediction_passages=passages)

INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages started
INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages completed
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for prediction passages started


Map:   0%|          | 0/9528 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for prediction passages completed


Saving the dataset (0/1 shards):   0%|          | 0/9528 [00:00<?, ? examples/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to prediction passages


  0%|          | 0/10 [00:00<?, ?it/s]

INFO:simpletransformers.retrieval.retrieval_utils:Adding FAISS index to prediction passages completed


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]

In [31]:
to_predict

['nikolaj coster waldau worked with the fox broadcasting company',
 'nikolaj coster waldau was not in a danish thriller film',
 'nikolaj coster waldau worked with peter dinklage',
 'nikolaj coster waldau refused to ever work with the fox broadcasting company',
 'nikolaj coster waldau was in a film']

In [33]:
len(predicted_passages)

5

In [None]:
predicted_passages