# Dense Retrieval with Simple Transformers - Indexing

In [1]:
import logging
from simpletransformers.retrieval import RetrievalModel, RetrievalArgs


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

2022-09-11 14:20:16.636608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


In [2]:
# Path to the trained model
model_type = "custom"
model_name = "outputs"

# Path to the Wikipedia passage collection
index_path = "indices/wiki_passages_to_build_index"

We load the model and pass the path to the passage collection to `prediction_passages`. This automatically builds the index and saves it to `<output_dir>/prediction_passage_dataset`.

In [3]:
model_args = RetrievalArgs()
model_args.output_dir = model_name

# Loading the model with prediction_passages automatically builds the index
model = RetrievalModel(
    model_type=model_type,
    model_name=model_name,
    args=model_args,
    prediction_passages=index_path,
)

INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages started
INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages completed
INFO:simpletransformers.retrieval.retrieval_utils:Generating embeddings for prediction passages started
  0%|          | 207/164183 [00:41<9:03:42,  5.03ba/s] 


KeyboardInterrupt: 

Calling the `model.predict()` function will retrieve passages from the index

## Loading the model with the generated index

In [3]:
model_type = "dpr"
model_name = None
context_name = "facebook/dpr-ctx_encoder-single-nq-base"
query_name = "facebook/dpr-question_encoder-single-nq-base"


model = RetrievalModel(
    model_type=model_type,
    model_name=model_name,
    context_encoder_name=context_name,
    query_encoder_name=query_name,
    prediction_passages="indices/wiki_prediction_passage_dataset",
)

INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages started
INFO:simpletransformers.retrieval.retrieval_utils:Preparing prediction passages completed
INFO:simpletransformers.retrieval.retrieval_utils:Loaded FAISS index from trained_model/prediction_passage_dataset/hf_dataset_index.faiss


In [4]:
docs, *_ = model.predict(["What is precision and recall?"])
docs[0]

Generating query embeddings: 1it [00:00,  1.23it/s]
Retrieving docs: 100%|██████████| 1/1 [00:11<00:00, 11.13s/it]


['"Precision and recall" weighted harmonic mean of precision and recall with weights formula_20. Their relationship is formula_21 where formula_22. There are other parameters and strategies for performance metric of information retrieval system, such as the area under the ROC curve (AUC). For web document retrieval, if the user\'s objectives are not clear, the precision and recall can\'t be optimized. As summarized by Lopresti, Precision and recall In pattern recognition, information retrieval and binary classification, precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances that',
 '"Precision and recall" each other as ROC curves and provide a principled mechanism to explore operating point tradeoffs. Outside of Information Retrieval, the application of Recall, Precision and F-measure are argued to be flawed as they ignore the true negative cell

In [5]:
docs, *_ = model.predict(["What is spice?"])
docs[0][:5]

Generating query embeddings: 1it [00:00, 35.16it/s]
Retrieving docs: 100%|██████████| 1/1 [00:10<00:00, 10.89s/it]


['"The Spice Trail" be patronising and mention a Blue Peter style of reporting. Though every so often, she took a nibble out of a hard-edged news story – the disease wrecking India’s pepper crops and driving farmers to suicide, the rip-off deals that turn spice workers into virtual slaves – she quickly skipped merrily on, playing the colonial grande dame out to charm the exotic locals. "The Express" gave a factual account of the show and made it one of their "picks of the day" KATE Humble travels along India’s fascinating Spice Coast to uncover the story of pepper, once known as black',
 '"ISO/IEC 15504" standard and used the acronym SPICE. SPICE initially stood for "Software Process Improvement and Capability Evaluation", but in consideration of French concerns over the meaning of "evaluation", SPICE has now been renamed "Software Process Improvement and Capability Determination". SPICE is still used for the user group of the standard, and the title for the annual conference. The firs

In [6]:
docs, *_ = model.predict(["What is a dune?"])
docs[0]

Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]

["the Arabian Peninsula, a vast erg, called the Rub' al Khali or Empty Quarter, contains seif dunes that stretch for almost 200 km and reach heights of over 300 m. Linear loess hills known as pahas are superficially similar. These hills appear to have been formed during the last ice age under permafrost conditions dominated by sparse tundra vegetation. Radially symmetrical, star dunes are pyramidal sand mounds with slipfaces on three or more arms that radiate from the high center of the mound. They tend to accumulate in areas with multidirectional wind regimes. Star dunes grow upward rather than laterally.",
 'They dominate the Grand Erg Oriental of the Sahara. In other deserts, they occur around the margins of the sand seas, particularly near topographic barriers. In the southeast Badain Jaran Desert of China, the star dunes are up to 500 metres tall and may be the tallest dunes on Earth. Oval or circular mounds that generally lack a slipface. Dome dunes are rare and occur at the far 

In [7]:
docs, *_ = model.predict(["What is spice in dune?"])
docs[0]

Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]

['surface. In "Dune", the desert of Arrakis is the only known source of the spice melange, the most essential and valuable commodity in the universe. Melange is a geriatric drug that gives the user a longer life span, greater vitality, and heightened awareness; it can also unlock prescience in some subjects, depending upon the dosage and the consumer\'s physiology. This prescience-enhancing property makes interstellar travel ("folding space") possible. Melange comes with a steep price however: it is highly addictive, and withdrawal is a fatal process. A by-product of the sandworm life cycle, sandtrout excretions exposed to water become a pre-spice',
 'the new leader of the Empire of Man. Sandworm (Dune) A sandworm is a fictional creature that appears in the "Dune" novels written by Frank Herbert. Sandworms are colossal worm-like creatures that live on the hot desert planet Arrakis. Arrakis is the only known source in the Universe of the spice "melange", a drug highly prized for its med