## Import Necessary Libraries
Run this cell to import all of the necessary libraries needed for the Frequently Requested Documents Model Testing Notebook.

In [1]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import pandas as pd
import os
import pickle

from frequently_requested_docs.docs_helper import getModel, getSaveName, loadEmbeddings, getEmbeddingPath, test_sentence
from frequently_requested_docs.docs_config import TOP_K, MODEL_NAMES, DATA_CSV_PATH

## Model Selection and Initialization

Run this cell to select and initialize a model you wish to test by choosing a number 0 to 8, in accordance with the model's position in the model_name list.

In [2]:
model_name = [
    'nli-mpnet-base-v2',
    'nli-roberta-base-v2',
    'princeton-nlp/sup-simcse-roberta-large',
    'princeton-nlp/unsup-simcse-roberta-large',
    'stsb-distilroberta-base-v2',
    'stsb-mpnet-base-v2',
    'stsb-roberta-base',
    'stsb-roberta-base-v2',
    'stsb-roberta-large',
]

m = 0
        

In [3]:
save_name = getSaveName(model_name[m])
    
model = getModel(model_name[m], save_name)

Loading model from disc


## Initialize and Load Corpus Embeddings
Run this cell to initialize and load the corpus embeddings from the Frequently Requested Documents dataset. 

In [4]:
# Format of corpus sentences
corpus_docs = []
data = pd.read_csv(DATA_CSV_PATH)
data.reset_index()

for ind, row in data.iterrows():
    if isinstance(row['Document'], str):
        corpus_docs.append(row)

# Load corpus embeddings if exist, otherwise encode embeddings
embedding_path = getEmbeddingPath(save_name)
corpus_embeddings = None
            
corpus_docs, corpus_embeddings = loadEmbeddings(model, embedding_path, corpus_docs)
        

Loading pre-computed embeddings from disc


## Test Sentence Selection
Run this cell to select and embed a sentence to test by setting i equal to a number 0 to 2, in accordance with the sentence's postion in the examples list.

In [5]:
examples = [
    'I am searching for the Detention Facility Reviews for the Randall County Jail in Amarillo, Texas', 
    'Statements made by former georgia senator david perdue about visas.', 
    "All documents regarding the TSA’s throughput data for August 2017"]

i = 0


In [6]:
sentence = examples[i]

test_sentence(sentence, model, corpus_docs, corpus_embeddings, TOP_K)

Sentence: I am searching for the Detention Facility Reviews for the Randall County Jail in Amarillo, Texas 

Top 25 most similar sentences in corpus:
2008 | Randall County Jail (Feb. 14-15), Amarillo, TX  (Score: 0.6632)
2008 | Randall County Jail (Feb. 10-12), Amarillo, TX  (Score: 0.6190)
2007 | Randall County Sheriff's Office, Amarillo, TX  (Score: 0.6114)
2007 | Bexar County, GEO Central Texas Detention Facility, San Antonio, TX  (Score: 0.6096)
2009 | Central Texas Detention Facility, San Antonio, TX  (Score: 0.5924)
2008 | Central Texas Detention Facility, San Antonio, TX  (Score: 0.5897)
2007 | Frio County Detention Center, Pearsall, TX  (Score: 0.5883)
2009 | Lubbock County Detention Center, Lubbock, TX  (Score: 0.5843)
2007 | Lubbock County Detention Center, Lubbock, TX  (Score: 0.5799)
2007 | Brooks County Detention Center, Falfurrias, TX  (Score: 0.5795)
2007 | Webb County Detention Center, Laredo, TX  (Score: 0.5795)
EROIGSA-15-0001: Prairieland Detention Center – Alvarado,