# Passage Retrievers

- TF-IDF
- BM25
- Dense Passage Retriever


# SQuAD Example
SQuAD format contains question, context and answers

Extractive QA:
Open book extractive model:
- Accepts question
- Converts sentence to query vector
- Compares vector with database of document or passage vectors
- Returns top k most similar context vectors
- Contexts are sent to reader
- Reader 'reads' context vectors and selects 'answer' passage


## PreProcessing

Using Sentence transformers from sbert.net
<br>It is a pretrained model for QA, we want a model that uses cosine similarity so we take one which is trained for that use
<br>We use the fastest but necessarily the best performing for this demo


In [1]:
import datasets

qa = datasets.load_dataset('squad', split='validation')
qa 

Downloading builder script: 5.27kB [00:00, 1.13MB/s]                   
Downloading metadata: 2.36kB [00:00, 785kB/s]                    


Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /Users/johncmcc/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data: 30.3MB [00:00, 65.9MB/s]/2 [00:00<?, ?it/s]
Downloading data: 4.85MB [00:00, 67.5MB/s]                   .35it/s]
Downloading data files: 100%|██████████| 2/2 [00:01<00:00,  1.82it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 729.38it/s]
                                                                                            

Dataset squad downloaded and prepared to /Users/johncmcc/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})

In [2]:
qa[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


Remove Duplicates

In [3]:
from sympy import false


unique_contexts = []
unique_ids = []

# make list of ids that represent only the first instance of each context

for row in qa:
    if row['context'] not in unique_contexts:
        unique_contexts.append(row['context'])
        unique_ids.append(row['id'])

# filtering samples that aren't included in unique ids
qa = qa.filter(lambda x: True if x['id'] in unique_ids else False)


100%|██████████| 11/11 [00:00<00:00, 16.86ba/s]


Create context vectors with the retriever model

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
model

Downloading: 100%|██████████| 737/737 [00:00<00:00, 347kB/s]
Downloading: 100%|██████████| 190/190 [00:00<00:00, 86.3kB/s]
Downloading: 100%|██████████| 11.3k/11.3k [00:00<00:00, 4.42MB/s]
Downloading: 100%|██████████| 612/612 [00:00<00:00, 272kB/s]
Downloading: 100%|██████████| 116/116 [00:00<00:00, 44.0kB/s]
Downloading: 100%|██████████| 25.5k/25.5k [00:00<00:00, 145kB/s] 
Downloading: 100%|██████████| 90.9M/90.9M [00:02<00:00, 45.3MB/s]
Downloading: 100%|██████████| 53.0/53.0 [00:00<00:00, 28.8kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 39.2kB/s]
Downloading: 100%|██████████| 466k/466k [00:00<00:00, 540kB/s]  
Downloading: 100%|██████████| 383/383 [00:00<00:00, 204kB/s]
Downloading: 100%|██████████| 13.8k/13.8k [00:00<00:00, 80.8kB/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 336kB/s]  
Downloading: 100%|██████████| 349/349 [00:00<00:00, 164kB/s]


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Encode the context vectors

In [None]:
# Takes a long time places contexts nto new feature called encoding
qa = qa.map(lambda x:{
    'encoding': model.encode(x['context']).tolist()
}, batched=True, batch_size=32)
qa

In [7]:
len(model.encode('hello world').tolist())

384

## Vector Database

Creating the vector database and indexing context vectors.

The retriever will be referring to the vector to the database to retrieve context

Using FAISS or using Pinecone
Using Pinecone in this example, I will need to get an API key from app.pinecone.io

### Pinecone

In [8]:
API_KEY = "22636ca9-6458-408b-8e48-8346f2a76f2f"


In [9]:
import pinecone

pinecone.init(API_KEY, environment='us-west1-gcp')

In [11]:
# Pass in an index name and the dimensionality of the index
# the dimension should match the length of the encoding that the model , all our vectors will have that dimension
dim = len(model.encode('hello world').tolist())
pinecone.create_index('qa-index', dim)

KeyboardInterrupt: 

Populate the index

In [None]:
# connecting to qa index
index = pinecone.Index('qa-index')

'Upsert' Upload and insert vectors into index
<br>Using a batch upload process

In [13]:
from tqdm.auto import tqdm
 
# When upserting to pinecone we want a tuple with id and the vector | metadata optional
upserts = [(v['id'], v['title']) for v in qa]

# progress bar
l_ups = len(upserts)

for i in tqdm(range(0, l_ups, 50)):
    i_end = i + 50
    if i_end > l_ups:
        i_end = l_ups
    index.upsert(vectors=upserts[i:i_end])

100%|██████████| 42/42 [00:00<00:00, 56371.45it/s]


In [14]:
i, i_end

(2050, 2067)

### FAISS

## QA Inference

We want to encode our queries in this section

In [None]:
query = ""
model.encode().tolist()

Regression Analysis
    Linear regression
    Multiple regression analysis
    Dummy Variables

