# **Vanilla BERT Text Encoding**

In this colab, I will be getting the text encodings for the queries using a Vanilla BERT model.  These will be saved for future reference.

## **Setup**

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.16.1-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 30.8 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 32.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

In [None]:
!pip install --upgrade python-terrier

Collecting python-terrier
  Downloading python-terrier-0.8.0.tar.gz (97 kB)
[?25l[K     |███▍                            | 10 kB 21.9 MB/s eta 0:00:01[K     |██████▊                         | 20 kB 11.8 MB/s eta 0:00:01[K     |██████████▏                     | 30 kB 9.5 MB/s eta 0:00:01[K     |█████████████▌                  | 40 kB 8.7 MB/s eta 0:00:01[K     |████████████████▉               | 51 kB 5.5 MB/s eta 0:00:01[K     |████████████████████▎           | 61 kB 5.6 MB/s eta 0:00:01[K     |███████████████████████▋        | 71 kB 5.5 MB/s eta 0:00:01[K     |███████████████████████████     | 81 kB 6.1 MB/s eta 0:00:01[K     |██████████████████████████████▍ | 92 kB 6.3 MB/s eta 0:00:01[K     |████████████████████████████████| 97 kB 3.5 MB/s 
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Collecting pyjnius~=1.3.0
  Downloading pyjnius-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 35.2 MB/s 
[?25hCollectin

In [None]:
import pyterrier as pt

if not pt.started():
  pt.init()

In [2]:
import pandas as pd
import pickle

In [None]:
cord19 = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = cord19.get_topics('title')

In [3]:
!wget https://github.com/DavidONeill75101/level-4-project/blob/master/Datasets/CORD-19_Datasets/round5_docs.csv?raw=true
round5_docs = pd.read_csv('/content/round5_docs.csv?raw=true').drop(columns=['Unnamed: 0'])

--2022-03-15 14:09:09--  https://github.com/DavidONeill75101/level-4-project/blob/master/Datasets/CORD-19_Datasets/round5_docs.csv?raw=true
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/DavidONeill75101/level-4-project/raw/master/Datasets/CORD-19_Datasets/round5_docs.csv [following]
--2022-03-15 14:09:09--  https://github.com/DavidONeill75101/level-4-project/raw/master/Datasets/CORD-19_Datasets/round5_docs.csv
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/DavidONeill75101/level-4-project/master/Datasets/CORD-19_Datasets/round5_docs.csv [following]
--2022-03-15 14:09:10--  https://media.githubusercontent.com/media/DavidONeill75101/level-4-project/master/Datasets/CORD-19_Datasets/round5_docs.csv
Resolving media.githubusercontent.com (me

## **Prepare for Encoding**

In [None]:
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased",output_hidden_states=True)

model = model.to(device)

The following function will encode the query concatenated to the document with a '\n' character in between.

In [None]:
def text_to_embed(text):
  # Tokenize it with appropriate padding and truncation
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=500)

  # Move the IDs of the tokens over to the GPU
  input_ids = inputs['input_ids'].to(device)

  # Run the model on the data
  outputs = model(input_ids=input_ids)

  # Extract the embeddings
  with torch.no_grad():
    # Get the final layer of the neural network, and average the embedding for all the tokens
    # Some researchers use the vector just for the first or final token of the sentence
    # instead of an average. I don't think there is a definitive best approach.
    # You could stick to the mean for now.
    embed = outputs.hidden_states[-1].squeeze().mean(axis=0)

    # Return the embedding to the CPU and convert to a numpy array
    embed = embed.cpu().numpy()

  return embed

## **Get Query Embeddings**

In [None]:
query_embeddings = {}
for i in range(1, 51):
  query_embeddings[str(i)] = text_to_embed(topics.iloc[i-1]['query'])
