# We want to build a QR Information Retrieval System with a Language Model on Wikipedia Data

Link to [article](https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html)

## Setting up Environment

In [1]:
!pip install torch  torchvision -f https://download.pytorch.org/whl/torch_stable.html
!pip install transformers==2.5.1
!pip install wikipedia==1.4.0

Looking in links: https://download.pytorch.org/whl/torch_stable.html
[0mCollecting transformers==2.5.1
  Downloading transformers-2.5.1-py3-none-any.whl (499 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m499.4/499.4 kB[0m [31m711.6 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting tokenizers==0.5.2
  Downloading tokenizers-0.5.2-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.12.1
    Uninstalling tokenizers-0.12.1:
      Successfully uninstalled tokenizers-0.12.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.20.1
    Uninstalling transformers-4.20.1:
      Successfully uninstalled transformers-4.20.1
[31mERROR: pip's dependency resolver does not current

Consider alos the model, BERT does not do machine-translation and GPT does not do QA. So there are some limitations with each LM. We will go ahead with BERT.

## QA Dataset

For QA dataset; SQuAD is a canonical dataset for QA and in SQuAD 1.1, all questions have the answer in the passage; whereas SQuAD2.0 has questions that cannot be answered by the provided passage.

In [2]:
# code to download the specified version of squad

# set path with magic
%env DATA_DIR=./data/squad 

# download the data
def download_squad(version=1):
    if version == 1:
        !wget -P $DATA_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
        !wget -P $DATA_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
    else:
        !wget -P $DATA_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
        !wget -P $DATA_DIR https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
            
download_squad(version=2)

env: DATA_DIR=./data/squad
--2022-10-23 11:05:10--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘./data/squad/train-v2.0.json’


2022-10-23 11:05:10 (183 MB/s) - ‘./data/squad/train-v2.0.json’ saved [42123633/42123633]

--2022-10-23 11:05:11--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.109.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘./data/squad/dev-v2.0.json’


2022-10-23 11:05:12 (52.1 MB/s) - 

## Fine-Tuning Script

In [3]:
# grab run_squad.py training script
!curl -L -O https://github.com/huggingface/transformers/blob/b90745c5901809faef3136ed09a689e7d733526c/examples/run_squad.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  462k    0  462k    0     0   859k      0 --:--:-- --:--:-- --:--:--  859k


In [4]:
# fine-tuning your own model for QA using HF's `run_squad.py`
# turn flags on and off according to the model you're training

cmd = [
    'python', 
#    '-m torch.distributed.launch --nproc_per_node 2', # use this to perform distributed training over multiple GPUs
    'run_squad.py', 
    
    '--model_type', 'bert',                            # model type (one of the list under "Pick a Model" above)
    '--model_name_or_path', 'bert-base-uncased',       # specific model name of the given model type (shown, a list is here: https://huggingface.co/transformers/pretrained_models.html) 
                                                       # on first execution this initiates a download of pre-trained model weights;
                                                       # can also be a local path to a directory with model weights
    
    '--output_dir', './models/bert/bbu_squad2',        # directory for model checkpoints and predictions
    
#    '--overwrite_output_dir',                         # use when adding output to a directory that is non-empty --
                                                       # for instance, when training crashes midway through and you need to restart it
    
    '--do_train',                                      # execute the training method 
    '--train_file', '$DATA_DIR/train-v2.0.json',       # provide the training data
    '--version_2_with_negative',                       # ** MUST use this flag if training on SQuAD 2.0! DO NOT use if training on SQuAD 1.1
    '--do_lower_case',                                 # ** set this flag if using an uncased model; don't use for Cased Models
    '--do_eval',                                       # execute the evaluation method on the dev set -- note: 
                                                       # if coupled with --do_train, evaluation runs after fine-tuning 
    
    '--predict_file', '$DATA_DIR/dev-v2.0.json',       # provide evaluation data (dev set)
    '--eval_all_checkpoints',                          # evaluate the model on the dev set at each checkpoint
    '--per_gpu_eval_batch_size', '12',                 # evaluation batch size for each gpu
    '--per_gpu_train_batch_size', '12',                # training batch size for each gpu
    '--save_steps', '5000',                            # how often checkpoints (complete model snapshot) are saved 
    '--threads', '8',                                  # num of CPU threads to use for converting SQuAD examples to model features
    
    # --- Model and Feature Hyperparameters --- 
    '--num_train_epochs', '3',                         # number of training epochs - usually 2-3 for SQuAD 
    '--learning_rate', '3e-5',                         # learning rate for the default optimizer (Adam in this case)
    '--max_seq_length', '384',                         # maximum length allowed for the full input sequence 
    '--doc_stride', '128'                              # used for long documents that must be chunked into multiple features -- 
                                                       # this "sliding window" controls the amount of stride between chunks
]

When executing `run_squad.py` for the first time:
1. Download pre-trained model weights for specified model type (bert-base-uncased)
2. Convert SQuAD training examples into features (15-30 mins)
3. Training features are saved to a cache file
4. `--do_train` for as many epochs as we specify; savng model every ``save_steps`` until trainng finishes
5. Final model weights and peripheral files are saved to ``output_dir``
6. If ``--do_eval``; SQuAD dev examples are converted to features
7. Dev features are also saved to a cache
8. Evaluation commences and outputs an assortment of performance scores

In [5]:
!python run_squad.py  \
    --model_type bert   \
    --model_name_or_path bert-base-uncased  \
    --output_dir models/bert/ \
    --data_dir data/squad   \
    --overwrite_output_dir \
    --overwrite_cache \
    --do_train  \
    --train_file train-v2.0.json   \
    --version_2_with_negative \
    --do_lower_case  \
    --do_eval   \
    --predict_file dev-v2.0.json   \
    --per_gpu_train_batch_size 2   \
    --learning_rate 3e-5   \
    --num_train_epochs 2.0   \
    --max_seq_length 384   \
    --doc_stride 128   \
    --threads 10   \
    --save_steps 5000 

  File "run_squad.py", line 8
    <!DOCTYPE html>
    ^
SyntaxError: invalid syntax


### Training Output

We'll find model's tokenizer:
- `tokenizer_config.json`
- `vocab.txt`
- `special_tokens_map.json`

Model Files:
- `pytorch_model.bin`: Actual model weights (can be several GB for some models)
- `config.json`: details of model architecture

Binary representation of command line args to train the model: `training_args.bin`

And if we included `--do_eval`:
- `predictions_.json`: Official best answer for each example
- `nbest_predictions.json` top n best answers for each example

In [1]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("./models/bert/bbu_squad2")
model = AutoModelForQuestionAnswering.from_pretrained("./models/bert/bbu_squad2")

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

If we don't have time to train; we can just load a model for HuggingFace! There are several..

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# executing these commands for the first time initiates a download of the 
# model weights to ~/.cache/torch/transformers/
tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2") 
model = AutoModelForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

Downloading:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

### Try Our Model

Three setps to QA:
1. Tokenise input
2. Obtain model scores
3. Get the answer span

In [4]:
question = "Who ruled Macedonia"

context = """Macedonia was an ancient kingdom on the periphery of Archaic and Classical Greece, 
and later the dominant state of Hellenistic Greece. The kingdom was founded and initially ruled 
by the Argead dynasty, followed by the Antipatrid and Antigonid dynasties. Home to the ancient 
Macedonians, it originated on the northeastern part of the Greek peninsula. Before the 4th 
century BC, it was a small kingdom outside of the area dominated by the city-states of Athens, 
Sparta and Thebes, and briefly subordinate to Achaemenid Persia."""


# 1. TOKENIZE THE INPUT
# note: if you don't include return_tensors='pt' you'll get a list of lists which is easier for 
# exploration but you cannot feed that into a model. 
inputs = tokenizer.encode_plus(question, context, return_tensors="pt") 

# 2. OBTAIN MODEL SCORES
# the AutoModelForQuestionAnswering class includes a span predictor on top of the model. 
# the model returns answer start and end scores for each word in the text
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(answer_start_scores)  # get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1  # get the most likely end of answer with the argmax of the score

# 3. GET THE ANSWER SPAN
# once we have the most likely start and end tokens, we grab all the tokens between them
# and convert tokens back to words!
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

'the Argead dynasty'

## QA On Wikipedia Pages

In [7]:
# pull up a wikipedia page
import wikipedia as wiki
import pprint as pp

question = "What is the wingspan of an albatross?"

results = wiki.search(question)
print("Wikipedia seach results for our question: \n")
pp.pprint(results)

page = wiki.page(results[0])
text = page.content
print(f"\nThe {results[0]} Wikipedia article contains {len(text)} characters.")

Wikipedia seach results for our question: 

['Wandering albatross',
 'Pelagornis sandersi',
 'List of largest birds',
 'Black-browed albatross',
 'Argentavis',
 'List of birds by flight speed',
 'Mollymawk',
 'Largest body part',
 'Aspect ratio (aeronautics)',
 'Early flying machines']

The Wandering albatross Wikipedia article contains 10427 characters.


In [8]:
inputs = tokenizer.encode_plus(question, text, return_tensors="pt")
print(f"This translates into {len(inputs['input_ids'][0])} tokens.")

This translates into 2536 tokens.


Note: Always use tokenizer specific to model, as tokens themselves will be different.

2536 execeds number of tokens we can feed into at once; most BERT-esque models can only accept 512 tokens at once, so need to split into chunks and each chunk must not exceed 512 tokens in total.

We must follow the format: [CLS] question tokens [SEP] context tokens [SEP].

So we must prepend the original question, followed by the next "chunk" of article tokens.

In [12]:
from collections import OrderedDict

# identify question tokens (token_type_ids = 0)
qmask = inputs["token_type_ids"].lt(1) # make a mast of token type ids=0; represents question. Make mask

qt = torch.masked_select(inputs['input_ids'], qmask) # question tokens
print(f"The question consists of {qt.size()[0]} tokens.")

# 1 accounts for having to add a [SEP] token to the end of each chunk
chunk_size = model.config.max_position_embeddings - qt.size()[0] - 1 # max chunk size=512 - len(question_tokens) - 1
print(f"Each chunk will contain {chunk_size - 2} tokens of the Wikipedia article.")

# create a dict of dicts; each sub-dict mimics the structure of pre-chunked model inputs
chunked_input = OrderedDict()

# loop through input_ids, token_type_ids, attention mask
for k, v in inputs.items():
    q = torch.masked_select(v, qmask) 
    c = torch.masked_select(v, ~qmask)
    chunks = torch.split(c, chunk_size) # chunk remaining tokens (non-question)
    
    # loop through chunks
    for i, chunk in enumerate(chunks): 
        
        # make key first time around
        if i not in chunked_input:
            chunked_input[i] = {}
        
        # append question to chunk
        thing = torch.cat((q, chunk))
        if i != len(chunks)-1:
            # append 102 SEP for end of question
            if k == "input_ids":
                thing = torch.cat((thing, torch.tensor([102])))
            # else add 1
            else:
                thing = torch.cat((thing, torch.tensor([1])))
                
        chunked_input[i][k] = torch.unsqueeze(thing, dim=0)

The question consists of 12 tokens.
Each chunk will contain 497 tokens of the Wikipedia article.


In [34]:
inputs

{'input_ids': tensor([[  101,  1327,  1110,  ..., 19021,  1233,   102]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 1, 1, 1]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

In [21]:
for i in range(len(chunked_input.keys())):
    print(f"Number of tokens in chunk {i}: {len(chunked_input[i]['input_ids'].tolist()[0])}")

Number of tokens in chunk 0: 512
Number of tokens in chunk 1: 512
Number of tokens in chunk 2: 512
Number of tokens in chunk 3: 512
Number of tokens in chunk 4: 512
Number of tokens in chunk 5: 41


Can now be fed to model without indexing errors. We'll get an "answer" for each chunk, however, not all answers are useful since not every part of Wikipedia article is informative for our question. The model will return a [CLS] token when it determines the context does not contain an answer to the question.

In [38]:
def convert_ids_to_string(tokenizer, input_ids):
    return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids))

answer = ""

# iterate over chunks, look for best answer from each chunk
for _, chunk in chunked_input.items():
    answer_start_scores, answer_end_scores = model(**chunk)
    
    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1
    
    ans = convert_ids_to_string(tokenizer, chunk['input_ids'][0][answer_start:answer_end])
    
    # if ans = [CLS] then the model did not find a real answer in this chunk
    if ans != '[CLS]':
        answer += ans + " / "
    
print(answer)




## Putting it all together

In [39]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

class DocumentReader:
    def __init__(self, pretrained_model_name_or_path="bert-large-uncased"):
        self.READER_PATH = pretrained_model_name_or_path
        self.tokenizer = AutoTokenizer.from_pretrained(self.READER_PATH)
        self.model = AutoModelForQuestionAnswering.from_pretrained(self.READER_PATH)
        self.max_len = self.model.config.max_position_embeddings
        self.chunked = False
        
    def tokenize(self, question, text):
        self.inputs = self.tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
        self.input_ids = self.inputs["input_ids"].tolist()[0]
        
        if len(self.input_ids) > self.max_len:
            self.inputs = self.chunkify()
            self.chunked = True
            
    def chunkify(self):
        """
        Break up a long article into chunks that fit within the max token
        requirement for that Transformer model.
        
        Calls to BERT / RoBERTa / ALBERT require the following format:
        [CLS] question tokens [SEP] context tokens [SEP].
        """
        # create a question mask based on token_type_ids
        # value is 0 for question tokens, 1 for context tokens
        qmask = self.inputs["token_type_ids"].lt(1)
        qt = torch.masked_select(self.inputs["input_ids"], qmask)
        chunk_size = self.max_len - qt.size()[0] - 1 # 1 accounts for having to add [SEP] to the end
        
        # create dict of dicts; each sub-dict mimics structure of pre-chunked models inputs
        chunked_input = OrderedDict()
        for k, v in self.inputs.items():
            q = torch.masked_select(v, qmask)
            c = torch.masked_select(v, ~qmask)
            chunks = torch.split(c, chunk_size)
            
            for i, chunk in enumerate(chunks):
                if i not in chunked_input:
                    chunked_input[i] = {}
                    
                thing = torch.cat((q, chunk))
                if i != len(chunks) - 1:
                    if k == 'input_ids':
                        thing = torch.cat((thing, torch.tensor([102])))
                    else:
                        thing = torch.cat((thing, torch.tensor([1])))
                        
                chunked_input[i][k] = torch.unsqueeze(thing, dim=0)
        return chunked_input
    
    
    def get_answer(self):
        if self.chunked:
            answer = ""
            for k, chunk in self.inputs.items():
                answer_start_scores, answer_end_scores = self.model(**chunk)
                
                answer_start = torch.argmax(answer_start_scores)
                answer_end = torch.argmax(answer_end_scores) + 1
                
                ans = self.convert_ids_to_string(chunk['input_ids'][0][answer_start:answer_end])
                if ans != '[CLS]':
                    answer += ans + " / "
            return answer
        else:
            answer_start_scores, answer_end_scores = self.model(**self.inputs)
            
            answer_start = torch.argmax(answer_start_scores) # most likely beginning of answer
            answer_end = torch.argmax(answer_end_scores) + 1
            
            return self.convert_ids_to_string(self.inputs['input_ids'][0][answer_start:answer_end])
        
    def convert_ids_to_string(self, input_ids):
        return self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids))

In [40]:
questions = [
    'When was Barack Obama born?',
    'Why is the sky blue?',
    'How many sides does a pentagon have?'
]

reader = DocumentReader("deepset/bert-base-cased-squad2") 

# if you trained your own model using the training cell earlier, you can access it with this:
#reader = DocumentReader("./models/bert/bbu_squad2")

for question in questions:
    print(f"Question: {question}")
    results = wiki.search(question)

    page = wiki.page(results[0])
    print(f"Top wiki result: {page}")

    text = page.content

    reader.tokenize(question, text)
    print(f"Answer: {reader.get_answer()}")
    print()

Question: When was Barack Obama born?
Top wiki result: <WikipediaPage 'Family of Barack Obama'>
Answer: January 17 , 1964 / circa 1829 / 1895 / 1934 / c . 1940 / [CLS] When was Barack Obama born ? [SEP] Siaya in 2013 . His campaign slogan was " Obama here , Obama there " in reference to his half - brother who was serving his second term as the president of the United States . Malik garnered a meager 2 , 792 votes , about 140 , 000 votes behind the eventual winner . Prior to the 2016 United States presidential election , he stated that he supported Donald Trump , the candidate for the Republican Party . He attended the third presidential debate as one of Trump ' s guests . = = = Auma Obama = = = Barack Obama ' s half - sister , born c . 1960 / August 24 , 1912 / 

Question: Why is the sky blue?
Top wiki result: <WikipediaPage 'Diffuse sky radiation'>
Answer: Rayleigh scattering / its intrinsic nature , can illuminate under - canopy leaves permitting more efficient total whole - plant ph

Incorrect answer is due to content more oft than the actual model! So also requires a search engine (document retriever) as good as the document reader.

In [42]:
page.content

'The Pentagon is the headquarters building of the United States Department of Defense. It was constructed on an accelerated schedule during World War II. As a symbol of the U.S. military, the phrase The Pentagon is often used as a metonym for the Department of Defense and its leadership.\nLocated in Arlington County, Virginia, across the Potomac River from Washington, D.C., the building was designed by American architect George Bergstrom and built by contractor John McShain. Ground was broken on 11 September 1941, and the building was dedicated on 15 January 1943. General Brehon Somervell provided the major impetus to gain Congressional approval for the project; Colonel Leslie Groves was responsible for overseeing the project for the U.S. Army Corps of Engineers, which supervised it.\nThe Pentagon is the world\'s largest office building, with about 6.5 million square feet (150 acres; 60 ha) of floor space, of which 3.7 million square feet (85 acres; 34 ha) are used as offices. Some 23,