<a href="https://colab.research.google.com/github/ArijaK/QuestionAnswering/blob/main/QA_wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Question Answering using Wikipedia articles**

Inspired by article *Building a QA System with BERT on Wikipedia* [found here](https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html#So-you've-decided-to-build-a-QA-system).






In [1]:
!pip install wikipedia
!pip install datasets
!pip install transformers[torch]

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11680 sha256=2b991494c3b3a7d42cbcfecc2ed49c00fa1b31b1ce642204abbacfc3ddb5cbc7
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[

In [7]:
PATH_TO_MODEL = 'drive/MyDrive/Colab Notebooks/Fine-tuned_models/albert-base-v2-squadv2'
TOKENIZER = 'twmkn9/albert-base-v2'

In [3]:
# In case the fine-tuned model is saved on google drive.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [42]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
model = AutoModelForQuestionAnswering.from_pretrained(PATH_TO_MODEL)

In [84]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [90]:
import torch
import string
import wikipedia
from collections import OrderedDict

def get_wiki_answer(question, tokenizer, model):
  search_results = wikipedia.search(question)
  try:
    # Faster option.
    content = wikipedia.summary(search_results[0], sentences=100)
    # Look through the whole article.
    # content = wikipedia.page(search_results[0]).content
  except:
    # Sometimes, article cannot be found.
    return 'Cannot find answer!'

  inputs = tokenizer.encode_plus(question, content, return_tensors='pt')

  qmask = inputs['token_type_ids'].lt(1)
  qt = torch.masked_select(inputs['input_ids'], qmask)
  sample_size = model.config.max_position_embeddings-qt.size()[0]-1

  inputs_split = OrderedDict()
  for k,v in inputs.items():
      q = torch.masked_select(v, qmask)
      c = torch.masked_select(v, ~qmask)
      samples = torch.split(c, sample_size)

      for i, sample in enumerate(samples):
          if i not in inputs_split:
              inputs_split[i] = {}

          data = torch.cat((q, sample))
          if i != len(samples)-1:
              if k == 'input_ids':
                  data = torch.cat((data, torch.tensor([102])))
              else:
                  data = torch.cat((data, torch.tensor([1])))

          inputs_split[i][k] = torch.unsqueeze(data, dim=0).to(device)

  answers = []
  for _, input in inputs_split.items():
    output = model(**input)
    start_scores = output.start_logits.squeeze()
    end_scores = output.end_logits.squeeze()

    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
      tokenizer.convert_ids_to_tokens(inputs.input_ids[0][answer_start:answer_end]))
    if answer != '[CLS]':
      logit_score = start_scores[answer_start] + end_scores[answer_end]
      answers.append({
          'text': answer,
          'logit_score': logit_score.tolist(),
        })

  if len(answers) > 0:
    best_answer = max(answers, key=lambda x: x['logit_score'])
    return best_answer['text']
  else:
    return 'Cannot find answer!'

In [91]:
while True:
    question = input("Enter a question: ")
    if question == '':
      print('Exiting...')
      break

    answer = get_wiki_answer(question, tokenizer, model)
    print(f'Answer: {answer}\n')

Enter a question: How old is Barack Obama?
Answer: 44

Enter a question: Where is London?
Answer: river thames in south-east england

Enter a question: Can I eat pasta?




  lis = BeautifulSoup(html).find_all('li')


Answer: Cannot find answer!

Enter a question: How to cook pasta?
Answer: Cannot find answer!

Enter a question: Where can I eat pasta?
Answer: Cannot find answer!

Enter a question: Which fruit is the sweetest?
Answer: antigua black pineapple

Enter a question: Which food is disgusting?
Answer: Cannot find answer!

Enter a question: Who is the president of Latvia?
Answer: head of state and commander-in-chief of the national armed forces of the republic of latvia

Enter a question: Where is Latvia?
Answer: Cannot find answer!

Enter a question: Where is Lithuania?
Answer: baltic region of europe

Enter a question: Where is Estonia?
Answer: northern europe

Enter a question: 
Exiting...
