
> # NLP Project



> ### Group Members 🤠

*   Haseebullah
*   Mohammad Maaz
*   Ali Musa



# Question Answering With Hugging Face Transformers


*   Model Used: https://huggingface.co/deepset/bert-base-cased-squad2
*   Model Training Dataset: https://huggingface.co/datasets/squad_v2


---
Note: The model is already pre trained on the dataset provided




# Installing required packages

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m50.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


# Importing a Question Answering Class from Transformers
Using this class we can initiliaze many different models for question answering 
it loads that specific model with its question and answering layer added on there as well

In [4]:
from transformers import BertForQuestionAnswering

# Importing the Model
In our case its deepset/bert-base-cased-squad2
<br>
Any Model from https://huggingface.co/models?pipeline_tag=question-answering&sort=downloads can be used

In [5]:
model = BertForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

# Defining Context and Questions 

In [6]:
# data has been taken from Imran Khan's Wiki Page
context = ("Imran Ahmed Khan Niazi HI(M) PP (Urdu: عمران احمد خان نیازی; born 25 November 1952) is a Pakistani politician and former cricket captain "
           "who served as the 22nd Prime Minister of Pakistan from August 2018 until April 2022. "
           "He is the founder and chairman of the political party Pakistan Tehreek-e-Insaf (PTI). "
           "Born to a Niazi Pashtun family in Lahore, Khan graduated from Keble College, University of Oxford, England, in 1975. "
           "He began his international cricket career at age 18, in a 1971 Test series against England. "
           "Khan played until 1992, served as the team's captain intermittently between 1982 and 1992, and won the 1992 Cricket World Cup, "
           "in what is Pakistan's first and only victory in the competition. Considered one of cricket's greatest all-rounders, "
           "Khan scored 3,807 runs and took 362 wickets in Test cricket and was inducted into the ICC Cricket Hall of Fame. "
           "He founded cancer hospitals in Lahore and Peshawar, and Namal College in Mianwali, prior to entering politics. "
           "After retiring from cricket, Khan founded the PTI in 1996 and became its national leader. "
           "His party became the leading opposition in the 2013 Pakistani general elections and gained a majority in the 2018 general elections, "
           "leading to his appointment as Prime Minister. "
           "As Prime Minister, Khan focused on anti-corruption measures, education, healthcare, and poverty alleviation programs. "
           )

questions = [
    "Who is Imran Khan?",
    "What are Imran Khan's achievements?",
    "What is Imran Khan's political party?",
    "When did Imran Khan graduate from the University of Oxford?",
    "Which cricket team did Imran Khan captain?",
    "When did Imran Khan retire from cricket?",
    "How many runs and wickets did Imran Khan achieve in Test cricket?",
    "What social initiatives did Imran Khan establish?",
    "When was Imran Khan elected as the Prime Minister of Pakistan?",
    "What were Imran Khan's main focuses during his tenure as Prime Minister?"
]



# Tokenizer
As the model bert works with token ids rather than strings, we will need to convert strings of our questions and answers into the format understandable by bert, and that is the reason we need the tokenizer

In [7]:
from transformers import AutoTokenizer

In [8]:
# initialzing the tokenizer
tokenizer = AutoTokenizer.from_pretrained('deepset/bert-base-cased-squad2')

Downloading (…)okenizer_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

# Working of the tokenizer


*   Padding and truncation keeps the total number of tokens as 512 (what bert considers)
*   These token mappings are known and understood by bert according to its internal logic
*   The '101' at the start refers to the '[CLS]' token, which means start of seq, 102 for end of seq '[SEP]'
*   Bert q&a model takes the format '[CLS] <context> [SEP] <question> [SEP] [PAD] [PAD]'



In [9]:
# tokenizing an example

tokenizer.encode(questions[0], truncation=True, padding=True)     # 

[101, 2627, 1110, 146, 1306, 4047, 4340, 136, 102]

# Set Up our Tokenizer and Model into a Q&A Pipeline
we get this pipeline from the transformers library

In [10]:
from transformers import pipeline

# The Pipeline
> Takes an input dictionary of the format: <br><br>
{
*   'question': question
*   'context': context      
}

<br>


> Returns an output dictionary: <br><br>
{
*   'score': model's confidence in the answer
*   'start': starting index of answer in the context
*   'end': ending index of answer in the context
*   'answer': The answer by the model
}


In [11]:
nlp = pipeline('question-answering', model = model, tokenizer=tokenizer)

# Example Question and Response

In [12]:
# we pass our question and context to the pipeline as a dictionary

response = nlp({
    'question': questions[0],
     'context': context
})

response

{'score': 0.4549984633922577,
 'start': 87,
 'end': 109,
 'answer': 'a Pakistani politician'}

In [13]:
print(context[response['start']:response['end']])

a Pakistani politician


# Testing on All Questions

In [14]:
for q in questions:
  res = nlp({
    'question': q,
     'context': context
  })
  print('Question: ', q)
  print('Answer: ', res['answer'])
  print('Score: ' , round(res['score'],4))
  print()

Question:  Who is Imran Khan?
Answer:  a Pakistani politician
Score:  0.455

Question:  What are Imran Khan's achievements?
Answer:  Considered one of cricket's greatest all-rounders
Score:  0.3033

Question:  What is Imran Khan's political party?
Answer:  Pakistan Tehreek-e-Insaf
Score:  0.5698

Question:  When did Imran Khan graduate from the University of Oxford?
Answer:  1975
Score:  0.9702

Question:  Which cricket team did Imran Khan captain?
Answer:  Pakistan Tehreek-e-Insaf (PTI
Score:  0.0024

Question:  When did Imran Khan retire from cricket?
Answer:  1996
Score:  0.6242

Question:  How many runs and wickets did Imran Khan achieve in Test cricket?
Answer:  362
Score:  0.9134

Question:  What social initiatives did Imran Khan establish?
Answer:  poverty alleviation programs
Score:  0.758

Question:  When was Imran Khan elected as the Prime Minister of Pakistan?
Answer:  August 2018 until April 2022
Score:  0.5098

Question:  What were Imran Khan's main focuses during his tenu