<a href="https://colab.research.google.com/github/DanielHolzwart/Question-Answering-with-roberta/blob/main/Question_Answering_with_roberta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this workbook, we will try to extract information from a sample CV file. Instead of summarizaton, we will work with roberta question and answering model which has been trained on the SQuAD2 dataset2. Let's see how the model perfoms on this set. It could be that we will not get the desired results as the SQuAD2 dataset contains answers from *arcticles*, while a CV in general has bullet points.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


PyMUPDF is used to read in a pdf file and export a text which.

In [None]:
!pip install pymupdf



In [None]:
import fitz  # PyMuPDF

In [None]:
def extract_text_from_pdf(pdf_path):

    #read in document
    doc = fitz.open(pdf_path)

    #declare empty string
    text = ""
    for page_num in range(doc.page_count):
        page = doc[page_num]
        #add page to empty string
        text += page.get_text()
    doc.close()
    return text

In [None]:
#read in CV sample file from google drive
import os
cv_path = os.getcwd() + '/drive/My Drive/2024-09-21 Question Answering with roberta'  + '/cvsamples.pdf'
cv_text = extract_text_from_pdf(cv_path)

The model can't process the whole CV input at once. Therefore, we must split it into chunks. The following functions does that and also implements a stride as chunks can have overlapping information.

In [None]:
def split_text_into_chunks(text, max_length=150, stride = 25):
    words = text.split()
    chunks = []
    current_chunk = []

    for word in words:
        #add current_chunk to chunk list if word threshold is triggered
        if len(current_chunk) + len(word) + 1 > max_length:
            chunks.append(" ".join(current_chunk))
            #set a stride of 25
            current_chunk = current_chunk[-stride:]
            current_chunk.extend([word])
        else:
            current_chunk.append(word)
    #for the last remaining text of the document
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

In [None]:
cv_text_chunks = split_text_into_chunks(cv_text)
print(len(cv_text_chunks))

22


Let us check whether the stride work as intended

In [None]:
cv_text_chunks[0].split()[-25:] == cv_text_chunks[1].split()[:25]

True

Everything works as intended. Now we can set up a question answering model from hugging face. The easiest way is via

In [None]:
from transformers import pipeline, AutoTokenizer

model = "consciousAI/question-answering-roberta-base-s-v2"
tokenizer = AutoTokenizer.from_pretrained(model)
pipe = pipeline("question-answering", model=model,tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
question = "What is Mike's profession?"
context =  "Mike was born in August 1972 in Canada, always wanted to be a pilot and thus went to pilot school. However, he eventually realized that he is afraid of heights and became a handyman."

In [None]:
pipe(question=question, context=context, top_k=3)

[{'score': 0.41693350672721863,
  'start': 173,
  'end': 181,
  'answer': 'handyman'},
 {'score': 0.09210065752267838, 'start': 62, 'end': 67, 'answer': 'pilot'},
 {'score': 0.02496851421892643,
  'start': 171,
  'end': 181,
  'answer': 'a handyman'}]

First test is looking good. Now we can try to apply our model to the chunks from the CV file. The model will try to answer the question to every chunk of the model and we will pick out the answer with the highest score. The answer to the question of 'How many undergrade students did Juan Garcia take care of?' is 2.

In [None]:
question = "How many undergrade students did Juan Garcia take care of?"
output_scores = []
answers = []
for chunks in cv_text_chunks:
    output = pipe(question=question, context= chunks)
    output_scores.append(output["score"])
    answers.append(output["answer"])
    print(output)


{'score': 0.00012995986617170274, 'start': 0, 'end': 3, 'answer': '217'}
{'score': 1.3816425781243424e-10, 'start': 636, 'end': 637, 'answer': '.'}
{'score': 1.8613463548255993e-10, 'start': 496, 'end': 503, 'answer': '06/2014'}
{'score': 7.08324371379021e-11, 'start': 172, 'end': 173, 'answer': '.'}
{'score': 8.089430514335305e-11, 'start': 366, 'end': 367, 'answer': '.'}
{'score': 4.136669062848597e-10, 'start': 1015, 'end': 1016, 'answer': '3'}
{'score': 1.1071007999241544e-10, 'start': 197, 'end': 198, 'answer': '.'}
{'score': 1.0605149397546754e-10, 'start': 571, 'end': 572, 'answer': '.'}
{'score': 6.384082151811299e-11, 'start': 389, 'end': 390, 'answer': '.'}
{'score': 1.010520556121719e-10, 'start': 974, 'end': 976, 'answer': '.”'}
{'score': 3.3165772350685074e-09, 'start': 202, 'end': 206, 'answer': '20xx'}
{'score': 7.596036688539698e-09, 'start': 84, 'end': 88, 'answer': '20xx'}
{'score': 2.993368608539271e-10, 'start': 744, 'end': 795, 'answer': '. \uf0b7 Edited copy for p

One observation is that the model output are in general numbers. This is to expect as the queston started with 'How many'. Let us pick out the 3 prediction with the highest score. Moreover, in most cases the score is extremly small and we only get an ouput because we forced the model to do so.

In [None]:
top_3_position = sorted(enumerate(output_scores), key=lambda x: x[1], reverse=True)[:3]
for position, scores in top_3_position:
    print(f"Answer: {answers[position]} - with score {scores} in chunk {position}")

Answer: two - with score 0.8707802295684814 in chunk 18
Answer: 3 - with score 0.1274404227733612 in chunk 20
Answer: two - with score 0.05405785143375397 in chunk 17


It looks like the model is fairly confident with Juan Garcia mentoring 2 students. It is interesting that answer 1 and 3 both have the value two and a quick check reveals that this is actually coming from the stride we implemented in the chunks

In [None]:
cv_text_chunks[17]

'by Their Students. Spring 20XX - Present Instructor, Latino/a Culture Anthropology Department, University of Illinois \uf0b7 Integrated multimedia approaches and used instructional technology to enhance pedagogical approach. \uf0b7 Explained challenging concepts using planned lessons, assignments and targeted discussions for 75 freshmen and sophomore students. Spring - Fall 20XX Graduate Mentor, Illinois Summer Research Opportunities Program The Graduate College, University of Illinois \uf0b7 Mentored two undergraduate students in data collection and analysis to visualize the properties of various geotechnical materials. \uf0b7 Guided the students in preparation and presentation of research findings. Summer 20XX, 20XX CV SAMPLE 7 grad.illinois.edu/CareerDevelopment Juan Garcia, page 2 of 3 TEACHING AND MENTORING EXPERIENCE CONTINUED Graduate Mentor, Illinois Summer Research Opportunities Program The Graduate College, University of Illinois \uf0b7 Mentored two undergraduate students in

In [None]:
cv_text_chunks[18]

'College, University of Illinois \uf0b7 Mentored two undergraduate students in data collection and analysis to visualize the properties of various geotechnical materials. \uf0b7 Guided the students in preparation and presentation of research findings. Summer 20XX, 20XX HONORS AND AWARDS Fulbright Scholarship to pursue a PhD \uf0b7 20 scholarships awarded in Argentina that year 20XX Flag Honor Guard Member \uf0b7 Qualified by graduating with honors and ranking 4th among engineering majors at UNSJ 20XX GRANTS Granting Agency, “Title of Grant”, $00,000 20XX - 20XX PUBLICATIONS Garcia, J., other authors. (Year). Title. Journal, Volume (Issue), page numbers. doi:. Garcia, J., other authors. (in press). Title. Journal, Volume (Issue), page numbers. Garcia, J., other authors. (Year produced). Title. Manuscript submitted for publication. Garcia, J., other authors. (Year draft produced). Title. Manuscript in preparation. CONFERENCE PRESENTATIONS ORAL PRESENTATIONS Garcia, J., other authors. (Ye

It is hard to say where score for the second best answer, 3, is coming from. Looking at the snippet below, the text cotain the name Juan Garcia, student activities and 'page 3 of 3. Thus the model somehow linked the 3 to the student. Nevertheless, the score for the 2nd answer is still 6 times lower than for the first.

In [None]:
cv_text_chunks[20]

'Illinois \uf0b7 Participated in the organization of the Principal’s Scholars Program 20XX GEAR UP College Bound Summer Program, where a group of minority children from elementary and middle school visited the college to learn about different paths in engineering. \uf0b7 Prepared a bridge design competition using popsicle sticks and glue, where the children demonstrated their skills and their creativity. July 20XX Student Assistant Office of International Student and Scholar Services (ISSS), University of Illinois \uf0b7 Assisted with check-in procedures for incoming international students. \uf0b7 Helped incoming international students with information on procedures and resources for their successful arrival on campus. July 20XX 8 grad.illinois.edu/CareerDevelopment Juan Garcia, page 3 of 3 TECHNICAL SKILLS \uf0b7 Programming languages and mathematical packages: Matlab, Mathematica, C, C ++ \uf0b7 Computer aided design/engineering: optical imaging, AutoCAD, Patran, Abaqus. \uf0b7 Other