Query formulation

In [27]:
#!pip install -U pip setuptools wheel
#!pip install -U spacy
#!python -m spacy download en_core_web_sm
import nltk
import numpy as np
#nltk.download()

In [15]:
tokens = nltk.word_tokenize("What are the different OS for personal computers?")
print(nltk.pos_tag(tokens))
tagged = nltk.pos_tag(tokens)

[('What', 'WP'), ('are', 'VBP'), ('the', 'DT'), ('different', 'JJ'), ('OS', 'NNP'), ('for', 'IN'), ('personal', 'JJ'), ('computers', 'NNS'), ('?', '.')]


In [18]:
query_words_tuple = list(filter(lambda x: 'NN' in x[1], tagged))
query_words = [key for key, _ in query_words_tuple]
query_words

['OS', 'computers']

Document retrieval

In [19]:
import os
basepath = '.'
for fname in os.listdir(basepath):
    path = os.path.join(basepath, fname)
    if os.path.isdir(path):
        for query_word in query_words:
            if fname == query_word:
                document = path
                print(path)

.\OS


Passage retrieval

In [23]:
#!pip install rank-bm25
from rank_bm25 import BM25Okapi

corpus = []
for filename in os.listdir(document):
    print(filename)
    with open(os.path.join(document, filename), 'r') as file:
        corpus.append(file.read())

corpus

OS_extended.txt
OS_memory.txt
OS_program_execution.txt


["An operating system is the most important software that runs on a computer. It manages the computer's memory and processes, as well as all of its software and hardware. It also allows you to communicate with the computer without knowing how to speak the computer's language. Without an operating system, a computer is useless.\nYour computer's operating system (OS) manages all of the software and hardware on the computer. Most of the time, there are several different computer programs running at the same time, and they all need to access your computer's central processing unit (CPU), memory, and storage. The operating system coordinates all of this to make sure each program gets what it needs.\nOperating systems usually come pre-loaded on any computer you buy. Most people use the operating system that comes with their computer, but it's possible to upgrade or even change operating systems. The three most common operating systems for personal computers are Microsoft Windows, macOS, and 

In [31]:
tokenized_corpus = [nltk.word_tokenize(passage) for passage in corpus]
bm25 = BM25Okapi(tokenized_corpus)

doc_scores = bm25.get_scores(query_words)
max_score_index = np.argmax(doc_scores)
print(doc_scores)
print(max_score_index)
print(corpus[max_score_index])

[0.8355942 0.        0.       ]
0
An operating system is the most important software that runs on a computer. It manages the computer's memory and processes, as well as all of its software and hardware. It also allows you to communicate with the computer without knowing how to speak the computer's language. Without an operating system, a computer is useless.
Your computer's operating system (OS) manages all of the software and hardware on the computer. Most of the time, there are several different computer programs running at the same time, and they all need to access your computer's central processing unit (CPU), memory, and storage. The operating system coordinates all of this to make sure each program gets what it needs.
Operating systems usually come pre-loaded on any computer you buy. Most people use the operating system that comes with their computer, but it's possible to upgrade or even change operating systems. The three most common operating systems for personal computers are 

Answer Retrieving

In [32]:
from transformers import pipeline

In [33]:
nlp = pipeline("question-answering")

In [35]:
OS_original_text = ''
with open("OS_extended.txt") as file:
    OS_original_text = file.read()
print(OS_original_text)

print(nlp(question="What are the different OS for personal computers?", context=corpus[max_score_index]))
print(nlp(question="What does operating system coordinates?", context=corpus[max_score_index]))
print(nlp(question="What is necessary for a programs to run?", context=corpus[max_score_index]))

An operating system is the most important software that runs on a computer. It manages the computer's memory and processes, as well as all of its software and hardware. It also allows you to communicate with the computer without knowing how to speak the computer's language. Without an operating system, a computer is useless.
Your computer's operating system (OS) manages all of the software and hardware on the computer. Most of the time, there are several different computer programs running at the same time, and they all need to access your computer's central processing unit (CPU), memory, and storage. The operating system coordinates all of this to make sure each program gets what it needs.
Operating systems usually come pre-loaded on any computer you buy. Most people use the operating system that comes with their computer, but it's possible to upgrade or even change operating systems. The three most common operating systems for personal computers are Microsoft Windows, macOS, and Linu

Using Model and tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

In [None]:
!pip install transformers sentencepiece

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", return_dict=False)

In [None]:
questions = [
    "What are the different OS for personal computers?",
    "What does operating system coordinates?",
    "What is necessary for a programs to run?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, OS_original_text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")