In [1]:
import torch
from transformers import BertForQuestionAnswering, BertTokenizer

In [2]:

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [4]:
from PyPDF2 import PdfReader
import nltk
nltk.download('punkt')

# Extracting Text from PDF
def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        pdf = PdfReader(file)
        text = " ".join(page.extract_text() for page in pdf.pages)
    return text

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
text = extract_text_from_pdf("CollieryControlOorder2000.pdf")

In [6]:
# Step 1: Remove special characters and symbols
cleaned_text = ''.join(e for e in text if (e.isalnum() or e.isspace() or e in ['.', ',', ';', ':', '(', ')']))

# Step 2: Remove extra spaces and line breaks
cleaned_text = ' '.join(cleaned_text.split())

# Step 3: Join lines
cleaned_text = cleaned_text.replace('\n', ' ')

In [7]:
cleaned_text

'Colliery Control Order COLLIERY CONTROL ORDER, 2000 In exercise of the powers conferred by section 3 read with section 5 of the Essential Commodities Act, 1955 (10 of 1955) and in supersession of the Colliery Control Order, 1945, except as respects things done or omitted to be done before such supersession, the Government of India has issued a Gazette Notification on 1.1.2000 to publish the Colliery Control Order, 2000. The content of the Colliery Control Order, 2000 is given below. 1. Short title and commencement. (1) This Order may be called the Colliery Control Order, 2000. (2) It shall come into force on the 1st day of January, 2000. 2. Definitions. In this Order, unless there is anything repugnant in the subject or context, (a) coal includes anthracite, bituminous coal, lignite, peat and any other form of carbonaceous matter sold or marketed as coal and also coke; (b) Coal Controller means the person appointed as such by the Central Government under the provisions of the Coal Con

In [8]:
def process_text_chunk(chunk_text, question, max_seq_length=512):
    tokenized_question = tokenizer.encode(question, add_special_tokens=True, return_tensors="pt")
    
    all_answers = []
    print("len=",len(chunk_text))
    for start in range(0, len(chunk_text), max_seq_length):
        end = start + max_seq_length
        chunk = chunk_text[start:end]
        print("count: ",start)
        tokenized_chunk = tokenizer.encode(chunk, add_special_tokens=True, return_tensors="pt")
        
        input_ids = torch.cat([tokenized_question, tokenized_chunk], dim=1)
        
        output = model(input_ids)
        answer_start = torch.argmax(output.start_logits)
        answer_end = torch.argmax(output.end_logits) + 1
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[0][answer_start:answer_end]))
        
        # Compute confidence level based on the sum of start and end logits
        confidence = output.start_logits[0][answer_start] + output.end_logits[0][answer_end - 1]
        
        all_answers.append({"answer": answer, "confidence": confidence.item()})
    
    return all_answers

In [9]:
# Example chunk text and question
chunk_text = cleaned_text
question = "How can a colliery owner obtain permission to open a coal mine, seam, or section of a seam?"

# Process the chunk with the question
answers = process_text_chunk(chunk_text, question)

# Sort answers by confidence level (higher confidence first)
answers.sort(key=lambda x: x["confidence"], reverse=True)

# Print answers with confidence levels
for i, answer in enumerate(answers):
    print(f"Answer {i + 1}: {answer['answer']} (Confidence: {answer['confidence']:.2f})")

len= 8175
count:  0
count:  512
count:  1024
count:  1536
count:  2048
count:  2560
count:  3072
count:  3584
count:  4096
count:  4608
count:  5120
count:  5632
count:  6144
count:  6656
count:  7168
count:  7680
Answer 1: in writing of the central government (Confidence: 9.41)
Answer 2: power to inspect collieries (Confidence: 5.65)
Answer 3: require any owner or agent or manager of a colliery to give any information in his possession (Confidence: 5.52)
Answer 4: in accordance with the procedure specified in subclause ( 1 ) (Confidence: 4.89)
Answer 5: cause the owner , agent or manager of a colliery or any person engaged in or incharge of the loading of coal in wagons , trolleys or trucks in a colliery (Confidence: 4.56)
Answer 6: declaration of grades of coal , the same may be referred to the coal controller (Confidence: 4.55)
Answer 7: without the prior permission in writing of the central government (Confidence: 4.45)
Answer 8: the owner , agent or manager of a colliery shall dec