# Our Goal: 

I intend to create models that can provide answers to questions using a PDF file as the context. I will be training different models to answer these questions. I've carefully crafted these questions from various sections of the PDF study file. I want to see how each of our models responds to these questions. I purposely excluded one question to observe how our models handle such cases.



## Table of Contents

1. **Introduction**
   - Brief overview of the notebook's purpose and scope.

2. **Question-Answering with BERT**
   - Loading and utilizing a pre-trained BERT model for question-answering tasks.
   - Moving the BERT model to the GPU if available.
   - Function to ask questions from the context using BERT and printing the answers.

3. **Question-Answering with DistilBERT**
   - Loading and utilizing a pre-trained DistilBERT model for question-answering tasks.
   - Moving the DistilBERT model to the GPU if available.
   - Function to ask questions from the context using DistilBERT and printing the answers.

4. **Question-Answering with BART**
   - Loading and utilizing a pre-trained BART model for question-answering tasks.
   - Moving the BART model to the GPU if available.
   - Function to ask questions from the context using BART and printing the answers.

5. **Question-Answering with T5**
   - Loading and utilizing a pre-trained T5 model for question-answering tasks.
   - Moving the T5 model to the GPU if available.
   - Function to ask questions from the context using T5 and printing the answers.

6. **Conclusion**
   - Summary of the notebook's key points and findings.

7. **References**
   - Citations or links to relevant sources used in the notebook, if applicable.



In [None]:
pip install PyMuPDF pandas

The command `pip install PyMuPDF pandas` is used in Python to install two Python packages: PyMuPDF and pandas. Here's what these packages do:

1. **PyMuPDF (PyMuPDF is a Python binding for the MuPDF library):**
   - **Purpose:** PyMuPDF is a Python library that allows you to work with PDF files. It provides functionalities to read, write, and manipulate PDF documents.
   - **Usage:** You can use PyMuPDF to extract text, images, and metadata from PDF files, as well as perform various operations such as merging, splitting, and rotating PDF pages.

2. **pandas (pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and data manipulation library built on top of Python):**
   - **Purpose:** pandas is a popular Python library for data manipulation and analysis. It provides data structures like Series and DataFrame, which allow you to handle and analyze structured data efficiently.
   - **Usage:** pandas is commonly used for tasks such as data cleaning, data transformation, and statistical analysis. It's widely used in data science and data analysis projects for handling and processing large datasets.

By running `pip install PyMuPDF pandas`, you're installing these libraries in your Python environment, enabling you to use their functionalities in your code. Remember to run this command in your terminal or command prompt to install these packages before using them in your Python scripts.

The provided code below defines a Python function and utilizes the PyMuPDF library (also known as fitz) to extract text from a PDF document. Here's a general explanation of what the entire code does:

**Explanation:**

1. **Library Import:**
   - The code imports the PyMuPDF library, often referred to as fitz, which provides functions for working with PDF files.

2. **PDF Text Extraction Function:**
   - The `extract_text_from_pdf` function takes a PDF file path as input.
   - Inside the function, it opens the PDF document specified by the input path.
   - It iterates through each page of the PDF, extracting text from each page using the `page.get_text()` method.
   - The extracted text from all pages is concatenated into a single string and returned.

3. **PDF Path and Text Extraction:**
   - The code specifies the path to the PDF file (`pdf_textbook_path`) that needs to be processed.
   - It then calls the `extract_text_from_pdf` function with the specified PDF path, extracting text from the PDF and storing it in the `context` variable.

In summary, the code defines a function `extract_text_from_pdf` that extracts text from a PDF document using the PyMuPDF library. It then utilizes this function to extract text from a specific PDF file (`LLM Note for Inference (1).pdf`) and stores the extracted text in the `context` variable for further use.

In [None]:
import fitz  # PyMuPDF

# Load PDF textbook
def extract_text_from_pdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as pdf_document:
        num_pages = pdf_document.page_count
        for page_num in range(num_pages):
            page = pdf_document[page_num]
            if page != "": 
                text += page.get_text()
    return text

# Replace 'path_to_your_pdf.pdf' with the actual path to your biology PDF textbook
# pdf_textbook_path = '/kaggle/input/physics-context-for-large-lang-models-1-page-pdf/LLM Note for Inference.pdf'
pdf_textbook_path = '/kaggle/input/llm-training-little-pdf/LLM Note for Inference (1).pdf'

context = extract_text_from_pdf(pdf_textbook_path)

The provided code snippet sets up a list of example questions, checks for the availability of a CUDA-enabled GPU, imports necessary libraries (`torch`), and initializes empty lists to store model names and corresponding durations. There's also a commented-out function `run_and_measure_time` which seems to be intended for measuring the time taken by different models to execute specific code.

**Explanation:**

1. **Example Questions:**
   - The `questions` list contains example questions related to the given context. These questions are likely intended for a question-answering task using a machine learning model.

2. **Device Check:**
   - The code checks if a CUDA-enabled GPU is available. If available, it sets the `device` variable to `"cuda"`; otherwise, it sets it to `"cpu"`. This is essential for utilizing GPU acceleration if it's available, which can significantly speed up computations, especially for deep learning models.

3. **Imports:**
   - The code imports the `torch` library, which is the primary library used for building and training neural networks in PyTorch.

4. **Initialization:**
   - The `model_names` and `durations` lists are initialized to store model names and their corresponding execution durations, respectively.

5. **Commented-Out Function (Not Executed):**
   - There is a commented-out function `run_and_measure_time` which seems to be designed to measure the execution time of specific code related to different machine learning models. The function takes a `model_name` and `code` as inputs, runs the provided `code`, measures the time taken, and appends the model name and duration to the respective lists. However, this function is currently not executed.


In [None]:
# Context string
# context = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity, one of the two pillars of modern physics."
import torch
import pandas as pd


# Example questions
questions = [
    "What is momentum?",
    "What is the nationality of Albert Einstein?",
    "What did Albert Einstein develop?",
    "What are the two pillars of modern physics?"
]

answers = [
    "The momentum of an object is defined as its mass multiplied by its velocity. Mathematically: p = mv",
    "Albert Einstein is from Germany",
    "He developed the theory of relativity",
    "the theory of relativity and the quantum theory",
]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


import time
import matplotlib.pyplot as plt

# Lists to store model names and corresponding durations
model_names = []
durations = []

# Initialize an empty DataFrame with questions and correct answers
df = pd.DataFrame(columns=['question', 'answer', 'BERT_answer', 'DistilBERT_answer', 'BART_answer', 'T5T_answer'])


# Add questions and correct answers to the DataFrame
# for i in range(len(questions)):
#     question = questions[i]
#     correct_answer = answers[i]  # You need to provide the correct answers
#     df = df.append({'question': question, 'answer': correct_answer, 'BERT_answer': None, 'DistilBERT_answer': None, 'BART_answer': None, 'T5T_answer': None}, ignore_index=True)

# # Function to run code and measure time
# def run_and_measure_time(model_name, code):
#     start_time = time.time()  # Start the timer
#     exec(code)  # Execute the code for the specific model
#     end_time = time.time()  # Stop the timer
#     duration = end_time - start_time  # Calculate the duration in seconds
#     model_names.append(model_name)
#     durations.append(duration)
#     print(f"{model_name} took {duration:.2f} seconds to run.")



# BERT MODEL

This code below snippet performs question-answering using a pre-trained BERT model. Here's a general explanation of what the code does:


**Explanation:**

1. **Loading Pre-trained Model and Tokenizer:**
   - The code imports the `BertTokenizer` and `BertForQuestionAnswering` classes from the `transformers` library.
   - It loads a pre-trained BERT tokenizer and model (`bert-base-uncased`). This model is specifically designed for question-answering tasks.

2. **Moving Model to GPU:**
   - If a CUDA-enabled GPU is available (as determined earlier in the code), the model is moved to the GPU using `model.to(device)`.

3. **Question-Answering Function:**
   - The `ask_question_bert` function takes a `question`, `context` (the provided text), pre-trained `model`, `tokenizer`, and `device` as inputs.
   - It tokenizes the question and context, encodes them as input tensors, and ensures the input is within the model's maximum sequence length (512 tokens).
   - The model predicts start and end logits for the answer span. The indices with the highest logits correspond to the start and end positions of the answer.
   - The function decodes the answer span and returns the answer as a string.

4. **Question-Answering Loop:**
   - The code iterates through the `questions` list.
   - For each question, it calls the `ask_question_bert` function, gets the answer, and prints the question along with the extracted answer.

This code essentially performs question-answering using a pre-trained BERT model, providing answers to the example questions based on the given context.

In [None]:
from transformers import BertTokenizer, BertForQuestionAnswering
# from transformers import DistilBertForQuestionAnswering, DistilBertTokenizer

bert_ans = []
# Load pre-trained BERT model and tokenizer
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
# tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
# model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


# Move the model to the GPU if available
model.to(device)


# Function to ask questions from the context using BERT on the GPU
def ask_question_bert(question, context, model, tokenizer, device):
    inputs = tokenizer.encode_plus(question, context, return_tensors='pt', max_length=512, truncation=True).to(device)
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    outputs = model(input_ids, attention_mask=attention_mask)
    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)
    
    answer = tokenizer.decode(input_ids[0][start_idx:end_idx + 1])
    return answer

# Ask questions and print answers using BERT (on GPU if available)
for question in questions:
    answer = ask_question_bert(question, context, model, tokenizer, device)
    print(f"Question: {question}")
    print(f"Answer: {answer}\n")
    bert_ans.append(answer)


# DistilBERT MODEL

The provided code utilizes the Hugging Face `transformers` library to perform question-answering using a pre-trained DistilBERT model. Here's a general explanation of what the code does:

**Explanation:**

1. **Loading Pre-trained Model and Tokenizer:**
   - The code imports the `DistilBertTokenizer` and `DistilBertForQuestionAnswering` classes from the `transformers` library.
   - It loads a pre-trained DistilBERT tokenizer and model (`distilbert-base-cased-distilled-squad`). This model is specifically fine-tuned for question-answering tasks.

2. **Moving Model to GPU:**
   - If a CUDA-enabled GPU is available (as determined earlier in the code), the model is moved to the GPU using `model.to(device)`.

3. **Question-Answering Function:**
   - The `ask_question_bert` function takes a `question`, `context` (the provided text), pre-trained `model`, `tokenizer`, and `device` as inputs.
   - It tokenizes the question and context, encodes them as input tensors, and ensures the input is within the model's maximum sequence length (512 tokens).
   - The model predicts start and end logits for the answer span. The indices with the highest logits correspond to the start and end positions of the answer.
   - The function decodes the answer span and returns the answer as a string.

4. **Question-Answering Loop:**
   - The code iterates through the `questions` list.
   - For each question, it calls the `ask_question_bert` function, gets the answer, and prints the question along with the extracted answer.

This code essentially performs question-answering using a pre-trained DistilBERT model, providing answers to the example questions based on the given context.

In [None]:
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering

distil_bert_ans = []

# Load pre-trained DistilBERT tokenizer and model
# tokenizer2 = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
# model2 = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')

# Load pre-trained DistilBERT tokenizer and model for question-answering
tokenizer2 = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
model2 = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')

# Move the model to the GPU if available
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model2.to(device)

# Function to ask questions from the context using DistilBERT on the GPU
def ask_question_distilbert(question, context, model, tokenizer, device):
    inputs = tokenizer.encode_plus(question, context, return_tensors='pt', max_length=512, truncation=True).to(device)
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    outputs = model(input_ids, attention_mask=attention_mask)
    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)
    
    answer = tokenizer.decode(input_ids[0][start_idx:end_idx + 1])
    return answer

# Example usage
# context = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity, one of the two pillars of modern physics."
# questions = ["What is the theory developed by Albert Einstein?", "Where was Albert Einstein born?"]

# Ask questions and print answers using DistilBERT (on GPU if available)
for question in questions:
    answer = ask_question_distilbert(question, context, model2, tokenizer2, device)
    print(f"Question: {question}")
    print(f"Answer: {answer}\n")
    distil_bert_ans.append(answer)


# BART MODEL

This code snippet uses a pre-trained BART (Bidirectional and Auto-Regressive Transformers) model for question-answering tasks. Here's an explanation of what the code does:


**Explanation:**

1. **Loading Pre-trained Model and Tokenizer:**
   - The code imports the `BartTokenizer` and `BartForConditionalGeneration` classes from the `transformers` library.
   - It loads a pre-trained BART tokenizer and model (`facebook/bart-large`). BART is a transformer-based model for text generation tasks, including question-answering.

2. **Moving Model to GPU:**
   - If a CUDA-enabled GPU is available (as determined earlier in the code), the model is moved to the GPU using `model3.to(device)`.

3. **Question-Answering Function:**
   - The `ask_question_bart` function takes a `question`, `context` (the provided text), pre-trained `model`, `tokenizer`, and `device` as inputs.
   - It tokenizes the question and context, combines them into a single input string, encodes them as input tensors, and sends them to the GPU.
   - The model generates an answer sequence based on the combined input.
   - The function decodes the generated answer sequence, skipping special tokens, and returns the answer as a string.

4. **Question-Answering Loop:**
   - The code iterates through the `questions` list.
   - For each question, it calls the `ask_question_bart` function, gets the answer, and prints the question along with the extracted answer.

This code performs question-answering using a pre-trained BART model, providing answers to the example questions based on the given context.

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

# tokenizer3 = BartTokenizer.from_pretrained('facebook/bart-large')
# model3 = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
bart_ans = []
from transformers import BartTokenizer, BartForQuestionAnswering

# Load pre-trained BART tokenizer and model
tokenizer3 = BartTokenizer.from_pretrained('facebook/bart-large')
model3 = BartForConditionalGeneration.from_pretrained('facebook/bart-large')


# Move the model to the GPU if available
model3.to(device)

# Function to ask questions from the context using RoBERTa on the GPU
def ask_question_bart(question, context, model, tokenizer, device):
    # Tokenize input
    inputs_bart = tokenizer3.encode("question: " + question + " context: " + context, return_tensors='pt').to(device)

    # Generate the answer from the model
    output_bart = model3.generate(inputs_bart)

    # Decode the output to get the answer
    answer_bart = tokenizer3.decode(output_bart[0], skip_special_tokens=True)


#     # Tokenize input
#     inputs_bart = tokenizer3.encode("question: " + question + " context: " + context, return_tensors='pt')

#     # Get the answer from the model
#     output_bart = model.generate(inputs_bart)
#     answer_bart = tokenizer3.decode(output_bart[0], skip_special_tokens=True)

    return answer_bart


# Ask questions and print answers using RoBERTa (on GPU if available)
for question in questions:
    answer = ask_question_bart(question, context, model3, tokenizer3, device)
    print(f"Question: {question}")
    print(f"Answer: {answer}\n")
    bart_ans.append(answer)



The code snippet below uses a pre-trained T5 (Text-To-Text Transfer Transformer) model for question-answering tasks. Here's an explanation of what the code does:


**Explanation:**

1. **Loading Pre-trained Model and Tokenizer:**
   - The code imports the `T5Tokenizer` and `T5ForConditionalGeneration` classes from the `transformers` library.
   - It loads a pre-trained T5 tokenizer and model (`t5-small`). T5 is a transformer-based model that can be applied to various text generation tasks.

2. **Moving Model to GPU:**
   - If a CUDA-enabled GPU is available (as determined earlier in the code), the model is moved to the GPU using `model4.to(device)`.

3. **Question-Answering Function:**
   - The `ask_question_t5t` function takes a `question`, `context` (the provided text), pre-trained `model`, `tokenizer`, and `device` as inputs.
   - It combines the question and context into a single input string, encodes it as input tensors, and sends them to the GPU.
   - The model generates an answer sequence based on the combined input.
   - The function decodes the generated answer sequence, skipping special tokens, and returns the answer as a string.

4. **Question-Answering Loop:**
   - The code iterates through the `questions` list.
   - For each question, it calls the `ask_question_t5t` function, gets the answer, and prints the question along with the extracted answer.

This code performs question-answering using a pre-trained T5 model, providing answers to the example questions based on the given context.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

t5t_ans = []

tokenizer4 = T5Tokenizer.from_pretrained('t5-small')
model4 = T5ForConditionalGeneration.from_pretrained('t5-small')

# Move the model to the GPU if available
model4.to(device)

# Fine-tuning code not shown (you would need labeled data for fine-tuning)

# Function to ask questions from the context using RoBERTa on the GPU
def ask_question_t5t(question, context, model, tokenizer, device):

    input_text = f"question: {question} context: {context}"
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

    output_ids = model.generate(input_ids)
    answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    return answer


# Ask questions and print answers using RoBERTa (on GPU if available)
for question in questions:
    answer = ask_question_t5t(question, context, model4, tokenizer4, device)
    print(f"Question: {question}")
    print(f"Answer: {answer}\n")
    t5t_ans.append(answer)


In [None]:
df['question'] = questions
df['answer'] = answers
df['BERT_answer'] = bert_ans
df['DistilBERT_answer'] = distil_bert_ans
df['BART_answer'] = bart_ans
df['T5T_answer'] = t5t_ans


df
# 'answer', 'BERT_answer', 'DistilBERT_answer', 'BART_answer', 'T5T_answer']

#  Analysis of Models

**1. BLEU (Bilingual Evaluation Understudy) Score (for text generation tasks):**
BLEU score measures the similarity between the generated text and a reference text. Higher BLEU scores indicate better performance.

You can use the nltk library in Python to calculate BLEU scores:

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Function to calculate BLEU score for a model's answer with smoothing
def calculate_bleu(reference, candidate):
    # Use smoothing function to handle zero counts
    smoothing_function = SmoothingFunction().method1
    return sentence_bleu([reference.split()], candidate.split(), smoothing_function=smoothing_function)

# Function to calculate BLEU score for a model's answer
# def calculate_bleu(reference, candidate):
#     return sentence_bleu([reference.split()], candidate.split())

# Calculate BLEU scores for each model's answers and add to the DataFrame
for index, row in df.iterrows():
    reference_answer = row['answer']

    # Calculate BLEU scores
    bleu_bert = calculate_bleu(reference_answer, row['BERT_answer'])
    bleu_distilbert = calculate_bleu(reference_answer, row['DistilBERT_answer'])
    bleu_bart = calculate_bleu(reference_answer, row['BART_answer'])
    bleu_t5t = calculate_bleu(reference_answer, row['T5T_answer'])
    
    # Add BLEU scores to the DataFrame
    df.at[index, 'BLEU_BERT'] = bleu_bert
    df.at[index, 'BLEU_DistilBERT'] = bleu_distilbert
    df.at[index, 'BLEU_BART'] = bleu_bart
    df.at[index, 'BLEU_T5T'] = bleu_t5t

# Print the updated DataFrame with BLEU scores
df


In [None]:
import matplotlib.pyplot as plt

# Plotting BLEU scores for each question and model
questions = df['question'].tolist()
models = ['BERT', 'DistilBERT', 'BART', 'T5T']
bleu_scores = df[['BLEU_BERT', 'BLEU_DistilBERT', 'BLEU_BART', 'BLEU_T5T']].values.T

# Create bar plots
for i, question in enumerate(questions):
    plt.figure(figsize=(8, 6))
    plt.bar(models, bleu_scores[:, i])
    plt.title(f'BLEU Scores for Question: {question}')
    plt.xlabel('Models')
    plt.ylabel('BLEU Score')
    plt.ylim(0, 1)  # Set y-axis limit to 1 for BLEU score range [0, 1]
    plt.show()


**3. Exact Match (EM) Score (for question answering tasks):**

EM score measures the percentage of answers that exactly match the reference answers.

In [None]:
from nltk.tokenize import word_tokenize

# Function to calculate Token Intersection over Union (Token IoU)
def calculate_token_iou(reference, candidate):
    reference_tokens = set(word_tokenize(reference.lower()))
    candidate_tokens = set(word_tokenize(candidate.lower()))
    
    intersection = len(reference_tokens.intersection(candidate_tokens))
    union = len(reference_tokens.union(candidate_tokens))
    
    if union == 0:
        return 0.0
    else:
        return intersection / union

# Calculate EM scores using Token Intersection over Union (Token IoU)
for index, row in df.iterrows():
    reference_answer = row['answer']

    # Calculate Token IoU scores
    iou_bert = calculate_token_iou(reference_answer, row['BERT_answer'])
    iou_distilbert = calculate_token_iou(reference_answer, row['DistilBERT_answer'])
    iou_bart = calculate_token_iou(reference_answer, row['BART_answer'])
    iou_t5t = calculate_token_iou(reference_answer, row['T5T_answer'])
    
    # Add Token IoU scores to the DataFrame
    df.at[index, 'IOU_BERT'] = iou_bert
    df.at[index, 'IOU_DistilBERT'] = iou_distilbert
    df.at[index, 'IOU_BART'] = iou_bart
    df.at[index, 'IOU_T5T'] = iou_t5t

# Print the updated DataFrame with Token IoU scores
df


In [None]:
import matplotlib.pyplot as plt

# Plotting BLEU, EM, and Token IoU scores for each question and model
for i, question in enumerate(questions):
    plt.figure(figsize=(16, 6))

    # Plot BLEU scores
    plt.subplot(1, 3, 1)
    plt.bar(models, bleu_scores[:, i], color='skyblue')
    plt.title(f'BLEU Scores for Question: {question}')
    plt.xlabel('Models')
    plt.ylabel('BLEU Score')
    plt.ylim(0, 1)  # Set y-axis limit to 1 for BLEU score range [0, 1]

    # Plot Token IoU scores
    plt.subplot(1, 3, 2)
    iou_scores = df[['IOU_BERT', 'IOU_DistilBERT', 'IOU_BART', 'IOU_T5T']].values.T
    plt.bar(models, iou_scores[:, i], color='salmon')
    plt.title(f'Token IoU Scores for Question: {question}')
    plt.xlabel('Models')
    plt.ylabel('Token IoU Score')
    plt.ylim(0, 1)  # Set y-axis limit to 1 for Token IoU score range [0, 1]

    # Plot EM scores
#     plt.subplot(1, 3, 3)
#     em_scores = df[['EM_BERT', 'EM_DistilBERT', 'EM_BART', 'EM_T5T']].values.T
#     plt.bar(models, em_scores[:, i], color='lightgreen')
#     plt.title(f'EM Scores for Question: {question}')
#     plt.xlabel('Models')
#     plt.ylabel('EM Score')
#     plt.ylim(0, 1)  # Set y-axis limit to 1 for EM score range [0, 1]

    plt.tight_layout()
    plt.show()
