##Installing Required Libraries

In this cell, we are installing the necessary Python libraries using the pip package manager.
The libraries being installed are:
1. transformers: A popular library for working with state-of-the-art natural language processing models.
2. datasets: A library for easily working with and loading various datasets for machine learning tasks.
3. sentencepiece: A library for text tokenization, used in natural language processing tasks.
4. accelerate: A library for distributed training and inference with PyTorch.

Let's run this cell to ensure the required dependencies are installed.


In [None]:
!pip install transformers datasets sentencepiece accelerate

##Importing Libraries

In this cell, we import the necessary modules and classes from the installed libraries.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset
import torch

##Loading Pre-trained Model and Tokenizer

 In this cell, we load a pre-trained T5 model and its corresponding tokenizer.
 - model_name: The name of the pre-trained model to be used, in this case, "lmqg/t5-base-squad-qag".
 - tokenizer: T5Tokenizer instance created from the pre-trained model.
 - model: T5ForConditionalGeneration instance loaded from the pre-trained model.


In [None]:
model_name = "lmqg/t5-base-squad-qag"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

## Loading Dataset

In this cell, we load a dataset for the Question Answering and Generation task using the load_dataset function.
 - dataset: A dataset object loaded from the "lmqg/qag_squad" dataset.

In [None]:
# Load and prepare dataset
dataset = load_dataset("lmqg/qag_squad")

##Data Preprocessing Function

In this cell, we define a data preprocessing function, preprocess_data, to prepare the dataset for the T5 model.
 - preprocess_data function takes examples as input and tokenizes the input context and target (question-answer pair).
 - The input is tokenized using the T5 tokenizer with specified maximum lengths and padding.
 - Labels (target) are tokenized separately as they require a different tokenizer configuration.


In [None]:
def preprocess_data(examples):
    # Preprocess input (context) and target (question and answer pair) for the model
    model_inputs = tokenizer(examples['paragraph'], max_length=512, truncation=True, padding="max_length")

    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['questions_answers'], max_length=128, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_data, batched=True)

##Training Configuration

In this cell, we define the training arguments for configuring the training process.
 - TrainingArguments is a class that holds all the hyperparameters for training the model.
 - We specify the output directory for saving model checkpoints, evaluation strategy, learning rate, batch sizes for training and evaluation, number of training epochs, and weight decay.

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",             # Output directory for model checkpoints
    evaluation_strategy="epoch",        # Evaluate each epoch
    learning_rate=2e-5,                 # Learning rate
    per_device_train_batch_size=8,      # Batch size per device during training
    per_device_eval_batch_size=8,       # Batch size for evaluation
    num_train_epochs=3,                 # Number of training epochs
    weight_decay=0.01,                  # Weight decay
)

##Model Training

In this cell, we set up the Trainer class for training the T5 model.
 - Trainer is responsible for managing the training process, including optimization, logging, and evaluation.
 - We pass the pre-trained T5 model, training arguments, and tokenized training and validation datasets to the Trainer.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

##Model Training Execution

In [None]:
# Train the model
trainer.train()

##Now save the model.

##Importing NLTK and Transformers Libraries

In this cell, we import the NLTK library for natural language processing tasks, download the necessary NLTK data, and import modules from the Transformers library.


In [None]:
import nltk
import json

# Download NLTK data
nltk.download('punkt')

# Import modules from the Transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import random

##Loading Pre-trained Question Generation Model

In this cell, we load a pre-trained question generation model and its corresponding tokenizer.
 - tokenizer: AutoTokenizer instance created from the pre-trained model "abhir00p/Qgen_model_ACM".
 - model: AutoModelForSeq2SeqLM instance loaded from the pre-trained model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("abhir00p/Qgen_model_ACM")
model = AutoModelForSeq2SeqLM.from_pretrained("abhir00p/Qgen_model_ACM")

##Function for Generating Question-Answer Pairs

In this cell, we define a function, generate_qa_pairs, that takes a context and generates question-answer pairs using a pre-trained model.
 - context: The input text from which questions are generated.
 - model: The pre-trained question generation model.
 - tokenizer: The tokenizer corresponding to the model.
 - num_pairs: The number of question-answer pairs to generate (default is 1).
 - max_length: The maximum length of the input text chunk for processing.

In [None]:
def generate_qa_pairs(context, model, tokenizer, num_pairs=1, max_length=256):
    qa_pairs = []

    current_start = 0
    while current_start < len(context):
        current_end = min(current_start + max_length, len(context))
        chunk = context[current_start:current_end]

        encoding = tokenizer.encode_plus(chunk, max_length=max_length, truncation=True, return_tensors="pt")
        input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

        output_sequences = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            early_stopping=True,
            num_beams=3,
            num_return_sequences=num_pairs,
            no_repeat_ngram_size=2,
            max_length=200,
        )

        for sequence in output_sequences:
            qa_text = tokenizer.decode(sequence, skip_special_tokens=True)
            qa_parts = qa_text.split("[SEP]")
            qa_pairs.append(qa_text)


        current_start = current_end

    return qa_pairs


In [None]:
context = """Introduction:

Constrained Application Protocol resulted from Internet Engineering Task Force (IETF) Constrained RESTful Environments Request For Comments (CORE - RFC) working group's efforts to develop a generic framework for resource- oriented applications targeting constrained nodes & networks.

COAP framework (RFC 7252) defines simple & flexible ways to manipulate sensors & actuators for data or device management.

COAP messaging model is primarily designed to facilitate the exchange of messages over UDP between endpoints, including secure transport protocol Datagram Transport Layer Security (DTLS).
BECE351E-EK

COAP

5/17 100% +

73

Introduction:

• COAP is based on request/response communication model

similar to HTTP & supports additional protocol features that are

useful in loT scenarios.

Popular use case: Wired & wireless sensor networks.

Due to its frequent usage in constrained & local networks, CoAP is more suitable for Internet wide data transfer over HTTP.

CoAP can efficiently work on constrained devices, even when these devices are connected to highly lossy networks with high packet loss, high error rates & bandwidth in range of kilobits.

Highlights: Service discovery, resource discovery, URIS (uniform resource identifier), Internet media handling support, easy HTTP integration & multicasting while maintaining low overheads.
Introduction:

6/17- 100% +

• CoAP implementations can act as both clients & servers (not simultaneously).

COAP client's request signifies a request for action from an identified resource on a server, which is similar to HTTP.

• Response sent by the server in the form of a response code can contain resource representations as well.

Interchanges are asynchronous & datagram-oriented over UDP.

• Packet collisions are handled by a logical message layer incorporating the exponential backoff mechanism for providing reliability.

Two distinct layers of messaging (which handle UDP & asynchronous messaging) and request-response (which handles connection establishment) are part of CoAP header.
UDP as Transport Protocol:

• CoAP is based on UDP, which has a variety of unique features while being slim & efficient, yet not always ideal for Internet communication or communication between multiple networks due to its non-reliable nature.

CoAP implements simple mechanisms to mitigate these issues:

Simple stop-and-wait retransmission: CoAP message can be marked as a confirmable message by adding a protocol flag.

Deduplication: CoAP has a built-in deduplication mechanism based on message identifiers. This mechanism is in place for all CoAP messages.

CoAP is capable of using other transport protocols like TCP or SMS.
COAP

Message Format:

COAP message is composed of a short fixed-length header field (4 bytes), a variable-length but mandatory token field (0-8 bytes), options fields if necessary & payload field.

CoAP message delivers low overhead while decreasing parsing complexity.
COAP

Functionality:

• COAP can run over IPv4 or IPv6, it is recommended that message fit within a single IP packet & UDP payload to avoid fragmentation.

For IPv6, with the default MTU size being 1280 bytes & allowing for no fragmentation across nodes, maximum CoAP message size could be up to 1152 bytes, including 1024 bytes for the payload.

For IPv4, as IP fragmentation may exist across network, implementations should limit themselves to more conservative values & set IPv4 Don't Fragment (DF) bit.

While most sensor & actuator traffic utilizes small-packet payloads, some use cases, such as firmware upgrades, require capability to send larger payloads.

COAP doesn't rely on IP fragmentation but defines a pair of Block options for transferring multiple blocks of information from a resource representation in multiple request/response pairs.
Functionality:

• Like HTTP, CoAP is based on the REST architecture, but with a "thing" acting as both client & server.

• Through the exchange of asynchronous messages, a client requests an action via a method code on a server resource.

• Uniform resource identifier (URI) localized on server identifies this resource and responds back.

COAP request/response semantics include methods GET, POST, PUT & DELETE.
COAP

Reliable Transmission:

COAP defines four types of messages: confirmable, non- confirmable, acknowledgement & reset.

While running over UDP, CoAP offers a reliable transmission of messages when a CoAP header is marked as "confirmable."

If a request or response is tagged as confirmable, recipient must explicitly either acknowledge or reject."""

In [None]:
qa_pairs = generate_qa_pairs(context, model, tokenizer)

In [None]:
qa_pairs

##Function for Cleaning Question-Answer Pairs

In this cell, we define a function, clean_qa_pairs, to clean the generated question-answer pairs.
The function extracts and formats the question and answer from each generated pair.


In [None]:
def clean_qa_pairs(qa_pairs):
    cleaned_qa_pairs = []
    for pair in qa_pairs:
        parts = pair.split(", answer: ", 1)
        if len(parts) > 1:
            question = parts[0].split("question: ")[1]
            answer = parts[1].split("|")[0].capitalize()
            cleaned_qa_pairs.append({"question": question, "answer": answer})
    return cleaned_qa_pairs

In [None]:
cleaned_pairs = clean_qa_pairs(qa_pairs)

In [None]:
cleaned_pairs

In [None]:
#conversion to a csv
import pandas as pd
def conversiontocsv(cleaned_pairs):
  df=pd.read_json(str({"data":cleaned_qa_pairs}))
  csv_file_download=df.to_csv("csv_cleaned_qa_pairs.csv")
  return csv_file_download

## Using LMQG Package for pipeline function

##Installing lmqg and Using TransformersQG for Question Generation

In this cell, we install the lmqg library using the pip package manager.
Then, we use the TransformersQG class to generate question-answer pairs from a given context.


In [None]:
pip install lmqg

In [None]:
from lmqg import TransformersQG

# initialize model
model = TransformersQG(model="abhir00p/Qgen_model_ACM")


In [None]:
# model prediction
question_answer_pairs = model.generate_qa(context)

# Using Inference API

In [None]:
# Install langchain library
!pip install langchain

In [None]:
from langchain.text_splitter import CharacterTextSplitter

#Text Chunking

 In this cell, we define a function, get_text_chunks, to split a given text into chunks using langchain's CharacterTextSplitter.
 The function takes a text as input, splits it into chunks, and removes newline characters from each chunk.

In [None]:
def get_text_chunks(text):
    # Initialize CharacterTextSplitter with specified parameters
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )

    # Split the text into main_chunks
    main_chunks = text_splitter.split_text(text)

    # Remove newline characters from each chunk
    for chunk in range(0, len(main_chunks)):
        main_chunks[chunk] = main_chunks[chunk].replace('\n', "")

    return main_chunks

In [None]:
context = """
Toy Story 4 is a 2019 American animated comedy-drama film produced by Pixar Animation Studios for Walt Disney Pictures. It is the fourth installment in Pixar's Toy Story series and the sequel to Toy Story 3 (2010). It was directed by Josh Cooley (in his feature directorial debut) from a screenplay by Andrew Stanton and Stephany Folsom; the three also conceived the story alongside John Lasseter, Rashida Jones, Will McCormack, Valerie LaPointe, and Martin Hynes.[5] Tom Hanks, Tim Allen, Annie Potts, Joan Cusack, Don Rickles,[b] Wallace Shawn, John Ratzenberger, Estelle Harris (in her final film role), Blake Clark, Jeff Pidgeon, Bonnie Hunt, Jeff Garlin, Kristen Schaal, and Timothy Dalton reprise their character roles from the first three films, and are joined by Tony Hale, Keegan-Michael Key, Jordan Peele, Christina Hendricks, Keanu Reeves, and Ally Maki, who voice new characters introduced in this film. Set after the third film, Toy Story 4 follows Woody (Hanks) and Buzz Lightyear (Allen) as the pair and the other toys go on a road trip with Bonnie (Madeleine McGraw), who creates Forky (Hale), a spork made with recycled materials from her school. Meanwhile, Woody is reunited with Bo Peep (Potts), and must decide where his loyalties lie.

Talks for a fourth film began in 2010, and Hanks stated that Pixar was working on the sequel in 2011. When the film was officially announced in November 2014 during an investor's call, it was reported that the film would be directed by Lasseter, who later announced it would be a love story, after writing a film treatment with Stanton, and input from Pete Docter and Lee Unkrich, while Galyn Susman would serve as the producer. Cooley became the film's co-director in March 2015, while Pixar president Jim Morris said it was not a continuation of the third film, who described the film as a romantic comedy. In July 2017, Lasseter was stepping down and leaving Cooley as the sole director. Despite this, Lasseter still retained writing credits. New characters for the film were announced in 2018 and 2019 along with new cast members. Composer Randy Newman returned to score the film, marking his ninth collaboration with Pixar. The film is dedicated to Don Rickles (the voice of Mr. Potato Head) and animator Adam Burke, who died in 2017 and 2018, respectively.[6][7]

Toy Story 4 premiered in Hollywood, Los Angeles, on June 11, 2019, and was released in the United States on June 21. It grossed $1.073 billion worldwide, becoming the eighth-highest-grossing film of 2019 and is the highest-grossing film in the franchise, marginally surpassing Toy Story 3. Like its predecessors, the film received acclaim from critics, with praise for its story, humor, emotional depth, musical score, animation, and vocal performances. The film was nominated for two awards at the 92nd Academy Awards, winning Best Animated Feature, and received numerous other accolades. Whilst Toy Story 4 film was initially expected to be the final film in the main Toy Story film series a fifth installment is in development.
"""

In [None]:
context_chunks = get_text_chunks(context)

#Question Generation via Hugging Face Inference API

 In this cell, we define a function, generate_qa_pairs, to generate question-answer pairs using the Hugging Face Inference API.
 The function takes a context, model_name, API token, and other optional parameters for generating QA pairs.

In [None]:
import requests

# API_URL = "https://api-inference.huggingface.co/models/abhir00p/Qgen_model_ACM"
# headers = {"Authorization": "Bearer your_ACCESS_TOKEN"}


In [None]:
def generate_qa_pairs(context, model_name, api_token, num_pairs=1, max_length=256):
    headers = {
        "Authorization": f"Bearer {api_token}"
    }

    qa_pairs = []
    current_start = 0
    while current_start < len(context):
        current_end = min(current_start + max_length, len(context))
        chunk = context[current_start:current_end]

        payload = {
            "inputs": chunk,
            "parameters": {
                "max_length": 200,
                "num_beams": 3,
                "num_return_sequences": num_pairs,
                "no_repeat_ngram_size": 2
            },
            "options": {
                "use_cache": False,
                "wait_for_model": True
            }
        }

        response = requests.post(f"https://api-inference.huggingface.co/models/{model_name}", headers=headers, json=payload)
        if response.status_code == 200:
            output_sequences = response.json()
            for sequence in output_sequences:
                qa_text = sequence["generated_text"]
                # Optionally split the text into Q&A parts
                # qa_parts = qa_text.split("[SEP]")
                qa_pairs.append(qa_text)

    current_start = current_end

    return qa_pairs



In [None]:
generate_qa_pairs(context_chunks,"abhir00p/Qgen_model_ACM","your_ACCESS_TOKEN")


References:
 1. https://github.com/asahi417/lm-question-generation?tab=readme-ov-file
 2. https://huggingface.co/learn/nlp-course/chapter1/1
 3. https://huggingface.co/lmqg


