<a href="https://colab.research.google.com/github/EMWetzel/AI_in_Const/blob/LLM%2FNLP/QA_Gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following exercise will use open-source, Large Language Models to read a text document, tokenize it, and create questions with answers. Note: There are some commented out lines that would export a Word Doc with the Q & A.


**Before we get started, change your runtime option to a GPU. Although this isn't super compute heavy, it will help speed it up a bit.**

First we will need to import the proper libraries.

"Transformers" is a PyTorch library that is designed for LLM/NLP. It has models, tokenizers, model weights, etc.


In [2]:
from transformers import pipeline, T5Tokenizer, T5ForConditionalGeneration
#If you want to run from VS Code to export a Word Doc, "pip install python-docx" and "from docx import Document"

Next, we will load the specific models we will use for both the question and answer tokenization and generation.

In [3]:
question_generation_model_name = "valhalla/t5-small-qg-hl"
answer_generation_model_name = "t5-base"

question_tokenizer = T5Tokenizer.from_pretrained(question_generation_model_name)
question_model = T5ForConditionalGeneration.from_pretrained(question_generation_model_name)

answer_tokenizer = T5Tokenizer.from_pretrained(answer_generation_model_name)
answer_model = T5ForConditionalGeneration.from_pretrained(answer_generation_model_name)

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Next, we will define all of our functions. Remember, the order we define these in this section doesn't matter, as long as we call the function in the correct order.

In [9]:
def generate_qa_pairs(text):
    inputs = question_tokenizer.encode("highlight: " + text + " </s>", return_tensors="pt")
    outputs = question_model.generate(inputs, max_length=150, num_return_sequences=5, do_sample=True)

    questions = [question_tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

    qa_pairs = []
    for question in questions:
        input_text = "question: " + question + " context: " + text + " </s>"
        inputs = answer_tokenizer.encode(input_text, return_tensors="pt")
        outputs = answer_model.generate(inputs, max_length=150, num_return_sequences=1, do_sample=True)

        answer = answer_tokenizer.decode(outputs[0], skip_special_tokens=True)
        qa_pairs.append({"question": question, "answer": answer})

    return qa_pairs

def print_qa_pairs(qa_pairs):
    print("Generated Q&A Pairs:\n")
    for i, pair in enumerate(qa_pairs):
        print(f'Q{i+1}: {pair["question"]}')
        print(f'A{i+1}: {pair["answer"]}\n')

#This does not get used, unless you run in VS Code and want to export a Word Doc
def save_to_word(qa_pairs, output_file):
    document = Document()
    document.add_heading('Generated Q&A Pairs', 0)

    for i, pair in enumerate(qa_pairs):
        document.add_heading(f'Q{i+1}: {pair["question"]}', level=1)
        document.add_paragraph(pair["answer"])

    document.save(output_file)

def read_text_from_file(file_path):
    try:
        with open(file_path, 'r') as file:
            text = file.read()
        print("File read successfully!")
        return text
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

def tokenize_text(text):
  tokens = question_tokenizer.tokenize(text)
  print("Tokenized text:", tokens)

Now we will begin calling the functions. The first function is to read the text file, including it's file path. Note in the function definition that I have added a little "print" test to make sure that this is read properly. If it is successful, the output of running this will be a notification.

For this to work, we need to import the text document we wish to run.

Click on the document icon on the left, load the text document, click on the three dots, and copy the file path. Place it into the input_file between the quotes.

In [13]:
input_file = "/content/TestText.txt"  # Replace with your .txt file path
text = read_text_from_file(input_file)

File read successfully!


Now let's see how the tokenizer model works. The below code will tokenize and then print the tokens from the attached document.

In [14]:
tokenize_text(text)

Tokenized text: ['▁Active', '▁learning', '▁is', '▁distinguished', '▁from', '▁more', '▁traditional', '▁college', '▁lectures', '▁where', '▁students', '▁passive', 'ly', '▁receive', '▁information', '▁from', '▁the', '▁instructor', '.', '▁Active', '▁learning', '▁must', '▁include', '▁‘', 'stud', 'ent', '▁activity', '▁and', '▁engagement', '▁in', '▁the', '▁learning', '▁process', '’', '▁(', 'Pri', 'nce', ',', '▁2004,', '▁', 'p', '.', '▁22', '3)', '.', '▁Students', '▁', 'tang', 'ibly', '▁participate', '▁in', '▁activities', '▁within', '▁the', '▁classroom', '▁during', '▁the', '▁class', '▁session', '.', '▁This', '▁student', '▁participation', '▁requires', '▁higher', '▁order', '▁thinking', '▁where', '▁students', '▁generate', '▁knowledge', '▁and', '▁understanding', '.', '▁Although', '▁not', '▁always', '▁included', ',', '▁meta', 'c', 'ogni', 'tion', ',', '▁where', '▁students', '▁reflect', '▁on', '▁what', '▁they', '▁have', '▁learned', ',', '▁is', '▁often', '▁', 'a', '▁key', '▁link', '▁between', '▁activit

Our last step is to generate the Q&A pairs and save them to a Word Doc.

In [15]:

qa_pairs = generate_qa_pairs(text)

print_qa_pairs(qa_pairs)

#If you would rather export to VS Code and print a Word Doc, uncomment and run this

#output_file = "qa_pairs.docx"
#save_to_word(qa_pairs, output_file)
#print(f"Q&A pairs saved to {output_file}")

Generated Q&A Pairs:

Q1: When applying active learning, the teacher has the ability to create new arguments and ideas for students.
A1: student

Q2: How is active learning correlated with traditional teacher study?
A2: passively receive information from the instructor

Q3: What major gap identifies as whether active learning is defined as a student willingness to participate in classes?
A3: student ability to generate new arguments and ideas

Q4: Introducment on active learning in the ELSE is about students learning and learning, but with one's approach actively learning is characterized by how traditional instructors perceive it?
A4: active learning can be impactful in large classes

Q5: What did Vigotsky believe in the value of active learning?
A5: solving problems beyond their actual developmental level when under the guidance of an instructor or peers



It shouldn't be a surprise if the Q&A doesn't make a lot of sense because this model has never been trained on this content. However, Q&A generation is a vital step in training a text-based model.