In [None]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")

# **Putting Everything Together**

The question generator can be repurposed to turn answers into questions.

As a reminder: the question generator's training inputs were `context` with a `highlighted answer` and it's labels were `question`.

So, the steps we will run through are:








* Take in a context
* Chunk the context into segments
*   Based on the context, generate an answer
*   Highlight the answer in the context
* Feed the context with highlight into the model
* Get an **answerable** question

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

from google.colab import drive
drive.mount("/content/drive/")

import torch

mod_q = torch.load("/content/drive/My Drive/TeachMy/base4epoch.pth", map_location = torch.device("cpu"))
mod_a = torch.load("/content/drive/My Drive/TeachMy/adult_tuned2e.pth", map_location=torch.device("cpu"))

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


# Prepare Texts

In [None]:
import nltk
nltk.download('punkt')

# dont forget to initialize the T5 tokenizer

def chunker(text, tokenizer, min_tokens=140, max_tokens=200):
    sentences = nltk.tokenize.sent_tokenize(text)

    segments = []
    current_segment = []
    current_token_count = 0

    for sentence in sentences:
        sentence_tokens = tokenizer.tokenize(sentence)
        sentence_token_count = len(sentence_tokens)

        if current_token_count + sentence_token_count > max_tokens:
            segments.append(' '.join(current_segment))
            current_segment = [sentence]
            current_token_count = sentence_token_count
        else:
            current_segment.append(sentence)
            current_token_count += sentence_token_count

        if min_tokens <= current_token_count <= max_tokens:
            segments.append(' '.join(current_segment))
            current_segment = []
            current_token_count = 0

    if current_segment:
        segments.append(' '.join(current_segment))

    return segments

In [None]:
longcon1 = """Humpback whales are one of the most fascinating species in the marine world, known for their impressive size, unique behaviors, and haunting songs. These majestic creatures belong to the family of baleen whales, which includes the largest animals on Earth. Humpback whales can grow up to 60 feet in length and weigh as much as 40 tons, making them one of the larger whale species, though still smaller than the blue whale.

One of the most distinctive features of humpback whales is their acrobatic nature. They are known for breaching, a behavior in which the whale propels itself out of the water in a spectacular leap. This behavior, along with their habit of slapping their massive tails and fins on the water's surface, makes humpback whales a favorite among whale watchers. Scientists believe these behaviors could serve multiple purposes, including communication, mating displays, or as a way to dislodge parasites.

Humpback whales are also renowned for their complex and melodious songs. Male humpbacks sing intricate songs that can last for up to 20 minutes and be heard miles away under the ocean. These songs are thought to play a role in attracting mates and asserting dominance. Remarkably, all males within a population sing the same song, which evolves gradually over time. The exact meaning and content of these songs remain one of the ocean's great mysteries.

Another interesting aspect of humpback whales is their feeding technique known as bubble net feeding. They create a unique "net" of bubbles by swimming in a spiral and releasing air from their blowholes. This bubble net traps schools of fish or krill, and the whales then swim upwards with their mouths open to engulf thousands of gallons of water filled with prey. This cooperative hunting strategy showcases the intelligence and social behavior of humpback whales.

Humpback whales embark on long migrations, one of the longest of any mammal on Earth. They spend the summer months in colder, polar waters where they feed extensively to build up fat reserves. In the winter, they migrate to warmer, tropical waters to breed and give birth. During the breeding season, humpbacks fast and live off their fat reserves, focusing on mating and nursing their young.

Conservation efforts have been critical in protecting humpback whales from the brink of extinction. Once hunted to near extinction, humpback populations have made a significant recovery thanks to international protection and the ban on commercial whaling. However, they still face threats from entanglement in fishing gear, ship strikes, pollution, and climate change which affects their prey availability.

The humpback whale continues to capture the human imagination with its beauty, mysterious songs, and awe-inspiring acrobatics. Their presence in the oceans serves as a reminder of the need to preserve and protect the incredible biodiversity of our planet."""

In [None]:
longcon2 = """Let's dive into the fascinating world of bioluminescence, the natural phenomenon where living organisms produce light. Bioluminescence is one of nature's most enchanting spectacles, observed in a variety of organisms, including certain types of fish, insects, fungi, and microorganisms, most notably in the depths of the ocean.

The light produced by these organisms is the result of a biochemical reaction in which enzyme luciferase acts on the substrate luciferin, in the presence of oxygen, to produce light. The colors of this light can vary greatly, from the common green or blue to the rare red, depending on the species and the environment in which they live.

Bioluminescence serves multiple purposes in nature. For deep-sea creatures, it can be a tool for communication, attracting mates, luring prey, or as a defense mechanism to confuse predators. Fireflies, on the other hand, use bioluminescence to attract mates with their distinctive patterns of flashing lights.

One of the most magical occurrences related to bioluminescence is the sea sparkle caused by dinoflagellates, microscopic plankton that illuminate the water along coastlines. When agitated by waves or movement, these organisms emit a stunning, ethereal glow, turning the sea into a canvas of sparkling light.

Bioluminescence has not only captivated the imagination of many but has also found practical applications in scientific research. For instance, the gene for luciferase has been inserted into various organisms, from bacteria to plants, making them glow. This technique is used in genetic engineering, bioassays, and medical research to track the expression of genes, monitor the spread of infections, or visualize the location of specific proteins within cells.

Despite its widespread occurrence and utility, much about bioluminescence remains a mystery. Deep-sea expeditions continue to discover new species that challenge our understanding of this beautiful biological phenomenon. The study of bioluminescent organisms not only broadens our knowledge of the natural world but also opens up new avenues for biotechnological applications, from eco-friendly lighting solutions to advanced medical diagnostics.

As we continue to explore and learn from the natural world, bioluminescence stands as a glowing testament to the wonders of life on Earth, reminding us of the complexity, diversity, and beauty hidden in the depths of the oceans and the nooks of our terrestrial ecosystems."""

In [None]:
longcon3 = """
Neuropsychology delves into the intricate relationships between the brain's physical structure and its cognitive functions, exploring how different brain areas and their interconnections underpin specific mental processes and behaviors. A pivotal area within this field is the study of executive functions, which are crucial high-level cognitive processes responsible for organizing, planning, strategizing, paying attention to and remembering details, and managing time and space.

Executive functions are primarily mediated by the prefrontal cortex, a part of the brain located at the front of the frontal lobe. This region is involved in the complex processes of decision-making, problem-solving, and behavior modulation. The prefrontal cortex works in tandem with other brain regions to orchestrate a range of activities necessary for goal-directed behavior. For instance, when planning a task, the prefrontal cortex might activate to assess the sequence of actions required, estimate time needed, and predict potential outcomes.

Damage or dysfunction in the prefrontal cortex can lead to significant difficulties with executive functions, manifesting in various ways depending on the affected area. For example, damage to the dorsolateral prefrontal cortex can result in problems with working memory and task flexibility, while damage to the orbitofrontal cortex can lead to poor impulse control and decision-making.

Executive functions are often assessed through neuropsychological tests designed to probe the complex interplay of cognitive processes involved in goal-directed behavior. These tests might evaluate an individual's ability to form strategies, switch between tasks, apply rules, and inhibit inappropriate responses.

Understanding executive functions and their neural underpinnings is not only crucial for basic neuroscience research but also has practical implications in clinical settings. Neuropsychologists apply this knowledge to diagnose and treat conditions characterized by executive dysfunction, such as traumatic brain injury, ADHD, and frontal lobe syndromes. Rehabilitation strategies often include cognitive exercises aimed at improving planning, problem-solving skills, and impulse control, thereby helping individuals regain functionality and improve their quality of life.

In sum, the study of executive functions within neuropsychology offers profound insights into how our brains enable us to navigate complex, goal-oriented tasks. It highlights the remarkable capacity of the human brain to adapt and organize behavior in response to ever-changing environmental demands, underscoring the intricate link between brain structure and cognitive function."""

In [None]:
text1 = segment_text_by_sentences(longcon1, tokenizer)
text2 = segment_text_by_sentences(longcon2, tokenizer)
text3 = segment_text_by_sentences(longcon3, tokenizer)

# **Define Functions**

* Models:

In [None]:
def genq(new_context):
    mod_q.eval()  # Set the model to evaluation mode
    input_text = f"generate question: {new_context} </s>"  # Prepare input text
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)  # Tokenize input

    with torch.no_grad():  # Disable gradient calculation
        outputs = mod_q.generate(input_ids)  # Generate output tokens

    generated_question = tokenizer.decode(outputs[0], skip_special_tokens=True)  # Decode tokens to string
    return generated_question

def gena(context):
    mod_a.eval()  # Set the model to evaluation mode
    # Since there's no question prefix required, we directly use the context
    input_ids = tokenizer.encode(context, return_tensors="pt").to(device)  # Tokenize input

    with torch.no_grad():  # Disable gradient calculation
        outputs = mod_a.generate(input_ids, max_length=50)  # Generate output tokens, you might want to adjust max_length

    generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)  # Decode tokens to string
    return generated_answer

* Answer Highlighter

In [None]:
def highlight_answer(context, answer):
    highlighted_context = context.replace(answer, f"[ANSWER] {answer} [/ANSWER]")
    return highlighted_context

In [None]:
generated_answers1 = [generate_answer(segment) for segment in text1]
generated_answers2 = [generate_answer(segment) for segment in text2]
generated_answers3 = [generate_answer(segment) for segment in text3]

In [None]:
for segment, answer in zip(text1, generated_answers):
    highlighted_segment = highlight_answer(segment, answer)

highlighted_segment

'The [ANSWER] humpback whale [/ANSWER] continues to capture the human imagination with its beauty, mysterious songs, and awe-inspiring acrobatics. Their presence in the oceans serves as a reminder of the need to preserve and protect the incredible biodiversity of our planet.'

# **Run the complete pipeline**

**Humpback Whales:**

In [None]:
for segment, answer in zip(text1, generated_answers1):

    highlighted_segment = highlight_answer(segment, answer)
    generated_question = genq(highlighted_segment)

    print(f"Segment: {segment}")
    print(f"Highlighted Segment: {highlighted_segment}")
    print(f"Generated Answer: {answer}")
    print(f"Generated Question: {generated_question}")
    print("-" * 50)

Segment: Humpback whales are one of the most fascinating species in the marine world, known for their impressive size, unique behaviors, and haunting songs. These majestic creatures belong to the family of baleen whales, which includes the largest animals on Earth. Humpback whales can grow up to 60 feet in length and weigh as much as 40 tons, making them one of the larger whale species, though still smaller than the blue whale. One of the most distinctive features of humpback whales is their acrobatic nature. They are known for breaching, a behavior in which the whale propels itself out of the water in a spectacular leap.
Highlighted Segment: Humpback whales are one of the most fascinating species in the marine world, known for their impressive size, unique behaviors, and haunting songs. These majestic creatures belong to the family of baleen whales, which includes the largest animals on Earth. Humpback whales can grow up to [ANSWER] 60 feet [/ANSWER] in length and weigh as much as 40 

Nice! Just the last Q+A Pair is missing a question mark and is a boring question (answer would be humpback whale). A reason for this is probably the short input (it's the last chunk, so only two sentences)

**Bioluminescence:**

In [None]:
for segment, answer in zip(text2, generated_answers2):

    highlighted_segment = highlight_answer(segment, answer)
    generated_question = genq(highlighted_segment)

    print(f"Segment: {segment}")
    print(f"Highlighted Segment: {highlighted_segment}")
    print(f"Generated Answer: {answer}")
    print(f"Generated Question: {generated_question}")
    print("-" * 50)



Segment: Let's dive into the fascinating world of bioluminescence, the natural phenomenon where living organisms produce light. Bioluminescence is one of nature's most enchanting spectacles, observed in a variety of organisms, including certain types of fish, insects, fungi, and microorganisms, most notably in the depths of the ocean. The light produced by these organisms is the result of a biochemical reaction in which enzyme luciferase acts on the substrate luciferin, in the presence of oxygen, to produce light. The colors of this light can vary greatly, from the common green or blue to the rare red, depending on the species and the environment in which they live.
Highlighted Segment: Let's dive into the fascinating world of bioluminescence, the natural phenomenon where living organisms produce light. Bioluminescence is one of nature's most enchanting spectacles, observed in a variety of organisms, including certain types of fish, insects, fungi, and microorganisms, most notably in t

The Question/Answer pair of Segment 2 is not good.

* **Q:** "What is one of the most magical occurrences related to bioluminescence?"

* **A:** "fireflies"

The actual answer would be "the sea sparkle", not "fireflies".

The problem lies in the fact that while "fireflies" was tagged as an answer by our model, the highlight was unsuccessful. Either the highlighter is case-sensitive, or was confused because fireflies comes up more than once. Must be checked.

**Neuropsychology:**

In [None]:
for segment, answer in zip(text3, generated_answers3):

    highlighted_segment = highlight_answer(segment, answer)
    generated_question = genq(highlighted_segment)

    print(f"Segment: {segment}")
    print(f"Highlighted Segment: {highlighted_segment}")
    print(f"Generated Answer: {answer}")
    print(f"Generated Question: {generated_question}")
    print("-" * 50)



Segment: 
Neuropsychology delves into the intricate relationships between the brain's physical structure and its cognitive functions, exploring how different brain areas and their interconnections underpin specific mental processes and behaviors. A pivotal area within this field is the study of executive functions, which are crucial high-level cognitive processes responsible for organizing, planning, strategizing, paying attention to and remembering details, and managing time and space. Executive functions are primarily mediated by the prefrontal cortex, a part of the brain located at the front of the frontal lobe. This region is involved in the complex processes of decision-making, problem-solving, and behavior modulation.
Highlighted Segment: 
Neuropsychology delves into the intricate relationships between the brain's physical structure and its cognitive functions, exploring how different brain areas and their interconnections underpin specific mental processes and behaviors. A pivot

Question + Answer Pair 3 is not so good.

* **Q:** "What are frontal lobe syndromes and traumatic brain injury?"

* **A:** "executive dysfunction"

Question should be "What are frontal lobe syndromes and traumatic brain injury *characterized by*?"

Question + Answer Pair 4 is also not good.

* **Q:** "The link between brain structure and cognitive function is highlighted by the study of the brain's ability"

* **A:** "adaptive and organize behavior"

The generated question is junk.

Reviewing the context, the answer was not highlighted.

Why? The generated answer seems to have been paraphrased. The generated answer is "adaptive and organize behavior", but there is no exact match in the text. The text itself contains the phrase "adapt and organize behavior". The question generator heavily depends on having a highlight to refer to.

# **Conclusion**

If there's time, I should rewrite the highlight function so it's not case-sensitive anymore. Also look into why the model paraphrased, and why it only did so once. Kind of weird.

But also, it is interesting to compare the questions of the pure question generator model from our first notebook, to the ones generated by our new approach.

These questions were generated for the humpback whale text by a base T5, fine-tuned on both the SQuAD and NewsQA dataset:


```
1. "What is the name of the whale's mate?"
2. "What do scientists believe?"
3. "What is the name of the song?"
4. "What do humpback whales do?"
5. "What does the humpback whale do?"
```

For comparison, here are the questions from above again:

```
1. "How long can a humpback whale grow?"
2. "How long can a male humpback whale's song last?"
3. "How do humpback whales create a unique "net" of bubbles?"
4. "Why do humpback whales migrate to warmer tropical waters?"
5. "What animal continues to capture the human imagination with its beauty, mysterious songs, and awe"
```


The questions are much better - they're more interesting and complex, and also specific. Except for #5, of course.