In [None]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")

# **Training to identify ANSWERS**

I tried the following to improve the Question Generator:

* Switched from the small T5 to the base
* Added the NewsQA dataset
* Trained for additional epochs

The problem of non-questions and incoherent questions being generated was largely solved. However, the model continued to generate unanswerable questions.

A different approach is necessary: *an answer must first be defined*, and based on the chosen answer, a question will be generated.

So, a base T5 was trained, again on the SQuAD dataset, but was this time taught to identify an `answer` based on a `context`.

In [None]:
def tokenize_data(data, tokenizer, source_max_token_len=512, target_max_token_len=32):
  tokenized = []
  for sample in data:
    source = f"{sample['context']}".  # source is still context
    target = f"{sample['answer']}".  # but target is now answer

    inputs = tokenizer(source, max_length = source_max_token_len, padding = "max_length", truncation = True, return_tensors = "pt")

    with tokenizer.as_target_tokenizer():
      labels = tokenizer(target, max_length=source_max_token_len, padding="max_length", truncation=True, return_tensors="pt")

    tokenized.append({
        "input_ids": inputs["input_ids"].squeeze(0),
        "attention_mask": inputs["attention_mask"].squeeze(0),
        "labels": labels["input_ids"].squeeze(0)
    })
  return tokenized

In [None]:
from torch.utils.data import Dataset, DataLoader

class T5Dataset(Dataset):
    def __init__(self, tokenized_data):
        self.tokenized_data = tokenized_data

    def __len__(self):
        return len(self.tokenized_data)

    def __getitem__(self, idx):
        return self.tokenized_data[idx]

dataset = T5Dataset(tokenized_data=squadt)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

*Training was done on Kaggle, so code is a bit incomplete again*

# **Testing out the Answer Generator**

Input a few longer texts for the model to be tested out on.

**Humpback Whale Text**

In [None]:
longcon1 = """Humpback whales are one of the most fascinating species in the marine world, known for their impressive size, unique behaviors, and haunting songs. These majestic creatures belong to the family of baleen whales, which includes the largest animals on Earth. Humpback whales can grow up to 60 feet in length and weigh as much as 40 tons, making them one of the larger whale species, though still smaller than the blue whale.

One of the most distinctive features of humpback whales is their acrobatic nature. They are known for breaching, a behavior in which the whale propels itself out of the water in a spectacular leap. This behavior, along with their habit of slapping their massive tails and fins on the water's surface, makes humpback whales a favorite among whale watchers. Scientists believe these behaviors could serve multiple purposes, including communication, mating displays, or as a way to dislodge parasites.

Humpback whales are also renowned for their complex and melodious songs. Male humpbacks sing intricate songs that can last for up to 20 minutes and be heard miles away under the ocean. These songs are thought to play a role in attracting mates and asserting dominance. Remarkably, all males within a population sing the same song, which evolves gradually over time. The exact meaning and content of these songs remain one of the ocean's great mysteries.

Another interesting aspect of humpback whales is their feeding technique known as bubble net feeding. They create a unique "net" of bubbles by swimming in a spiral and releasing air from their blowholes. This bubble net traps schools of fish or krill, and the whales then swim upwards with their mouths open to engulf thousands of gallons of water filled with prey. This cooperative hunting strategy showcases the intelligence and social behavior of humpback whales.

Humpback whales embark on long migrations, one of the longest of any mammal on Earth. They spend the summer months in colder, polar waters where they feed extensively to build up fat reserves. In the winter, they migrate to warmer, tropical waters to breed and give birth. During the breeding season, humpbacks fast and live off their fat reserves, focusing on mating and nursing their young.

Conservation efforts have been critical in protecting humpback whales from the brink of extinction. Once hunted to near extinction, humpback populations have made a significant recovery thanks to international protection and the ban on commercial whaling. However, they still face threats from entanglement in fishing gear, ship strikes, pollution, and climate change which affects their prey availability.

The humpback whale continues to capture the human imagination with its beauty, mysterious songs, and awe-inspiring acrobatics. Their presence in the oceans serves as a reminder of the need to preserve and protect the incredible biodiversity of our planet."""

**Bioluminescene Text**

In [None]:
longcon2 = """Let's dive into the fascinating world of bioluminescence, the natural phenomenon where living organisms produce light. Bioluminescence is one of nature's most enchanting spectacles, observed in a variety of organisms, including certain types of fish, insects, fungi, and microorganisms, most notably in the depths of the ocean.

The light produced by these organisms is the result of a biochemical reaction in which enzyme luciferase acts on the substrate luciferin, in the presence of oxygen, to produce light. The colors of this light can vary greatly, from the common green or blue to the rare red, depending on the species and the environment in which they live.

Bioluminescence serves multiple purposes in nature. For deep-sea creatures, it can be a tool for communication, attracting mates, luring prey, or as a defense mechanism to confuse predators. Fireflies, on the other hand, use bioluminescence to attract mates with their distinctive patterns of flashing lights.

One of the most magical occurrences related to bioluminescence is the sea sparkle caused by dinoflagellates, microscopic plankton that illuminate the water along coastlines. When agitated by waves or movement, these organisms emit a stunning, ethereal glow, turning the sea into a canvas of sparkling light.

Bioluminescence has not only captivated the imagination of many but has also found practical applications in scientific research. For instance, the gene for luciferase has been inserted into various organisms, from bacteria to plants, making them glow. This technique is used in genetic engineering, bioassays, and medical research to track the expression of genes, monitor the spread of infections, or visualize the location of specific proteins within cells.

Despite its widespread occurrence and utility, much about bioluminescence remains a mystery. Deep-sea expeditions continue to discover new species that challenge our understanding of this beautiful biological phenomenon. The study of bioluminescent organisms not only broadens our knowledge of the natural world but also opens up new avenues for biotechnological applications, from eco-friendly lighting solutions to advanced medical diagnostics.

As we continue to explore and learn from the natural world, bioluminescence stands as a glowing testament to the wonders of life on Earth, reminding us of the complexity, diversity, and beauty hidden in the depths of the oceans and the nooks of our terrestrial ecosystems."""

**Neuropsychology Text**

In [None]:
longcon3 = """
Neuropsychology delves into the intricate relationships between the brain's physical structure and its cognitive functions, exploring how different brain areas and their interconnections underpin specific mental processes and behaviors. A pivotal area within this field is the study of executive functions, which are crucial high-level cognitive processes responsible for organizing, planning, strategizing, paying attention to and remembering details, and managing time and space.

Executive functions are primarily mediated by the prefrontal cortex, a part of the brain located at the front of the frontal lobe. This region is involved in the complex processes of decision-making, problem-solving, and behavior modulation. The prefrontal cortex works in tandem with other brain regions to orchestrate a range of activities necessary for goal-directed behavior. For instance, when planning a task, the prefrontal cortex might activate to assess the sequence of actions required, estimate time needed, and predict potential outcomes.

Damage or dysfunction in the prefrontal cortex can lead to significant difficulties with executive functions, manifesting in various ways depending on the affected area. For example, damage to the dorsolateral prefrontal cortex can result in problems with working memory and task flexibility, while damage to the orbitofrontal cortex can lead to poor impulse control and decision-making.

Executive functions are often assessed through neuropsychological tests designed to probe the complex interplay of cognitive processes involved in goal-directed behavior. These tests might evaluate an individual's ability to form strategies, switch between tasks, apply rules, and inhibit inappropriate responses.

Understanding executive functions and their neural underpinnings is not only crucial for basic neuroscience research but also has practical implications in clinical settings. Neuropsychologists apply this knowledge to diagnose and treat conditions characterized by executive dysfunction, such as traumatic brain injury, ADHD, and frontal lobe syndromes. Rehabilitation strategies often include cognitive exercises aimed at improving planning, problem-solving skills, and impulse control, thereby helping individuals regain functionality and improve their quality of life.

In sum, the study of executive functions within neuropsychology offers profound insights into how our brains enable us to navigate complex, goal-oriented tasks. It highlights the remarkable capacity of the human brain to adapt and organize behavior in response to ever-changing environmental demands, underscoring the intricate link between brain structure and cognitive function."""

I decided to build a "chunker" to segment texts in order to:


* prevent truncation of an input
* allow for more than one question to be generated
* logical for longer texts to put out higher amount of questions

In [None]:
import nltk
nltk.download('punkt')

def chunker(text, tokenizer, min_tokens=140, max_tokens=200):
    sentences = nltk.tokenize.sent_tokenize(text)

    segments = []
    current_segment = []
    current_token_count = 0

    for sentence in sentences:
        sentence_tokens = tokenizer.tokenize(sentence)
        sentence_token_count = len(sentence_tokens)

        if current_token_count + sentence_token_count > max_tokens:
            segments.append(' '.join(current_segment))
            current_segment = [sentence]
            current_token_count = sentence_token_count
        else:
            current_segment.append(sentence)
            current_token_count += sentence_token_count

        if min_tokens <= current_token_count <= max_tokens:
            segments.append(' '.join(current_segment))
            current_segment = []
            current_token_count = 0

    if current_segment:
        segments.append(' '.join(current_segment))

    return segments

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# **Comparing Model Performance**

I had to segment the training data (into 4 parts), because it was too glitchy to fine-tune my model on all of it as once. This means that I ended up with four answer-generating models, trained on 1/4 to 4/4 of the dataset. I thought it would be interesting to compare their performance.

* The **baby model** was trained for one epoch on a fourth of the dataset.

* The **adult model** was trained for one epoch on all four parts of the dataset.

In [None]:
babymodel = torch.load("/content/drive/My Drive/TeachMy/answermod_d3e1.pth", map_location = torch.device("cpu"))
adultmodel = torch.load("/content/drive/My Drive/TeachMy/answermod_d1e1_d2e1_d3e1_d4e1.pth", map_location = torch.device("cpu"))

**Humpback Whales:**

In [None]:
print("Baby:", [gen_baby(segment) for segment in longcon1])
print("Adult:", [gen_adult(segment) for segment in longcon1])

Baby: ['acrobatic', "slapping their massive tails and fins on the water's surface", 'bubble net feeding', 'entanglement in fishing gear, ship strikes, pollution, and climate change', 'humpback whale']
Adult: ['60 feet', 'humpback whales', 'bubble net feeding', 'entanglement in fishing gear, ship strikes, pollution, and climate change', 'humpback whale']


**Bioluminescence:**

In [None]:
print("Baby:", [gen_baby(segment) for segment in longcon2])
print("Adult:", [gen_adult(segment) for segment in longcon2])

Baby: ['bioluminescence', 'a canvas of sparkling light', 'bioluminescence', 'bioluminescence']
Adult: ['bioluminescence', 'Fireflies', 'bioluminescence', 'bioluminescence']


**Neuropsychology:**

In [None]:
print("Baby:", [gen_baby(segment) for segment in longcon3])
print("Adult:", [gen_adult(segment) for segment in longcon3])

Baby: ['executive functions', 'prefrontal cortex', 'executive dysfunction', 'brain structure and cognitive function']
Adult: ['prefrontal cortex', 'damage to the dorsolateral prefrontal cortex', 'traumatic brain injury, ADHD, and frontal lobe syndromes', 'adapt and organize behavior']


**Conclusions:**

* Baby model does surprisingly well, not much difference in answer quality between the two imo.

* Both have issues with the bioluminescence text, seems like it has a "What" bias.

* Based on these 3 examples, especially the adult model seems to show this bias. For the humpback whale text, it defines "humpback whale/s" as an answer twice. For the bioluminescence text, it defines "bioluminescence" as an answer thrice.

This is bad, don't want repetitive/too easy questions and answers.

# **Improving the Model**

Let's look at what the most common SQuAD questions look like.

In [None]:
from collections import Counter

firstwords = [" ".join(data["question"].split()[:4]) for data in squadp]
firstwordcounts = Counter(firstwords)
firstwordcounts.most_common(5)

[('In what year did', 827),
 ('What is the name', 776),
 ('What was the name', 668),
 ('In what year was', 488),
 ('What was the first', 178)]

The most common questions made by the crowdworkers were pretty lazy ofc, and query very simple things like years and names. That's not good for the model.

What about just looking at the first word of each question?

*some code is missing, I worked across so many different notebooks 😩*

In [None]:
first_word_percentage = {word: (count / total) * 100 for word, count in first_word_freq}
first_word_percentage

{'WHAT': 44.882528612611985,
 'WHO': 23.49255122650141,
 'HOW': 8.346424070845346,
 'WHERE': 7.743962436776053,
 'WHEN': 4.258661464896684,
 'WHICH': 2.4909803379999653,
 'THE': 0.7543717309119785,
 'IN': 0.6542491670838441,
 'WHOSE': 0.383227744307687,
 'FOR': 0.2969151892834332}

The most common question words are of course "What" and "Who", as expected. While "What" questions can definitely be complex, the above analysis looking at the most common 4-word combinations indicates that most "What" questions just referred to the name of something.

So, I extracted data triplets based on the first word of their respective questions, in order to generate higher quality answers and consequently, questions. I extracted the triplets with the following question words:



*   How
*   Why
* Which



I then compared the performance between the **old** adult model and the new **"tuned"** adult model.

In [None]:
adultnew = torch.load("/content/drive/My Drive/TeachMy/adult_tuned.pth", map_location=torch.device("cpu"))

**Humpback Whales:**

In [None]:
print("Old:", [gen_adult(segment) for segment in text1])
print("New:", [gen_tuned(segment) for segment in text1])

Old: ['60 feet', 'humpback whales', 'bubble net feeding', 'entanglement in fishing gear, ship strikes, pollution, and climate change', 'humpback whale']
New: ['60 feet', '20 minutes', 'by swimming in a spiral and releasing air from their blowholes', 'to build up fat reserves', 'beauty, mysterious songs, and awe-inspiring acrobatics']


* The tuned model does not define "humpback whales" as an answer at all, which is great

**Bioluminescence:**

In [None]:
print("Old:", [gen_adult(segment) for segment in text2])
print("New:", [gen_tuned(segment) for segment in text2])

Old: ['bioluminescence', 'Fireflies', 'bioluminescence', 'bioluminescence']
New: ['biochemical reaction', 'to confuse predators', 'biotechnological', 'bioluminescence']


* The tuned model puts out 4 unique answers, as opposed to the old one, which has bioluminescence x3

**Neuropsychology:**

In [None]:
print("Old:", [gen_adult(segment) for segment in text3])
print("New:", [gen_tuned(segment) for segment in text3])

Old: ['prefrontal cortex', 'damage to the dorsolateral prefrontal cortex', 'traumatic brain injury, ADHD, and frontal lobe syndromes', 'adapt and organize behavior']
New: ['executive functions', 'through neuropsychological tests', 'traumatic brain injury, ADHD, and frontal lobe syndromes', 'brain structure and cognitive function']


* No problem with the answers from either model

Cool! We now have an answer generator. The only step left is to convert the answers into questions. Luckily enough, the question generator made in Part 1 can actually be repurposed for this task and no new model has to be trained.