# **Choosing a model**

A T5 model was chosen for the task of question generation.



*   The seq-to-seq architecture of the model makes it ideal for this sort of task
*   The T5 has a built-in question generation mode



# **Building and loading the first model**

The initial model was a **small T5** and fine-tuned on the SQuAD dataset for 1 epoch.

**Preprocessing the training data**

Excerpt of the dataset:




* **Context:** "In 2015-2016, Notre Dame ranked 18th overall among 'national universities' in the United States in U.S. News & World Report's Best Colleges 2016."

* **Question:** "Where did U.S. News & World Report rank Notre Dame in its 2015-2016 university rankings?"

* **Answer:** {'answer_start': 32, 'text': '18th overall'}


The `context` was preprocessed to add a highlight around the `answer`.


* **Context:** "In 2015-2016, Notre Dame ranked [ANSWER]18th overall[/ANSWER] among 'national universities' in the United States in U.S. News & World Report's Best Colleges 2016."

* **Question:** "Where did U.S. News & World Report rank Notre Dame in its 2015-2016 university rankings?"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = T5ForConditionalGeneration.from_pretrained("t5-small").to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
### excerpt of some training code ####

### training was largely done on kaggle, so some preprocessing of the data is missing here (e.g. dataloaders)

model.train()
for epoch in range(1):
    total_loss = 0
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch + 1}"):
        optimizer.zero_grad()

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch + 1} | Loss: {total_loss / len(train_loader)}")

In [None]:
import torch

model = torch.load('/content/drive/My Drive/TeachMy/t5_model_entire.pth', map_location=torch.device('cpu'))

In [None]:
def genq(new_context):
    model.eval()
    input_text = f"generate question: {new_context} </s>"  # set it to question generation mode
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(input_ids)

    generated_question = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_question

# **Trying out the model**

Let's test out the model on different texts.

In [None]:
context = "The Berlin International Film Festival (German: Internationale Filmfestspiele Berlin), usually called the Berlinale, is a major international film festival held annually in Berlin, Germany. Founded in 1951 and originally run in June, the festival has been held every February since 1978 and is one of Europe's Big Three film festivals alongside the Venice Film Festival held in Italy and the Cannes Film Festival held in France. Furthermore, it is one of the Big Five, the most prestigious film festivals in the world. The festival regularly draws tens of thousands of visitors each year."

In [None]:
genq(context)

'What is the name of the major international film festival held in Berlin?'

* The question is simple yet answerable based on the given context.




In [None]:
context = "In 1963, two years after the Berlin Wall had been erected, a daily show of the Berlinale was shown on television in East Germany, with five films in competition broadcast."

In [None]:
genq(context)



'How many films were in competition with the Berlinale?'

* The question does not make sense - five films were not in competition ***with*** the Berlinale, but ***at*** the Berlinale

In [None]:
genq(context)

'What was the name of the daily show of the Berlinale shown on television in East Germany'

* The question is low quality - it is not answerable based on the context, and missing a question mark.

**Let's try a new input:**

In [None]:
context = "Prior to the erection of the Berlin Wall in 1961, a selection of the films were also screened in East Berlin."

In [None]:
genq(context)

'What was the name of the film that was screened in East Berlin?'

* Again, the question does not make sense, and is querying information that is not provided in the context.

# **Next Steps**

Here is an example text that was given to the model for more detailed testing:


```
Humpback whales are one of the most fascinating species in the marine world, known for their impressive size, unique behaviors, and haunting songs. These majestic creatures belong to the family of baleen whales, which includes the largest animals on Earth. Humpback whales can grow up to 60 feet in length and weigh as much as 40 tons, making them one of the larger whale species, though still smaller than the blue whale.

One of the most distinctive features of humpback whales is their acrobatic nature. They are known for breaching, a behavior in which the whale propels itself out of the water in a spectacular leap. This behavior, along with their habit of slapping their massive tails and fins on the water's surface, makes humpback whales a favorite among whale watchers. Scientists believe these behaviors could serve multiple purposes, including communication, mating displays, or as a way to dislodge parasites.

Humpback whales are also renowned for their complex and melodious songs. Male humpbacks sing intricate songs that can last for up to 20 minutes and be heard miles away under the ocean. These songs are thought to play a role in attracting mates and asserting dominance. Remarkably, all males within a population sing the same song, which evolves gradually over time. The exact meaning and content of these songs remain one of the ocean's great mysteries.

Another interesting aspect of humpback whales is their feeding technique known as bubble net feeding. They create a unique "net" of bubbles by swimming in a spiral and releasing air from their blowholes. This bubble net traps schools of fish or krill, and the whales then swim upwards with their mouths open to engulf thousands of gallons of water filled with prey. This cooperative hunting strategy showcases the intelligence and social behavior of humpback whales.

Humpback whales embark on long migrations, one of the longest of any mammal on Earth. They spend the summer months in colder, polar waters where they feed extensively to build up fat reserves. In the winter, they migrate to warmer, tropical waters to breed and give birth. During the breeding season, humpbacks fast and live off their fat reserves, focusing on mating and nursing their young.

Conservation efforts have been critical in protecting humpback whales from the brink of extinction. Once hunted to near extinction, humpback populations have made a significant recovery thanks to international protection and the ban on commercial whaling. However, they still face threats from entanglement in fishing gear, ship strikes, pollution, and climate change which affects their prey availability.

The humpback whale continues to capture the human imagination with its beauty, mysterious songs, and awe-inspiring acrobatics. Their presence in the oceans serves as a reminder of the need to preserve and protect the incredible biodiversity of our planet.

```



Testing revealed that training the model for more epochs did not solve the aforementioned problems.


Here are some examples of questions generated by a 4-epoch fine-tuned model:


```
1. "What is one of the most distinctive features of humpback whales?"
2. "Humpback whales are known for their melodious songs that can last for up to 20"
3. "What is the name of the song that humpback whales sing?"
4. "What is the name of the whales that feeds them to build up fat reserves?"
5. "The humpback whale is a reminder of the need to preserve and protect our planet"
```


Switching from the small T5 to the base T5 improved the quality of the questions but several of the questions that it generated were still not answerable based on the given context.


```
1. "What is the acrobatic nature of humpback whales?"
2. "What is the habit of slapping their massive tails and fins on the water"
3. "What is the meaning of the songs?"
4. "Humpback whales spend the summer months in colder, polar waters where they build"
5. "What is the importance of humpback whales in the oceans?"
```


Diversifying the training data by adding the NewsQA dataset improved the completeness of questions but did not solve the problem of unanswerable questions and uninteresting questions.


```
1. "What is the name of the whale's mate?"
2. "What do scientists believe?"
3. "What is the name of the song?"
4. "What do humpback whales do?"
5. "What does the humpback whale do?"
```


**Conclusion:** A different approach to generating questions must be taken.