> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

<img src="https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C5-white-bg.png">

# Lab: Format Text for Turn-Based Dialogue

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_5/gdm_lab_5_2_format_text_for_turn_based_dialogue.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

15 minutes

Format the data for building a revision study flashcard generator.


## Overview

In the last lab, you saw that a pre-trained base model is a poor question-answerer. In this lab, you will take the first step in fixing that problem by preparing a high-quality dataset for fine-tuning. Your goal is to create a dataset which is formatted in a way that can be used to train a model for generating the answers to revision study flashcards about African culture and geography. You will load a dataset of question-answer pairs and reformat them using special delimiter tokens to clearly separate the user's turn (the question) from the model's turn (the answer).





### What you will learn:

By the end of this lab, you will be able to:

* Explain the importance of using special delimiter tokens to structure conversational data for fine-tuning.

* Write a Python function to transform raw data into a specific, formatted structure.


### Tasks

Your main task is to write a function to format your data for the flashcard generation task.


**In this lab, you will**:

* Load a JSON file containing question-answer pairs.

* Write a function that takes a single data entry as input.

* Add special delimiter tokens inside the function to mark the start and end of the question and the answer.





## How to use Google Colaboratory (Colab)

Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in cells that are executed on a remote server.

To run a cell, hover over a cell, and click the `run` button to its left. The run button is the circle with the triangle (â–¶). Alternatively, you can also click a cell and use the keyboard combination Ctrl+Return (or âŒ˜+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [None]:
from datetime import datetime
print(f"Today is {datetime.today():%A}.")

Note that the order in which you run the cells matters. When you are working through a lab, make sure to always run all cells in order. Otherwise, the code might not work. If you take a break while working on a lab, Colab may disconnect you; in that case, you have to execute all cells again before  continuing your work. To make this easier, you can select the cell you are currently working on and then choose __Runtime â†’ Run before__  from the menu above (or use the keyboard combination Ctrl/âŒ˜ + F8). This will re-execute all cells before the current one.

### Using Colab with a GPU

Follow these steps to run the activities in this lab on a GPU:

1.  In the top menu bar, click on **Runtime**.
2.  Select **Change runtime type** from the dropdown menu.
3.  In the pop-up window under **Hardware Accelerator**, select **GPU** (usually listed as `T4 GPU`).
4.  Click **Save**.

Your Colab session will now restart with GPU access.

Note that access to GPUs is limited and at times, you may not be able to run this lab on a GPU. All activities will still work but they will run slower and you will have to wait longer for some of the cells to finish running.


## Imports



In this lab, you will primarily work with the `pandas` library to load the dataset.

In [None]:
%%capture
# Install the custom package for this course.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

# Packages used.
import pandas as pd # For loading the dataset.
from textwrap import fill # For formatting longer paragraphs.

# Functions for providing feedback.
from ai_foundations.feedback.course_5 import formatting as feedback

### Load and inspect the dataset

As a first step, load the question-and-answer dataset and inspect its structure. This dataset contains the information that your model will eventually need to learn.

Run the following cell to see the number of examples and the properties of the first entry in the dataset.


In [None]:
africa_galore_qa = pd.read_json(
    "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore_qa_v2.json"
)

print(
    f"Loaded dataset with {africa_galore_qa.shape[0]:,}"
    f" question-answer pairs.\n"
)
# Print the names of all fields of the dataset.
print(f"Dataset columns: {', '.join(africa_galore_qa.columns)}\n")

# Print the first example of the dataset.
print(f"First example:")
print(f"Category: {africa_galore_qa['category'][0]}")
print(f"Name: {africa_galore_qa['name'][0]}")
print(f"Question: {africa_galore_qa['question'][0]}")
print(fill(f"Answer: {africa_galore_qa['answer'][0]}"))

Notice that while the raw answer is informative, it is not structured for a specific task like a chatbot response or the answer to a revision study flashcard. Before you can fine-tune a model to produce better, more concise answers, you first need to reformat the data. This will make it suitable for training the model to generate the question-answer format required for the flashcard generator.


### Format questions and answers

A model trained on narrative text is not able to process  the back-and-forth nature of a dialogue. To train a model to do this, you need to explicitly mark where one speaker's turn ends and the next one begins. As mentioned in the previous article, you can achieve this by wrapping each question and answer with special delimiter tokens. The function below takes a question and an answer and prepares them for this turn-based format. The format includes information on whose turn it is (user or model).



In [None]:
def format_qa(
    question: str,
    answer: str,
    sot: str = "<start_of_turn>",
    eot: str = "<end_of_turn>",
) -> tuple[str, str]:

    """Add special delimiters at start and end of question and answer.

    Args:
      question: The question for the flashcard.
      answer: The answer for the flashcard.
      sot: The token to mark the start of a turn.
      eot: The token to mark the end of a turn.

    Returns:
      formatted_q: Formatted string of the question.
      formatted_a: Formatted string of the answer.
    """

    formatted_q = f"{sot}user\n{question}{eot}\n"
    formatted_a = f"{sot}model\n{answer}{eot}"

    return formatted_q, formatted_a

Run the following cell to format a question-answer pair you can specify. Investigate what the function does for different question-answer pairs.



In [None]:
# @title Format question-answer pair
question = "What is Jollof rice?" #@param {type: "string"}
answer   = "Jollof rice is a tasty African dish that is made with a red sauce." #@param {type: "string"}

formatted_q, formatted_a = format_qa(question,answer)

print(formatted_q + formatted_a)

As the output of this cell shows, the function wraps the text with `<start_of_turn>` and `<end_of_turn>` tokens. These act like the `<PAD>` and `<UNK>` tokens that you have encountered before. Like all special tokens, they act as signals to the model.

Furthermore, each turn starts with an indication of who is speaking, the user or the model. By consistently using this format throughout your dataset, a model can learn not just the content, but the conversational format of a question followed by an answer.



### A more appealing custom format

Marking turns is the first step, but for your flashcard generator, you want a more specific and helpful format. An effective flashcard gets straight to the point. One way to achieve this is to include a category at the beginning of the answer on a flashcard as follows:

<br />

------
>**Category**: Food
>
>Jollof rice is a popular and iconic one-pot rice dish that is a staple in many West African countries.
------

<br />

For the model to learn this behavior, you need to process the dataset with a new function that not only adds the turn-based delimiters but also formats the answer string to include the category information available in your dataset.





## Coding Activity 1: Create the flashcard format

------
> ðŸ’» **Your task:**
>
> Complete the function so that it returns the question and answer in the desired format.
>
> The function takes one row of the dataset as input. The output should be two variables, the first for the formatted question and the second for the formatted answer.
>
> The output should be in a format that can be processed by a language model. It should contain the delimiters for start and end of a turn. For the answer, it should also contain the category, preceded by "Category:" as in the example above.
>
> For example for the question, "What is jollof rice?" and the corresponding answer above, your function should set `formatted_q` and `formatted_q` to:
>```
> formatted_q = "<start_of_turn>user\nWhat is jollof rice?<end_of_turn>"
> formatted_a = "<start_of_turn>model\nCategory: Food\nJollof rice is a popular and iconic one-pot rice dish that is a staple in many West African countries.<end_of_turn>"
>```
------

In [None]:
def format_qa(
    data: pd.Series | dict[str, str],
    sot: str = "<start_of_turn>",
    eot: str = "<end_of_turn>",
) -> tuple[str, str]:
    """Add special delimiters at start and end of question and answer.

    Args:
      data: Row of a dataframe with fields "category", "question" and "answer".
      sot:  String of the token for start of a turn.
      eot:  String of the token for end of a turn.

    Returns:
      formatted_q: Formatted string of the question.
      formatted_a: Formatted string of the answer.
    """

    category = data["category"]
    question = data["question"]
    answer = data["answer"]

    formatted_q = # Add your code here.
    formatted_a = # Add your code here.

    return formatted_q, formatted_a


# Add your code here to test your function.
# For example, run the function and print its outputs. Check that the outputs
# correspond to what you expect them to be.
# You can access the dataset through the variable `africa_galore_qa`.


In [None]:
# @title Run this cell to test your code
feedback.check_qa_format(format_qa, africa_galore_qa)

## Summary

In this lab, you have formatted the dataset for the desired question and answer format that can now be used to train your transformer. You are now ready to fine-tune your model so that it can be used as a flashcard generator.

## Solutions

The following cells provide reference solutions to the coding activities in this notebook. If you really get stuck after trying to solve the activities yourself, you may want to consult these solutions.

It is recommended that you *only* look at the solutions after you have tried to solve the activities *multiple times*. The best way to learn challenging concepts in computer science and artificial intelligence is to debug your code piece-by-piece until it works, rather than copying existing solutions.

If you feel stuck, you may want to first try to debug your code. For example, by adding additional print statements to see what your code is doing at every step. This will provide you with a much deeper understanding of the code and the materials. It will also provide you with practice on how to solve challenging coding problems beyond this course.

To view the solutions for an activity, click on the arrow to the left of the activity name. If you consult the solutions, do not copy and paste them into the cells above. Instead, look at them, and type them manually into the cell. This will help you understand where you went wrong.


### Coding Activity 1

In [None]:
# Complete implementation of format_qa.
def format_qa(
    data: pd.Series | dict[str, str],
    sot: str = "<start_of_turn>",
    eot: str = "<end_of_turn>",
) -> tuple[str, str]:
    """Add special delimiters at start and end of question and answer.

    Args:
      data: Row of a dataframe with fields "category", "question" and "answer".
      sot:  String of the token for start of a turn.
      eot:  String of the token for end of a turn.

    Returns:
      formatted_q: Formatted string of the question.
      formatted_a: Formatted string of the answer.
    """

    category = data["category"]
    question = data["question"]
    answer = data["answer"]

    formatted_q = f"{sot}user\n{question}{eot}\n"
    formatted_a = f"{sot}model\nCategory: {category}\n{answer}{eot}"

    return formatted_q, formatted_a