# Bonito Tutorial with A100
This tutorial runs Bonito on A100 GPUs to generate synthetic instruction tuning datasets.
To use Bonito with A100 GPUs, you will need to purchase compute units from Google. The price starts from $9.99 for 100 compute units. See [pricing](https://colab.research.google.com/signup) for more details.

 If you are looking to run Bonito (for free) on the T4 GPUs, check our [quantized Bonito tutorial](https://colab.research.google.com/drive/1tfAqUsFaLWLyzhnd1smLMGcDXSzOwp9r?usp=sharing).



## Setup
First we clone into the repo and install the dependencies. This will take several minutes.

In [None]:
!git clone https://github.com/BatsResearch/bonito.git
!pip install -U bonito/

## Load the Bonito Model
Loads the weights from Huggingface into the Bonito class.

In [None]:
from bonito import Bonito

bonito = Bonito("BatsResearch/bonito-v1")

## Synthetic Data Generation
Here we first show how to use the Bonito model with an unannotated text and then show how to generate instruction tuning dataset with a small unannotated dataset.


### Single example

In [None]:
unannotated_paragraph = """1. “Confidential Information”, whenever used in this Agreement, shall mean any data, document, specification and other information \nor material, that is delivered or disclosed by UNHCR to the Recipient in any form whatsoever, whether orally, visually in writing \nor otherwise (including computerized form), and that, at the time of disclosure to the Recipient, is designated as \nconfidential."""
print(unannotated_paragraph)

Now generate a pair of synthetic instruction for unannotated paragraph.

In [None]:
from datasets import Dataset
from vllm import SamplingParams
from transformers import set_seed

set_seed(2)


def convert_to_dataset(text):
    dataset = Dataset.from_list([{"input": text}])
    return dataset


sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    convert_to_dataset(unannotated_paragraph),
    context_col="input",
    task_type="nli",
    sampling_params=sampling_params,
)
print("----Generated Instructions----")
print(f'Input: {synthetic_dataset[0]["input"]}')
print(f'Output: {synthetic_dataset[0]["output"]}')

Now we change the task type from NLI (nli) to multiple choice question answering (mcqa). For more details, see [supported task types](https://github.com/BatsResearch/bonito?tab=readme-ov-file#supported-task-types)

In [None]:
set_seed(0)
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.7, n=1)
synthetic_dataset = bonito.generate_tasks(
    convert_to_dataset(unannotated_paragraph),
    context_col="input",
    task_type="mcqa",  # changed
    sampling_params=sampling_params,
)
print("----Generated Instructions----")
print(f'Input: {synthetic_dataset[0]["input"]}')
print(f'Output: {synthetic_dataset[0]["output"]}')

### Small dataset
We select 10 unannoated samples from the ContractNLI dataset and convert them into NLI instruction tuning dataset.


In [None]:
# load dataset with unannotated text
from datasets import load_dataset

unannotated_dataset = load_dataset(
    "BatsResearch/bonito-experiment", "unannotated_contract_nli"
)["train"].select(range(10))

Generate the synthetic NLI dataset.

In [None]:
# Generate synthetic instruction tuning dataset
from vllm import SamplingParams
from transformers import set_seed

set_seed(42)

sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    unannotated_dataset,
    context_col="input",
    task_type="nli",
    sampling_params=sampling_params,
)
print("----Generated Instructions----")
print(f'Input: {synthetic_dataset[0]["input"]}')
print(f'Output: {synthetic_dataset[0]["output"]}')

Now go try it out with your own datasets! You can vary the `task_type` for different types of generated instructions.
You can also play around the sampling hyperparameters such as `top_p` and `temperature`.
