# Overview

Let's instruct a dataset from various documents. Here we will use Bonito.The workflow see below:

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/171/384/916/341/941/original/0518bdfdaf362c60.webp)

In [1]:
%%capture
!conda create -n bonito python=3.9 -y

In [2]:
!conda activate bonito

usage: conda [-h] [--no-plugins] [-V] COMMAND ...
conda: error: argument COMMAND: invalid choice: 'activate' (choose from 'clean', 'compare', 'config', 'create', 'info', 'init', 'install', 'list', 'notices', 'package', 'remove', 'uninstall', 'rename', 'run', 'search', 'update', 'upgrade', 'doctor', 'env')


In [3]:
!git clone https://github.com/BatsResearch/bonito.git
!pip install -U bonito/

Cloning into 'bonito'...
remote: Enumerating objects: 87, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 87 (delta 34), reused 24 (delta 24), pack-reused 45[K
Unpacking objects: 100% (87/87), 783.25 KiB | 8.33 MiB/s, done.


In [None]:
!ls

Obtaining file:///kaggle/working
[31mERROR: file:///kaggle/working does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0m

In [5]:
# https://github.com/vllm-project/vllm/issues/2747#issuecomment-2017133246
# !pip install vllm==0.3.3

In [6]:
# %capture
# !git clone https://github.com/BatsResearch/bonito.git
# !pip install -U bonito/

In [None]:
%%capture
# https://github.com/huggingface/datasets/issues/6753
!pip install datasets==2.17.0
!pip install PyMuPDF==1.24.0
!pip install spacy==3.7.4
!pip install huggingface-hub==0.22.1

In [None]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

# Loading data

In [None]:
import fitz

pdf_path='/kaggle/input/pdf-for-data-generation/cssf12_552eng.pdf'

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:  # Iterate through each page
        text += page.get_text()  # Extract text and append it to the text variable
    return text

text = extract_text_from_pdf(pdf_path)  # Call the function with the path to your PDF

# Text to sentences

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load the English language model

def split_into_sentences(text):
    doc = nlp(text)  # Process the text with SpaCy
    sentences = [sent.text.strip() for sent in doc.sents]  # Extract sentences and strip whitespace
    return sentences

sentences = split_into_sentences(text)  # Split the extracted text into sentences
print(len(sentences))

In [None]:
print(sentences[500])

# Loading to Huggingface Dataset Format

In [None]:
from datasets import Dataset

# Assuming sentences is a list of strings, where each string is a sentence
data = {"sentence": sentences}
dataset = Dataset.from_dict(data)
dataset

# Generating the Synthetic Dataset

We are using Bonito library to generate a synthetic dataset for "question generation". However, it also supports a wide array of tasks, see the link in "Acknowledge" section.

In [None]:
from bonito import Bonito
from vllm import SamplingParams
from datasets import load_dataset

# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1", dtype="float16")

sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qg",
    sampling_params=sampling_params
)

In [None]:
import pandas as pd

df=pd.DataFrame(synthetic_dataset)
df.head()

# Pushing to Hub

In [None]:
synthetic_dataset.push_to_hub('aisuko/generate_dataset12_552')

# Acknowledge

* https://arxiv.org/pdf/2402.18334.pdf
* https://medium.com/towards-data-science/how-to-generate-instruction-datasets-from-any-documents-for-llm-fine-tuning-abb319a05d91
* https://huggingface.co/BatsResearch/bonito-v1