# Overview

Let's instruct a dataset from various documents. Here we will use Bonito.The workflow see below:

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/171/384/916/341/941/original/0518bdfdaf362c60.webp)

In [1]:
%%capture
!git clone https://github.com/BatsResearch/bonito.git
!pip install -U bonito/

In [2]:
%%capture
# https://github.com/huggingface/datasets/issues/6753
!pip install datasets==2.17.0
!pip install PyMuPDF==1.24.0
!pip install spacy==3.7.4
!pip install huggingface-hub==0.22.1
!pip install vllm==0.3.3

In [3]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Loading data

In [4]:
import fitz

pdf_path='/kaggle/input/pdf-for-data-generation/cssf12_552eng.pdf'

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:  # Iterate through each page
        text += page.get_text()  # Extract text and append it to the text variable
    return text

text = extract_text_from_pdf(pdf_path)  # Call the function with the path to your PDF

# Text to sentences

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load the English language model

def split_into_sentences(text):
    doc = nlp(text)  # Process the text with SpaCy
    sentences = [sent.text.strip() for sent in doc.sents]  # Extract sentences and strip whitespace
    return sentences

sentences = split_into_sentences(text)  # Split the extracted text into sentences
print(len(sentences))

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy, cupy-cuda12x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------



1175


In [6]:
print(sentences[500])

The second line consists of support functions, such as the financial and 
accounting function, and especially the compliance and the risk control 
functions which control risks on an independent basis and support the business 
units in complying with the applicable policies and procedures.


# Loading to Huggingface Dataset Format

In [7]:
from datasets import Dataset

# Assuming sentences is a list of strings, where each string is a sentence
data = {"sentence": sentences}
dataset = Dataset.from_dict(data)
dataset

Dataset({
    features: ['sentence'],
    num_rows: 1175
})

# Generating the Synthetic Dataset

We are using Bonito library to generate a synthetic dataset for "question generation". However, it also supports a wide array of tasks, see the link in "Acknowledge" section.

In [8]:
from bonito import Bonito
from vllm import SamplingParams
from datasets import load_dataset

# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1", dtype="float16")

sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qg",
    sampling_params=sampling_params
)

2024-03-28 05:17:05,467	INFO util.py:124 -- Outdated packages:
  ipywidgets==7.7.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

INFO 03-28 05:17:05 llm_engine.py:87] Initializing an LLM engine with config: model='BatsResearch/bonito-v1', tokenizer='BatsResearch/bonito-v1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


tokenizer_config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/145 [00:00<?, ?B/s]

ImportError: /opt/conda/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKSs

In [None]:
import pandas as pd

df=pd.DataFrame(synthetic_dataset)
df.head()

# Pushing to Hub

In [None]:
synthetic_dataset.push_to_hub('aisuko/generate_dataset12_552')

# Acknowledge

* https://arxiv.org/pdf/2402.18334.pdf
* https://medium.com/towards-data-science/how-to-generate-instruction-datasets-from-any-documents-for-llm-fine-tuning-abb319a05d91
* https://huggingface.co/BatsResearch/bonito-v1