# Dataset Generator

We will use models from the HuggingFace library to generate text based datasets. We will use gradio to create a user-friendly visualization.

This implementation requires a GPU. Therefore for users who do not have a GPU based system, I recommend running this program on Google Colab, with google's free T4 GPU. 

**Note:** Create a HuggingFace account and create a token key. Add this key as 'hf_token' to your Google Colab keys and give key access to the notebook. For users with GPU based systems, add the HuggingFace key to the system environment.

Huggingface models- https://huggingface.co/models

### Package Installations

In [None]:
!pip install -q torch bitsandbytes transformers sentencepiece accelerate gradio huggingface_hub

### Imports

In [None]:
import os
from dotenv import load_dotenv
from IPython.display import Markdown
import torch
import gc
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
from google.colab import userdata
from pprint import pprint
import gradio as gr

### Import Huggingface token

In [None]:
# For Google Colab users
hf_token = userdata.get('hf_token')
login(hf_token, add_to_git_credential=True)

# For PCs with GPU
'''
load_dotenv(override=True)
hf_token = os.getenv('hf_token')
if hf_token:
    print(f"HuggingFace Key exists.")
else:
    print("HuggingFace Key not set.")
'''

### Define Models from HuggingFace

In [None]:
models = {
    "llama" : "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "mistral" : "mistralai/Mistral-7B-Instruct-v0.3",
    "qwen" : "Qwen/Qwen3-8B",
    "gemma": "google/gemma-3-1b-it",
    "huggingface": "HuggingFaceH4/zephyr-7b-beta"
}

### Function to generate prompt

In [None]:
def generate_prompt(topic, q1, a1, q2, a2, q3, a3, output_size):
    multi_shot_examples = [
        {"instruction": q1, "response": a1},
        {"instruction": q2, "response": a2},
        {"instruction": q3, "response": a3},
    ]
    messages = [
        {
            "role": "system", "content": f"You are a dataset generator, which is purely text based. You are going to generate text on the topic: {topic} Here are a few examples : {multi_shot_examples}. Generate exactly {output_size} outputs. Output format should be same as the provided examples. Do not include the example prompts in the responses. Do not repeat the responses."
        },
        {
            "role": "user", "content": f"Generate dataset for the topic {topic} "
        }
    ]
    return messages

In [None]:
m = generate_prompt(default_topic, q_1, a_1, q_2, a_2, q_3, a_3, default_number_of_data)
pprint(m)

### Quantization Config - this allows us to load the model into memory and use less memory

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

### Function to define model and stream output
P.s. The streamer works only in the cell-based execution of the Jupyter notebook, and not in Gradio

In [None]:
def Model(default_topic, q1, a1, q2, a2, q3, a3, default_number_of_data, model_id):
    # Generate message
    messages = generate_prompt(default_topic, q1, a1, q2, a2, q3, a3, default_number_of_data)
    model = models.get(model_id)
    # Generate tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model)
    tokenizer.pad_token = tokenizer.eos_token
    # Tokenize inputs
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    streamer = TextStreamer(tokenizer)
    # Quantize the model
    model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=quant_config)
    # Generate output
    outputs = model.generate(inputs, max_new_tokens=2000, streamer=streamer)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    del model, inputs, tokenizer, outputs, streamer
    gc.collect()
    torch.cuda.empty_cache()
    split_marker = "Do not repeat the responses."
    if split_marker in decoded_output:
        decoded_output = decoded_output.split(split_marker, 1)[-1].strip()
    return decoded_output

### Prompt example set 1

In [None]:
default_topic = "Fake News vs. Real News. Goal: Generate examples of plausible real and fake headlines. Label them as real or fake, with fact based explanation."
default_number_of_data = 10
q_1 = "Headline: NASA Confirms Discovery of Liquid Water on Mars."
a_1 ="Label: Real. Explanation: This is a verified news story from NASA, first announced in 2015, based on evidence from satellite imaging."
q_2 = "Headline: Scientists Reveal That Bananas Can Cure COVID-19.",
a_2 = "Label: Fake. Explanation: There is no such scientific evidence. This is a typical example of misinformation."
q_3 = "Headline: Elon Musk to Launch First Tesla-Branded Country in the Pacific Ocean.",
a_3 = "Label: Fake. Explanation: This headline sounds sensational and implausible. No credible news source has reported such an event; it's a satirical or fabricated claim."


### Prompt example set 2

In [None]:
default_topic = "Python programming questions."
default_number_of_data = 10
q_1 = "How to reverse a string in Python?"
a_1 ="result = reversed(str) or result = str[::-1] "
q_2 = "How do you find the product of an array of integers except the integer itself?",
a_2 = "result = [math.prod(arr[:i] + arr[i+1:]) for i in range(len(arr))]"
q_3 = "How do you calculate the factorial of a number?",
a_3 = "result = math.factorial(n)"

In [None]:
Model(default_topic, q_1, a_1, q_2, a_2, q_3, a_3, 5, "huggingface")

### Gradio Interface

In [None]:
with gr.Blocks() as ui:
  with gr.Row():
    gr.Markdown("# Dataset Generator")
  with gr.Row():
    gr.Markdown("### This is a Dataset Generation app. You can begin by entering the Topic of your dataset. Please provide 3 examples showcasing the kind of dataset you want. The dataset will be produced in the same format as the provided examples. You can choose a model of your choice. Have fun exploring ...")
  with gr.Row():
    with gr.Column(scale=1):
      topic = gr.Textbox(label= "Topic (5-10 words):", value = default_topic)
      with gr.Row():
        q1 = gr.Textbox(label= "Instruction 1:", value=q_1)
      with gr.Row():
        a1 = gr.Textbox(label= "Response 1:", value=a_1)
      with gr.Row():
        q2 = gr.Textbox(label= "Instruction 2:", value=q_2)
      with gr.Row():
        a2 = gr.Textbox(label= "Response 2:", value=a_2)
      with gr.Row():
        q3 = gr.Textbox(label= "Instruction 3:", value=q_3)
      with gr.Row():
        a3 = gr.Textbox(label= "Response 3:", value=a_3)
      with gr.Row():
        default_number_of_data = gr.Dropdown([1,2,3,4,5,6,7,8,9,10])
      with gr.Row():
        model = gr.Dropdown(["mistral", "llama", "qwen", "gemma", "huggingface"], label="Select model", value="gemma")
      with gr.Row():
        generate_button = gr.Button("Generate Dataset")
      with gr.Row():
        clear_button= gr.Button("Clear")
    with gr.Column(scale=1):
      dataset_output = gr.Markdown("### Generated Dataset will appear here...", height=1000)

  generate_button.click(Model, inputs=[topic, q1, a1, q2, a2, q3, a3, default_number_of_data, model], outputs=dataset_output)

  clear_button.click(inputs=[topic, q1, a1, q2, a2, q3, a3, default_number_of_data, model], outputs=dataset_output)

ui.launch(inbrowser=True, share=True)


