# ðŸ§  Nuzantara Synthetic Data Generator

This notebook generates high-quality synthetic Question-Answer pairs from Indonesian legal text using open-source LLMs (Llama 3 / Mistral).

**Goal**: Create a "Gold Dataset" to train and test Nuzantara.

### Steps:
1.  **Setup**: Install dependencies.
2.  **Load Data**: Paste your legal text or load from file.
3.  **Generate**: AI creates user personas and questions.
4.  **Save**: Export as JSON for Nuzantara.

In [None]:
# @title 1. Install Dependencies
!pip install -q -U torch transformers accelerate bitsandbytes
!pip install -q -U langchain langchain-community

In [None]:
# @title 2. Load Model (Llama 3 8B or Mistral 7B)
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Use a powerful but efficient model
MODEL_ID = "unsloth/llama-3-8b-Instruct-bnb-4bit"  # Faster & Free on Colab

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config={"load_in_4bit": True},
    device_map="auto",
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15,
)

print("âœ… Model Loaded Successfully!")

In [None]:
# @title 3. Define the Generator Function


def generate_synthetic_data(legal_text, num_pairs=5):
    prompt = f"""
    You are an expert legal AI trainer.
    Your task is to generate {num_pairs} diverse Question-Answer pairs based strictly on the provided legal text.
    
    **Scenarios to simulate:**
    1. A confused foreigner (simple English).
    2. A professional lawyer (formal Indonesian).
    3. A digital nomad (casual slang).
    
    **Format:**
    Return a JSON list of objects with 'question', 'answer', 'context_used', and 'persona'.
    
    **Legal Text:**
    {legal_text}
    
    **Output (JSON only):**
    """

    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs only valid JSON.",
        },
        {"role": "user", "content": prompt},
    ]

    prompt_formatted = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    outputs = pipe(prompt_formatted)
    return (
        outputs[0]["generated_text"]
        .split("<|start_header_id|>assistant<|end_header_id|>")[-1]
        .strip()
    )


print("âœ… Generator Function Ready!")

In [None]:
# @title 4. Run Generation (Paste Text Here)

legal_text_input = """
Pasal 4
(1) Orang Asing yang berada di Wilayah Indonesia wajib memiliki Izin Tinggal.
(2) Izin Tinggal sebagaimana dimaksud pada ayat (1) diberikan kepada Orang Asing sesuai dengan Visa yang dimilikinya.
"""

result = generate_synthetic_data(legal_text_input, num_pairs=3)
print(result)