# Generate synthetic data

## Why do we need synthetic data?
1. Data Scarcity:
   - Limited availability of Darija (Moroccan Arabic) text data
   - Most available data is from social media or YouTube comments, which are:
     * Short in length
     * Often contain informal language
     * May include spelling errors and non-standard writing

2. Quality Control:
   - Synthetic data allows us to:
     * Generate longer, more structured text
     * Control the language style and formality
     * Ensure grammatical correctness
     * Create diverse topics and contexts

3. Training Benefits:
   - Larger training dataset
   - Better coverage of language patterns
   - More consistent quality
   - Ability to generate specific types of content

4. Cost Efficiency:
   - Using GCP credits for inference
   - More cost-effective than manual data collection
   - Scalable solution for generating large datasets


## Prompts best practices

Check examples here https://cloud.google.com/discover/what-is-prompt-engineering#use-cases-and-examples-of-prompt-engineering

Example Prompt Structure:

In [None]:
"""
Context: [Domain/Topic]
Audience: [Target readers]
Style: [Formal/Informal]
Length: [Word count]
Language: Darija (Moroccan Arabic)
Format: [Structure requirements]

Examples:
[2-3 example inputs and outputs]

Constraints:
- Avoid [specific issues]
- Include [required elements]
- Maintain [style guidelines]
"""

'\nContext: [Domain/Topic]\nAudience: [Target readers]\nStyle: [Formal/Informal]\nLength: [Word count]\nLanguage: Darija (Moroccan Arabic)\nFormat: [Structure requirements]\n\nExamples:\n[2-3 example inputs and outputs]\n\nConstraints:\n- Avoid [specific issues]\n- Include [required elements]\n- Maintain [style guidelines]\n'

## Generate data Using Google Gemini Model

➪➪➪ **in this section we will use gemini model free tier from [aistudio](https://aistudio.google.com/), also you can use models from your GCP credits.**

In [None]:
from google.colab import userdata
import base64
import os
from google import genai
from google.genai import types
import json # Import the json library

token=userdata.get('GEMINI_API_KEY')

In [None]:
# generate function to get answers from google ai models
def generate(messages):
    client = genai.Client(
        api_key=token,
    )

    model = "gemini-2.0-flash"
    contents = [
              types.Content(
                  role=message["role"],
                  parts=[
                      types.Part(text=message["content"]),
                  ]
              )  for message in messages
        ]
    # Change response_mime_type to application/json
    # you can play with parameters
    generate_content_config = types.GenerateContentConfig(
        response_mime_type="application/json",
        temperature= 0.7,
        max_output_tokens= 1024,
        top_p= 0.8,
        top_k= 40
    )
    output=client.models.generate_content(
        model=model,
        contents=contents,
        config=generate_content_config,
    ).candidates[0].content.parts[0].text
    # Optional: Parse the JSON output to a Python dictionary
    try:
        output_json = json.loads(output)
        return output_json
    except json.JSONDecodeError:
        print("Warning: Model did not return valid JSON.")
        return output # Return raw text if JSON parsing fails

In [None]:
# Example prompt template modified to request JSON output
prompt_template = """
Generate a paragraph in Moroccan Arabic Darija (arabic letters) about daily life.
Style: Informal
Length: 100-150 words
Topic: Random daily activities

Format: JSON with a single key "paragraph" containing the generated text.

Example:
Input: Write about going to the market
Output:
{
  "paragraph": "[Example in Darija]"
}


Constraints:
- Use natural, conversational Darija
- Include common expressions
- Avoid formal Arabic
"""

messages=[
  {
      "role": "user",
      "content": prompt_template
  }
]

In [None]:
from pprint import pprint #for beautiful print
# Call generate and print the output (which will be a dictionary if JSON is parsed)
json_output = generate(messages)
pprint(json_output)

{'paragraph': 'اليوم فقت معطل شوية، شي تسعود هكاك. دغيا دغيا غسلت وجهي و شربت '
              'قهوة كحلة باش نصحصح. من بعد مشيت للسوق باش نتقدى الخضرة و '
              'الفاكية، ضروري خاصني شي طويجين ديال الخضرة للعشا. تلاقيت مع '
              'واحد صاحبي تما، بقينا كنهضرو شوية على الماتش ديال البارح. ملي '
              'رجعت للدار، بديت كنوجد الغدا، درت غير شي حاجة خفيفة. مع العشية '
              'عيطت على ماما، سولتها على ختي، توحشتها بزاف. بالليل تفرجت فشي '
              'فيلم مع خوتي و نعسنا. ياك ما طولت عليكم؟ ههه.'}


## Generate data many samples

In [None]:
from datetime import datetime
from tqdm import tqdm
from time import sleep

generated_data = []
num_samples = 10
SLEEP_TIME = 60
SLEEP_NUM_SAMPLES = 10

for i in tqdm(range(0, num_samples)):
  if (i+1) % SLEEP_NUM_SAMPLES == 0 and i > 0:
    print(f"Sleeping for {SLEEP_TIME} seconds...")
    sleep(SLEEP_TIME)
  else:
    messages=[
      {
          "role": "user",
          "content": prompt_template
      }
    ]
    try:
        # Generate batch of samples
        response = generate(messages)

        # Process and validate responses
        generated_data.append({
              "text": response["paragraph"],
               "metadata": {
                        "generated_at": datetime.now().isoformat(),
                        "sample_id": i+1
                    }
                })

    except Exception as e:
        print(f"Error in sample {i+1}: {e}")
        continue

 90%|█████████ | 9/10 [00:22<00:02,  2.49s/it]

Sleeping for 60 seconds...


100%|██████████| 10/10 [01:22<00:00,  8.22s/it]


In [None]:
#from generated dataset to pandas
import pandas as pd
df = pd.DataFrame(generated_data)
df.head()

Unnamed: 0,text,metadata
0,اليوما فقت معطل، شي عادة عندي. دغيا غسلت وجهي ...,"{'generated_at': '2025-05-23T14:53:38.656600',..."
1,فالصباح، كانفيق معطل شوية، شي مرة مع تسعود شي ...,"{'generated_at': '2025-05-23T14:53:41.050817',..."
2,فالصباح، كانفيق معطل شوية، شي مرة مع 10 شي مرة...,"{'generated_at': '2025-05-23T14:53:43.190018',..."
3,اليوم فقت معطل شوية، شي تسعود هكاك. دغيا دغيا ...,"{'generated_at': '2025-05-23T14:53:45.815748',..."
4,اليوما فقت معطل شوية، شي تسعود هكاك. دغيا غسلت...,"{'generated_at': '2025-05-23T14:53:48.353040',..."


## Push data to Hugging Face Hub

In [None]:
# Install required packages
!uv pip install datasets huggingface_hub -q

In [None]:
from huggingface_hub import login
login(token="") # your token

In [None]:
from datasets import Dataset

In [None]:
dataset=Dataset.from_list(generated_data)
dataset

Dataset({
    features: ['text', 'metadata'],
    num_rows: 9
})

In [None]:
dataset.push_to_hub("username/dataset_title")

## Challenge

Try to translate 1000 **random** samples (not 1000 first ones) from [`roneneldan/TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories) to Moroccan Arabic, make sure the quality is good before translating everything and pushing it to hub