# Multi-Turn Conversation Dataset Generator

## Deskripsi
Notebook ini menghasilkan dataset percakapan multi-turn untuk melatih AI sebagai interviewer karier di platform Diploy.

## Setup Awal

### 1. Install Dependencies
```bash
pip install openai pandas openpyxl tqdm python-dotenv
```

### 2. Set API Key
**Untuk keamanan, JANGAN hardcode API key!**

#### Opsi A: Environment Variable (Recommended)
```bash
export OPENAI_API_KEY='your-api-key-here'
```

#### Opsi B: File .env
Buat file `.env` di root folder:
```
OPENAI_API_KEY=your-api-key-here
```
Lalu load dengan:
```python
from dotenv import load_dotenv
load_dotenv()
```

### 3. Persiapkan Data
Pastikan file Excel `datasetmultiturncreateV2.xlsx` memiliki kolom:
- Jenjang Pendidikan
- Jurusan
- Pelatihan
- Sertifikasi
- Pekerjaan Saat Ini
- Pengalaman Kerja
- Keterampilan
- Area_Fungsi
- Level

## Output Format
File JSONL dengan format:
```json
{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}
```

## Mode Percakapan
- **FAST_DIRECT**: Langsung rekomendasi
- **FAST_SHORT**: 1 pertanyaan â†’ rekomendasi
- **MEDIUM**: 4-6 turn dengan 1-2 pertanyaan
- **LONG**: 8-10 turn dengan 3-5 pertanyaan

In [8]:
import os
from pathlib import Path

# =========================
# PATH CONFIG
# =========================
# Untuk Google Colab, uncomment baris berikut:
# from google.colab import drive
# drive.mount('/content/drive')
# DATASET_DIR = "/content/drive/MyDrive/Colab Notebooks"

DATASET_DIR = Path.cwd().parent / "Flagged_500_Per_Class"
DATASET_DIR = str(DATASET_DIR)

print(f"Dataset directory: {DATASET_DIR}")
print(f"Directory exists: {os.path.exists(DATASET_DIR)}")

# List files untuk verifikasi
if os.path.exists(DATASET_DIR):
    files = os.listdir(DATASET_DIR)
    excel_files = [f for f in files if f.endswith('.xlsx')]
    print(f"Found {len(excel_files)} Excel files")
    if excel_files:
        print(f"   First few: {excel_files[:3]}")
else:
    print("WARNING: Directory not found!")

Dataset directory: /home/wildanaziz/dtp-data-pipeline/Pipeline Multiturn/Flagged_500_Per_Class
Directory exists: True
Found 46 Excel files
   First few: ['Pengembangan_Produk_Digital_2.xlsx', 'Tata_Kelola_Teknologi_Informasi_9.xlsx', 'Sains_Data_Kecerdasan_Artifisial_8.xlsx']


## Quick Setup: Set API Key

**Pilih salah satu cara berikut:**

### Cara 1: Langsung di Notebook (Paling Cepat)
Jalankan cell ini terlebih dahulu, lalu jalankan cell berikutnya:
```python
import os
os.environ['OPENAI_API_KEY'] = 'sk-proj-your-actual-key-here'  # Ganti dengan key Anda
```

### Cara 2: Buat File .env
Buat file `.env` di folder `/home/wildanaziz/dtp-data-pipeline/Pipeline Multiturn/script/` dengan isi:
```
OPENAI_API_KEY=sk-proj-your-actual-key-here
```

### Cara 3: Terminal (Persistent)
```bash
export OPENAI_API_KEY='sk-proj-your-actual-key-here'
```
Lalu restart kernel notebook.

In [None]:
import pandas as pd
import json
import asyncio
import random
from openai import AsyncOpenAI, APIError, RateLimitError
import time
import os
from tqdm.auto import tqdm
from datetime import datetime
from pathlib import Path
import glob
import re

# load env var
try:
    from dotenv import load_dotenv
    load_dotenv()  # Load dari .env file
    print(".env file loaded successfully")
except ImportError:
    print("python-dotenv tidak terinstall, gunakan environment variable manual")
except Exception as e:
    print(f"Error loading .env: {e}")

# config env
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")

if not OPENAI_API_KEY:
    env_file = Path.cwd() / ".env"
    if env_file.exists():
        with open(env_file, 'r') as f:
            for line in f:
                if line.startswith('OPENAI_API_KEY'):
                    OPENAI_API_KEY = line.split('=')[1].strip().strip('"').strip("'")
                    print("API Key loaded from .env file manually")
                    break
    
    if not OPENAI_API_KEY:
        print("\n" + "="*60)
        print("OPENAI_API_KEY TIDAK DITEMUKAN!")
        print("="*60)
        print("\nCara Set API Key:")
        print("\nOpsi 1: Buat file .env di folder ini:")
        print(f"   Location: {Path.cwd()}")
        print("   Isi file .env:")
        print('   OPENAI_API_KEY=sk-proj-your-actual-key-here')
        print("\nOpsi 2: Set di notebook (temporary):")
        print("   Tambahkan cell baru dengan:")
        print("   import os")
        print("   os.environ['OPENAI_API_KEY'] = 'sk-proj-your-actual-key-here'")
        print("\nOpsi 3: Set di terminal (persistent):")
        print("   export OPENAI_API_KEY='sk-proj-your-actual-key-here'")
        print("   Lalu restart kernel notebook")
        print("\n" + "="*60 + "\n")
        raise ValueError("OPENAI_API_KEY tidak ditemukan! Ikuti instruksi di atas.")

print(f"API Key: {OPENAI_API_KEY[:20]}... (truncated for security)")

client = AsyncOpenAI(
    api_key=OPENAI_API_KEY,
    base_url="https://openrouter.ai/api/v1"
)

# atur batch size konsisten
BATCH_SIZE = 10
MAX_TOKENS = 1400
TEMPERATURE = 0.7

RETRY_LIMIT = 5
RETRY_DELAY = 3  
CONCURRENT_REQUESTS = 5

# pilih model list yang ada di openrouter pastiin pake yg gpt aje
MODEL_NAME = "openai/gpt-5.1-chat"  

# output dir config
OUTPUT_BASE_DIR = Path.cwd().parent / "MultiturnDatasetOutput"  

print(f"Output akan disimpan di: {OUTPUT_BASE_DIR}")

# SYSTEM PROMPT (LOCKED)
SYSTEM_PROMPT = (
    "Anda adalah interviewer dari platform talenta digital Diploy. "
    "Tugas Anda adalah menggali informasi pendidikan, pelatihan, sertifikasi, "
    "pengalaman kerja, dan keterampilan talenta sesuai data yang tersedia. "
    "Gunakan bahasa profesional & natural."
)

# GLOBAL RULES
GLOBAL_RULES = f"""
ATURAN FORMAT OUTPUT
====================
Output WAJIB berupa ARRAY JSON VALID dan HARUS dimulai dengan:

  {{
    "role": "system",
    "content": "{SYSTEM_PROMPT}"
  }}

TIDAK BOLEH mengubah teks system prompt tersebut.

FORMAT WAJIB:
[
  {{"role": "system", "content": "{SYSTEM_PROMPT}"}},
  {{"role": "user", "content": "..."}},
  {{"role": "assistant", "content": "..."}},
  {{"role": "user", "content": "..."}},
  {{"role": "assistant", "content": "..."}}
]

PENTING:
- Turn pertama JSON = EXACT system prompt di atas.
- Tidak boleh ada kalimat tambahan sebelum ARRAY JSON.
- Tidak boleh ada JSON di dalam string.
- Seluruh user turn setelah intro WAJIB dihasilkan model.
- User turn tidak boleh invent data baru.
- Assistant hanya boleh bertanya berdasarkan field yang ADA dalam Excel.
- **TURN TERAKHIR ASSISTANT HARUS BERISI REKOMENDASI AKHIR (TIDAK ADA PERTANYAAN LAGI):**
    - Format: "Berdasarkan informasi Anda, rekomendasi yang sesuai adalah: Area Fungsi: [value], Level: [value]."
    - Level HARUS integer (1â€“9), tanpa desimal.
    - Jika tidak ada data level, gunakan nilai hidden yang diberikan.
    - Jangan tambahkan pertanyaan atau kalimat lain setelah rekomendasi.

MODE FAST / MEDIUM / LONG
=========================
FAST_DIRECT:
- system
- user membagikan data profil
- assistant memberi rekomendasi

FAST_SHORT:
- assistant bertanya 1 pertanyaan
- user menjawab
- assistant memberi rekomendasi

MEDIUM:
- 4â€“6 turn total
- 1â€“2 pertanyaan
- rekomendasi terakhir

LONG:
- 8â€“10 turn total
- 3â€“5 pertanyaan
- rekomendasi terakhir

- Untuk semua mode FAST_DIRECT,FAST_SHORT, MEDIUM, LONG: Pastikan turn terakhir assistant adalah rekomendasi area fungsi dan level, WAJIB tidak ada pertanyaan tambahan.
- Jika data tidak cukup, gunakan fallback: "Anda belum dapat dipetakan pada area fungsi apapun", Jangan ada pertanyaan lagi diakhir.
- Gunakan parameter dan format tetap untuk menjaga konsistensi output.
"""

# MODE RULES
MODE_RULES = {
    "fast_direct": "FAST DIRECT MODE",
    "fast_short": "FAST SHORT-QA MODE",
    "medium": "MEDIUM MODE",
    "long": "LONG MODE",
}

FAST_VARIANTS = ["fast_direct", "fast_short"]

# Normalisasi
MISSING = {"", "-", "â€“", "â€”", "none", "nan", "n/a", "null", "tidak ada"}

def normalize(v):
    """Normalisasi nilai kosong/missing."""
    if v is None or pd.isna(v):
        return None
    s = str(v).strip()
    return None if s.lower() in MISSING else s

# Extract fields
def extract_fields(row):
    """Ekstrak field public (untuk conversation) dan hidden (untuk validasi)."""
    pub_map = {
        "Jenjang_Pendidikan": "Pendidikan",  
        "Jurusan": "Jurusan",
        "Pelatihan": "Pelatihan",
        "Nama_Pelatihan": "Pelatihan",  
        "Bidang_Pelatihan": "Bidang Pelatihan",  
        "Sertifikasi": "Sertifikasi",
        "Pekerjaan Saat Ini": "Pekerjaan",
        "Posisi_Pekerjaan": "Pekerjaan",  
        "Pengalaman Kerja": "Pengalaman Kerja",
        "Deskripsi_tugas_dan_tanggung_jawab": "Deskripsi Pekerjaan",  
        "Lama_Bekerja": "Lama Bekerja",  
        "Keterampilan": "Keterampilan",
    }

    hid_map = {
        "Area_Fungsi": "Area Fungsi", 
        "Level": "Level",
        "Level_Okupasi": "Level"
    }

    public, hidden = {}, {}

    for c, k in pub_map.items():
        v = normalize(row.get(c))
        if v:
            public[k] = v

    for c, k in hid_map.items():
        v = normalize(row.get(c))
        if v:
            if k == "Level":
                try:
                    v = str(int(float(v)))
                except Exception:
                    v = v
            hidden[k] = v

    return public, hidden

# Build user intro
def make_user_intro(public):
    """Generate user introduction dari data profil."""
    seg = []

    if "Pendidikan" in public and "Jurusan" in public:
        seg.append(f"Saya lulusan {public['Pendidikan']} jurusan {public['Jurusan']}.")
    if "Bidang Pelatihan" in public and "Pelatihan" in public:
        seg.append(f"Saya pernah mengikuti pelatihan {public['Pelatihan']} di bidang {public['Bidang Pelatihan']}.")
    elif "Pelatihan" in public:
        seg.append(f"Saya pernah mengikuti pelatihan {public['Pelatihan']}.")

    if "Sertifikasi" in public:
        seg.append(f"Saya memiliki sertifikasi {public['Sertifikasi']}.")

    if "Pekerjaan" in public:
        seg.append(f"Saat ini saya bekerja sebagai {public['Pekerjaan']}.")

    if "Lama Bekerja" in public:
        seg.append(f"Saya memiliki pengalaman kerja selama {public['Lama Bekerja']}.")
    elif "Pengalaman Kerja" in public:
        seg.append(f"Pengalaman kerja saya adalah {public['Pengalaman Kerja']}.")

    if "Deskripsi Pekerjaan" in public:
        desc = public['Deskripsi Pekerjaan']
        if len(desc) > 150:
            desc = desc[:150] + "..."
        seg.append(f"Tanggung jawab saya meliputi: {desc}")

    if "Keterampilan" in public:
        seg.append(f"Saya memiliki keterampilan {public['Keterampilan']}.")

    random.shuffle(seg)

    if not seg:
        return "Halo, saya ingin mengikuti asesmen karier."

    return "Halo, saya ingin mengikuti asesmen karier. " + " ".join(seg)

# Build prompt
def build_prompt(row, mode_key):
    """Generate prompt untuk GPT model."""
    public, hidden = extract_fields(row)
    intro = make_user_intro(public)
    pdesc = "\n".join([f"- {k}: {v}" for k, v in public.items()])

    level_val = hidden.get("Level", "-")

    # normalisasi level untuk diberikan ke model
    try:
        level_val = str(int(float(level_val)))
    except Exception:
        pass

    return f"""
USER INTRO:
{intro}

PROFILE DATA (referensi saja):
{pdesc}

TARGET OUTPUT:
Area Fungsi = {hidden.get('Area Fungsi','')}
Level = {level_val}

MODE:
{MODE_RULES[mode_key]}

PATUHI ATURAN BERIKUT:
{GLOBAL_RULES}

TEMPLATE REKOMENDASI AKHIR (harus digunakan di turn terakhir assistant):
"Berdasarkan informasi Anda, rekomendasi yang sesuai adalah: Area Fungsi: {hidden.get('Area Fungsi','')}, Level: {level_val}".

HASILKAN ARRAY JSON VALID dengan system prompt EXACT berikut:
{SYSTEM_PROMPT}
"""

# enhanced validation function
def validate_conversation_structure(messages):
    """
    Validasi struktur conversation untuk memastikan format yang benar.
    
    Returns:
        tuple: (is_valid, error_message)
    """
    if not isinstance(messages, list):
        return False, "Messages bukan list"
    
    if len(messages) < 2:
        return False, f"Conversation terlalu pendek: {len(messages)} messages"
    
    # Check first message is system
    if messages[0].get('role') != 'system':
        return False, f"First message bukan system: {messages[0].get('role')}"
    
    # Check last message is assistant
    if messages[-1].get('role') != 'assistant':
        return False, f"Last message bukan assistant: {messages[-1].get('role')}"
    
    # Check no JSON-in-string corruption
    for i, msg in enumerate(messages):
        content = msg.get('content', '')
        
        # Detect if content starts with JSON markers (corruption sign)
        if isinstance(content, str):
            stripped = content.strip()
            if stripped.startswith('[{') or stripped.startswith('[\n  {'):
                return False, f"Message {i} ({msg.get('role')}) contains JSON array as string (corruption detected)"
            
            # Additional check: content shouldn't have escaped JSON structure
            if '\\"role\\"' in content or '\\\"role\\\"' in content:
                return False, f"Message {i} ({msg.get('role')}) contains escaped JSON (corruption detected)"
    
    # Check all messages have required fields
    for i, msg in enumerate(messages):
        if 'role' not in msg:
            return False, f"Message {i} missing 'role' field"
        if 'content' not in msg:
            return False, f"Message {i} missing 'content' field"
        if msg['role'] not in ['system', 'user', 'assistant']:
            return False, f"Message {i} has invalid role: {msg['role']}"
    
    return True, "Valid"

def clean_and_parse_json(raw_text):
    """
    Advanced JSON cleaning and parsing dengan multiple strategies.
    
    Returns:
        tuple: (parsed_array, success, error_message)
    """
    # Stage 1: Basic cleaning
    cleaned = (
        raw_text.replace("```json", "")
                .replace("```", "")
                .replace("\n\n", "\n")
                .strip()
    )
    
    # Stage 2: Remove common prefixes
    prefixes = [
        "Here is the JSON array:",
        "Berikut adalah array JSON:",
        "JSON output:",
        "Output JSON:",
        "Here's the conversation:",
        "Berikut percakapannya:",
    ]
    for prefix in prefixes:
        if cleaned.startswith(prefix):
            cleaned = cleaned[len(prefix):].strip()
    
    # Stage 3: Direct parse attempt
    if cleaned.startswith("[") and cleaned.endswith("]"):
        try:
            arr = json.loads(cleaned)
            return arr, True, None
        except json.JSONDecodeError as e:
            # Try repair
            try:
                repaired = re.sub(r',\s*}', '}', cleaned)
                repaired = re.sub(r',\s*]', ']', repaired)
                arr = json.loads(repaired)
                return arr, True, None
            except:
                pass 
    
    # Stage 4: Extract array substring
    if "[" in cleaned and "]" in cleaned:
        try:
            start = cleaned.index("[")
            end = cleaned.rindex("]") + 1
            extracted = cleaned[start:end]
            
            arr = json.loads(extracted)
            return arr, True, None
        except json.JSONDecodeError as e:
            # Try repair on extracted
            try:
                repaired = re.sub(r',\s*}', '}', extracted)
                repaired = re.sub(r',\s*]', ']', repaired)
                arr = json.loads(repaired)
                return arr, True, None
            except:
                return None, False, f"JSON parse error: {str(e)}"
    
    return None, False, "No valid JSON array found in response"

# OpenAI call with ENHANCED error handling and validation
async def call_api(prompt, row_index=None, mode="unknown"):
    """
    Call OpenAI API dengan retry mechanism, validation, dan error handling.
    
    PERBAIKAN UTAMA:
    1. Real-time validation setelah parse
    2. Auto-retry dengan parameter berbeda jika validation gagal
    3. Logging lebih detail untuk debugging
    4. Reject corrupt data sebelum disimpan
    """
    for attempt in range(RETRY_LIMIT):
        try:
            # Adjust temperature based on attempt (lower = more deterministic)
            attempt_temp = max(0.3, TEMPERATURE - (attempt * 0.1))
            
            resp = await client.chat.completions.create(
                model=MODEL_NAME,
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=MAX_TOKENS,
                temperature=attempt_temp
            )

            raw = resp.choices[0].message.content.strip()

            # Parse with enhanced cleaning
            parsed_arr, parse_success, parse_error = clean_and_parse_json(raw)
            
            if not parse_success:
                print(f"Parse failed (row {row_index}, {mode}, attempt {attempt+1}): {parse_error}")
                
                # Retry dengan parameter berbeda
                if attempt < RETRY_LIMIT - 1:
                    await asyncio.sleep(RETRY_DELAY)
                    continue
                else:
                    # Last attempt failed - return error
                    return [
                        {"role": "system", "content": SYSTEM_PROMPT},
                        {"role": "assistant", "content": f"ERROR: Parse failed after {RETRY_LIMIT} attempts - {parse_error}"}
                    ]
            
            # Remove duplicate system messages
            cleaned_arr = [m for m in parsed_arr if m.get("role") != "system"]
            final_messages = [{"role": "system", "content": SYSTEM_PROMPT}] + cleaned_arr
            
            # CRITICAL: Validate conversation structure
            is_valid, validation_error = validate_conversation_structure(final_messages)
            
            if not is_valid:
                print(f"Validation failed (row {row_index}, {mode}, attempt {attempt+1}): {validation_error}")
                
                # REJECT corrupt data - retry dengan parameter berbeda
                if attempt < RETRY_LIMIT - 1:
                    print(f"Retrying dengan temperature={attempt_temp-0.1:.2f}...")
                    await asyncio.sleep(RETRY_DELAY)
                    continue
                else:
                    # Last attempt failed - return error instead of corrupt data
                    print(f"Row {row_index} ({mode}) REJECTED after {RETRY_LIMIT} attempts")
                    return [
                        {"role": "system", "content": SYSTEM_PROMPT},
                        {"role": "assistant", "content": f"ERROR: Validation failed - {validation_error}"}
                    ]
            
            # SUCCESS - valid conversation
            print(f"Row {row_index} ({mode}, attempt {attempt+1}): {len(final_messages)} turns, validated")
            return final_messages

        except RateLimitError:
            print(f"Rate limit (row {row_index}, {mode}, attempt {attempt+1}/{RETRY_LIMIT})")
            await asyncio.sleep(RETRY_DELAY * 2)  # Longer delay for rate limit
        except APIError as e:
            print(f"API error (row {row_index}, {mode}, attempt {attempt+1}/{RETRY_LIMIT}): {e}")
            await asyncio.sleep(RETRY_DELAY)
        except Exception as e:
            print(f"Unexpected error (row {row_index}, {mode}, attempt {attempt+1}): {e}")
            await asyncio.sleep(RETRY_DELAY)

    # All retries exhausted
    print(f"Row {row_index} ({mode}) FAILED after {RETRY_LIMIT} attempts")
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "assistant", "content": f"ERROR: All retry attempts exhausted (row {row_index})"}
    ]

# Batch processing with semaphore
SEM = asyncio.Semaphore(CONCURRENT_REQUESTS)

async def safe_process_row(row, row_index):
    """Process single row dengan concurrency control dan validation."""
    async with SEM:
        fast_variant = random.choice(FAST_VARIANTS)
        modes = [fast_variant, "medium", "long"]
        prompts = [
            build_prompt(row, fast_variant),
            build_prompt(row, "medium"),
            build_prompt(row, "long"),
        ]
        
        # Process dengan mode labels untuk logging
        tasks = []
        for prompt, mode in zip(prompts, modes):
            tasks.append(call_api(prompt, row_index, mode))
        
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # Handle exceptions
        processed_results = []
        for i, result in enumerate(results):
            mode = modes[i]
            if isinstance(result, Exception):
                print(f"Exception (row {row_index}, {mode}): {result}")
                processed_results.append([
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "assistant", "content": f"ERROR: Exception - {str(result)}"}
                ])
            else:
                processed_results.append(result)
        
        fast, med, long = processed_results
        return {"fast": fast, "medium": med, "long": long}

async def process_batch(df, writer, batch_num, pbar):
    """Process batch of rows dengan validation."""
    tasks = []
    for idx, row in df.iterrows():
        tasks.append(safe_process_row(row, idx))
    
    results = []
    for coro in asyncio.as_completed(tasks):
        out = await coro
        results.append(out)
        pbar.update(1)
    
    # Write results dengan final validation
    valid_count = 0
    error_count = 0
    
    for out in results:
        for mode_name, msgs in out.items():
            # Final check before writing
            is_valid, _ = validate_conversation_structure(msgs)
            
            if is_valid:
                writer.write(json.dumps({"messages": msgs}, ensure_ascii=False) + "\n")
                valid_count += 1
            else:
                # Skip corrupt conversations
                error_count += 1
                print(f"Skipped corrupt conversation ({mode_name})")
    
    if error_count > 0:
        print(f"Batch {batch_num}: {valid_count} valid, {error_count} skipped")

# PROCESS SINGLE FILE
async def process_single_file(input_path, output_dir, file_pbar=None):
    """Process satu file Excel."""
    file_name = Path(input_path).stem
    
    print(f"\n{'='*60}")
    print(f"Processing: {file_name}")
    print(f"{'='*60}")
    
    # Read Excel with error handling
    try:
        df = pd.read_excel(input_path)
        print(f"Columns detected: {df.columns.tolist()}")
    except FileNotFoundError:
        print(f"Error: File tidak ditemukan: {input_path}")
        if file_pbar:
            file_pbar.update(1)
        return
    except Exception as e:
        print(f"Error membaca Excel: {e}")
        if file_pbar:
            file_pbar.update(1)
        return
    
    total = len(df)
    print(f"Total rows: {total}")
    
    # Create output directory for this file
    file_output_dir = output_dir / file_name
    os.makedirs(file_output_dir, exist_ok=True)
    
    batch_count = 1
    
    # Progress bar for rows
    with tqdm(total=total, desc=f"  Rows ({file_name})", unit="row", leave=False) as pbar:
        for i in range(0, total, BATCH_SIZE):
            batch = df.iloc[i:i + BATCH_SIZE]
            
            file_path = f"{file_output_dir}/batch_{batch_count:03d}.jsonl"
            
            try:
                with open(file_path, "w", encoding="utf-8") as w:
                    await process_batch(batch, w, batch_count, pbar)
            except Exception as e:
                print(f"Error writing batch {batch_count}: {e}")
            
            batch_count += 1
    
    print(f"Done: {file_name} - {batch_count - 1} batches created")
    
    if file_pbar:
        file_pbar.update(1)

# MAIN - PROCESS ALL FILES
async def process_all_files(input_dir, output_base):
    """Process semua file Excel di direktori input."""
    print(f"\n{'='*60}")
    print(f"Multi-File Dataset Generation")
    print(f"{'='*60}")
    print(f"Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Input Dir: {input_dir}")
    print(f"Output Dir: {output_base}")
    print(f"Model: {MODEL_NAME}")
    print(f"Batch size: {BATCH_SIZE}")
    print(f"Concurrent requests: {CONCURRENT_REQUESTS}")
    print(f"{'='*60}\n")
    
    # Find all Excel files
    excel_files = list(Path(input_dir).glob("*.xlsx"))
    
    if not excel_files:
        print("Tidak ada file Excel ditemukan!")
        return
    
    print(f"Found {len(excel_files)} Excel files\n")
    
    # Create base output directory
    os.makedirs(output_base, exist_ok=True)
    
    # Process files sequentially (untuk menghindari rate limit)
    with tqdm(total=len(excel_files), desc="Files", unit="file") as file_pbar:
        for excel_file in excel_files:
            await process_single_file(excel_file, output_base, file_pbar)
    
    print(f"\n{'='*60}")
    print(f"ALL DONE!")
    print(f"Output: {output_base}")
    print(f"Processed: {len(excel_files)} files")
    print(f"{'='*60}\n")

input_directory = DATASET_DIR
output_directory = OUTPUT_BASE_DIR
print("\nMemulai proses generation...")
print("Harap tunggu, ini akan memakan waktu...\n")

await process_all_files(input_directory, output_directory)

.env file loaded successfully
API Key: sk-or-v1-621bc3f020f... (truncated for security)
Output akan disimpan di: /home/wildanaziz/dtp-data-pipeline/Pipeline Multiturn/MultiturnDatasetOutput

Memulai proses generation...
Harap tunggu, ini akan memakan waktu...


Multi-File Dataset Generation
Time: 2025-12-09 23:52:06
Input Dir: /home/wildanaziz/dtp-data-pipeline/Pipeline Multiturn/Flagged_500_Per_Class
Output Dir: /home/wildanaziz/dtp-data-pipeline/Pipeline Multiturn/MultiturnDatasetOutput
Model: openai/gpt-5.1-chat
Batch size: 10
Concurrent requests: 5

Found 46 Excel files



Files:   0%|          | 0/46 [00:00<?, ?file/s]


Processing: Pengembangan_Produk_Digital_2
Columns detected: ['Jenjang_Pendidikan', 'Jurusan', 'Judul_Tugas_Akhir', 'Bidang_Pelatihan', 'Nama_Pelatihan', 'Sertifikasi', 'Bidang_Sertifikasi', 'Posisi_Pekerjaan', 'Deskripsi_tugas_dan_tanggung_jawab', 'Lama_Bekerja', 'Keterampilan', 'Area_Fungsi', 'Level_Okupasi']
Total rows: 500
Columns detected: ['Jenjang_Pendidikan', 'Jurusan', 'Judul_Tugas_Akhir', 'Bidang_Pelatihan', 'Nama_Pelatihan', 'Sertifikasi', 'Bidang_Sertifikasi', 'Posisi_Pekerjaan', 'Deskripsi_tugas_dan_tanggung_jawab', 'Lama_Bekerja', 'Keterampilan', 'Area_Fungsi', 'Level_Okupasi']
Total rows: 500


  Rows (Pengembangan_Produk_Digital_2):   0%|          | 0/500 [00:00<?, ?row/s]

Row 3 (fast_direct, attempt 1): 3 turns, validated
Row 2 (fast_short, attempt 1): 5 turns, validated
Row 2 (fast_short, attempt 1): 5 turns, validated
Row 9 (medium, attempt 1): 5 turns, validated
Row 9 (medium, attempt 1): 5 turns, validated
Row 7 (fast_short, attempt 1): 5 turns, validated
Row 7 (fast_short, attempt 1): 5 turns, validated
Row 3 (medium, attempt 1): 5 turns, validated
Row 9 (fast_short, attempt 1): 5 turns, validated
Row 3 (medium, attempt 1): 5 turns, validated
Row 9 (fast_short, attempt 1): 5 turns, validated
Row 6 (medium, attempt 1): 5 turns, validated
Row 6 (medium, attempt 1): 5 turns, validated
Row 6 (fast_direct, attempt 1): 5 turns, validated
Row 7 (medium, attempt 1): 5 turns, validated
Row 6 (fast_direct, attempt 1): 5 turns, validated
Row 7 (medium, attempt 1): 5 turns, validated
Row 9 (long, attempt 1): 11 turns, validated
Parse failed (row 6, long, attempt 1): JSON parse error: Expecting ',' delimiter: line 39 column 16 (char 2341)
Row 2 (medium, attempt