# Gretel AI to Opik Dataset Integration - Complete Cookbook

A comprehensive guide with ready-to-run examples for generating synthetic Q&A datasets using Gretel Navigator and importing them into Opik for model evaluation.

---

## 🎯 What This Cookbook Covers

- **Authentication setup** for both Gretel and Opik
- **Synthetic data generation** using Gretel Navigator
- **Data format conversion** from Gretel to Opik
- **Dataset import** into Opik for evaluation
- **Complete examples** for different use cases

---

## 📋 Prerequisites & Setup

Before starting, you'll need:
1. **Gretel Account**: Sign up at [gretel.ai](https://gretel.ai)
2. **Comet Account**: Sign up at [comet.com](https://comet.com) for Opik access
3. **API Keys**: Gretel API key and Comet API key

### Install Required Packages

In [1]:
%pip install gretel_client opik langchain tiktoken pandas --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import getpass
from gretel_client import Gretel
import opik
import pandas as pd
import json

print("🚀 Starting Gretel to Opik integration setup...")

#set up Opik
opik.configure()

# Set up Gretel API key
if "GRETEL_API_KEY" not in os.environ:
    os.environ["GRETEL_API_KEY"] = getpass.getpass("Enter your Gretel API key: ")

gretel = Gretel(api_key=os.environ["GRETEL_API_KEY"], cache=True, validate=True)

OPIK: Opik is already configured. You can check the settings by viewing the config file at /home/mavrick/.opik.config


🚀 Starting Gretel to Opik integration setup...
Logged in as mavrickrishi@gmail.com ✅


## Find Working Model

In [3]:
# Test different models to find one that works
def find_working_navigator_model():
    """Find a Navigator model that works in your environment"""
    models_to_try = ['gretelai/auto', 'gretelai/apache-2.0', 'gretelai/llama-3.x']
    
    for model in models_to_try:
        try:
            print(f"🔄 Testing model: {model}")
            test_navigator = gretel.factories.initialize_navigator_api("tabular", backend_model=model)
            
            # Quick test generation
            test_result = test_navigator.generate(
                "Create a small dataset with columns 'name' and 'age' for 3 people.", 
                num_records=3
            )
            
            print(f"✅ Model {model} works! Generated {len(test_result)} records")
            
            # Print the test database
            print(f"📊 Test database:")
            print(test_result)
            
            return test_navigator, model
            
        except Exception as e:
            print(f"❌ Model {model} failed: {e}")
    
    raise Exception("No working Navigator model found")

# Find and use a working model
navigator, working_model = find_working_navigator_model()
print(f"🎉 Using working model: {working_model}")

🔄 Testing model: gretelai/auto
Backend model: gretelai/auto
API path: https://api.gretel.cloud/v1/inference/tabular/
Navigator Tabular initialized 🚀


Generating records: 100%|██████████| 3/3 [00:07, 0.41 records/s]

✅ Model gretelai/auto works! Generated 3 records
📊 Test database:
         name  age
0  John Smith   25
1   Maria Lee   31
2   David Kim   42
🎉 Using working model: gretelai/auto





## 📝 Configure Prompt and Source Content

In [4]:
# Base prompt for Q&A dataset creation
PROMPT = (
    "From the source text below, create a dataset with the following columns:\n"
    "* `question`: Ask a set of unique questions related to the topic that a customer might ask. "
    "Questions should be relatively complex and specific enough to be addressed in a short answer.\n"
    "* `context`: Copy the exact sentence(s) from the source text and surrounding details from where the answer can be derived.\n"
    "* `truth`: Respond to the question with a clear, textbook quality answer that provides relevant details to fully address the question.\n"
)

# Your source content (customize this with your domain-specific content)
source_text = """
Artificial Intelligence (AI) has revolutionized numerous industries by automating complex tasks 
and providing intelligent insights. Machine learning, a subset of AI, enables systems to learn 
from data without explicit programming. Deep learning, using neural networks with multiple layers, 
has achieved breakthroughs in image recognition, natural language processing, and decision making.
The field continues to evolve with advancements in transformer architectures, reinforcement learning,
and federated learning approaches that preserve privacy while enabling collaborative model training.
"""

print("📝 Prompt and source text configured")
print(f"📄 Source text length: {len(source_text)} characters")

📝 Prompt and source text configured
📄 Source text length: 596 characters


## 🚀 Generate Synthetic Q&A Dataset

In [5]:
def generate_qa_dataset_robust(navigator, prompt, source_text, max_attempts=3):
    """Generate Q&A dataset with multiple fallback strategies"""
    
    # Different parameter strategies (from complex to simple)
    strategies = [
        {"num_records": 10, "temperature": 0.7, "top_p": 0.9},
        {"num_records": 8, "temperature": 0.5},
        {"num_records": 5},
        {"num_records": 3, "temperature": 0.3}
    ]
    
    for attempt, params in enumerate(strategies, 1):
        try:
            print(f"🔄 Attempt {attempt}: Generating with params {params}")
            result = navigator.generate(f"{prompt}\n\n{source_text}", **params)
            
            if len(result) > 0:
                print(f"✅ Success! Generated {len(result)} records")
                return result
            else:
                print("⚠️ Empty result, trying next strategy...")
                
        except Exception as e:
            print(f"❌ Attempt {attempt} failed: {e}")
            if attempt < len(strategies):
                print("🔄 Trying next strategy...")
    
    return None

# Generate the dataset
print("🚀 Starting data generation...")
synthetic_df = generate_qa_dataset_robust(navigator, PROMPT, source_text)

if synthetic_df is not None:
    print(f"\n📊 Dataset generated successfully!")
    print(f"   Shape: {synthetic_df.shape}")
    print(f"   Columns: {list(synthetic_df.columns)}")
    
    # Display sample data
    pd.set_option('display.max_colwidth', 100)
    print(f"\n📋 Sample generated data:")
    print(synthetic_df.head(3))
else:
    print("❌ Failed to generate any data")

🚀 Starting data generation...
🔄 Attempt 1: Generating with params {'num_records': 10, 'temperature': 0.7, 'top_p': 0.9}


Generating records: 100%|██████████| 10/10 [00:33, 0.30 records/s]

✅ Success! Generated 10 records

📊 Dataset generated successfully!
   Shape: (10, 3)
   Columns: ['question', 'context', 'truth']

📋 Sample generated data:
                                                                  question  \
0                  What is the primary function of machine learning in AI?   
1  What are some key areas where deep learning has achieved breakthroughs?   
2                    What are some recent advancements in the field of AI?   

                                                                                               context  \
0   Machine learning, a subset of AI, enables systems to learn from data without explicit programming.   
1  Deep learning, using neural networks with multiple layers, has achieved breakthroughs in image r...   
2  The field continues to evolve with advancements in transformer architectures, reinforcement lear...   

                                                                                                 truth  
0




## 🔄 Convert Gretel Format to Opik Format

In [6]:
def convert_gretel_to_opik_format(df, model_name="unknown"):
    """Convert Gretel DataFrame to Opik dataset format"""
    
    print(f"🔄 Converting {len(df)} rows to Opik format...")
    print(f"📝 Available columns: {list(df.columns)}")
    
    opik_items = []
    
    # Detect columns automatically
    question_col = None
    answer_col = None
    context_col = None
    
    for col in df.columns:
        col_lower = col.lower().strip()
        
        if any(word in col_lower for word in ['question', 'query', 'q']):
            question_col = col
            print(f"✅ Question column: {col}")
        elif any(word in col_lower for word in ['truth', 'answer', 'response', 'reply']):
            answer_col = col
            print(f"✅ Answer column: {col}")
        elif any(word in col_lower for word in ['context', 'background', 'source']):
            context_col = col
            print(f"✅ Context column: {col}")
    
    # Convert each row
    successful_conversions = 0
    for idx, row in df.iterrows():
        try:
            # Build input dictionary
            input_data = {}
            
            if question_col and pd.notna(row.get(question_col)):
                input_data["question"] = str(row[question_col]).strip()
            
            if context_col and pd.notna(row.get(context_col)):
                input_data["context"] = str(row[context_col]).strip()
            
            # Get expected output
            expected_output = ""
            if answer_col and pd.notna(row.get(answer_col)):
                expected_output = str(row[answer_col]).strip()
            
            # Only include items with both question and answer
            if input_data.get("question") and expected_output:
                item = {
                    "input": input_data,
                    "expected_output": expected_output,
                    "metadata": {
                        "source": "gretel_navigator",
                        "generated": True,
                        "row_index": idx,
                        "model": model_name
                    }
                }
                opik_items.append(item)
                successful_conversions += 1
            else:
                print(f"⚠️ Skipping row {idx}: missing question or answer")
                
        except Exception as e:
            print(f"❌ Error converting row {idx}: {e}")
    
    print(f"✅ Successfully converted {successful_conversions}/{len(df)} rows")
    return opik_items

# Convert the dataset
if synthetic_df is not None and len(synthetic_df) > 0:
    opik_formatted_data = convert_gretel_to_opik_format(synthetic_df, working_model)
    
    if opik_formatted_data:
        print(f"\n📋 Sample converted item:")
        print(json.dumps(opik_formatted_data[0], indent=2))
    else:
        print("❌ No items were successfully converted")

🔄 Converting 10 rows to Opik format...
📝 Available columns: ['question', 'context', 'truth']
✅ Question column: question
✅ Context column: context
✅ Answer column: truth
✅ Successfully converted 10/10 rows

📋 Sample converted item:
{
  "input": {
    "question": "What is the primary function of machine learning in AI?",
    "context": "Machine learning, a subset of AI, enables systems to learn from data without explicit programming."
  },
  "expected_output": "Machine learning allows systems to learn from data without being explicitly programmed, enabling them to improve their performance on a task over time.",
  "metadata": {
    "source": "gretel_navigator",
    "generated": true,
    "row_index": 0,
    "model": "gretelai/auto"
  }
}


## 📤 Push Dataset to Opik

In [7]:
def push_to_opik(opik_data, dataset_name="gretel-qa-dataset"):
    """Push converted data to Opik as a dataset"""
    
    if not opik_data:
        return False, "No data to push"
    
    print(f"📤 Pushing {len(opik_data)} items to Opik...")
    
    try:
        # Initialize Opik client
        opik_client = opik.Opik()
        
        # Create or get dataset
        opik_dataset = opik_client.get_or_create_dataset(
            name=dataset_name,
            description=f"Synthetic Q&A dataset generated using Gretel Navigator ({working_model})"
        )
        
        print(f"📊 Dataset created/found: {opik_dataset.name}")
        print(f"🆔 Dataset ID: {opik_dataset.id}")
        
        # Insert data
        opik_dataset.insert(opik_data)
        
        print(f"✅ Successfully pushed {len(opik_data)} items!")
        print(f"📊 Dataset name: {opik_dataset.name}")
        print(f"🆔 Dataset ID: {opik_dataset.id}")
        
        # Show sample of what was pushed
        if opik_data:
            sample = opik_data[0]
            print(f"\n📋 Sample item pushed:")
            print(f"   Question: {sample['input'].get('question', 'N/A')[:80]}...")
            print(f"   Answer: {sample['expected_output'][:80]}...")
        
        return True, opik_dataset.name
        
    except Exception as e:
        print(f"❌ Failed to push to Opik: {e}")
        import traceback
        traceback.print_exc()
        return False, str(e)

# Push the data to Opik
if 'opik_formatted_data' in locals() and opik_formatted_data:
    success, result = push_to_opik(opik_formatted_data, "gretel-ai-qa-cookbook")
    
    if success:
        print(f"\n🎉 Integration completed successfully!")
        print(f"📊 Dataset '{result}' is now available in Opik")
        print(f"\n🔗 Next steps:")
        print(f"   1. Go to your Comet workspace")
        print(f"   2. Navigate to Opik → Datasets")
        print(f"   3. Find your dataset: {result}")
        print(f"   4. Use it in model evaluations!")
    else:
        print(f"❌ Failed to complete integration: {result}")
else:
    print("❌ No data available to push")

📤 Pushing 10 items to Opik...
📊 Dataset created/found: gretel-ai-qa-cookbook
🆔 Dataset ID: 0197a84b-cf53-7f88-afef-ffb5c0fa95b2
✅ Successfully pushed 10 items!
📊 Dataset name: gretel-ai-qa-cookbook
🆔 Dataset ID: 0197a84b-cf53-7f88-afef-ffb5c0fa95b2

📋 Sample item pushed:
   Question: What is the primary function of machine learning in AI?...
   Answer: Machine learning allows systems to learn from data without being explicitly prog...

🎉 Integration completed successfully!
📊 Dataset 'gretel-ai-qa-cookbook' is now available in Opik

🔗 Next steps:
   1. Go to your Comet workspace
   2. Navigate to Opik → Datasets
   3. Find your dataset: gretel-ai-qa-cookbook
   4. Use it in model evaluations!


The gretel-qa-dataset dataset can now be viewed in the UI:

![gretel-qa-dataset](https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/fern/img/cookbook/gretel_opik_integration_cookbook.png)

## ✅ Verify Dataset Creation

In [8]:
def verify_opik_dataset(dataset_name):
    """Verify the dataset was created and provide access instructions"""
    
    try:
        print(f"🔍 Verifying dataset: {dataset_name}")
        opik_client = opik.Opik()
        
        # Get the dataset
        dataset = opik_client.get_dataset(dataset_name)
        print(f"✅ Dataset verified: {dataset.name}")
        print(f"🆔 Dataset ID: {dataset.id}")

        print(f"\n📋 How to view your dataset:")
        print(f"   1. Go to https://www.comet.com")
        print(f"   2. Navigate to your workspace")
        print(f"   3. Click on 'Opik' in the left sidebar")
        print(f"   4. Go to 'Datasets' tab")
        print(f"   5. Look for dataset: {dataset_name}")
        
        print(f"\n🧪 How to use in evaluations:")
        print(f"""
# Example evaluation code:
import opik

opik_client = opik.Opik()
dataset = opik_client.get_dataset('{dataset_name}')

@opik.track
def my_qa_model(input_data):
    question = input_data.get('question', '')
    context = input_data.get('context', '')
    # Your model logic here
    return "Your model's answer"

# Run evaluation
evaluation = opik.evaluate(
    dataset=dataset,
    task=my_qa_model,
    experiment_name="gretel-synthetic-eval"
)
        """)
        
        return True
        
    except Exception as e:
        print(f"❌ Could not verify dataset: {e}")
        return False

# Verify the dataset (use the actual dataset name from previous step)
if 'result' in locals() and success:
    verify_opik_dataset(result)
else:
    print("⚠️ No dataset to verify - make sure previous steps completed successfully")

🔍 Verifying dataset: gretel-ai-qa-cookbook
✅ Dataset verified: gretel-ai-qa-cookbook
🆔 Dataset ID: 0197a84b-cf53-7f88-afef-ffb5c0fa95b2

📋 How to view your dataset:
   1. Go to https://www.comet.com
   2. Navigate to your workspace
   3. Click on 'Opik' in the left sidebar
   4. Go to 'Datasets' tab
   5. Look for dataset: gretel-ai-qa-cookbook

🧪 How to use in evaluations:

# Example evaluation code:
import opik

opik_client = opik.Opik()
dataset = opik_client.get_dataset('gretel-ai-qa-cookbook')

@opik.track
def my_qa_model(input_data):
    question = input_data.get('question', '')
    context = input_data.get('context', '')
    # Your model logic here
    return "Your model's answer"

# Run evaluation
evaluation = opik.evaluate(
    dataset=dataset,
    task=my_qa_model,
    experiment_name="gretel-synthetic-eval"
)
        


## 🔄 Alternative: Load from Gretel Export Files

If you have pre-existing Gretel datasets exported as files, you can also import them:

In [9]:
def load_gretel_export(file_path, format_type="csv"):
    """
    Load a Gretel dataset export from local file.
    Supports CSV, JSON, and JSONL formats.
    """
    try:
        if format_type.lower() == "csv":
            df = pd.read_csv(file_path)
        elif format_type.lower() == "json":
            df = pd.read_json(file_path)
        elif format_type.lower() == "jsonl":
            df = pd.read_json(file_path, lines=True)
        else:
            raise ValueError("Supported formats: csv, json, jsonl")
        
        print(f"✅ Loaded {len(df)} records from {file_path}")
        print(f"📊 Dataset shape: {df.shape}")
        print(f"📋 Columns: {list(df.columns)}")
        
        # Display sample data from Gretel
        print("\n📄 Sample data from Gretel:")
        pd.set_option('display.max_columns', None)
        pd.set_option('display.max_colwidth', 100)
        print(df.head(3))
        
        # Show data types
        print(f"\n📈 Data types:")
        print(df.dtypes)
        
        # Basic statistics
        print(f"\n📊 Basic statistics:")
        if 'question' in df.columns:
            print(f"  - Average question length: {df['question'].str.len().mean():.1f} characters")
        if 'answer' in df.columns or 'truth' in df.columns:
            answer_col = 'answer' if 'answer' in df.columns else 'truth'
            print(f"  - Average answer length: {df[answer_col].str.len().mean():.1f} characters")
        if 'topic' in df.columns:
            print(f"  - Unique topics: {df['topic'].nunique()}")
        if 'difficulty' in df.columns or 'user_profile' in df.columns:
            diff_col = 'difficulty' if 'difficulty' in df.columns else 'user_profile'
            print(f"  - Difficulty distribution: {dict(df[diff_col].value_counts())}")
            
        return df
    
    except Exception as e:
        print(f"❌ Error loading file: {e}")
        return None

# Example usage:
# df_gretel = load_gretel_export("your_gretel_export.csv", "csv")
# df_gretel = load_gretel_export("your_gretel_export.jsonl", "jsonl")

# Then convert and push to Opik:
# opik_data = convert_gretel_to_opik_format(df_gretel, "gretel-export")
# success, result = push_to_opik(opik_data, "gretel-imported-dataset")

## 🎯 Complete Integration Summary

This cookbook provides a complete workflow for integrating Gretel AI datasets with Opik:

### ✅ **What We Accomplished:**
1. **Authentication Setup** - Both Gretel and Opik API configurations
2. **Model Discovery** - Automatic detection of working Gretel models
3. **Synthetic Data Generation** - Using Gretel Navigator for Q&A creation
4. **Format Conversion** - Transform Gretel output to Opik-compatible format
5. **Dataset Import** - Push datasets to Opik for evaluation use
6. **Verification** - Confirm successful import and provide usage guidance

### 🔧 **Key Features:**
- **Robust Error Handling**: Multiple fallback strategies
- **Automatic Column Detection**: Smart mapping of data fields
- **Flexible Input**: Supports both live generation and file imports
- **Production Ready**: Comprehensive validation and user guidance

### 📊 **Use Cases:**
- **Model Testing**: Create evaluation datasets for Q&A models
- **Benchmarking**: Generate consistent test sets across experiments
- **Agent Optimization**: Provide training data for Opik's Agent Optimizer
- **Continuous Evaluation**: Regular model performance monitoring

### 🚀 **Next Steps:**
1. Customize the `source_text` with your domain-specific content
2. Adjust generation parameters based on your needs
3. Use the imported dataset in Opik evaluations
4. Scale up for larger dataset generation

This integration enables seamless data flow from Gretel's synthetic data generation capabilities into Opik's model evaluation and optimization ecosystem! 🎉