# Dataset Exploration and Endpoint Benchmarking

This notebook provides comprehensive exploration and benchmarking tools for instruction-following datasets against AI model endpoints.

## Overview

This notebook enables you to:
- **Load and analyze datasets**: Load instruction datasets from HuggingFace Hub with sampling options
- **Endpoint testing**: Test AI model selection endpoints with real dataset samples
- **Cost analysis**: Calculate and track API usage costs across different models
- **Export results**: Save detailed results and summaries to CSV files

### Supported Datasets:
- **Databricks Dolly 15k**: High-quality instruction dataset (default)
- **Open-Orca datasets**: Large-scale instruction datasets
- **Any HuggingFace instruction dataset**: Customizable dataset loading

### Features:
- ✅ **Model cost tracking** with accurate pricing data
- ✅ **Progress monitoring** with real-time updates  
- ✅ **Error handling** for robust API testing
- ✅ **Flexible JSON payload** supporting multiple AI providers
- ✅ **Detailed analytics** with model selection distributions

## Setup Instructions

### 1. Install Dependencies
```bash
# Using pip
pip install datasets pandas requests numpy matplotlib seaborn huggingface_hub

# Using uv (recommended)
uv add datasets pandas requests numpy matplotlib seaborn huggingface_hub
```

### 2. Configure HuggingFace Authentication
```bash
huggingface-cli login
```

### 3. Update Endpoint Configuration
Before running, update the `ENDPOINT_URL` variable in the "Configuration" section with your actual endpoint URL.

### 4. Customize Model Costs (Optional)
Update the `model_costs` dictionary in the `process_dataset_with_endpoint` function to match your actual model pricing.

## Quick Start

1. Run all cells sequentially
2. The notebook will load 1000 samples from Databricks Dolly 15k dataset
3. Test your endpoint connectivity
4. Process samples and generate cost analysis
5. Results are saved as CSV files with timestamps

## Output Files

- `dolly_endpoint_results_YYYYMMDD_HHMMSS.csv`: Detailed results for each sample
- `dolly_summary_YYYYMMDD_HHMMSS.csv`: Summary statistics and costs

## Configuration

**✏️ Update these settings before running the notebook:**

In [23]:
# ===== CONFIGURATION SECTION =====
# Update these variables before running the notebook

# Dataset Configuration
DATASET_NAME = "databricks/databricks-dolly-15k"  # HuggingFace dataset identifier
SAMPLE_SIZE = 1000  # Number of samples to process (None for all)
DATASET_SPLIT = "train"  # Dataset split to use

# Endpoint Configuration
ENDPOINT_URL = "https://prompt-classifer-dev.mangoplant-a7a21605.swedencentral.azurecontainerapps.io/predict"  # ⚠️ UPDATE THIS

# Processing Configuration
MAX_SAMPLES_TO_PROCESS = 1000  # Maximum samples to send to endpoint
REQUEST_TIMEOUT = 30  # Timeout for each API request (seconds)
DELAY_BETWEEN_REQUESTS = 0.05  # Delay between requests (seconds)

# Output Configuration
SAVE_CSV_RESULTS = True  # Whether to save results to CSV
SAVE_SUMMARY = True  # Whether to save summary statistics

print("✅ Configuration loaded")
print(f"📊 Dataset: {DATASET_NAME}")
print(f"📊 Sample size: {SAMPLE_SIZE}")
print(f"🌐 Endpoint: {ENDPOINT_URL}")
print("⚠️  Remember to update ENDPOINT_URL before running!")

✅ Configuration loaded
📊 Dataset: databricks/databricks-dolly-15k
📊 Sample size: 1000
🌐 Endpoint: https://prompt-classifer-dev.mangoplant-a7a21605.swedencentral.azurecontainerapps.io/predict
⚠️  Remember to update ENDPOINT_URL before running!


## Setup and Imports

In [24]:
# Core imports
import json
import time
from typing import Any
import warnings

from datasets import load_dataset
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Plotting setup
plt.style.use('default')
sns.set_palette("husl")

print("✓ All imports successful")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print("Requests available for API testing")

✓ All imports successful
Pandas version: 2.2.2
NumPy version: 1.26.4
Requests available for API testing


## Authentication Check

In [25]:
# Check HuggingFace authentication
from huggingface_hub import whoami

try:
    user_info = whoami()
    print(f"✓ Authenticated as: {user_info['name']}")
    print(f"User type: {user_info.get('type', 'Unknown')}")
except Exception as e:
    print(f"✗ Authentication failed: {e}")
    print("Please run: huggingface-cli login")

✓ Authenticated as: AImen44
User type: user


## Analysis Functions

In [26]:
def analyze_dataset_structure(dataset_info: dict[str, Any]) -> None:
    """
    Analyze and display dataset structure and statistics.
    """
    if 'error' in dataset_info:
        print(f"Cannot analyze {dataset_info['dataset_name']} due to error: {dataset_info['error']}")
        return

    samples = dataset_info['samples']
    if not samples:
        print("No samples to analyze")
        return

    print(f"\n{'='*60}")
    print(f"DATASET ANALYSIS: {dataset_info['dataset_name']}")
    print(f"{'='*60}")

    # Basic info
    print(f"Total samples: {dataset_info['num_samples']:,}")
    print(f"Load time: {dataset_info['load_time']:.2f}s")
    print(f"Streaming mode: {dataset_info['streaming']}")

    # Schema analysis
    first_sample = samples[0]
    print(f"\nSchema ({len(first_sample)} fields):")
    for field, value in first_sample.items():
        value_type = type(value).__name__
        if isinstance(value, str):
            print(f"  {field}: {value_type} (avg length: {len(value)} chars)")
        elif isinstance(value, (list, dict)):
            print(f"  {field}: {value_type} (length: {len(value)})")
        else:
            print(f"  {field}: {value_type} = {value}")

    # Text statistics for string fields
    print(f"\nText Statistics (based on {len(samples)} samples):")
    for field in first_sample.keys():
        if isinstance(first_sample[field], str):
            lengths = [len(str(sample[field])) for sample in samples]
            print(f"  {field}:")
            print(f"    Min: {min(lengths)} chars")
            print(f"    Max: {max(lengths)} chars")
            print(f"    Avg: {np.mean(lengths):.1f} chars")
            print(f"    Median: {np.median(lengths):.1f} chars")

def display_sample_data(dataset_info: dict[str, Any], num_samples: int = 3) -> None:
    """
    Display sample data from the dataset.
    """
    if 'error' in dataset_info:
        return

    samples = dataset_info['samples']
    if not samples:
        return

    print(f"\n{'='*60}")
    print(f"SAMPLE DATA: {dataset_info['dataset_name']}")
    print(f"{'='*60}")

    for i, sample in enumerate(samples[:num_samples]):
        print(f"\n--- Sample {i+1} ---")
        for field, value in sample.items():
            if isinstance(value, str):
                if len(value) > 200:
                    print(f"{field}: {value[:200]}...")
                else:
                    print(f"{field}: {value}")
            else:
                print(f"{field}: {value}")
        print("-" * 40)

print("✓ Analysis functions defined")

✓ Analysis functions defined


## Dataset Testing and Benchmarking

Now let's test and benchmark different instruction datasets.

In [27]:
def process_dataset_with_endpoint(dataset_info: dict, endpoint_url: str,
                                max_samples: int = None, save_csv: bool = True) -> pd.DataFrame:
    """
    Process dataset instructions through the endpoint and collect results with cost calculation.
    
    Args:
        dataset_info: Dataset information from load_and_sample_dataset
        endpoint_url: Endpoint URL
        max_samples: Maximum number of samples to process (None for all)
        save_csv: Whether to save results to CSV file
    
    Returns:
        DataFrame with all requests and responses
    """
    if 'error' in dataset_info or not dataset_info['samples']:
        print("❌ No valid dataset to process")
        return pd.DataFrame()

    # Model costs (cost per 1M tokens) - update these based on your model catalog
    model_costs = {
        "gemini-2.5-flash-lite-preview-06-17": {"input": 0.075, "output": 0.30},
        "gemini-2.5-flash": {"input": 0.15, "output": 0.60},
        "gemini-2.5-pro": {"input": 1.25, "output": 10.00},
        "mistral-small-latest": {"input": 0.10, "output": 0.30},
        "gpt-4.1-nano": {"input": 0.10, "output": 0.40},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "gpt-4.1-mini": {"input": 0.40, "output": 1.60},
        "gpt-4.1": {"input": 2.00, "output": 8.00},
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "o3-mini": {"input": 1.10, "output": 4.40},
        "o4-mini": {"input": 1.10, "output": 4.40},
        "o3": {"input": 10.00, "output": 40.00},
        "gpt-4.5": {"input": 75.00, "output": 150.00},
        "o1": {"input": 15.00, "output": 60.00},
        "o1-pro": {"input": 150.00, "output": 600.00},
        "deepseek-chat": {"input": 0.14, "output": 0.28},
        "deepseek-reasoner": {"input": 0.55, "output": 2.19},
        "grok-3-mini": {"input": 0.30, "output": 0.50},
        "grok-3": {"input": 3.00, "output": 15.00},
        "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
        "claude-opus-4-20250514": {"input": 15.00, "output": 75.00},
        "Qwen/Qwen2.5-14B-Instruct": {"input": 0.12, "output": 0.12},
        "meta-llama/Llama-3.1-8B-Instruct": {"input": 0.10, "output": 0.10},
        "codellama/CodeLlama-13b-Instruct-hf": {"input": 0.11, "output": 0.11},
        "mistralai/Mistral-7B-Instruct-v0.3": {"input": 0.08, "output": 0.08},
        "google/flan-t5-xl": {"input": 0.06, "output": 0.06},
        "microsoft/deberta-v3-large": {"input": 0.04, "output": 0.04},
    }

    def estimate_tokens(text: str) -> int:
        """Rough token estimation (1 token ≈ 4 characters)"""
        return len(text) // 4

    def calculate_cost(model_name: str, input_tokens: int, output_tokens: int) -> float:
        """Calculate cost based on model and token counts"""
        if model_name not in model_costs:
            return 0.0

        costs = model_costs[model_name]
        input_cost = (input_tokens / 1_000_000) * costs["input"]
        output_cost = (output_tokens / 1_000_000) * costs["output"]
        return input_cost + output_cost

    samples = dataset_info['samples']
    if max_samples:
        samples = samples[:max_samples]

    print(f"🚀 Processing {len(samples)} samples through endpoint...")
    print(f"📊 Endpoint: {endpoint_url}")

    results = []
    successful_calls = 0
    failed_calls = 0
    total_cost = 0.0

    for i, sample in enumerate(samples):
        if i % 50 == 0:
            print(f"Progress: {i}/{len(samples)} ({i/len(samples)*100:.1f}%) - Cost so far: ${total_cost:.6f}")

        # Extract fields from sample
        instruction = sample.get('instruction', '')
        context = sample.get('context', '')
        response = sample.get('response', '')  # Use for output token estimation only

        # Create full prompt
        full_prompt = f"{instruction}\n\nContext: {context}" if context.strip() else instruction

        # Call endpoint
        start_time = time.time()
        api_result = call_endpoint(endpoint_url, instruction, context)
        end_time = time.time()

        # Estimate token counts
        input_tokens = estimate_tokens(full_prompt)
        output_tokens = estimate_tokens(response) if response else 100  # Use actual response for estimation

        # Prepare result row
        result_row = {
            'sample_id': i,
            'instruction': instruction,
            'context': context,
            'full_prompt': full_prompt,
            'input_token_estimate': input_tokens,
            'output_token_estimate': output_tokens,
            'api_success': api_result['success'],
            'api_status_code': api_result['status_code'],
            'api_error': api_result['error'],
            'response_time_seconds': end_time - start_time,
            'timestamp': pd.Timestamp.now().isoformat()
        }

        # Add API response fields if successful
        if api_result['success'] and api_result['response']:
            api_response = api_result['response']

            # Extract protocol and model information
            protocol = api_response.get('protocol', '')
            selected_model = ''
            selected_provider = ''
            estimated_cost = 0.0

            # Parse based on protocol type
            if protocol == 'standard' and 'standard' in api_response:
                standard_info = api_response['standard']
                selected_model = standard_info.get('model', '')
                selected_provider = standard_info.get('provider', '')

                # Calculate cost for selected model
                estimated_cost = calculate_cost(selected_model, input_tokens, output_tokens)

            elif protocol == 'minion' and 'minion' in api_response:
                minion_info = api_response['minion']
                selected_model = minion_info.get('model', '')
                selected_provider = 'huggingface'  # Minions are HuggingFace models

                # Calculate cost for selected model
                estimated_cost = calculate_cost(selected_model, input_tokens, output_tokens)

            # Update result with API response details
            result_row.update({
                'api_protocol': protocol,
                'api_selected_model': selected_model,
                'api_selected_provider': selected_provider,
                'api_estimated_cost_usd': estimated_cost,
                'api_full_response': json.dumps(api_response, indent=2)
            })

            total_cost += estimated_cost
            successful_calls += 1
        else:
            # Add empty fields for failed calls
            result_row.update({
                'api_protocol': '',
                'api_selected_model': '',
                'api_selected_provider': '',
                'api_estimated_cost_usd': 0.0,
                'api_full_response': ''
            })
            failed_calls += 1

        results.append(result_row)

        # Small delay to avoid overwhelming the endpoint
        time.sleep(DELAY_BETWEEN_REQUESTS)

    # Create DataFrame
    df = pd.DataFrame(results)

    print("\n✅ Processing complete!")
    print(f"📊 Results: {successful_calls} successful, {failed_calls} failed")
    print(f"📈 Success rate: {successful_calls/(successful_calls+failed_calls)*100:.1f}%")
    print(f"💰 Total estimated cost: ${total_cost:.6f} USD")
    print(f"💰 Average cost per request: ${total_cost/len(samples):.6f} USD")

    if save_csv:
        timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
        filename = f"dolly_endpoint_results_{timestamp}.csv"
        df.to_csv(filename, index=False)
        print(f"💾 Results saved to: {filename}")

        # Also save a summary
        summary_filename = f"dolly_summary_{timestamp}.csv"

        summary_df = pd.DataFrame([{
            'total_samples': len(samples),
            'successful_calls': successful_calls,
            'failed_calls': failed_calls,
            'success_rate_percent': successful_calls/(successful_calls+failed_calls)*100,
            'total_cost_usd': total_cost,
            'avg_cost_per_request_usd': total_cost/len(samples),
            'endpoint_url': endpoint_url,
            'timestamp': pd.Timestamp.now().isoformat()
        }])
        summary_df.to_csv(summary_filename, index=False)
        print(f"📋 Summary saved to: {summary_filename}")

    return df

print("✅ Dataset processing function with cost calculation defined")

✅ Dataset processing function with cost calculation defined


In [None]:
# Endpoint configuration and functions
def test_endpoint_connection(url: str) -> bool:
    """Test if the endpoint is accessible."""
    try:
        # For LitServe endpoints, try the predict endpoint directly
        response = requests.get(f"{url.rstrip('/')}", timeout=10)
        if response.status_code == 200:
            print(f"✅ Endpoint {url} is accessible")
            return True
        elif response.status_code == 405:  # Method not allowed (GET on POST endpoint)
            print(f"✅ Endpoint {url} is accessible (405 expected for GET on POST endpoint)")
            return True
        else:
            print(f"⚠️ Endpoint returned status code: {response.status_code}")
            return False
    except requests.exceptions.RequestException as e:
        print(f"❌ Failed to connect to endpoint: {e}")
        return False

def create_model_selection_request(prompt: str, context: str = "") -> dict:
    """Create a request payload matching ModelSelectionRequest structure."""
    full_prompt = f"{prompt}\n\nContext: {context}" if context.strip() else prompt

    # Only active providers: OpenAI, GROQ, and DeepSeek
    active_providers = [
        "openai",      # ProviderType.OPENAI
        "groq",        # ProviderType.GROQ (includes grok-3 models)
        "deepseek",    # ProviderType.DEEPSEEK
    ]

    # Correct JSON format matching ModelSelectionRequest from llm_core_models.py
    return {
        "prompt": full_prompt,
        "user_id": None,
        "provider_constraint": active_providers,  # Only include active providers
        "cost_bias": None
    }

def call_endpoint(url: str, prompt: str, context: str = "", timeout: int = None) -> dict:
    """Call the endpoint with a single prompt."""
    if timeout is None:
        timeout = REQUEST_TIMEOUT

    try:
        payload = create_model_selection_request(prompt, context)

        response = requests.post(
            url,
            json=payload,
            headers={"Content-Type": "application/json"},
            timeout=timeout
        )

        if response.status_code == 200:
            return {
                "success": True,
                "response": response.json(),
                "status_code": 200,
                "error": None
            }
        else:
            return {
                "success": False,
                "response": None,
                "status_code": response.status_code,
                "error": f"HTTP {response.status_code}: {response.text[:200]}"
            }
    except Exception as e:
        return {
            "success": False,
            "response": None,
            "status_code": None,
            "error": str(e)
        }

# Test endpoint connection
print("Testing endpoint connection...")
endpoint_accessible = test_endpoint_connection(ENDPOINT_URL)
print(f"Endpoint accessible: {endpoint_accessible}")

In [None]:
# Test endpoint with sample data
if endpoint_accessible:
    print("🧪 Testing endpoint with sample data...")

    # Check if dolly_info is available
    if 'dolly_info' in globals() and 'error' not in dolly_info:
        # Test with first sample from Dolly dataset
        test_sample = dolly_info['samples'][0]
        test_instruction = test_sample['instruction']
        test_context = test_sample['context']

        print(f"Test instruction: {test_instruction}")
        print(f"Test context: {test_context[:100]}..." if len(test_context) > 100 else f"Test context: {test_context}")

        # Create and display the JSON payload
        test_payload = create_model_selection_request(test_instruction, test_context)
        print("\nJSON payload being sent:")
        print(json.dumps(test_payload, indent=2))

        # Test the endpoint
        result = call_endpoint(ENDPOINT_URL, test_instruction, test_context)

        if result['success']:
            print("\n✅ Endpoint test successful!")
            print(f"Status code: {result['status_code']}")
            print("API Response preview:")
            response_preview = json.dumps(result['response'], indent=2)[:500]
            print(f"{response_preview}...")
        else:
            print("\n❌ Endpoint test failed:")
            print(f"Status code: {result['status_code']}")
            print(f"Error: {result['error']}")
    else:
        print("⚠️ Dataset not loaded yet. Please run the dataset loading cell first.")

else:
    print("❌ Endpoint not accessible - skipping test")
    print("💡 Make sure to update ENDPOINT_URL in the configuration section")

In [30]:
# Load and analyze the Dolly dataset
def load_and_sample_dataset(dataset_name: str, sample_size: int = None,
                          dataset_split: str = "train", streaming: bool = False) -> dict:
    """
    Load a dataset from HuggingFace Hub with optional sampling.
    
    Args:
        dataset_name: HuggingFace dataset identifier
        sample_size: Number of samples to load (None for all)
        dataset_split: Dataset split to use
        streaming: Whether to use streaming mode
    
    Returns:
        Dict with dataset info and samples
    """
    try:
        start_time = time.time()

        # Load dataset
        if streaming:
            dataset = load_dataset(dataset_name, split=dataset_split, streaming=True)
            if sample_size:
                dataset = dataset.take(sample_size)
            samples = list(dataset)
        else:
            dataset = load_dataset(dataset_name, split=dataset_split)
            if sample_size:
                # Get a sample
                if sample_size >= len(dataset):
                    samples = list(dataset)
                else:
                    indices = np.random.choice(len(dataset), sample_size, replace=False)
                    samples = [dataset[int(i)] for i in indices]
            else:
                samples = list(dataset)

        load_time = time.time() - start_time

        return {
            'dataset_name': dataset_name,
            'num_samples': len(samples),
            'samples': samples,
            'load_time': load_time,
            'streaming': streaming,
            'split': dataset_split
        }

    except Exception as e:
        return {
            'dataset_name': dataset_name,
            'error': str(e),
            'samples': []
        }

print("✅ Dataset loading function defined")

# Load the Dolly dataset
print(f"🔄 Loading {DATASET_NAME} dataset...")
dolly_info = load_and_sample_dataset(
    DATASET_NAME,
    sample_size=SAMPLE_SIZE,
    dataset_split=DATASET_SPLIT
)

if 'error' not in dolly_info:
    print("✅ Dataset loaded successfully!")
    print(f"📊 Dataset: {dolly_info['dataset_name']}")
    print(f"📊 Samples loaded: {dolly_info['num_samples']:,}")
    print(f"⏱️  Load time: {dolly_info['load_time']:.2f}s")

    # Analyze and display dataset structure
    analyze_dataset_structure(dolly_info)

    # Display sample data
    display_sample_data(dolly_info, num_samples=2)
else:
    print(f"❌ Failed to load dataset: {dolly_info['error']}")

✅ Dataset loading function defined
🔄 Loading databricks/databricks-dolly-15k dataset...
✅ Dataset loaded successfully!
📊 Dataset: databricks/databricks-dolly-15k
📊 Samples loaded: 1,000
⏱️  Load time: 1.96s

DATASET ANALYSIS: databricks/databricks-dolly-15k
Total samples: 1,000
Load time: 1.96s
Streaming mode: False

Schema (4 fields):
  instruction: str (avg length: 45 chars)
  context: str (avg length: 0 chars)
  response: str (avg length: 340 chars)
  category: str (avg length: 13 chars)

Text Statistics (based on 1000 samples):
  instruction:
    Min: 12 chars
    Max: 1759 chars
    Avg: 68.1 chars
    Median: 52.0 chars
  context:
    Min: 0 chars
    Max: 8851 chars
    Avg: 319.6 chars
    Median: 0.0 chars
  response:
    Min: 2 chars
    Max: 4866 chars
    Avg: 364.4 chars
    Median: 189.5 chars
  category:
    Min: 7 chars
    Max: 22 chars
    Avg: 11.8 chars
    Median: 10.0 chars

SAMPLE DATA: databricks/databricks-dolly-15k

--- Sample 1 ---
instruction: Give me a list

In [31]:
# Process samples through endpoint using configuration
if endpoint_accessible:
    print(f"🚀 Processing {MAX_SAMPLES_TO_PROCESS} samples through endpoint...")

    # Check if dolly_info is available
    if 'dolly_info' in globals() and 'error' not in dolly_info:
        # Process samples using configuration values
        results_1000 = process_dataset_with_endpoint(
            dolly_info,
            ENDPOINT_URL,
            max_samples=MAX_SAMPLES_TO_PROCESS,
            save_csv=SAVE_CSV_RESULTS
        )

        if not results_1000.empty:
            print("\n📊 Results Summary:")
            print(f"Total samples processed: {len(results_1000)}")
            print(f"Successful API calls: {results_1000['api_success'].sum()}")
            print(f"Failed API calls: {(~results_1000['api_success']).sum()}")
            print(f"Total estimated cost: ${results_1000['api_estimated_cost_usd'].sum():.6f}")

            print("\n📋 Sample Results Preview:")
            preview_cols = ['sample_id', 'api_protocol', 'api_selected_model',
                           'api_estimated_cost_usd', 'api_success', 'response_time_seconds']
            print(results_1000[preview_cols].head(10))

            print("\n📈 Model Selection Distribution:")
            if results_1000['api_selected_model'].notna().any():
                model_counts = results_1000['api_selected_model'].value_counts()
                print(model_counts.head())

            print("\n💰 Cost by Model:")
            if results_1000['api_selected_model'].notna().any():
                cost_by_model = results_1000.groupby('api_selected_model')['api_estimated_cost_usd'].agg(['count', 'sum', 'mean'])
                print(cost_by_model.head())

            print("\n📊 Final CSV Columns:")
            print(f"Total columns: {len(results_1000.columns)}")
            print("Column list:", list(results_1000.columns))
        else:
            print("❌ No results to process")
    else:
        print("⚠️ Dataset not loaded yet. Please run the dataset loading cell first.")

else:
    print("❌ Endpoint not accessible - cannot process samples")
    print("💡 Make sure to update ENDPOINT_URL in the configuration section")

❌ Endpoint not accessible - cannot process samples
💡 Make sure to update ENDPOINT_URL in the configuration section


## Conclusion

This notebook provides a comprehensive framework for:

1. **Dataset Loading**: Efficient loading with sampling and streaming options
2. **Structure Analysis**: Understanding dataset schema and statistics
3. **Quality Assessment**: Evaluating dataset suitability for instruction tuning
4. **Comparative Analysis**: Comparing multiple datasets side by side
5. **Visualization**: Creating plots and tables for better understanding

### Usage Tips:

- **For Experimentation**: Use `sample_size` parameter to work with smaller subsets
- **For Large Datasets**: Use `streaming=True` to avoid memory issues
- **For Production**: Consider the load times and implement caching strategies
- **For Quality**: Pay attention to empty fields and text length distributions

### Next Steps:

1. Extend this notebook to include more datasets
2. Add more sophisticated quality metrics
3. Implement data preprocessing pipelines
4. Create automated benchmarking workflows
5. Add model evaluation capabilities