# PyForge CLI End-to-End Testing - Databricks Serverless

This notebook tests PyForge CLI functionality in Databricks Serverless environment using the deployed wheel from Unity Catalog volumes.

## Databricks Widgets
This notebook uses Databricks widgets for easy parameter configuration. The widgets will appear at the top of the notebook after running the first cell:

- **sample_datasets_base_path**: Base path for sample datasets installation
  - Default: `/Volumes/cortex_dev_catalog/0000_santosh/volume_sandbox/sample-datasets/`
  - Type: Text input widget
  
- **pyforge_version**: PyForge CLI version to test
  - Default: `1.0.8.dev2`
  - Type: Text input widget
  
- **databricks_username**: Your Databricks username
  - Default: `usa-sdandey@deloitte.com`
  - Type: Text input widget
  
- **force_conversion**: Whether to force overwrite existing conversions
  - Default: `True`
  - Type: Dropdown (True/False)
  
- **use_pyspark_for_csv**: Enable PySpark converter for CSV files
  - Default: `True`
  - Type: Dropdown (True/False)
  
- **test_smallest_files_only**: Test only the smallest file of each type
  - Default: `True`
  - Type: Dropdown (True/False)

## Test Configuration
- **Environment**: Databricks Serverless Compute
- **Installation Source**: Unity Catalog Volume (deployed wheel)
- **Sample Data**: Real sample datasets from v1.0.5 release
- **Output Format**: Parquet (optimized for Databricks)

## Prerequisites
1. PyForge CLI wheel deployed to volume via `scripts/deploy_pyforge_to_databricks.py`
2. Unity Catalog access permissions to the specified volume path
3. Workspace access to CoreDataEngineers folder

## ⚠️ Important: PyPI Index URL Configuration
**All `%pip install` commands in this notebook include the proper PyPI index URL for dependency resolution in corporate environments:**

```python
%pip install package --no-cache-dir --quiet --index-url https://pypi.org/simple/ --trusted-host pypi.org
```

**Required flags:**
- `--no-cache-dir`: Ensures fresh installation without cached packages
- `--quiet`: Reduces installation output verbosity  
- `--index-url https://pypi.org/simple/`: Specifies PyPI index for dependency resolution
- `--trusted-host pypi.org`: Trusts PyPI host for secure downloads

This configuration is memorized in `CLAUDE.md` for all future Databricks Serverless notebooks.

## How to Use This Notebook
1. Run the first cell to initialize the widgets
2. Modify widget values as needed (they appear at the top of the notebook)
3. Run all remaining cells in sequence
4. Review the test results and summary report

## Widget Benefits
- **No Code Changes**: Modify parameters without editing code cells
- **Persistence**: Widget values persist across cell executions
- **Job Parameters**: Widgets can be passed as parameters when running as Databricks Jobs
- **User-Friendly**: Interactive UI elements for configuration

## Key Features of This Notebook
1. **Improved File Discovery**: Displays all downloaded files with sizes using `dbutils.fs.ls`
2. **Smart File Selection**: Option to test only smallest files or all files
3. **Detailed Observations**: Logs detailed test observations for each conversion
4. **No --verbose Flag**: Fixed the command to remove unsupported --verbose flag
5. **Better Error Handling**: Enhanced error messages and timeout management

In [ ]:
# DBTITLE 1,Configuration Parameters from Widgets
# =============================================================================
# CONFIGURATION SECTION - Using Widget Values
# =============================================================================

# Get widget values
SAMPLE_DATASETS_BASE_PATH = dbutils.widgets.get("sample_datasets_base_path")
PYFORGE_VERSION = dbutils.widgets.get("pyforge_version")
DATABRICKS_USERNAME = dbutils.widgets.get("databricks_username")
FORCE_CONVERSION = dbutils.widgets.get("force_conversion").lower() == "true"
USE_PYSPARK_FOR_CSV = dbutils.widgets.get("use_pyspark_for_csv").lower() == "true"
TEST_SMALLEST_FILES_ONLY = dbutils.widgets.get("test_smallest_files_only").lower() == "true"

# Derived paths
PYFORGE_WHEEL_PATH = f"/Volumes/cortex_dev_catalog/sandbox_testing/pkgs/{DATABRICKS_USERNAME}/pyforge_cli-{PYFORGE_VERSION}-py3-none-any.whl"
SAMPLE_DATASETS_PATH = SAMPLE_DATASETS_BASE_PATH.rstrip('/')  # Remove trailing slash for consistency
CONVERTED_OUTPUT_PATH = SAMPLE_DATASETS_PATH.replace('/sample-datasets', '/converted_output')

print(f"🔧 Configuration (from widgets):")
print(f"   PyForge Version: {PYFORGE_VERSION}")
print(f"   Databricks Username: {DATABRICKS_USERNAME}")
print(f"   PyForge Wheel Path: {PYFORGE_WHEEL_PATH}")
print(f"   Sample Datasets Base Path: {SAMPLE_DATASETS_BASE_PATH}")
print(f"   Sample Datasets Path: {SAMPLE_DATASETS_PATH}")
print(f"   Output Path: {CONVERTED_OUTPUT_PATH}")
print(f"   Force Conversion: {FORCE_CONVERSION}")
print(f"   Use PySpark for CSV: {USE_PYSPARK_FOR_CSV}")
print(f"   Test Smallest Files Only: {TEST_SMALLEST_FILES_ONLY}")

# Validate paths
if not SAMPLE_DATASETS_BASE_PATH.startswith("/Volumes/"):
    print("⚠️  Warning: Sample datasets path should start with /Volumes/ for Unity Catalog volumes")

print("\n📝 Tip: You can change these values using the widgets at the top of the notebook!")

In [ ]:
# DBTITLE 1,Initialize Notebook Widgets
# =============================================================================
# DATABRICKS WIDGETS INITIALIZATION
# =============================================================================

# Remove any existing widgets to ensure clean state
dbutils.widgets.removeAll()

# Create widgets for notebook parameters
dbutils.widgets.text(
    "sample_datasets_base_path", 
    "/Volumes/cortex_dev_catalog/0000_santosh/volume_sandbox/sample-datasets/",
    "Sample Datasets Base Path"
)

dbutils.widgets.text(
    "pyforge_version",
    "1.0.8.dev2",
    "PyForge Version"
)

dbutils.widgets.text(
    "databricks_username",
    "usa-sdandey@deloitte.com",
    "Databricks Username"
)

dbutils.widgets.dropdown(
    "force_conversion",
    "True",
    ["True", "False"],
    "Force Conversion"
)

dbutils.widgets.dropdown(
    "use_pyspark_for_csv",
    "True", 
    ["True", "False"],
    "Use PySpark for CSV"
)

dbutils.widgets.dropdown(
    "test_smallest_files_only",
    "True",
    ["True", "False"],
    "Test Smallest Files Only"
)

# Display widget values
print("📋 Widget Parameters Initialized:")
print(f"   Sample Datasets Base Path: {dbutils.widgets.get('sample_datasets_base_path')}")
print(f"   PyForge Version: {dbutils.widgets.get('pyforge_version')}")
print(f"   Databricks Username: {dbutils.widgets.get('databricks_username')}")
print(f"   Force Conversion: {dbutils.widgets.get('force_conversion')}")
print(f"   Use PySpark for CSV: {dbutils.widgets.get('use_pyspark_for_csv')}")
print(f"   Test Smallest Files Only: {dbutils.widgets.get('test_smallest_files_only')}")

print("\n✅ Widgets created successfully! You can modify the parameters using the widgets above.")
print("📝 Note: Widget values will persist across cell executions until changed.")

# MAGIC %md
# MAGIC ### Using this Notebook in Databricks Jobs
# MAGIC 
# MAGIC When running this notebook as a Databricks Job, you can pass widget values as job parameters:
# MAGIC 
# MAGIC ```json
# MAGIC {
# MAGIC   "notebook_task": {
# MAGIC     "notebook_path": "/path/to/02-test-cli-end-to-end-serverless",
# MAGIC     "base_parameters": {
# MAGIC       "sample_datasets_base_path": "/Volumes/your_catalog/your_schema/sample-datasets/",
# MAGIC       "pyforge_version": "1.0.8",
# MAGIC       "databricks_username": "your-username@company.com",
# MAGIC       "force_conversion": "True",
# MAGIC       "use_pyspark_for_csv": "True",
# MAGIC       "test_smallest_files_only": "True"
# MAGIC     }
# MAGIC   }
# MAGIC }
# MAGIC ```
# MAGIC 
# MAGIC The widgets will automatically use the job parameter values instead of the defaults.

In [ ]:
# DBTITLE 1,Validate Widget Parameters
# =============================================================================
# WIDGET PARAMETER VALIDATION
# =============================================================================

# Validate widget parameters before proceeding
validation_errors = []

# Check sample datasets path
if not SAMPLE_DATASETS_BASE_PATH:
    validation_errors.append("❌ Sample datasets base path cannot be empty")
elif not SAMPLE_DATASETS_BASE_PATH.startswith("/Volumes/"):
    validation_errors.append("⚠️  Sample datasets path should start with /Volumes/ for Unity Catalog volumes")

# Check PyForge version format
if not PYFORGE_VERSION:
    validation_errors.append("❌ PyForge version cannot be empty")
elif not any(char.isdigit() for char in PYFORGE_VERSION):
    validation_errors.append("❌ PyForge version should contain version numbers")

# Check username
if not DATABRICKS_USERNAME:
    validation_errors.append("❌ Databricks username cannot be empty")
elif "@" not in DATABRICKS_USERNAME and "-" not in DATABRICKS_USERNAME:
    validation_errors.append("⚠️  Username format may be incorrect (expected email or ID format)")

# Display validation results
if validation_errors:
    print("⚠️  PARAMETER VALIDATION WARNINGS:")
    for error in validation_errors:
        print(f"   {error}")
    print("\n📝 Please review the widget parameters above and update if needed.")
    
    # For critical errors, stop execution
    critical_errors = [e for e in validation_errors if e.startswith("❌")]
    if critical_errors:
        raise ValueError(f"Critical validation errors found: {critical_errors}")
else:
    print("✅ All widget parameters validated successfully!")
    
# Additional checks for wheel path existence will be done in the next cell
print(f"\n📦 Expected wheel path: {PYFORGE_WHEEL_PATH}")
print("   (Will verify existence in the next cell)")

In [None]:
# DBTITLE 1,Environment Check
# =============================================================================
# ENVIRONMENT VERIFICATION
# =============================================================================

import os
import subprocess
import json
from datetime import datetime

print("🔍 Verifying Databricks Serverless environment...")

# Check if we're in Databricks environment
try:
    dbutils
    print("✅ Running in Databricks environment")
    
    # Get current user info
    current_user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()
    print(f"   Current user: {current_user}")
    
    # Check if wheel file exists
    try:
        dbutils.fs.ls(PYFORGE_WHEEL_PATH.replace('/Volumes/', 'dbfs:/Volumes/'))
        print(f"✅ PyForge wheel found: {PYFORGE_WHEEL_PATH}")
    except Exception as e:
        print(f"❌ PyForge wheel not found: {PYFORGE_WHEEL_PATH}")
        print(f"   Error: {e}")
        print("   Please run deployment script first: scripts/deploy_pyforge_to_databricks.py")
        raise
        
except NameError:
    print("❌ Not running in Databricks environment")
    print("   This notebook is designed for Databricks Serverless only")
    raise RuntimeError("This notebook requires Databricks environment")

print(f"\n🕐 Test started at: {datetime.now()}")

In [ ]:
# DBTITLE 1,Install PyForge CLI from Unity Catalog Volume
# =============================================================================
# INSTALLATION FROM DEPLOYED WHEEL WITH PYPI INDEX URL
# =============================================================================

print(f"📦 Installing PyForge CLI from deployed wheel...")
print(f"   Installing from: {PYFORGE_WHEEL_PATH}")
print(f"   Using --no-cache-dir to ensure fresh installation")
print(f"   Using corporate PyPI index URL for dependency resolution")

# Install PyForge CLI from volume wheel with no cache and proper index URL
%pip install {PYFORGE_WHEEL_PATH} --no-cache-dir --quiet --index-url https://pypi.org/simple/ --trusted-host pypi.org

print(f"✅ PyForge CLI installed successfully from volume!")
print("🔄 Restarting Python environment to ensure clean import...")

In [ ]:
# Restart Python to ensure clean environment
dbutils.library.restartPython()

In [ ]:
# DBTITLE 1,Re-initialize Configuration After Restart
# =============================================================================
# VARIABLE RE-INITIALIZATION AFTER PYTHON RESTART
# =============================================================================

# Re-initialize all configuration variables from widgets since Python was restarted
# Widgets persist across Python restarts, so we can get the values again

# Get widget values
SAMPLE_DATASETS_BASE_PATH = dbutils.widgets.get("sample_datasets_base_path")
PYFORGE_VERSION = dbutils.widgets.get("pyforge_version")
DATABRICKS_USERNAME = dbutils.widgets.get("databricks_username")
FORCE_CONVERSION = dbutils.widgets.get("force_conversion").lower() == "true"
USE_PYSPARK_FOR_CSV = dbutils.widgets.get("use_pyspark_for_csv").lower() == "true"
TEST_SMALLEST_FILES_ONLY = dbutils.widgets.get("test_smallest_files_only").lower() == "true"

# Derived paths
PYFORGE_WHEEL_PATH = f"/Volumes/cortex_dev_catalog/sandbox_testing/pkgs/{DATABRICKS_USERNAME}/pyforge_cli-{PYFORGE_VERSION}-py3-none-any.whl"
SAMPLE_DATASETS_PATH = SAMPLE_DATASETS_BASE_PATH.rstrip('/')  # Remove trailing slash for consistency
CONVERTED_OUTPUT_PATH = SAMPLE_DATASETS_PATH.replace('/sample-datasets', '/converted_output')

print(f"🔄 Re-initialized configuration variables from widgets after Python restart:")
print(f"   PyForge Version: {PYFORGE_VERSION}")
print(f"   Databricks Username: {DATABRICKS_USERNAME}")
print(f"   PyForge Wheel Path: {PYFORGE_WHEEL_PATH}")
print(f"   Sample Datasets Base Path: {SAMPLE_DATASETS_BASE_PATH}")
print(f"   Sample Datasets Path: {SAMPLE_DATASETS_PATH}")
print(f"   Output Path: {CONVERTED_OUTPUT_PATH}")
print(f"   Force Conversion: {FORCE_CONVERSION}")
print(f"   Use PySpark for CSV: {USE_PYSPARK_FOR_CSV}")
print(f"   Test Smallest Files Only: {TEST_SMALLEST_FILES_ONLY}")

print("\n✅ Configuration restored from widgets successfully!")

In [ ]:
# DBTITLE 1,Re-initialize Configuration After Restart
# =============================================================================
# VARIABLE RE-INITIALIZATION AFTER PYTHON RESTART
# =============================================================================

# Re-initialize all configuration variables since Python was restarted
# NOTEBOOK PARAMETERS - These can be set when running the notebook
SAMPLE_DATASETS_BASE_PATH = "/Volumes/cortex_dev_catalog/0000_santosh/volume_sandbox/sample-datasets/"

PYFORGE_VERSION = "1.0.8.dev2"
DATABRICKS_USERNAME = "usa-sdandey@deloitte.com"  # Update with your username
PYFORGE_WHEEL_PATH = f"/Volumes/cortex_dev_catalog/sandbox_testing/pkgs/{DATABRICKS_USERNAME}/pyforge_cli-{PYFORGE_VERSION}-py3-none-any.whl"

# Paths configuration using the base path parameter
SAMPLE_DATASETS_PATH = SAMPLE_DATASETS_BASE_PATH.rstrip('/')  # Remove trailing slash for consistency
CONVERTED_OUTPUT_PATH = SAMPLE_DATASETS_PATH.replace('/sample-datasets', '/converted_output')

FORCE_CONVERSION = True
USE_PYSPARK_FOR_CSV = True

print(f"🔄 Re-initialized configuration variables after Python restart:")
print(f"   PyForge Version: {PYFORGE_VERSION}")
print(f"   Databricks Username: {DATABRICKS_USERNAME}")
print(f"   PyForge Wheel Path: {PYFORGE_WHEEL_PATH}")
print(f"   Sample Datasets Base Path: {SAMPLE_DATASETS_BASE_PATH}")
print(f"   Sample Datasets Path: {SAMPLE_DATASETS_PATH}")
print(f"   Output Path: {CONVERTED_OUTPUT_PATH}")
print(f"   Force Conversion: {FORCE_CONVERSION}")
print(f"   Use PySpark for CSV: {USE_PYSPARK_FOR_CSV}")

In [None]:
%%sh
echo "📋 PyForge CLI Help Information:"
pyforge --help

In [None]:
%%sh
echo "📊 PyForge CLI Version Information:"
pyforge --version

In [None]:
# DBTITLE 1,Check PySpark Availability in Serverless
# =============================================================================
# PYSPARK AVAILABILITY CHECK FOR SERVERLESS
# =============================================================================

def check_pyspark_availability():
    """Check if PySpark is available in the Databricks Serverless environment."""
    try:
        import pyspark
        from pyspark.sql import SparkSession
        print("✅ PySpark is available in this Databricks Serverless environment")
        print(f"   PySpark Version: {pyspark.__version__}")
        
        # Try to get or create a Spark session
        try:
            spark = SparkSession.builder.getOrCreate()
            print(f"   Spark Session: Active")
            print(f"   Spark Version: {spark.version}")
            
            # Check if it's Spark Connect (serverless)
            try:
                master = spark.sparkContext.master
                print(f"   Spark Master: {master}")
            except Exception:
                print(f"   Spark Mode: Serverless (Spark Connect)")
            
            return True
        except Exception as e:
            print(f"   ⚠️  Could not create Spark session: {e}")
            return False
    except ImportError:
        print("❌ PySpark is NOT available in this environment")
        print("   CSV files will be converted using pandas")
        return False

# Check PySpark availability
pyspark_available = check_pyspark_availability()

# Update USE_PYSPARK_FOR_CSV based on availability
if not pyspark_available and USE_PYSPARK_FOR_CSV:
    print("\n⚠️  Note: PySpark not available, CSV conversion will fall back to pandas")
    USE_PYSPARK_FOR_CSV = False
elif pyspark_available:
    print("\n🚀 PySpark is available! PyForge CLI will auto-detect and use PySpark for CSV conversions")

In [ ]:
# DBTITLE 1,Comprehensive Conversion Testing
# =============================================================================
# BULK CONVERSION TESTING IN DATABRICKS SERVERLESS
# =============================================================================

def run_serverless_conversion_test(file_info):
    """Run conversion test for a single file in Databricks Serverless environment."""
    file_path = file_info['file_path']
    file_type = file_info['file_type']
    file_name = file_info['file_name']
    file_ext = file_info['extension']
    
    # Create output path in volume
    output_name = file_name.split('.')[0]
    output_path = f"{CONVERTED_OUTPUT_PATH}/{file_info['category']}/{output_name}.parquet"
    
    # Build conversion command (removed --verbose flag as it's not supported)
    force_flag = '--force' if FORCE_CONVERSION else ''
    pyspark_flag = '--force-pyspark' if USE_PYSPARK_FOR_CSV and file_ext == '.csv' else ''
    excel_flag = '--separate' if file_ext in ['.xlsx', '.xls'] else ''
    
    cmd = [
        'pyforge', 'convert', file_path, output_path, 
        '--format', 'parquet', force_flag, pyspark_flag, excel_flag
    ]
    cmd = [arg for arg in cmd if arg]  # Remove empty strings
    
    print(f"\n🔄 Converting {file_name} ({file_type})...")
    print(f"   File size: {file_info.get('size_readable', 'Unknown')}")
    print(f"   Command: {' '.join(cmd)}")
    
    # Log test observation
    observation = {
        'file': file_name,
        'type': file_type,
        'size': file_info.get('size_readable', 'Unknown'),
        'start_time': datetime.now().strftime('%H:%M:%S')
    }
    
    try:
        start_time = time.time()
        
        # Set timeout based on file size
        file_size_mb = file_info.get('size_mb', 0)
        if file_size_mb > 100:
            timeout = 600  # 10 minutes for large files
        elif file_size_mb > 10:
            timeout = 300  # 5 minutes for medium files
        else:
            timeout = 120  # 2 minutes for small files
        
        print(f"   Timeout: {timeout}s")
        
        # Run conversion
        result = subprocess.run(
            cmd, 
            capture_output=True, 
            text=True, 
            timeout=timeout
        )
        
        end_time = time.time()
        duration = round(end_time - start_time, 2)
        
        if result.returncode == 0:
            status = 'SUCCESS'
            error_message = None
            # Check if PySpark was used for CSV files
            converter_used = 'PySpark' if (file_ext == '.csv' and 'Using PySpark' in result.stdout) else 'Standard'
            print(f"   ✅ Success ({duration}s) - {converter_used} converter")
            
            # Log observation
            observation['status'] = 'SUCCESS'
            observation['duration'] = f"{duration}s"
            observation['converter'] = converter_used
            
            # Verify output file exists in volume
            try:
                dbutils.fs.ls(output_path.replace('/Volumes/', 'dbfs:/Volumes/'))
                print(f"   ✅ Output file verified in volume")
                observation['output_verified'] = True
            except Exception:
                print(f"   ⚠️  Output file not found in volume")
                observation['output_verified'] = False
                
        else:
            status = 'FAILED'
            error_message = result.stderr.strip() if result.stderr else result.stdout.strip()
            converter_used = 'Unknown'
            print(f"   ❌ Failed ({duration}s)")
            print(f"   Error: {error_message[:200]}...")
            
            # Log observation
            observation['status'] = 'FAILED'
            observation['duration'] = f"{duration}s"
            observation['error'] = error_message[:200]
        
        # Print detailed observation
        print(f"\n📝 Test Observation:")
        for key, value in observation.items():
            print(f"   {key}: {value}")
        
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': status,
            'duration_seconds': duration,
            'error_message': error_message,
            'output_path': output_path if status == 'SUCCESS' else None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': converter_used,
            'observation': observation
        }
        
    except subprocess.TimeoutExpired:
        observation['status'] = 'TIMEOUT'
        observation['duration'] = f"{timeout}s"
        print(f"   ⏰ Timeout after {timeout}s")
        
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'TIMEOUT',
            'duration_seconds': timeout,
            'error_message': f'Conversion timed out after {timeout} seconds',
            'output_path': None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': 'Unknown',
            'observation': observation
        }
    except Exception as e:
        observation['status'] = 'ERROR'
        observation['error'] = str(e)
        print(f"   🚫 Error: {str(e)}")
        
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'ERROR',
            'duration_seconds': 0,
            'error_message': str(e),
            'output_path': None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': 'Unknown',
            'observation': observation
        }

def run_bulk_serverless_tests():
    """Run conversion tests for selected files in Databricks Serverless."""
    print(f"\n🚀 Starting conversion tests in Databricks Serverless...")
    print(f"📁 Output directory: {CONVERTED_OUTPUT_PATH}")
    print(f"📊 Test mode: {'Smallest files only' if TEST_SMALLEST_FILES_ONLY else 'All files'}")
    print(f"🔧 Force conversion: {FORCE_CONVERSION}")
    print(f"🚀 Use PySpark for CSV: {USE_PYSPARK_FOR_CSV}")
    
    test_results = []
    test_observations = []
    total_start_time = time.time()
    
    for i, file_info in enumerate(files_catalog, 1):
        print(f"\n{'='*60}")
        print(f"📝 Test {i}/{len(files_catalog)}")
        result = run_serverless_conversion_test(file_info)
        test_results.append(result)
        test_observations.append(result['observation'])
    
    total_end_time = time.time()
    total_duration = round(total_end_time - total_start_time, 2)
    
    # Print test observations summary
    print(f"\n{'='*60}")
    print("📊 TEST OBSERVATIONS SUMMARY:")
    print(f"{'='*60}")
    for obs in test_observations:
        print(f"\n{obs['file']} ({obs['type']}, {obs['size']}):")
        print(f"   Status: {obs['status']}")
        print(f"   Duration: {obs.get('duration', 'N/A')}")
        if 'converter' in obs:
            print(f"   Converter: {obs['converter']}")
        if 'error' in obs:
            print(f"   Error: {obs['error'][:100]}...")
    
    return test_results, total_duration

# Run the bulk conversion tests
print("🎯 Executing conversion tests...")
test_results, total_test_duration = run_bulk_serverless_tests()

print(f"\n🏁 Conversion testing completed in {total_test_duration} seconds!")

# MAGIC %md
# MAGIC ### Conversion Testing Complete
# MAGIC The conversion tests have been executed above. Continue to the next cell for the summary report.

In [None]:
# DBTITLE 1,Setup Sample Datasets in Volume
# =============================================================================
# SAMPLE DATASETS SETUP IN UNITY CATALOG VOLUME
# =============================================================================

print(f"📥 Setting up sample datasets in volume: {SAMPLE_DATASETS_PATH}")

# Create volume directories using dbutils
volume_datasets_path = SAMPLE_DATASETS_PATH.replace('/Volumes/', 'dbfs:/Volumes/')
volume_output_path = CONVERTED_OUTPUT_PATH.replace('/Volumes/', 'dbfs:/Volumes/')

try:
    # Create sample datasets directory
    dbutils.fs.mkdirs(volume_datasets_path)
    print(f"✅ Created sample datasets directory: {SAMPLE_DATASETS_PATH}")
    
    # Create output directory
    dbutils.fs.mkdirs(volume_output_path)
    print(f"✅ Created output directory: {CONVERTED_OUTPUT_PATH}")
    
except Exception as e:
    print(f"⚠️  Directory creation warning: {e}")
    print("   Directories may already exist")

# Install sample datasets using PyForge CLI
print("\n📦 Installing sample datasets using PyForge CLI...")
try:
    # Use shell command to install sample datasets to volume path
    result = subprocess.run([
        'pyforge', 'install', 'sample-datasets', SAMPLE_DATASETS_PATH, '--force'
    ], capture_output=True, text=True, timeout=300)
    
    if result.returncode == 0:
        print("✅ Sample datasets installed successfully!")
        print(f"   Output: {result.stdout}")
    else:
        print(f"⚠️  Sample datasets installation had issues: {result.stderr}")
        print("   Proceeding with available data...")
        
except subprocess.TimeoutExpired:
    print("⚠️  Sample datasets installation timed out, creating minimal test datasets...")
except Exception as e:
    print(f"⚠️  Sample datasets installation failed: {e}")
    print("   Creating minimal test datasets in volume...")

# Create minimal test datasets directly in volume if needed
try:
    # Create test CSV file in volume
    test_csv_data = """id,name,category,value,date
1,Sample Item 1,Category A,100.50,2023-01-01
2,Sample Item 2,Category B,250.75,2023-01-02
3,Sample Item 3,Category A,175.25,2023-01-03
4,Sample Item 4,Category C,90.00,2023-01-04
5,Sample Item 5,Category B,320.80,2023-01-05"""
    
    csv_path = f"{SAMPLE_DATASETS_PATH}/csv/test_data.csv"
    dbutils.fs.mkdirs(f"{volume_datasets_path}/csv")
    dbutils.fs.put(csv_path.replace('/Volumes/', 'dbfs:/Volumes/'), test_csv_data, overwrite=True)
    print(f"✅ Created test CSV file: {csv_path}")
    
    # Create test XML file in volume
    test_xml_data = """<?xml version="1.0" encoding="UTF-8"?>
<data>
    <items>
        <item id="1">
            <name>Sample Item 1</name>
            <category>Category A</category>
            <value>100.50</value>
            <date>2023-01-01</date>
        </item>
        <item id="2">
            <name>Sample Item 2</name>
            <category>Category B</category>
            <value>250.75</value>
            <date>2023-01-02</date>
        </item>
    </items>
</data>"""
    
    xml_path = f"{SAMPLE_DATASETS_PATH}/xml/test_data.xml"
    dbutils.fs.mkdirs(f"{volume_datasets_path}/xml")
    dbutils.fs.put(xml_path.replace('/Volumes/', 'dbfs:/Volumes/'), test_xml_data, overwrite=True)
    print(f"✅ Created test XML file: {xml_path}")
    
except Exception as e:
    print(f"⚠️  Error creating test files: {e}")

print("\n✅ Sample datasets setup completed!")

In [ ]:
# DBTITLE 1,Comprehensive Conversion Testing
# =============================================================================
# BULK CONVERSION TESTING IN DATABRICKS SERVERLESS
# =============================================================================

def run_serverless_conversion_test(file_info, verbose=None):
    """Run conversion test for a single file in Databricks Serverless environment."""
    # Use widget value if verbose not explicitly set
    if verbose is None:
        verbose = VERBOSE_OUTPUT
        
    file_path = file_info['file_path']
    file_type = file_info['file_type']
    file_name = file_info['file_name']
    file_ext = file_info['extension']
    
    # Create output path in volume
    output_name = file_name.split('.')[0]
    output_path = f"{CONVERTED_OUTPUT_PATH}/{file_info['category']}/{output_name}.parquet"
    
    # Build conversion command
    force_flag = '--force' if FORCE_CONVERSION else ''
    pyspark_flag = '--force-pyspark' if USE_PYSPARK_FOR_CSV and file_ext == '.csv' else ''
    excel_flag = '--separate' if file_ext in ['.xlsx', '.xls'] else ''
    verbose_flag = '--verbose' if verbose else ''
    
    cmd = [
        'pyforge', 'convert', file_path, output_path, 
        '--format', 'parquet', force_flag, pyspark_flag, excel_flag, verbose_flag
    ]
    cmd = [arg for arg in cmd if arg]  # Remove empty strings
    
    print(f"🔄 Converting {file_name} ({file_type})...")
    if verbose:
        print(f"   Command: {' '.join(cmd)}")
    
    try:
        start_time = time.time()
        
        # Set timeout based on file size
        file_size_mb = file_info.get('size_mb', 0)
        if file_size_mb > 100:
            timeout = 600  # 10 minutes for large files
        elif file_size_mb > 10:
            timeout = 300  # 5 minutes for medium files
        else:
            timeout = 120  # 2 minutes for small files
        
        print(f"   Timeout: {timeout}s (file size: {file_size_mb:.3f} MB)")
        
        # Run conversion
        result = subprocess.run(
            cmd, 
            capture_output=True, 
            text=True, 
            timeout=timeout
        )
        
        end_time = time.time()
        duration = round(end_time - start_time, 2)
        
        if result.returncode == 0:
            status = 'SUCCESS'
            error_message = None
            # Check if PySpark was used for CSV files
            converter_used = 'PySpark' if (file_ext == '.csv' and 'PySpark' in result.stdout) else 'Standard'
            print(f"  ✅ Success ({duration}s) - {converter_used} converter")
            
            # Verify output file exists in volume
            try:
                dbutils.fs.ls(output_path.replace('/Volumes/', 'dbfs:/Volumes/'))
                print(f"  ✅ Output file verified in volume: {output_path}")
            except Exception:
                print(f"  ⚠️  Output file not found in volume: {output_path}")
                
        else:
            status = 'FAILED'
            error_message = result.stderr.strip() if result.stderr else result.stdout.strip()
            converter_used = 'Unknown'
            print(f"  ❌ Failed ({duration}s)")
            print(f"     Error: {error_message[:200]}...")
            if len(error_message) > 200:
                print(f"     (Full error saved in results)")
        
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': status,
            'duration_seconds': duration,
            'error_message': error_message,
            'output_path': output_path if status == 'SUCCESS' else None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': converter_used
        }
        
    except subprocess.TimeoutExpired:
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'TIMEOUT',
            'duration_seconds': timeout,
            'error_message': f'Conversion timed out after {timeout} seconds',
            'output_path': None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': 'Unknown'
        }
    except Exception as e:
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'ERROR',
            'duration_seconds': 0,
            'error_message': str(e),
            'output_path': None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': 'Unknown'
        }

def run_bulk_serverless_tests():
    """Run conversion tests for all discovered files in Databricks Serverless."""
    print(f"🚀 Starting bulk conversion tests in Databricks Serverless...")
    print(f"📁 Output directory: {CONVERTED_OUTPUT_PATH}")
    print(f"📊 Verbose mode: {'ON' if VERBOSE_OUTPUT else 'OFF'}")
    
    test_results = []
    total_start_time = time.time()
    
    for i, file_info in enumerate(files_catalog, 1):
        print(f"\n📝 Test {i}/{len(files_catalog)}: {file_info['file_name']}")
        result = run_serverless_conversion_test(file_info)
        test_results.append(result)
    
    total_end_time = time.time()
    total_duration = round(total_end_time - total_start_time, 2)
    
    return test_results, total_duration

# Run the bulk conversion tests
print("🎯 Executing bulk conversion tests in Databricks Serverless...")
test_results, total_test_duration = run_bulk_serverless_tests()

print(f"\n🏁 Bulk conversion testing completed in {total_test_duration} seconds!")

In [None]:
# DBTITLE 1,Comprehensive Conversion Testing
# =============================================================================
# BULK CONVERSION TESTING IN DATABRICKS SERVERLESS
# =============================================================================

def run_serverless_conversion_test(file_info, verbose=True):
    """Run conversion test for a single file in Databricks Serverless environment."""
    file_path = file_info['file_path']
    file_type = file_info['file_type']
    file_name = file_info['file_name']
    file_ext = file_info['extension']
    
    # Create output path in volume
    output_name = file_name.split('.')[0]
    output_path = f"{CONVERTED_OUTPUT_PATH}/{file_info['category']}/{output_name}.parquet"
    
    # Build conversion command
    force_flag = '--force' if FORCE_CONVERSION else ''
    pyspark_flag = '--force-pyspark' if USE_PYSPARK_FOR_CSV and file_ext == '.csv' else ''
    excel_flag = '--separate' if file_ext in ['.xlsx', '.xls'] else ''
    verbose_flag = '--verbose' if verbose else ''
    
    cmd = [
        'pyforge', 'convert', file_path, output_path, 
        '--format', 'parquet', force_flag, pyspark_flag, excel_flag, verbose_flag
    ]
    cmd = [arg for arg in cmd if arg]  # Remove empty strings
    
    print(f"🔄 Converting {file_name} ({file_type})...")
    if verbose:
        print(f"   Command: {' '.join(cmd)}")
    
    try:
        start_time = time.time()
        
        # Set timeout based on file size
        file_size_mb = file_info.get('size_mb', 0)
        if file_size_mb > 100:
            timeout = 600  # 10 minutes for large files
        elif file_size_mb > 10:
            timeout = 300  # 5 minutes for medium files
        else:
            timeout = 120  # 2 minutes for small files
        
        print(f"   Timeout: {timeout}s (file size: {file_size_mb:.3f} MB)")
        
        # Run conversion
        result = subprocess.run(
            cmd, 
            capture_output=True, 
            text=True, 
            timeout=timeout
        )
        
        end_time = time.time()
        duration = round(end_time - start_time, 2)
        
        if result.returncode == 0:
            status = 'SUCCESS'
            error_message = None
            # Check if PySpark was used for CSV files
            converter_used = 'PySpark' if (file_ext == '.csv' and 'PySpark' in result.stdout) else 'Standard'
            print(f"  ✅ Success ({duration}s) - {converter_used} converter")
            
            # Verify output file exists in volume
            try:
                dbutils.fs.ls(output_path.replace('/Volumes/', 'dbfs:/Volumes/'))
                print(f"  ✅ Output file verified in volume: {output_path}")
            except Exception:
                print(f"  ⚠️  Output file not found in volume: {output_path}")
                
        else:
            status = 'FAILED'
            error_message = result.stderr.strip() if result.stderr else result.stdout.strip()
            converter_used = 'Unknown'
            print(f"  ❌ Failed ({duration}s)")
            print(f"     Error: {error_message[:200]}...")
            if len(error_message) > 200:
                print(f"     (Full error saved in results)")
        
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': status,
            'duration_seconds': duration,
            'error_message': error_message,
            'output_path': output_path if status == 'SUCCESS' else None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': converter_used
        }
        
    except subprocess.TimeoutExpired:
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'TIMEOUT',
            'duration_seconds': timeout,
            'error_message': f'Conversion timed out after {timeout} seconds',
            'output_path': None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': 'Unknown'
        }
    except Exception as e:
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'ERROR',
            'duration_seconds': 0,
            'error_message': str(e),
            'output_path': None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': 'Unknown'
        }

def run_bulk_serverless_tests():
    """Run conversion tests for all discovered files in Databricks Serverless."""
    print(f"🚀 Starting bulk conversion tests in Databricks Serverless...")
    print(f"📁 Output directory: {CONVERTED_OUTPUT_PATH}")
    
    test_results = []
    total_start_time = time.time()
    
    for i, file_info in enumerate(files_catalog, 1):
        print(f"\n📝 Test {i}/{len(files_catalog)}: {file_info['file_name']}")
        result = run_serverless_conversion_test(file_info, verbose=True)
        test_results.append(result)
    
    total_end_time = time.time()
    total_duration = round(total_end_time - total_start_time, 2)
    
    return test_results, total_duration

# Run the bulk conversion tests
print("🎯 Executing bulk conversion tests in Databricks Serverless...")
test_results, total_test_duration = run_bulk_serverless_tests()

print(f"\n🏁 Bulk conversion testing completed in {total_test_duration} seconds!")

In [None]:
# DBTITLE 1,Validate Converted Files with Spark
# =============================================================================
# CONVERTED FILE VALIDATION USING SPARK
# =============================================================================

def validate_converted_files_with_spark():
    """Validate converted Parquet files using Spark in Databricks Serverless."""
    print("🔍 Validating converted Parquet files with Spark...")
    
    successful_conversions = df_detailed_results[df_detailed_results['status'] == 'SUCCESS']
    validation_results = []
    
    if len(successful_conversions) == 0:
        print("⚠️  No successful conversions to validate.")
        return
    
    for _, result in successful_conversions.iterrows():
        output_path = result['output_path']
        file_name = result['file_name']
        
        try:
            # Try to read the parquet file with Spark
            df_spark = spark.read.parquet(output_path)
            row_count = df_spark.count()
            col_count = len(df_spark.columns)
            
            # Get schema info
            schema_info = [(field.name, str(field.dataType)) for field in df_spark.schema.fields]
            
            validation_results.append({
                'file_name': file_name,
                'status': 'VALID',
                'rows': row_count,
                'columns': col_count,
                'schema_sample': str(schema_info[:3]) if schema_info else 'No schema',
                'error': None
            })
            
            print(f"  ✅ {file_name}: {row_count} rows, {col_count} columns")
            
            # Show a sample of data for small files
            if row_count <= 10:
                print(f"     Sample data:")
                df_spark.show(5, truncate=False)
            
        except Exception as e:
            validation_results.append({
                'file_name': file_name,
                'status': 'INVALID',
                'rows': 0,
                'columns': 0,
                'schema_sample': None,
                'error': str(e)
            })
            print(f"  ❌ {file_name}: Validation failed - {str(e)[:100]}...")
    
    if validation_results:
        print(f"\n📊 Spark Validation Summary:")
        df_validation = pd.DataFrame(validation_results)
        display(df_validation)
        
        valid_count = len(df_validation[df_validation['status'] == 'VALID'])
        total_count = len(df_validation)
        print(f"\n✅ Validation Results: {valid_count}/{total_count} files are valid Parquet files")
        
        if valid_count == total_count:
            print("🎉 ALL CONVERTED FILES ARE VALID PARQUET FILES!")
            print("✅ PyForge CLI is working perfectly in Databricks Serverless environment")

# Run Spark validation
validate_converted_files_with_spark()