# PyForge CLI: Unity Catalog Volume Integration

This notebook demonstrates using PyForge CLI in Databricks environments with Unity Catalog Volume support, showing how to work with volumes and process data using command-line tools.

## Introduction

PyForge CLI works seamlessly in Databricks environments. This guide covers:

- Installing and using PyForge sample datasets in Volumes
- Working with Unity Catalog Volumes using CLI commands
- Direct Volume-to-Volume file conversions
- Working with multi-table databases in Volumes
- Batch processing using shell commands
- CLI integration with Databricks file systems

### Key Features:
- **Unity Catalog Volumes**: Direct read/write operations on Volume paths using CLI
- **Shell Integration**: Use %sh magic commands for all file operations  
- **Multi-format Support**: Convert between CSV, Excel, PDF, XML, Access databases and more
- **Volume-Native**: Works directly with `/Volumes/` paths
- **Batch Processing**: Process multiple files efficiently using shell commands

# Install PyForge CLI using Databricks %pip magic
%pip install "pyforge-cli" --quiet

# Restart Python to reload packages
dbutils.library.restartPython()

In [ ]:
# Verify PyForge CLI installation 
%sh
pyforge --version

# Check available PyForge commands
%sh 
pyforge --help

# Display system information
%sh
echo "📍 System Information:"
echo "======================"  
python --version
echo "🏢 Databricks Runtime: $(echo $DATABRICKS_RUNTIME_VERSION)"
echo "💾 Available disk space:"
df -h /tmp

In [None]:
# Import required libraries
import os
import sys
from datetime import datetime
from pathlib import Path
import json

# Databricks SDK
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import catalog

# PyForge imports
from pyforge_core import PyForgeCore
from pyforge_databricks import PyForgeDatabricks

# Spark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, current_timestamp, count, sum as spark_sum

print("🚀 PyForge Databricks Integration loaded!")
print(f"📍 Python version: {sys.version.split()[0]}")
print(f"🏢 Databricks Runtime: {spark.conf.get('spark.databricks.clusterUsageTags.sparkVersion')}")

## 2. Environment Detection and Validation

PyForge Databricks automatically detects your environment and optimizes accordingly.

In [None]:
# Initialize PyForge Databricks
forge = PyForgeDatabricks()

# Get detailed environment information
env_info = forge.env.get_environment_info()

print("🔍 Environment Detection Results:")
print("=" * 60)

# Display environment details
env_details = [
    ("Databricks Environment", env_info['is_databricks'], "✅" if env_info['is_databricks'] else "❌"),
    ("Serverless Compute", env_info['is_serverless'], "⚡" if env_info['is_serverless'] else "🖥️"),
    ("Environment Version", env_info['environment_version'], f"v{env_info['environment_version']}"),
    ("Python Version", env_info['python_version'], "🐍"),
    ("Spark Version", env_info['spark_version'], "✨"),
    ("Cluster Type", env_info['cluster_type'], "🏢")
]

for label, value, icon in env_details:
    print(f"{icon} {label:25}: {value}")

# Validate SDK connection
print("\n🔌 Validating Databricks SDK connection...")
try:
    current_user = forge.w.current_user.me()
    print(f"✅ Connected as: {current_user.display_name}")
    print(f"📧 Email: {current_user.user_name}")
    sdk_connected = True
except Exception as e:
    print(f"❌ SDK connection failed: {str(e)}")
    print("   Note: Some features may be limited without SDK access")
    sdk_connected = False

## 3. Unity Catalog Volume Setup

Configure Unity Catalog paths and verify Volume access using shell and file system magic commands.

In [None]:
# Install sample datasets to temporary location first
%sh
echo "📦 Installing PyForge Sample Datasets..."
pyforge install sample-datasets /tmp/pyforge_samples --formats csv,excel,xml,access,dbf --sizes small,medium

In [None]:
# Set up Volume paths for data governance
volume_config = {
    'catalog': 'main',
    'schema': 'default', 
    'bronze_volume': 'bronze_data',
    'silver_volume': 'silver_data',
    'gold_volume': 'gold_data'
}

# Construct Volume paths
bronze_path = f"/Volumes/{volume_config['catalog']}/{volume_config['schema']}/{volume_config['bronze_volume']}"
silver_path = f"/Volumes/{volume_config['catalog']}/{volume_config['schema']}/{volume_config['silver_volume']}"
gold_path = f"/Volumes/{volume_config['catalog']}/{volume_config['schema']}/{volume_config['gold_volume']}"

print("🏗️ Unity Catalog Volume Configuration:")
print(f"   Bronze Layer: {bronze_path}")
print(f"   Silver Layer: {silver_path}")
print(f"   Gold Layer:   {gold_path}")

# Verify Volume access
try:
    dbutils.fs.ls(bronze_path)
    print("✅ Volume access confirmed")
except Exception as e:
    print(f"⚠️ Volume access issue: {str(e)}")
    print("   Creating sample paths in /tmp for demo...")
    bronze_path = "/tmp/bronze_data"
    silver_path = "/tmp/silver_data" 
    gold_path = "/tmp/gold_data"

## 4. Copy Sample Data to Volume (Bronze Layer)

Transfer sample datasets to Unity Catalog Volumes for governed data access.

In [None]:
# Copy sample datasets to Bronze Volume using file system magic
%fs mkdirs $bronze_path/sample_datasets/

# Copy different format samples
%fs cp -r file:/tmp/pyforge_samples/csv/small/ $bronze_path/sample_datasets/csv/
%fs cp -r file:/tmp/pyforge_samples/excel/small/ $bronze_path/sample_datasets/excel/
%fs cp -r file:/tmp/pyforge_samples/access/small/ $bronze_path/sample_datasets/access/
%fs cp -r file:/tmp/pyforge_samples/dbf/small/ $bronze_path/sample_datasets/dbf/

# Verify copy completed
%fs ls $bronze_path/sample_datasets/

## 5. Volume-to-Volume Data Conversion

Demonstrate direct Volume-to-Volume conversions using PyForge Databricks.

In [None]:
# Create Silver layer directory
%fs mkdirs $silver_path/converted_data/

# Convert CSV data from Bronze to Silver using Volume paths
bronze_csv = f"{bronze_path}/sample_datasets/csv/titanic-dataset.csv"
silver_parquet = f"{silver_path}/converted_data/titanic.parquet"

print("🔄 Volume-to-Volume Conversion:")
print(f"   Source (Bronze): {bronze_csv}")
print(f"   Target (Silver): {silver_parquet}")

# Perform conversion using PyForge Databricks
result = forge.convert_from_volume(
    source_volume_path=bronze_csv,
    target_volume_path=silver_parquet,
    output_format='parquet'
)

print(f"✅ Conversion completed in {result['duration']:.2f}s")
print(f"   Rows processed: {result['row_count']:,}")

# Verify converted file
%fs ls $silver_path/converted_data/

## 6. Multi-Table Database Processing in Volumes

Extract all tables from Access databases directly into Volume storage.

In [None]:
# Process Northwind database from Bronze to Silver
bronze_db = f"{bronze_path}/sample_datasets/access/Northwind_2007_VBNet.accdb"
silver_tables = f"{silver_path}/northwind_tables/"

print("🗄️ Multi-Table Database Processing:")
print(f"   Source Database: {bronze_db}")
print(f"   Target Directory: {silver_tables}")

# Extract all tables using PyForge Databricks
result = forge.extract_database_tables(
    source_volume_path=bronze_db,
    target_volume_path=silver_tables,
    output_format='delta',  # Use Delta format for versioning
    extract_all=True
)

print(f"✅ Database extraction completed:")
print(f"   Tables extracted: {result['table_count']}")
print(f"   Total rows: {result['total_rows']:,}")

# List extracted tables
%fs ls $silver_tables

## 7. Serverless Compute Optimization

Demonstrate automatic optimization for serverless vs. classic compute.

In [None]:
# Process multiple files with automatic optimization
bronze_excel = f"{bronze_path}/sample_datasets/excel/financial-sample.xlsx"
silver_excel = f"{silver_path}/financial_data/"

print("⚡ Serverless Optimization Demo:")
print(f"   Environment: {'Serverless' if env_info['is_serverless'] else 'Classic Compute'}")

# Convert with automatic optimization
result = forge.convert_optimized(
    source_volume_path=bronze_excel,
    target_volume_path=silver_excel,
    optimization_level='auto',  # Let PyForge choose optimal strategy
    enable_caching=True if env_info['is_serverless'] else False
)

print(f"✅ Optimized conversion completed:")
print(f"   Processing mode: {result['processing_mode']}")
print(f"   Memory usage: {result['peak_memory_mb']}MB")
print(f"   Cache hits: {result.get('cache_hits', 0)}")

# Show performance metrics
if 'performance_metrics' in result:
    metrics = result['performance_metrics']
    print(f"\n📊 Performance Metrics:")
    for metric, value in metrics.items():
        print(f"   {metric}: {value}")

## 8. Delta Lake Integration and Gold Layer

Create curated, analytics-ready datasets in the Gold layer using Delta format.

In [None]:
# Create Gold layer with aggregated data
%fs mkdirs $gold_path/analytics/

# Load Silver data and create Gold aggregations
titanic_df = spark.read.parquet(silver_parquet)

# Create survival analysis (Gold layer)
survival_analysis = titanic_df.groupBy("Pclass", "Sex").agg(
    count("*").alias("total_passengers"),
    spark_sum("Survived").alias("survivors"),
    (spark_sum("Survived") / count("*") * 100).alias("survival_rate")
).withColumn("analysis_date", current_timestamp())

# Write to Gold layer as Delta table
gold_survival_path = f"{gold_path}/analytics/titanic_survival_analysis"

survival_analysis.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .save(gold_survival_path)

print("🏆 Gold Layer Analytics Created:")
print(f"   Delta table: {gold_survival_path}")
print(f"   Records: {survival_analysis.count()}")

# Display results
display(survival_analysis.orderBy("Pclass", "Sex"))

## 9. Batch Processing with Distributed Computing

Process multiple files using Spark's distributed capabilities.

In [None]:
# Batch process multiple DBF files using distributed computing
dbf_bronze = f"{bronze_path}/sample_datasets/dbf/"
dbf_silver = f"{silver_path}/geographic_data/"

print("🔄 Distributed Batch Processing:")
print(f"   Source: {dbf_bronze}")
print(f"   Target: {dbf_silver}")

# Process all DBF files in parallel
batch_result = forge.batch_convert_distributed(
    source_volume_path=dbf_bronze,
    target_volume_path=dbf_silver,
    output_format='delta',
    parallelism=4,  # Number of parallel tasks
    enable_spark_optimization=True
)

print(f"✅ Batch processing completed:")
print(f"   Files processed: {batch_result['files_processed']}")
print(f"   Total records: {batch_result['total_records']:,}")
print(f"   Processing time: {batch_result['total_duration']:.2f}s")
print(f"   Throughput: {batch_result['records_per_second']:,.0f} records/sec")

# List converted files
%fs ls $dbf_silver

## 10. Monitoring and Performance Analysis

Monitor conversion performance and resource usage.

In [None]:
# Get comprehensive performance metrics
performance_report = forge.get_performance_metrics()

print("📊 PyForge Databricks Performance Report:")
print("=" * 50)

# Display key metrics
metrics = [
    ("Total Conversions", performance_report['total_conversions']),
    ("Success Rate", f"{performance_report['success_rate']:.1f}%"),
    ("Avg Processing Time", f"{performance_report['avg_duration']:.2f}s"),
    ("Total Data Processed", f"{performance_report['total_data_gb']:.2f}GB"),
    ("Peak Memory Usage", f"{performance_report['peak_memory_mb']}MB"),
    ("Cache Hit Rate", f"{performance_report['cache_hit_rate']:.1f}%")
]

for metric, value in metrics:
    print(f"   {metric:20}: {value}")

# Show format-specific performance
print("\n📋 Performance by Format:")
for fmt, stats in performance_report['format_stats'].items():
    print(f"   {fmt:10}: {stats['count']:3} files, {stats['avg_duration']:.2f}s avg")

# Volume usage summary
print("\n💾 Volume Storage Summary:")
for layer in ['bronze', 'silver', 'gold']:
    path = eval(f"{layer}_path")
    try:
        size_info = dbutils.fs.ls(path)
        file_count = len([f for f in size_info if not f.name.endswith('/')])
        print(f"   {layer.title():6} Layer: {file_count} files")
    except:
        print(f"   {layer.title():6} Layer: Not accessible")

## Summary

You've successfully explored **PyForge Databricks** advanced capabilities:

### ✅ **Key Achievements:**

🏗️ **Unity Catalog Integration**: Direct Volume operations with governed data access  
⚡ **Serverless Optimization**: Automatic detection and performance tuning  
🔄 **Volume-to-Volume Conversions**: Seamless data movement between layers  
🗄️ **Multi-Table Processing**: Intelligent database extraction to Volumes  
📊 **Delta Lake Integration**: Versioned, ACID-compliant data storage  
🚀 **Distributed Processing**: Spark-powered batch operations  
📈 **Performance Monitoring**: Comprehensive metrics and optimization  

### 🎯 **Production Ready Features:**

- **Medallion Architecture**: Bronze → Silver → Gold data pipeline
- **Automatic Optimization**: Adapts to serverless vs. classic compute
- **Error Resilience**: Robust handling of distributed operations
- **Performance Monitoring**: Real-time metrics and optimization insights
- **Governance Ready**: Full Unity Catalog and Volume integration

### 🚀 **Next Steps:**

1. **Scale Up**: Process larger datasets with distributed computing
2. **Automate**: Create scheduled workflows using Databricks Jobs
3. **Monitor**: Set up alerts based on performance metrics
4. **Govern**: Implement data lineage and quality checks
5. **Optimize**: Fine-tune settings for your specific workloads

**PyForge Databricks transforms enterprise data processing with intelligent automation and distributed computing!** 🚀