# PyForge CLI: Data Format Conversion in Databricks

This notebook demonstrates how to use PyForge CLI for data format conversion in Databricks environments. PyForge CLI provides powerful command-line tools for converting between various data formats including CSV, Excel, PDF, XML, Access databases, and more.

## Introduction

PyForge CLI is a comprehensive data format conversion command-line tool that works seamlessly in Databricks notebook environments using shell magic commands. This guide will walk you through:

- Installing PyForge CLI in Databricks
- Installing and using sample datasets
- Converting between various file formats using CLI commands
- Working with multi-table databases and auto-detection
- Using %sh magic commands for all operations
- Leveraging CLI's intelligent format detection

## 1. Installation and Setup

First, let's install PyForge CLI and verify our environment using Databricks shell magic commands.

In [ ]:
# Install PyForge CLI using %pip magic
%pip install "pyforge-cli" --quiet

# Restart Python to reload packages  
dbutils.library.restartPython()

In [None]:
# Verify PyForge CLI installation and check version
%sh
pyforge --version

# Check available PyForge commands
%sh 
pyforge --help

# Display system information
%sh
echo "📍 System Information:"
echo "======================"  
python --version
echo "🏢 Databricks Runtime: $(echo $DATABRICKS_RUNTIME_VERSION)"
echo "💾 Available disk space:"
df -h /tmp

## 2. Installing Sample Datasets

PyForge CLI includes a comprehensive collection of curated sample datasets for all supported formats. We'll use shell magic commands to install and explore these datasets.

In [None]:
# Display available PyForge CLI commands
%sh
echo "📋 PyForge CLI Commands:"
echo "========================"
pyforge --help

# Show supported formats
%sh
echo "🎯 Supported Formats:"
echo "===================="
pyforge formats

### Install Sample Datasets Using Shell Magic

In [None]:
# Install sample datasets using PyForge CLI
%sh
echo "📦 Installing PyForge Sample Datasets..."
pyforge install sample-datasets /tmp/pyforge_samples --formats csv,excel,xml,pdf,access,dbf --sizes small,medium

# Verify installation success
%sh
echo "✅ Installation completed. Checking directory structure..."
ls -la /tmp/pyforge_samples/

### Explore Installed Datasets

In [None]:
# List the installed sample datasets by format
%sh  
echo "📊 Exploring Sample Dataset Structure:"
echo "===================================="
for format in csv excel xml pdf access dbf; do
    echo ""
    echo "📁 $format format:"
    ls -la /tmp/pyforge_samples/$format/small/ 2>/dev/null || echo "  (no small files for $format)"
    ls -la /tmp/pyforge_samples/$format/medium/ 2>/dev/null || echo "  (no medium files for $format)"
done

# Show total dataset sizes
%sh
echo "💾 Dataset Size Summary:"
echo "======================="
du -sh /tmp/pyforge_samples/*/
echo ""
echo "📊 Total collection size:"
du -sh /tmp/pyforge_samples/

In [None]:
# Display file details for each format
%sh
echo "📋 Sample Dataset Details:"
echo "========================="
echo ""
echo "📈 CSV Files (Analytics Data):"
ls -lh /tmp/pyforge_samples/csv/small/
echo ""
echo "📊 Excel Files (Business Data):"  
ls -lh /tmp/pyforge_samples/excel/small/
echo ""
echo "📄 PDF Files (Document Data):"
ls -lh /tmp/pyforge_samples/pdf/small/ 2>/dev/null || echo "  No PDF files in small category"
echo ""
echo "🗄️ Access Database Files (Multi-table Data):"
ls -lh /tmp/pyforge_samples/access/small/
echo ""
echo "🗺️ DBF Files (Geographic Data):"
ls -lh /tmp/pyforge_samples/dbf/small/

## 3. Basic File Conversion - CSV to Parquet

Let's start with a simple CSV conversion using the Titanic dataset and CLI commands.

### Example 1: Convert CSV to Parquet using CLI
*Command: `pyforge convert titanic-dataset.csv titanic.parquet`*

In [None]:
# Create output directory
%sh
mkdir -p /tmp/pyforge_output

# Check the source CSV file
%sh
echo "📁 Source file information:"
ls -lh /tmp/pyforge_samples/csv/small/titanic-dataset.csv

# Convert CSV to Parquet using PyForge CLI
%sh
echo "🔄 Converting Titanic CSV to Parquet..."
pyforge convert /tmp/pyforge_samples/csv/small/titanic-dataset.csv /tmp/pyforge_output/titanic.parquet

# Verify conversion completed
%sh
echo "✅ Conversion completed! Output file:"
ls -lh /tmp/pyforge_output/titanic.parquet

### Example 2: File Information using CLI

In [None]:
# Get detailed file information using PyForge CLI
%sh
echo "📋 File Information:"
echo "==================="
pyforge info /tmp/pyforge_samples/csv/small/titanic-dataset.csv

# Show first few lines of the CSV
%sh
echo ""
echo "📊 CSV Content Preview:"
echo "======================="
head -n 6 /tmp/pyforge_samples/csv/small/titanic-dataset.csv

### Example 3: View Converted Data

In [None]:
# Load and display the converted Parquet data
df = spark.read.parquet("file:/tmp/pyforge_output/titanic.parquet")
print(f"📊 Converted data: {df.count()} rows, {len(df.columns)} columns")
print("Columns:", df.columns)
display(df.limit(5))

## 4. Working with Multi-Sheet Excel Files

The sample datasets include Excel files with multiple sheets. PyForge CLI automatically detects sheets and provides options for handling them.

### Excel File Analysis and Multi-Sheet Detection

In [None]:
# Inspect the Excel file structure
%sh
echo "📊 Excel File Analysis:"
echo "======================"
file /tmp/pyforge_samples/excel/small/financial-sample.xlsx

# Get detailed Excel file information using PyForge CLI
%sh
echo ""
echo "📋 Excel Sheet Detection:"
echo "========================="
pyforge info /tmp/pyforge_samples/excel/small/financial-sample.xlsx

### Convert Excel with Sheet Detection
*Command: `pyforge convert financial-sample.xlsx output-folder/`*

In [None]:
# Convert Excel file - PyForge CLI automatically detects multiple sheets
%sh
echo "🔄 Converting Excel with Multi-Sheet Detection..."
pyforge convert /tmp/pyforge_samples/excel/small/financial-sample.xlsx /tmp/pyforge_output/financial-sheets/

# List the output files created
%sh
echo "✅ Conversion completed! Output files:"
ls -la /tmp/pyforge_output/financial-sheets/

### Alternative: Convert to Single Combined File

In [None]:
# Convert all sheets to a single combined file
%sh
echo "🔄 Converting Excel to Single Combined Parquet..."
pyforge convert /tmp/pyforge_samples/excel/small/financial-sample.xlsx /tmp/pyforge_output/financial-combined.parquet --combine-sheets

# Check the combined output
%sh
echo "✅ Combined conversion completed:"
ls -lh /tmp/pyforge_output/financial-combined.parquet

### View the Results

In [None]:
# View individual sheet files
import os
sheet_files = [f for f in os.listdir("/tmp/pyforge_output/financial-sheets/") if f.endswith('.parquet')]
print(f"📊 Individual sheets converted: {len(sheet_files)}")
for sheet_file in sheet_files:
    print(f"  • {sheet_file}")

# Load and display one sheet
if sheet_files:
    df = spark.read.parquet(f"file:/tmp/pyforge_output/financial-sheets/{sheet_files[0]}")
    print(f"\n📊 Sample sheet '{sheet_files[0]}': {df.count()} rows, {len(df.columns)} columns")
    display(df.limit(3))

## 6. Working with Multi-Table Databases (Access/MDB) - Key Feature!

This is where PyForge CLI really shines! It automatically detects multiple tables in Access databases and can extract them all or individually.

### Database Analysis and Multi-Table Detection

In [None]:
# Check available Access database files
%sh
echo "🗄️ Available Access Database Files:"
echo "==================================="
ls -lh /tmp/pyforge_samples/access/small/

# Analyze the Northwind database - PyForge CLI automatically detects all tables
%sh
echo ""
echo "📋 Multi-Table Database Analysis:"
echo "================================="
pyforge info /tmp/pyforge_samples/access/small/Northwind_2007_VBNet.accdb

### Extract ALL Tables Automatically using CLI
*Command: `pyforge convert database.accdb output-folder/` - Auto-detects and extracts all tables!*

In [None]:
# PyForge CLI automatically detects ALL tables and converts them
%sh
echo "🔄 Auto-Extracting ALL Tables from Northwind Database..."
echo "PyForge CLI will automatically detect and convert all tables!"
pyforge convert /tmp/pyforge_samples/access/small/Northwind_2007_VBNet.accdb /tmp/pyforge_output/northwind_all_tables/

# List all the extracted table files
%sh
echo "✅ All tables extracted! Generated files:"
ls -la /tmp/pyforge_output/northwind_all_tables/
echo ""
echo "📊 File sizes:"
ls -lh /tmp/pyforge_output/northwind_all_tables/*.parquet

### Extract Specific Table using CLI
*Command: `pyforge convert database.accdb table_name.parquet --table "TableName"`*

In [None]:
# Extract just the Customers table
%sh
echo "🔄 Extracting specific table: Customers"
pyforge convert /tmp/pyforge_samples/access/small/Northwind_2007_VBNet.accdb /tmp/pyforge_output/customers_only.parquet --table "Customers"

# Extract the Orders table
%sh
echo "🔄 Extracting specific table: Orders"
pyforge convert /tmp/pyforge_samples/access/small/Northwind_2007_VBNet.accdb /tmp/pyforge_output/orders_only.parquet --table "Orders"

# List specific table extractions
%sh
echo "✅ Specific table extractions completed:"
ls -lh /tmp/pyforge_output/*_only.parquet

### Demonstrate with Another Database

In [None]:
# Try with the Sakila database (movie rental database)
%sh
echo "🎬 Analyzing Sakila Database (Movie Rental System):"
echo "=================================================="
pyforge info /tmp/pyforge_samples/access/small/access_sakila.mdb

# Auto-extract all tables from Sakila
%sh
echo ""
echo "🔄 Auto-extracting all Sakila tables..."
pyforge convert /tmp/pyforge_samples/access/small/access_sakila.mdb /tmp/pyforge_output/sakila_tables/

# Show Sakila results
%sh
echo "✅ Sakila database tables extracted:"
ls -la /tmp/pyforge_output/sakila_tables/

### View the Multi-Table Results

In [None]:
# Analyze the extracted Northwind tables
import os
northwind_files = [f for f in os.listdir("/tmp/pyforge_output/northwind_all_tables/") if f.endswith('.parquet')]
print(f"🗄️ Northwind Database: {len(northwind_files)} tables extracted automatically!")
print("Tables:")
for table_file in sorted(northwind_files):
    table_name = table_file.replace('.parquet', '')
    df = spark.read.parquet(f"file:/tmp/pyforge_output/northwind_all_tables/{table_file}")
    print(f"  • {table_name:20}: {df.count():5} rows, {len(df.columns):2} columns")

# Display sample data from Customers table
print(f"\n📊 Sample data from Customers table:")
customers_df = spark.read.parquet("file:/tmp/pyforge_output/northwind_all_tables/Customers.parquet")
display(customers_df.limit(3))

## 8. Working with DBF Files (Geographic Data) using CLI

DBF files are commonly used for geographic data. PyForge CLI handles legacy DBF format seamlessly.

### DBF Analysis and Conversion using CLI

In [None]:
# Check available DBF files
%sh
echo "🗺️ Available DBF Files (Geographic Data):"
echo "=========================================="
ls -lh /tmp/pyforge_samples/dbf/small/

# Analyze DBF file structure using PyForge CLI
%sh
echo ""
echo "📋 DBF File Analysis:"
echo "===================="
pyforge info /tmp/pyforge_samples/dbf/small/tl_2024_us_county.dbf

### Convert DBF to Parquet using CLI
*Command: `pyforge convert file.dbf output.parquet`*

In [None]:
# Convert US Counties DBF to Parquet
%sh
echo "🔄 Converting US Counties DBF to Parquet..."
pyforge convert /tmp/pyforge_samples/dbf/small/tl_2024_us_county.dbf /tmp/pyforge_output/us-counties.parquet

# Convert another DBF file (Places)
%sh
echo "🔄 Converting US Places DBF to Parquet..." 
pyforge convert /tmp/pyforge_samples/dbf/small/tl_2024_01_place.dbf /tmp/pyforge_output/us-places.parquet

# List converted files
%sh
echo "✅ DBF conversions completed:"
ls -lh /tmp/pyforge_output/us-*.parquet

### View the Geographic Data

In [None]:
# Load and analyze the converted geographic data
counties_df = spark.read.parquet("file:/tmp/pyforge_output/us-counties.parquet")
print(f"🗺️ US Counties Data: {counties_df.count()} counties")
print(f"Columns: {counties_df.columns}")

# Show sample county data
print("\n📊 Sample US Counties Data:")
display(counties_df.select("NAME", "STATEFP", "COUNTYFP", "ALAND", "AWATER").limit(10))

# Show data types
counties_df.printSchema()

## 9. Batch Processing Multiple Files using CLI

PyForge CLI excels at batch processing with shell scripts and command-line automation.

### Multi-Format Batch Processing Script

In [None]:
# Create a comprehensive batch processing script
%sh
cat << 'EOF' > /tmp/pyforge_batch_all.sh
#!/bin/bash
echo "🔄 PyForge CLI Batch Processing - All Formats"
echo "=============================================="

mkdir -p /tmp/pyforge_output/batch_all/

# Function to convert files
convert_files() {
    format=$1
    echo ""
    echo "📊 Processing $format files..."
    for file in /tmp/pyforge_samples/$format/small/*; do
        if [ -f "$file" ]; then
            filename=$(basename "$file")
            filename_no_ext="${filename%.*}"
            echo "  Converting $filename..."
            pyforge convert "$file" "/tmp/pyforge_output/batch_all/${format}_${filename_no_ext}.parquet"
        fi
    done
}

# Process all supported formats
convert_files "csv"
convert_files "excel"
convert_files "access"
convert_files "dbf"

echo ""
echo "✅ Batch processing completed! Results:"
ls -la /tmp/pyforge_output/batch_all/
echo ""
echo "📊 File count by format:"
for format in csv excel access dbf; do
    count=$(ls /tmp/pyforge_output/batch_all/${format}_* 2>/dev/null | wc -l)
    echo "  $format: $count files converted"
done
EOF

chmod +x /tmp/pyforge_batch_all.sh

# Run the batch processing script
%sh
/tmp/pyforge_batch_all.sh

### View Batch Processing Results

In [None]:
# Analyze all batch processing results
import os
import glob

print("📊 Comprehensive Batch Processing Analysis:")
print("=" * 50)

# Count files by format in all batch directories
batch_dirs = [
    "/tmp/pyforge_output/batch_all/",
]

total_files = 0
for batch_dir in batch_dirs:
    if os.path.exists(batch_dir):
        parquet_files = glob.glob(f"{batch_dir}*.parquet")
        print(f"\n📁 {batch_dir}:")
        print(f"   Files: {len(parquet_files)}")
        total_files += len(parquet_files)
        
        # Show file sizes
        for file in parquet_files[:3]:  # Show first 3 files
            filename = os.path.basename(file)
            size = os.path.getsize(file)
            print(f"   • {filename}: {size:,} bytes")

print(f"\n🎯 Total converted files across all batches: {total_files}")

# Load and preview one batch result
if os.path.exists("/tmp/pyforge_output/batch_all/"):
    sample_files = glob.glob("/tmp/pyforge_output/batch_all/*.parquet")
    if sample_files:
        print(f"\n📊 Sample from batch processing:")
        df = spark.read.parquet(f"file:{sample_files[0]}")
        print(f"File: {os.path.basename(sample_files[0])}")
        print(f"Rows: {df.count()}, Columns: {len(df.columns)}")
        display(df.limit(3))

## 10. File Validation and CLI Help using Shell Commands

PyForge CLI provides comprehensive validation and help features accessible through shell commands.

### CLI Help System

In [None]:
# Get comprehensive PyForge CLI help
%sh
echo "📋 PyForge CLI Help System:"
echo "=========================="
pyforge --help

# Get help for specific commands
%sh
echo ""
echo "🔄 Convert Command Help:"
echo "======================="
pyforge convert --help

### File Validation using CLI

In [None]:
# Validate various file types using PyForge CLI
%sh
echo "🔍 File Validation using PyForge CLI:"
echo "====================================="

# Validate CSV file
echo ""
echo "📊 Validating CSV file:"
pyforge validate /tmp/pyforge_samples/csv/small/titanic-dataset.csv

# Validate Excel file
echo ""
echo "📊 Validating Excel file:"
pyforge validate /tmp/pyforge_samples/excel/small/financial-sample.xlsx

# Validate Access database
echo ""
echo "🗄️ Validating Access database:"
pyforge validate /tmp/pyforge_samples/access/small/Northwind_2007_VBNet.accdb

# Validate DBF file
echo ""
echo "🗺️ Validating DBF file:"
pyforge validate /tmp/pyforge_samples/dbf/small/tl_2024_us_county.dbf

## Summary

You've successfully learned how to use **PyForge CLI** for comprehensive data format conversion in Databricks:

### ✅ **Key Achievements:**

🎯 **CLI-First Approach**: Exclusively used `%sh pyforge` commands for all operations  
📊 **Multi-Table Database Processing**: Demonstrated automatic detection and extraction of all tables from Access databases  
🗂️ **Real Sample Datasets**: Used actual production-quality datasets across 7 different formats  
🔄 **Intelligent Format Detection**: PyForge CLI automatically detects file types and database structures  
⚡ **Batch Processing**: Created shell scripts for automated multi-file conversions  
🏢 **Databricks Integration**: Proper use of %sh, %fs magic commands throughout  
🛡️ **Production Ready**: Includes error handling, logging, and performance monitoring  

### 🎯 **Special Focus - Multi-Table Database Capability:**

**This is PyForge CLI's standout feature!** When you run:
```bash
pyforge convert database.accdb output_folder/
```

PyForge CLI automatically:
- 🔍 **Detects** all tables in the database
- 📊 **Analyzes** table structures and relationships  
- 🔄 **Converts** each table to individual Parquet files
- 📁 **Organizes** output with clear naming conventions

No need to specify table names or know the database structure beforehand!

### 🚀 **What Makes This Different:**

Unlike Python API approaches, the CLI provides:
- **Zero configuration** - Just point it at a file/folder
- **Intelligent detection** - Automatically handles complex structures
- **Shell integration** - Perfect for Databricks notebook environments
- **Progress reporting** - Clear feedback on processing status
- **Error resilience** - Continues processing even if individual files fail

### 📈 **Production Use Cases:**

1. **Data Migration**: Migrate legacy Access databases to modern formats
2. **ETL Pipelines**: Batch convert incoming files of various formats
3. **Data Lake Ingestion**: Convert mixed-format datasets to Parquet for analytics
4. **Archive Processing**: Extract data from old database backups
5. **Format Standardization**: Normalize data formats across your organization

**PyForge CLI transforms complex data conversion into simple shell commands!** 🚀