# Remote Spark Connection Example

This notebook demonstrates how to connect to a remote Databricks cluster using Databricks Connect.

## Prerequisites

1. **Databricks Workspace**: You need access to a Databricks workspace
2. **Personal Access Token**: Generate a personal access token from your Databricks workspace
3. **Cluster**: Have a running Databricks cluster (or specify cluster ID for auto-start)
4. **Environment**: Use either `delta-lake` or `unity-catalog` conda environment

## Setup

Run the configuration script to set up your credentials:
```bash
./scripts/configure-remote.sh
```

Or manually create a `.env` file with:
```
DATABRICKS_HOST=https://your-workspace.databricks.com
DATABRICKS_TOKEN=your-personal-access-token
DATABRICKS_CLUSTER_ID=your-cluster-id  # Optional
```

In [None]:
# Import required libraries
import os
from dotenv import load_dotenv
import sys
sys.path.append('..')
from shared.spark_utils import SparkSessionManager

In [None]:
# Load environment variables
load_dotenv()

# Check configuration
databricks_host = os.getenv("DATABRICKS_HOST")
databricks_token = os.getenv("DATABRICKS_TOKEN")
cluster_id = os.getenv("DATABRICKS_CLUSTER_ID")

print(f"Databricks Host: {databricks_host}")
print(f"Token configured: {'Yes' if databricks_token else 'No'}")
print(f"Cluster ID: {cluster_id if cluster_id else 'Not specified'}")

if not databricks_host or not databricks_token:
    print("\n‚ùå Error: DATABRICKS_HOST and DATABRICKS_TOKEN must be configured!")
    print("Please run: ./scripts/configure-remote.sh")
else:
    print("\n‚úÖ Configuration looks good!")

In [None]:
# Connect to remote Spark cluster
print("üöÄ Connecting to remote Databricks cluster...")

try:
    spark = SparkSessionManager.get_session("remote", "Remote-Jupyter-Example")
    print("‚úÖ Successfully connected!")
    print(f"   Spark Version: {spark.version}")
    print(f"   Application ID: {spark.sparkContext.applicationId}")
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    print("\nTroubleshooting:")
    print("1. Verify your credentials are correct")
    print("2. Ensure your Databricks cluster is running")
    print("3. Check network connectivity")

In [None]:
# Test basic functionality
print("üìä Testing basic Spark functionality...")

# Simple test query
df = spark.sql("""
    SELECT 
        'Hello from Remote Spark!' as message,
        current_timestamp() as timestamp,
        version() as spark_version
""")

df.show(truncate=False)

In [None]:
# Explore the environment
print("üóÑÔ∏è Exploring available databases...")

# Show databases
databases_df = spark.sql("SHOW DATABASES")
databases_df.show()

# Get database list as Python list
databases = [row.namespace for row in databases_df.collect()]
print(f"\nFound {len(databases)} databases: {', '.join(databases)}")

In [None]:
# Create and work with a DataFrame
print("üìà Creating sample DataFrame...")

# Create sample data
sample_data = [
    (1, "Alice", 25, "Engineering"),
    (2, "Bob", 30, "Marketing"),
    (3, "Charlie", 35, "Engineering"),
    (4, "Diana", 28, "Sales"),
    (5, "Eve", 32, "Engineering")
]

columns = ["id", "name", "age", "department"]

# Create DataFrame
df = spark.createDataFrame(sample_data, columns)

print("\nüìã Sample data:")
df.show()

print("\nüìä Department statistics:")
df.groupBy("department").agg(
    {"age": "avg", "id": "count"}
).withColumnRenamed("avg(age)", "avg_age").withColumnRenamed("count(id)", "employee_count").show()

In [None]:
# Optional: Work with Delta Lake (if available)
print("üî∫ Testing Delta Lake functionality...")

try:
    # Try to create a temporary Delta table
    temp_table_path = "/tmp/delta-table-test"
    
    # Write as Delta format
    df.write.format("delta").mode("overwrite").save(temp_table_path)
    
    # Read it back
    delta_df = spark.read.format("delta").load(temp_table_path)
    
    print("‚úÖ Delta Lake is working!")
    print("\nüìä Data from Delta table:")
    delta_df.show()
    
    # Clean up
    spark.sql(f"DROP TABLE IF EXISTS delta.`{temp_table_path}`")
    
except Exception as e:
    print(f"‚ö†Ô∏è Delta Lake test failed: {e}")
    print("This might be expected if Delta Lake is not configured in your cluster.")

In [None]:
# Clean up
print("üîÑ Cleaning up...")
SparkSessionManager.stop_session()
print("‚úÖ Spark session stopped.")