# Setup Environment for Lakeflow Data Ingestion

This notebook sets up the environment for the Data Ingestion with Lakeflow exercises.

It will create:
- Catalog: `lakeflow_demo`
- Schema: `lakeflow_schema`
- Volume: `raw` (within lakeflow_schema)
- Sample data files in the raw volume for ingestion exercises


## Step 1: Create Catalog and Schema


In [0]:
%sql
-- Create catalog
CREATE CATALOG IF NOT EXISTS lakeflow_demo;

-- Create schema within the catalog
CREATE SCHEMA IF NOT EXISTS lakeflow_demo.lakeflow_schema;

-- Set default catalog and schema
USE CATALOG lakeflow_demo;
USE SCHEMA lakeflow_schema;

-- Verify current catalog and schema
SELECT current_catalog(), current_schema();


## Step 2: Create Volume


In [0]:
%sql
-- Create volume for raw data files
CREATE VOLUME IF NOT EXISTS lakeflow_demo.lakeflow_schema.raw;

-- Verify volume creation
DESCRIBE VOLUME lakeflow_demo.lakeflow_schema.raw;


## Step 3: Create Sample Data Files

We'll create sample data files in different formats (Parquet, CSV, JSON) for the ingestion exercises.

### 3.1: Create Sample Parquet Files (Users Historical Data)


In [0]:
%python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

# Generate sample user data
n_records = 10000
user_ids = [f"UA{str(i).zfill(12)}" for i in range(1, n_records + 1)]
emails = [f"user{i}@example.com" for i in range(1, n_records + 1)]

# Generate Unix timestamps (in microseconds)
base_timestamp = int(datetime(2020, 1, 1).timestamp() * 1_000_000)
timestamps = [base_timestamp + random.randint(0, 365*24*60*60*1_000_000) for _ in range(n_records)]

# Create DataFrame
users_df = pd.DataFrame({
    'user_id': user_ids,
    'user_first_touch_timestamp': timestamps,
    'email': emails
})

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(users_df)

# Write to Parquet files in the raw volume
output_path = "/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/"
spark_df.coalesce(5).write.mode("overwrite").parquet(output_path)

print(f"Created {n_records} user records in Parquet format at: {output_path}")
print(f"Files created:")
files = dbutils.fs.ls(output_path)
for file in files:
    if file.name.startswith('part-'):
        print(f"  {file.name} - {file.size} bytes")


### 3.2: Create Sample CSV Files (Sales Data)


In [0]:
%python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Set random seed
np.random.seed(42)
random.seed(42)

# Generate sample sales data
n_records = 5000
order_ids = [f"ORD{str(i).zfill(6)}" for i in range(1, n_records + 1)]
products = ["Product A", "Product B", "Product C", "Product D", "Product E"]
quantities = np.random.randint(1, 10, n_records)
prices = np.random.uniform(10.0, 1000.0, n_records).round(2)

sales_df = pd.DataFrame({
    'order_id': order_ids,
    'product': [random.choice(products) for _ in range(n_records)],
    'quantity': quantities,
    'price': prices,
    'sale_date': [datetime.now() - timedelta(days=random.randint(0, 365)) for _ in range(n_records)]
})

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(sales_df)

# Write to CSV files in the raw volume (pipe-delimited)
output_path = "/Volumes/lakeflow_demo/lakeflow_schema/raw/sales-csv/"
spark_df.coalesce(3).write.mode("overwrite").option("sep", "|").option("header", "true").csv(output_path)

print(f"Created {n_records} sales records in CSV format at: {output_path}")
print(f"Files created:")
files = dbutils.fs.ls(output_path)
for file in files:
    if file.name.endswith('.csv'):
        print(f"  {file.name} - {file.size} bytes")


### 3.3: Create Sample JSON Files (Kafka Events Data)


In [0]:
%python
import json
import base64
import random
from datetime import datetime

# Set random seed
random.seed(42)

# Generate sample Kafka event data
n_records = 2000
devices = ["iOS", "Android", "Linux", "Windows", "Mac"]
cities = ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix"]
states = ["NY", "CA", "IL", "TX", "AZ"]
traffic_sources = ["google", "email", "direct", "social", "organic"]
event_names = ["main", "add_item", "finalize", "purchase", "view_item"]

events = []
for i in range(n_records):
    user_id = f"UA{str(i+1).zfill(12)}"
    
    # Create event value (JSON string)
    event_value = {
        "device": random.choice(devices),
        "ecommerce": {},
        "event_name": random.choice(event_names),
        "event_timestamp": int(datetime.now().timestamp() * 1000) + i,
        "geo": {
            "city": random.choice(cities),
            "state": random.choice(states)
        },
        "items": [],
        "traffic_source": random.choice(traffic_sources),
        "user_first_touch_timestamp": int(datetime.now().timestamp() * 1000) - random.randint(0, 86400000),
        "user_id": user_id
    }
    
    # Encode key and value in base64
    key_encoded = base64.b64encode(user_id.encode()).decode()
    value_encoded = base64.b64encode(json.dumps(event_value).encode()).decode()
    
    # Create Kafka event record
    kafka_event = {
        "key": key_encoded,
        "offset": 219255000 + i,
        "partition": i % 3,
        "timestamp": int(datetime.now().timestamp() * 1000) + i,
        "topic": "clickstream",
        "value": value_encoded
    }
    
    events.append(kafka_event)

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(events)

# Write to JSON files in the raw volume
output_path = "/Volumes/lakeflow_demo/lakeflow_schema/raw/events-kafka/"
spark_df.coalesce(5).write.mode("overwrite").json(output_path)

print(f"Created {n_records} Kafka event records in JSON format at: {output_path}")
print(f"Files created:")
files = dbutils.fs.ls(output_path)
for file in files:
    if file.name.startswith('part-'):
        print(f"  {file.name} - {file.size} bytes")


### 3.4: Create Sample CSV File with Malformed Data (for Rescued Data Exercise)


In [0]:
%python
import pandas as pd
import os

# Create sample product data with one malformed row
products_data = [
    {"item_id": "M_PREM_Q", "name": "Premium Queen Mattress", "price": 1795.0},
    {"item_id": "M_STAN_F", "name": "Standard Full Mattress", "price": 945.0},
    {"item_id": "M_PREM_A", "name": "Premium Queen Mattress", "price": "$100.00"},  # Malformed: price with $ sign
    {"item_id": "M_STAN_T", "name": "Standard Twin Mattress", "price": 595.0}
]

products_df = pd.DataFrame(products_data)

# Ensure output directory exists
output_dir = "/Volumes/lakeflow_demo/lakeflow_schema/raw/products-csv"
dbutils.fs.mkdirs(output_dir)

# Write to CSV file (comma-delimited)
output_path = f"{output_dir}/lab_malformed_data.csv"
products_df.to_csv(output_path, index=False, sep=",")

print(f"Created malformed CSV file at: {output_path}")
print(f"File contains {len(products_df)} rows, including one malformed row with price='$100.00'")

### 3.5: Create Volumes and Data for Streaming Tables (Auto Loader)


In [0]:
%python
# Create volumes for streaming table exercises
spark.sql("CREATE VOLUME IF NOT EXISTS lakeflow_demo.lakeflow_schema.autoloader_staging_files")
spark.sql("CREATE VOLUME IF NOT EXISTS lakeflow_demo.lakeflow_schema.csv_files_autoloader_source")

print("Created volumes for streaming table exercises")


In [0]:
%python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Set random seed
np.random.seed(42)
random.seed(42)

print("Creating clean CSV files for streaming table exercises...")
print("=" * 70)

# CRITICAL: We need to create clean CSV files WITHOUT Spark metadata
# Autoloader expects raw file drop zones, not Spark output directories

# Step 1: Generate sample sales data
n_records_per_file = 2000
products = ["Product A", "Product B", "Product C", "Product D", "Product E"]

def create_csv_file(output_path, n_records, file_number):
    """Create a clean CSV file without Spark metadata"""
    order_ids = [f"ORD{str(i + file_number * n_records).zfill(6)}" for i in range(1, n_records + 1)]
    quantities = np.random.randint(1, 10, n_records)
    prices = np.random.uniform(10.0, 1000.0, n_records).round(2)
    
    sales_df = pd.DataFrame({
        'order_id': order_ids,
        'product': [random.choice(products) for _ in range(n_records)],
        'quantity': quantities,
        'price': prices,
        'sale_date': [datetime.now() - timedelta(days=random.randint(0, 365)) for _ in range(n_records)]
    })
    
    # Write CSV content to a string buffer (avoids local filesystem access)
    from io import StringIO
    csv_buffer = StringIO()
    sales_df.to_csv(csv_buffer, index=False, sep="|", lineterminator='\n')
    csv_content = csv_buffer.getvalue()
    csv_buffer.close()
    
    # Write directly to volume using dbutils.fs.put()
    # This avoids local filesystem access and Spark metadata files
    dbutils.fs.put(output_path, csv_content, overwrite=True)
    
    return len(sales_df)

# Step 2: Clean and prepare Autoloader volumes (remove any existing Spark artifacts)
print("\n1. Cleaning Autoloader volumes (removing any Spark artifacts)...")
try:
    # Remove all files from autoloader volumes to start clean
    for path in [
        "/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source",
        "/Volumes/lakeflow_demo/lakeflow_schema/autoloader_staging_files"
    ]:
        try:
            existing_files = dbutils.fs.ls(path)
            for f in existing_files:
                if not f.isDir():
                    dbutils.fs.rm(f.path)
            print(f"  Cleaned: {path}")
        except Exception as e:
            # Directory might not exist or be empty
            dbutils.fs.mkdirs(path)
            print(f"  Created: {path}")
except Exception as e:
    print(f"  Note: {e}")

# Step 3: Create initial CSV file for autoloader source (file drop zone)
print("\n2. Creating initial CSV file for csv_files_autoloader_source...")
initial_file_path = "/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source/sales_initial.csv"
rows_created = create_csv_file(initial_file_path, n_records_per_file, 0)
print(f"  ✓ Created: sales_initial.csv with {rows_created} rows")

# Step 4: Create staging CSV files for incremental ingestion demo
print("\n3. Creating staging CSV files for autoloader_staging_files...")
staging_files = []
for i in range(1, 4):  # Create 3 files
    staging_file_path = f"/Volumes/lakeflow_demo/lakeflow_schema/autoloader_staging_files/sales_staging_{i:03d}.csv"
    rows_created = create_csv_file(staging_file_path, n_records_per_file, i)
    staging_files.append(f"sales_staging_{i:03d}.csv")
    print(f"  ✓ Created: sales_staging_{i:03d}.csv with {rows_created} rows")

# Step 5: Verify files (should ONLY see CSV files, no Spark metadata)
print("\n4. Verifying Autoloader volumes (should contain ONLY CSV files)...")
print("\n   csv_files_autoloader_source:")
source_files = dbutils.fs.ls("/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source")
csv_files = [f for f in source_files if f.name.endswith('.csv') and not f.isDir()]
non_csv = [f for f in source_files if not f.name.endswith('.csv') and not f.isDir()]
print(f"     CSV files: {len(csv_files)}")
for f in csv_files:
    print(f"       - {f.name} ({f.size:,} bytes)")
if non_csv:
    print(f"     ⚠ WARNING: Found {len(non_csv)} non-CSV files (should be 0):")
    for f in non_csv:
        print(f"       - {f.name}")

print("\n   autoloader_staging_files:")
staging_files_list = dbutils.fs.ls("/Volumes/lakeflow_demo/lakeflow_schema/autoloader_staging_files")
staging_csv = [f for f in staging_files_list if f.name.endswith('.csv') and not f.isDir()]
staging_non_csv = [f for f in staging_files_list if not f.name.endswith('.csv') and not f.isDir()]
print(f"     CSV files: {len(staging_csv)}")
for f in staging_csv:
    print(f"       - {f.name} ({f.size:,} bytes)")
if staging_non_csv:
    print(f"     ⚠ WARNING: Found {len(staging_non_csv)} non-CSV files (should be 0):")
    for f in staging_non_csv:
        print(f"       - {f.name}")

print("\n" + "=" * 70)
print("✓ Streaming table data setup complete!")
print("\nThese volumes are now clean file drop zones suitable for Autoloader:")
print(f"  • Initial file: /Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source/")
print(f"  • Staging files: /Volumes/lakeflow_demo/lakeflow_schema/autoloader_staging_files/")
print("\nNote: Files were created using pandas (not Spark) to avoid metadata artifacts.")


## Step 4: Verify Setup


In [0]:
%sql
-- Verify catalog and schema
SHOW SCHEMAS IN lakeflow_demo;

-- Verify volume
DESCRIBE VOLUME lakeflow_demo.lakeflow_schema.raw;

-- List files in raw volume
LIST '/Volumes/lakeflow_demo/lakeflow_schema/raw';


## Setup Complete!

Your environment is now ready for the Data Ingestion with Lakeflow exercises.

**Catalog:** `lakeflow_demo`  
**Schema:** `lakeflow_schema`  
**Volumes:**
- `raw` (located at `/Volumes/lakeflow_demo/lakeflow_schema/raw/`)
- `csv_files_autoloader_source` (for streaming table exercises)
- `autoloader_staging_files` (for incremental ingestion demo)

**Sample data files created:**
- Parquet files: `/Volumes/lakeflow_demo/lakeflow_schema/raw/users-historical/`
- CSV files: `/Volumes/lakeflow_demo/lakeflow_schema/raw/sales-csv/`
- JSON files: `/Volumes/lakeflow_demo/lakeflow_schema/raw/events-kafka/`
- Malformed CSV: `/Volumes/lakeflow_demo/lakeflow_schema/raw/products-csv/lab_malformed_data.csv`

**Streaming table data:**
- Initial CSV file: `/Volumes/lakeflow_demo/lakeflow_schema/csv_files_autoloader_source/` (1 file)
- Staging CSV files: `/Volumes/lakeflow_demo/lakeflow_schema/autoloader_staging_files/` (3 files)

You can now proceed with the Data Ingestion exercises!
