# Exercise Setup Environment

This notebook sets up the environment for the Data Ingestion with Lakeflow exercises.

It will create:
- Catalog: `lakeflow_exercise`
- Schema: `exercise_schema`
- Volume: `exercise_raw` (within exercise_schema)
- Sample data files in the exercise_raw volume for practice exercises

**Note:** This setup creates different datasets than the main notebooks to provide fresh practice scenarios.


## Step 1: Create Catalog and Schema


In [0]:
%sql
-- Create catalog
CREATE CATALOG IF NOT EXISTS lakeflow_exercise;

-- Create schema within the catalog
CREATE SCHEMA IF NOT EXISTS lakeflow_exercise.exercise_schema;

-- Set default catalog and schema
USE CATALOG lakeflow_exercise;
USE SCHEMA exercise_schema;

-- Verify current catalog and schema
SELECT current_catalog(), current_schema();


## Step 2: Create Volume


In [0]:
%sql
-- Create volume for raw data files
CREATE VOLUME IF NOT EXISTS lakeflow_exercise.exercise_schema.exercise_raw;

-- Verify volume creation
DESCRIBE VOLUME lakeflow_exercise.exercise_schema.exercise_raw;


## Step 3: Create Sample Data Files

We'll create sample data files in different formats (Parquet, CSV, JSON) for the exercise.

### 3.1: Create Sample Parquet Files (Customer Data)


In [0]:
%python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Set random seed for reproducibility
np.random.seed(123)
random.seed(123)

# Generate sample customer data
n_records = 15000
customer_ids = [f"CUST{str(i).zfill(8)}" for i in range(1, n_records + 1)]
customer_names = [f"Customer_{i}" for i in range(1, n_records + 1)]
registration_timestamps = []

# Generate Unix timestamps (in microseconds)
base_timestamp = int(datetime(2019, 1, 1).timestamp() * 1_000_000)
for _ in range(n_records):
    registration_timestamps.append(base_timestamp + random.randint(0, 730*24*60*60*1_000_000))

# Create DataFrame
customers_df = pd.DataFrame({
    'customer_id': customer_ids,
    'customer_name': customer_names,
    'registration_timestamp': registration_timestamps,
    'country': [random.choice(['USA', 'UK', 'Canada', 'Australia', 'Germany']) for _ in range(n_records)]
})

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(customers_df)

# Write to Parquet files in the raw volume
output_path = "/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/"
spark_df.coalesce(6).write.mode("overwrite").parquet(output_path)

print(f"Created {n_records} customer records in Parquet format at: {output_path}")
print(f"Files created:")
files = dbutils.fs.ls(output_path)
for file in files:
    if file.name.startswith('part-'):
        print(f"  {file.name} - {file.size} bytes")


### 3.2: Create Sample CSV Files (Transactions Data)


In [0]:
%python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Set random seed
np.random.seed(123)
random.seed(123)

# Generate sample transaction data
n_records = 8000
transaction_ids = [f"TXN{str(i).zfill(7)}" for i in range(1, n_records + 1)]
categories = ["Electronics", "Clothing", "Food", "Books", "Home"]
amounts = np.random.uniform(5.0, 2000.0, n_records).round(2)

transactions_df = pd.DataFrame({
    'transaction_id': transaction_ids,
    'category': [random.choice(categories) for _ in range(n_records)],
    'amount': amounts,
    'transaction_date': [datetime.now() - timedelta(days=random.randint(0, 180)) for _ in range(n_records)]
})

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(transactions_df)

# Write to CSV files in the raw volume (pipe-delimited)
output_path = "/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/transactions-csv/"
spark_df.coalesce(4).write.mode("overwrite").option("sep", "|").option("header", "true").csv(output_path)

print(f"Created {n_records} transaction records in CSV format at: {output_path}")
print(f"Files created:")
files = dbutils.fs.ls(output_path)
for file in files:
    if file.name.endswith('.csv'):
        print(f"  {file.name} - {file.size} bytes")


### 3.3: Create Sample CSV File with Malformed Data (for Rescued Data Exercise)


In [0]:
%python
import pandas as pd
import os

# Create sample inventory data with malformed rows
inventory_data = [
    {"product_id": "P001", "product_name": "Laptop", "stock_quantity": 50},
    {"product_id": "P002", "product_name": "Mouse", "stock_quantity": 200},
    {"product_id": "P003", "product_name": "Keyboard", "stock_quantity": "150 units"},  # Malformed: quantity with text
    {"product_id": "P004", "product_name": "Monitor", "stock_quantity": 75},
    {"product_id": "P005", "product_name": "Headphones", "stock_quantity": "N/A"},  # Malformed: text instead of number
    {"product_id": "P006", "product_name": "Webcam", "stock_quantity": 30}
]

inventory_df = pd.DataFrame(inventory_data)

# Ensure output directory exists
output_dir = "/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/inventory-csv"
dbutils.fs.mkdirs(output_dir)

# Write to CSV file (comma-delimited)
output_path = f"{output_dir}/exercise_malformed_data.csv"
inventory_df.to_csv(output_path, index=False, sep=",")

print(f"Created malformed CSV file at: {output_path}")
print(f"File contains {len(inventory_df)} rows, including malformed rows with stock_quantity containing text")

### 3.4: Create Sample JSON Files (Web Events Data)


In [0]:
%python
import json
import base64
import random
from datetime import datetime

# Set random seed
random.seed(123)

# Generate sample web event data
n_records = 3000
browsers = ["Chrome", "Firefox", "Safari", "Edge", "Opera"]
pages = ["home", "products", "cart", "checkout", "about"]
actions = ["view", "click", "add_to_cart", "purchase", "search"]
cities = ["San Francisco", "New York", "London", "Toronto", "Sydney"]
countries = ["US", "GB", "CA", "AU", "DE"]

events = []
for i in range(n_records):
    customer_id = f"CUST{str(i+1).zfill(8)}"
    
    # Create event value (JSON string)
    event_value = {
        "browser": random.choice(browsers),
        "page": random.choice(pages),
        "action": random.choice(actions),
        "event_timestamp": int(datetime.now().timestamp() * 1000) + i,
        "location": {
            "city": random.choice(cities),
            "country": random.choice(countries)
        },
        "session_id": f"SESS{str(i+1).zfill(8)}",
        "customer_id": customer_id
    }
    
    # Encode key and value in base64
    key_encoded = base64.b64encode(customer_id.encode()).decode()
    value_encoded = base64.b64encode(json.dumps(event_value).encode()).decode()
    
    # Create web event record
    web_event = {
        "key": key_encoded,
        "offset": 500000 + i,
        "partition": i % 4,
        "timestamp": int(datetime.now().timestamp() * 1000) + i,
        "topic": "web_events",
        "value": value_encoded
    }
    
    events.append(web_event)

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(events)

# Write to JSON files in the raw volume
output_path = "/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/web-events-json/"
spark_df.coalesce(6).write.mode("overwrite").json(output_path)

print(f"Created {n_records} web event records in JSON format at: {output_path}")
print(f"Files created:")
files = dbutils.fs.ls(output_path)
for file in files:
    if file.name.startswith('part-'):
        print(f"  {file.name} - {file.size} bytes")


## Step 4: Verify Setup


In [0]:
%sql
-- Verify catalog and schema
SHOW SCHEMAS IN lakeflow_exercise;

-- Verify volume
DESCRIBE VOLUME lakeflow_exercise.exercise_schema.exercise_raw;

-- List files in raw volume
LIST '/Volumes/lakeflow_exercise/exercise_schema/exercise_raw';


## Setup Complete!

Your environment is now ready for the Data Ingestion exercises.

**Catalog:** `lakeflow_exercise`  
**Schema:** `exercise_schema`  
**Volume:** `exercise_raw` (located at `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/`)

**Sample data files created:**
- Parquet files: `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/`
- CSV files: `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/transactions-csv/`
- JSON files: `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/web-events-json/`
- Malformed CSV: `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/inventory-csv/exercise_malformed_data.csv`

You can now proceed with the exercise notebook!
