# Data Ingestion with Lakeflow - Exercise

This exercise notebook tests your understanding of data ingestion techniques covered in the practice notebooks.

**Instructions:**
1. Make sure you have run the `00_Exercise_Setup_Environment.ipynb` notebook first
2. Complete each exercise below
3. You can use either SQL or Python/PySpark (or both) to solve the exercises
4. All tables should be created in the `lakeflow_exercise.exercise_schema` catalog and schema

---

**Environment Setup:**
- **Catalog:** `lakeflow_exercise`
- **Schema:** `exercise_schema`
- **Volume:** `exercise_raw` (located at `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/`)

**Note:** Make sure to run the `00_Exercise_Setup_Environment.ipynb` notebook first to create the catalog, schema, volume, and sample data files.


## A. Setup

Run the following cell to configure your working environment for this exercise.


In [0]:
%sql
-- Set default catalog and schema
USE CATALOG lakeflow_exercise;
USE SCHEMA exercise_schema;

-- View current catalog and schema
SELECT 
  current_catalog(), 
  current_schema();


# Exercise 1: Parquet File Ingestion with CTAS

**Objective:** Ingest customer data from Parquet files into a Delta table.

**Tasks:**
1. Explore the Parquet files located at `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/`
2. Use `read_files()` with a CTAS statement to create a table named `customers_bronze_ctas`
3. Verify the table was created successfully and contains the expected data

**Hints:**
- Use `LIST` to explore the files
- Use `read_files()` with `format => 'parquet'`
- Use `CREATE TABLE ... AS SELECT` syntax


### Your Solution:


In [0]:
%sql
LIST '/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/'



In [0]:
%sql
CREATE TABLE customers_bronze_ctas
USING DELTA
AS SELECT * FROM read_files('/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/', format => 'parquet')

In [0]:
%sql
SELECT * FROM customers_bronze_ctas LIMIT 10;

In [0]:
%sql
DESCRIBE customers_bronze_ctas;

# Exercise 2: Incremental Ingestion with COPY INTO

**Objective:** Use `COPY INTO` to incrementally load data into an existing Delta table.

**Tasks:**
1. Create an empty table named `customers_bronze_copy` with only the `customer_id` and `customer_name` columns
2. Use `COPY INTO` to load data from the Parquet files
3. Handle the schema mismatch error by using `COPY_OPTIONS` with `mergeSchema = 'true'`
4. Verify that the data was loaded successfully

**Hints:**
- The Parquet files contain more columns than initially defined in the table
- Use `COPY_OPTIONS ('mergeSchema' = 'true')` to handle schema evolution


### Your Solution:


In [0]:
%sql
-- TODO: Write your solution here
-- Step 1: Create empty table with partial schema
CREATE TABLE customers_bronze_copy (
  customer_id STRING,
  customer_name STRING
);

-- Step 2: Use COPY INTO with mergeSchema option

COPY INTO customers_bronze_copy
  FROM '/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/'
  FILEFORMAT = parquet
  COPY_OPTIONS ('mergeSchema' = 'true');

-- Step 3: Verify the data

SELECT * FROM customers_bronze_copy LIMIT 10;


# Exercise 3: Adding Metadata Columns During Ingestion

**Objective:** Create a bronze table with metadata columns from customer Parquet files.

**Tasks:**
1. Create a table named `customers_bronze_metadata` that includes:
   - All original columns from the Parquet files
   - A `registration_date` column (convert `registration_timestamp` from Unix microseconds to DATE)
   - A `file_modification_time` column (from `_metadata`)
   - A `source_file` column (from `_metadata`)
   - An `ingestion_time` column (current timestamp)

**Hints:**
- Use `from_unixtime()` to convert Unix timestamp (divide by 1,000,000 to convert microseconds to seconds)
- Use `_metadata.file_modification_time` and `_metadata.file_name`
- Use `current_timestamp()` for ingestion time


### Your Solution:


In [0]:
%sql
-- TODO: Write your solution here
-- Create table with metadata columns using CTAS and read_files()
CREATE TABLE customers_bronze_metadata
USING DELTA
AS SELECT *,
    DATE(from_unixtime(registration_timestamp/1000000)) AS registration_date,
    _metadata.file_modification_time AS file_modification_time,
    _metadata.file_name   AS source_file,
    current_timestamp()   AS ingestion_time

FROM read_files('/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/', format => 'parquet')  


# Exercise 4: CSV Ingestion with Rescued Data Column

**Objective:** Ingest a CSV file with malformed data and handle rescued data.

**Tasks:**
1. Inspect the malformed CSV file at `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/inventory-csv/exercise_malformed_data.csv`
2. Create a table named `inventory_bronze` that:
   - Uses the schema: `product_id STRING, product_name STRING, stock_quantity INT`
   - Includes the `_rescued_data` column to capture malformed rows
   - Includes metadata columns: `file_modification_time`, `source_file`, and `ingestion_time`
3. Query the table to identify which rows have rescued data

**Hints:**
- The CSV file is comma-delimited
- Use `read_files()` with `rescuedDataColumn => '_rescued_data'`
- Some `stock_quantity` values contain text instead of numbers


### Your Solution:


In [0]:
%sql
LIST '/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/inventory-csv/exercise_malformed_data.csv'

In [0]:
%sql
-- TODO: Write your solution here
-- Step 1: Inspect the CSV file (optional)
SELECT *
FROM read_files(
  '/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/inventory-csv/exercise_malformed_data.csv',
  format => 'csv',
  header => true
);

-- Step 2: Create table with rescued_data column
CREATE TABLE inventory_bronze
USING DELTA
AS
SELECT
  product_id,
  product_name,
  stock_quantity,
  _rescued_data,
  _metadata.file_modification_time AS file_modification_time,
  _metadata.file_path AS source_file,
  current_timestamp() AS ingestion_time
FROM read_files(
  '/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/inventory-csv/exercise_malformed_data.csv',
  format => 'csv',
  header => true,
  schema => 'product_id STRING, product_name STRING, stock_quantity INT',
  rescuedDataColumn => '_rescued_data'
);

-- Step 3: Query to find rows with rescued data
SELECT *
FROM inventory_bronze
WHERE _rescued_data IS NOT NULL;


# Exercise 5: JSON File Ingestion and Decoding

**Objective:** Ingest JSON files containing base64-encoded web event data and decode the fields.

**Tasks:**
1. Create a table named `web_events_bronze_raw` from the JSON files at `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/web-events-json/`
2. Create a second table named `web_events_bronze_decoded` that:
   - Decodes the base64-encoded `key` column to STRING (name it `decoded_key`)
   - Decodes the base64-encoded `value` column to STRING (name it `decoded_value`)
   - Includes all other columns: `offset`, `partition`, `timestamp`, `topic`
3. Verify the decoded data is readable

**Hints:**
- Use `unbase64()` function to decode base64 strings
- Cast the decoded BINARY to STRING using `cast(unbase64(...) AS STRING)`


### Your Solution:


In [0]:
%sql
-- TODO: Write your solution here
-- Step 1: Create raw bronze table from JSON files
CREATE TABLE web_events_bronze_raw
USING DELTA
AS
SELECT *
FROM read_files(
  '/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/web-events-json/',
  format => 'json'
);

SELECT * FROM web_events_bronze_raw LIMIT 10
  
-- Step 2: Create decoded bronze table with unbase64()
CREATE TABLE web_events_bronze_decoded
USING DELTA
AS
SELECT
  CAST(unbase64(key) AS STRING)   AS decoded_key,
  CAST(unbase64(value) AS STRING) AS decoded_value,
  offset,
  partition,
  timestamp,
  topic
FROM web_events_bronze_raw;


-- Step 3: Verify the decoded data

SELECT
  decoded_key,
  decoded_value
FROM web_events_bronze_decoded
LIMIT 10;



# Exercise 6: Flattening JSON String Columns

**Objective:** Extract and flatten fields from JSON-formatted strings.

**Tasks:**
1. Create a table named `web_events_bronze_flattened` from `web_events_bronze_decoded` that extracts:
   - `decoded_value:browser` as `browser`
   - `decoded_value:page` as `page`
   - `decoded_value:action` as `action`
   - `decoded_value:location` as `location` (this will be a JSON string)
   - `decoded_value:customer_id` as `customer_id`
   - Include `decoded_key`, `offset`, `partition`, `timestamp`, `topic`
2. Query the table to verify the flattened structure

**Hints:**
- Use SQL-style JSON path extraction: `decoded_value:field_name`
- The `location` field contains nested JSON, so it will remain as a JSON string


### Your Solution:


In [0]:
%sql
-- TODO: Write your solution here
-- Create flattened table using JSON path extraction
CREATE TABLE web_events_bronze_flattened
USING DELTA
AS 
  SELECT 
    decoded_key,
    decoded_value:browser As browser,
    decoded_value:page As page,
    decoded_value:action As action,
    decoded_value:location As location,
    decoded_value:customer_id As customer_id,
    offset,
    partition,
    timestamp,
    topic
FROM 
    web_events_bronze_decoded

SELECT * FROM web_events_bronze_flattened;

# Exercise 7: CSV Ingestion with Different Delimiter (Challenge)

**Objective:** Ingest CSV files with a pipe delimiter and create a bronze table.

**Tasks:**
1. Explore the CSV files at `/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/transactions-csv/`
2. Create a table named `transactions_bronze` that:
   - Reads the pipe-delimited CSV files
   - Includes metadata columns: `file_modification_time`, `source_file`, and `ingestion_time`
3. Verify the table contains the expected number of rows

**Hints:**
- The CSV files use pipe (`|`) as delimiter
- Use `read_files()` with `sep => '|'` option
- Don't forget to set `header => true`


### Your Solution:


In [0]:
%sql
-- TODO: Write your solution here
-- Step 1: List files in the volume

SELECT * FROM read_files('/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/transactions-csv/',
 format => 'csv',
 sep => '|',
 header => true)
-- Step 2: Create table with pipe delimiter

CREATE TABLE transactions_bronze
USING DELTA
AS
SELECT *,
    _metadata.file_modification_time AS file_modification_time,
    _metadata.file_name AS source_file,
    current_timestamp() AS ingestion_time
FROM read_files('/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/transactions-csv/',
 format => 'csv',
 sep => '|',
 header => true)
-- Step 3: Verify row count

SELECT COUNT(*) AS total_rows
FROM transactions_bronze;

# Exercise 8: Python/PySpark Alternative (Optional)

**Objective:** Complete Exercise 3 using Python/PySpark instead of SQL.

**Tasks:**
1. Recreate the `customers_bronze_metadata` table using PySpark
2. Use PySpark functions to:
   - Convert Unix timestamp to DATE
   - Add file name using `input_file_name()` or `read_files()` SQL function
   - Add ingestion timestamp using `current_timestamp()`

**Hints:**
- You can use `spark.read.format("parquet")` or `spark.sql()` with `read_files()`
- Use `from_unixtime()` from `pyspark.sql.functions`
- Use `df.write.saveAsTable()` to create the table


### Your Solution:


In [0]:
%python
# TODO: Write your solution here
# Option 1: Using native PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_unixtime, col, current_timestamp, input_file_name

# Initialize Spark session
spark = SparkSession.builder.appName("Exercise3").getOrCreate()

# Read the Parquet files
df = spark.read.format("parquet").load("/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/")

# Add the necessary transformations:
# - Convert the Unix timestamp to a DATE format using from_unixtime
# - Add the file name using input_file_name
# - Add the ingestion timestamp using current_timestamp
df_transformed = df.withColumn(
    "registration_date", from_unixtime(col("registration_timestamp") / 1000000).cast("date")
).withColumn(
    "file_modification_time", input_file_name()  # Capture the file name
).withColumn(
    "source_file", input_file_name()  # Same as file_modification_time in this case
).withColumn(
    "ingestion_time", current_timestamp()  # Add ingestion time
)

# Write the DataFrame to a Delta table
df_transformed.write.format("delta").mode("overwrite").saveAsTable("customers_bronze_metadata")

# Option 2: Using spark.sql() with read_files() to access _metadata

# Using SQL-style read_files to access _metadata
spark.sql("""
    CREATE TABLE customers_bronze_metadata
    USING DELTA
    AS
    SELECT
        *,
        DATE(from_unixtime(registration_timestamp / 1000000)) AS registration_date,
        _metadata.file_modification_time AS file_modification_time,
        _metadata.file_path AS source_file,
        current_timestamp() AS ingestion_time
    FROM read_files(
        '/Volumes/lakeflow_exercise/exercise_schema/exercise_raw/customers-parquet/',
        format => 'parquet'
    )
""")



# Exercise 9: Data Quality Check (Challenge)

**Objective:** Analyze the rescued data and create a cleaned version.

**Tasks:**
1. From the `inventory_bronze` table created in Exercise 4, create a new table `inventory_bronze_cleaned` that:
   - Extracts numeric values from the `_rescued_data` column for malformed `stock_quantity` values
   - For rows where `stock_quantity` is NULL but `_rescued_data` contains a quantity, extract and use that value
   - Sets `stock_quantity` to NULL for rows where no valid numeric value can be extracted (e.g., "N/A")
   - Includes all other columns: `product_id`, `product_name`, and metadata columns

**Hints:**
- Use `COALESCE()` to prefer the original `stock_quantity` value
- Use JSON path extraction on `_rescued_data` to get rescued values: `_rescued_data:stock_quantity`
- Use `REPLACE()` or string functions to clean extracted values if needed


### Your Solution:


In [0]:
%sql
-- TODO: Write your solution here
-- Create cleaned table by extracting values from _rescued_data
CREATE TABLE inventory_bronze_cleaned
USING DELTA
AS
SELECT
    product_id,
    product_name,
    COALESCE(
        stock_quantity,
        CASE 
            WHEN REGEXP_REPLACE(_rescued_data:stock_quantity, '[^0-9]', '') != '' 
            THEN TRY_CAST(REGEXP_REPLACE(_rescued_data:stock_quantity, '[^0-9]', '') AS INT)
            ELSE NULL
        END
    ) AS stock_quantity,
    _metadata.file_modification_time AS file_modification_time,
    _metadata.file_path AS source_file,
    current_timestamp() AS ingestion_time
FROM inventory_bronze;



In [0]:
%sql
SELECT *
FROM inventory_bronze_cleaned


# Exercise 10: Summary Query (Verification)

**Objective:** Create summary queries to verify all your work.

**Tasks:**
1. Write a query that shows the row count for each bronze table you created
2. Write a query that shows how many rows have rescued data in the `inventory_bronze` table
3. Write a query that shows the distribution of browsers in the `web_events_bronze_flattened` table

**Hints:**
- Use `COUNT(*)` for row counts
- Use `WHERE _rescued_data IS NOT NULL` to find rescued rows
- Use `GROUP BY` for distributions


### Your Solution:


In [0]:
%sql
-- TODO: Write your summary queries here
-- Query 1: Row counts for all bronze tables

SELECT 'inventory_bronze' AS table_name, COUNT(*) AS row_count
FROM inventory_bronze
UNION ALL
SELECT 'web_events_bronze_flattened' AS table_name, COUNT(*) AS row_count
FROM web_events_bronze_flattened
UNION ALL
SELECT 'customers_bronze_metadata' AS table_name, COUNT(*) AS row_count
FROM customers_bronze_metadata
UNION ALL
SELECT 'transactions_bronze' AS table_name, COUNT(*) AS row_count
FROM transactions_bronze;

-- Query 2: Count of rows with rescued data

SELECT COUNT(*) AS rows_with_rescued_data
FROM inventory_bronze
WHERE _rescued_data IS NOT NULL;

-- Query 3: Browser distribution


SELECT browser, COUNT(*) AS browser_count
FROM web_events_bronze_flattened
GROUP BY browser
ORDER BY browser_count DESC;




## Exercise Complete!

Congratulations on completing the Data Ingestion exercises!

**Expected Tables Created:**
1. `customers_bronze_ctas` - Customer data ingested with CTAS
2. `customers_bronze_copy` - Customer data ingested with COPY INTO
3. `customers_bronze_metadata` - Customer data with metadata columns
4. `inventory_bronze` - Inventory data with rescued data column
5. `inventory_bronze_cleaned` - Cleaned inventory data (Challenge)
6. `web_events_bronze_raw` - Raw web events JSON data
7. `web_events_bronze_decoded` - Decoded web events data
8. `web_events_bronze_flattened` - Flattened web events data
9. `transactions_bronze` - Transaction data from CSV files

**Key Skills Demonstrated:**
- ✅ CTAS with `read_files()`
- ✅ COPY INTO with schema evolution
- ✅ Adding metadata columns during ingestion
- ✅ Handling rescued data columns
- ✅ JSON ingestion and base64 decoding
- ✅ JSON string flattening
- ✅ CSV ingestion with different delimiters
