# Batch Data Ingestion - Demo

**Cel szkoleniowy:** Opanowanie technik idempotentnego ≈Çadowania danych batch do Delta Lake.

**Zakres tematyczny:**
- COPY INTO (idempotent batch load)
- R√≥≈ºne formaty plik√≥w (CSV, JSON, Parquet)
- Schema management (inference vs enforcement)
- Error handling (badRecordsPath)
- CTAS (CREATE TABLE AS SELECT)
- Incremental loading patterns

## Kontekst i wymagania

- **Dzie≈Ñ szkolenia**: Dzie≈Ñ 2 - Delta Lake & Lakehouse
- **Typ notebooka**: Demo
- **Wymagania techniczne**:
  - Databricks Runtime 13.0+ (zalecane: 14.3 LTS)
  - Unity Catalog w≈ÇƒÖczony
  - Uprawnienia: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY

## Wstƒôp teoretyczny

**COPY INTO** - najbardziej rekomendowana metoda dla batch ingestion:
- **Idempotency**: Automatyczne ≈õledzenie przetworzonych plik√≥w
- **File tracking**: Delta Lake zapisuje checksums - tylko nowe pliki sƒÖ ≈Çadowane
- **Zastosowanie**: Incremental batch loads, data lake ingestion z S3/ADLS/GCS

**CTAS (CREATE TABLE AS SELECT)**:
- Tworzy tabelƒô z zapytania SELECT
- **NIE** jest idempotentne
- **Zastosowanie**: One-time loads, transformacje, agregacje

**Dataset KION**:
- **customers.csv**: customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment
- **orders_batch.json**: order_id, customer_id, product_id, store_id, order_datetime, quantity, unit_price, discount_percent, total_amount, payment_method
- **products.parquet**: product_id, product_name, subcategory_code, brand, unit_cost, list_price, weight_kg, status

## Izolacja per u≈ºytkownik

In [0]:
%run ../00_setup

## Konfiguracja

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime

# ≈öcie≈ºki do danych
CUSTOMERS_CSV = f"{DATASET_BASE_PATH}/customers/customers.csv"
ORDERS_JSON = f"{DATASET_BASE_PATH}/orders/orders_batch.json"
PRODUCTS_PARQUET = f"{DATASET_BASE_PATH}/products/products.parquet"

# Wy≈õwietl kontekst u≈ºytkownika (zmienne z 00_setup)
print("=== Kontekst u≈ºytkownika ===")
print(f"Katalog: {CATALOG}")
print(f"Schema Bronze: {BRONZE_SCHEMA}")
print(f"Schema Silver: {SILVER_SCHEMA}")
print(f"Schema Gold: {GOLD_SCHEMA}")
print(f"U≈ºytkownik: {raw_user}")

# Ustaw katalog i schemat jako domy≈õlne
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

print(f"\n Domy≈õlny katalog: {CATALOG}")
print(f" Domy≈õlny schemat: {BRONZE_SCHEMA}")

---

## Sekcja 1: COPY INTO - CSV (Customers)

**Cel:** Idempotentne ≈Çadowanie danych klient√≥w z CSV.

**Schema customers.csv:**
- customer_id, first_name, last_name, email, phone
- city, state, country
- registration_date, customer_segment

### Przyk≈Çad 1.1: COPY INTO z CSV

In [0]:
# Przyk≈Çad 1.1 - COPY INTO from CSV

TABLE_CUSTOMERS = f"{BRONZE_SCHEMA}.customers_batch"

# Krok 1: Utw√≥rz target table ze schematem zgodnym z customers.csv
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {TABLE_CUSTOMERS} (
  customer_id STRING,
  first_name STRING,
  last_name STRING,
  email STRING,
  phone STRING,
  city STRING,
  state STRING,
  country STRING,
  registration_date DATE,
  customer_segment STRING,
  _ingestion_timestamp TIMESTAMP
) USING DELTA
COMMENT 'Customers data - Bronze layer'
""")

print(f"‚úì Tabela {TABLE_CUSTOMERS} gotowa")

# Krok 2: COPY INTO z transformacjami
result = spark.sql(f"""
COPY INTO {TABLE_CUSTOMERS}
FROM (
  SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    phone,
    city,
    state,
    country,
    TO_DATE(registration_date) as registration_date,
    customer_segment,
    current_timestamp() as _ingestion_timestamp
  FROM '{CUSTOMERS_CSV}'
)
FILEFORMAT = CSV
FORMAT_OPTIONS (
  'header' = 'true',
  'delimiter' = ','
)
""")

display(result)

count = spark.table(TABLE_CUSTOMERS).count()
print(f"\n‚úì Za≈Çadowano {count} klient√≥w")

# Sprawd≈∫ dane
print("\n=== Przyk≈Çadowe dane ===")
display(spark.table(TABLE_CUSTOMERS).limit(5))

**Idempotency Test:**

Uruchom powy≈ºszƒÖ kom√≥rkƒô ponownie - zobaczysz ≈ºe COPY INTO nie za≈Çaduje duplikat√≥w! Delta Lake ≈õledzi przetworzone pliki w transaction log.

---

## Sekcja 2: COPY INTO - JSON (Orders)

**Cel:** ≈Åadowanie zam√≥wie≈Ñ z JSON z audit columns.

**Schema orders_batch.json:**
- order_id, customer_id, product_id, store_id
- order_datetime, quantity, unit_price
- discount_percent, total_amount, payment_method

### Przyk≈Çad 2.1: COPY INTO z JSON + audit columns

In [0]:
# Przyk≈Çad 2.1 - COPY INTO from JSON

TABLE_ORDERS = f"{BRONZE_SCHEMA}.orders_batch"

# Utw√≥rz tabelƒô ze schematem zgodnym z orders_batch.json
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {TABLE_ORDERS} (
  order_id STRING,
  customer_id STRING,
  product_id STRING,
  store_id STRING,
  order_datetime TIMESTAMP,
  quantity INT,
  unit_price DECIMAL(10,2),
  discount_percent INT,
  total_amount DECIMAL(10,2),
  payment_method STRING,
  _ingestion_timestamp TIMESTAMP,
  _source_file STRING
) USING DELTA
COMMENT 'Orders data - Bronze layer'
""")

print(f"‚úì Tabela {TABLE_ORDERS} gotowa")

# COPY INTO z SELECT - dodaj audit columns
result = spark.sql(f"""
COPY INTO {TABLE_ORDERS}
FROM (
  SELECT 
    order_id,
    customer_id,
    product_id,
    store_id,
    TO_TIMESTAMP(order_datetime) as order_datetime,
    CAST(quantity AS INT) as quantity,
    CAST(unit_price AS DECIMAL(10,2)) as unit_price,
    CAST(discount_percent AS INT) as discount_percent,
    CAST(total_amount AS DECIMAL(10,2)) as total_amount,
    payment_method,
    current_timestamp() as _ingestion_timestamp,
    _metadata.file_path as _source_file
  FROM '{ORDERS_JSON}'
)
FILEFORMAT = JSON
""")

display(result)

count = spark.table(TABLE_ORDERS).count()
print(f"\n‚úì Za≈Çadowano {count} zam√≥wie≈Ñ")

# Poka≈º audit columns
print("\n=== Dane z audit columns ===")
display(spark.table(TABLE_ORDERS).select(
    "order_id", "customer_id", "total_amount", 
    "_ingestion_timestamp", "_source_file"
).limit(5))

**Audit Columns:**

Kolumny `_ingestion_timestamp` i `_source_file` sƒÖ kluczowe dla:
- Traceability: SkƒÖd pochodzƒÖ dane?
- Debugging: Kiedy zosta≈Çy za≈Çadowane?
- Data lineage: Pe≈Çna historia pochodzenia danych

---

## Sekcja 3: COPY INTO - Parquet (Products)

**Cel:** Najszybsze ≈Çadowanie z Parquet (columnar format).

**Schema products.parquet:**
- product_id, product_name, subcategory_code
- brand, unit_cost, list_price
- weight_kg, status

### Przyk≈Çad 3.1: COPY INTO z Parquet

In [0]:
# Przyk≈Çad 3.1 - COPY INTO from Parquet

TABLE_PRODUCTS = f"{BRONZE_SCHEMA}.products_batch"

# Utw√≥rz tabelƒô ze schematem zgodnym z products.parquet
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {TABLE_PRODUCTS} (
  product_id STRING,
  product_name STRING,
  subcategory_code STRING,
  brand STRING,
  unit_cost DECIMAL(10,2),
  list_price DECIMAL(10,2),
  weight_kg DECIMAL(10,2),
  status STRING,
  _ingestion_timestamp TIMESTAMP
) USING DELTA
COMMENT 'Products data - Bronze layer'
""")

print(f"‚úì Tabela {TABLE_PRODUCTS} gotowa")

# COPY INTO z Parquet - najszybszy format!
result = spark.sql(f"""
COPY INTO {TABLE_PRODUCTS}
FROM (
  SELECT 
    product_id,
    product_name,
    subcategory_code,
    brand,
    CAST(unit_cost AS DECIMAL(10,2)) as unit_cost,
    CAST(list_price AS DECIMAL(10,2)) as list_price,
    CAST(weight_kg AS DECIMAL(10,2)) as weight_kg,
    status,
    current_timestamp() as _ingestion_timestamp
  FROM '{PRODUCTS_PARQUET}'
)
FILEFORMAT = PARQUET
""")

display(result)

count = spark.table(TABLE_PRODUCTS).count()
print(f"\n‚úì Za≈Çadowano {count} produkt√≥w")

print("\n=== Przyk≈Çadowe produkty ===")
display(spark.table(TABLE_PRODUCTS).limit(5))

**Performance Parquet vs CSV/JSON:**

Dla 100GB danych (typowy scenariusz):
- **CSV**: ~5 min read time
- **JSON**: ~3 min read time  
- **Parquet**: ~30 sec read time ‚ö°

**üí° Best Practice:** Konwertuj CSV/JSON ‚Üí Parquet w Bronze layer!

---

## Sekcja 4: Schema Management

**Wprowadzenie:**

Dwa podej≈õcia do schema:

**1. Schema Inference (automatyczne):**
- ‚úÖ Szybkie dla prototyping
- ‚ùå Mo≈ºe byƒá nieprecyzyjne
- ‚ùå Skanuje dane (wolniejsze)

**2. Explicit Schema (zdefiniowany):**
- ‚úÖ Precyzyjne typy danych
- ‚úÖ Walidacja podczas read
- ‚úÖ Dokumentacja w kodzie
- ‚úÖ **ZALECANE dla production**

### Przyk≈Çad 4.1: Schema Inference vs Explicit

In [0]:
# Przyk≈Çad 4.1 - Schema Inference vs Enforcement

print("=== 1. Schema Inference ===\n")

# Automatyczne wykrywanie schema
df_inferred = spark.read.csv(CUSTOMERS_CSV, header=True, inferSchema=True)

print("Schema z inference:")
df_inferred.printSchema()
print(f"Liczba rekord√≥w: {df_inferred.count()}")

display(df_inferred.limit(3))

print("\n=== 2. Explicit Schema ===\n")

# Zdefiniowany schema (ZALECANE!)
schema_explicit = StructType([
  StructField("customer_id", StringType(), False),
  StructField("first_name", StringType(), True),
  StructField("last_name", StringType(), True),
  StructField("email", StringType(), True),
  StructField("phone", StringType(), True),
  StructField("city", StringType(), True),
  StructField("state", StringType(), True),
  StructField("country", StringType(), True),
  StructField("registration_date", DateType(), True),
  StructField("customer_segment", StringType(), True)
])

df_explicit = spark.read.schema(schema_explicit).csv(CUSTOMERS_CSV, header=True)

print("Schema explicit:")
df_explicit.printSchema()
print(f"Liczba rekord√≥w: {df_explicit.count()}")

display(df_explicit.limit(3))

print("\nüí° Production: Zawsze u≈ºywaj explicit schema!")

---

## Sekcja 5: Error Handling

**Strategie obs≈Çugi b≈Çƒôd√≥w:**

**Parse Modes:**
- `PERMISSIVE` (default): Parsuje co siƒô da, b≈Çƒôdy ‚Üí _corrupt_record
- `DROPMALFORMED`: Usuwa b≈Çƒôdne rekordy (ostro≈ºnie!)
- `FAILFAST`: Zatrzymuje na pierwszym b≈Çƒôdzie

**badRecordsPath:**
- Zapisuje niepoprawne rekordy do folderu
- Umo≈ºliwia analizƒô post-factum
- **Rekomendowane dla production**

### Przyk≈Çad 5.1: Error Handling z badRecordsPath

In [0]:
# Przyk≈Çad 5.1 - Error Handling z badRecordsPath

BAD_RECORDS_PATH = f"/tmp/{raw_user}/bad_records"

# Wyczy≈õƒá folder (dla demo)
try:
    dbutils.fs.rm(BAD_RECORDS_PATH, True)
except:
    pass

print(f"Bad records path: {BAD_RECORDS_PATH}")

# Utw√≥rz tabelƒô z _corrupt_record column
TABLE_ERRORS = f"{BRONZE_SCHEMA}.customers_with_validation"

spark.sql(f"""
CREATE TABLE IF NOT EXISTS {TABLE_ERRORS} (
  customer_id STRING,
  first_name STRING,
  last_name STRING,
  email STRING,
  phone STRING,
  city STRING,
  state STRING,
  country STRING,
  registration_date STRING,
  customer_segment STRING,
  _corrupt_record STRING,
  _ingestion_timestamp TIMESTAMP
) USING DELTA
""")

# Wczytaj z error handling
df_with_errors = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("columnNameOfCorruptRecord", "_corrupt_record")
    .option("badRecordsPath", BAD_RECORDS_PATH)
    .load(CUSTOMERS_CSV)
    .withColumn("_ingestion_timestamp", F.current_timestamp())
)

df_with_errors.write.mode("overwrite").saveAsTable(TABLE_ERRORS)

print(f"‚úì Dane za≈Çadowane do {TABLE_ERRORS}")

# Analiza b≈Çƒôdnych rekord√≥w
print("\n=== Statystyki ===")
total = spark.table(TABLE_ERRORS).count()
corrupt = spark.table(TABLE_ERRORS).filter(F.col("_corrupt_record").isNotNull()).count()
valid = total - corrupt

print(f"≈ÅƒÖcznie: {total}")
print(f"Poprawnych: {valid}")
print(f"B≈Çƒôdnych: {corrupt}")

if corrupt > 0:
    print("\n‚ö†Ô∏è B≈Çƒôdne rekordy:")
    display(spark.table(TABLE_ERRORS).filter(F.col("_corrupt_record").isNotNull()))
else:
    print("\n‚úÖ Wszystkie rekordy poprawne!")

---

## Sekcja 6: CTAS (CREATE TABLE AS SELECT)

**Wprowadzenie:**

CTAS tworzy tabelƒô z zapytania SELECT:
- **NIE** jest idempotentne (ka≈ºde uruchomienie tworzy/nadpisuje)
- Idealne do transformacji i agregacji
- Szybkie wykonanie (parallel processing)

**Kiedy u≈ºywaƒá CTAS:**
1. Jednorazowe ≈Çadowanie historyczne
2. Transformacje Bronze ‚Üí Silver/Gold
3. Agregacje (summary tables)
4. Format conversion (CSV ‚Üí Delta)

### Przyk≈Çad 6.1: CTAS dla agregacji

In [0]:
# Przyk≈Çad 6.1 - CTAS dla agregacji

AGG_TABLE = f"{SILVER_SCHEMA}.customer_segment_summary"

print(f"=== Tworzenie tabeli: {AGG_TABLE} ===\n")

# CTAS z agregacjƒÖ
spark.sql(f"""
CREATE OR REPLACE TABLE {AGG_TABLE}
USING DELTA
COMMENT 'Customer segmentation summary'
AS
SELECT 
  customer_segment,
  COUNT(*) as customer_count,
  COUNT(DISTINCT state) as states_count,
  COUNT(DISTINCT country) as countries_count,
  MIN(registration_date) as first_registration,
  MAX(registration_date) as last_registration,
  current_timestamp() as snapshot_timestamp
FROM {TABLE_CUSTOMERS}
WHERE customer_segment IS NOT NULL
GROUP BY customer_segment
ORDER BY customer_count DESC
""")

print(f"‚úì Tabela {AGG_TABLE} utworzona\n")

print("=== Summary po segmentach ===")
display(spark.table(AGG_TABLE))

### Przyk≈Çad 6.2: CTAS Bronze ‚Üí Silver (data quality)

In [0]:
# Przyk≈Çad 6.2 - CTAS Bronze ‚Üí Silver transformation

SILVER_CUSTOMERS = f"{SILVER_SCHEMA}.customers_clean"

print(f"=== Transformacja Bronze ‚Üí Silver: {SILVER_CUSTOMERS} ===\n")

# CTAS z data quality improvements
spark.sql(f"""
CREATE OR REPLACE TABLE {SILVER_CUSTOMERS}
USING DELTA
COMMENT 'Cleaned customers - Silver layer'
AS
SELECT 
  customer_id,
  TRIM(UPPER(first_name)) as first_name,
  TRIM(UPPER(last_name)) as last_name,
  CONCAT(TRIM(first_name), ' ', TRIM(last_name)) as full_name,
  LOWER(TRIM(email)) as email,
  phone,
  UPPER(city) as city,
  UPPER(state) as state,
  UPPER(country) as country,
  registration_date,
  customer_segment,
  DATEDIFF(CURRENT_DATE(), registration_date) as days_since_registration,
  CASE 
    WHEN customer_segment = 'Premium' THEN 'High Value'
    WHEN customer_segment = 'Basic' THEN 'Standard Value'
    ELSE 'Unknown'
  END as value_tier,
  current_timestamp() as processed_timestamp
FROM {TABLE_CUSTOMERS}
WHERE 
  customer_id IS NOT NULL
  AND email IS NOT NULL
  AND email LIKE '%@%'
  AND registration_date IS NOT NULL
""")

print(f"‚úì Silver table utworzona\n")

# Statystyki
bronze_count = spark.table(TABLE_CUSTOMERS).count()
silver_count = spark.table(SILVER_CUSTOMERS).count()
filtered = bronze_count - silver_count

print(f"=== Statystyki transformacji ===")
print(f"Bronze: {bronze_count}")
print(f"Silver: {silver_count}")
print(f"Filtered out: {filtered}")
if bronze_count > 0:
    print(f"Quality rate: {(silver_count/bronze_count*100):.2f}%")

print("\n=== Przyk≈Çadowe dane Silver ===")
display(spark.table(SILVER_CUSTOMERS).limit(5))

---

## Sekcja 7: Best Practices

### 7.1 File Size Optimization

**Idealne rozmiary:**
- Minimum: 128 MB per file
- Optimum: 256 MB - 1 GB per file
- Maximum: < 1 GB per file

**Problem ma≈Çych plik√≥w:**
```python
# BAD: TysiƒÖce ma≈Çych plik√≥w
for file in files:
    spark.read.csv(file).write.mode("append")

# GOOD: Batch processing + coalesce
df.coalesce(10).write.mode("append").saveAsTable("table")
```

### 7.2 Idempotency Patterns

**Pattern 1: COPY INTO (Recommended)**
```sql
COPY INTO table FROM 'path/*.parquet'  -- Automatyczna idempotency
```

**Pattern 2: MERGE z watermark**
```sql
MERGE INTO target USING source
ON target.id = source.id AND source.date >= '2024-01-01'
```

**Pattern 3: Overwrite partition**
```sql
INSERT OVERWRITE TABLE target PARTITION (date = '2024-01-15')
SELECT * FROM source WHERE date = '2024-01-15'
```

### 7.3 Audit Columns (ObowiƒÖzkowe!)

```python
.withColumn("_ingestion_timestamp", F.current_timestamp())
.withColumn("_source_file", F.input_file_name())
.withColumn("_job_id", F.lit(dbutils.notebook.entry_point.getDbutils().notebook().getContext().jobId().get()))
```

### 7.4 Quick Reference Card

| Scenario | Recommended Approach |
|----------|---------------------|
| Incremental loads (daily/hourly) | COPY INTO |
| One-time historical load | CTAS |
| SaaS integration | Lakeflow Connect |
| High-frequency (< 1h) | Streaming (Auto Loader) |
| Transformations | CTAS (Bronze ‚Üí Silver) |
| Upserts | MERGE INTO |

---

## Podsumowanie

### Co osiƒÖgnƒôli≈õmy:

‚úÖ **COPY INTO - Idempotent Loading**
- Automatic file tracking
- No duplicates na retry
- R√≥≈ºne formaty: CSV, JSON, Parquet

‚úÖ **Schema Management**
- Inference vs Explicit schema
- Explicit schema dla production

‚úÖ **Error Handling**
- badRecordsPath dla quarantine
- PERMISSIVE mode
- Analiza corrupt records

‚úÖ **CTAS Transformations**
- Agregacje (segment summary)
- Bronze ‚Üí Silver (data quality)
- Fast parallel processing

‚úÖ **Best Practices**
- File size optimization
- Idempotency patterns
- Audit columns
- Performance tuning

### Kluczowe wnioski:

üí° **1. COPY INTO > INSERT**
- Zawsze preferuj COPY INTO dla batch loads

üí° **2. Explicit Schema > Inference**
- Production pipelines wymagajƒÖ zdefiniowanego schema

üí° **3. Always Error Handling**
- badRecordsPath + PERMISSIVE mode + monitoring

üí° **4. Audit Everything**
- _ingestion_timestamp, _source_file, _job_id

üí° **5. Optimize Performance**
- Parquet > CSV/JSON
- Files: 256MB - 1GB
- Coalesce ma≈Çe pliki

### Nastƒôpne kroki:

üìö **Kolejny Notebook:** `03_streaming_data_ingestion.ipynb`
- Auto Loader (cloudFiles)
- Structured Streaming
- Incremental processing

üõ†Ô∏è **Warsztat:** `02_ingestion_pipeline_workshop.ipynb`
- Hands-on: End-to-end pipeline
- Real-world scenarios
- Error handling & monitoring

---

## Cleanup (Opcjonalnie)

**Uwaga:** Wykonaj cleanup tylko je≈õli nie potrzebujesz ju≈º tych tabel!

In [0]:
# Cleanup - usu≈Ñ tabele demo
# UWAGA: Odkomentuj tylko je≈õli jeste≈õ pewien!

CLEANUP_ENABLED = False  # Zmie≈Ñ na True aby w≈ÇƒÖczyƒá

if CLEANUP_ENABLED:
    tables_to_drop = [
        f"{CATALOG}.{BRONZE_SCHEMA}.customers_batch",
        f"{CATALOG}.{BRONZE_SCHEMA}.orders_batch",
        f"{CATALOG}.{BRONZE_SCHEMA}.products_batch",
        f"{CATALOG}.{BRONZE_SCHEMA}.customers_with_validation",
        f"{CATALOG}.{SILVER_SCHEMA}.customer_segment_summary",
        f"{CATALOG}.{SILVER_SCHEMA}.customers_clean"
    ]
    
    for table in tables_to_drop:
        try:
            spark.sql(f"DROP TABLE IF EXISTS {table}")
            print(f"‚úÖ Usuniƒôto: {table}")
        except Exception as e:
            print(f"‚ö†Ô∏è B≈ÇƒÖd: {str(e)}")
    
    print("\n‚úÖ Cleanup zako≈Ñczony!")
else:
    print("‚ö†Ô∏è Cleanup WY≈ÅƒÑCZONY")
    print("Zmie≈Ñ CLEANUP_ENABLED = True aby usunƒÖƒá tabele")
    print("\nüí° Zalecane: Zostaw dane dla kolejnych notebook√≥w!")