# Task 1: ETL Pipeline - CSV to Parquet to SQLite

## Scenario
You need to build a data pipeline that:
1. Loads ticket data from CSV
2. Transforms and cleans the data
3. Writes to compressed Parquet format
4. Creates partitioned Parquet files for efficient querying
5. Loads the data into a SQLite database

## Learning Objectives
- Understand CSV vs Parquet trade-offs (size, speed, schema)
- Work with Parquet compression and partitioning
- Create SQLite databases from pandas DataFrames
- Validate data integrity across formats

## Dataset
Support tickets CSV with columns:
- `ticket_id`, `user_id`, `category`, `description`
- `created_at`, `resolved_at`, `priority`, `status`

---
## Setup (Provided)

In [None]:
import pandas as pd
import sqlite3
import os
from pathlib import Path

# Create output directories
OUTPUT_DIR = Path("../fixtures/output")
OUTPUT_DIR.mkdir(exist_ok=True)

PARTITIONED_DIR = OUTPUT_DIR / "partitioned_tickets"
PARTITIONED_DIR.mkdir(exist_ok=True)

print("Setup complete!")

---
## Task 1.1: Load and Explore CSV Data

Load the tickets CSV file and explore its structure.

**File location:** `../fixtures/input/tickets.csv`

**Tasks:**
1. Load the CSV into a DataFrame called `df`
2. Parse `created_at` and `resolved_at` as datetime columns
3. Display basic statistics

In [None]:
# YOUR CODE HERE
# Load CSV with datetime parsing



In [None]:
# TEST - Do not modify
assert 'df' in dir(), "DataFrame 'df' not found"
assert len(df) == 50, f"Expected 50 tickets, got {len(df)}"
assert 'created_at' in df.columns, "Missing 'created_at' column"
assert pd.api.types.is_datetime64_any_dtype(df['created_at']), "created_at should be datetime type"
assert pd.api.types.is_datetime64_any_dtype(df['resolved_at']), "resolved_at should be datetime type"

print("✓ Task 1.1 PASSED!")
print(f"\nLoaded {len(df)} tickets")
print(f"Date range: {df['created_at'].min()} to {df['created_at'].max()}")
print(f"\nCategories: {df['category'].value_counts().to_dict()}")

---
## Task 1.2: Data Transformation

Add calculated columns for analysis:
1. `resolution_hours`: Hours between created_at and resolved_at
2. `month`: Month from created_at (as string like '2024-01')
3. `category_code`: Numeric code for category (Technical=1, Billing=2, Account=3)

Store the result in `df_transformed`.

In [None]:
# YOUR CODE HERE
# Create transformed DataFrame with new columns



In [None]:
# TEST - Do not modify
assert 'df_transformed' in dir(), "DataFrame 'df_transformed' not found"
assert len(df_transformed) == 50, "Transformed df should have same number of rows"
assert 'resolution_hours' in df_transformed.columns, "Missing 'resolution_hours' column"
assert 'month' in df_transformed.columns, "Missing 'month' column"
assert 'category_code' in df_transformed.columns, "Missing 'category_code' column"

# Check resolution_hours calculation
first_row_hours = (df.iloc[0]['resolved_at'] - df.iloc[0]['created_at']).total_seconds() / 3600
assert abs(df_transformed.iloc[0]['resolution_hours'] - first_row_hours) < 0.01, \
    "resolution_hours calculation incorrect"

# Check month format
assert df_transformed['month'].iloc[0] in ['2024-01', '2024-02', '2024-03', '2024-04', '2024-05'], \
    "Month should be in format YYYY-MM"

# Check category codes
assert set(df_transformed['category_code'].unique()) <= {1, 2, 3}, \
    "category_code should only contain 1, 2, or 3"

print("✓ Task 1.2 PASSED!")
print(f"\nAverage resolution time: {df_transformed['resolution_hours'].mean():.2f} hours")
print(f"Months covered: {sorted(df_transformed['month'].unique())}")

---
## Task 1.3: Write to Parquet with Compression

Write the transformed data to Parquet format with compression.

**Tasks:**
1. Write `df_transformed` to `../fixtures/output/tickets.parquet`
2. Use `snappy` compression
3. Compare file sizes between CSV and Parquet

**Hint:** Use `df.to_parquet()` with `compression` parameter

In [None]:
# YOUR CODE HERE
# Write to Parquet with snappy compression



In [None]:
# TEST - Do not modify
parquet_path = OUTPUT_DIR / "tickets.parquet"
assert parquet_path.exists(), "Parquet file not created"

# Load back and verify
df_loaded = pd.read_parquet(parquet_path)
assert len(df_loaded) == 50, "Parquet file should contain 50 rows"
assert list(df_loaded.columns) == list(df_transformed.columns), "Columns don't match"

# Compare file sizes
csv_size = os.path.getsize("../fixtures/input/tickets.csv")
parquet_size = os.path.getsize(parquet_path)

print("✓ Task 1.3 PASSED!")
print(f"\nFile size comparison:")
print(f"  CSV:     {csv_size:,} bytes")
print(f"  Parquet: {parquet_size:,} bytes")
print(f"  Reduction: {(1 - parquet_size/csv_size)*100:.1f}%")

---
## Task 1.4: Create Partitioned Parquet Files

Create partitioned Parquet files organized by month and category.

**Structure:**
```
partitioned_tickets/
  month=2024-01/
    category=Technical/
      data.parquet
    category=Billing/
      data.parquet
  month=2024-02/
    ...
```

**Hint:** Use `df.to_parquet()` with `partition_cols` parameter

In [None]:
# YOUR CODE HERE
# Create partitioned Parquet files by month and category



In [None]:
# TEST - Do not modify
# Check that partitioned directory has data
partition_files = list(PARTITIONED_DIR.rglob("*.parquet"))
assert len(partition_files) > 0, "No partitioned Parquet files created"

# Verify partition structure
month_dirs = [d for d in PARTITIONED_DIR.iterdir() if d.is_dir() and d.name.startswith("month=")]
assert len(month_dirs) > 0, "No month partitions found"

# Load back and verify data integrity
df_partitioned = pd.read_parquet(PARTITIONED_DIR)
assert len(df_partitioned) == 50, "Partitioned data should have 50 rows"

# Verify we can filter by partition
df_jan = pd.read_parquet(PARTITIONED_DIR, filters=[('month', '=', '2024-01')])
assert len(df_jan) > 0, "Should have January data"
assert all(df_jan['month'] == '2024-01'), "Partition filter not working"

print("✓ Task 1.4 PASSED!")
print(f"\nCreated {len(partition_files)} partition files")
print(f"Months: {sorted([d.name for d in month_dirs])}")
print(f"January tickets: {len(df_jan)}")

---
## Task 1.5: Create SQLite Database

Load the data into a SQLite database with proper schema.

**Tasks:**
1. Create connection to `../fixtures/output/tickets.db`
2. Write `df_transformed` to table named `tickets`
3. Create an index on `user_id` for faster queries
4. Create an index on `category` for filtering

**Hint:** Use `df.to_sql()` and `conn.execute()` for indexes

In [None]:
# YOUR CODE HERE
# Create SQLite database and load data



In [None]:
# TEST - Do not modify
db_path = OUTPUT_DIR / "tickets.db"
assert db_path.exists(), "Database file not created"

# Verify table and data
conn = sqlite3.connect(db_path)

# Check table exists
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", conn)
assert 'tickets' in tables['name'].values, "Table 'tickets' not found"

# Check row count
count = pd.read_sql("SELECT COUNT(*) as cnt FROM tickets", conn)['cnt'].iloc[0]
assert count == 50, f"Expected 50 rows in database, got {count}"

# Check indexes exist
indexes = pd.read_sql("SELECT name FROM sqlite_master WHERE type='index'", conn)
index_names = indexes['name'].tolist()
has_user_index = any('user_id' in idx.lower() for idx in index_names)
has_category_index = any('category' in idx.lower() for idx in index_names)

assert has_user_index, "Index on user_id not found"
assert has_category_index, "Index on category not found"

# Verify data integrity
sample = pd.read_sql("SELECT * FROM tickets LIMIT 5", conn)
assert len(sample) == 5, "Could not query data"

conn.close()

print("✓ Task 1.5 PASSED!")
print(f"\nDatabase created with:")
print(f"  Rows: {count}")
print(f"  Indexes: {[idx for idx in index_names if 'auto' not in idx.lower()]}")
print(f"\nSample query result:")
print(sample[['ticket_id', 'category', 'priority']].head())

---
## Task 1.6: Validate Data Integrity

Verify that data is consistent across all formats.

**Tasks:**
1. Load data from Parquet and database
2. Compare row counts
3. Verify ticket_id values match
4. Check that resolution_hours are consistent

In [None]:
# YOUR CODE HERE
# Load from both sources and compare



In [None]:
# TEST - Do not modify
parquet_data = pd.read_parquet(OUTPUT_DIR / "tickets.parquet")
db_conn = sqlite3.connect(OUTPUT_DIR / "tickets.db")
db_data = pd.read_sql("SELECT * FROM tickets", db_conn)
db_conn.close()

# Compare row counts
assert len(parquet_data) == len(db_data), "Row counts don't match"

# Compare ticket IDs (sorted)
parquet_ids = sorted(parquet_data['ticket_id'].tolist())
db_ids = sorted(db_data['ticket_id'].tolist())
assert parquet_ids == db_ids, "Ticket IDs don't match"

# Compare resolution hours for first 10 tickets
parquet_sorted = parquet_data.sort_values('ticket_id').reset_index(drop=True)
db_sorted = db_data.sort_values('ticket_id').reset_index(drop=True)

hours_match = all(
    abs(parquet_sorted.loc[i, 'resolution_hours'] - db_sorted.loc[i, 'resolution_hours']) < 0.01
    for i in range(10)
)
assert hours_match, "Resolution hours don't match between formats"

print("✓ Task 1.6 PASSED!")
print("\n✅ Data integrity verified across all formats!")
print(f"\nPipeline summary:")
print(f"  CSV → Parquet: {len(parquet_data)} rows")
print(f"  CSV → SQLite:  {len(db_data)} rows")
print(f"  All formats consistent: ✓")

---
## Summary

Congratulations! You've built a complete ETL pipeline:

✅ Loaded CSV data with proper datetime parsing
✅ Transformed data with calculated columns
✅ Wrote compressed Parquet files (smaller storage)
✅ Created partitioned Parquet for efficient queries
✅ Loaded data into SQLite with indexes
✅ Validated data integrity across formats

## Key Takeaways

**Format Comparison:**
- **CSV**: Human-readable, universally compatible, larger size
- **Parquet**: Columnar, compressed, faster queries, schema enforcement
- **SQLite**: SQL queries, ACID transactions, indexes for fast lookups

**Partitioning Benefits:**
- Only read relevant data (partition pruning)
- Parallel processing of partitions
- Easier data management (delete old months)

**When to use what:**
- **CSV**: Sharing data, manual inspection, compatibility
- **Parquet**: Analytics, data lakes, large datasets
- **SQLite**: Application databases, complex queries, transactions