# Module 04: Data Loading and Storage

**Estimated Time:** 45-60 minutes

## Learning Objectives

By the end of this module, you will:
- Load data to various file formats (CSV, JSON, Parquet)
- Write data to SQL databases
- Understand batch vs. incremental loading strategies
- Implement upsert operations
- Learn data partitioning strategies
- Optimize loading performance

---

## 1. The Load Phase in ETL

The Load phase is where transformed data lands in its final destination:

### Common Destinations
- **Files**: CSV, Parquet, JSON (for data lakes or sharing)
- **Databases**: PostgreSQL, MySQL, SQL Server
- **Data Warehouses**: Snowflake, BigQuery, Redshift
- **Object Storage**: S3, GCS, Azure Blob
- **APIs**: REST endpoints for downstream systems

### Loading Strategies
1. **Full Load**: Replace all data
2. **Incremental Load**: Only new/changed records
3. **Upsert**: Update existing, insert new
4. **Append**: Add new records without updates

### Key Considerations
- Performance: How fast can you load?
- Atomicity: All or nothing?
- Idempotency: Safe to run multiple times?
- Data quality: Validation before loading?

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sqlalchemy import create_engine, text
import os

print("[OK] Libraries loaded")

---

## 2. Loading to File Formats

### 2.1 CSV Files

In [None]:
# Create sample transformed data
sales_data = pd.DataFrame(
    {
        "date": pd.date_range("2024-01-01", periods=100),
        "product_id": np.random.choice(["P001", "P002", "P003"], 100),
        "quantity": np.random.randint(1, 50, 100),
        "revenue": np.random.uniform(100, 1000, 100).round(2),
        "customer_id": np.random.randint(1, 50, 100),
    }
)

print(f"Sample Data: {len(sales_data)} records")
sales_data.head()

In [None]:
# Load to CSV
output_path = "../data/processed/sales_data.csv"

sales_data.to_csv(output_path, index=False)
print(f"[OK] Loaded {len(sales_data)} records to {output_path}")

# Verify file size
file_size = os.path.getsize(output_path)
print(f"   File size: {file_size:,} bytes ({file_size/1024:.2f} KB)")

In [None]:
# Load with options
sales_data.to_csv(
    "../data/processed/sales_data_advanced.csv",
    index=False,
    date_format="%Y-%m-%d",  # Format dates
    float_format="%.2f",  # 2 decimal places for floats
    encoding="utf-8",
)

print("[OK] Loaded with advanced options")

### 2.2 Parquet Files (Recommended for Large Data)

In [None]:
# Load to Parquet
parquet_path = "../data/processed/sales_data.parquet"

sales_data.to_parquet(
    parquet_path, engine="pyarrow", compression="snappy", index=False  # Fast compression
)

# Compare file sizes
csv_size = os.path.getsize(output_path)
parquet_size = os.path.getsize(parquet_path)

print(f"[OK] Loaded to Parquet")
print(f"   CSV size: {csv_size:,} bytes")
print(f"   Parquet size: {parquet_size:,} bytes")
print(f"   Compression ratio: {(1 - parquet_size/csv_size)*100:.1f}% smaller")

### 2.3 JSON Files

In [None]:
# Load to JSON
json_path = "../data/processed/sales_data.json"

sales_data.to_json(json_path, orient="records", date_format="iso", indent=2)  # List of objects

print(f"[OK] Loaded to JSON")
print(f"   File size: {os.path.getsize(json_path):,} bytes")

---

## 3. Loading to Databases

### 3.1 Basic Database Loading

In [None]:
# Create SQLite database for demo
db_path = "../data/processed/warehouse.db"
engine = create_engine(f"sqlite:///{db_path}")

print("[OK] Database connection created")

In [None]:
# Load data to database - Method 1: Replace
sales_data.to_sql(
    name="sales", con=engine, if_exists="replace", index=False  # Drop and recreate table
)

print("[OK] Data loaded to 'sales' table (replace mode)")

# Verify
count = pd.read_sql("SELECT COUNT(*) as count FROM sales", engine)
print(f"   Records in database: {count['count'][0]:,}")

In [None]:
# Load data to database - Method 2: Append
new_sales = pd.DataFrame(
    {
        "date": [datetime.now()],
        "product_id": ["P001"],
        "quantity": [25],
        "revenue": [500.00],
        "customer_id": [999],
    }
)

new_sales.to_sql(name="sales", con=engine, if_exists="append", index=False)  # Add to existing table

print("[OK] New data appended")

# Verify
count = pd.read_sql("SELECT COUNT(*) as count FROM sales", engine)
print(f"   Records in database: {count['count'][0]:,}")

### 3.2 Incremental Loading (Load Only New Records)

In [None]:
def incremental_load(df, table_name, engine, date_column="date"):
    """
    Load only records newer than the last load
    """
    try:
        # Get the latest date from the database
        query = f"SELECT MAX({date_column}) as max_date FROM {table_name}"
        result = pd.read_sql(query, engine)
        max_date = result["max_date"][0]

        if max_date:
            # Filter for records after max_date
            df_incremental = df[df[date_column] > max_date]
            print(f"[DATA] Latest date in DB: {max_date}")
            print(f"[DATA] New records to load: {len(df_incremental)}")
        else:
            # No existing data, load all
            df_incremental = df
            print("[DATA] No existing data, loading all records")

        if len(df_incremental) > 0:
            df_incremental.to_sql(table_name, engine, if_exists="append", index=False)
            print(f"[OK] Loaded {len(df_incremental)} new records")
        else:
            print("ℹ️ No new records to load")

    except Exception as e:
        # Table doesn't exist yet
        print(f"[DATA] Table doesn't exist, creating and loading all records")
        df.to_sql(table_name, engine, if_exists="replace", index=False)
        print(f"[OK] Created table and loaded {len(df)} records")


# Test incremental load
future_sales = pd.DataFrame(
    {
        "date": pd.date_range("2024-05-01", periods=10),
        "product_id": np.random.choice(["P001", "P002"], 10),
        "quantity": np.random.randint(1, 50, 10),
        "revenue": np.random.uniform(100, 1000, 10).round(2),
        "customer_id": np.random.randint(1, 50, 10),
    }
)

incremental_load(future_sales, "sales_incremental", engine)

### 3.3 Upsert Operations (Update or Insert)

In [None]:
def upsert_data(df, table_name, engine, primary_key):
    """
    Upsert: Update if exists, Insert if new

    Note: This is a simplified version. Production systems use
    database-specific UPSERT syntax (INSERT ... ON CONFLICT, MERGE, etc.)
    """
    # Create temporary table
    temp_table = f"{table_name}_temp"
    df.to_sql(temp_table, engine, if_exists="replace", index=False)

    # Check if main table exists
    try:
        with engine.connect() as conn:
            # Delete existing records that will be updated
            delete_query = text(
                f"""
                DELETE FROM {table_name}
                WHERE {primary_key} IN (SELECT {primary_key} FROM {temp_table})
            """
            )
            result = conn.execute(delete_query)
            conn.commit()
            deleted = result.rowcount

            # Insert all records from temp table
            insert_query = text(
                f"""
                INSERT INTO {table_name}
                SELECT * FROM {temp_table}
            """
            )
            conn.execute(insert_query)
            conn.commit()

        print(f"[OK] Upsert complete")
        print(f"   Updated: {deleted} records")
        print(f"   Total processed: {len(df)} records")

    except Exception as e:
        # Table doesn't exist, create it
        df.to_sql(table_name, engine, if_exists="replace", index=False)
        print(f"[OK] Created table and inserted {len(df)} records")

    # Drop temp table
    with engine.connect() as conn:
        conn.execute(text(f"DROP TABLE IF EXISTS {temp_table}"))
        conn.commit()


# Test upsert
customer_data = pd.DataFrame(
    {
        "customer_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Carol"],
        "email": ["alice@ex.com", "bob@ex.com", "carol@ex.com"],
        "lifetime_value": [1000, 2000, 1500],
    }
)

upsert_data(customer_data, "customers", engine, "customer_id")

# Update some customers
updated_customers = pd.DataFrame(
    {
        "customer_id": [2, 3, 4],  # 2 & 3 exist, 4 is new
        "name": ["Bob Updated", "Carol Updated", "David"],
        "email": ["bob@ex.com", "carol@ex.com", "david@ex.com"],
        "lifetime_value": [2500, 1800, 500],
    }
)

upsert_data(updated_customers, "customers", engine, "customer_id")

# Verify
result = pd.read_sql("SELECT * FROM customers ORDER BY customer_id", engine)
print("\nFinal customer data:")
result

---

## 4. Data Partitioning Strategies

In [None]:
# Create larger dataset for partitioning
large_sales = pd.DataFrame(
    {
        "date": pd.date_range("2023-01-01", "2024-12-31", freq="H"),
        "product_id": np.random.choice(["P001", "P002", "P003"], 17544),  # 2 years of hourly data
        "revenue": np.random.uniform(100, 1000, 17544).round(2),
    }
)

large_sales["year"] = large_sales["date"].dt.year
large_sales["month"] = large_sales["date"].dt.month

print(f"Large dataset: {len(large_sales):,} records")
print(f"Date range: {large_sales['date'].min()} to {large_sales['date'].max()}")

In [None]:
# Partition by year and month
def save_partitioned_data(df, base_path, partition_cols):
    """
    Save data partitioned by specified columns

    Example structure: base_path/year=2024/month=01/data.parquet
    """
    for group_keys, group_df in df.groupby(partition_cols):
        # Create partition path
        if len(partition_cols) == 1:
            group_keys = [group_keys]

        partition_path = base_path
        for col, value in zip(partition_cols, group_keys):
            partition_path = os.path.join(partition_path, f"{col}={value}")

        os.makedirs(partition_path, exist_ok=True)

        # Save partition
        file_path = os.path.join(partition_path, "data.parquet")
        group_df.to_parquet(file_path, index=False)

        print(f"[OK] Saved partition: {partition_path} ({len(group_df):,} records)")


# Save partitioned data
partition_base = "../data/processed/sales_partitioned"
save_partitioned_data(large_sales, partition_base, ["year", "month"])

print("\n[DATA] Partitioning complete!")
print(f"   Benefits: Faster queries when filtering by year/month")
print(f"   Benefits: Can process one partition at a time")

---

## 5. Performance Optimization

In [None]:
import time

# Create test data
test_data = pd.DataFrame(
    {
        "id": range(10000),
        "value": np.random.randn(10000),
        "category": np.random.choice(["A", "B", "C"], 10000),
    }
)

print(f"Test data: {len(test_data):,} records")

In [None]:
# Compare loading speeds
def benchmark_loading(df, method="default"):
    """
    Benchmark different loading methods
    """
    start = time.time()

    if method == "default":
        df.to_sql("benchmark", engine, if_exists="replace", index=False)
    elif method == "batched":
        df.to_sql("benchmark", engine, if_exists="replace", index=False, chunksize=1000)

    elapsed = time.time() - start
    return elapsed


# Benchmark
time_default = benchmark_loading(test_data, "default")
print(f"Default loading: {time_default:.3f}s")

time_batched = benchmark_loading(test_data, "batched")
print(f"Batched loading: {time_batched:.3f}s")

if time_batched < time_default:
    print(f"\n[OK] Batched is {((time_default - time_batched) / time_default * 100):.1f}% faster")
else:
    print(f"\n[OK] Default is {((time_batched - time_default) / time_batched * 100):.1f}% faster")

---

## 6. Best Practices for Data Loading

### 1. Choose the Right Format
- **CSV**: Simple, human-readable, universal
- **Parquet**: Large datasets, analytics, compression
- **JSON**: Nested/hierarchical data, APIs

### 2. Use Appropriate Loading Strategy
- **Full Load**: Simple but slow for large datasets
- **Incremental**: Faster, requires tracking
- **Upsert**: Best for dimensional data

### 3. Implement Data Validation
- Validate before loading
- Check row counts
- Verify data types
- Log any discrepancies

### 4. Handle Failures Gracefully
- Use transactions when possible
- Implement retry logic
- Log errors
- Alert on failures

### 5. Partition Large Datasets
- By date (year/month/day)
- By category/region
- Enables parallel processing
- Faster queries

### 6. Monitor Performance
- Track loading times
- Monitor resource usage
- Optimize bottlenecks
- Scale as needed

---

## 7. Complete Loading Example

In [None]:
class DataLoader:
    """
    Production-ready data loader with best practices
    """

    def __init__(self, engine=None):
        self.engine = engine
        self.load_stats = {}

    def load(self, df, destination_type, **kwargs):
        """
        Load data to various destinations
        """
        start = time.time()

        try:
            if destination_type == "csv":
                df.to_csv(kwargs["file_path"], index=False)
            elif destination_type == "parquet":
                df.to_parquet(kwargs["file_path"], compression="snappy", index=False)
            elif destination_type == "database":
                df.to_sql(
                    kwargs["table_name"],
                    self.engine,
                    if_exists=kwargs.get("if_exists", "replace"),
                    index=False,
                )
            else:
                raise ValueError(f"Unsupported destination: {destination_type}")

            elapsed = time.time() - start

            self.load_stats[destination_type] = {
                "records": len(df),
                "duration_seconds": elapsed,
                "timestamp": datetime.now().isoformat(),
            }

            print(f"[OK] Loaded {len(df):,} records to {destination_type} in {elapsed:.2f}s")
            return True

        except Exception as e:
            print(f"[FAIL] Load failed: {e}")
            return False


# Use the loader
loader = DataLoader(engine=engine)

# Load to multiple destinations
sample_data = sales_data.head(50)

loader.load(sample_data, "csv", file_path="../data/processed/final_output.csv")
loader.load(sample_data, "parquet", file_path="../data/processed/final_output.parquet")
loader.load(sample_data, "database", table_name="final_sales", if_exists="replace")

print("\n[DATA] Loading Statistics:")
for dest, stats in loader.load_stats.items():
    print(f"   {dest}: {stats['records']:,} records in {stats['duration_seconds']:.2f}s")

---

## 8. Key Takeaways

[OK] **File Formats**: Choose based on use case (CSV, Parquet, JSON)

[OK] **Loading Strategies**: Full, incremental, upsert, append

[OK] **Incremental Loading**: Only load new/changed data

[OK] **Upsert**: Update existing, insert new records

[OK] **Partitioning**: Split data for better performance

[OK] **Validation**: Always verify data before and after loading

### Next Steps

In **Module 05: Building ETL Pipelines**, we'll:
- Combine Extract, Transform, and Load into complete pipelines
- Learn pipeline design patterns
- Implement error handling and logging
- Build modular, reusable pipeline components

---

**Ready to build complete pipelines?** Open `05_building_etl_pipelines.ipynb`!