# Lesson 1: Data Sources & Formats

**Module 3: Data & Pipeline Engineering**  
**Estimated Time**: 1-2 hours  
**Difficulty**: Beginner-Intermediate

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Understand the trade-offs between CSV, Parquet, and Avro  
âœ… Learn why Columnar Storage is critical for analytics and ML  
âœ… Benchmark read/write speeds of different formats  
âœ… Answer interview questions on Big Data formats  

---

## ðŸ“š Table of Contents

1. [The Big Three: CSV, Parquet, Avro](#1-formats)
2. [Deep Dive: Row vs Columnar Storage](#2-storage)
3. [Hands-On: Benchmarking Performance](#3-hands-on)
4. [Compression Strategies](#4-compression)
5. [Interview Preparation](#5-interview-questions)

---

## 1. The Big Three: CSV, Parquet, Avro

| Feature | CSV | Parquet | Avro |
|---------|-----|---------|------|
| **Type** | Text (Human Readable) | Binary (Columnar) | Binary (Row-based) |
| **Schema** | None (Inferred) | Embedded in footer | JSON Schema header |
| **Use Case** | Excel, small data | Analytics, ML training | Streaming (Kafka) |
| **Compression** | Poor | Excellent (Snappy/Gzip) | Good |
| **Write Speed** | Slow | Slow (Encoding overhead) | Fast |
| **Read Speed** | Very Slow | Very Fast (Column pruning) | Fast |

**Key Insight**: For training ML models on tabular data, **Parquet** is almost always the best choice.

## 2. Deep Dive: Row vs Columnar Storage

### Row-Oriented (CSV, Database, Avro)
Stores data record by record:  
`[ID:1, Name:John, Age:30], [ID:2, Name:Jane, Age:25]`

**Good for**: Transactional (OLTP) systems. writing ONE new user.

### Column-Oriented (Parquet)
Stores data column by column:  
`IDs:[1, 2], Names:[John, Jane], Ages:[30, 25]`

**Good for**: Analytical (OLAP) queries. "Calculate average Age".

**Why for ML?**
When training, you often select specific features (columns). Parquet allows you to read ONLY the columns you need, ignoring the rest. This drastically reduces I/O.

## 3. Hands-On: Benchmarking Performance

Simulate a large dataset and compare formats.

In [None]:
import pandas as pd
import numpy as np
import time
import os

# 1. Create a "Large" Dataset (1 Million rows)
print("Generating data...")
n_rows = 1_000_000
df = pd.DataFrame({
    'id': np.arange(n_rows),
    'category': np.random.choice(['A', 'B', 'C'], n_rows),
    'value1': np.random.randn(n_rows),
    'value2': np.random.randn(n_rows) 
})

# 2. Benchmark CSV
start = time.time()
df.to_csv('data.csv', index=False)
write_csv = time.time() - start

start = time.time()
pd.read_csv('data.csv')
read_csv = time.time() - start
size_csv = os.path.getsize('data.csv') / (1024 * 1024)

# 3. Benchmark Parquet (Snappy compression default)
start = time.time()
df.to_parquet('data.parquet', index=False)
write_pq = time.time() - start

start = time.time()
pd.read_parquet('data.parquet')
read_pq = time.time() - start
size_pq = os.path.getsize('data.parquet') / (1024 * 1024)

# 4. Results
print(f"\n{'Format':<10} {'Write(s)':<10} {'Read(s)':<10} {'Size(MB)':<10}")
print("-"*40)
print(f"{'CSV':<10} {write_csv:<10.2f} {read_csv:<10.2f} {size_csv:<10.2f}")
print(f"{'Parquet':<10} {write_pq:<10.2f} {read_pq:<10.2f} {size_pq:<10.2f}")

print(f"\nParquet is {size_csv/size_pq:.1f}x smaller and {read_csv/read_pq:.1f}x faster to read!")

# Cleanup
os.remove('data.csv')
os.remove('data.parquet')

## 5. Interview Preparation

### Common Questions

#### Q1: "Why is Parquet preferred over CSV for S3 data lakes?"
**Answer**: 
1. **Columnar Storage**: Allows scanning only required columns (saving I/O and cost with Athena/Spark).
2. **Schema Enforcement**: Stores data types, preventing "everything is a string" issues common in CSV.
3. **Compression**: Significantly smaller file sizes (saving storage cost).
4. **Splittable**: Compression blocks can be processed in parallel by Spark executors.

#### Q2: "When would you use Avro?"
**Answer**: "I prioritize Avro for **streaming data pipelines** (e.g., Kafka). It is row-based, making it efficient for writing records one by one. It also handles schema evolution (adding fields) very gracefully, which is crucial for long-running producers/consumers."

#### Q3: "Explain Dictionary Encoding in Parquet."
**Answer**: "For categorical columns with low cardinality (few unique values), Parquet stores a dictionary of values (e.g., ['A', 'B']) and stores row values as tiny integers (indexes). This provides massive compression for categorical data."