# File Formats Comparison: CSV, JSON, Parquet

This notebook compares the three most common data formats used in data engineering and ML pipelines:

- **CSV**: Text-based, human-readable, widely supported
- **JSON**: Flexible schema, hierarchical data, human-readable
- **Parquet**: Columnar binary format, optimized for analytics, compressed

We'll compare:
1. File sizes
2. Read/write performance
3. Compression options
4. Column pruning (reading subset of columns)
5. Use cases and trade-offs

In [None]:
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import json
import time
import os
from pathlib import Path

# Create output directory
output_dir = Path('../fixtures/output')
output_dir.mkdir(exist_ok=True, parents=True)

print(f"pandas version: {pd.__version__}")
print(f"pyarrow version: {pa.__version__}")

## 1. Create Sample Dataset

We'll create a realistic e-commerce dataset with various data types:
- Numeric columns (integers, floats)
- Text columns (strings)
- Categorical data
- Timestamps

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate sample e-commerce data (100,000 rows)
n_rows = 100_000

df = pd.DataFrame({
    'order_id': range(1, n_rows + 1),
    'customer_id': np.random.randint(1000, 10000, n_rows),
    'product_name': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones', 'Mouse', 'Keyboard'], n_rows),
    'category': np.random.choice(['Electronics', 'Accessories', 'Computers'], n_rows),
    'price': np.round(np.random.uniform(10, 2000, n_rows), 2),
    'quantity': np.random.randint(1, 10, n_rows),
    'discount_percent': np.round(np.random.uniform(0, 30, n_rows), 1),
    'order_date': pd.date_range('2023-01-01', periods=n_rows, freq='5min'),
    'customer_email': [f'user{i}@example.com' for i in np.random.randint(1000, 10000, n_rows)],
    'shipping_country': np.random.choice(['USA', 'UK', 'Canada', 'Germany', 'France'], n_rows),
    'notes': np.random.choice(['Gift wrap requested', 'Express shipping', None, 'Standard delivery'], n_rows)
})

# Calculate total
df['total'] = df['price'] * df['quantity'] * (1 - df['discount_percent'] / 100)

print(f"Dataset shape: {df.shape}")
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
df.head()

In [None]:
# Data types
df.dtypes

## 2. File Size Comparison

Let's save the data in different formats and compare file sizes.

In [None]:
# CSV (uncompressed)
csv_path = output_dir / 'data.csv'
df.to_csv(csv_path, index=False)

# CSV (gzip compressed)
csv_gz_path = output_dir / 'data.csv.gz'
df.to_csv(csv_gz_path, index=False, compression='gzip')

# JSON (uncompressed)
json_path = output_dir / 'data.json'
df.to_json(json_path, orient='records', date_format='iso')

# JSON (gzip compressed)
json_gz_path = output_dir / 'data.json.gz'
df.to_json(json_gz_path, orient='records', date_format='iso', compression='gzip')

# Parquet (snappy - default compression)
parquet_snappy_path = output_dir / 'data_snappy.parquet'
df.to_parquet(parquet_snappy_path, compression='snappy', index=False)

# Parquet (gzip compression)
parquet_gzip_path = output_dir / 'data_gzip.parquet'
df.to_parquet(parquet_gzip_path, compression='gzip', index=False)

# Parquet (zstd compression - best compression ratio)
parquet_zstd_path = output_dir / 'data_zstd.parquet'
df.to_parquet(parquet_zstd_path, compression='zstd', index=False)

# Parquet (uncompressed)
parquet_none_path = output_dir / 'data_none.parquet'
df.to_parquet(parquet_none_path, compression=None, index=False)

print("Files saved successfully!")

In [None]:
# Compare file sizes
def get_file_size_mb(path):
    return os.path.getsize(path) / 1024**2

sizes = {
    'CSV': get_file_size_mb(csv_path),
    'CSV (gzip)': get_file_size_mb(csv_gz_path),
    'JSON': get_file_size_mb(json_path),
    'JSON (gzip)': get_file_size_mb(json_gz_path),
    'Parquet (snappy)': get_file_size_mb(parquet_snappy_path),
    'Parquet (gzip)': get_file_size_mb(parquet_gzip_path),
    'Parquet (zstd)': get_file_size_mb(parquet_zstd_path),
    'Parquet (none)': get_file_size_mb(parquet_none_path),
}

size_df = pd.DataFrame(list(sizes.items()), columns=['Format', 'Size (MB)'])
size_df['Compression Ratio'] = size_df['Size (MB)'].iloc[0] / size_df['Size (MB)']
size_df = size_df.sort_values('Size (MB)')

print("File Size Comparison:")
print(size_df.to_string(index=False))

# Visualize
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(size_df['Format'], size_df['Size (MB)'])
plt.xlabel('File Size (MB)')
plt.title('File Format Size Comparison')
plt.tight_layout()
plt.show()

### Key Observations:

1. **Parquet with compression (zstd/gzip)** typically gives the smallest file size
2. **CSV is very large** without compression due to text representation of numbers
3. **JSON is even larger** due to additional syntax (keys, braces, quotes)
4. **Snappy compression** (default for Parquet) balances speed and compression ratio
5. **Zstd** provides best compression but slower write speeds

## 3. Read Performance Comparison

Let's benchmark how long it takes to read each format.

In [None]:
def time_read(func, path, runs=5):
    """Time a read operation over multiple runs."""
    times = []
    for _ in range(runs):
        start = time.time()
        func(path)
        end = time.time()
        times.append(end - start)
    return np.mean(times), np.std(times)

# Benchmark reads
read_times = {}

print("Benchmarking read performance...\n")

# CSV
mean_time, std_time = time_read(pd.read_csv, csv_path)
read_times['CSV'] = mean_time
print(f"CSV: {mean_time:.4f}s ± {std_time:.4f}s")

# CSV (gzip)
mean_time, std_time = time_read(pd.read_csv, csv_gz_path)
read_times['CSV (gzip)'] = mean_time
print(f"CSV (gzip): {mean_time:.4f}s ± {std_time:.4f}s")

# JSON
mean_time, std_time = time_read(lambda p: pd.read_json(p, orient='records'), json_path)
read_times['JSON'] = mean_time
print(f"JSON: {mean_time:.4f}s ± {std_time:.4f}s")

# JSON (gzip)
mean_time, std_time = time_read(lambda p: pd.read_json(p, orient='records', compression='gzip'), json_gz_path)
read_times['JSON (gzip)'] = mean_time
print(f"JSON (gzip): {mean_time:.4f}s ± {std_time:.4f}s")

# Parquet (snappy)
mean_time, std_time = time_read(pd.read_parquet, parquet_snappy_path)
read_times['Parquet (snappy)'] = mean_time
print(f"Parquet (snappy): {mean_time:.4f}s ± {std_time:.4f}s")

# Parquet (gzip)
mean_time, std_time = time_read(pd.read_parquet, parquet_gzip_path)
read_times['Parquet (gzip)'] = mean_time
print(f"Parquet (gzip): {mean_time:.4f}s ± {std_time:.4f}s")

# Parquet (zstd)
mean_time, std_time = time_read(pd.read_parquet, parquet_zstd_path)
read_times['Parquet (zstd)'] = mean_time
print(f"Parquet (zstd): {mean_time:.4f}s ± {std_time:.4f}s")

In [None]:
# Visualize read performance
perf_df = pd.DataFrame(list(read_times.items()), columns=['Format', 'Read Time (s)'])
perf_df = perf_df.sort_values('Read Time (s)')

plt.figure(figsize=(10, 6))
plt.barh(perf_df['Format'], perf_df['Read Time (s)'])
plt.xlabel('Read Time (seconds)')
plt.title('File Format Read Performance Comparison')
plt.tight_layout()
plt.show()

print("\nRead Performance Ranking:")
print(perf_df.to_string(index=False))

### Key Observations:

1. **Parquet is typically fastest** for reading, especially with snappy compression
2. **CSV is slower** due to text parsing and type inference
3. **JSON is slowest** due to complex parsing overhead
4. **Compression adds overhead** but is often worth it for network/disk I/O

## 4. Write Performance Comparison

In [None]:
def time_write(func, path, df, runs=3):
    """Time a write operation over multiple runs."""
    times = []
    for _ in range(runs):
        start = time.time()
        func(df, path)
        end = time.time()
        times.append(end - start)
    return np.mean(times), np.std(times)

write_times = {}

print("Benchmarking write performance...\n")

# CSV
mean_time, std_time = time_write(lambda d, p: d.to_csv(p, index=False), csv_path, df)
write_times['CSV'] = mean_time
print(f"CSV: {mean_time:.4f}s ± {std_time:.4f}s")

# CSV (gzip)
mean_time, std_time = time_write(lambda d, p: d.to_csv(p, index=False, compression='gzip'), csv_gz_path, df)
write_times['CSV (gzip)'] = mean_time
print(f"CSV (gzip): {mean_time:.4f}s ± {std_time:.4f}s")

# Parquet (snappy)
mean_time, std_time = time_write(lambda d, p: d.to_parquet(p, compression='snappy', index=False), parquet_snappy_path, df)
write_times['Parquet (snappy)'] = mean_time
print(f"Parquet (snappy): {mean_time:.4f}s ± {std_time:.4f}s")

# Parquet (gzip)
mean_time, std_time = time_write(lambda d, p: d.to_parquet(p, compression='gzip', index=False), parquet_gzip_path, df)
write_times['Parquet (gzip)'] = mean_time
print(f"Parquet (gzip): {mean_time:.4f}s ± {std_time:.4f}s")

# Parquet (zstd)
mean_time, std_time = time_write(lambda d, p: d.to_parquet(p, compression='zstd', index=False), parquet_zstd_path, df)
write_times['Parquet (zstd)'] = mean_time
print(f"Parquet (zstd): {mean_time:.4f}s ± {std_time:.4f}s")

# Visualize
write_df = pd.DataFrame(list(write_times.items()), columns=['Format', 'Write Time (s)'])
write_df = write_df.sort_values('Write Time (s)')

plt.figure(figsize=(10, 6))
plt.barh(write_df['Format'], write_df['Write Time (s)'])
plt.xlabel('Write Time (seconds)')
plt.title('File Format Write Performance Comparison')
plt.tight_layout()
plt.show()

## 5. Column Pruning with Parquet

One of Parquet's biggest advantages is **column pruning**: reading only the columns you need.

This is extremely efficient because Parquet stores data in a **columnar format**.

In [None]:
# Read all columns vs specific columns
print("Reading ALL columns from Parquet:")
start = time.time()
df_all = pd.read_parquet(parquet_snappy_path)
time_all = time.time() - start
print(f"Time: {time_all:.4f}s")
print(f"Shape: {df_all.shape}")
print(f"Memory: {df_all.memory_usage(deep=True).sum() / 1024**2:.2f} MB\n")

# Read only 3 columns
print("Reading ONLY 3 columns (order_id, product_name, total):")
start = time.time()
df_subset = pd.read_parquet(parquet_snappy_path, columns=['order_id', 'product_name', 'total'])
time_subset = time.time() - start
print(f"Time: {time_subset:.4f}s")
print(f"Shape: {df_subset.shape}")
print(f"Memory: {df_subset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\nSpeedup: {time_all / time_subset:.2f}x faster")
print(f"Memory savings: {(1 - time_subset/time_all) * 100:.1f}%")

In [None]:
df_subset.head()

### Comparison with CSV (no column pruning)

CSV must read the entire file even if you only want specific columns.

In [None]:
# CSV with usecols parameter
print("CSV with usecols (still has to scan entire file):")
start = time.time()
df_csv_subset = pd.read_csv(csv_path, usecols=['order_id', 'product_name', 'total'])
time_csv_subset = time.time() - start
print(f"Time: {time_csv_subset:.4f}s")
print(f"\nParquet is {time_csv_subset / time_subset:.2f}x faster for column pruning!")

## 6. Compression Options Deep Dive

Let's examine the trade-offs between different compression algorithms.

In [None]:
# Create comparison table
compression_comparison = pd.DataFrame([
    {
        'Compression': 'None',
        'Size (MB)': get_file_size_mb(parquet_none_path),
        'Write Time': write_times.get('Parquet (none)', 0),
        'Read Time': read_times.get('Parquet (none)', 0),
    },
    {
        'Compression': 'Snappy',
        'Size (MB)': get_file_size_mb(parquet_snappy_path),
        'Write Time': write_times.get('Parquet (snappy)', 0),
        'Read Time': read_times.get('Parquet (snappy)', 0),
    },
    {
        'Compression': 'Gzip',
        'Size (MB)': get_file_size_mb(parquet_gzip_path),
        'Write Time': write_times.get('Parquet (gzip)', 0),
        'Read Time': read_times.get('Parquet (gzip)', 0),
    },
    {
        'Compression': 'Zstd',
        'Size (MB)': get_file_size_mb(parquet_zstd_path),
        'Write Time': write_times.get('Parquet (zstd)', 0),
        'Read Time': read_times.get('Parquet (zstd)', 0),
    },
])

print("Parquet Compression Algorithm Comparison:\n")
print(compression_comparison.to_string(index=False))

### Compression Algorithm Recommendations:

| Algorithm | Speed | Compression | Use Case |
|-----------|-------|-------------|----------|
| **None** | Fastest | No compression | Local processing, temporary files |
| **Snappy** | Very fast | Moderate | **Default choice**, good balance |
| **Gzip** | Slower | Good | Network transfer, long-term storage |
| **Zstd** | Moderate | Best | Best compression ratio, modern choice |

**General recommendation**: Use **Snappy** for most cases, **Zstd** for storage optimization.

## 7. Schema and Type Preservation

Parquet preserves data types perfectly, while CSV requires type inference.

In [None]:
# Original dtypes
print("Original DataFrame dtypes:")
print(df.dtypes)
print("\n" + "="*50 + "\n")

# Read from CSV (loses type information)
df_csv = pd.read_csv(csv_path)
print("After CSV round-trip (notice order_date becomes string):")
print(df_csv.dtypes)
print("\n" + "="*50 + "\n")

# Read from Parquet (preserves types)
df_parquet = pd.read_parquet(parquet_snappy_path)
print("After Parquet round-trip (types preserved):")
print(df_parquet.dtypes)

In [None]:
# Parquet schema inspection with PyArrow
parquet_file = pq.ParquetFile(parquet_snappy_path)
print("Parquet Schema (stored in file metadata):")
print(parquet_file.schema)

## 8. Summary and Recommendations

### When to Use Each Format:

#### CSV
- Human-readable debugging
- Interoperability with non-technical tools (Excel)
- Simple data exchange
- When schema changes frequently

**Pros**: Universal support, human-readable  
**Cons**: Large file size, slow parsing, no type preservation, no column pruning

#### JSON
- Hierarchical/nested data structures
- Web APIs and microservices
- Configuration files
- Flexible schemas

**Pros**: Flexible schema, handles nested data, human-readable  
**Cons**: Largest file size, slowest to parse, verbose syntax

#### Parquet
- **Data lakes and analytics**
- **ML feature stores**
- **Large-scale data processing**
- Long-term data storage

**Pros**: Small size, fast reads, column pruning, type preservation, excellent compression  
**Cons**: Binary format (not human-readable), requires special tools

### Best Practices:

1. **Use Parquet for analytics pipelines** - it's optimized for read-heavy workloads
2. **Use CSV for human interaction** - debugging, data sharing with non-technical users
3. **Use JSON for APIs and nested data** - web services, configuration
4. **Default to Snappy compression** for Parquet - best speed/compression trade-off
5. **Use Zstd for archival storage** - best compression ratio
6. **Always specify columns when reading Parquet** - leverage column pruning

In [None]:
# Final summary visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# File size comparison (select formats)
formats = ['CSV', 'JSON', 'Parquet (snappy)', 'Parquet (zstd)']
sizes_selected = [sizes[f] for f in formats]
axes[0].bar(formats, sizes_selected)
axes[0].set_ylabel('Size (MB)')
axes[0].set_title('File Size Comparison')
axes[0].tick_params(axis='x', rotation=45)

# Read performance
read_selected = [read_times[f] for f in formats]
axes[1].bar(formats, read_selected)
axes[1].set_ylabel('Time (seconds)')
axes[1].set_title('Read Performance')
axes[1].tick_params(axis='x', rotation=45)

# Write performance
write_selected = [write_times.get(f, 0) for f in formats if f in write_times]
axes[2].bar([f for f in formats if f in write_times], write_selected)
axes[2].set_ylabel('Time (seconds)')
axes[2].set_title('Write Performance')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()