# Data Import and Exploration

## The Story Continues...

Remember our e-commerce company? The data team just got access to the source systems:

- **CRM System** exports customers to CSV (daily dump)
- **Order Management** sends JSON via API 
- **Product Catalog** is in Parquet (from legacy Hadoop system)

**Your first task:** Import this data and understand its structure before building pipelines.

---

## Why This Matters (Real-World Context)

### The "inferSchema" Trap

Most tutorials show: `spark.read.option("inferSchema", "true")`

**In production, this is often a bad idea:**

| Scenario | With inferSchema | With explicit schema |
|----------|------------------|---------------------|
| 10 GB file | Scans entire file first | Direct read |
| Column `"123"` | Might be INT or STRING? | You control it |
| New column added | Schema changes silently | Fails fast (good!) |
| Null values | Might guess wrong type | Explicit nullable |

**Rule of thumb:** Use `inferSchema=true` for exploration, explicit schema for production.

### The Bronze Layer Philosophy

In Medallion architecture, Bronze layer should:
- Keep data as STRING (preserve original values)
- Add metadata (ingestion time, source file)
- NOT apply business logic
- Be idempotent (re-run safe)

*We'll explore this more in the ingestion workshop.*

---

## What You'll Learn

1. **DataFrame Reader API** - unified interface for all formats
2. **Schema definition** - explicit vs inferred, when to use which
3. **Exploration operations** - quick data profiling
4. **Format-specific options** - CSV, JSON, Parquet gotchas

---

## Environment Initialization

Run the central configuration script:

In [0]:
%run ../00_setup

## Notebook Configuration

Define variables specific to this notebook:

In [0]:
# Paths to data directories (subdirectories in DATASET_BASE_PATH from 00_setup)
CUSTOMERS_PATH = f"{DATASET_BASE_PATH}/customers"
ORDERS_PATH = f"{DATASET_BASE_PATH}/orders"
PRODUCTS_PATH = f"{DATASET_BASE_PATH}/products"

# Paths to specific files
CUSTOMERS_CSV = f"{CUSTOMERS_PATH}/customers.csv"
ORDERS_JSON = f"{ORDERS_PATH}/orders_batch.json"
PRODUCTS_PARQUET = f"{PRODUCTS_PATH}/products.parquet"

print(f"Path to customers CSV file: {CUSTOMERS_CSV}")
print(f"Path to orders JSON file: {ORDERS_JSON}")
print(f"Path to products Parquet file: {PRODUCTS_PARQUET}")

---

## Part 1: CSV Data Import (Customers)

### 1.1. Loading CSV with Automatic Schema Inference

In [0]:
# Load CSV data with automatic schema inference
customers_auto_df = (
    spark.read
    .format("csv")
    .option("header", "true")       # First line contains column names
    .option("inferSchema", "true")  # Automatic data type inference
    .load(CUSTOMERS_CSV)
)

In [0]:
# Display schema
print(" Automatically detected schema:")
customers_auto_df.printSchema()

In [0]:
print("\n Data sample (5 rows):")
customers_auto_df.show(5, truncate=False)

In [0]:
# Display data sample
print("\n Data sample :")
display(customers_auto_df.head())

### 1.2. CSV Data Exploration

In [0]:
# List of column names
print(" DataFrame Columns:")
customers_auto_df.columns

In [0]:
# List of data types
print("\n Column data types:")
customers_auto_df.dtypes

In [0]:
# Row count
row_count = customers_auto_df.count()
print(f"\nRow count: {row_count}")

### 1.3. Descriptive Statistics

In [0]:
from pyspark.sql.types import StructType, StructField, StringType

# Schema definition for customers
# Actual structure: customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment
customers_schema = StructType([
    StructField("customer_id", StringType(), nullable=False),
    StructField("first_name", StringType(), nullable=True),
    StructField("last_name", StringType(), nullable=True), 
    StructField("email", StringType(), nullable=True),
    StructField("phone", StringType(), nullable=True),
    StructField("city", StringType(), nullable=True),
    StructField("state", StringType(), nullable=True),
    StructField("country", StringType(), nullable=True),
    StructField("registration_date", StringType(), nullable=True),  # Date as string, will convert later
    StructField("customer_segment", StringType(), nullable=True)
])

# Descriptive statistics (count, mean, stddev, min, max)
print(" Descriptive statistics (describe):")
customers_auto_df.describe().display()

In [0]:
# Extended statistics (+ percentiles)
print("\n Extended statistics (summary):")
customers_auto_df.summary().display()

## Extended Reader Options – Typical Production Issues

In a production environment, we often encounter CSV files with:

- different separator (`;` instead of `,`),
- quotes inside fields,
- corrupted rows.

Key options:

- `delimiter` – custom column separator,
- `quote` – character opening/closing text fields,
- `escape` – way to "escape" special characters,
- `mode` – handling of malformed records (`PERMISSIVE`, `DROPMALFORMED`, `FAILFAST`).

### 1.4. Manual Schema Definition for CSV

**Best Practice:** Defining schema manually ensures control and performance.

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType

In [0]:
# Schema definition for customers
# Structure: customer_id (int), first_name (string), last_name (string), email (string), city (string), country (string), registration_date (timestamp)
customers_schema = StructType([
    StructField("customer_id", IntegerType(), nullable=False),
    StructField("first_name", StringType(), nullable=True),
    StructField("last_name", StringType(), nullable=True),
    StructField("city", StringType(), nullable=True),
    StructField("email", StringType(), nullable=True),
    StructField("country", StringType(), nullable=True),
    StructField("registration_date", TimestampType(), nullable=True)
])

In [0]:
# Load CSV data with manually defined schema
customers_df = (
    spark.read
    .format("csv")
    .option("header", "true")
    .schema(customers_schema)  # Use defined schema
    .load(CUSTOMERS_CSV)
)
customers_df.printSchema()

In [0]:
print("\n Data sample with manual schema:")
customers_df.display(5, truncate=False)

---

## Part 2: JSON Data Import (Orders)

### 2.1. Loading JSON with Automatic Schema Inference

In [0]:
# Load JSON data with automatic schema inference
orders_auto_df = (
    spark.read
    .format("json")
    .option("inferSchema", "true")
    .load(ORDERS_JSON)
)

print(" JSON schema detected automatically:")
orders_auto_df.printSchema()

In [0]:
print("\n JSON data sample:")
orders_auto_df.display()

### 2.2. Eksploracja danych JSON

In [0]:
# Number of columns and rows
print(f" Number of columns: {len(orders_auto_df.columns)}")
print(f" Column names: {orders_auto_df.columns}")
print(f" Number of rows: {orders_auto_df.count()}")

In [0]:
# Data types
print("\n Data types:")
for col_name, col_type in orders_auto_df.dtypes:
    print(f"  - {col_name}: {col_type}")

### 2.3. Manual Schema Definition for JSON

In [0]:
from pyspark.sql.types import DoubleType, StringType, StructField, StructType, TimestampType, IntegerType

# Schema definition for orders
# Actual structure: order_id, customer_id, product_id, store_id, order_datetime, quantity, unit_price, discount_percent, total_amount, payment_method
orders_schema = StructType([
    StructField("order_id", StringType(), nullable=True),  # Can be null in data
    StructField("customer_id", StringType(), nullable=False),
    StructField("product_id", StringType(), nullable=False),
    StructField("store_id", StringType(), nullable=False),
    StructField("order_datetime", StringType(), nullable=True),  # String, will convert to timestamp later
    StructField("quantity", IntegerType(), nullable=False),
    StructField("unit_price", DoubleType(), nullable=False),
    StructField("discount_percent", IntegerType(), nullable=False),
    StructField("total_amount", DoubleType(), nullable=False),
    StructField("payment_method", StringType(), nullable=False)
])

In [0]:
%skip
orders_schema = StructType([
    StructField("order_id", StringType(), nullable=False),
    StructField("customer_id", StringType(), nullable=True),
    StructField("order_datetime", TimestampType(), nullable=True),
    StructField("total_amount", DoubleType(), nullable=True),
    StructField("status", StringType(), nullable=True)
])

In [0]:
# Load JSON data with manually defined schema
orders_df = (
    spark.read
    .format("json")
    .schema(orders_schema)
    .load(ORDERS_JSON)
)

print(" JSON schema defined manually:")
orders_df.printSchema()

print("\n Data sample with manual schema:")
orders_df.display(5, truncate=False)

### 2.4. Descriptive Statistics for JSON

In [0]:
# Statistics for numerical columns
print(" Statistics for orders:")
orders_df.select("order_id", "customer_id", "total_amount").describe().display()

In [0]:
# Order distribution by payment method
print("\n Order distribution by payment method (payment_method):")
orders_df.groupBy("payment_method").count().orderBy("count", ascending=False).display()

In [0]:
# Top 10 customers with the highest number of orders
print("\n Top 10 customers with the highest number of orders:")
orders_auto_df.groupBy("customer_id").count().orderBy("count", ascending=False).limit(10).display()

---

## Part 3: Parquet Data Import (Products)

### 3.1. Loading Parquet (built-in schema)

In [0]:
# Parquet already contains built-in schema - no need to define it
products_df = (
    spark.read
    .format("parquet")
    .load(PRODUCTS_PARQUET)
)

print(" Parquet Schema (built-in):")
products_df.printSchema()

print("\n Parquet data sample:")
products_df.display(5, truncate=False)

### 3.2. Parquet Data Exploration

In [0]:
# Basic information
print(f" Number of columns: {len(products_df.columns)}")
print(f" Column names: {products_df.columns}")
print(f" Number of products: {products_df.count()}")

In [0]:
# Data types
print("\n Data types:")
for col_name, col_type in products_df.dtypes:
    print(f"  - {col_name}: {col_type}")

### 3.3. Descriptive Statistics for Parquet

In [0]:
# Statistics for numerical columns
print(" Statistics for products:")
products_df.describe().display()

In [0]:
# Extended statistics
print("\n Extended statistics:")
products_df.summary().display()

---

## Part 4: Performance Comparison

### 4.1. Loading CSV: inferSchema vs Manual Schema

In [0]:
import time

# Test 1: Automatic schema inference
start_auto = time.time()
df_auto = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(CUSTOMERS_CSV)
)
count_auto = df_auto.count()  # Action - forces execution
time_auto = time.time() - start_auto

print(f"[BENCHMARK] Loading CSV with inferSchema: {time_auto:.3f} seconds")
print(f"[BENCHMARK] Row count: {count_auto}")

# Test 2: Manual schema definition
start_manual = time.time()
df_manual = (
    spark.read
    .format("csv")
    .option("header", "true")
    .schema(customers_schema)
    .load(CUSTOMERS_CSV)
)
count_manual = df_manual.count()  # Action - forces execution
time_manual = time.time() - start_manual

print(f"\n[BENCHMARK] Loading CSV with manual schema: {time_manual:.3f} seconds")
print(f"[BENCHMARK] Row count: {count_manual}")

# Comparison
speedup = (time_auto - time_manual) / time_auto * 100
print(f"\n[RESULT] Speedup: {speedup:.1f}%")

---

## Part 5: Best Practices

### Recommendations for Data Import:

1. **Always define schema manually**
   - Faster loading
   - Data type control
   - Avoiding type errors

2. **Choose the right format**
   - **Parquet** - best for analytics (columnar, compression)
   - **CSV** - easy to debug, but slower
   - **JSON** - flexible for semi-structured data

3. **Use partitioning**
   - Speed up filtering queries
   - Example: partitioning orders by date

4. **Check data quality immediately**
   - `count()` - check row count
   - `describe()` - check value distribution
   - `printSchema()` - verify types

5. **Use `limit()` when experimenting**
   - Speeds up code iterations
   - Example: `df.limit(1000).display()`

6. **Document schemas**
   - Facilitates code maintenance
   - Example: comments at StructType definition

---

## Summary

In this notebook you learned:

✅ **DataFrame Reader API**
- Loading data from CSV, JSON, Parquet
- Configuring options (header, inferSchema, delimiter)

✅ **Manual Schema Definition**
- Creating schemas using StructType and StructField
- Performance comparison: inferSchema vs manual schema

✅ **Exploration Operations**
- Basic operations: columns, dtypes, count
- Statistics: describe(), summary()
- Grouping and aggregation

✅ **Best Practices**
- Recommendation for manual schema definition
- Choosing data format for different scenarios
- Checking data quality

---

## Additional Resources

- [Databricks - Reading Data](https://docs.databricks.com/ingestion/index.html)
- [PySpark DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)