# Lab 01 - Spark DataFrames Deep Dive

## Objetive:

Understand how Spark DataFrames work internally and how to manipulate them efficiently:

- Schema inference vs explicit schema
- Column expressions
- Transformations vs actions (deep view)
- Filtering & projections
- Derived columns
- Null handling
- Aggregations
- Window functions
- Execution plans

___
```Section 1``` — Loading Data the Right Way

In real pipelines, bad schema inference causes:
- Wrong types
- Slow reads
- Broken aggregations
- Unexpected casting

In [0]:
# =======================
# Create a sample DataSet
# =======================
data = [
    (1, "2026-01-01", "card", 120.50, "CR"),
    (2, "2026-01-01", "transfer", 540.00, "CR"),
    (3, "2026-01-02", "card", 75.25, "US"),
    (4, "2026-01-02", "card", None, "CR"),
]
print(type(data))

# =======================
# Create an Explicit Schema 
# =======================

from pyspark.sql.types import *
schema = StructType([
    StructField("transaction_id", IntegerType(), False), #(name, data type, nullable)
    StructField("date", StringType(), True),
    StructField("channel", StringType(), True),
    StructField("amount", DoubleType(), True),
    StructField("country", StringType(), True),
])

# =======================
# Feeding the df
# =======================
df = spark.createDataFrame(data, schema)
df.printSchema()


In [0]:
display(df)

Some comentaries
- Spark does not validate schema strictly unless enforced.
- Schema affects ```Catalyst*``` optimization.
- Explicit schema is faster than inference for large files.
- String date is not ideal → we will cast.


**Note**:
Catalyst is the query optimizer of Apache Spark. Basically decides 
- When to apply filters
- How to reorder operations
- Wheter to push filters down 
- wheter to eliminate unused columns
- Wheter to optimize joins

Catalyst builds **the most efficient physical plan**.

In [0]:
spark.version