# Part 1: The PySpark Speedrun

## Key Concepts
- **DataFrame**: The core abstraction
- **Transformation (Lazy)**: An instruction (e.g., `.select()`, `.filter()`)
- **Action (Eager)**: A command to run the job (e.g., `.show()`, `.count()`)

## Module 1.1: First Load (10 mins)

**Goal**: Load the TPC-H dataset and explore it


In [None]:
# Setup: Import required libraries
from pyspark.sql.functions import *

# Verify Spark session (pre-configured in Databricks)
# Note: In local environment, you may need to create SparkSession manually:
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName("PySpark Speedrun").getOrCreate()
print(f"Spark Version: {spark.version}")


In [None]:
display(dbutils.fs.ls('/databricks-datasets'))

In [None]:
# Load the TPC-H orders dataset (built into Databricks)
# Start simple: just one table to learn the basics
df=spark.table("samples.tpch.orders") 


In [None]:
# What data do we have? - Print the schema
df.printSchema()


In [None]:
# Let's see it - Show first 5 rows
df.show(5)


## Module 1.2: Core API (10 mins)

**Goal**: Learn the essential DataFrame operations


In [None]:
# Select specific columns
df.select("o_orderkey", "o_totalprice", "o_orderstatus", "o_orderdate").show(5)


In [None]:
# Create a new column with calculation
df.withColumn(
    "order_year", 
    year(col("o_orderdate"))
).select("o_orderkey", "o_orderdate", "order_year").show(5)


In [None]:
# Filter rows based on condition
df.filter(col("o_orderstatus") == "F").show(5)


In [None]:
# Count total orders - This triggers the first "real" job!
# ðŸ’¡ This is an ACTION - it actually runs the computation
total_orders = df.count()
print(f"Total Orders: {total_orders:,}")


### ðŸŽ¯ Understanding Lazy Evaluation

**Transformations** (Lazy - build execution plan, don't execute):
- `.select()`, `.filter()`, `.withColumn()`, `.groupBy()`, `.join()`

**Actions** (Eager - trigger actual computation):
- `.show()`, `.count()`, `.collect()`, `.write()`

**Key Insight**: Spark optimizes the entire plan before executing!


In [None]:
# Demonstrate lazy evaluation
print("Creating transformation (no execution yet)...")
filtered_orders = df.filter(col("o_orderstatus") == "F")
print("Transformation created! (Nothing computed yet)")

In [None]:
print("\nTriggering action...")
count = filtered_orders.count()
print(f"Finished orders: {count:,} (Now computation happened!)")