# PySpark DataFrames Basics

## Overview
This notebook introduces PySpark DataFrames - the fundamental data structure in Spark for structured data processing.

## Learning Objectives
- Understand DataFrames and their benefits
- Create DataFrames from various sources
- Perform basic transformations
- Understand lazy evaluation and actions
- Work with schemas and data types

---

## 1. What are DataFrames?

### DataFrame Concept

A **DataFrame** is a distributed collection of data organized into named columns, similar to a table in a relational database or a pandas DataFrame, but with optimizations for distributed computing.

**Key Features**:
- ✅ Distributed across cluster nodes
- ✅ Immutable (transformations create new DataFrames)
- ✅ Lazy evaluation (operations not executed until action)
- ✅ Optimized execution plans (Catalyst optimizer)
- ✅ Type-safe with schema

**DataFrame vs RDD**:
- DataFrames have named columns and schema
- Better optimization and performance
- Easier to use (SQL-like operations)
- DataFrames are recommended over RDDs for most use cases

## 2. Creating DataFrames

### Method 1: From Python Collections

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# In Databricks, spark session is already available
# For local development:
# spark = SparkSession.builder.appName("DataFrameBasics").getOrCreate()

# Create DataFrame from list of tuples
data = [
    (1, "Alice", 25, "Engineering"),
    (2, "Bob", 30, "Sales"),
    (3, "Charlie", 35, "Engineering"),
    (4, "Diana", 28, "Marketing"),
    (5, "Eve", 32, "Sales")
]

columns = ["id", "name", "age", "department"]

df = spark.createDataFrame(data, columns)

# Display DataFrame (Databricks command)
display(df)

In [None]:
# Show DataFrame (standard Spark command)
df.show()

# Show with options
df.show(3, truncate=False)  # Show 3 rows, don't truncate

### Method 2: From Files (CSV, JSON, Parquet)

In [None]:
# Read CSV file
# df_csv = spark.read.csv("/path/to/file.csv", header=True, inferSchema=True)

# Read with schema specification
# df_csv = spark.read \
#     .option("header", "true") \
#     .option("inferSchema", "true") \
#     .csv("/path/to/file.csv")

# Read JSON
# df_json = spark.read.json("/path/to/file.json")

# Read Parquet (columnar format)
# df_parquet = spark.read.parquet("/path/to/file.parquet")

# Read Delta (recommended in Databricks)
# df_delta = spark.read.format("delta").load("/path/to/delta/table")

print("File reading examples (commented out - need actual files)")

### Method 3: From SQL Query

In [None]:
# Register DataFrame as temporary view
df.createOrReplaceTempView("employees")

# Create DataFrame from SQL query
df_sql = spark.sql("""
    SELECT id, name, age, department
    FROM employees
    WHERE age > 28
""")

display(df_sql)

## 3. DataFrame Schema

Schema defines the structure of your DataFrame - column names and data types.

In [None]:
# Print schema
df.printSchema()

# Get schema
print("\nSchema object:")
print(df.schema)

In [None]:
# Define explicit schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True),
    StructField("salary", DoubleType(), True)
])

# Create DataFrame with explicit schema
data_with_salary = [
    (1, "Alice", 25, "Engineering", 75000.0),
    (2, "Bob", 30, "Sales", 65000.0),
    (3, "Charlie", 35, "Engineering", 85000.0),
]

df_typed = spark.createDataFrame(data_with_salary, schema)
df_typed.printSchema()
display(df_typed)

## 4. Basic DataFrame Operations

### Select Columns

In [None]:
# Select specific columns
df.select("name", "age").show()

# Select using col() function
df.select(col("name"), col("age")).show()

# Select all columns
df.select("*").show()

# Select with expressions
df.select(
    col("name"),
    col("age"),
    (col("age") + 5).alias("age_in_5_years")
).show()

### Filter Rows

In [None]:
# Filter with condition
df.filter(col("age") > 28).show()

# Alternative: where() is alias for filter()
df.where(col("age") > 28).show()

# Multiple conditions with &, |, ~
df.filter(
    (col("age") > 25) & (col("department") == "Engineering")
).show()

# Filter with SQL expression
df.filter("age > 28 AND department = 'Engineering'").show()

### Add New Columns

In [None]:
# Add new column with withColumn()
df_with_bonus = df.withColumn(
    "bonus",
    when(col("department") == "Engineering", 5000)
    .when(col("department") == "Sales", 3000)
    .otherwise(2000)
)

display(df_with_bonus)

# Add multiple columns
df_enhanced = df \
    .withColumn("age_category", 
                when(col("age") < 30, "Young")
                .otherwise("Senior")) \
    .withColumn("name_length", length(col("name")))

display(df_enhanced)

### Rename Columns

In [None]:
# Rename single column
df.withColumnRenamed("name", "employee_name").show()

# Rename multiple columns (using select with alias)
df.select(
    col("id").alias("employee_id"),
    col("name").alias("employee_name"),
    col("age"),
    col("department").alias("dept")
).show()

### Drop Columns

In [None]:
# Drop single column
df.drop("age").show()

# Drop multiple columns
df.drop("age", "department").show()

## 5. Aggregations and Grouping

In [None]:
# Simple aggregations
print("Total count:", df.count())
print("Average age:", df.select(avg("age")).collect()[0][0])

# Multiple aggregations
df.select(
    count("*").alias("total_count"),
    avg("age").alias("avg_age"),
    min("age").alias("min_age"),
    max("age").alias("max_age")
).show()

In [None]:
# Group by department
df.groupBy("department").count().show()

# Group by with aggregations
df.groupBy("department").agg(
    count("*").alias("employee_count"),
    avg("age").alias("avg_age"),
    min("age").alias("youngest"),
    max("age").alias("oldest")
).show()

## 6. Sorting

In [None]:
# Sort by single column
df.orderBy("age").show()

# Sort descending
df.orderBy(col("age").desc()).show()

# Sort by multiple columns
df.orderBy(col("department").asc(), col("age").desc()).show()

## 7. Handling Missing Data

In [None]:
# Create DataFrame with nulls
data_with_nulls = [
    (1, "Alice", 25, "Engineering"),
    (2, "Bob", None, "Sales"),
    (3, None, 35, "Engineering"),
    (4, "Diana", 28, None),
]

df_nulls = spark.createDataFrame(data_with_nulls, ["id", "name", "age", "department"])
display(df_nulls)

# Drop rows with any null
print("Drop any nulls:")
df_nulls.dropna().show()

# Drop rows where all values are null
print("Drop all nulls:")
df_nulls.dropna(how="all").show()

# Drop based on specific columns
print("Drop nulls in specific columns:")
df_nulls.dropna(subset=["name", "age"]).show()

In [None]:
# Fill null values
print("Fill all nulls with default:")
df_nulls.fillna("Unknown").show()

# Fill with different values per column
print("Fill with column-specific values:")
df_nulls.fillna({
    "name": "Unknown",
    "age": 0,
    "department": "Unassigned"
}).show()

## 8. Transformations vs Actions

### Understanding Lazy Evaluation

**Transformations** (lazy - not executed immediately):
- `select()`, `filter()`, `where()`, `groupBy()`
- `withColumn()`, `drop()`, `orderBy()`
- Return a new DataFrame

**Actions** (eager - trigger execution):
- `show()`, `count()`, `collect()`
- `write()`, `save()`
- Return results to driver or write data

In [None]:
# Transformations are lazy - these don't execute yet
df_transformed = df \
    .filter(col("age") > 25) \
    .withColumn("age_next_year", col("age") + 1) \
    .select("name", "age", "age_next_year")

print("Transformations defined but not executed yet")

# Action triggers execution
print("\nNow executing with show():")
df_transformed.show()

## 9. Common DataFrame Methods

In [None]:
# Get column names
print("Columns:", df.columns)

# Get row count
print("Row count:", df.count())

# Get number of partitions
print("Partitions:", df.rdd.getNumPartitions())

# Describe statistics
df.describe().show()

# Get distinct values
print("\nDistinct departments:")
df.select("department").distinct().show()

# Get first n rows
print("\nFirst 2 rows:")
print(df.head(2))

# Take n rows
print("\nTake 2 rows:")
print(df.take(2))

## 10. Practice Exercises

In [None]:
# Sample data for exercises
sales_data = [
    (1, "Product A", 100, 50, "2024-01-15", "North"),
    (2, "Product B", 150, 30, "2024-01-16", "South"),
    (3, "Product A", 100, 75, "2024-01-17", "North"),
    (4, "Product C", 200, 25, "2024-01-18", "East"),
    (5, "Product B", 150, 40, "2024-01-19", "West"),
]

sales_df = spark.createDataFrame(
    sales_data,
    ["id", "product", "price", "quantity", "date", "region"]
)

display(sales_df)

### Exercise 1: Calculate total revenue per product
Add a column 'revenue' = price * quantity, then group by product and sum revenue

In [None]:
# Your solution here
# TODO: Add revenue column and calculate total per product

### Exercise 2: Find products with average quantity > 40

In [None]:
# Your solution here
# TODO: Group by product, calculate avg quantity, filter > 40

## Summary

In this notebook, you learned:

✅ What DataFrames are and their benefits
✅ Creating DataFrames from various sources
✅ Working with schemas and data types
✅ Basic transformations (select, filter, withColumn)
✅ Aggregations and grouping
✅ Handling missing data
✅ Transformations vs Actions (lazy evaluation)

## Next Steps

1. Complete the practice exercises
2. Experiment with your own data
3. Move to advanced DataFrame operations

## Additional Resources

- [PySpark DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)
- [Databricks PySpark Guide](https://docs.databricks.com/pyspark/index.html)