# Module 06 - Spark Basics in Azure Context

## Overview

This module covers Apache Spark fundamentals in the context of Azure data engineering. You'll learn how Spark is used in Azure for processing large-scale data.

## Learning Objectives

By the end of this module, you will understand:
- What is Apache Spark and why it's important
- Spark architecture and components
- Spark in Azure (Databricks, Synapse, HDInsight)
- Basic Spark operations (transformations and actions)
- Working with DataFrames in Spark
- Spark best practices in Azure


## What is Apache Spark?

**Apache Spark** is an open-source, distributed processing system used for big data workloads. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

### Key Features

- **Fast Processing**: In-memory computing for faster processing
- **Distributed**: Processes data across multiple nodes
- **Fault Tolerant**: Handles node failures gracefully
- **Multiple Languages**: Supports Python, Scala, Java, R, SQL
- **Multiple Data Sources**: Works with various data formats
- **Unified Platform**: Batch and streaming processing

### Why Spark for Big Data?

- **Performance**: 100x faster than Hadoop MapReduce for certain workloads
- **Ease of Use**: High-level APIs (DataFrames, SQL)
- **Versatility**: Batch, streaming, machine learning, graph processing
- **Scalability**: Handles petabytes of data
- **Ecosystem**: Rich ecosystem of libraries

### Spark Use Cases

✅ **ETL Processing**: Transform large datasets
✅ **Data Analytics**: Analyze big data
✅ **Real-time Processing**: Stream processing
✅ **Machine Learning**: MLlib for ML workloads
✅ **Data Warehousing**: Query large datasets


## Spark Architecture

### Core Components

#### 1. Spark Driver
- **Purpose**: Main program that creates SparkContext
- **Responsibilities**: 
  - Convert user code into tasks
  - Schedule tasks on executors
  - Coordinate execution

#### 2. Spark Executors
- **Purpose**: Worker nodes that execute tasks
- **Responsibilities**:
  - Run tasks assigned by driver
  - Store data in memory/disk
  - Report status to driver

#### 3. Cluster Manager
- **Purpose**: Manages cluster resources
- **Types**: Standalone, YARN, Mesos, Kubernetes

### Spark Architecture Diagram

```
┌─────────────┐
│   Driver    │  (Main Program)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Cluster   │
│   Manager   │
└──────┬──────┘
       │
   ┌───┴───┐
   │       │
┌──▼──┐ ┌──▼──┐
│Exec1│ │Exec2│  (Workers)
└─────┘ └─────┘
```

### Key Concepts

- **RDD (Resilient Distributed Dataset)**: Immutable distributed collection
- **DataFrame**: Distributed collection organized into named columns
- **Dataset**: Type-safe DataFrame (Scala/Java)
- **Partition**: Logical division of data across nodes
- **Task**: Unit of work sent to executor


## Spark in Azure

Azure provides multiple options for running Spark workloads:

### 1. Azure Databricks

**Purpose**: Unified analytics platform built on Spark

**Features:**
- Fully managed Spark clusters
- Collaborative notebooks
- Optimized Spark runtime
- Integration with Azure services
- MLflow for ML lifecycle

**Use Cases:**
- Data engineering and ETL
- Data science and ML
- Real-time analytics
- Collaborative analytics

### 2. Azure Synapse Analytics (Spark Pools)

**Purpose**: Spark pools within Synapse workspace

**Features:**
- Integrated with Synapse workspace
- Serverless or provisioned pools
- Direct integration with Data Lake
- SQL and Spark in one platform

**Use Cases:**
- Data warehousing with Spark
- Unified analytics platform
- ELT workloads

### 3. Azure HDInsight

**Purpose**: Managed Hadoop and Spark clusters

**Features:**
- Multiple cluster types (Spark, Hive, etc.)
- Open-source Hadoop ecosystem
- Integration with Azure services

**Use Cases:**
- Big data processing
- Hadoop ecosystem needs
- Legacy Hadoop migrations

### Comparison

| Feature | Databricks | Synapse Spark | HDInsight |
|---------|-----------|---------------|-----------|
| **Managed** | Fully | Fully | Fully |
| **Optimization** | High | Medium | Standard |
| **Integration** | Excellent | Native | Good |
| **ML Support** | Excellent | Good | Standard |
| **Cost** | Higher | Medium | Lower |


## Spark Operations

### Transformations vs Actions

#### Transformations
- **Lazy Evaluation**: Not executed immediately
- **Create New RDD/DataFrame**: Return new dataset
- **Examples**: `filter()`, `map()`, `select()`, `groupBy()`

#### Actions
- **Eager Evaluation**: Executed immediately
- **Trigger Execution**: Cause transformations to run
- **Return Results**: Return values to driver
- **Examples**: `count()`, `collect()`, `show()`, `write()`

### Common Transformations

```python
# Filter rows
df_filtered = df.filter(df.Amount > 1000)

# Select columns
df_selected = df.select("CustomerID", "Amount")

# Add column
df_with_new = df.withColumn("Total", df.Amount * 1.1)

# Group by and aggregate
df_agg = df.groupBy("Region").agg(
    sum("Amount").alias("TotalSales"),
    count("*").alias("OrderCount")
)

# Join
df_joined = df1.join(df2, on="CustomerID", how="inner")

# Sort
df_sorted = df.orderBy(df.Amount.desc())
```

### Common Actions

```python
# Show data
df.show()

# Count rows
row_count = df.count()

# Collect to driver (use carefully!)
data = df.collect()

# Write to storage
df.write.format("parquet").save("path/to/output")

# Take first N rows
first_rows = df.take(10)
```


## Working with DataFrames

### Creating DataFrames

```python
# From CSV file
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# From Parquet
df = spark.read.parquet("path/to/file.parquet")

# From JSON
df = spark.read.json("path/to/file.json")

# From Azure Data Lake
df = spark.read.format("csv").load("abfss://container@storage.dfs.core.windows.net/path")

# From SQL Database
df = spark.read.format("jdbc").option("url", "jdbc:sqlserver://...").load()
```

### DataFrame Operations

```python
# View schema
df.printSchema()

# Show data
df.show(10)

# Select columns
df.select("col1", "col2")

# Filter
df.filter(df.col1 > 100)

# Group by
df.groupBy("category").agg(sum("amount"))

# Join
df1.join(df2, df1.id == df2.id, "inner")

# Write
df.write.format("parquet").mode("overwrite").save("output/path")
```

### Reading from Azure Data Lake

```python
# Using abfss protocol (recommended)
df = spark.read.format("csv") \
    .option("header", "true") \
    .load("abfss://container@storageaccount.dfs.core.windows.net/path/to/data")

# Using wasbs protocol (also works)
df = spark.read.format("csv") \
    .option("header", "true") \
    .load("wasbs://container@storageaccount.blob.core.windows.net/path/to/data")
```


## Spark Best Practices in Azure

### 1. Partitioning

✅ **Partition Data**: Partition by date, region, etc.
✅ **Optimal Partition Size**: 128MB - 1GB per partition
✅ **Avoid Too Many Partitions**: Can cause overhead
✅ **Coalesce When Needed**: Reduce partitions if too many

```python
# Partition when writing
df.write.partitionBy("year", "month").parquet("output/path")

# Repartition if needed
df_repartitioned = df.repartition(10)
```

### 2. Caching

✅ **Cache Frequently Used Data**: Use `cache()` or `persist()`
✅ **Unpersist When Done**: Free up memory
✅ **Choose Storage Level**: MEMORY_ONLY, DISK_ONLY, etc.

```python
# Cache DataFrame
df.cache()

# Use cached data
result = df.filter(...).groupBy(...)

# Unpersist when done
df.unpersist()
```

### 3. Data Formats

✅ **Use Columnar Formats**: Parquet, Delta Lake
✅ **Avoid Text Formats**: CSV, JSON for large data
✅ **Compression**: Enable compression (snappy, gzip)

```python
# Prefer Parquet
df.write.format("parquet").save("path")

# Use Delta Lake for ACID transactions
df.write.format("delta").save("path")
```

### 4. Resource Management

✅ **Right-Size Clusters**: Match cluster size to workload
✅ **Auto-Scaling**: Enable auto-scaling when available
✅ **Monitor Resource Usage**: Watch CPU, memory usage
✅ **Shut Down Idle Clusters**: Save costs

### 5. Error Handling

✅ **Handle Nulls**: Use `na.drop()` or `na.fill()`
✅ **Validate Data**: Check data quality
✅ **Log Errors**: Log failures for debugging
✅ **Retry Logic**: Implement retry for transient failures


## Summary

In this module, we've covered:

✅ What is Apache Spark and its importance
✅ Spark architecture and components
✅ Spark in Azure (Databricks, Synapse, HDInsight)
✅ Spark operations (transformations and actions)
✅ Working with DataFrames
✅ Spark best practices in Azure

### Key Takeaways

1. **Spark** is a distributed processing engine for big data
2. **Transformations** are lazy, **Actions** trigger execution
3. **Azure offers multiple Spark options**: Databricks, Synapse, HDInsight
4. **DataFrames** provide high-level API for data processing
5. **Partitioning** is crucial for performance
6. **Use columnar formats** (Parquet, Delta) for better performance
7. **Cache frequently used data** to avoid recomputation

### Next Steps

Proceed to **Module 07: Azure Synapse Analytics Basics** to learn about:
- Synapse workspace
- SQL Pools (dedicated and serverless)
- Spark Pools
- Unified analytics platform
