# Apache Spark and Modern Data Architectures

## Introduction

This notebook covers the evolution beyond Hadoop MapReduce, including Apache Spark and the modern data architecture evolution from Data Warehouse to Data Lakehouse.

## What You'll Learn

- Challenges with MapReduce
- Apache Spark and its advantages
- Data Architecture Evolution: Data Warehouse → Data Lake → Lakehouse
- Summary and key takeaways


## Challenges with MapReduce

Despite its success, Hadoop MapReduce faced several challenges:

### 1. Storage Capacity
- **Issue**: Limited by disk I/O performance
- **Impact**: Slow data access and processing
- **Reason**: MapReduce writes intermediate results to disk

### 2. Processing Time
- **Issue**: High latency for iterative and interactive workloads
- **Impact**: Not suitable for real-time or near-real-time processing
- **Reason**: 
  - Disk-based processing (slow I/O)
  - Overhead of job setup and teardown
  - Not optimized for iterative algorithms (machine learning)

### 3. Programming Complexity
- **Issue**: Writing MapReduce programs is complex
- **Impact**: Requires deep understanding of distributed systems
- **Reason**: Low-level programming model

### 4. Limited Language Support
- **Issue**: MapReduce primarily available in Java
- **Impact**: Limited accessibility for developers using other languages
- **Reason**: Native Java implementation

### 5. Hive Performance
- **Issue**: Hive SQL queries performed slower than database SQL queries
- **Impact**: Not suitable for interactive SQL workloads
- **Reason**: 
  - Hive translates SQL to MapReduce jobs
  - MapReduce overhead adds latency
  - Not optimized for SQL workloads

### Need for a Better Solution

These challenges led to the development of **Apache Spark**, which addressed many of these limitations.


## Enter Apache Spark

**Apache Spark** was developed to address the limitations of Hadoop MapReduce.

Apache Spark is a general purpose, inmemory, compute engine.

### Advantages of Apache Spark over Hadoop

**1. Performance**
- **10 to 100 times faster** than Hadoop MapReduce
- **In-memory processing**: Keeps data in memory instead of disk
- **Optimized execution engine**: Advanced query optimization
- **Reduced overhead**: Less job setup/teardown overhead

**2. Ease of Development**
- **Spark SQL**: High-performance SQL engine
- **Composable Function API**: Easy to build complex data pipelines
- **DataFrame API**: Similar to Pandas, intuitive for data engineers
- **High-level APIs**: Less boilerplate code compared to MapReduce

**3. Language Support**
- **Java**: Full support
- **Scala**: Native language (Spark is written in Scala)
- **Python**: PySpark API
- **R**: SparkR API
- **SQL**: Spark SQL

**4. Storage**
- **HDFS**: Can read from and write to HDFS
- **Multiple formats**: CSV, JSON, Parquet, ORC, Avro, etc.
- **Multiple sources**: Databases, cloud storage, streaming sources

**5. Resource Management**
- **YARN**: Can run on Hadoop YARN
- **Mesos**: Apache Mesos support
- **Kubernetes**: Native Kubernetes support
- **Standalone**: Can run in standalone mode

### Apache Spark Deployment Options

**1. With Hadoop (Data Lake)**
- Spark runs on top of Hadoop cluster
- Uses HDFS for storage
- Uses YARN for resource management
- Traditional Data Lake architecture

**2. Without Hadoop (Lakehouse)**
- Spark runs independently
- **Cloud platforms**: AWS, Azure, GCP
- **Databricks Spark Platform**: Managed Spark service
- Modern Lakehouse architecture
- Can use cloud storage (S3, ADLS, GCS) instead of HDFS

Note - To turn your "Without Hadoop" setup into a Lakehouse, you usually add a storage format like Delta Lake.

***Spark and PySpark***

Apache Spark (The Engine)
Spark is the "muscle." It is written in Scala and runs on the Java Virtual Machine (JVM). Its job is to handle the heavy lifting: breaking data into partitions, distributing tasks across a cluster of computers, and managing memory/fault tolerance.

PySpark (The Interface)
Since Spark is built in Scala, Python cannot talk to it directly. PySpark acts as a bridge. When you write Python code using PySpark, a library called Py4J translates those commands into calls that the Spark engine (the JVM) can understand.

## Data Architecture Evolution

Data architectures have evolved significantly to meet changing business needs and technological capabilities.

### 1. Data Warehouse Architecture

**Era**: 1980s - 2000s

**Flow:**
```
Structured Data → ETL → Data Warehouse → BI / Reports
```

**Characteristics:**
- **Input**: Only structured data
- **Process**: ETL (Extract, Transform, Load)
- **Storage**: Data Warehouse (optimized for analytics)
- **Output**: BI tools and reports
- **Schema**: Schema-on-write (structured before storage)
- **Use Case**: Business intelligence, reporting, analytics

**Limitations:**
- Only handles structured data
- Expensive to scale
- Long ETL processes
- Limited flexibility

### 2. Data Lake Architecture

**Era**: 2010s

**Flow:**
```
Structured, Semi-structured, Unstructured Data (Data Lake)
    │
    ├─→ ETL → Data Warehouse → BI / Reports
    │
    └─→ Data Science / ML
```

**Characteristics:**
- **Input**: All data types (structured, semi-structured, unstructured)
- **Storage**: Data Lake (raw data storage)
- **Processing**: 
  - ETL to Data Warehouse for BI/Reports
  - Direct access for Data Science/ML
- **Schema**: Schema-on-read (flexible schema)
- **Use Case**: 
  - Traditional BI (via Data Warehouse)
  - Advanced analytics and machine learning (direct from Data Lake)

**Advantages:**
- Handles all data types
- Cost-effective storage
- Flexible schema
- Supports both BI and ML use cases

**Challenges:**
- Data quality issues (raw data)
- Governance challenges
- Performance issues for BI workloads
- Still requires Data Warehouse for some use cases


### 3. Data Lakehouse Architecture

**Era**: 2020s (Modern Approach)

**Flow:**
```
Structured, Semi-structured, Unstructured Data (Data Lake)
    │
    └─→ Metadata and Governance Layer
            │
            ├─→ BI / Reports
            └─→ Data Science / ML
```

**Characteristics:**
- **Input**: All data types (structured, semi-structured, unstructured)
- **Storage**: Data Lake (single storage layer)
- **Enhancement**: Metadata and governance layer on top
- **Output**: 
  - Direct BI/Reports (no separate Data Warehouse needed)
  - Direct Data Science/ML
- **Schema**: Schema-on-read with metadata management
- **Key Innovation**: Data Lake performs at par with Data Warehouse

**Key Features:**
- **ACID Transactions**: Ensures data consistency
- **Schema Enforcement**: Data quality and governance
- **Performance Optimization**: Query performance similar to Data Warehouse
- **Unified Storage**: Single source of truth for all data
- **Cost-Effective**: Eliminates need for separate Data Warehouse

**Advantages:**
- **Simplified Architecture**: One storage layer instead of two
- **Cost Reduction**: No separate Data Warehouse needed
- **Better Governance**: Metadata layer provides data quality
- **Performance**: Optimized for both BI and ML workloads
- **Flexibility**: Supports all data types and use cases

**Technology Stack:**
- **Storage**: Cloud storage (S3, ADLS, GCS) or HDFS
- **Processing**: Apache Spark, Delta Lake
- **Governance**: Apache Hive Metastore, Unity Catalog (Databricks)
- **Query Engine**: Spark SQL, Presto, Trino

### Architecture Comparison

| Aspect | Data Warehouse | Data Lake | Data Lakehouse |
|--------|---------------|-----------|----------------|
| **Data Types** | Structured only | All types | All types |
| **Schema** | Schema-on-write | Schema-on-read | Schema-on-read + governance |
| **Storage** | Data Warehouse | Data Lake | Data Lake |
| **BI Performance** | Optimized | Via Data Warehouse | Optimized |
| **ML Support** | Limited | Direct access | Direct access |
| **Cost** | High | Low | Low |
| **Governance** | Strong | Weak | Strong |
| **Complexity** | Medium | High | Medium |


## Summary: Big Data Evolution Timeline

### Key Milestones

| Year | Milestone | Impact |
|------|-----------|--------|
| **1959** | COBOL developed | Beginning of structured business data processing |
| **1970s-1980s** | RDBMS revolution (Oracle, SQL Server) | SQL-based data management becomes standard |
| **1990s-2000s** | Internet and Mobile Revolution | Explosion of data variety, volume, and velocity |
| **2003** | Google File System (GFS) paper | Foundation for distributed file systems |
| **2004** | MapReduce paper | Foundation for distributed computing |
| **2006** | Apache Hadoop project started | Open-source big data platform |
| **2008** | Hadoop becomes top-level Apache project | Enterprise adoption begins |
| **2010s** | Data Lake architecture emerges | Support for all data types |
| **2014** | Apache Spark 1.0 released | Faster alternative to MapReduce |
| **2020s** | Data Lakehouse architecture | Unified architecture for BI and ML |

### Evolution Path

```
COBOL (1959)
    ↓
RDBMS (1970s-1980s)
    ↓
Internet/Mobile Revolution (1990s-2000s)
    ↓
Big Data Problem (3Vs)
    ↓
Hadoop (2006)
    ↓
Apache Spark (2014)
    ↓
Data Lakehouse (2020s)
```

### Key Takeaways

1. **Data processing evolved** from single-machine to distributed systems
2. **Data types expanded** from structured to include semi-structured and unstructured
3. **Scalability shifted** from vertical (monolithic) to horizontal (distributed)
4. **Performance improved** with in-memory processing (Spark vs MapReduce)
5. **Architecture simplified** with Lakehouse (unified storage for BI and ML)
6. **Cost reduced** through commodity hardware and cloud computing

### Modern Big Data Stack

**Storage**: Data Lake (S3, ADLS, GCS, HDFS)

**Processing**: Apache Spark

**Governance**: Metadata layer (Hive Metastore, Unity Catalog)

**Query**: Spark SQL, Presto, Trino

**Platforms**: Databricks, AWS EMR, Azure Synapse, GCP Dataproc


## Reflection Questions

1. **Why did RDBMS systems fail to handle Big Data?**
   - Think about the 3Vs (Variety, Volume, Velocity)
   - Consider schema-on-write vs schema-on-read

2. **What are the key advantages of distributed systems over monolithic systems?**
   - Scalability, fault tolerance, cost-effectiveness

3. **How does Hadoop address the Big Data problem?**
   - YARN for resource management
   - HDFS for distributed storage
   - MapReduce for distributed computing

4. **Why is Apache Spark faster than Hadoop MapReduce?**
   - In-memory processing
   - Optimized execution engine
   - Reduced overhead

5. **What is the key difference between Data Lake and Data Lakehouse?**
   - Governance and metadata layer
   - Performance optimization
   - Unified architecture

6. **How has data architecture evolved over time?**
   - From Data Warehouse to Data Lake to Data Lakehouse
   - From structured-only to all data types
   - From separate systems to unified architecture
