### Summary of HDFS, MapReduce, and Spark for Data Engineers

#### **Big Data Overview**
- **Definition**: Big Data refers to datasets too large for local systems, often exceeding 32 GB and requiring distributed systems to manage storage and computation.
- **Local vs. Distributed Systems**:
  - **Local Systems**: Use a single machine's resources (CPU, RAM).
  - **Distributed Systems**: Leverage multiple machines for computation and storage, offering scalability and fault tolerance.

---

#### **HDFS (Hadoop Distributed File System)**
- **Purpose**: Distributes and stores large datasets across multiple machines.
- **Key Features**:
  - Data is split into **blocks** (default size: 128 MB).
  - Each block is **replicated 3 times** for fault tolerance.
  - **Scalability**: New machines can be added to increase capacity.
- **Fault Tolerance**:
  - Multiple copies prevent data loss if a node fails.
  - Designed to support parallel processing by dividing large files into smaller chunks.

---

#### **MapReduce**
- **Function**: Distributes computational tasks across the data stored in HDFS.
- **Architecture**:
  - **Job Tracker**: Coordinates tasks by assigning them to worker nodes.
  - **Task Tracker**: Executes assigned tasks and monitors performance.
- **Workflow**:
  - Breaks down computations into **Map** and **Reduce** steps.
  - Writes intermediate results to disk between each step.
- **Challenges**:
  - Disk I/O can slow down performance compared to in-memory processing.

![{712ECA83-819E-4730-B785-26AA96CE75CC}.png](attachment:{712ECA83-819E-4730-B785-26AA96CE75CC}.png)
---

#### **Spark**
- **Introduction**:
  - An Apache open-source framework created in 2013 at UC Berkeley.
  - Designed for high-speed, distributed data processing.
  - Works with multiple storage systems: HDFS, AWS S3, Cassandra, etc.
- **Advantages over MapReduce**:
  - **Speed**: 100x faster for some operations by keeping data in memory.
  - **Flexibility**: Does not require data to be stored in HDFS.
- **Core Concepts**:
  - **RDD (Resilient Distributed Dataset)**:
    - Immutable, fault-tolerant, and partitioned for parallel processing.
    - Supports lazy evaluation to optimize execution.
  - **Transformations**: Define data manipulations (e.g., `map`, `filter`).
  - **Actions**: Trigger actual computation (e.g., `count`, `collect`).

![{224EF1FE-18FC-4A59-959B-9DC496EE85E5}.png](attachment:{224EF1FE-18FC-4A59-959B-9DC496EE85E5}.png)

---

#### **Spark DataFrames**
- **Evolution**: Introduced in Spark 2.0 as a more user-friendly syntax for handling structured data.
- **Machine Learning**: DataFrames are the standard interface for Spark's ML capabilities.
- **Documentation**: Still evolving but central to Spark's growing ecosystem.

---

#### **When to Use Spark vs. MapReduce**
- **Spark**: Preferred for iterative algorithms, interactive queries, or machine learning tasks due to its in-memory processing and speed.
- **MapReduce**: Best suited for simple batch processing tasks where fault tolerance is critical.

---

#### **Key Takeaways for Data Engineers**
- Use **HDFS** for scalable, fault-tolerant storage of large datasets.
- Apply **MapReduce** for traditional batch processing with distributed computations.
- Leverage **Spark** for faster, more flexible distributed computing, especially when in-memory processing can significantly boost performance.