# Hadoop Architecture

## Introduction

This notebook provides a deep dive into Hadoop's architecture, including YARN, HDFS, MapReduce, and the foundational research that led to Hadoop's development.

## What You'll Learn

- YARN (Yet Another Resource Negotiator) architecture
- HDFS (Hadoop Distributed File System) architecture
- MapReduce programming model and framework
- Google's contribution to Big Data


## Hadoop Architecture: YARN

### YARN (Yet Another Resource Negotiator)

**YARN** is the Hadoop cluster operating system, popularly known as the **Hadoop Cluster Resource Manager**.

### Three Main Components

**1. Resource Manager (RM)**
- **Location**: Master node
- **Function**: 
  - Manages cluster resources
  - Allocates resources to applications
  - Coordinates with Node Managers
- **Responsibilities**:
  - Scheduler: Allocates resources to various applications
  - Applications Manager: Manages application masters

**2. Node Manager (NM)**
- **Location**: Worker nodes (Worker Node 1, Worker Node 2, Worker Node 3, etc.)
- **Function**:
  - Monitors resource usage on each node
  - Reports to Resource Manager
  - Manages containers on the node
- **Responsibilities**:
  - Container lifecycle management
  - Resource monitoring (CPU, memory, disk)

**3. Application Master (AM)**
- **Location**: Runs in a container on a worker node
- **Function**:
  - Manages the lifecycle of an application
  - Requests resources from Resource Manager
  - Coordinates with Node Managers
- **Responsibilities**:
  - Negotiates resources for the application
  - Monitors application progress
  - Handles application failures

### YARN Architecture Diagram

```
┌─────────────────────────────────────────┐
│         Resource Manager (RM)           │
│         (Master Node)                   │
└─────────────────────────────────────────┘
                    │
        ┌───────────┼───────────┐
        │           │           │
┌───────▼───┐ ┌─────▼─────┐ ┌───▼──────┐
│ Worker    │ │ Worker    │ │ Worker   │
│ Node 1    │ │ Node 2    │ │ Node 3   │
│           │ │           │ │          │
│ Node      │ │ Node      │ │ Node     │
│ Manager   │ │ Manager   │ │ Manager  │
│           │ │           │ │          │
│ App       │ │ App       │ │ App      │
│ Master    │ │ Master    │ │ Master   │
│ Container │ │ Container │ │ Container│
└───────────┘ └───────────┘ └──────────┘
```

**Key Points:**
- Application Master container runs your application code
- Each application has its own Application Master
- Application Master is present in a worker node


## Hadoop Architecture: HDFS

### HDFS (Hadoop Distributed File System)

**HDFS** provides distributed storage on a Hadoop cluster.

### Two Main Components

**1. Name Node**
- **Function**: Stores file metadata
- **Metadata Information**:
  - File name
  - Directory location
  - File size
  - File blocks
  - Block ID
  - Block sequence
  - Block location (which Data Node stores each block)

**2. Data Node**
- **Function**: Stores actual data blocks
- **Responsibilities**:
  - Store and retrieve data blocks
  - Replicate blocks for fault tolerance
  - Report block status to Name Node

### How HDFS Works

1. **File Storage**: Large files are split into blocks (typically 128MB or 256MB)
2. **Replication**: Each block is replicated across multiple Data Nodes (default: 3 replicas)
3. **Metadata**: Name Node maintains metadata about all files and blocks
4. **Fault Tolerance**: If a Data Node fails, data can be retrieved from replicas

### HDFS Architecture

```
┌─────────────────────────────────────┐
│         Name Node                   │
│    (File Metadata Storage)          │
│  - File names                       │
│  - Block locations                  │
│  - Directory structure              │
└─────────────────────────────────────┘
            │
    ┌───────┼───────┐
    │       │       │
┌───▼───┐ ┌─▼───┐ ┌─▼───┐
│ Data  │ │Data │ │Data │
│ Node 1│ │Node2│ │Node3│
│       │ │     │ │     │
│ Block │ │Block│ │Block│
│ Repl. │ │Repl.│ │Repl.│
└───────┘ └─────┘ └─────┘
```

**Key Features:**
- **Fault Tolerance**: Data replicated across multiple nodes
- **Scalability**: Add more Data Nodes to increase storage
- **High Throughput**: Optimized for large file reads


## Hadoop Architecture: MapReduce

### MapReduce

**MapReduce** is both a **programming model** and a **programming framework** for processing large datasets in parallel.

### MapReduce Model

The MapReduce model requires implementing logic in **two functions**:

**1. Map Function**
- **Input**: Reads a data block
- **Process**: Applies logic at the block level
- **Output**: Produces intermediate key-value pairs
- **Characteristics**: 
  - Processes data in parallel across multiple nodes
  - Each map task processes one block independently

**2. Reduce Function**
- **Input**: Receives map output (intermediate key-value pairs)
- **Process**: Consolidates the results
- **Output**: Final aggregated results
- **Characteristics**:
  - Groups data by key
  - Performs aggregation operations (sum, count, average, etc.)

### How MapReduce Works

```
Input Data Blocks
    │
    ├─ Block 1 ──► Map ──► (key1, value1)
    │                    (key2, value2)
    ├─ Block 2 ──► Map ──► (key1, value3)
    │                    (key2, value4)
    └─ Block 3 ──► Map ──► (key1, value5)
                            (key2, value6)
                                │
                                ▼
                        Shuffle & Sort
                                │
                                ▼
                    ┌───────────┴───────────┐
                    │                       │
            Reduce (key1)            Reduce (key2)
                    │                       │
                    ▼                       ▼
            (key1, aggregated)    (key2, aggregated)
```

### MapReduce Framework Implementation

- **Hadoop MapReduce Framework**: Implements the MapReduce model
- **YARN**: Manages resource allocation for MapReduce jobs
- **HDFS**: Manages data blocks that MapReduce processes

### Key Concepts

1. **Data Locality**: Map tasks run on nodes where data is stored (reduces network traffic)
2. **Parallel Processing**: Multiple map and reduce tasks run simultaneously
3. **Fault Tolerance**: Failed tasks are automatically retried
4. **Scalability**: Can process petabytes of data across thousands of nodes


## Google's Contribution to Big Data

Google faced the Big Data problem and tried to solve it by addressing four key areas:

### 1. Data Collection and Ingestion
- Efficiently collect data from various sources
- Handle high-velocity data streams

### 2. Data Storage and Management
- Store massive volumes of data
- Ensure reliability and fault tolerance

### 3. Data Processing and Transformation
- Process large datasets efficiently
- Enable parallel processing

### 4. Data Access and Retrieval
- Fast data retrieval
- Support for various access patterns

### Google's Research Papers

**1. Google File System (GFS) Whitepaper - 2003**
- Introduced distributed file system architecture
- Designed for large-scale distributed applications
- Key concepts: Master/Chunk servers, replication, fault tolerance
- **Impact**: Foundation for HDFS (Hadoop Distributed File System)

**2. MapReduce Paper - 2004**
- Introduced the MapReduce programming model
- Simplified parallel and distributed computing
- Key concepts: Map and Reduce functions, automatic parallelization
- **Impact**: Foundation for Hadoop MapReduce

### Open Source Development

These Google research papers became the basis for the development of **open-source Hadoop**:
- **2003**: Google File System paper published
- **2004**: MapReduce paper published
- **2006**: Apache Hadoop project started (inspired by Google's papers)
- **2008**: Hadoop became a top-level Apache project

**Key Insight**: Google's research demonstrated that distributed computing on commodity hardware could solve big data problems cost-effectively.
