# Hadoop Overview

## What is Hadoop?

>_Apache Hadoop_ is an open-source, distributed processing framework that is used to efficiently store and process large datasets. It was the first big data management tool to gain popularity in industry. 

Since the 1970's and especially after the development of computers, the data management domain started to emerge. Organisations had small datasets that they stored either on physical paper, in digital files or in a centralised database. Data analysis was usually done manually using spreadsheet software (such as Microsoft Excel) or by querying a database and generating some reports.

More recently however, and especially after the introduction of the Internet and smart devices (such as Internet of Things, smart phones and so on), there was a major shift that resulted in big data becoming the dominant trend. Accordingly, the traditional approaches were no longer able to keep up. _Big data_ refers to data that has several unique properties identified by the 3 Vs:
- Exponentially increasing data quantity (volume)
- Accelerating data generation speeds (velocity)
- New types of data being produced and analysed (variety)

For example, we started seeing a rapid increase in images and videos that people shared on social media websites and mobile applications. Image and video data were never really stored and analysed in the past, and they are examples of data types that can't be stored in a relational database or a spreadsheet. 

Accordingly, new tools started to emerge to handle these new data varities. In 2006, Hadoop was introdued. It was created by Doug Cutting and Mike Cafarella and quickly became the first big data tool that global companies adopted.

Below is a diagram showing the overall Hadoop ecosystem. Note the various Hadoop components and how other tools can be integrated:
<p align="center">
<img src= "images/hadoop-overview6.png" width=600>
</p>


## Features of Hadoop

> Since Hadoop was introduced, it went through a few iterations of maturity. With each iteration, the flexibility of the tools was enhanced and it became able to handle a wider variety of industry use cases.

When the initial version of Hadoop was released (dubbed MapReduce 1.0), it was able to defeat the fastest supercomputers at that time in an experiment to sort 1 Terabyte of data (1,000 Gigabytes). Hadoop did so in 68 seconds, which was a new record. 

### Strengths of MapReduce 1.0
Hadoop MapReduce 1.0 was revolutionary when it was introduced because it offered new features which never before existed:
- Enabling the use of commodity (inexpensive) hardware to store data (previously, expensive specialised servers were required)
- Open-source and free (prevoiusly, expensive database and data warehouse licenses were required)
- Handling unstructured data as well as massive amounts of structured data
- Enabling data engineers to write their own functions and code using Java
- Managing massive volumes of batch data effectively

### Limitations of MapReduce 1.0
However, despite all of these features, there were some limitations which included:
- Java MapReduce coding was extremely complex and time-consuming
- It wasn't able to handle stream data processing use cases (no streaming tools were yet available)
- It couldn't integrate with other data management tools due to it design which coupled MapReduce closely to HDFS

### Strengths of MapReduce 2.0 - YARN
Ultimately, a new version of Hadoop MapReduce 2.0 (known as YARN) was released in 2012. YARN stands for _Yet Another Resource Negotiator_, and it was the first component that enabled automatically managing resources (like memory and CPU). This version decoupled data processing from data storage and was a big step as it transformed Hadoop from being a single product to becoming a framework or ecosystem that other tools can connect to using YARN as an "operating system". This enabled novel approaches to data engineering that unlocked a new world of business use cases. 

To help visualise the differences between the 2 versions, take a look at the diagram below:
<p align="center">
<img src= "images/yarn.png" width=600>
</p>

The main benefits YARN provided include:
- Decoupling data storage from data processing, thus transforming Hadoop into a wider-use framework which can easily integrate with other tools
- Allowing Hadoop to more efficiently harness the power of cloud computing by automating resource management
- Stream data processing was now possible by integrating Hadoop with other tools like Apache Storm (prior to YARN, only batch processing was possible)
- Higher reliability as YARN has no single point of failure
- Supporting larger size of data blocks (128MB compared to 64MB previously) which increases performance of the system as less data blocks are used


## Hadoop Modules

> Since the latest version of Hadoop was released, the framework now consists of 4 core modules: HDFS, YARN, MapReduce and the Hadoop Common

### 1. Common  
- Also referred to as the Hadoop Core
- Contains the essential JAR (Java ARchive) files and scripts required to start Hadoop
- Provides source code and product documentation
- Refers to the collection of common utilities and libraries that support the functions of other Hadoop modules

### 2. HDFS
- This module is responsible for reliably storing the data across multiple nodes in a distributed cluster network
- It is also responsible for fault-tolerance, which is implemented by replicating each block of data multiple times (3X by default) to ensure data availability is guaranteed
- Raw data, intermediate results of processing, in addition to fully processed data are all stored in HDFS
- The architecture of HDFS is a master-worker design which consists of 3 types of nodes:

#### NameNode
- Which is the master node managing the operations
- Also stores the metadata
- Usually needs to be a powerful machine with high memory
#### Secondary NameNode
- Which is the backup master node that takes over if the main node fails
- This ensures that the system is always online and reliable
#### DataNodes
- Actual worker nodes which physically store and process the data
- Perform read and write requests sent by clients
- Usually consists of cheap machines with large storage space

<p align="center">
<img src= "images/hdfs2.png" width=400>
</p>

### 3. YARN
- YARN is an acronym for Yet Another Resource Negotiator
- Sits in between HDFS and MapReduce and acts as a dynamic resource manager for the entire Hadoop cluster
- Schedules the various MapReduce processing jobs across the cluster
- Coordinates everything under the hood - effectively acting as a Hadoop operating system
- Seperates the resource management layer from the data processing layer - acting as an interface between the HDFS storage layer and other tools (like Apache Spark) which can be connected
- Consists of several sub-components which include:
    - Client
    - Resource manager
    - Node manager
    - Application master
    - Container

<p align="center">
<img src= "images/yarn-architecture.png" width=400>
</p>

Let's look at a high-level example of how these components would work. In a real Hadoop cluster, a client would be an application like Apache Spark, and the HDFS NameNode and the YARN Resource manager will usually be running on the main master node (both are Java processes). Once Spark submits a job (for instance, a SELECT query), this job goes to the NameNode for verification and scheduling. The NameNode will then select the appropriate HDFS DataNode (which stores the actual data the query requires) in order to execute the job in it and return the results to the client. This principle is called _data locality_, which refers to running the data processing tasks in the location where data is stored, rather than moving lots of data across the network.

### 4. MapReduce
- Is a programming paradigm for processing data that can handle huge datasets by scaling massively across a distributed network
- It is the core of the Apache Hadoop framework and works by processing data in parallel on various machines
- The term refers to 2 seperate and distinct tasks that form the basis of all data processing activities:
    #### Mapper Function
    - The map job splits the input data (usually stored in HDFS) into equal size chunks called _input splits_
    - Every input split is passed to one mapping function to produce intermediate output values
    - The output values consist of key-value pairs which are pushed to the Reducers
    - For example, if a large input file gets split into 3 smaller files (3 being the number of mappers configured), each one of the mappers would recieve 1 out of the 3 smaller files. Each mapper would then be responsible for processing the data within that smaller file only.
        
    #### Reducer Function
    - Key-value pairs produced from the mapper first go through a step called shuffling. In the _shuffling_ step, the function consolidates related records together.
    - The output values from each suffling step is then aggregated together
    - The aggregated output is then further processed to produce the final result
    - For instance, the 3 smaller files which the mappers processed in the prevoius step would then be combined and the results integrated to produce only one large output file as the final result

To help clarify how MapReduce operates, let's look at a word count example. Imagine we had an input text consisting of a few words, and we wanted to count the frequency of each word. Here is the text we will be processing (imagine billions of words instead of this toy example):

`Dear Bear River` 
`Car Car River`
`Deer Car Bear`

This is how MapReduce would go about executing the word count program:
<p align="center">
<img src= "images/mapreduce-example.png" width=400>
</p>

1. _Input_: 
    - The first step is splitting all the input data into equal chunks (128MB) to enable parallel data processing
2. _Splitting:_ 
    -   Each chunk is transformed into a key-value pair and passed on to _one mapper_
3. _Mapping_: 
    -   In this step, data processing gets initiated. Each mapper performs a word count _on the input split it was provided_ and produces an intermediate list of key-value pairs containing the word and its frequency
4. _Shuffling_: 
    -   In this step, all mapper outputs are compared and the key-value pairs containing the _same word_ from all mappers are passed to the same shuffler (i.e. each shuffler is now responsible for only one word)
5. _Reducing_: 
    -   Now that each shuffler contains one word, the reducers start to run the aggregate functions required to calculate the total counts of each word. Every reducer is responsible for creating a final count for _one word only_.
6. _Final Result_: 
    -   In the final step, the output from _all_ reducers is combined together and produced as a final key-value pair output

Although these steps might seem a bit complex, it's important to note that nowadays data engineers need not worry about manually executing these tasks, as they are almost always performed under the hood by the tool or software that's being used. For example, if we are running a Hive SELECT query, Hive automatically transforms this query into MapReduce code in the back-end and the results are returned to the user in the Hive GUI using the standard tabular output format (with rows and columns).

## Tools compatible with Hadoop

> After the introduction of YARN, it started becoming possible to integrate Hadoop with a wider variety of big data engineering tools for data warehousing (such as Hive), streaming data processing (such as Storm) and NoSQL data stores (such as HBase)
<p>
Below is a table showing some of the most popular tools used alongside Hadoop in industry:
<p>


 <table>
    <thead>
        <tr>
            <th style="width:auto;text-align:center">Tool Name</th>
            <th style="width:auto;text-align:center">Tool Type</th>
            <th style="width:auto;text-align:center">Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>Hive</th>
            <td>Data Warehouse</td>
            <td><li>Hadoop's native data warehousing tool
                <li>Uses HQL (which is very similar to SQL) to query data 
                <li>Can connect to various data store technologies (HDFS, NoSQL and so on..)
            </td>
        </tr>
        <tr>
            <th>Impala</th>
            <td>Data Warehouse</td>
            <td><li>Open-source SQL engine for Hadoop
                <li>Faster and newer than Hive
                <li>Easily integrates with various Hadoop components (such as HBase)
            </td>
        </tr>
        <tr>
            <th>Spark</th>
            <td>Data Processing</td>
            <td><li>Supports in-memory batch and streaming data processing (Spark Core and Spark Streaming)
                <li>Provides modules to enable structured (SparkSQL) and unstructured (Spark Core/Streaming) data processing
                <li>Has built-in machine learning module (MLlib)
            </td>
        </tr>  
		<tr>
            <th>Storm</th>
            <td>Data Processing</td>
            <td><li>Supports processing of streaming data only
                <li>Easy to setup and integrate with Hadoop
                <li>Can handle up to millions of data points per second per node
            </td>
        </tr>
	    <tr>
            <th>Flume</th>
            <td>Data Ingestion</td>
            <td><li>Designed to ingest data into a Hadoop cluster and to move data within the cluster 
                <li>Mainly handles unstructured and semi-structured data
                <li>Can handle both batch and streaming data
                <li>Can connect to various types of data sources and data consumers
            </td>
        </tr>  
		<tr>
            <th>Kafka</th>
            <td>Data Ingestion</td>
            <td><li>Designed to ingest streaming data at scale
                <li>Easily connects with various Hadoop components
                <li>Newer than Flume, more reliable and offers some basic data processing capabilities
                <li>Can connect to a wide variety of data sources and data consumers
            </td>
        </tr>
        <tr>
            <th>Sqoop</th>
            <td>Data Ingestion</td>
            <td><li>Command-line batch data ingestion tool
                <li>Mainly designed to connect HDFS to relational databases like MySQL
                <li>Bi-directional (can move data into HDFS or from HDFS into a database)
            </td>
        </tr>  
	    <tr>
            <th>Oozie</th>
            <td>Workflow Scheduler</td>
            <td><li>Schedules and runs various Hadoop jobs/programs
                <li>Consists of a Coordinator and a Workflow component
                <li>Provides dashboards and metrics regarding the various Hadoop jobs
            </td>
        </tr>
  	    <tr>
            <th>Airflow</th>
            <td>Workflow Scheduler</td>
            <td><li>Used to create, schedule, run and monitor data engineering pipelines
                <li>Easily integrates with Hadoop components
                <li>Newer than Oozie with a richer user interface (GUI)
            </td>
        </tr>
        <tr>
            <th>Zookeeper</th>
            <td>Coordinator</td>
            <td><li>Zookeeper coordinates the various Hadoop services together
                <li>Performs automatic synchronisation of tasks, configuration maintenance, and grouping of jobs
            </td>
        </tr>
     	<tr>
            <th>Pig</th>
            <td>Scripting</td>
            <td><li>Used for data transformation and cleaning
                <li>Similar to Linux bash scripts
                <li>Relatively old tool that is becoming much less popular
            </td>
        </tr>
	    <tr>
            <th>HBase</th>
            <td>NoSQL Data Store</td>
            <td><li>One of the most popular NoSQL tools used by global companies
                <li>It's a column-oriented data store
                <li>Supports streaming and batch data processing
            </td>
        </tr>
		<tr>
            <th>ElasticSearch</th>
            <td>Text Analytics</td>
            <td><li>Used as a real-time text analytics application
                <li>Used to store and index massive amounts of text data
                <li>Can easily and efficiently query text data in a Google-like manner
            </td>
        </tr>
    </tbody>
 </table>
 


## Hadoop vs Relational Databases

 <table>
    <thead>
        <tr>
            <th style="width:auto;text-align:center"></th>
            <th style="width:auto;text-align:center">Hadoop</th>
            <th style="width:auto;text-align:center">Relational Database</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>Scalability</th>
            <td><li>Leverages distributed technology to enable multiple computers to analyse massive datasets in parallel
    		    <li>Can easily scale vertically (up or down per node) and horizontally (increase the number of nodes)
            </td>
            <td><li>Can only scale vertically (up or down) due to the centralised architecture</td>
        </tr>
        <tr>
            <th>Data Variety</th>
            <td><li>Unstructured, semi-structured and structured data are all supported</td>
            <td><li>Structured data only</td>
        </tr>  
		<tr>
            <th>Fault Tolerance</th>
            <td><li>High availability with multiple backup nodes in the cluster
				<li>No single point of failure
			</td>
            <td><li>Single point of failure</td>
        </tr>
	    <tr>
            <th>Data Redundancy</th>
            <td><li>Data replicated automatically (3X by default)</td>
            <td><li>Requires manual setup for backups and data redundancy</td>
        </tr>
		<tr>
            <th>Data Processing</th>
            <td><li>Can process batch and streaming data</td>
            <td><li>Can process batch and streaming data</td>
        </tr>  
		<tr>
            <th>Schema</th>
            <td><li>When tools read data stored in HDFS, they parse the files to check if any certain scehma exists and what the data formatting looks like			
            	<li>Data can first be loaded into HDFS then explored later (in a process called ELT)
                <li>This is known as a Schema-on-read model
            </td>
            <td><li>Schema required before loading data into database
				<li>Data can't be written before it's transformed into the required format (in a process called ETL)
                <li>This is known as a Schema-on-write model 
			</td>
        </tr>
    </tbody>
 </table>

## Limitations of Hadoop

### MapReduce is not the most efficient for all use cases
- Despite being a disruptive technology when it was first released, there are now faster and more advanced big data processing technologies available (like Apache Spark)

### Difficulty integrating various tools
- Although YARN enables integrating various tools with Hadoop, the setup process can be complex and time-consuming
- Also, there are often compatability issues which require lots of time to debug

### Complexity of writing Java MapReduce code
- Java MapReduce code is very complicated and bulky, thus requiring a lot of time to create. Previously, Java was the main way to create code within Hadoop.
- Data engineers use other languages now to create code (such as Python)

### Talent gap
- It's not easy finding skilled experts with real hands-on experience working in large, enterprise Hadoop clusters
- Accordingly, recruiting and retaining qualified staff can be expensive and time-consuming

### Doesn't handle small files well
- Hadoop's architecture makes it much more efficient at handling larger-sized files (> 128MB), rather than a vast number of smaller files
- As the block side of YARN is 128MB, having a huge number of small files (for instance, 1MB in size) will lead to inefficient use of blocks since ideally each block should store 1 file. This also results in overwhelming the NameNode (which keeps track of where each file is stored in the cluster).

### Cluster Security 
- Due to the complexity of the ecosystem, Hadoop's security is difficult to implement
- Security tools are not yet as mature as those for other technologies (such as relational database systems)

### Competition from newer tools
- Since Hadoop was introduced, newer tools like Apache Spark and Apache Flink started to emerge
- These tools are sometimes easier to use and are faster than Hadoop in processing data, thus causing them to overtake Hadoop in popularity


## Popular Use Cases

Below are some examples of how Hadoop is used in industry:

### Leveraging HDFS as a Data Lake
- This is one of the most popular Hadoop use cases as it's storage is cheap since it uses commodity hardware (if on-premise) and cloud data storage is also supported
- Hadoop is often used as an extension to a traditional data warehouse by storing unstructured data in addition to low-value structured data
- Twitter has been using Hadoop since 2010 to store and process its tweets and log files. The data stored is compressed using a LZO compression mechanism to help optimise resources


### Data Aggregation
- Spotify is using a large Hadoop cluster to to generate its content and perform various types of data aggregation

### Hadoop in the Financial Sector
- Several global banks, such as Credit Suisse and Bank of America, use Hadoop to store their customer and transaction data
- This data is then used for a variety of use cases including:
    - Customer segmentation
    - Fraud detection
    - Personalised marketing

### Hadoop in the Healthcare Sector
- Many healthcare providers use Hadoop to store large amounts of medical data
- This data is used to keep track of large-scale health indeces and metrics
- The main use cases include:
    - Patient detailed record keeping
    - Regulatory requirements


## Key Takeaways

- Apache Hadoop is an open-source, distributed processing framework that is used to efficiently store and process large datasets. It was the first big data management tool to gain popularity in industry. 
- The tool went through several iterations until it reached its current maturity level. Nowadays, it can integrate with other big data engineering tools to support a wide variety of use cases.
- Hadoop was initially designed to address the shortcomings of traditional relational database systems which included: The inability to handle unstructured data (such as images and video) and the difficulty of storing massive quantities of low-value, structured data
- Hadoop YARN is the newer version which tackles many of the drawbacks in the earlier Hadoop releases. The main benefits YARN provides includes: Seperating data storage from processing, enabling Hadoop to more efficiently harness the power of cloud technologies, allowing streaming data processing, providing a higher cluster availability and supporting larger sized data blocks
- Hadoop is composed of 4 main modules which are: Hadoop Common (core files required to run the system), HDFS (the data storage layer), YARN (the resource management layer) and MapReduce (the data processing framework)
- MapReduce is a technical term that refers to the process of how Hadoop divides a large task into a number of smaller tasks to enable parallel processing to occur. As the name implies, the process consists of a Mapper function and a Reducer function
- There are currently a large number of tools that can integrate with Hadoop. These include data warehousing tools (such as Hive), data processing tools (like Spark), data ingestion tools (like Kafka), job scheduling tools (such as Airflow) and NoSQL data stores (like HBase) to name a few.
- Hadoop should not be confused with a relational database system or a relational data warehouse. There are many differences between them. For instance, Hadoop can handle unstructured data while databases/data warehouses cannot. Also, Hadoop leverages parallel processing across a distributed network while databases usually run on a single, powerful machine.
- Hadoop has some drawbacks, with the main ones including: Difficulty of tool integration, complexity of writing MapReduce code, the lack of available talent, inability to handle small files well and the maturing security tools. There are also newer big data frameworks, like Apache Spark, which are quickly overtaking Hadoop.
- There are many global companies that leverage Hadoop for various use cases such as using it as a data lake, a data warehouse or to perform batch and streaming data processing