# Apache HBase - Overview

## What is HBase?

> HBase is an open-source, column-oriented distributed data store that runs typically in a Hadoop environment. 

Although it can handle structured data, HBase is designed mainly to store semi-structured and unstructured data types that traditional relational databases can't.

HBase can store and interact with massive amounts of data (terabytes to petabytes) which are stored in _table_ structures. The tables present in HBase can consist of billions of rows having millions of columns. HBase is built for low latency operations, which provides benefits compared to traditional relational models.

For an introductory video on HBase, check out this Huwaei lecture:
- [Introduction to HBase](https://www.youtube.com/embed/VUkPIT97J9A)



## HBase Architecture

> HBase can run independently in a stand-alone mode, or in a distributed environment running on top of Hadoop's HDFS (in a pseudo-distributed or fully distributed mode).

HBase's flexible architecture allows the tool to be installed in 3 different modes:
### Stand-alone Mode 
  - This is mainly used for testing and proof of concept purposes
  - Data will be stored on the local disk storage


### Fully-distributed Mode  
  - In this mode, the 3 HBase components (which we'll explain shortly) run on separate computer nodes
  - In global companies using large-scale production environments, HBase is normally integrated with Hadoop to leverage HDFS as the back-end storage repository
  - This enables massive scaling and strong fault-tolerance

### Pseudo-distributed Mode
  - In this mode, the 3 HBase components run as separate processes but on a _single_ machine/node
  - Hadoop's HDFS will be a separate cluster network to be able to scale up and down as required
  - This mode is normally used in smaller organisations with less intensive data needs

Although HBase can run on top of different storage systems like Amazon S3, the reason HDFS is popular to use with HBase it due to its low cost, fault tolerance and scalability. 


The main storage entity in HBase is a _table_, which consist of rows and columns. The intersection of a row and column is called a _cell_, which stores data. Tables are sorted by the row. Table schemas are defined using something called a _column family_, whereby each column family can have any number of columns associated with it. Each column is a collection of _key value_ pairs.

Below is a visual representation of a typical HBase table:

<p align="center">
  <img src="images/hbase-column-family.png" width=600>
  <figcaption align="center"><cite>HBase Column Family</cite></figcaption>
</p>

Notice the flexible schema structure allows rows to have varying number of columns, unlike relational databases don't allow. Moreover, columns don't always have to be in the exact same order nor contain the exact same data. For example, for row `101`, the first columns is `email`, while on the other hand, the first column for `104` is `name`.

 _Regions_ are machines that store the actual data. The data stored on a region consists of all the rows between the start key and the end key which are assigned to that region. In practice, the size of regions is usually between 5GB to 20GB. 
 
 _Region servers_ are the machines which store the information about the data hosted in the various regions under its supervision and coordinate reads/writes. One region server is usually responsible for many regions. As a best practice, the number of regions per region server should be between 20 and 200 (although increasing them above 200 is possible). See [here](https://www.cloudaeon.co.uk/regions-in-hbase.html) for more details.

<p align="center">
  <img src="images/hbase-architecture.png" width=600>
  <figcaption align="center"><cite>HBase Architecture</cite></figcaption>
</p>

Overall, in HBase:
- Regions - tables are split into regions, with each region storing a "range" of rows. They usually store the data in HDFS.
- Region server - this server communicates with the user of the system and oversees a group of regions. It coordinates all read/write data related requests to the regions under its command.
- Table - is a collection of rows
- Row - is all of the key-value pairs for a record which may be spread across a collection of column families
- Column - is a collection of the same key-value pairs for different rows
- Column family - is a collection of columns


## HBase Components

> HBase consists of 3 main components: HMaster, the Region Server and Zookeeper

_Note: It should be noted that for most daily tasks, data engineers don't have to worry about directly dealing with the various HBase components described below, as most of these are abstracted away by HBase._

### 1. HMaster

HMaster represents the master server in HBase. Mainly, the master handles task assignment, network load balancing and cluster operations. To be more specific, the main responsibilities of the master include:

-   Assigning regions to the region servers with help from Apache ZooKeeper
-   Handling load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.
-   Being responsible for schema changes and other metadata operations such as creation of tables and column families

For information related to metadata and for performing any schema changes, the client contacts the _HMaster_

### 2. Region Server

HBase tables are divided horizontally into regions which contain groups of row key. Regions are simply HBase tables split up and spread across a distributed network called region servers. This split improves performance and data reliability. Region servers run on top of Hadoop's HDFS data nodes. They are essentially the worker nodes which handle read, write, update and delete requests from the various clients.

The region servers have regions under their control that:
-   Communicate with the client and handle data-related operations
-   Handle read and write requests
-   Decide the size of the region by following the region size thresholds

For read and write operations, the client will communicate directly with the region server, which will then coordinate the reads and writes

### 3. Zookeeper

Zookeeper is an open-source Apache project that provides services like maintaining configuration information, server/host naming, and providing distributed synchronization. It allows HBase to communicate with other data storage platforms such as AWS S3 or HDFS by acting as a distributed coordination service that can integrate various tools together.

Some of the main tasks include:
-   Discovering available servers
-   Tracking server failures and repairing failed nodes
-   Remembering what is stored in which network partition
-   Enabling communication between clients and region servers

HBase itself will take care of zookeeper. 

## HBase Data Redundancy

> To help ensure that the stored data is not lost if the node storing it crashes, most NoSQL tools store replicas of the same data

The vast majority of modern big data tools, such as Hadoop and NoSQL data stores, replicate data on separate, physically isolated computer nodes, perhaps in different regions to ensure resiliency to system failures. The cost of this is that you have to pay for storing redundant data which exists elsewhere.

It is standard to store 3 replicas of the same data to ensure data durability of a data system. For instance, Hadoop's HDFS is configured to have triple replication by default. The number of replicas can also be modified using the various parameters available in configuration files. In HBase, data replication occurs at the column family granularity. It is also possible to replicate entire HBase clusters, not just specific tables. For a more detailed explanation on HBase replication, [check this link](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_bdr_hbase_replication.html).  

## HBase Features

Below are the main features provided by HBase:

- HBase is built for low latency operations
- HBase provides fast random read operations.  It does so because it uses Hash tables and indexes the data stored in HDFS.
- HBase can store large amounts of data easily (terabytes and even petabytes) as clusters can be scaled up and down as required
- Automatic and configurable sharding (division) of tables
- Automatic failover supports between region servers
- Convenient base classes available for backing Hadoop MapReduce jobs in HBase tables
- Easy to use API for client access
- Supports real-time querying efficiently

It's also important to note what HBase is __not__:

-   It's not a SQL database and doesn't store data using the relational model
-   It's not designed for Online Transaction Processing (OLTP)
-   It doesn't provide typical database features like ACID (atomicity, consistency, isolation and durability) or data normalization
-   It's not designed to be used with small datasets - that would be overkill
-   Data is referenced _only_ using the row key, like in a key-value data store

## Key Takeaways

- HBase is a modern tool for storing and analyzing big data in tables. It does so using a column-oriented approach. This should not be confused with the row-oriented approach that traditional relational databases use.
- The intersection of a row and column in a table is called a _cell_.  Cells store data, which in turn is accessed using a unique ID called the _row key_.
- Related columns in HBase are grouped together into _column families_. An HBase table can have more than one column family. 
- HBase's architecture is composed of 3 main components: _HMaster_ (which acts as the master server), _Region Servers_ (which are various nodes that store tables), and _Zookeeper_ (which coordinates the various administrative tasks).
- HBase is designed to efficiently handle unstructured and semi-structured data using low-latency operations. The tool is easy to scale and support batch and real-time querying of data.