# NoSQL - HBase

## What is HBase?

> HBase is an open-source, column-oriented distributed data store that runs typically in a Hadoop environment. 

Although it can handle structured data, Hbase is designed mainly to store semi-structured and unstructured data types that a traditional relational databases can't.

HBase can store and interact with massive amounts of data (terabytes to petabytes) which are stored in _table_ structures. The tables present in HBase can consist of billions of rows having millions of columns. HBase is built for low latency operations, which provides benefits compared to traditional relational models.

For an introductory video on HBase, check out this Huwaei lecture:
- [Introduction to HBase](https://www.youtube.com/embed/VUkPIT97J9A)



## HBase Architecture

> HBase can run independently in a stand-alone mode, or in a distributed environment running on top of Hadoop's HDFS (in a pseudo-distributed or fully distributed mode).

Hbase's flexible architecture allows the tool to be installed in 3 different modes:
### Stand-alone mode 
  - This is mainly used for testing and proof of concept purposes
  - Data will be stored on the local disk storage


### Fully-distributed mode  
  - In this mode, the 3 HBase components (which we'll explain shortly) run on seperate computer nodes
  - In global companies using large-scale production environments, HBase is normally integrated with Hadoop to leverage HDFS as the back-end storage repository
  - This enables massive scaling and strong fault-tolerance

### Pseudo-distributed mode
  - In this mode, the 3 HBase components run as seperate processes but on a _single_ machine/node
  - Hadoop's HDFS will be a seperate cluster network to be able to scale up and down as required
  - This mode is normally used in smaller organisations with less intensive data needs

Although HBase can run on top of different storage systems like Amazon S3, the reason HDFS is popular to use with HBase it due to its low cost, fault tolerance and scalability. 


The main storage entity in Hbase is a _table_, which consist of rows and columns. The intersection of a row and column is called a _cell_, which stores data. Tables are sorted by the row. Table schemas are defined using something called a _column family_, whereby each column family can have any number of columns associated with it. Each column is a collection of _key value_ pairs.

Below is a visual representation of a typical HBase table:

<p align="center">
  <img src="images/hbase-column-family.png" width=600>
</p>

Notice the flexible schema structure allows rows to have varying number of columns, unlike relational databases don't allow. Moreover, columns don't always have to be in the exact same order nor contain the exact same data. For example, for row `101`, the first columns is the `email`, while on the other hand, the first column for `104` is `name`.

 _Regions_ are machines that store the actual data. The data stored on a region consists of all the rows between the start key and the end key which are assigned to that region. In practice, the size of regions is usually between 5GB to 20GB. 
 
 _Region servers_ are the machines which store the information about the data hosted in the various regions under its supervision and coordinate reads/writes. One region server is usually resonsible for many regions. As a best practice, the number of regions per region server should be between 20 and 200 (although increasing them above 200 is possible). See [here](https://www.cloudaeon.co.uk/regions-in-hbase.html) for more details.

<p align="center">
  <img src="images/hbase-architecture.png" width=600>
</p>

Overall, in HBase:
- Regions - tables are split into regions, with each region storing a "range" of rows. They usually store the data in HDFS.
- Region server - this server communicates with the user of the system and oversees a group of regions. It coordinates all read/write data related requests to the regions under its command.
- Table - is a collection of rows
- Row - is all of the key-value pairs for a record which may be spread across a collection of column families
- Column - is a collection of the same key-value pairs for different rows
- Column family - is a collection of columns


## HBase Components

> HBase consists of 3 main components: HMaster, the Region Server and Zookeeper

_Note: It should be noted that for most daily tasks, data engineers don't have to worry about directly dealing with the various HBase components described below, as most of these are abstracted away by HBase._

### 1. HMaster

HMaster represents the master server in Hbase. Mainly, the master handles task assignment, network load balancing and cluster operations. To be more specific, the main responsibilities of the master include:

-   Assigning regions to the region servers with help from Apache ZooKeeper
-   Handling load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.
-   Being responsible for schema changes and other metadata operations such as creation of tables and column families

For information related to metadata and for performing any schema changes, the client contacts the _HMaster_

### 2. Region Server

HBase tables are divided horizontally into regions which contain groups of row key. Regions are simply Hbase tables split up and spread across a distributed network called region servers. This split improves performance and data reliability. Region servers run on top of Hadoop's HDFS data nodes. They are essentially the worker nodes which handle read, write, update and delete requests from the various clients.

The region servers have regions under their control that:
-   Communicate with the client and handle data-related operations
-   Handle read and write requests
-   Decide the size of the region by following the region size thresholds

For read and write operations, the client will communicate directly with the region server, which will then coordinate the reads and writes

### 3. Zookeeper

Zookeeper is an open-source Apache project that provides services like maintaining configuration information, server/host naming, and providing distributed synchronization. It allows HBase to communicate with other data storage platforms such as AWS S3 or HDFS by acting as a distributed coordination service that can integrate various tools together.

Some of the main tasks include:
-   Discovering available servers
-   Tracking server failures and repairing failed nodes
-   Remembering what is stored in which network partition
-   Enabling communication between clients and region servers

HBase itself will take care of zookeeper. 

## HBase data redundancy

> To help ensure that the stored data is not lost if the node storing it crashes, most NoSQL tools store replicas of the same data

The vast majority of modern big data tools, such as Hadoop and NoSQL data stores, replicate data on seperate, physically isolated computer nodes, perhaps in different regions to ensure resiliancy to system failures. The cost of this is that you have to pay for storing redundant data which exists elsewhere.

It is standard to store 3 replicas of the same data to ensure data durability of a data system. For instance, Hadoop's HDFS is configured to have triple replication by default. The number of replicas can also be modified using the various parameters available in configuration files. In HBase, data replication occurs at the column family granularity. It is also possible to replicate entire HBase clusters, not just specific tables. For a more detailed explanation on HBase replication, [check this link](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_bdr_hbase_replication.html).  

## HBase Features

Below are the main features provided by Hbase:

- HBase is built for low latency operations
- HBase provides fast random read operations.  It does so because it uses Hash tables and indexes the data stored in HDFS.
- HBase can store large amounts of data easily (terabytes and even petabytes) as clusters can be scaled up and down as required
- Automatic and configurable sharding (division) of tables
- Automatic failover supports between region servers
- Convenient base classes available for backing Hadoop MapReduce jobs in HBase tables
- Easy to use API for client access
- Supports real-time querying efficiently

It's also important to note what HBase is __not__:

-   It's not a SQL database and doesn't store data using the relational model
-   It's not designed for Online Transaction Processing (OLTP)
-   It doesn't provide typical database features like ACID (atomicity, consistency, isolation and durability) or data normalization
-   It's not designed to be used with small datasets - that would be overkill
-   Data is referenced _only_ using the row key, like in a key-value data store

## Download and install HBase

> It's highly recommended to use a Linux system as HBase is known to cause issues when used with Windows

Now that we have a high-level understanding of what HBase is, let's download and install HBase to get a better feel of how it operates. 

At a high-level, to get HBase working on your system, the steps involve:

- Downloading and installing Hadoop (Hadoop's file system is required for HBase in pseudo-distributed mode, but is not required if HBase is run in standalone mode)
    -   _Note: Refer to the Hadoop notebook for detailed instructions on how to implement this step._
    <p></p>
    
- Configuring the following files:
    -   `bashrc`
    -   `hadoop-env.sh`
    -   `core-site.xml`
    -   `hdfs-site.xml`
    -   `mapred-site-xml`
    -   `yarn-site.xml`
    <p></p>

- Downloading and installing HBase in pseudo-distributed mode (to leverage HDFS for data storage)
    <p></p>
    
- Configuring the following HBase files:
    -   `hbase-env.sh`
    -   `hbase-site.xml`

_Note: We'll be using Linux (Ubuntu) for this tutorial, so if you're on a different operating system the steps might differ._

Let's begin:

#### 1. Open your terminal and run the following commands to update all existing applications:

In [None]:
sudo apt update
sudo apt -y upgrade
# sudo reboot # if you run into issues, try restarting, which is what this command will do

#### 2. Ensure that __Java__ is installed on your system. 

_Note: It is highly recommended to use `Java 8` as this is the version fully compatible with both Hadoop and HBase. Otherwise, we may face some errors._

To do this, run the following command:

In [None]:
# Check the Java version
java -version

# Check the Java copmiler version
javac -version

If Java is installed correctly, you'll see an output showing the Java and compiler version. Otherwise, you will need to download and install Java and the Java compiler. 

For detailed steps on how to do this, please refer to the Hadoop notebook.

#### 3. Next, we need to ensure that the `JAVA_HOME` variable is correctly set up on your system.  To do that, run the below command:

In [None]:
echo $JAVA_HOME

If the variable is correctly set up, you should see a path show up similar to the one below:

In [None]:
echo $JAVA_HOME

# Expected output should be similar to:
/usr/lib/jvm/java-8-openjdk-amd64

If you get no output, or if the output is simply repeating JAVA_HOME, then the variable is not set up correctly.

To fix this, refer to the Hadoop notebook for detailed steps.

#### 4. Now that Hadoop has been successfully installed, our next objective is to download and setup HBase.  

> Remember that we'll be using version 1.7.1 as it is compatible with both Hadoop and Java 8.

Recall that HBase is composed of 3 components:
-   HMaster (coordinating the region server and admin functions)
-   Region Server (maps the region to the server)
-   Zookeeper (coordinates with Hadoop)

We need to see all of these 3 components correctly running to be able to use HBase.

Let's start by downloading the HBase installation files:

In [None]:
# Set the Version variable
VERSION="1.7.1"	
# Download the HBase version
wget https://dlcdn.apache.org/hbase/${VERSION}/hbase-${VERSION}-bin.tar.gz

#### 5. Extract the downloaded archive and move it to `/usr/local/HBase` (create that folder if it's not already there):

In [None]:
tar -xzvf hbase-${VERSION}-bin.tar.gz
sudo mv hbase-${VERSION}/ /usr/local/HBase/

#### 6. We need to set your `HBASE_HOME` variable similar to what we did with `JAVA_HOME`.  To do this, copy the path of your HBase folder and open the `bashrc` file to add the variable:

In [None]:
sudo nano ~/.bashrc

Add the below lines (use your HBase folder if the path is different):

In [None]:
export HBASE_HOME=/usr/local/hbase/hbase-1.7.1
export PATH=$PATH:$HBASE_HOME/bin

#### 7. Next, we need to update the HBase environment configuration file, which contains the configurable paramters for the HBase environment.  

For running HBase in pseudo-distributed mode, we need to set 3 properties within this file:
-   `JAVA_HOME`
-   `HBASE_MANAGES_ZK`
-   `HBASE_REGIONSERVERS`

The file is located in the `conf` folder within the location which we unpacked HBase into, which should be `/usr/local/HBase/`.

Go ahead and open the `hbase-env.sh` file and uncomment/add the below settings:

In [None]:
sudo nano hbase-env.sh

In [None]:
# Add/uncomment the below settings
export HBASE_MANAGES_ZK=true
export HBASE_REGIONSERVERS=/usr/local/hbase/hbase-1.7.1/conf/regionservers
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

#### 8. Before the next step, we should create a Zookeeper data folder first. Place a nested folder called `data/zookeeper` inside the HBase folder (not inside the `conf` folder).

The folder structure should look like this:

<p align="center">
  <img src="images/hbase-data-zookeeper.png" width=400>
</p>



#### 9. Then, we will need to update another file inside the same `conf` folder called the `hbase-site.xml` file.  
Add the following between the `<configuration>` tags:

In [None]:
<property>
	<name>hbase.cluster.distributed</name>
	<value>true</value>
</property>
<property>
	<name>hbase.tmp.dir</name>
	<value>./tmp</value>
</property>
<property>
	<name>hbase.unsafe.stream.capability.enforce</name>
	<value>false</value>
</property>
<property>
  	<name>hbase.rootdir</name>
  	<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
  	<name>hbase.zookeeper.property.dataDir</name>
  	<value>/usr/local/hbase/data/zookeeper</value>
</property>
<property>
    	<name>hbase.zookeeper.quorum</name>
    	<value>localhost</value>
</property>
<property>
    	<name>dfs.replication</name>
    	<value>1</value>
</property>
<property>
    	<name>hbase.zookeeper.property.clientPort</name>
    	<value>2181</value>
</property>


Here is what each of these parameters do:

- `hbase.cluster.distributed`
    -   This parameter tells HBase to run in a stand-alone local mode or on a distributed cluster via Hadoop.

- `hbase.tmp.dir`
    -   This is the HDFS temporary data storage folder

- `hbase.unsafe.stream.capability.enforce`
    -   Controls whether or not HBase will check for stream capabilities
    -   This is used for toggling on or off advanced data flushing by HBase using something called Hflush and Hsync which help guarantee data durability. See more [here](https://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html)

- `hbase.rootdir`
    -   Specifies the the root HDFS folder location

- `hbase.zookeeper.property.dataDir`
    -   Tells Zookeeper where to store its data files
    
- `hbase.zookeeper.quorum`
    -   This is the list of one or more server nodes that are available for clients requests.  

- `dfs.replication`
    -   The replication factor for HDFS data (this should match the Hadoop settings we configured earlier)
    
- `hbase.zookeeper.property.clientPort`
    -   Tells Zookeeper which port it should use for communication

#### 10. Start all the Hadoop daemons first by running the following command:

In [None]:
cd $HADOOP_HOME/sbin
bash start-dfs.sh
bash start-yarn.sh

If everything runs smoothly, you should see output similar to the below:

<p align="center">
  <img src="images/start-dfs-yarn.png" width=600>
</p>

If everything looks good, we'll stop the services and grant the Hadoop user access to Hbase. Run the below commands:

In [None]:
# Stop all running Hadoop processes
stop-all.sh

# Change the directory to your HADOOP_HOME 
cd $HADOOP_HOME

# Change the owner of the hadoop directory from root to the Hadoop account
sudo chown -R hadoop:root hadoop 

# Change the access permission of the hadoop directory to allow read and execute access to all users and write access for the new account owner
sudo chmod -R 755 hadoop


# Change the owner of the HBase directory from root to the Hadoop account
sudo chown -R hadoop:root Hbase

# Change the access permission of the HBase directory to allow read and execute access to all users and write access for the new account owner
sudo chmod -R 755 Hbase

#### 11. Test HDFS to make sure everything is working smoothly.

To do this, we'll create a `test` directory using the below comand:

In [None]:
hadoop fs -mkdir /test
hadoop fs -ls /

_Note: The Hadoop file system (HDFS) is _not_ the same as the local file system. In reality, HDFS will be hosted on multiple servers across a distributed network._

The output should be:

In [None]:
hadoop fs -ls /

# Expected output
2021-12-22 12:27:34,309 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x   - hadoop supergroup      	0 2021-12-22 12:26 /test
hadoop@dodz-vm:/usr/local/hadoop/hadoop-3.3.1/etc/hadoop$


#### 12. Next, we'll start HBase by running the `start-hbase.sh` script. This starts the 3 components we mentioned earlier: HMaster, the region server, and Zookeeper.

In [None]:
bash start-hbase.sh

_Note: If `permission denied` error shows up, we may need to grant the Hadoop user access to the HBase folders.  To do this, run the below commands (using your correspnoding HBase and Zookeeper paths)_

In [None]:
sudo chmod -R 755 /usr/local/hbase/*

sudo chown -R hadoop:hadoop /usr/local/hbase/

sudo chmod -R 755 /usr/lib/data/zookeeper/*

sudo chown -R hadoop:hadoop /usr/lib/data/zookeeper

To ensure all the proper HBase processes are running, run the below Linux command which shows the status of all active Java processes:

In [None]:
jps

The expected output should include all of the below processes (3 for HBase and 5 for Hadoop plus jps itself):

In [None]:
jps

# Expected output
9298 HMaster
5652 SecondaryNameNode
5286 NameNode
9238 HQuorumPeer
9399 HRegionServer
5784 ResourceManager
6684 DataNode
5918 NodeManager
9486 Jps

Once the required processes run, we now need to run the HBase shell to ensure that we can start interacting with HBase.

To do this, run the below command:

In [None]:
hbase shell

You should now be inside the HBase shell as we can see below:  

<p align="center">
  <img src="images/hbase-shell.png" width=600>
</p>

Now we are inside HBase and can begin to use it's commands.

Try to run the `status` command to ensure HBase is working successfully.  This command shows the list of active HBase servers. 

In [None]:
status

The output should be something like:

In [None]:
hbase(main):001:0> status

#Expected output
1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load

hbase(main):002:0>


If you get an error that mentions HMaster is not running, double check the `/etc/hosts` file to ensure the VM and the localhost both have the same IP (127.0.0.1)

In [None]:
sudo nano /etc/hosts

Nice! We've successfully set up HBase. 


### Importing data into HBase

#### 1. The first step is to download the Retail data CSV file

You can [download it from here](https://aicore-files.s3.amazonaws.com/Data-Eng/retail.csv)


#### 2. The next step is to import the Retail data file into HBase.

To do that, we need to first create a new Hbase table and specify the Column Family. To do this, type the below command from inside the `hbase shell`:

In [None]:
create 'retail_table',{NAME => 'cf'}

To check the table was created successfully, run the `list` command to see all available HBase tables:

In [None]:
list

#Expected output:
hbase(main):002:0> list
retail_table                                                                         	 
1 row(s) in 0.3180 seconds


Once the table is created, we need to run the below command to copy the CSV file to HDFS so we can import it into HBase:

_Note: Ensure you are using your folder path where you saved the `retail.csv` file_

In [None]:
hadoop fs -put /YOURPATH/retail.csv /data

Now, to check that the file has been properly copied to HDFS, type the below command:

In [None]:
hadoop fs -ls /data

You should see output like:

In [None]:
hadoop fs -ls /data

# Expected output
22/01/26 17:20:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup   45580638 2022-01-26 17:20 /data/retail.csv


Finally, we need to load the `retail.csv` file into HBase. To do this, run the below command from the __terminal__:

In [None]:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY,cf:description,cf:quantity,cf:price,cf:customer,cf:country retail_table /data/retail.csv

_Note: You could of course write code to generate the column names if you have too many to write out by hand._

_Note: If you get any errors, such as "Bad Lines" or "Failed Map", check that you didn't miss any characters from the above code and attempt to type it directly yourself instead of copy and pasting it._

If everything works smoothly, you should see output similar to:

In [None]:
2022-01-07 13:34:28,910 INFO  [main] mapreduce.Job: erations=0
   	 HDFS: Number of bytes read=756
   	 HDFS: Number of bytes written=0
   	 HDFS: Number of read operations=2
   	 HDFS: Number of large read operations=0
   	 HDFS: Number of write operations=0
    Job Counters
   	 Launched map tasks=1
   	 Data-local map tasks=1
   	 Total time spent by all maps in occupied slots (ms)=5154
   	 Total time spent by all reduces in occupied slots (ms)=0
   	 Total time spent by all map tasks (ms)=5154
   	 Total vcore-milliseconds taken by all map tasks=5154
   	 Total megabyte-milliseconds taken by all map tasks=5277696
    Map-Reduce Framework
   	 Map input records=15
   	 Map output records=15
   	 Input split bytes=104
   	 Spilled Records=0
   	 Failed Shuffles=0
   	 Merged Map outputs=0
   	 GC time elapsed (ms)=77
   	 CPU time spent (ms)=1600
   	 Physical memory (bytes) snapshot=183992320
   	 Virtual memory (bytes) snapshot=1874804736
   	 Total committed heap usage (bytes)=137953280
    ImportTsv
   	 Bad Lines=0
    File Input Format Counters
   	 Bytes Read=652
    File Output Format Counters
   	 Bytes Written=0

#### 3. Now, we need to go into the HBase shell and check that the data is correctly loaded. 

To do that, we'll use the `scan` command, which is similar to a SQL `SELECT`. It will scan over the entire table and retrieve the relevant data.

For example, the below code will return the _first 5 rows_ of the Retail table:

In [None]:
scan 'retail_table', {'LIMIT', 5}

In [None]:
# Expected output:
Hbase::Table - retail_table
hbase(main):009:0> scan 'retail_table', {'LIMIT', 5}
ROW                	COLUMN+CELL                                               	 
 1                 	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 1                 	column=cf:customer, timestamp=1643213704999, value=17850  	 
 1                 	column=cf:description, timestamp=1643213704999, value=WHITE HAN
                   	GING HEART T-LIGHT HOLDER                                 	 
 1                 	column=cf:price, timestamp=1643213704999, value=2.55      	 
 1                 	column=cf:quantity, timestamp=1643213704999, value=6      	 
 10                	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 10                	column=cf:customer, timestamp=1643213704999, value=13047  	 
 10                	column=cf:description, timestamp=1643213704999, value=ASSORTED
                   	COLOUR BIRD ORNAMENT                                      	 
 10                	column=cf:price, timestamp=1643213704999, value=1.69      	 
 10                	column=cf:quantity, timestamp=1643213704999, value=32     	 
 100               	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 100               	column=cf:customer, timestamp=1643213704999, value=14688  	 
 100               	column=cf:description, timestamp=1643213704999, value=60 TEATIM
                   	E FAIRY CAKE CASES                                        	 
 100               	column=cf:price, timestamp=1643213704999, value=0.55      	 
 100               	column=cf:quantity, timestamp=1643213704999, value=24     	 
 1000              	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 1000              	column=cf:customer, timestamp=1643213704999, value=14729  	 
 1000              	column=cf:description, timestamp=1643213704999, value=TOAST ITS
                    	- HAPPY BIRTHDAY                                         	 
 1000              	column=cf:price, timestamp=1643213704999, value=1.25      	 
 1000              	column=cf:quantity, timestamp=1643213704999, value=2      	 
 10000             	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 10000             	column=cf:customer, timestamp=1643213704999, value=13174  	 
 10000             	column=cf:description, timestamp=1643213704999, value=SET OF 2
                   	TINS VINTAGE BATHROOM                                     	 
 10000             	column=cf:price, timestamp=1643213704999, value=4.25      	 
 10000             	column=cf:quantity, timestamp=1643213704999, value=2      	 
5 row(s) in 0.1360 seconds


Take a detailed look at how the data is displayed in HBase as it may seem confusing at first.  Unlike a relational database which stores data in a row-based manner, HBase stores the data in a __column-based__ approach. 

Each line in HBase represents a column value and also includes an automatic timestamp. The __Row__ is a unique Rowkey identifier that tells HBase how each of the columns are connected to each other (i.e. if they are part of the same logical row or not).

### Querying data in HBase

You can of course, run more complex queries.

To find out how many total rows we have in the table, we can use the `count` command as follows:

In [None]:
count `retail_table`

The output should look like:

<p align="center">
  <img src="images/hbase-count.png" width=600>
</p>

To do more advanced querying using filters (similar to SQL's WHERE command), we'll first need to import 3 HBase classes:

- `SingleColumnValueFilter`
- `CompareFilter`
- `BinaryComparator`
    
These 3 classes work together to provide flexible filtering criteria.

To achieve this, run the below 3 commands from inside the HBase shell:

In [None]:
# Import the required 3 classes 
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter 
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.BinaryComparator

The output should be similar to:

<p align="center">
  <img src="images/hbase-filter-import.png" width=600>
</p>

Now we can run queries with specific filters. First, let's query the table for all data that have the `country as United Kingdom`. The query would have the following format:

In [None]:
scan 'retail_table', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('cf'), Bytes.toBytes('country'), CompareFilter::CompareOp.valueOf('EQUAL'),BinaryComparator.new(Bytes.toBytes('United Kingdom')))}

The output will look something like this:

<p align="center">
  <img src="images/hbase-scan-filter.png" width=600>
</p>

Next, let's run a query to check how many products have a `price equal to 12.75`:

In [None]:
scan 'retail_table', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('cf'), Bytes.toBytes('price'), CompareFilter::CompareOp.valueOf('EQUAL'),BinaryComparator.new(Bytes.toBytes('12.75')))}

The output should be `826 products` as indicated by the number of rows seen below:

<p align="center">
  <img src="images/hbase-filter-price.png" width=600>
</p>

Using the combination of above filters, we can use the below comparison operators inside the `CompareFilter::CompareOp.valueOf` on column values:

- `EQUAL`
- `GREATER`
- `GREATER_OR_EQUAL`
- `LESS`
- `LESS_OR_EQUAL`
- `NOT_EQUAL`

### HBase commands:

Below are some of the typical commands you would be using to interact with data in HBase:

- `put`
    -   This command allows you to update the data in an already existing cell.

- `get`
    -   This command are used to read data from a table in HBase. It returns a the values associated with a row of data at a time.

- `delete`
    -   This command allows you to delete a specific cell in an HBase table.

- `deleteall`
    -   This command deletes all of the cells in a table.

- `scan`
    -   This command is used to view the data stored in an HBase table.

- `count`
    -   This command is used to count the number of rows of a table.

- `disable`
    -   This command disables (turns off) a table so that it can be deleted.

- `drop`
    -   This commands deletes a disabled table.

-   `truncate`
    -   This commands does 3 things in sequence:
        -   Disables a table
        -   Drops a table
        -   Recreates the table with the same name


For a detailed explanation of HBase commands, check the following guide:
-    [HBase Cheat Sheet](https://sparkbyexamples.com/hbase/hbase-shell-commands-cheat-sheet/)


## Key Takeaways

- HBase is a modern tool for storing and analyzing big data in tables.  It does so using a column-oriented approach.  This should not be confused with the row-oriented approach that traditional relational databases use.
- The intersection of a row and column in a table is called a _cell_.  Cells store data, which in turn is accessed using a unique ID called the _row key_.
- Related columns in HBase are grouped together into _column families_. An HBase table can have more than one column family. 
- HBase's architecture is composed of 3 main components: _HMaster_ (which acts as the master server), _Region Servers_ (which are various nodes that store tables), and _Zookeeper_ (which coordinates the various administrative tasks).
- HBase is designed to efficiently handle unstructured and semi-structured data using low-latency operations.  The tool is easy to scale and support batch and real-time querying of data.
- Hbase can be installed in different modes including stand-alone (on a local machine), pseudo-distributed (using Hadoop as the underlying data store) and fully distributed (across a corporate cluster).
- To download and install HBase in pseudo-distributed mode, we'll need to have a compatible Java and Hadoop version installed beforehand. Using a Linux operating system is highly recommended.
- Tables in HBase can be created using the `create` command.  Table querying can be done using `scan` and `get` commands, while inserting data can be done using the `put` command.
- In order to delete an HBase table, we first need to `disable` the table and then `drop` the disabled table.  Alternatively, the `truncate` command can be used to implement all of these actions.