# NoSQL - HBase

## What is HBase?

> HBase is an open-source, column-oriented distributed data store that runs typically in a Hadoop environment. 

Hbase is a tool used to store and access massive amounts of data. Although it can handle structured data, Hbase is designed mainly to efficiently manage semi-structured and unstructred data types that a traditional relational database couldn't handle.  Hbase was originally developed by Google and was called Big Table. Afterwards, it was renamed to HBase and became a project under the Apache foundation.  Apache HBase is typically used for _real-time_ big data applications. HBase can store and interact with massive amounts of data (terabytes to petabytes) which are stored in _table_ structures. The tables present in HBase can consist of billions of rows having millions of columns. HBase is built for low latency operations, which provides benefits compared to traditional relational models.

For an introductory video on HBase, check out this Huwaei lecture:
- [Introduction to HBase](https://www.youtube.com/embed/VUkPIT97J9A)

### HBase Architecture

> HBase can run independently in a stand-alone mode, or in a distributed environment running on top of Hadoop's HDFS (in a pseudo-distributed or fully distributed mode).

Hbase's flexible architecture allow the tool to be installed independently, in a mode called stand-alone.  This is mainly used for testing and proof of concept purposes. However, in real production environments, HBase is normally integrated with Hadoop to leverage HDFS as the back-end storage repository.

The main storage entity in Hbase is a _table_, which consist of rows and columns.  The intersection of a row and column is called a _cell_, which stores data. Tables are sorted by the row.  Table schemas are defined using something called a __column family__, whereby each column family can have any number of columns associated with it.  Each column is a collection of _key value_ pairs.

To summarize, in HBase:

- Table - is a collection of rows
- Row - is a collection of column families
- Column family - is a collection of columns
- Column - is a collection of key-value pairs
- Regions - tables are split into regions, with each region storing a "range" of rows.  They store the data in HDFS.
- Region server - this server communicates with the user of the system and oversees a group of regions.  It coordinates all read/write data related requests to the regions under its command.

Below is a visual representation of a typical HBase table:

<p align="center">
  <img src="images/hbase-table-schema.png" width=600>
</p>


HBase consists of 3 main components:

<p align="center">
  <img src="images/hbase-architecture.png" width=600>
</p>


__1. HMaster__

HMaster represents the master server in Hbase.  Mainly, the master handles task assignment, network load balancing and cluster operations.  To be more specific, the main responsibilities of the master include:

-   Assigns regions to the region servers (uses help from Apache ZooKeeper for this task).
-   Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.
-   Is responsible for schema changes and other metadata operations such as creation of tables and column families.

__2. Region Server__

HBase tables are divided horizontally by the _row key_ into regions. _Regions_ are simply Hbase tables split up and spread across a distributed network called _region servers_.  This split is done for performance and data reliability reasons.  

The region servers have regions under their control that:
-   Communicate with the client and handle data-related operations.
-   Handle read and write requests.
-   Decide the size of the region by following the region size thresholds.

__3. Zookeeper__

Zookeeper is an additional component that works as a coordinator for HBase, especially when HDFS is used to store the data. Zookeeper is an open-source Apache project that provides services like maintaining configuration information, naming, and providing distributed synchronization. Some of the main tasks include:

-   Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to discover available servers.
-   In addition to availability, the nodes are also used to track server failures or network partitions.
-   Clients communicate with region servers via Zookeeper.
-   In pseudo-distributed and standalone modes, HBase itself will take care of zookeeper.

## Apache HBase Features

Below are the main features provided by Hbase:

- HBase is built for low latency operations
- HBase provides fast random read operations.  It does so because it uses Hash tables and indexes the data stored in HDFS.
- HBase can store large amounts of data easily (terabytes and even petabytes)
- Provides scalability within cluster environments
- Automatic and configurable sharding (division) of tables
- Automatic failover supports between region servers
- Convenient base classes available for backing Hadoop MapReduce jobs in HBase tables
- Easy to use API for client access
- Supports real-time querying efficiently


It's also important to note what HBase is __not__:

-   It's not a SQL database and doesn't store data using a relational model.
-   It's not designed for Online Transaction Processing (OLTP).
-   It doesn't provide typical database features like ACID (atomicity, consistency, isolation and durability) or data normalization.
-   It's not designed to be used with small datasets.
-   Data is sorted _only_ using the row key.

## Download and install HBase

> It's highly recommended to use a Linux system as HBase is known to cause issues when used with Windows

Now that we have a high-level understanding of what HBase is, let's download and install HBase to get a better feel of how it operates. 

At a high-level, to get HBase working on your system, the steps involve:

- Downloading and installing Hadoop (Hadoop's file system is required for HBase in pseudo-distributed mode)
- Configuring the following files:
    -   bashrc
    -   hadoop-env.sh
    -   core-site.xml
    -   hdfs-site.xml
    -   mapred-site-xml
    -   yarn-site.xml
- Downloading and installing HBase in pseudo-distributed mode (to leverage HDFS for data storage).
- Configuring the following HBase files:
    -   hbase-env.sh
    -   hbase-site.xml

__Note__: We'll be using Linux (Ubuntu) for this tutorial, so if you're on a different operating system the steps might differ.  

Let's begin:

1. Open your terminal and run the following commands to __update__ all existing applications:

In [None]:
sudo apt update
sudo apt -y upgrade
sudo reboot

2. Ensure that __Java__ is installed on your system. 

__Note__: It is highly recommended to use `Java 8` as this is the version fully compatible with both Hadoop and HBase. Otherwise, we may face some errors.

To do this, run the following command:

In [None]:
# Check the Java version
java -version

# Check the Java copmiler version
javac -version

If Java is installed correctly, you'll see an output showing the Java and compiler version.  

Otherwise, you will need to download and install Java by running the following commands (sometimes both the JRE and JDK are required, so let's go ahead and install both just in case):

In [None]:
# Install Java 8 runtime environment
sudo apt-get install oracle-java8-installer
sudo apt-get install openjdk-8-jre 

# Add Java to the Linux repository
sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update

# Install Java 8 development kit
sudo apt-get install openjdk-8-jdk

Check that both the JDK and the JRE have installed correctly.  If all is well, run the below command and you should see the Java version:

In [None]:
$ java -version

# Output should be similar to this
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)


In [None]:
javac -version

# Output should be similar to this
javac 1.8.0_312

__Note__: If you have multiple Java versions installed, it's possible to switch between them.  To do this, check what available Java versions currently exist by running the below command:

In [None]:
# Check Java versions already installed
sudo update-alternatives --config java

You should see a menu similar to the below one.  You can select the desired version directly by typing in its number:

In [None]:
sudo update-alternatives --config java
[sudo] password for hadoop:

# # Output should be similar to this
There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection	Path                                        	Priority   Status
------------------------------------------------------------
* 0        	/usr/lib/jvm/java-11-openjdk-amd64/bin/java  	1111  	auto mode
  1        	/usr/lib/jvm/java-11-openjdk-amd64/bin/java  	1111  	manual mode
  2        	/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081  	manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
$

Repeat the same step above for the Java compiler `javac`:

In [None]:
sudo update-alternatives --config javac

3. Next, we need to ensure that the `JAVA_HOME` variable is correctly set up on your system.  To do that, run the below command:

In [None]:
echo $JAVA_HOME

If the variable is correctly set up, you should see a path show up similar to the one below:

In [None]:
echo $JAVA_HOME

# Expected output should be similar to:
/usr/lib/jvm/java-8-openjdk-amd64

If you get no output, or if the output is simply repeating JAVA_HOME, then the variable is not set up correctly.

To fix this, you will need to add the correct Java path to your `.bashrc` file.  To do this, first find the location of the Java installation, then open and update the `.bashrc` file to include the following lines at the end of the file (ensure you use the path to your Java folder):

In [None]:
# Open the .bashrc file using the Nano editor (you can use any other editor like Vim if you prefer)
sudo nano ~/.bashrc

In [None]:
# Add the below to the .bashrc file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

After saving the file, we need to use the `source` command to enforce the changes in the operating system:

In [None]:
source ~/.bashrc

Now, to make sure everything is set up correctly, check that `JAVA_HOME` is working.  You should now be able to see the Java folder path correctly:

In [None]:
echo $JAVA_HOME

4. Next, we should create a new user account for Hadoop.  HBase uses Hadoop's HDFS to store the data, so it's recommended to have a seperation of accuonts between the Hadoop file system and the Linux file system to avoid confusion.  Go ahead and type the below commands:

In [None]:
# Create a new user called Hadoop
sudo adduser hadoop
sudo usermod -aG sudo hadoop

You will then be asked for some additional information including a password for the new user account.  Go ahead and add that information.  You should see output similar to:

In [None]:
# Expected output should be similar to the below
Adding user `hadoop' ...
Adding new group `hadoop' (1001) ...
Adding new user `hadoop' (1001) with group `hadoop' ...
Creating home directory `/home/hadoop' ...
Copying files from `/etc/skel' ...
New password:
Retype new password:
passwd: password updated successfully
Changing the user information for hadoop
Enter the new value, or press ENTER for the default
    Full Name []: 	 
    Room Number []:
    Work Phone []:
    Home Phone []:
    Other []:
Is the information correct? [Y/n] y
adiwany@dodz-vm:~$ sudo usermod -aG sudo hadoop
adiwany@dodz-vm:~$


5. Now, we need to generate a __SSH key-pair__ for the new Hadoop user.  Run the below commands:

In [None]:
sudo su hadoop
ssh-keygen -t rsa

If all goes well, you should see output similar to this:

In [None]:
ssh-keygen -t rsa

# Expected output:
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:tol1mX4v1GrDLeHHq1Wa/nvKaZUMlEGDCZisTAkWVPc hadoop@dodz-vm
The key's randomart image is:
+---[RSA 3072]----+
|  .=+.o.o.. ++o  |
|  .  o.+.  o o.  |
|	o .  E  .	|
| 	o 	o .   |
|    	S +  .o o|
|   	+ =  o .*.|
|  	. o .+.=+. |
|       	.O==..|
|       	..BB=+|
+----[SHA256]-----+
hadoop@dodz-vm:~$

6. Next, add this newly generated key to the list of authorized SSH keys:

In [None]:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
sudo chmod 0600 ~/.ssh/authorized_keys

If no errors come up, then everything is set up correctly so far.

7. Verify that we can use SSH with the newly generated key:

In [None]:
ssh localhost

If everything runs smoothly, you should see output similar to:

In [None]:
ssh localhost

# Expected output
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:fPYKPrq8VD1pNfI+7EXyKqQFFm4/eWi0+jjADURdHhU.
Are you sure you want to continue connecting (yes/no/[fingerprint])? y
Please type 'yes', 'no' or the fingerprint: yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 20.04.3 LTS (GNU/Linux 5.11.0-41-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management: 	https://landscape.canonical.com
 * Support:    	https://ubuntu.com/advantage

23 updates can be applied immediately.
To see these additional updates run: apt list --upgradable

Your Hardware Enablement Stack (HWE) is supported until April 2025.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

If you get an error saying something like __Connection Refused__, then the localhost is not properly set up or SSH is not yet installed.

To install SSH, enter the following command:

In [None]:
sudo apt install ssh 

You should see output similar to this:

In [None]:
sudo apt install ssh

# Expected output 
[sudo] password for hadoop:
Reading package lists... Done
Building dependency tree  	 
Reading state information... Done
The following packages were automatically installed and are no longer required:
  chromium-codecs-ffmpeg-extra gstreamer1.0-vaapi
  libgstreamer-plugins-bad1.0-0 libva-wayland2
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  ncurses-term openssh-server openssh-sftp-server ssh-import-id
Suggested packages:
  molly-guard monkeysphere ssh-askpass
The following NEW packages will be installed:
  ncurses-term openssh-server openssh-sftp-server ssh ssh-import-id
0 upgraded, 5 newly installed, 0 to remove and 18 not upgraded.
Need to get 693 kB of archives.
After this operation, 6,130 kB of additional disk space will be used.
Do you want to continue? [Y/n] y

8. Download and install `Hadoop`

Hadoop is required for HBase to correctly work in pseduo-distributed mode.  In this mode, HBase uses Hadoop's file system (HDFS) to store and retrieve the data.

To ensure that Hadoop and HBase are compatible, you need to use compatible versions.  For this tutorial, we'll be using the following versions:
-   Hadoop 2.10.1
-   HBase 1.7.1

Save the version to a variable as follows:

In [None]:
Version="2.10.1"

Then download the corresponding Hadoop version as follows:

In [None]:
sudo wget https://www-eu.apache.org/dist/hadoop/common/hadoop-$Version/hadoop-$Version.tar.gz

9. Extract the files and move the resulting folder to the location of your choice (we'll be using `/usr/local/hadoop`).  Create the new directory if required:

In [None]:
tar -xzvf hadoop-$Version.tar.gz
rm hadoop-$Version.tar.gz
sudo mv hadoop-$Version/ /usr/local/hadoop

10. Now, we need to set `HADOOP_HOME` and add the directory containing the Hadoop binaries to your `.bashrc` file.  To do this, run the following command:

In [None]:
sudo nano ~/.bashrc

Define the Hadoop environment variables required to configure the tool on your system.  To do this, we need to add the following content to the end of the `.bashrc` file (remember to use your own Java and Hadoop folder paths):

In [None]:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/lib/hadoop/hadoop-2.10.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Save the file and exit.  It's vital to run the below command to apply the changes we just added:

In [None]:
source ~/.bashrc

11. Next, we'll update the `hadoop-env.sh` file.  This file contains important configurations related to Hadoop's setup:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Uncomment the `$JAVA_HOME` variable (i.e., remove the # sign) and _add the full path to the OpenJDK installation on your system without the bin directory_. If you don't know the Java path, run the following command to find out:

In [None]:
readlink -f /usr/bin/javac

Your file should look something like this:
<p align="center">
  <img src="images/hadoop-env3.png" width=600>
</p>


12. The `core-site.xml` file defines important HDFS and Hadoop cluster properties.  To set up Hadoop properly, we need to provide the URL of the NameNode (master node).  To do that, run the following command to open the file:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

To specify that we'll be running Hadoop locally, add the following in between the `<configuration>` and `</configuration>` tags and save the file:

In [None]:
<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
      <description>The default file system URI</description>
   </property>
</configuration>

13. Next, we need to edit the `hdfs-site.xml` file.  This file stores the details regarding the location of the metadata, NameNode and DataNode directories. 

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following properties to the file and, if needed, adjust the NameNode and DataNode directories to your custom locations (create the directories first).  We'll also set the default HDFS replication factor to 1:

In [None]:
<configuration>
<property>
  <name>dfs.name.dir</name>
  <value>file:///$HADOOP_HOME/dfsdata/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>file:///$HADOOP_HOME/dfsdata/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
</configuration>

14. Edit the `mapred-site.xml` file to define the MapReduce required values:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following configuration to change the default MapReduce framework value to `yarn`:

In [None]:
<configuration> 
<property> 
  <name>mapreduce.framework.name</name> 
  <value>yarn</value> 
</property> 
</configuration>

15. Edit `yarn-site.xml` which is used to define YARN-specific settings by opening the file and adding the below:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following configurations:

In [None]:
<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>

This completes the required Hadoop configurations.

16. Reboot your system to ensure all new settings are loaded before running Hadoop:

In [None]:
systemctl reboot -i

17.  Validate Hadoop settings and configurations

After completing the above steps, we want to make sure that Hadoop was properly set up.  To do this, we should do the following to check the Hadoop version:

In [None]:
hadoop version

You should see output like this:

In [None]:
# Expected output
Hadoop 2.10.1
Subversion https://github.com/apache/hadoop -r 1827467c9a56f133025f28557bfc2c562d78e816
Compiled by centos on 2020-09-14T13:17Z
Compiled with protoc 2.5.0
From source with checksum 3114edef868f1f3824e7d0f68be03650
This command was run using /usr/lib/hadoop/hadoop-2.10.1/share/hadoop/common/hadoop-common-2.10.1.jar

18. Now we are ready to start the Hadoop cluster.  To do this, we need to run a number of commands:

- `start-dfs.sh` 
    -   This command starts HDFS
- `start-yarn.sh`
    -   This command starts YARN
- `jps`
    -   This command checks all Java processes to ensure the correct daemons are active

__Note__: It's recommended to run these commands from inside your `HADOOP_HOME` folder:

In [None]:
bash start-dfs.sh

This will take some time to start HDFS.  You should see output similar to:
<p align="center">
  <img src="images/start-hdfs.png" width=600>
</p>


Then run:

In [None]:
bash start-yarn.sh
jps

You should see the below output:

<p align="center">
  <img src="images/start-yarn-jps.png" width=600>
</p>


__Note__: If the DataNode process doesn't show up in the jps list, then we probably need to double check the correct HDFS directory is created and that we have the proper persmissions.

The commands should be similar to the below (make sure to use your HDFS path as the below is just an example):

In [None]:
sudo chmod -R 755 /usr/lib/hadoop/hdfsdata/*
sudo chown -R hadoop:hadoop /usr/lib/hadoop/hdfsdata

Now, if you run `jps`, you should see the DataNode process correctly displayed along with the remaining Hadoop processes:

<p align="center">
  <img src="images/datanode-jps.png" width=600>
</p>

__Note__: Another possible error you may encounter is the `incompatible cluster ID error`.  To resolve this, delete both the `datanode` and `namenode` folders and create them again.

19. Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode user interface, which allows you to monitor the Hadoop enviornment:

In [None]:
http://localhost:9870

You should be able to see a page similar to this one:

<p align="center">
  <img src="images/hadoop-ui.png" width=600>
</p>

Similarly, you can use port 9864 to access the DataNode user interface:

In [None]:
http://localhost:9864

And port 8088 can be used to view the YARN Resource Manager:

In [None]:
http://localhost:8088

20. Now that Hadoop has been successfully installed, our next objective is to download and setup HBase.  

Remember that we'll be using version 1.7.1 as it is compatible with both Hadoop and Java 8.

__Note__: Recall that HBase is composed of 3 components:
    -   HMaster (coordinating the region server and admin functions)
    -   Region Server (maps the region to the server)
    -   Zookeeper (coordiantes with Hadoop)

We need to see all of these 3 components correctly running to be able to use HBase.

Let's start by downloading the HBase installation files:

In [None]:
# Set the Version variable
Version="1.7.1"	
# Download the HBase version
wget https://dlcdn.apache.org/hbase/Version/hbase-Version-bin.tar.gz

21. Extract the downloaded archive and move it to `/usr/local/HBase` (create that folder if it's not already there):

In [None]:
tar xvf hbase-$Version-bin.tar.gz
sudo mv hbase-$Version/ /usr/local/HBase/

22. We need to set your `HBASE_HOME` variable similar to what we did with `JAVA_HOME`.  To do this, copy the path of your HBase folder and open the `bashrc` file to add the variable:

In [None]:
sudo nano ~/.bashrc

Add the below lines (use your HBase folder if the path is different):

In [None]:
export HBASE_HOME=/usr/local/hbase/hbase-1.7.1
export PATH=$PATH:$HBASE_HOME/bin

23. Next, we need to update the `hbase-env.sh` file, which contains the configurable paramters for the HBase enviornment.  

For running HBase in pseudo-distributed mode, we need to set 3 properties within this file:
-   JAVA_HOME
-   HBASE_MANAGES_ZOOKEER
-   HBASE_REGIONSERVERS

Go ahead and open the file and uncomment/add the below settings:

In [None]:
export HBASE_MANAGES_ZK=true
export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
export JAVA_HOME=${JAVA_HOME}

24. Then, we will need to update the `hbase-site.xml` file.  Add the following between the `<configuration>` tags (you may need to create the Zookeeper data folder first):

Here is what each one of these parameters does:

- `hbase.cluster.distributed`
    -   This parameter tells HBase to run in a stand-alone local more or on a distributed cluster via Hadoop.

- `hbase.tmp.dir`
    -   This is the HDFS temporary data storage folder

- `hbase.unsafe.stream.capability.enforce`
    -   Controls whether or not HBase will check for stream capabilities

- `hbase.rootdir`
    -   Specifies the the root HDFS folder location

- `hbase.zookeeper.property.dataDir`
    -   Tells Zookeeper where to store its data files
    
- `hbase.zookeeper.quorum`
    -   This is the list of one or more server nodes that are available for clients requests.  

- `dfs.replication`
    -   The replication factor for HDFS data (this should match the Hadoop settings we configured earlier)
    
- `hbase.zookeeper.property.clientPort`
    -   Tells Zookeeper which port it should use for communication

In [None]:
<property>
	<name>hbase.cluster.distributed</name>
	<value>true</value>
</property>
<property>
	<name>hbase.tmp.dir</name>
	<value>./tmp</value>
</property>
<property>
	<name>hbase.unsafe.stream.capability.enforce</name>
	<value>false</value>
</property>
<property>
  	<name>hbase.rootdir</name>
  	<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
  	<name>hbase.zookeeper.property.dataDir</name>
  	<value>/usr/lib/hbase/data/zookeeper</value>
</property>
<property>
    	<name>hbase.zookeeper.quorum</name>
    	<value>localhost</value>
</property>
<property>
    	<name>dfs.replication</name>
    	<value>1</value>
</property>
<property>
    	<name>hbase.zookeeper.property.clientPort</name>
    	<value>2181</value>
</property>

25. Start all the Hadoop daemons first by running the following command:

In [None]:
cd $HADOOP_HOME/sbin
bash start-dfs.sh
bash start-yarn.sh

If everything runs smoothly, you should see output similar to the below:

<p align="center">
  <img src="images/start-dfs-yarn.png" width=600>
</p>

If everything looks good, we'll stop the services and grant the Hadoop user access to Hbase.  Run the below commands:

In [None]:
stop-all.sh
chown -R hadoop:root hadoop
chmod -R 755 hadoop

chown -R hadoop:root Hbase
chmod -R 755 Hbase

26. Test HDFS to make sure everything is working smoothly.

To do this, we'll create a `test` directory using the below comand:

In [None]:
hadoop fs -mkdir /test
hadoop fs -ls /

__Note__: The Hadoop file system (HDFS) is _not_ the same as the local file system. In reality, HDFS will be hosted on multiple servers across a distributed network.

The output should be:

In [None]:
hadoop fs -ls /

# Expected output
2021-12-22 12:27:34,309 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x   - hadoop supergroup      	0 2021-12-22 12:26 /test
hadoop@dodz-vm:/usr/local/hadoop/hadoop-3.3.1/etc/hadoop$


27. Next, we'll initiate the HBase process using the provided script `start-hbase.sh`

In [None]:
cd $HBASE_HOME/bin
bash start-hbase.sh

__Note__: If `permission denied` error shows up, we may need to grant the Hadoop user access to the HBase folders.  To do this, run the below commands (using your correspnoding HBase and Zookeeper paths)

In [None]:
sudo chmod -R 755 /usr/lib/hbase/*
sudo chown -R hadoop:hadoop /usr/lib/hbase/

sudo chmod -R 755 /usr/lib/data/zookeeper/*
sudo chown -R hadoop:hadoop /usr/lib/data/zookeeper

To ensure all the proper HBase processes are running, run the below Linux command which shows all active processes:

In [None]:
jps

The expected output should include all of the below processes (3 for HBase and 5 for Hadoop plus jps itself):

In [None]:
jps

# Expected output
9298 HMaster
5652 SecondaryNameNode
5286 NameNode
9238 HQuorumPeer
9399 HRegionServer
5784 ResourceManager
6684 DataNode
5918 NodeManager
9486 Jps

Once the required processes run, we now need to run the HBase shell to ensure that we can start interacting with HBase.

To do this, run the below command:

In [None]:
hbase shell

You should now be inside the HBase shell as we can see below:  

<p align="center">
  <img src="images/hbase-shell.png" width=600>
</p>

Now we are inside HBase and can begin to use it's commands.

Try to run the `status` command to ensure HBase is working successfully.  This command shows the list of active HBase servers. 
The output should be something like:

In [None]:
status

In [None]:
hbase(main):001:0> status

#Expected output
1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load

hbase(main):002:0>


If you get an error that mentions HMaster is not running, double check the `/etc/hosts` file to ensure the VM and the localhost both have the same IP (127.0.0.1)

In [None]:
sudo nano /etc/hosts

28. Now that HBase is up and running, we'll proceed to create a table and populate it with data.

The first step is to [download the Employee data file from here](LINK).

__Note__: If you are using a Virtual Machine, try to download the file directly in the VM rather than the host operating system.  This avoids having to do a file transfer between the two systems.


29.The next step is to import the Employee data file into HBase.

To do that, we need to create an Hbase table and specify the Column Family:

In [None]:
create 'emp_data',{NAME => 'cf'}

To check the table was created successfully, run the `list` command to see all available HBase tables:

In [None]:
list

#Expected output:
hbase(main):002:0> list
emp_data                                                                         	 
1 row(s) in 0.3180 seconds



Once the table is created, we need to run the below command to copy the CSV file to HDFS so we can import it into HBase:

In [None]:
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno emp_data /data/emp_data.csv

__Note__: If you get any errors, such as "Bad Lines" or "Failed Map", check that you didn't miss any charactars from the above code and attempt to type it directly yourself instead of copy and pasting it.

If everything works smoothly, you should see output similar to:

In [None]:
2022-01-07 13:34:28,910 INFO  [main] mapreduce.Job: erations=0
   	 HDFS: Number of bytes read=756
   	 HDFS: Number of bytes written=0
   	 HDFS: Number of read operations=2
   	 HDFS: Number of large read operations=0
   	 HDFS: Number of write operations=0
    Job Counters
   	 Launched map tasks=1
   	 Data-local map tasks=1
   	 Total time spent by all maps in occupied slots (ms)=5154
   	 Total time spent by all reduces in occupied slots (ms)=0
   	 Total time spent by all map tasks (ms)=5154
   	 Total vcore-milliseconds taken by all map tasks=5154
   	 Total megabyte-milliseconds taken by all map tasks=5277696
    Map-Reduce Framework
   	 Map input records=15
   	 Map output records=15
   	 Input split bytes=104
   	 Spilled Records=0
   	 Failed Shuffles=0
   	 Merged Map outputs=0
   	 GC time elapsed (ms)=77
   	 CPU time spent (ms)=1600
   	 Physical memory (bytes) snapshot=183992320
   	 Virtual memory (bytes) snapshot=1874804736
   	 Total committed heap usage (bytes)=137953280
    ImportTsv
   	 Bad Lines=0
    File Input Format Counters
   	 Bytes Read=652
    File Output Format Counters
   	 Bytes Written=0

30. Now, we need to go into the HBase shell and check that the data is correctly loaded.  To do that, we'll use the `scan` command (which is similar to a SQL SELECT):

In [None]:
scan 'emp_data'

# Expected output:
hbase(main):004:0> scan 'emp_data'
ROW                	COLUMN+CELL                                               	 
 7369              	column=cf:deptno, timestamp=1641555244509, value=20       	 
 7369              	column=cf:designation, timestamp=1641555244509, value=CLERK    
 7369              	column=cf:ename, timestamp=1641555244509, value=SMITH     	 
 7369              	column=cf:hire_date, timestamp=1641555244509, value=12/17/1980
 7369              	column=cf:manager, timestamp=1641555244509, value=7902    	 
 7369              	column=cf:sal, timestamp=1641555244509, value=800         	 
 7499              	column=cf:deptno, timestamp=1641555244509, value=30       	 
 7499              	column=cf:designation, timestamp=1641555244509, value=SALESMAN
 7499              	column=cf:ename, timestamp=1641555244509, value=ALLEN     	 
 7499              	column=cf:hire_date, timestamp=1641555244509, value=2/20/1981  
 7499              	column=cf:manager, timestamp=1641555244509, value=7698    	 
 7499              	column=cf:sal, timestamp=1641555244509, value=1600        	 
 7521              	column=cf:deptno, timestamp=1641555244509, value=30       	 
 7521              	column=cf:designation, timestamp=1641555244509, value=SALESMAN
 7521              	column=cf:ename, timestamp=1641555244509, value=WARD      	 
 7521              	column=cf:hire_date, timestamp=1641555244509, value=2/22/1981  
 7521              	column=cf:manager, timestamp=1641555244509, value=7698    	 
 7521              	column=cf:sal, timestamp=1641555244509, value=1250        	 
 7566              	column=cf:deptno, timestamp=1641555244509, value=20       	 
 7566              	column=cf:designation, timestamp=1641555244509, value=MANAGER  
 7566              	column=cf:ename, timestamp=1641555244509, value=TURNER    	 
 7566              	column=cf:hire_date, timestamp=1641555244509, value=4/2/1981   
 7566              	column=cf:manager, timestamp=1641555244509, value=7839    	 
 7566              	column=cf:sal, timestamp=1641555244509, value=2975        	 
 7654              	column=cf:deptno, timestamp=1641555244509, value=30       	 
 7654              	column=cf:designation, timestamp=1641555244509, value=SALESMAN
 7654              	column=cf:ename, timestamp=1641555244509, value=MARTIN    	 
 7654              	column=cf:hire_date, timestamp=1641555244509, value=9/28/1981  
 7654              	column=cf:manager, timestamp=1641555244509, value=7698    	 
 7654              	column=cf:sal, timestamp=1641555244509, value=1250        	 
 7698              	column=cf:deptno, timestamp=1641555244509, value=30       	 
 7698              	column=cf:designation, timestamp=1641555244509, value=MANAGER  
 7698              	column=cf:ename, timestamp=1641555244509, value=MILLER    	 
 7698              	column=cf:hire_date, timestamp=1641555244509, value=5/1/1981   
 7698              	column=cf:manager, timestamp=1641555244509, value=7839    	 
 7698              	column=cf:sal, timestamp=1641555244509, value=2850        	 
 7782              	column=cf:deptno, timestamp=1641555244509, value=10       	 
 7782              	column=cf:designation, timestamp=1641555244509, value=MANAGER  
 7782              	column=cf:ename, timestamp=1641555244509, value=CLARK     	 
 7782              	column=cf:hire_date, timestamp=1641555244509, value=6/9/1981   
 7782              	column=cf:manager, timestamp=1641555244509, value=7839    	 
 7782              	column=cf:sal, timestamp=1641555244509, value=2450        	 
 7788              	column=cf:deptno, timestamp=1641555244509, value=20       	 
 7788              	column=cf:designation, timestamp=1641555244509, value=ANALYST  
 7788              	column=cf:ename, timestamp=1641555244509, value=SCOTT     	 
 7788              	column=cf:hire_date, timestamp=1641555244509, value=12/9/1982  
 7788              	column=cf:manager, timestamp=1641555244509, value=7566    	 
 7788              	column=cf:sal, timestamp=1641555244509, value=3000        	 
 7839              	column=cf:deptno, timestamp=1641555244509, value=10       	 
 7839              	column=cf:designation, timestamp=1641555244509, value=PRESIDENT
 7839              	column=cf:ename, timestamp=1641555244509, value=KING      	 
 7839              	column=cf:hire_date, timestamp=1641555244509, value=11/17/1981
 7839              	column=cf:manager, timestamp=1641555244509, value=NULL    	 
 7839              	column=cf:sal, timestamp=1641555244509, value=5000        	 
 7844              	column=cf:deptno, timestamp=1641555244509, value=30       	 
 7844              	column=cf:designation, timestamp=1641555244509, value=SALESMAN
 7844              	column=cf:ename, timestamp=1641555244509, value=TURNER    	 
 7844              	column=cf:hire_date, timestamp=1641555244509, value=9/8/1981   
 7844              	column=cf:manager, timestamp=1641555244509, value=7698    	 
 7844              	column=cf:sal, timestamp=1641555244509, value=1500        	 
 7876              	column=cf:deptno, timestamp=1641555244509, value=20       	 
 7876              	column=cf:designation, timestamp=1641555244509, value=CLERK    
 7876              	column=cf:ename, timestamp=1641555244509, value=ADAMS     	 
 7876              	column=cf:hire_date, timestamp=1641555244509, value=1/12/1983  
 7876              	column=cf:manager, timestamp=1641555244509, value=7788    	 
 7876              	column=cf:sal, timestamp=1641555244509, value=1100        	 
 7900              	column=cf:deptno, timestamp=1641555244509, value=30       	 
 7900              	column=cf:designation, timestamp=1641555244509, value=CLERK    
 7900              	column=cf:ename, timestamp=1641555244509, value=JAMES     	 
 7900              	column=cf:hire_date, timestamp=1641555244509, value=12/3/1981  
 7900              	column=cf:manager, timestamp=1641555244509, value=7698    	 
 7900              	column=cf:sal, timestamp=1641555244509, value=950         	 
 7902              	column=cf:deptno, timestamp=1641555244509, value=20       	 
 7902              	column=cf:designation, timestamp=1641555244509, value=ANALYST  
 7902              	column=cf:ename, timestamp=1641555244509, value=FORD      	 
 7902              	column=cf:hire_date, timestamp=1641555244509, value=12/3/1981  
 7902              	column=cf:manager, timestamp=1641555244509, value=7566    	 
 7902              	column=cf:sal, timestamp=1641555244509, value=3000        	 
 7934              	column=cf:deptno, timestamp=1641555244509, value=10       	 
 7934              	column=cf:designation, timestamp=1641555244509, value=CLERK    
 7934              	column=cf:ename, timestamp=1641555244509, value=MILLER    	 
 7934              	column=cf:hire_date, timestamp=1641555244509, value=1/23/1982  
 7934              	column=cf:manager, timestamp=1641555244509, value=7782    	 
 7934              	column=cf:sal, timestamp=1641555244509, value=1300        	 
 empno             	column=cf:deptno, timestamp=1641555244509, value=deptno   	 
 empno             	column=cf:designation, timestamp=1641555244509, value=designati
 empno             	column=cf:ename, timestamp=1641555244509, value=ename     	 
 empno             	column=cf:hire_date, timestamp=1641555244509, value=hire_date  
 empno             	column=cf:manager, timestamp=1641555244509, value=manager 	 
 empno             	column=cf:sal, timestamp=1641555244509, value=sal         	 
15 row(s) in 0.5130 seconds


31. Take a detailed look at how the data is displayed in HBase as it may seem confusing at first.  Unlike a relational database which stores data in a row-based manner, HBase stores the data in a __column-based__ approach. 

Each line in HBase represents a column value and also includes an automatic timestamp. The __Row__ is a unique Rowkey identifier that tells HBase how each of the columns are connected to each other (i.e. if they are part of the same logical row or not).

Once you feel you have a good sense of how the data is structured in HBase, let's go ahead and look at some HBase commands we can use to interact with the data.

### HBase commands:

Below are some of the typical commands you would be using to interact with data in HBase:

- `put`
    -   This command allows you to update the data in an already existing cell.

- `get`
    -   This command are used to read data from a table in HBase. It returns a the values associated with a row of data at a time.

- `delete`
    -   This command allows you to delete a specific cell in an HBase table.

- `deleteall`
    -   This command deletes all of the cells in a table.

- `scan`
    -   This command is used to view the data stored in an HBase table.

- `count`
    -   This command is used to count the number of rows of a table.

- `disable`
    -   This command disables (turns off) a table so that it can be deleted.

- `drop`
    -   This commands deletes a disabled table.

-   `truncate`
    -   This commands does 3 things in sequence:
        -   Disables a table
        -   Drops a table
        -   Recreates the table with the same name


For a detailed explanation of HBase commands, check the following guide:
-    [HBase Cheat Sheet](https://sparkbyexamples.com/hbase/hbase-shell-commands-cheat-sheet/)


## Key Takeaways

- HBase is a modern tool for storing and analyzing big data in tables.  It does so using a column-oriented approach.  This should not be confused with the row-oriented approach that traditional relational databases use.
- The intersection of a row and column in a table is called a _cell_.  Cells store data, which in turn is accessed using a unique ID called the _row key_.
- Related columns in HBase are grouped together into _column families_. An HBase table can have more than one column family. 
- HBase's architecture is composed of 3 main components: _HMaster_ (which acts as the master server), _Region Servers_ (which are various nodes that store tables), and _Zookeeper_ (which coordinates the various administrative tasks).
- HBase is designed to efficiently handle unstructured and semi-structured data using low-latency operations.  The tool is easy to scale and support batch and real-time querying of data.
- Hbase can be installed in different modes including stand-alone (on a local machine), pseudo-distributed (using Hadoop as the underlying data store) and fully distributed (across a corporate cluster).
- To download and install HBase in pseudo-distributed mode, we'll need to have a compatible Java and Hadoop version installed beforehand. Using a Linux operating system is highly recommended.
- Tables in HBase can be created using the `create` command.  Table querying can be done using `scan` and `get` commands, while inserting data can be done using the `put` command.
- In order to delete an HBase table, we first need to `disable` the table and then `drop` the disabled table.  Alternatively, the `truncate` command can be used to implement all of these actions.