# Apache HBase - Installation

> It's highly recommended using a Linux system as HBase is known to cause issues when used with Windows

Now that we have a high-level understanding of what HBase is, let's download and install HBase to get a better feel of how it operates. 

At a high-level, to get HBase working on your system, the steps involve:

- Downloading and installing Hadoop (Hadoop's file system is required for HBase in pseudo-distributed mode, but is not required if HBase is run in standalone mode)
    -   _Note: Refer to the Hadoop notebook for detailed instructions on how to implement this step._
    <p></p>
    
- Configuring the following files:
    -   `bashrc`
    -   `hadoop-env.sh`
    -   `core-site.xml`
    -   `hdfs-site.xml`
    -   `mapred-site-xml`
    -   `yarn-site.xml`
    <p></p>

- Downloading and installing HBase in pseudo-distributed mode (to leverage HDFS for data storage)
    <p></p>
    
- Configuring the following HBase files:
    -   `hbase-env.sh`
    -   `hbase-site.xml`

_Note: We'll be using Linux (Ubuntu) for this tutorial, so if you're on a different operating system the steps might differ._

Let's begin:

#### 1. Open your terminal and run the following commands to update all existing applications:

In [None]:
sudo apt update
sudo apt -y upgrade
# sudo reboot # if you run into issues, try restarting, which is what this command will do

#### 2. Ensure that __Java__ is installed on your system. 

_Note: It is highly recommended to use `Java 8` as this is the version fully compatible with both Hadoop and HBase. Otherwise, we may face some errors._

To do this, run the following command:

In [None]:
# Check the Java version
java -version

# Check the Java copmiler version
javac -version

If Java is installed correctly, you'll see an output showing the Java and compiler version. Otherwise, you will need to download and install Java and the Java compiler. 

For detailed steps on how to do this, please refer to the Hadoop notebook.

#### 3. Next, we need to ensure that the `JAVA_HOME` variable is correctly set up on your system.  To do that, run the below command:

In [None]:
echo $JAVA_HOME

If the variable is correctly set up, you should see a path show up similar to the one below:

In [None]:
echo $JAVA_HOME

# Expected output should be similar to:
/usr/lib/jvm/java-8-openjdk-amd64

If you get no output, or if the output is simply repeating JAVA_HOME, then the variable is not set up correctly.

To fix this, refer to the Hadoop notebook for detailed steps.

#### 4. Ensure Hadoop is already installed as our next objective is to download and setup HBase.  

> Remember that we'll be using version 1.7.1 as it is compatible with both Hadoop and Java 8.

Recall that HBase is composed of 3 components:
-   HMaster (coordinating the region server and admin functions)
-   Region Server (maps the region to the server)
-   Zookeeper (coordinates with Hadoop)

We need to see all of these 3 components correctly running to be able to use HBase.

Let's start by downloading the HBase installation files:

In [None]:
# Set the Version variable
VERSION="2.4.10"	
# Download the HBase version
wget https://dlcdn.apache.org/hbase/${VERSION}/hbase-${VERSION}-bin.tar.gz

#### 5. Extract the downloaded archive and move it to `/usr/local/HBase` (create that folder if it's not already there):

In [None]:
tar -xzvf hbase-${VERSION}-bin.tar.gz
sudo mv hbase-${VERSION}-bin/ /usr/local/

#### 6. We need to set your `HBASE_HOME` variable similar to what we did with `JAVA_HOME`.  To do this, copy the path of your HBase folder and open the `bashrc` file to add the variable:

In [None]:
sudo nano ~/.bashrc

Add the below lines (use your HBase folder if the path is different):

In [None]:
export HBASE_HOME=/usr/local/hbase-2.4.10
export PATH=$PATH:$HBASE_HOME/bin

#### 7. Next, we need to update the HBase environment configuration file, which contains the configurable parameters for the HBase environment.  

For running HBase in pseudo-distributed mode, we need to set 3 properties within this file:
-   `JAVA_HOME`
-   `HBASE_MANAGES_ZK`
-   `HBASE_REGIONSERVERS`

The file is located in the `conf` folder within the location which we unpacked HBase into, which should be `/usr/local/HBase-2.4.10/`.

Go ahead and open the `hbase-env.sh` file and uncomment/add the below settings:

In [None]:
sudo nano hbase-env.sh

In [None]:
# Add/uncomment the below settings
export HBASE_MANAGES_ZK=true
export HBASE_REGIONSERVERS=/usr/local/hbase-2.4.10/conf/regionservers
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

#### 8. Before the next step, we should create a Zookeeper data folder first. Place a nested folder called `data/zookeeper` inside the HBase folder (not inside the `conf` folder).

The folder structure should look like this:

<p align="center">
  <img src="images/hbase-data-zookeeper.png" width=400>
  <figcaption align="center"><cite>Zookeeper Folder Structure</cite></figcaption>
</p>



#### 9. Then, we will need to update another file inside the same `conf` folder called the `hbase-site.xml` file.  
Add the following between the `<configuration>` tags:

In [None]:
<property>
	<name>hbase.cluster.distributed</name>
	<value>true</value>
</property>
<property>
	<name>hbase.tmp.dir</name>
	<value>./tmp</value>
</property>
<property>
	<name>hbase.unsafe.stream.capability.enforce</name>
	<value>false</value>
</property>
<property>
  	<name>hbase.rootdir</name>
  	<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
  	<name>hbase.zookeeper.property.dataDir</name>
  	<value>/usr/local/hbase-2.4.10/data/zookeeper</value>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
</property>


Here is what each of these parameters do:

- `hbase.cluster.distributed`
    -   This parameter tells HBase to run in a stand-alone local mode or on a distributed cluster via Hadoop.

- `hbase.tmp.dir`
    -   This is the HDFS temporary data storage folder

- `hbase.unsafe.stream.capability.enforce`
    -   Controls whether or not HBase will check for stream capabilities
    -   This is used for toggling on or off advanced data flushing by HBase using something called Hflush and Hsync which help guarantee data durability. See more [here](https://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html)

- `hbase.rootdir`
    -   Specifies the root HDFS folder location

- `hbase.zookeeper.property.dataDir`
    -   Tells Zookeeper where to store its data files
    
- `hbase.zookeeper.quorum`
    -   This is the list of one or more server nodes that are available for clients requests.  

- `dfs.replication`
    -   The replication factor for HDFS data (this should match the Hadoop settings we configured earlier)
    
- `hbase.zookeeper.property.clientPort`
    -   Tells Zookeeper which port it should use for communication

#### 10. Start all the Hadoop daemons first by running the following command:

In [None]:
## It hasn't been specified to add Hadoop home yet

cd $HADOOP_HOME/sbin
bash start-dfs.sh
bash start-yarn.sh

If everything runs smoothly, you should see output similar to the below:

<p align="center">
  <img src="images/start-dfs-yarn.png" width=600>
  <figcaption align="center"><cite>Hadoop Daemons</cite></figcaption>

</p>

If everything looks good, we'll stop the services and grant the Hadoop user access to Hbase. Run the below commands:

In [None]:
# Stop all running Hadoop processes
stop-all.sh

# Change the directory to your HADOOP_HOME 
cd $HADOOP_HOME


# Do we need to specify adding HBase and Hadoop to their own user? This may just confuse users. 
# Change the owner of the hadoop directory from root to the Hadoop account
sudo chown -R hadoop:root hadoop 

# Change the access permission of the hadoop directory to allow read and execute access to all users and write access for the new account owner
sudo chmod -R 755 hadoop


# Change the owner of the HBase directory from root to the Hadoop account
sudo chown -R hadoop:root Hbase

# Change the access permission of the HBase directory to allow read and execute access to all users and write access for the new account owner
sudo chmod -R 755 Hbase

#### 11. Test HDFS to make sure everything is working smoothly.

To do this, we'll create a `test` directory using the below command:

In [None]:
hadoop fs -mkdir /test
hadoop fs -ls /

_Note: The Hadoop file system (HDFS) is _not_ the same as the local file system. In reality, HDFS will be hosted on multiple servers across a distributed network._

The output should be:

In [None]:
hadoop fs -ls /

# Expected output
2021-12-22 12:27:34,309 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x   - hadoop supergroup      	0 2021-12-22 12:26 /test
hadoop@dodz-vm:/usr/local/hadoop/hadoop-3.3.1/etc/hadoop$


#### 12. Next, we'll start HBase by running the `start-hbase.sh` script. This starts the 3 components we mentioned earlier: HMaster, the region server, and Zookeeper.

In [None]:
bash start-hbase.sh

_Note: If `permission denied` error shows up, we may need to grant the Hadoop user access to the HBase folders.  To do this, run the below commands (using your corresponding HBase and Zookeeper paths)_

In [None]:

sudo chmod -R 755 /usr/local/hbase/*

sudo chown -R hadoop:hadoop /usr/local/hbase/

sudo chmod -R 755 /usr/lib/data/zookeeper/*

sudo chown -R hadoop:hadoop /usr/lib/data/zookeeper

To ensure all the proper HBase processes are running, run the below Linux command which shows the status of all active Java processes:

In [None]:
jps

The expected output should include all of the below processes (3 for HBase and 5 for Hadoop plus jps itself):

In [None]:
jps

# Expected output
9298 HMaster
5652 SecondaryNameNode
5286 NameNode
9238 HQuorumPeer
9399 HRegionServer
5784 ResourceManager
6684 DataNode
5918 NodeManager
9486 Jps

Once the required processes run, we now need to run the HBase shell to ensure that we can start interacting with HBase.

To do this, run the below command:

In [None]:
hbase shell

You should now be inside the HBase shell as we can see below:  

<p align="center">
  <img src="images/hbase-shell.png" width=600>
  <figcaption align="center"><cite>HBase Shell</cite></figcaption>

</p>

Now we are inside HBase and can begin to use its commands.

Try to run the `status` command to ensure HBase is working successfully.  This command shows the list of active HBase servers. 

In [None]:
status

The output should be something like:

In [None]:
hbase(main):001:0> status

#Expected output
1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load

hbase(main):002:0>


If you get an error that mentions HMaster is not running, double-check the `/etc/hosts` file to ensure the VM and the localhost both have the same IP (127.0.0.1)

In [None]:
sudo nano /etc/hosts

Nice! We've successfully set up HBase on your local machine. 


## Key Takeaways

- HBase can be installed in different modes including stand-alone (on a local machine), pseudo-distributed (using Hadoop as the underlying data store) and fully distributed (across a corporate cluster).
- To download and install HBase in pseudo-distributed mode, we'll need to have a compatible Java and Hadoop version installed beforehand. Using a Linux operating system is highly recommended.
- To run HBase in pseudo-distrubted mode, we also need to download and install Hadoop as HBase uses HDFS to store the data
- We need to ensure that JAVA_HOME, HADOOP_HOME and HBASE_HOME are properly set up in the `.bashrc` file
- HBase has 2 configuration files that must be properly set up: `hbase-env.sh` and `hbase-site.xml`
- All Hadoop and HBase daemons must be initiated before we can open the HBase shell which enables us to type commands
- To start the HBase shell, type `hbase shell` from the terminal
- Once inside the shell, the `status` command can be used to provide information about the HBase cluster and to see if it is running properly
