# Apache HBase - Data Loading

Below are the steps required to load data into an HBase table. Make sure to have followed the previous steps to download, install and setup HBase before proceeding.

#### 1. The first step is to download the Retail data CSV file

You can [download it from here](https://aicore-files.s3.amazonaws.com/Data-Eng/retail.csv)


#### 2. The next step is to import the Retail data file into HBase.

To do that, we need to first create a new Hbase table and specify the Column Family. To do this, type the below command from inside the `hbase shell`:

In [None]:
create 'retail_table',{NAME => 'cf'}

To check the table was created successfully, run the `list` command to see all available HBase tables:

In [None]:
list

#Expected output:
hbase(main):002:0> list
retail_table                                                                         	 
1 row(s) in 0.3180 seconds


Once the table is created, we need to run the below command to copy the CSV file to HDFS, so we can import it into HBase:

_Note: Ensure you are using your folder path where you saved the `retail.csv` file_

In [None]:
hadoop fs -put /YOURPATH/retail.csv /data

Now, to check that the file has been properly copied to HDFS, type the below command:

In [None]:
hadoop fs -ls /data

You should see output like:

In [None]:
hadoop fs -ls /data

# Expected output
22/01/26 17:20:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup   45580638 2022-01-26 17:20 /data/retail.csv


Finally, we need to load the `retail.csv` file into HBase. To do this, run the below command from the __terminal__:

In [None]:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY,cf:description,cf:quantity,cf:price,cf:customer,cf:country retail_table /data/retail.csv

_Note: You could of course write code to generate the column names if you have too many to write out by hand._

_Note: If you get any errors, such as "Bad Lines" or "Failed Map", check that you didn't miss any characters from the above code and attempt to type it directly yourself instead of copy and pasting it._

If everything works smoothly, you should see output similar to:

In [None]:
2022-01-07 13:34:28,910 INFO  [main] mapreduce.Job: erations=0
   	 HDFS: Number of bytes read=756
   	 HDFS: Number of bytes written=0
   	 HDFS: Number of read operations=2
   	 HDFS: Number of large read operations=0
   	 HDFS: Number of write operations=0
    Job Counters
   	 Launched map tasks=1
   	 Data-local map tasks=1
   	 Total time spent by all maps in occupied slots (ms)=5154
   	 Total time spent by all reduces in occupied slots (ms)=0
   	 Total time spent by all map tasks (ms)=5154
   	 Total vcore-milliseconds taken by all map tasks=5154
   	 Total megabyte-milliseconds taken by all map tasks=5277696
    Map-Reduce Framework
   	 Map input records=15
   	 Map output records=15
   	 Input split bytes=104
   	 Spilled Records=0
   	 Failed Shuffles=0
   	 Merged Map outputs=0
   	 GC time elapsed (ms)=77
   	 CPU time spent (ms)=1600
   	 Physical memory (bytes) snapshot=183992320
   	 Virtual memory (bytes) snapshot=1874804736
   	 Total committed heap usage (bytes)=137953280
    ImportTsv
   	 Bad Lines=0
    File Input Format Counters
   	 Bytes Read=652
    File Output Format Counters
   	 Bytes Written=0

#### 3. Now, we need to go into the HBase shell and check that the data is correctly loaded. 

To do that, we'll use the `scan` command, which is similar to a SQL `SELECT`. It will scan over the entire table and retrieve the relevant data.

For example, the below code will return the _first 5 rows_ of the Retail table:

In [None]:
scan 'retail_table', {'LIMIT', 5}

In [None]:
# Expected output:
Hbase::Table - retail_table
hbase(main):009:0> scan 'retail_table', {'LIMIT', 5}
ROW                	COLUMN+CELL                                               	 
 1                 	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 1                 	column=cf:customer, timestamp=1643213704999, value=17850  	 
 1                 	column=cf:description, timestamp=1643213704999, value=WHITE HAN
                   	GING HEART T-LIGHT HOLDER                                 	 
 1                 	column=cf:price, timestamp=1643213704999, value=2.55      	 
 1                 	column=cf:quantity, timestamp=1643213704999, value=6      	 
 10                	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 10                	column=cf:customer, timestamp=1643213704999, value=13047  	 
 10                	column=cf:description, timestamp=1643213704999, value=ASSORTED
                   	COLOUR BIRD ORNAMENT                                      	 
 10                	column=cf:price, timestamp=1643213704999, value=1.69      	 
 10                	column=cf:quantity, timestamp=1643213704999, value=32     	 
 100               	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 100               	column=cf:customer, timestamp=1643213704999, value=14688  	 
 100               	column=cf:description, timestamp=1643213704999, value=60 TEATIM
                   	E FAIRY CAKE CASES                                        	 
 100               	column=cf:price, timestamp=1643213704999, value=0.55      	 
 100               	column=cf:quantity, timestamp=1643213704999, value=24     	 
 1000              	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 1000              	column=cf:customer, timestamp=1643213704999, value=14729  	 
 1000              	column=cf:description, timestamp=1643213704999, value=TOAST ITS
                    	- HAPPY BIRTHDAY                                         	 
 1000              	column=cf:price, timestamp=1643213704999, value=1.25      	 
 1000              	column=cf:quantity, timestamp=1643213704999, value=2      	 
 10000             	column=cf:country, timestamp=1643213704999, value=United Kingdo
                   	m                                                         	 
 10000             	column=cf:customer, timestamp=1643213704999, value=13174  	 
 10000             	column=cf:description, timestamp=1643213704999, value=SET OF 2
                   	TINS VINTAGE BATHROOM                                     	 
 10000             	column=cf:price, timestamp=1643213704999, value=4.25      	 
 10000             	column=cf:quantity, timestamp=1643213704999, value=2      	 
5 row(s) in 0.1360 seconds


Take a detailed look at how the data is displayed in HBase as it may seem confusing at first.  Unlike a relational database which stores data in a row-based manner, HBase stores the data in a __column-based__ approach. 

Each line in HBase represents a column value and also includes an automatic timestamp. The __Row__ is a unique Rowkey identifier that tells HBase how each of the columns are connected to each other (i.e. if they are part of the same logical row or not).

## Querying data in HBase

You can of course, run more complex queries.

To find out how many total rows we have in the table, we can use the `count` command as follows:

In [None]:
count `retail_table`

The output should look like:

<p align="center">
  <img src="images/hbase-count.png" width=600>
  <figcaption align="center"><cite>Output of HBase Count Command</cite></figcaption>
</p>

To do more advanced querying using filters (similar to SQL's WHERE command), we'll first need to import 3 HBase classes:

- `SingleColumnValueFilter`
- `CompareFilter`
- `BinaryComparator`
    
These 3 classes work together to provide flexible filtering criteria.

To achieve this, run the below 3 commands from inside the HBase shell:

In [None]:
# Import the required 3 classes 
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter 
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.BinaryComparator

The output should be similar to:

<p align="center">
  <img src="images/hbase-filter-import.png" width=600>
  <figcaption align="center"><cite>HBase Class Imports</cite></figcaption>
</p>

Now we can run queries with specific filters. First, let's query the table for all data that have the `country as United Kingdom`. The query would have the following format:

In [None]:
scan 'retail_table', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('cf'), Bytes.toBytes('country'), CompareFilter::CompareOp.valueOf('EQUAL'),BinaryComparator.new(Bytes.toBytes('United Kingdom')))}

The output will look something like this:

<p align="center">
  <img src="images/hbase-scan-filter.png" width=600>
  <figcaption align="center"><cite>HBase Scan Command Output</cite></figcaption>
</p>

Next, let's run a query to check how many products have a `price equal to 12.75`:

In [None]:
scan 'retail_table', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('cf'), Bytes.toBytes('price'), CompareFilter::CompareOp.valueOf('EQUAL'),BinaryComparator.new(Bytes.toBytes('12.75')))}

The output should be `826 products` as indicated by the number of rows seen below:

<p align="center">
  <img src="images/hbase-filter-price.png" width=600>
  <figcaption align="center"><cite>Filtered Scan Results</cite></figcaption>
</p>

Using the combination of above filters, we can use the below comparison operators inside the `CompareFilter::CompareOp.valueOf` on column values:

- `EQUAL`
- `GREATER`
- `GREATER_OR_EQUAL`
- `LESS`
- `LESS_OR_EQUAL`
- `NOT_EQUAL`

### HBase commands:

Below are some of the typical commands you would be using to interact with data in HBase:

- `put`
    -   This command allows you to update the data in an already existing cell.

- `get`
    -   This command are used to read data from a table in HBase. It returns the values associated with a row of data at a time.

- `delete`
    -   This command allows you to delete a specific cell in an HBase table.

- `deleteall`
    -   This command deletes all of the cells in a table.

- `scan`
    -   This command is used to view the data stored in an HBase table.

- `count`
    -   This command is used to count the number of rows of a table.

- `disable`
    -   This command disables (turns off) a table so that it can be deleted.

- `drop`
    -   This commands deletes a disabled table.

-   `truncate`
    -   This commands does 3 things in sequence:
        -   Disables a table
        -   Drops a table
        -   Recreates the table with the same name


For a detailed explanation of HBase commands, check the following guide:
-    [HBase Cheat Sheet](https://sparkbyexamples.com/hbase/hbase-shell-commands-cheat-sheet/)


## Key Takeaways

- Tables in HBase can be created using the `create` command.  Table querying can be done using `scan` and `get` commands, while inserting data can be done using the `put` command.
- In order to delete an HBase table, we first need to `disable` the table and then `drop` the disabled table.  Alternatively, the `truncate` command can be used to implement all of these actions.