Skip to content

GatsbyNewton/hive-bulkload-hbase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hive-bulkload-hbase

Import hive table into hbase as fast as possible.

Directories

  • bin: Contains the shell script that starts the program.
  • src: Contains the source code and the test code.
  • schema: Contains the schema file of a table.

Compilation

$ mvn clean compile

$ mvn clean package

$ mvn assembly:assembly

Description

HBase gives random read and write access to your big data, but getting your big data into HBase can be a challenge. And there are three methods to be able to make it.

  1. Use the API to put the data one by one.
  2. Hive Integrates HBase. And you can check HBaseIntegration and here
  3. HBase comes with bulk load capabilities.

However, the first two methods is slower than the last method that you simply bypassed the lot and created the HFiles yourself and copied them directly into the HDFS. The HBase bulk load process consists of two steps if Hive and HBase are on one cluster.

  1. HFile preparation via a MapReduce job.
  2. Importing the HFile into HBase using LoadIncrementalHFiles.doBulkLoad(eg. Driver2.java).

But HBase bulk load process consists of three steps if Hive and HBase are on different cluster.

  1. HFile preparation via a MapReduce job.
  2. Copying HFile from Hive cluster to HBase cluster.
  3. Importing the HFile into HBase via HBase commands on HBase cluster.

Usage

The aim of the MapReduce job is to generate HBase date files(HFile) from your input RCFile using HFileOutputFormat. Before you generate HFile, you should get Hive table's schema. And you can make use the following methods to get the schema.

  • Reading Hive metadata.
    • Using JDBC to obtain from MysSQL
    • Using HCatalog to obtain from MySQL
  • Parsing a file that records the schema. In my opinion, it is more efficient than reading metadata, even if a table contains serveral thousands columns.

Output from Mapper class are ImmutableBytesWritable, KeyValue. These classes are used by the subsequent partitioner and reducer to create the HFiles.
There is no need to write your own reducer as the HFileOutputFormat.configureIncrementalLoad() as used in the driver code sets the correct reducer and partitioner up for you.
Then, you should copy generated HFile from one cluster to another if Hive and HBase are on different cluster.

hadoop distcp hdfs://mycluster-hive/hfile/hbase hdfs://mycluster-hbase/hbase/test

Finally, import the File into HBase via HBase commands on HBase cluster.

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /hbase/test hbase_table

Or import the File into HBase via Java code on HBase cluster(eg. Driver2.java).

// Importing the generated HFiles into a HBase table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(outputPath, htable);

About

Import hive table into hbase as fast as possible

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published