Skip to content
Austin Ouyang edited this page Sep 7, 2016 · 5 revisions

Introduction

The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Requirements

At least 4 AWS Instances Hadoop (must have run the following start-dfs.sh, start-yarn, mr-jobhistory-daemon.sh)

Install Hive

This installation process only needs to be executed on the Master/Namenode.

We will grab the hive 1.2.0 version and save it to a Downloads folder. Next we will install it into our /usr/local directory and rename the folder to simply hive

namenode:~$ wget http://apache.mesi.com.ar/hive/stable/apache-hive-1.2.0-bin.tar.gz -P ~/Downloads
namenode:~$ sudo tar zxvf ~/Downloads/apache-hive-*.tar.gz -C /usr/local
namenode:~$ sudo mv /usr/local/apache-hive-* /usr/local/hive

Set Environment Variables

We will next add the Hive environment variables in the ~/.profile

…
…
…
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Source the file so the environment variables are in the current shell session.

namenode:~$ . ~/.profile

Bring up WebUIs

Let’s open the WebUI to view the Hadoop cluster and the jobs that are submitted to the custer. This can be viewed at public_dns_of_namenode:8088. The UI should look like the following:

We should also view the job history UI to see previous MapReduce jobs that we’ve run on the cluster. This can be seen at public_dns_of_namenode:19888. It should look like the following:

The example shows jobs that have been run, but if this is your first time, there should be no Retired Jobs in the Job History WebUI. We will now run through an example of writing a Hive script to compute some aggregate functions on a sample dataset.

Hive Example Problem

We will now run through an example which will involve performing some cleaning of the data and then running a MapReduce job using the Hive framework.

First download the dataset from the Google Drive link to your local machine: price data

The data is a couple days of tick level Gold Futures data from the CME Group. Each row contains a date, the last price at that time, and the number of contracts (or volume) traded at that price. The date is in the format of:

Schema:
<year><month><day> <hour><minute><seconds>; <price>; <volume>

Example:
20140602 22:34:12; 1250.5; 12  

Let’s now transfer this file over to your Hadoop cluster and load it onto HDFS into the folder ‘user’

local_machine:~$ scp -i   ubuntu@:~/

namenode:~$ hdfs dfs -mkdir /user
namenode:~$ hdfs dfs -copyFromLocal ~/price_data*.txt /user

We would like to compute the average price and total volume traded at each 30 minute interval. To do this we will need to map each timestamp to a 30 minute time slot. Next we will perform an average on the price and a sum of the contracts traded in each 30 minute time slot.

Mapping each time stamp to a 30 minute interval requires some string manipulation and can be done within a Hive query. Now we will create a Hive script named price_data.sql. Place the following into the file:

namenode:~$ touch price_data.sql

DROP TABLE price_data;

CREATE EXTERNAL TABLE IF NOT EXISTS price_data ( time string, price double, volume int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY “\073” LINES TERMINATED BY “\n”
STORED AS TEXTFILE LOCATION ‘/user/price_data’;

DESCRIBE price_data;

INSERT OVERWRITE DIRECTORY ‘/user/price_data_output_hive’
SELECT min30, AVG(price), SUM(volume)
FROM (SELECT concat(substr(time,1,11), if(substr(time,12,2)<”30”,”00”,”30”), “00”) AS min30, price, volume FROM price_data) compressed
GROUP BY min30
ORDER BY min30;

The Hive script can be run with the following command:

namenode:~$ hive -f price_data.sql

While the job is running you can go to public_dns_of_namenode:8088 and view the job being submitted and its status. Once the job is finished, it will show up in the job history under public_dns_of_namenode:19888. The output for this dataset should look somewhat like the following by executing the following command:

namenode:~$ hdfs dfs -cat /user/price_data_output_hive/part-r-00000

We can see here that the date times have been changed to 30 minute windows and the price is now a floating point number. Volume is larger resulting in the sum of all traded volume in these time periods.

Bonus: How do the run times compare between Pig and Hive? How about ease of programming.

Hive CLI

We will create a CSV file and populate it with some data. This should be done on the Hadoop namenode.

node:~$ nano user_info.csv

Add the following lines to the CSV file

1,Steven,Smith,24,42000,Male
2,Pawan,Lathwal,24,30000,Male
3,Mariya,Gilbert,25,44000,Female
4,Taylor,Lockwood,24,41000,Male
5,Sanjiv,Singh,25,51000,Male

node:~$ hdfs dfs -mkdir /data
node:~$ hdfs dfs -copyFromLocal user_info.csv /data

Start Hive on the namenode.

namenode:~$ hive
hive> create external table user_info2 (id INT, fname STRING, lname STRING, age INT, salary INT, gender STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> LOAD DATA INPATH '/data/user_info.csv' INTO TABLE user_info;

Check the table contents

hive> SELECT * FROM user_info;

It would look as follows: