<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# HIVE Lab

---

## Introduction

In the past labs we have introduced Hadoop and performed more and more complex map-reduce jobs using this tool.

It would be nice however to be able to use the familiar SQL syntax we have learned using relational databases when dealing with Hadoop. Luckily, the Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules and offer that functionality. In particular:

- _Sqoop_ is used to import and export data to and from and between HDFS and RDBMS.
- _Pig_ is a procedural language platform used to develop a script for MapReduce operations.
- _Hive_ is a platform used to develop SQL type scripts to do MapReduce operations.

In this lab we will focus on **Hive**.

## Hive
Hive enables analysis of large data sets using a language very similar to standard ANSI SQL. This means anyone who can write SQL queries can access data stored on the Hadoop cluster. Hive offers a simple interface for:

- Log processing
- Text mining
- Document indexing
- Customer-facing business intelligence (e.g., Google Analytics)
- Predictive modeling, hypothesis testing

Let's start hive by typing hive to our remote machine prompt.


You should see a prompt like this:

```bash
hive>
```

The `SHOW TABLES;` command displays the tables contained:

```bash
hive> SHOW TABLES;
```

**Do you remember the equivalent postgres command?**

Let's create a table called gutenberg where we'll store the word counts for the project_gutenberg documents.

```SQL
CREATE EXTERNAL TABLE gutenberg (
    word STRING,
    count INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/hadoop/output_gutenberg';
```

We have just created a table called gutenberg that references the output folder of the project_gutenberg hadoop map reduce job we've executed in the past hours.

Go back to the file browser to check what the content of that folder is:

```SQL
hadoop fs -cat output_gutenberg/part*
```

Now that we have defined the table in Hive, we can query it using a SQL-like statement:

```SQL
hive> SELECT * FROM gutenberg ORDER BY count DESC LIMIT 10;
```

As you will see, this starts a Map reduce job on the output files and should return something like this:

```bash
Total MapReduce CPU Time Spent: 4 seconds 460 msec
OK
the 63656
of  34367
and 32787
to  31399
a   24811
in  18168
I   18070
his 13485
he  13299
was 13029
Time taken: 37.311 seconds, Fetched: 10 row(s)
```

## Word count in Hive

Let's go ahead and perform the word count for one of the books in project Gutenberg using Hive.

### 1. Alice in Wonderland word count

Let's start by counting the words of Alice in Wonderland (pg11.txt).

- create a table called alice_text that will map to the text file lines
- create a table called alice that counts the words
    - hint: you will need to use the `LATERAL VIEW` keywords to parse the text file table

### alice table

```SQL
DROP TABLE IF EXISTS alice_text;

CREATE TABLE alice_text(
text string
) row format delimited fields terminated by '\n' stored AS textfile;

load data local inpath 'project_gutenberg/pg11.txt' overwrite INTO TABLE alice_text;

DROP TABLE IF EXISTS alice;

CREATE TABLE alice AS
SELECT word, COUNT(*) AS cnt FROM alice_text LATERAL VIEW explode(split(text, ' ')) lTable AS word GROUP BY word;
```


You can use these 3 resources as reference to find the appropriate commands:

- https://www.linkedin.com/pulse/word-count-program-using-r-spark-map-reduce-pig-hive-python-sahu
- http://www.hadooplessons.info/2014/12/in-this-post-i-am-going-to-discuss-how.html
- http://stackoverflow.com/questions/10039949/word-count-program-in-hive

### 2. Peter Pan word count

Repeat the operation creating a new table called peter where you will store the word counts from Peter Pan (pg16.txt).

Note that you can get the definition of a table by using the `describe` command:


```SQL
    hive> describe alice;
    hive> describe peter;
```

### peter table

### 3. Joins in Hive 

The advantage of having a SQL-like interface is that it makes join operations much easier to perform.

Find the common words to alice and peter table and sort them by the sum of their total count in decreasing order. Limit the display to the first 20 most common words.

The result should look like:

|word|alice_count|peter_count|sum|
|---|---|---|---|
|the|1664|2331|3995|
|and|780|1396|2176|
|...|...|...|...|

## Additional Resources

- [Serde example](https://community.hortonworks.com/articles/8313/apache-hive-csv-serde-example.html)
- [Logs Page](http://ita.ee.lbl.gov/html/contrib/ClarkNet-HTTP.html)
- [Cloudera Twitter example](https://github.com/cloudera/cdh-twitter-example)
- [AWS Serde example](http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-gs.html)