# Hive assignment 1

The purpose of this task is to create an external table on the posts data of the `stackoverflow.com` website.

## Step 1. Intro. Creation of the DB

Let's create the sandbox database where you will complete your assignment.

<b>Note!</b> This code shouldn't be in your submission. Please, remove this code from the notebook before submission.

In [1]:
%%writefile creation_db.hql
DROP DATABASE IF EXISTS jovyan CASCADE;

Overwriting creation_db.hql


In [2]:
%%writefile -a creation_db.hql
CREATE DATABASE jovyan ;

Appending to creation_db.hql


In [3]:
! hive -f creation_db.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 2.508 seconds
OK
Time taken: 0.403 seconds


**Don't forget to remove this code before submission!**


## Step 2. Exploration of the dataset

Okay, we have created the database. Let's create your own table for users and posts.

First of all, let's watch at the datasets for `users` which are located at `/data/stackexchange1000/posts` and for `posts` which is located at `/data/stackexchange1000/users`. Print the first three rows of those datasets.


In [4]:
%%writefile get_data.hql

CREATE DATABASE IF NOT EXISTS jovyan;

USE jovyan;

DROP TABLE IF EXISTS data_raw;

CREATE EXTERNAL TABLE data_raw (
     line STRING
)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
STORED AS textfile
LOCATION '/data/stackexchange1000/posts/';

SELECT COUNT(*) FROM data_raw;
SELECT * FROM data_raw LIMIT 3;

Overwriting get_data.hql


In [5]:
! hive -f get_data.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 0.695 seconds
OK
Time taken: 0.017 seconds
OK
Time taken: 0.088 seconds
OK
Time taken: 0.738 seconds
Query ID = jovyan_20200813071717_e9128864-8bf3-4ce1-a0d7-c7baa221fae3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1597302836302_0001, Tracking URL = http://2bcb91321cef:8088/proxy/application_1597302836302_0001/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1597302836302_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-08-13 07:18:03,863 Stage-1

In [9]:
%%writefile formatted_data.hql

DROP TABLE IF EXISTS formatted_raw;

CREATE TABLE formatted_raw (
  tag STRING,
  count INT
)
PARTITIONED BY (year INT);

SELECT COUNT(*) FROM formatted_raw;
SELECT * FROM formatted_raw LIMIT 3;

Writing formatted_data.hql


In [10]:
! hive -f formatted_data.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 0.794 seconds
OK
Time taken: 0.652 seconds
Query ID = jovyan_20200813072020_1d6cec07-3981-492c-aa32-f958321954a9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1597302836302_0002, Tracking URL = http://2bcb91321cef:8088/proxy/application_1597302836302_0002/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1597302836302_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-08-13 07:20:20,657 Stage-1 map = 0%,  reduce = 0%
2020-08-13 07:20:27,092 Stage-1 ma

In [4]:
%%writefile mapper.py
#!/usr/bin/env python3

import sys
import re
from datetime import datetime

for line in sys.stdin:
    tags = re.findall(r"<row\s.*CreationDate=\"([^\"]+)\"\s.*Tags=\"([^\"]+)", line, flags=re.UNICODE)
    if len(tags) == 0:
        continue
    tags = tags[0]

    createDate = datetime.strptime(tags[0].split(".")[0], '%Y-%m-%dT%H:%M:%S')
    tags = tags[1].replace("&gt;&lt;", ","). \
        replace("&lt;", "").replace("&gt;", "").lower().split(",")
    for tag in tags:
        print(tag, 1, createDate.year, sep="\t")

Overwriting mapper.py


In [22]:
%%writefile reducer.py
#!/usr/bin/env python3

import sys
from collections import Counter

dict_counter = Counter()

for line in sys.stdin:
    if not line: continue
    tag, count, year = line.strip().split("\t")
    dict_counter[year+"\x01"+tag] += int(count)

for key, count in dict_counter.items():
    year, tag = key.split("\x01")
    if count > 10 and (year == '2010' or year == '2016'):
        print(tag, count, year, sep="\t")

Overwriting reducer.py


In [14]:
! hdfs dfs -ls 

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:49 precreate
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2008
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2008/month=2008-08
-rwxr-xr-x   1 jovyan supergroup       7464 2018-07-28 17:51 precreate/posts/year=2008/month=2008-08/000000_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2008/month=2008-09
-rwxr-xr-x   1 jovyan supergroup      31728 2018-07-28 17:51 precreate/posts/year=2008/month=2008-09/000000_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2008/month=2008-10
-rwxr-xr-x   1 jovyan supergroup

-rwxr-xr-x   1 jovyan supergroup     165058 2018-07-28 17:51 precreate/posts/year=2012/month=2012-04/000000_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2012/month=2012-05
-rwxr-xr-x   1 jovyan supergroup     173774 2018-07-28 17:51 precreate/posts/year=2012/month=2012-05/000000_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2012/month=2012-06
-rwxr-xr-x   1 jovyan supergroup     166502 2018-07-28 17:51 precreate/posts/year=2012/month=2012-06/000000_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2012/month=2012-07
-rwxr-xr-x   1 jovyan supergroup     181487 2018-07-28 17:51 precreate/posts/year=2012/month=2012-07/000000_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2012/month=2012-08
-rwxr-xr-x   1 jovyan supergroup     178336 2018-07-28 17:51 precreate/posts/year=2012/month=2012-08/000000_0
drwxr-xr-x   - jovyan supergroup          0 20

-rwxr-xr-x   1 jovyan supergroup     226802 2018-07-28 17:51 precreate/posts/year=2015/month=2015-06/000001_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2015/month=2015-07
-rwxr-xr-x   1 jovyan supergroup     234679 2018-07-28 17:51 precreate/posts/year=2015/month=2015-07/000001_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2015/month=2015-08
-rwxr-xr-x   1 jovyan supergroup     221281 2018-07-28 17:51 precreate/posts/year=2015/month=2015-08/000001_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2015/month=2015-09
-rwxr-xr-x   1 jovyan supergroup     213721 2018-07-28 17:51 precreate/posts/year=2015/month=2015-09/000001_0
drwxr-xr-x   - jovyan supergroup          0 2018-07-28 17:52 precreate/posts/year=2015/month=2015-10
-rwxr-xr-x   1 jovyan supergroup     224685 2018-07-28 17:51 precreate/posts/year=2015/month=2015-10/000001_0
drwxr-xr-x   - jovyan supergroup     

In [20]:
%%writefile final.hql

SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=non-strict;

ADD FILE mapper.py;
ADD FILE reducer.py;

CREATE DATABASE IF NOT EXISTS jovyan;

USE jovyan;

DROP TABLE IF EXISTS data_raw;

CREATE EXTERNAL TABLE data_raw (
    line STRING
)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
STORED AS textfile
LOCATION '/data/stackexchange1000/posts/';

--- SELECT COUNT(*) FROM data_raw;
--- SELECT * FROM data_raw LIMIT 3;

DROP TABLE IF EXISTS formatted_raw;

CREATE TABLE formatted_raw (
    tag STRING,
    count INT
)
PARTITIONED BY (year INT);

--- SELECT COUNT(*) FROM formatted_raw;
--- SELECT * FROM formatted_raw LIMIT 3;

FROM (
    FROM data_raw
    SELECT TRANSFORM (line)
    USING "mapper.py" AS tag, count, year
    DISTRIBUTE BY year SORT BY tag
) Tags
INSERT OVERWRITE TABLE formatted_raw
    PARTITION (year)
    SELECT TRANSFORM(Tags.tag, Tags.count, Tags.year)
    USING "reducer.py" AS tag, count, year;

SELECT year, tag, count
FROM (
    SELECT year, tag, count, rnk
    FROM (
        SELECT year, tag, count, RANK() OVER(
            PARTITION BY year 
            ORDER BY count DESC, tag) AS rnk
        FROM formatted_raw
    ) AS tgs
    WHERE rnk <= 10
) AS tgs
ORDER BY year ASC, count DESC;
--- INTO OUTFILE 'hw3_hive.out';

Overwriting final.hql


In [None]:
! hive final.hql > hw3_hive.out


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties


As you can see, those rows contain some information about posts and users in XML format.

<b>Question.</b> Which fields for users and posts do you think are the most important for the analysis? And for joining tables? 

<h3><b>Please, check your answer with this information!</b></h3>

So, the lines not started with the "row" tags should be ignored. The valid row contains the following fields and their order is not defined:

* Id (integer) - id of the post
* PostTypeId (integer: 1 or 2) - 1 for questions, 2 for answers
* CreationDate (date) - post creation date in the format "YYYY-MM-DDTHH:MM:SS.ms"
* Tags (string, optional) - list of post tags, each tag is wrapped with html entities `&lt;` and `&gt;`
* OwnerUserId (integer, optional) - user id of the post's author
* ParentId (integer, optional) - for answers - id of the question
* Score (integer) - score (votes) of a question or an answer, can be negative (!)
* FavoriteCount (integer, optional) - how many times the question was added in the favorites

The second part of the dataset contains StackOverflow users.

The fields are the following and their order is also not defined:

* Id (integer) - user id
* Reputation (integer) - user's reputation
* CreationDate (string) - creation date in the format "YYYY-MM-DDTHH:MM:SS.ms"
* DisplayName (string) - user's name
* Location (string, options) - user's country
* Age (integer, optional) - user's age

## Step 3. Train your regexp skills

In this step you will find out how to parse information for complex rows! You will try to create some examples for parsing! There are some general rules for parsing:

1. To create a regular expression, which describes strings containing two patterns, where the order of the patterns is not defined use the following so-called ‘positive lookahead assertion’ with `?=` group modifier. For example, both strings “Washington Irving” and “Irving Washington” match the pattern:
```
(?=.*Washington)(?=.*Irving)
```.
2. To capture groups use round brackets. So, the pattern: `(?=.*(Washington))(?=.*(Irving))` captures `Washington` and `Irving` from both strings: "William Arthur Irving Washington was an English first-class Cricketer" and: “Washington Irving was an American writer”.
3. Use `\b` to specify boundaries of words and increase accuracy of your pattern. For example: pattern `(?=.*\bID=(\d+))(?=.*\bUserID=(\d+))` captures `1` and `2` from the string `ID=1 UserID=2`, whereas pattern without `\b`: `(?=.*ID=(\d+))(?=.*UserID=(\d+))` returns the wrong groups: `2` and `2`.
4. In Hive pattern for the external table in SERDEPROPERTIES `input.regex` should describe the whole input string, add `.*` at the end of the pattern.
5. Don't forget that for the beginning of string should also be covered. That's why use the pattern `.*?` for lazy initialization of future patterns.

To sum up, you can create your first regex for parsing Id from posts!

<b>Question!</b> What will be your first regex for parsing Id from the posts? Don't forget to add steps 4 and 5!

<div class="panel-group">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" href="#collapse-answer">Check your answer!</a>
      </h4>
    </div>
    <div id="collapse-answer" class="panel-collapse collapse">
      <div class="panel-body">The correct answer is `".*?(?=.*\\bId=\"(\\d+)\").*"` </div>
    </div>
  </div>
</div>

Let's create the first external table with one row which contains only `Id` field. Let's name it `posts_external_only_id`.
You can watch the lecture for the SerDe format: <a href="https://www.coursera.org/learn/big-data-analysis/lecture/wAGe6/hive-analytics-regexserde-views">Serde Format</a> and <a href="/notebooks/demos/course02_week02-Demo_submission.ipynb#2.-Creation-the-external-table">creation of external table</a> tutorial.

In [None]:
%%writefile demo_example.hql

-- adding necessary JARs and including database
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;

USE demodb;
DROP TABLE IF EXISTS posts_external_only_id;


-- Create external table 

-- Your code here

In [None]:
!hive -f demo_example.hql

Hooray! You have created your first table. Let us watch for this table!

In [None]:
%%writefile describe.hql

ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;

USE demodb;
DESCRIBE posts_external_only_id;

In [None]:
!hive -f describe.hql

Let's see that the data is  correctly parsed. For this case, take a select query that chooses for us first 10 rows

In [None]:
%%writefile my_first_select.hql
-- Your code here

In [None]:
!hive -f my_first_select.hql

How many posts are there in the dataset? Don't forget to clear the `NULL` values!

In [None]:
%%writefile how_many_posts.hql
-- Your code here

In [None]:
!hive -f how_many_posts.hql

Try to parse different fields: for example, day and month of creation date. Don't forget that Hive will accept values for the capturing group in the lookahead. 

Now you are ready to complete your task! Before this you can check your regular expression for parsing!

In [None]:
import re

In [None]:
CHECK_ROW = '<row Id="1394" PostTypeId="2" ParentId="1390" CreationDate="2008-08-04T16:38:03.667" Score="16" Body="&lt;p&gt;Not sure how credible &lt;a href=&quot;http://www.builderau.com.au/program/windows/soa/Getting-started-with-Windows-Server-2008-Core-edition/0,339024644,339288700,00.htm&quot; rel=&quot;nofollow noreferrer&quot;&gt;this source is&lt;/a&gt;, but:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA;  &lt;p&gt;The Windows Server 2008 Core edition can:&lt;/p&gt;&#xA;  &#xA;  &lt;ul&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the file server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the Hyper-V virtualization server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the Directory Services role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the DHCP server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the IIS Web server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the DNS server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run Active Directory Lightweight Directory Services.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the print server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;/ul&gt;&#xA;  &#xA;  &lt;p&gt;The Windows Server 2008 Core edition cannot:&lt;/p&gt;&#xA;  &#xA;  &lt;ul&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run a SQL Server.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run an Exchange Server.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run Internet Explorer.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run Windows Explorer.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Host a remote desktop session.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run MMC snap-in consoles locally.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;/ul&gt;&#xA;&lt;/blockquote&gt;&#xA;" OwnerUserId="91" LastEditorUserId="1" LastEditorDisplayName="Jeff Atwood" LastEditDate="2008-08-27T13:02:50.273" LastActivityDate="2008-08-27T13:02:50.273" CommentCount="1" />'

In [None]:
CHECK_REGEX = <Paste your regex>

In [None]:
result = re.match(CHECK_REGEX, CHECK_ROW)

In [None]:
# Sanity check
assert result.group(0) == CHECK_ROW

In [None]:
# Check that your groups are correct
print(result.groups())

## Step 4. Complete the assignment

In [None]:
%%writefile task1_create_external_table.hql
-- Create external table posts_sample_external with suitable values
-- Your code here
    

In [None]:
!hive -f task1_create_external_table.hql

Make sure that you have created your table correctly. Select the first 10 posts from the dataset.

In [None]:
%%writefile task1_check_select.hql

-- Write select query for the first 10 rows
-- Your code here

In [None]:
!hive -f task1_check_select.hql

Create managed table `posts_sample`. Create the partition by the month and by the year. 

In [None]:
%%writefile task1_create_managed_table.hql
-- create managed table
-- Check that this table contains info about year and month
-- Your code here

In [None]:
!hive -f task1_create_managed_table.hql

Populate data from the table `posts_sample_external` to the table `posts_sample`. Don't forget about the partitioning rules!

In [None]:
%%writefile task1_insert_table.hql

-- Insert data to the managed table

USE <your database>;
-- filling managed posts table from external one
SET hive.exec.dynamic.partition.mode=nonstrict;

-- Your code here for inserting data

In [None]:
!hive -f task1_insert_table.hql

Make sure that your table contains appropriate data about posts

In [None]:
%%writefile task1_watch_new_table.hql
-- Your code here

In [None]:
!hive -f task1_watch_new_table.hql

Take the third row of the dataset in the ascending order for the posts (firstly by year, after that by month)

In [None]:
%%writefile task1_result.hql
-- Your code here

In [None]:
!hive -f task1_result.hql

## Step 5. Submission part. Do not touch!! And simple run all cells below!

Copy your notebook from the steps <a href="#Step-4.-Complete-the-assignment">Step 4</a> and <a href="#Step-5.-Submission-part.-Do-not-touch!!-And-simple-run-all-cells-below!">Step 5</a> to the new notebook. Run all the cells! And submit the copied notebook!

In [None]:
!cat task1_create_external_table.hql > task1.hql
!cat task1_create_managed_table.hql >> task1.hql
!cat task1_insert_table.hql >> task1.hql
!cat task1_result.hql >> task1.hql

Take a look at your submission query!

In [None]:
!cat task1.hql

In [None]:
%%javascript

$(document).ready(function() {
    console.log('Ready');
    
    
    function is_hive_command(list_tokens) {
        return list_tokens.indexOf('hive') > -1 && 
             list_tokens.indexOf('f') > -1 &&
             list_tokens.indexOf('-') > -1 && 
             list_tokens.indexOf('!') > -1 &&
             list_tokens.indexOf('hql') > -1 && 
             list_tokens.indexOf('writefile') == -1;
    } 
    
    function collectText(input_tag) {

        var result_string = [];
        $.each($(input_tag).children(), function(index, child) {
            result_string.push($(child).text());
        });
        return [result_string, is_hive_command(result_string)];
    };
    
    var filtered_results = $(".cell.code_cell.rendered").filter(function(index, element) {
        var out = collectText($(element).find('.CodeMirror-line').find('span'));
        console.log(out);
        return collectText($(element).find('.CodeMirror-line').find('span'))[1];
    });
    $(filtered_results).remove();
});

In [None]:
%%bash
hive -f task1.hql

Congratulations! You have completed the assignment! Now you can submit it to the system and get your results!

Copy your notebook from the steps <a href="#Step-4.-Complete-the-assignment">Step 4</a> and <a href="#Step-5.-Submission-part.-Do-not-touch!!-And-simple-run-all-cells-below!">Step 5</a> to the new notebook. Run all the cells! And submit the copied notebook!