### Hive assignment. Task1

The purpose of this task is to create an external table on the posts data of the stackoverflow.com website.

Create your own database and `'use'` it. Create external table `'posts_sample_external'` over the sample dataset with posts in `'/data/stackoverflow_1000'` directory. Create managed table 'posts_sample' and populate with the data from the external table. `'Posts_sample'` table should be partitioned by year and by month of post creation. Provide output of query which selects lines number per each partition in the format:

`year <tab> month <table> lines count`

where year in `YYYY` format and month in `YYYY-MM` format. The result is the 3th line of the last query output.

Example:

`2008 2008-07 123`

In [1]:
%%writefile query.hql

CREATE DATABASE IF NOT EXISTS mskorokhod;
USE mskorokhod;


Overwriting query.hql


In [2]:
%%writefile -a query.hql

-- Create posts_sample_external table

DROP TABLE IF EXISTS `posts_sample_external`;
CREATE EXTERNAL TABLE IF NOT EXISTS `posts_sample_external`(
  `id` int,
  `post_type_id` tinyint,
  `date` string,
  `owner_user_id` int,
  `parent_id` int,
  `score` int,
  `favorite_count` int,
  `tags` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = '^<row.*?(?=.*\\bId=\"(\\d+)\")(?=.*\\bPostTypeId=\"(\\d+)\")(?=.*\\bCreationDate=\"([^"]*)\")(?=.*\\bOwnerUserId=\"(\\d+)\")?(?=.*\\bParentId=\"(\\d+)\")?(?=.*\\bScore=\"(-?\\d+)\")(?=.*\\bFavoriteCount=\"(\\d+)\")?(?=.*\\bTags=\"([^"]*)\")?.*',
  "input.regex.case.insensitive" = 'true'
)
STORED AS TEXTFILE
LOCATION
  '/data/stackexchange100/posts'
;


Appending to query.hql


In [3]:
%%writefile -a query.hql

-- Create managed table and fill data

DROP TABLE IF EXISTS `posts_sample`;
CREATE TABLE IF NOT EXISTS `posts_sample`(
  `id` int,
  `post_type_id` tinyint,
  `date` string,
  `owner_user_id` int,
  `parent_id` int,
  `score` int,
  `favorite_count` int,
  `tags` array <string>
)
PARTITIONED BY ( 
  `year` string, 
  `month` string
)
CLUSTERED BY ( 
  `date`
) 
SORTED BY ( 
  id ASC
) 
INTO 8 BUCKETS
STORED AS TEXTFILE
LOCATION
  '/user/hjudge/task1'
;

SET hive.exec.dynamic.partition=true;  
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE `posts_sample`
PARTITION (`year`, `month`)
SELECT
  `id`,
  `post_type_id`,
  `date`,
  `owner_user_id`,
  `parent_id`,
  `score`,
  `favorite_count`,
  split(regexp_replace(`tags`, '(&lt\;|&gt\;$)', ''), '&gt\;') AS `tags`,
  regexp_extract(`date`, '^(\\d{4})', 1) AS `year`,
  regexp_extract(`date`, '^(\\d{4}-\\d{2})', 1) AS `month`
FROM `posts_sample_external`
;


Appending to query.hql


In [4]:
%%writefile -a query.hql

SELECT * FROM (
    SELECT year, month, count(1)
    FROM posts_sample
    GROUP BY year, month
    LIMIT 3
) AS SubQ
SORT BY month DESC
LIMIT 1
;


Appending to query.hql


In [5]:
! hive -f query.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 0.526 seconds
OK
Time taken: 0.016 seconds
OK
Time taken: 0.894 seconds
OK
Time taken: 0.288 seconds
OK
Time taken: 0.02 seconds
OK
Time taken: 0.151 seconds
Query ID = jovyan_20171101144848_082d6dcf-37b7-4f2d-8183-50f8a6f9ffbe
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1509524427668_0072, Tracking URL = http://1abfcb4c4022:8088/proxy/application_1509524427668_0072/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1509524427668_0072
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2017-11-01 14:48:24,625 Stage-1 map = 0%,  reduce = 0%
2017-11-01 14:48:42,739 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 32.2 sec
2017-11-01 14:48:48,978 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 44.31 sec
2017-11-01 14:48:54,182 Sta

Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2011/month=2011-09
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2011/month=2011-10
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2011/month=2011-11
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2011/month=2011-12
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2012/month=2012-01
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2012/month=2012-02
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-st

Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2016/month=2016-03
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2016/month=2016-04
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2016/month=2016-05
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2016/month=2016-06
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2016/month=2016-07
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-staging_hive_2017-11-01_14-48-17_302_6355347109182599690-1/-ext-10000/year=2016/month=2016-08
Moving data to: hdfs://localhost:9000/user/jovyan/task1/.hive-st

Partition mskorokhod.posts_sample{year=2012, month=2012-05} stats: [numFiles=1, numRows=3633, totalSize=237630, rawDataSize=233997]
Partition mskorokhod.posts_sample{year=2012, month=2012-06} stats: [numFiles=1, numRows=3494, totalSize=228150, rawDataSize=224656]
Partition mskorokhod.posts_sample{year=2012, month=2012-07} stats: [numFiles=1, numRows=3795, totalSize=248511, rawDataSize=244716]
Partition mskorokhod.posts_sample{year=2012, month=2012-08} stats: [numFiles=1, numRows=3794, totalSize=246157, rawDataSize=242363]
Partition mskorokhod.posts_sample{year=2012, month=2012-09} stats: [numFiles=1, numRows=3492, totalSize=227343, rawDataSize=223851]
Partition mskorokhod.posts_sample{year=2012, month=2012-10} stats: [numFiles=1, numRows=4006, totalSize=262596, rawDataSize=258590]
Partition mskorokhod.posts_sample{year=2012, month=2012-11} stats: [numFiles=1, numRows=3839, totalSize=251064, rawDataSize=247225]
Partition mskorokhod.posts_sample{year=2012, month=2012-12} stats: [numFiles

Starting Job = job_1509524427668_0074, Tracking URL = http://1abfcb4c4022:8088/proxy/application_1509524427668_0074/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1509524427668_0074
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-11-01 14:50:19,054 Stage-1 map = 0%,  reduce = 0%
2017-11-01 14:50:25,299 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.72 sec
2017-11-01 14:50:30,504 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.68 sec
MapReduce Total cumulative CPU time: 5 seconds 680 msec
Ended Job = job_1509524427668_0074
Launching Job 2 out of 4
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1509524427668_0075, Tracking URL = h