# Hive assignment 1

The purpose of this task is to create an external table on the posts data of the `stackoverflow.com` website.

## Step 1. Intro. Creation of the DB

Let's create the sandbox database where you will complete your assignment.

<b>Note!</b> This code shouldn't be in your submission. Please, remove this code from the notebook before submission.

In [4]:
%%writefile creation_db.hql
DROP DATABASE IF EXISTS demodb CASCADE;

Overwriting creation_db.hql


In [5]:
%%writefile -a creation_db.hql
CREATE DATABASE demodb LOCATION '/user/jovyan/test_metastore';

Appending to creation_db.hql


In [6]:
! hive -f creation_db.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 6.245 seconds
OK
Time taken: 0.286 seconds


**Don't forget to remove this code before submission!**


## Step 2. Exploration of the dataset

Okay, we have created the database. Let's create your own table for users and posts.

First of all, let's watch at the datasets for `users` which are located at `/data/stackexchange1000/posts` and for `posts` which is located at `/data/stackexchange1000/users`. Print the first three rows of those datasets.


In [4]:
! head -n 3 /data/stackexchange1000/posts/part-00000

reporter:status:Reading 	
<row Id="1394" PostTypeId="2" ParentId="1390" CreationDate="2008-08-04T16:38:03.667" Score="16" Body="&lt;p&gt;Not sure how credible &lt;a href=&quot;http://www.builderau.com.au/program/windows/soa/Getting-started-with-Windows-Server-2008-Core-edition/0,339024644,339288700,00.htm&quot; rel=&quot;nofollow noreferrer&quot;&gt;this source is&lt;/a&gt;, but:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA;  &lt;p&gt;The Windows Server 2008 Core edition can:&lt;/p&gt;&#xA;  &#xA;  &lt;ul&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the file server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the Hyper-V virtualization server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the Directory Services role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the DHCP server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the IIS Web server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the DNS server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run

In [5]:
! head -n 3 /data/stackexchange1000/users/part-00000

reporter:status:Reading 	
<row Id="756" Reputation="2358" CreationDate="2008-08-08T15:31:50.013" DisplayName="Simon Gillbee" LastAccessDate="2016-12-09T15:38:03.453" WebsiteUrl="http://simon.gillbee.com" Location="Pearland, TX" AboutMe="&lt;p&gt;Personally, I am a husband, step-father, grandfather, Christ-worshiper, singer, worship leader, computer programmer, reader, game player, kite flyer, generally all around fun guy (or that fungi?).&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Professionally, I've been developing both commercial and proprietary systems for the last 20 years. These days I'm primarily writing enterprise-scale client and server software using .NET and loving it!&lt;/p&gt;&#xA;&#xA;&lt;h1&gt;SOreadytohelp&lt;/h1&gt;&#xA;" Views="478" UpVotes="352" DownVotes="25" Age="45" AccountId="587" />	
<row Id="2050" Reputation="4177" CreationDate="2008-08-20T00:32:49.217" DisplayName="Eric Platon" LastAccessDate="2016-12-10T22:24:27.217" WebsiteUrl="" Location="Tokyo, Japan" AboutMe="" Views=

As you can see, those rows contain some information about posts and users in XML format.

<b>Question.</b> Which fields for users and posts do you think are the most important for the analysis? And for joining tables? 

<h3><b>Please, check your answer with this information!</b></h3>

So, the lines not started with the "row" tags should be ignored. The valid row contains the following fields and their order is not defined:

* Id (integer) - id of the post
* PostTypeId (integer: 1 or 2) - 1 for questions, 2 for answers
* CreationDate (date) - post creation date in the format "YYYY-MM-DDTHH:MM:SS.ms"
* Tags (string, optional) - list of post tags, each tag is wrapped with html entities `&lt;` and `&gt;`
* OwnerUserId (integer, optional) - user id of the post's author
* ParentId (integer, optional) - for answers - id of the question
* Score (integer) - score (votes) of a question or an answer, can be negative (!)
* FavoriteCount (integer, optional) - how many times the question was added in the favorites

The second part of the dataset contains StackOverflow users.

The fields are the following and their order is also not defined:

* Id (integer) - user id
* Reputation (integer) - user's reputation
* CreationDate (string) - creation date in the format "YYYY-MM-DDTHH:MM:SS.ms"
* DisplayName (string) - user's name
* Location (string, options) - user's country
* Age (integer, optional) - user's age

## Step 3. Train your regexp skills

In this step you will find out how to parse information for complex rows! You will try to create some examples for parsing! There are some general rules for parsing:

1. To create a regular expression, which describes strings containing two patterns, where the order of the patterns is not defined use the following so-called ‘positive lookahead assertion’ with `?=` group modifier. For example, both strings “Washington Irving” and “Irving Washington” match the pattern:
```
(?=.*Washington)(?=.*Irving)
```.
2. To capture groups use round brackets. So, the pattern: `(?=.*(Washington))(?=.*(Irving))` captures `Washington` and `Irving` from both strings: "William Arthur Irving Washington was an English first-class Cricketer" and: “Washington Irving was an American writer”.
3. Use `\b` to specify boundaries of words and increase accuracy of your pattern. For example: pattern `(?=.*\bID=(\d+))(?=.*\bUserID=(\d+))` captures `1` and `2` from the string `ID=1 UserID=2`, whereas pattern without `\b`: `(?=.*ID=(\d+))(?=.*UserID=(\d+))` returns the wrong groups: `2` and `2`.
4. In Hive pattern for the external table in SERDEPROPERTIES `input.regex` should describe the whole input string, add `.*` at the end of the pattern.
5. Don't forget that for the beginning of string should also be covered. That's why use the pattern `.*?` for lazy initialization of future patterns.

To sum up, you can create your first regex for parsing Id from posts!

<b>Question!</b> What will be your first regex for parsing Id from the posts? Don't forget to add steps 4 and 5!

<div class="panel-group">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" href="#collapse-answer">Check your answer!</a>
      </h4>
    </div>
    <div id="collapse-answer" class="panel-collapse collapse">
      <div class="panel-body">The correct answer is `".*?(?=.*\\bId=\"(\\d+)\").*"` </div>
    </div>
  </div>
</div>

Let's create the first external table with one row which contains only `Id` field. Let's name it `posts_external_only_id`.
You can watch the lecture for the SerDe format: <a href="https://www.coursera.org/learn/big-data-analysis/lecture/wAGe6/hive-analytics-regexserde-views">Serde Format</a> and <a href="/notebooks/demos/course02_week02-Demo_submission.ipynb#2.-Creation-the-external-table">creation of external table</a> tutorial.

In [6]:
%%writefile demo_example.hql

-- adding necessary JARs and including database
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;

USE demodb;
DROP TABLE IF EXISTS posts_external_only_id;


-- Create external table 

-- Your code here
CREATE EXTERNAL TABLE posts_external_only_id (
    id string
)
ROW FORMAT
    SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        "input.regex" = '.*?(?=.*\\bId=\"(\\d+)\").*' 
    )
LOCATION '/data/stackexchange1000/posts';

Writing demo_example.hql


Hooray! You have created your first table. Let us watch for this table!

In [None]:
%%writefile describe.hql

ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;

USE demodb;
DESCRIBE posts_external_only_id;

Let's see that the data is  correctly parsed. For this case, take a select query that chooses for us first 10 rows

In [None]:
%%writefile my_first_select.hql
-- Your code here
SELECT * FROM demodb.posts_external_only_id
LIMIT 10;

How many posts are there in the dataset? Don't forget to clear the `NULL` values!

In [None]:
%%writefile how_many_posts.hql
-- Your code here
SELECT count(*) FROM demodb.posts_external_only_id;

Try to parse different fields: for example, day and month of creation date. Don't forget that Hive will accept values for the capturing group in the lookahead. 

Now you are ready to complete your task! Before this you can check your regular expression for parsing!

In [None]:
import re

In [None]:
CHECK_ROW = '<row Id="1394" PostTypeId="2" ParentId="1390" CreationDate="2008-08-04T16:38:03.667" Score="16" Body="&lt;p&gt;Not sure how credible &lt;a href=&quot;http://www.builderau.com.au/program/windows/soa/Getting-started-with-Windows-Server-2008-Core-edition/0,339024644,339288700,00.htm&quot; rel=&quot;nofollow noreferrer&quot;&gt;this source is&lt;/a&gt;, but:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA;  &lt;p&gt;The Windows Server 2008 Core edition can:&lt;/p&gt;&#xA;  &#xA;  &lt;ul&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the file server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the Hyper-V virtualization server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the Directory Services role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the DHCP server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the IIS Web server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the DNS server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run Active Directory Lightweight Directory Services.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run the print server role.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;/ul&gt;&#xA;  &#xA;  &lt;p&gt;The Windows Server 2008 Core edition cannot:&lt;/p&gt;&#xA;  &#xA;  &lt;ul&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run a SQL Server.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run an Exchange Server.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run Internet Explorer.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run Windows Explorer.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Host a remote desktop session.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;li&gt;&lt;p&gt;Run MMC snap-in consoles locally.&lt;/p&gt;&lt;/li&gt;&#xA;  &lt;/ul&gt;&#xA;&lt;/blockquote&gt;&#xA;" OwnerUserId="91" LastEditorUserId="1" LastEditorDisplayName="Jeff Atwood" LastEditDate="2008-08-27T13:02:50.273" LastActivityDate="2008-08-27T13:02:50.273" CommentCount="1" />'

In [None]:
CHECK_REGEX = '.*?(?=.*\\bId=\"(\\d+)\")(?=.*\\bPostTypeId=\"(\\d+)\")(?=.*\\bParentId=\"(\\d+)\")(?=.*\\bCreationDate=\"(\\d+).*\")(?=.*\\bCreationDate=\"(\\d{4}-\\d+).*\")(?=.*\\bCreationDate=\"\\d{4}-\\d{2}-(\\d+).*\").*'

In [None]:
result = re.match(CHECK_REGEX, CHECK_ROW)

In [None]:
# Sanity check
assert result.group(0) == CHECK_ROW

In [None]:
# Check that your groups are correct
print(result.groups())

## Step 4. Complete the assignment

In [7]:
%%writefile task1_create_external_table.hql
-- Create external table posts_sample_external with suitable values
-- Your code here
-- adding necessary JARs and including database
-- ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
-- ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;

USE demodb;
DROP TABLE IF EXISTS posts_sample_external;


-- Create external table 

-- Your code here
CREATE EXTERNAL TABLE posts_sample_external (
    Id string,
    CreationYear string,
    CreationMonth string
)
ROW FORMAT
    SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        "input.regex" = '.*?(?=.*\\bId=\"(\\d+)\")(?=.*\\bCreationDate=\"(\\d+).*\")(?=.*\\bCreationDate=\"(\\d{4}-\\d+).*\").*' 
    )
LOCATION '/data/stackexchange1000/posts';

Writing task1_create_external_table.hql


In [8]:
!hive -f task1_create_external_table.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 1.138 seconds
OK
Time taken: 0.099 seconds
OK
Time taken: 0.898 seconds


Make sure that you have created your table correctly. Select the first 10 posts from the dataset.

In [9]:
%%writefile task1_check_select.hql

-- Write select query for the first 10 rows
-- Your code here
SELECT * FROM demodb.posts_sample_external
LIMIT 10;

Writing task1_check_select.hql


In [10]:
!hive -f task1_check_select.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
NULL	NULL	NULL
1394	2008	2008-08
3543	2008	2008-08
4521	2008	2008-08
8689	2008	2008-08
9062	2008	2008-08
14671	2008	2008-08
16307	2008	2008-08
18780	2008	2008-08
18929	2008	2008-08
Time taken: 2.87 seconds, Fetched: 10 row(s)


Create managed table `posts_sample`. Create the partition by the month and by the year. 

In [11]:
%%writefile task1_create_managed_table.hql
-- create managed table
-- Check that this table contains info about year and month
-- Your code here
-- adding necessary JARs and including database
-- ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
-- ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;

USE demodb;

DROP TABLE IF EXISTS posts_sample;

CREATE TABLE posts_sample (
    Id string
)
PARTITIONED BY (
    CreationMonth string,
    CreationYear string
);


Writing task1_create_managed_table.hql


In [12]:
!hive -f task1_create_managed_table.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 1.128 seconds
OK
Time taken: 0.096 seconds
OK
Time taken: 0.996 seconds


Populate data from the table `posts_sample_external` to the table `posts_sample`. Don't forget about the partitioning rules!

In [13]:
%%writefile task1_insert_table.hql

-- Insert data to the managed table

USE demodb;
-- filling managed posts table from external one
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=2048;
SET hive.exec.max.dynamic.partitions.pernode=256;
SET hive.exec.max.created.files=10000;
SET hive.enforce.bucketing=true;
SET hive.mapred.supports.subdirectories=true;

-- Your code here for inserting data
FROM posts_sample_external pse
INSERT overwrite table posts_sample
PARTITION (CreationMonth, CreationYear)
SELECT pse.Id, pse.CreationMonth, pse.CreationYear;



Writing task1_insert_table.hql


In [14]:
!hive -f task1_insert_table.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 1.107 seconds
Query ID = jovyan_20190515080707_acf8fffa-d372-4bd2-ba48-5c730c07300e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1557896819805_0001, Tracking URL = http://b43f8bc3608b:8088/proxy/application_1557896819805_0001/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1557896819805_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-15 08:08:10,663 Stage-1 map = 0%,  reduce = 0%
2019-05-15 08:08:29,194 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 17.77 sec
2019-05-15 08:08:35,623 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 23.42 sec
2019-05-15 08:08:40,951 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 29.28 sec
2019-05-15 08:08:47,341 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 33.35 sec
MapRe

Partition demodb.posts_sample{creationmonth=2008-11, creationyear=2008} stats: [numFiles=1, numRows=54, totalSize=378, rawDataSize=324]
Partition demodb.posts_sample{creationmonth=2008-12, creationyear=2008} stats: [numFiles=1, numRows=51, totalSize=357, rawDataSize=306]
Partition demodb.posts_sample{creationmonth=2009-01, creationyear=2009} stats: [numFiles=1, numRows=84, totalSize=588, rawDataSize=504]
Partition demodb.posts_sample{creationmonth=2009-02, creationyear=2009} stats: [numFiles=1, numRows=84, totalSize=588, rawDataSize=504]
Partition demodb.posts_sample{creationmonth=2009-03, creationyear=2009} stats: [numFiles=1, numRows=85, totalSize=595, rawDataSize=510]
Partition demodb.posts_sample{creationmonth=2009-04, creationyear=2009} stats: [numFiles=1, numRows=97, totalSize=679, rawDataSize=582]
Partition demodb.posts_sample{creationmonth=2009-05, creationyear=2009} stats: [numFiles=1, numRows=111, totalSize=777, rawDataSize=666]
Partition demodb.posts_sample{creationmonth=200

Partition demodb.posts_sample{creationmonth=2015-01, creationyear=2015} stats: [numFiles=1, numRows=502, totalSize=4518, rawDataSize=4016]
Partition demodb.posts_sample{creationmonth=2015-02, creationyear=2015} stats: [numFiles=1, numRows=506, totalSize=4554, rawDataSize=4048]
Partition demodb.posts_sample{creationmonth=2015-03, creationyear=2015} stats: [numFiles=1, numRows=568, totalSize=5112, rawDataSize=4544]
Partition demodb.posts_sample{creationmonth=2015-04, creationyear=2015} stats: [numFiles=1, numRows=581, totalSize=5229, rawDataSize=4648]
Partition demodb.posts_sample{creationmonth=2015-05, creationyear=2015} stats: [numFiles=1, numRows=566, totalSize=5094, rawDataSize=4528]
Partition demodb.posts_sample{creationmonth=2015-06, creationyear=2015} stats: [numFiles=1, numRows=570, totalSize=5130, rawDataSize=4560]
Partition demodb.posts_sample{creationmonth=2015-07, creationyear=2015} stats: [numFiles=1, numRows=585, totalSize=5265, rawDataSize=4680]
Partition demodb.posts_samp

Make sure that your table contains appropriate data about posts

In [15]:
%%writefile task1_watch_new_table.hql
-- Your code here
SELECT * FROM demodb.posts_sample
LIMIT 10;

Writing task1_watch_new_table.hql


In [16]:
!hive -f task1_watch_new_table.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
1394	2008-08	2008
3543	2008-08	2008
4521	2008-08	2008
8689	2008-08	2008
9062	2008-08	2008
14671	2008-08	2008
16307	2008-08	2008
18780	2008-08	2008
18929	2008-08	2008
19668	2008-08	2008
Time taken: 3.076 seconds, Fetched: 10 row(s)


Take the third row of the dataset in the ascending order for the posts (firstly by year, after that by month)

In [61]:
%%writefile task1_result.hql
-- Your code here
with cte as (
    select count(*) as total, CreationYear, CreationMonth
    from demodb.posts_sample
    group by CreationYear, CreationMonth
), cte2 as (
    select total, CreationYear, CreationMonth,
        dense_rank() over (order by CreationMonth) as rang
    from cte
)
select CreationYear, CreationMonth, total
from cte2
where rang=3;


Overwriting task1_result.hql


In [62]:
!hive -f task1_result.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
Query ID = jovyan_20190515085353_7b1fa961-b82d-441e-a5d6-f16dfe911d53
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1557896819805_0027, Tracking URL = http://b43f8bc3608b:8088/proxy/application_1557896819805_0027/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1557896819805_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-05-15 08:53:36,376 Stage-1 map = 0%,  reduce = 0%
2019-05-15 08:53:45,105 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 

## Step 5. Submission part. Do not touch!! And simple run all cells below!

Copy your notebook from the steps <a href="#Step-4.-Complete-the-assignment">Step 4</a> and <a href="#Step-5.-Submission-part.-Do-not-touch!!-And-simple-run-all-cells-below!">Step 5</a> to the new notebook. Run all the cells! And submit the copied notebook!

In [None]:
!cat task1_create_external_table.hql > task1.hql
!cat task1_create_managed_table.hql >> task1.hql
!cat task1_insert_table.hql >> task1.hql
!cat task1_result.hql >> task1.hql

Take a look at your submission query!

In [None]:
!cat task1.hql

In [None]:
%%javascript

$(document).ready(function() {
    console.log('Ready');
    
    
    function is_hive_command(list_tokens) {
        return list_tokens.indexOf('hive') > -1 && 
             list_tokens.indexOf('f') > -1 &&
             list_tokens.indexOf('-') > -1 && 
             list_tokens.indexOf('!') > -1 &&
             list_tokens.indexOf('hql') > -1 && 
             list_tokens.indexOf('writefile') == -1;
    } 
    
    function collectText(input_tag) {

        var result_string = [];
        $.each($(input_tag).children(), function(index, child) {
            result_string.push($(child).text());
        });
        return [result_string, is_hive_command(result_string)];
    };
    
    var filtered_results = $(".cell.code_cell.rendered").filter(function(index, element) {
        var out = collectText($(element).find('.CodeMirror-line').find('span'));
        console.log(out);
        return collectText($(element).find('.CodeMirror-line').find('span'))[1];
    });
    $(filtered_results).remove();
});

In [None]:
%%bash
!hive -f task1.hql

Congratulations! You have completed the assignment! Now you can submit it to the system and get your results!

Copy your notebook from the steps <a href="#Step-4.-Complete-the-assignment">Step 4</a> and <a href="#Step-5.-Submission-part.-Do-not-touch!!-And-simple-run-all-cells-below!">Step 5</a> to the new notebook. Run all the cells! And submit the copied notebook!