## Step 1: Start Hive

Follow the steps in the lecture to start your Hortonworks Sandbox VM.

Start 2 `ssh` session2 into it.

Start the Hive shell in one session.

Leave the other session running the Bash shell. 

    Start VirtualBox.
    Select Hortonworks Sandbox
    Holding down the Shift key click Start. This brings up the machine headless.

####Window 1
    Connect to it using ssh -p 2222 root@127.0.0.1
    Use hadoop as the password.
    Type hive to start the Hive shell.

####Window 2
    Connect to it using ssh -p 2222 root@127.0.0.1
    Use hadoop as the password.

## Step 2: Upload Movielens Data to HDFS

Download the MovieLens data files from the web (`ml-latest-small`).

Unzip it and upload these files into HDFS:

- `links.csv`
- `movies.csv`
- `ratings.csv`
- `tags.csv`

## Step 3: Create Tables

Here are the header lines from the files.

File           |First Line
----           |----------
`links.csv `   |`movieId,imdbId,tmdbId`
`movies.csv`   |`movieId,title,genres`
`ratings.csv`  |`userId,movieId,rating,timestamp`
`tags.csv `    |`userId,movieId,tag,timestamp`

Execute `CREATE TABLE` commands to create internal tables for each of
these files.

Use `DESCRIBE FORMATTED` to verify that the tables are created
correctly.

    -- Drop table if it exists.
    DROP TABLE IF EXISTS  links;

    -- Create table.
    CREATE TABLE links(
      movieId INT,
      imdbId INT,
      tmdbID INT
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");

    -- Load data to Hive.
    LOAD DATA 
    INPATH '/user/root/movie_data/links.csv' 
    OVERWRITE 
    INTO TABLE links;

    -- Drop table if it exists.
    DROP TABLE IF EXISTS movies;

    -- Create table.
    CREATE TABLE movies(
      movieId INT,
      title string,
      genre string
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");

    -- Load data to Hive.
    LOAD DATA 
    INPATH '/user/root/movie_data/movies.csv' 
    OVERWRITE 
    INTO TABLE movies;

    -- Drop table if it exists.
    DROP TABLE IF EXISTS ratings;

    -- Create table.
    CREATE TABLE ratings(
    userId INT,
    movieId INT,
    rating FLOAT,
    time INT
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");

    -- Load data to Hive.
    LOAD DATA 
    INPATH '/user/root/movie_data/ratings.csv' 
    OVERWRITE 
    INTO TABLE ratings;

    -- Drop table if it exists.
    DROP TABLE IF EXISTS tags;

    -- Create table.
    CREATE TABLE tags(
    userId INT ,
    movieId INT,
    tag STRING,
    time INT
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");

    -- Load data to Hive.
    LOAD DATA 
    INPATH '/user/root/movie_data/tags.csv' 
    OVERWRITE 
    INTO TABLE tags;

## Step 4: Hive Queries

Write Hive Queries to perform the following actions.

- Count the number of movies in the `movies` table.

- Count the number of distinct tags grouped by tags.

    hive> select count(DISTINCT movieId) from movies;

    Query ID = root_20150917045139_59c38079-70fc-4b58-a2fc-eb635eef5f11
    Total jobs = 1
    Launching Job 1 out of 1


    Status: Running (Executing on YARN cluster with App id application_1442465149828_0001)

    Map 1: -/-	Reducer 2: 0/1	Reducer 3: 0/1
    Map 1: 0/1	Reducer 2: 0/1	Reducer 3: 0/1
    Map 1: 0(+1)/1	Reducer 2: 0/1	Reducer 3: 0/1
    Map 1: 1/1	Reducer 2: 0(+1)/1	Reducer 3: 0/1
    Map 1: 1/1	Reducer 2: 1/1	Reducer 3: 0/1
    Map 1: 1/1	Reducer 2: 1/1	Reducer 3: 0(+1)/1
    Map 1: 1/1	Reducer 2: 1/1	Reducer 3: 1/1
    OK
    718
    Time taken: 20.148 seconds, Fetched: 1 row(s)

####The GROUP BY clause is used to group all the records in a result set using a particular collection column (global grouping)

    hive> select tag, count(*) from tags group by tag limit 10;

    Query ID = root_20150917051333_85ed1b26-c657-46bf-9132-214c95cf375e
    Total jobs = 1
    Launching Job 1 out of 1


    Status: Running (Executing on YARN cluster with App id application_1442465149828_0003)

    Map 1: 0/1	Reducer 2: 0/1
    Map 1: 0(+1)/1	Reducer 2: 0/1
    Map 1: 1/1	Reducer 2: 0/1
    Map 1: 1/1	Reducer 2: 0(+1)/1
    Map 1: 1/1	Reducer 2: 1/1
    OK
    007	6
    06 Oscar Nominated Best Movie - Animation	3
    1900s	1
    1920s	2
    1950s	2
    1960s	1
    1970s	3
    1980s	2
    80's cult movie	1
    AIDs	2
    Time taken: 6.929 seconds, Fetched: 10 row(s)

## Step 5: Extra Challenge

- For each movie find how many ratings it has.

    hive> select movieId, count(rating) from ratings group by movieId limit 10;

    Query ID = root_20150917051727_64e0c987-d8e5-4391-ac7e-b2a37dbb4482
    Total jobs = 1
    Launching Job 1 out of 1


    Status: Running (Executing on YARN cluster with App id application_1442465149828_0003)

    Map 1: 0/1	Reducer 2: 0/1
    Map 1: 0(+1)/1	Reducer 2: 0/1
    Map 1: 1/1	Reducer 2: 0/1
    Map 1: 1/1	Reducer 2: 0(+1)/1
    Map 1: 1/1	Reducer 2: 1/1
    OK
    1	263
    2	107
    3	59
    4	8
    5	58
    6	124
    7	51
    8	4
    9	17
    10	142
    Time taken: 9.028 seconds, Fetched: 10 row(s)

- For each movie find out the average rating.

    hive> select movieId, avg(rating) from ratings group by movieId sort by movieId desc limit 10;

    135887	4.5
    134853	4.3
    134783	0.5
    134393	4.0
    134368	5.0
    134170	3.75
    133897	3.0
    133545	2.5
    133419	1.75
    132796	5.0

- Find top 10 movies with the highest average ratings that have at least 5 ratings.

    select * 
    from (select id from test where id>10) a 
    join (select id from test where id>20) b 
    on a.id=b.id;

    select ratings.movieId, avg(ratings.rating) as ave
    from ratings

    right join 
        (select count(ratings.rating) as cnt
         from ratings
         ) as cnt_table

    on movieId.ratings = movieId.cnt_table
    where count(cnt_table.cnt) > 5
    group by cnt_table.movieId
    order by ratings.ave desc 
    limit 10;

## Step 6: Drop Tables

- When you are done drop the tables.

- Do you need to do anything else to get rid of the data?