##Individual Assignment

###Objectives:

- Contrasting different file formats
- Understanding Partitioning
- Understanding Bucketing
- Understanding Skewed tables.

###Lab:

####File Formats

This lab assumes you have hive and its dependancies installed and running.   

####I will only be uploading movies.csv into the tabels, for a total of 3 tables. 

- Step1: Download 100m movies data from
  <http://files.grouplens.org/datasets/movielens/ml-latest-small.zip>
  

    localhost $ wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

- Step2: Load data into HDFS

    hadoop fs -put ~/ml-latest-small

- Create `<tablename>_txt` to load data as text file.
- Create `<tablename>_rc` to load data in rc format.
- Create `<tablename>_orc` to load data in orc format.

##Note
    ORC AND RC tables have not had data loaded into them
    1. I need to change the file format from .csv to .orc (or whatever)
    2. I need to copy data from movies_text table into the other two tables
        - This step is much eaiser and is a common use for Hive !!

###RC file format in table declaration is not recognized !!!!


    --Copy to ORC table
    INSERT INTO TABLE movies_orc SELECT * FROM movies_text;

    -- Drop table if it exists.
    DROP TABLE IF EXISTS  movies_orc;

    -- Create external table with RC format.
    CREATE TABLE movies_orc(
          movieId INT,
          title string,
          genre string
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS ORC 
    TBLPROPERTIES("skip.header.line.count"="1", "orc.compress"="ZLIB");

    -- Drop table if it exists.
    DROP TABLE IF EXISTS  movies_rc;

    -- Create external table with RC format.
    CREATE TABLE movies_rc(
          movieId INT,
          title string,
          genre string
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS RC
    TBLPROPERTIES("skip.header.line.count"="1");
    

    -- Drop table if it exists.
    DROP TABLE IF EXISTS  movies_txt;

    -- Create external table with TEXT format.
    CREATE  TABLE movies_text(
          movieId INT,
          title string,
          genre string
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");
    
    -- Load data to Hive
    LOAD DATA
    INPATH '/user/root/ml-latest-small/movies.csv'
    OVERWRITE
    INTO TABLE movies_text;

- Step3: Load the data into above tables from hdfs and note the
  timings. Which table took more time to load?
  

    Text file load time: 11.192 sec
    Transfer data from text table to orc table: 61.623 seconds

- Step4: How many movies with the tag `Action` are present in the
  list? Save a list of the titles and IDs of such movie to the HDFS.
  Contrast the timings. *Hint : Case sensitive?*

####Partitioning

- Step1: Review the data in `data/state-pops.csv`  

- Step2: Load it into HDFS.  

    hadoop fs -mkdir /users/root/states

- Step3: Create table `states` in hive partitioned on `country`.  

    -- Create external table.
    DROP TABLE IF EXISTS  statePop;
    CREATE EXTERNAL TABLE statePop(
      country STRING,
      state STRING,
      pop    INT
    )
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION '/user/root/state_data'
    TBLPROPERTIES("skip.header.line.count"="1");

    -- Create partitioned table.
    DROP TABLE IF EXISTS  statePop_part;
    CREATE TABLE statePop_part(
      country STRING,
      pop    INT
    )
    PARTITIONED BY (state STRING)
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;
    
    -- Insert select into partitioned table.
    FROM statePop 
    INSERT INTO TABLE statePop_part PARTITION(state)
    SELECT country,pop,state;
    
    NOTE: the field that is partitioned on must placed at the end of the SELECT statement
          when inserting data into the table

    select * from statePop limit 10;
    
    OK
    
    Canada	Quebec	7903001
    Canada	British Columbia	4400057
    Canada	Alberta	3645257
    Canada	Manitoba	1208268
    Canada	Saskatchewan	1033381
    Canada	Nova Scotia	921727
    Canada	New Brunswick	751171
    Canada	Newfoundland and Labrador	514536
    Canada	Prince Edward Island	140204
    US	California	38332521
    
    Time taken: 2.307 seconds, Fetched: 10 row(s)

- Step4: Query the description of the table. 

    hive> DESCRIBE statePop;
    OK
    country             	string
    state               	string
    pop                 	int
    Time taken: 1.111 seconds, Fetched: 3 row(s)

- Step5: Load states with data from HDFS.  

#####I don't understand Step 5

- Step6: Check the directory structure in HDFS.

    root@sandbox ~]# hadoop fs -ls  /apps/hive/warehouse/statepop_part
    Found 60 items
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Alabama
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Alaska
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Alberta
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Arizona
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Arkansas
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=British Columbia
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=California
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Colorado
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Connecticut
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Delaware
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=District of Columbia
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Florida
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Georgia
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Hawaii
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Idaho
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Illinois
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Indiana
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Iowa
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Kansas
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Kentucky
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Louisiana
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Maine
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Manitoba
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Maryland
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Massachusetts
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Michigan
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Minnesota
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Mississippi
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Missouri
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Montana
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Nebraska
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Nevada
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=New Brunswick
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=New Hampshire
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=New Jersey
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=New Mexico
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=New York
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Newfoundland and Labrador
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=North Carolina
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=North Dakota
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Nova Scotia
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Ohio
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Oklahoma
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Oregon
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Pennsylvania
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Prince Edward Island
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Quebec
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Rhode Island
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Saskatchewan
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=South Carolina
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=South Dakota
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Tennessee
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Texas
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Utah
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Vermont
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Virginia
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Washington
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=West Virginia
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Wisconsin
    drwxrwxrwx   - root hdfs          0 2015-09-27 01:07 /apps/hive/warehouse/statepop_part/state=Wyoming

- The data for each country should be in a separate folder.

####Bucketing

- Step1: Review data in `movies.csv`.

- Step2: Create table `movies1` **without** bucketing.  

- Step3: Create table `movies2` **with bucketing over movieID (4 buckets)**.  

- Step4: Load same data to both tables and notice difference in time.   

- Step5: Run `count(*)` command on both tables and notice difference in time.   

- Step6: Perform sampling on `movies2`.