Aggregating CSV Data (Spatial Binning)

Sarah Ambrose edited this page Jun 15, 2015 · 10 revisions

Aggregating CSV Data (Spatial Binning)

This tutorial will show you how to take a CSV file containing latitude and longitude information -> copy it to HDFS -> aggregate the dataset into bins for a useful visual analysis.

View the corresponding blog post.

  1. The sample data being used are 2013 NYC taxi data. This dataset is freely available for download. For this tutorial, we will only download trip_data_1.csv.zip for simplicity.

  2. Downloaded trip_data_1.csv.zip, unzip it and store it on your cluster. I am storing mine in gis-tools-for-hadoop/samples/data/taxi-data

  3. Open your terminal (windows use Cygwin) and move your folder containing the data from the local directory (gis-tools-for-hadoop/samples/data/taxi-data) to hdfs by using the put command. To do this make a folder (if necessary), and then move the folder containing the csv:

    #make a directory if needed
    hadoop fs -mkdir taxidemo
    #hadoop fs -put /path/on/localsystem /path/to/hdf
    hadoop fs -put gis-tools-for-hadoop/samples/data/taxi-data taxidemo
    #check that it worked:
    hadoop fs -ls taxidemo
    

    You should see something like:

    Found 1 items
    drwxr-xr-x   - sarah hdfs          0 2015-03-19 12:00 taxidemo/taxi-data
    
  4. Start hive in your terminal and add the appropriate jars (similar to the sample example):

    hive
    add jar
        gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
        gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;
    
  5. Drop a table in case it exists, and then create a new table that the taxi data will reside in:

    drop table taxi_demo;
    
    CREATE EXTERNAL TABLE taxi_demo(medallion STRING, hack_license STRING,vendor_id STRING,
    rate_code STRING,store_and_fwd_flag STRING,pickup_datetime STRING, 
    dropoff_datetime STRING, passenger_count DOUBLE, trip_time_in_secs DOUBLE, 
    trip_distance DOUBLE, pickup_longitude DOUBLE,pickup_latitude DOUBLE, 
    dropoff_longitude DOUBLE, dropoff_latitude DOUBLE)
    ROW FORMAT delimited fields terminated by ',' STORED AS textfile
    tblproperties ("skip.header.line.count"="1");
    

    I found the schema of the data by typing in the terminal (not hive) head -2 trip_data_1.csv > header_taxi.csv and then viewing the resulting CSV. You can always find out what folder you are currently in by typing pwd.

  6. Check that the table was made correctly:

    describe taxi_demo;
    
  7. Load the taxi CSV data into the table:

    LOAD DATA INPATH 'taxidemo/taxi-data/trip_data_1.csv' OVERWRITE INTO TABLE taxi_demo;
    
  8. Create the temporary functions that will be used in aggregating bins:

    create temporary function ST_Bin as 'com.esri.hadoop.hive.ST_Bin';
    create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
    create temporary function ST_BinEnvelope as 'com.esri.hadoop.hive.ST_BinEnvelope';
    

    A full list of Hive UDFs can be found here

  9. Try out the aggregation:

    FROM (SELECT ST_Bin(0.001, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, *FROM taxi_demo) bins
    SELECT ST_BinEnvelope(0.001, bin_id) shape,
    COUNT(*) count
    GROUP BY bin_id;
    

    0.001 refers to 0.001 degrees (the unit the data is in). This can easily be changed if you want less or more detail.

  10. If it ran without errors, create a new hive table to save the results to:

        drop table taxi_agg;
    
        CREATE TABLE taxi_agg(area BINARY, count DOUBLE)
        ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'              
        STORED AS INPUTFORMAT 'com.esri.json.hadoop.UnenclosedJsonInputFormat'
        OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
    

    'com.esri.json.hadoop.UnenclosedJsonInputFormat' -> "Unenclosed" or "Enclosed" must match what is used in the JSON to Features - Step 4 tool.

  11. Rerun the query, and save the results to the new table taxi_agg:

    FROM (SELECT ST_Bin(0.001, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, *FROM taxi_demo) bins
    INSERT OVERWRITE TABLE taxi_agg
    SELECT ST_BinEnvelope(0.001, bin_id) shape, COUNT(*) count
    GROUP BY bin_id;
    
  12. To view the new table results in ArcMap follow steps 5:9 in the Geoprocessing tools for Hadoop tutorial.