# Merge the Nearest Neighbors, Census, and Sensor ID tables

This will merge the nearest neighbor calculations with the census information. There are three tables here: 
1. nearest_neighbors (nn): contains tract_id (census tract), location_id (location of the nearest sensor to the census tract), distance (the distance from the sensor to the census tract in km) and rank (all of these should be one, as a result from the nn analysis)
2. filtered_sensors (fs): provides the location_id, sensor_id, parameter, units, datetime_last (last measurement by the sensor) and coordinates of the location 
3. filtered_geo_census/census_tracts (ct): gives the census_tract, centroid coordinates, and median_income, and margin of error of median income  

JOINs
- nn.tract_id = ct.GEOID
- nn.location_id = fs.location_id

- The nearest neighbors and census tracts both have the same number of rows, since they are organized by tract_id/GEOID. This join can be an Inner or left and still preserve all of the information. 
- The second join, including the filtered sensors will expand the list, as there are only location ID values included in the tract/nn tables, but some of the locations have multiple sensors. There are 2982 locations, and 8609 sensors. 

When performing the API call, we only call per location, but it doens't provide the parameter or units on the sensor measurements. This table will be a static table, and will only be used for a reference, to collect this information without having to perform subsequent calls to this after getting the locations from the FIRMS data. 

In [0]:
%sql
SELECT COUNT(*)
FROM nearest_neighbors

In [0]:
filtered_geo_census_path = "/Volumes/tabular/dataexpert/freshoats_capstone/filtered_geo_census.parquet"
filtered_geo_census_df = spark.read.parquet(filtered_geo_census_path)

# Create a temporary view of the DataFrames
filtered_geo_census_df.write.format("delta") \
    .mode("overwrite") \
    .option("path", "dbfs:/delta/census_tracts") \
    .saveAsTable("census_tracts")

In [0]:
%sql
SELECT *
FROM census_tracts 

In [0]:
%sql
SELECT *
FROM filtered_sensors

In [0]:
%sql
SELECT COUNT(DISTINCT(location_id))
  , COUNT(DISTINCT(sensor_id))
FROM filtered_sensors

In [0]:
%sql
CREATE OR REPLACE TABLE sensors_with_income_levels 
USING DELTA AS 
SELECT
  ct.GEOID AS GEOID
  , sl.state_abbreviation AS state
  , ct.INTPTLAT AS tract_lat
  , ct.INTPTLON AS tract_lon
  , ct.median_income AS median_income
  , ct.median_income_margin AS median_income_margin
  , CASE 
      WHEN median_income <= 30000 THEN 'Low'
      WHEN median_income <= 50000 THEN 'Lower-Middle'
      WHEN median_income <= 80000 THEN 'Middle'
      WHEN median_income <= 150000 THEN 'Upper-Middle'
      WHEN median_income <= 250000 THEN 'High'
      WHEN median_income > 250000 THEN 'Very-High'
      ELSE 'Unknown'
    END AS income_bracket
  , nn.location_id AS nn_location_id
  , nn.distance AS nn_distance_km
  , CASE 
        WHEN nn.distance <= 5 THEN 'High'
        WHEN nn.distance <= 20 THEN 'Moderate'
        WHEN nn.distance <= 50 THEN 'Low'
        ELSE 'Very Low'
    END AS aq_confidence_level
  , fs.latitude AS location_lat
  , fs.longitude AS location_lon
  , fs.datetime_last AS datetime_last
  , fs.sensor_id AS sensor_id
  , fs.parameter_name AS parameter_name
  , fs.parameter_units AS parameter_units
FROM census_tracts ct
    LEFT JOIN state_lookup sl
      ON ct.STATEFP = sl.STATEFP
  INNER JOIN nearest_neighbors nn 
    ON ct.GEOID = nn.tract_id
  INNER JOIN filtered_sensors fs 
    ON fs.location_id = nn.location_id



In [0]:
%sql
SELECT * 
FROM sensors_with_income_levels