# Perform the Nearest Neighbor Calculations 
This will open the Filtered Geo Census data and the Sensor Location data on the centroid coordinates and location coordinates

Note: since the computation for the nearest neighbor was intensive, I changed the compute properties of the cluster to r5d.xlarge and included the following parameters in the advanced section: 

- spark.memory.offHeap.enabled true
- spark.driver.memory 16g
- spark.executor.instances 4
- spark.memory.offHeap.size 2g
- spark.driver.cores 4
- spark.executor.memory 16g
- spark.executor.cores 4

This was changed back after this process, since it was a single process and won't need to be performed again. 

In [0]:
filtered_geo_census_path = "/Volumes/tabular/dataexpert/freshoats_capstone/filtered_geo_census.parquet"

filtered_geo_census_df = spark.read.parquet(filtered_geo_census_path)
filtered_sensors = spark.table('filtered_sensors')

In [0]:
display(filtered_geo_census_df)

In [0]:
# Create a temporary view of the DataFrames
filtered_geo_census_df.createOrReplaceTempView("census_tracts")

## Verify the views

In [0]:
# Verify the census_tracts view
spark.sql("SELECT * FROM census_tracts LIMIT 5").show()

# Verify the sensor_locations view
spark.sql("SELECT * \
          FROM filtered_sensors \
          WHERE datetime_last < '2025-01-01' OR datetime_last IS NULL \
          LIMIT 5").show()

In [0]:
    %sql
    WITH sensors AS (
      SELECT *
          FROM filtered_sensors
          WHERE datetime_last >= '2025-01-01' AND datetime_last IS NOT NULL
    )
      SELECT
          c.GEOID AS tract_id
          , s.location_id
          , 2 * 6371 * ASIN(SQRT(
              POWER(SIN(RADIANS(s.latitude - c.INTPTLAT) / 2), 2) +
              COS(RADIANS(c.INTPTLAT)) * COS(RADIANS(s.latitude)) *
              POWER(SIN(RADIANS(s.longitude - c.INTPTLON) / 2), 2)
          )) AS distance
      FROM
          census_tracts c
      CROSS JOIN
          sensors s

In [0]:
# SQL query to calculate distances and find the sensor distances from each census tract to each sensor using the Haversine formula
nearest_neighbors_query = """
    WITH sensors AS (
      SELECT *
          FROM filtered_sensors
          WHERE datetime_last >= '2025-01-01' AND datetime_last IS NOT NULL
    )
      SELECT
          c.GEOID AS tract_id
          , s.location_id
          , 2 * 6371 * ASIN(SQRT(
              POWER(SIN(RADIANS(s.latitude - c.INTPTLAT) / 2), 2) +
              COS(RADIANS(c.INTPTLAT)) * COS(RADIANS(s.latitude)) *
              POWER(SIN(RADIANS(s.longitude - c.INTPTLON) / 2), 2)
          )) AS distance
      FROM
          census_tracts c
      CROSS JOIN
          sensors s
    """

# Execute the query
distances_df = spark.sql(nearest_neighbors_query)

# Show the calculated distances
distances_df.show()

The cross-join got all of the distances. They are currenlty in no particular order. The next step will be running a Window function to rank partition by the tract and then rank by distance, which by default orders in ascending order. This means that the first row will be the shortest distance, and that will be the nearest neighbor sensor to the tract. 

Since this step requires ordering and shuffling, it is going to take substantially longer and more computational power to complete. 

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

# Define a window partitioned by tract_id and ordered by distance
window = Window.partitionBy("tract_id").orderBy("distance")

# Add a rank column and filter for the nearest sensor (rank = 1)
nearest_neighbors_df = distances_df.withColumn("rank", row_number().over(window)).filter("rank = 1")

# Show the nearest neighbors
nearest_neighbors_df.select("tract_id", "location_id", "distance").show()

In [0]:
%sql
DROP TABLE IF EXISTS nearest_neighbors

In [0]:
nearest_neighbors_df.write.format("delta") \
    .mode("overwrite") \
    .option("path", "dbfs:/delta/nearest_neighbors") \
    .saveAsTable("nearest_neighbors")

In [0]:
nearest_neighbors_df = spark.read.table("nearest_neighbors")

In [0]:
display(nearest_neighbors_df)

There are 1455 tracts without median income data - this is fine, as I will be grouped into all different income brackets if null. The next step is to set up the marketing brackets. 

In [0]:
# Save the result as a managed Delta table
result_df.write.format("delta").mode("overwrite").saveAsTable("sensors_with_tract_income_brackets")