# Geodata cleaning and then integration of Sensor locations from OpenAQ

In [0]:
geo_census_df_path = "/Volumes/tabular/dataexpert/freshoats_capstone/Geo_Census.parquet"

geo_census_df = spark.read.parquet(geo_census_df_path)


In [0]:
geo_census_df.printSchema()


In [0]:
geo_census_df.show()

Filter for only states, excluding territories. All states range from 01 to 56. Territories start at 60. 

In [0]:
# Filter States by STATEFP
filtered_geo_census = geo_census_df.filter(geo_census_df.STATEFP.cast("int") < 57)

In [0]:
from pyspark.sql.functions import col 

# Confirm there are no STATEFB over 56
filtered_geo_census.orderBy(col('STATEFP').desc()).show()

In [0]:
duplicates_df = filtered_geo_census.groupBy(filtered_geo_census.columns).count().filter(col("count") > 1)

duplicates_df.show()

In [0]:
duplicate_count = filtered_geo_census.groupBy(filtered_geo_census.columns).count().filter(col("count") > 1).count()

print(f"Number of duplicate rows: {duplicate_count}")

There are no duplicate rows in the Geo Census data.

Validate the coordinates are within the bounds of US coordinates: 

In [0]:
# Define the bounding box for the U.S.
min_lat, max_lat = 18.0, 71.538800
min_lon, max_lon = -179.148909, -66.93457

# Filter the DataFrame
valid_locations_df = filtered_geo_census.filter(
    (filtered_geo_census["INTPTLAT"] >= min_lat) &
    (filtered_geo_census["INTPTLAT"] <= max_lat) &
    (filtered_geo_census["INTPTLON"] >= min_lon) &
    (filtered_geo_census["INTPTLON"] <= max_lon)
)

# Show the valid locations
valid_locations_df.show()

In [0]:
filtered_geo_census.count() == valid_locations_df.count()

In [0]:
filtered_geo_census.count()

In [0]:
valid_locations_df.count()

In [0]:
# Find rows in filtered_geo_census but not in valid_locations_df
missing_row_df = filtered_geo_census.subtract(valid_locations_df)

# Show the missing row(s)
missing_row_df.show()

This location is in Alaska just across the International dateline. It's technically valid even though it doesn't pass the test. 

In [0]:
filtered_geo_census.count()

In [0]:
sensor_locations_df.count()

In [0]:
# Rename GEOID_Data, drop geometry, countyfp, and GOEID before saving to parquet
filtered_geo_census = filtered_geo_census.drop("GEOID", "COUNTYFP", "geometry").withColumnRenamed("GEOID_Data", "GEOID")
display(filtered_geo_census)

Prior to filtering inactive sensor locations, there were 4700. This is going to significantly reduce the computation in finding nearest neighbors. 

In [0]:
filtered_geo_census.write.mode("overwrite").parquet(filtered_geo_census_path)