# Access Metrics Made Easy: Nearest Destination and Access Count
This notebook provides a simple workflow and toolkit to calculate find access metrics. Primarily, this notebook calculates the nearest destination and the number of locations within a given threshold. The basic workflow is as follows:
1. Import data: Point data for destinations, origin geographies, and a transit cost matrix of pre-computed travel costs (eg. Minutes, Miles, etc.)
2. Spatially join destinations to origins: Based on the geospatial location, this associates each destination with an origin geography. Given that the travel cost between each origin geography is known, we can easily calculate the distance between.
3. Calculate metrics: For the nearest location, we'll simply sort the list of origins and destinations by travel time, then take the first entry. For the count within a given threshold, we can filter the list of origins and destinations by the travel time, and count the number of entries under a given threshold.

---

Getting start: Imports and a helper function for later. Here, we'll install the needed libraries (on the Colab remote machine) and import the libraries:

In [1]:
!pip install pandas geopandas access rtree pygeos access pyarrow

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
# from access import Access, weights, Datasets

def dfToGdf(df, lon, lat, crs='EPSG:4326'):
  '''
    df: pandas dataframe
    lon: longitude column name
    lat: latitude column name
    crs: EPSG code or similar coordinate reference system
  '''
  return gpd.GeoDataFrame(
    df.drop([lon, lat], axis=1),
    crs=crs,
    geometry=[Point(xy) for xy in zip(df[lon], df[lat])])





### Public data:
We've provided a set of public datasets to help get you started. Specifically, we have travel cost matrices for US Zip codes and US Census Tracts, with the travel cost as a value in minutes. The data here is in the Apache Parquet format, an efficient format for storing tabular data.

We've also included some base geographies and population data (should you need it) and some sample destination data of Federally Qualified Health Clinics (FQHC's).

In [2]:
matrices = {
    'tract': {
        'car':'https://uchicago.box.com/shared/static/hkipez75z2p7zivtjdgsfzut4rhm6t6h.parquet',
        'bike':'https://uchicago.box.com/shared/static/cvkq3dytr6rswzrxlgzeejmieq3n5aal.parquet',
        'walk':'https://uchicago.box.com/shared/static/swggh8jxj59c7vpxzx1emt7jnd083rmh.parquet'
    },
    'zip': {
        'car':'https://uchicago.box.com/shared/static/swggh8jxj59c7vpxzx1emt7jnd083rmh.parquet',
        'bike':'https://uchicago.box.com/shared/static/7yzgf1gx3k3sacntjqber6l40m0d5aqw.parquet', 
        'walk':'https://uchicago.box.com/shared/static/lrxeqclmpkflibg9c7sphmun4kny9xsb.parquet',  
    }
}

geographies = {
    'tract': 'https://uchicago.box.com/shared/static/kfoid6fzlbpyfecmwpe9u16cl5n89ru0.zip',
    'zip':'https://uchicago.box.com/shared/static/270ca6syxcg3dlvohhnt3ap92m4t7cxc.zip'
}
pop_data = {
    'tract':'https://uchicago.box.com/shared/static/z6xm6tre935xbc06gg4ukzgyicro26cw.csv',
    'zip': 'https://uchicago.box.com/shared/static/njjpskiuj7amcztrxjws2jfwqlv66t49.csv'
}
sample_point_data = {
    'FQHC': 'https://uchicago.box.com/shared/static/uylcq23g5z8jhvmp7cnofr074j4hwj6e.csv',
    'pharmacies': 'https://uchicago.box.com/shared/static/njjpskiuj7amcztrxjws2jfwqlv66t49.csv',
    'opioid_treatment_facilities': 'https://raw.githubusercontent.com/GeoDaCenter/opioid-policy-scan/master/data_raw/Opioid_Treatment_Directory_Geocoded.csv',
    'hospitals': 'https://raw.githubusercontent.com/GeoDaCenter/opioid-policy-scan/master/data_final/Resources/Hospitals_Geocoded.csv',
    'mentalhealth': 'https://raw.githubusercontent.com/GeoDaCenter/opioid-policy-scan/master/data_final/Resources/MentalHealthProviders_Geocoded.csv',
    'moud_full': 'https://raw.githubusercontent.com/GeoDaCenter/opioid-policy-scan/master/data_final/moud/us-wide-moudsCleaned_geocoded.csv'
}

geoid_cols = {
    "tract":"GEOID",
    "zip": "GEOID10"
}

Specify your preferred geographic unit and transit mode below. Or do your own thing!

In [3]:
GEOGRAPHIC_UNIT = 'zip' # 'tract' or 'zip'
TRANSIT_MODE = 'car' # 'car' or 'bike' or 'walk'

TRANSIT_MATRIX = pd.read_parquet(matrices[GEOGRAPHIC_UNIT][TRANSIT_MODE])
GEOGRAPHIES = gpd.read_file(geographies[GEOGRAPHIC_UNIT]).to_crs('EPSG:4326')
GEOGRAPHIES['FIPS'] = GEOGRAPHIES[geoid_cols[GEOGRAPHIC_UNIT]].astype('int64')

Update your destination of interest here (i.e. point locations csv)

In [4]:
DESTINATIONS = dfToGdf(pd.read_csv(sample_point_data['hospitals']), 'Longitude', 'Latitude') 

### Spatial Join
Spatially joining points and polygons is easy. We're using the `intersects` geometric predicate here, simply meaning that if a point intersects a polygon, those two become joined or associated. 

This means that we are able to see which polygon from our geographies each destination is in, and from the travel matrix, we'll know how far it is (roughly) from the other georaphies.

In [5]:
merged_destinations = gpd.sjoin(DESTINATIONS, GEOGRAPHIES[['FIPS', 'geometry']], how='inner', op='intersects')
merged_destinations.head()

  if (await self.run_code(code, result,  async_=asy)):


Unnamed: 0.1,Unnamed: 0,Name,Hospital.Type,Address,Address_2,City,State,Zipcode,County,Staffed.All.Beds,...,Licensed.All.Beds...SOURCE,All.Bed.Occupancy.Rate...SOURCE,ICU.Bed.Occupancy.Rate...SOURCE,CCM_ID,DH.ID,HCRIS.ID,HIFLD.ID,geometry,index_right,FIPS
0,1,IU HEALTH UNIVERSITY HOSPITAL,GENERAL ACUTE CARE,550 UNIVERSITY BLVD,,INDIANAPOLIS,IN,46202,MARION,,...,,,,100,,,100,POINT (-86.17656 39.77528),20351,46202
79,80,RILEY HOSPITAL FOR CHILDREN,CHILDREN,705 RILEY HOSPITAL DR,,INDIANAPOLIS,IN,46202,MARION,,...,,,,102,,,102,POINT (-86.18028 39.77711),20351,46202
1798,1799,INDIANA UNIVERSITY HEALTH,GENERAL ACUTE CARE,1701 N SENATE BLVD,,INDIANAPOLIS,IN,46202,MARION,1226.0,...,DH-NUM_LICENS,DH-BED_UTILIZ,,156546202,1269.0,,156546202,POINT (-86.16310 39.78980),20351,46202
1818,1819,ESKENAZI HEALTH,GENERAL ACUTE CARE,720 ESKENAZI AVENUE,,INDIANAPOLIS,IN,46254,MARION,316.0,...,DH-NUM_LICENS,HCRIS-Total Bed Occupancy Rate,HCRIS-ICU Occupancy Rate,157146254,1267.0,150024.0,157146254,POINT (-86.18399 39.77788),20351,46202
4669,4670,RICHARD L. ROUDEBUSH VA MEDICAL CENTER,MILITARY,1481 W 10TH ST,,INDIANAPOLIS,IN,46202,MARION,,...,,,,4646202,1268.0,,4646202,POINT (-86.18716 39.77846),20351,46202


### Moins Est Plus
Less is more, let's just snag the columns we need. We'll need to join this data again to the travel matrix, so the second line gets everyone speaking the same language.

In [6]:
## Pull out the simplified columns we need for the analysis
destinations_simplified = merged_destinations[['index_right','FIPS']]
destinations_simplified.head()
#destinations_simplified.shape

Unnamed: 0,index_right,FIPS
0,20351,46202
79,20351,46202
1798,20351,46202
1818,20351,46202
4669,20351,46202


### The Big Join
Currently, we have destinations associated with our origin geographies (if you're using the default data, census tracts and health clinics), but we need to bring it all together by joining the origins and destinations to the travel matrix. Below, we'll join the travel matrix to the destinations:

In [7]:
## Merge onto the transit matrix, giving us the distance from each origin to each destination
merge_transit_matrix = TRANSIT_MATRIX.merge(destinations_simplified, left_on="destination", right_on="FIPS")
merge_transit_matrix.head()

Unnamed: 0,origin,destination,minutes,index_right,FIPS
0,1001,1040,19.88,17553,1040
1,1001,1040,19.88,17553,1040
2,1002,1040,27.49,17553,1040
3,1002,1040,27.49,17553,1040
4,1003,1040,29.33,17553,1040


### Analysis Time
Let's get down to business. To begin, let's declare some variables that will help us a bit later. To start, we can define what our origin column (by default, creatively, `origin`), the destination ID column that came from the destinations data, the travel cost column, and the treshold for travel time. 

In [8]:
origin_col = 'origin'
destination_id_col = 'index_right'
travel_cost_col = 'minutes'
travel_threshold = 30

### Data Cleanup
We have some weird -1000 values. Let's fix them and replace them with 999, the default null value of this travel matrix.

In [9]:
## clean up this weird bug, then merge the data
travel_costs = merge_transit_matrix.sort_values(travel_cost_col, ascending=True)
travel_costs.minutes = travel_costs.minutes.replace(-1000, 999)
travel_costs.origin = travel_costs.origin.astype('int64')
travel_costs.head()

Unnamed: 0,origin,destination,minutes,index_right,FIPS
1328611,55912,55912,0.0,25673,55912
1906226,94040,94040,0.0,28940,94040
1482319,64701,64701,0.0,22226,64701
1482553,64735,64735,0.0,918,64735
1482836,65301,65301,0.0,4532,65301


### Nearest location
To get the nearest location, sort the values by lowest cost then filter for  the first appearance of each origin ID. This means we'll get the first time that origin shows up, sorted by the lowest travel cost. 

We'll use pandas `.duplicated()` function with the not (`~`) operator before it.

In [10]:
time_to_nearest = travel_costs[~travel_costs.origin.duplicated()][[origin_col, travel_cost_col]]
time_to_nearest.head()

Unnamed: 0,origin,minutes
1328611,55912,0.0
1906226,94040,0.0
1482319,64701,0.0
1482553,64735,0.0
1482836,65301,0.0


### Count in Threshold 
For getting the count of destinations within a given threshold (by default, 30 minutes), we can chain a couple functions here from pandas. 

First, we'll filter the `travel_costs` dataframe for costs that are less than or equal to the threshold. Then group by the origin column, giving us sets of rows that share the same origin ID, and then count those columns, giving us the number of rows for each origin ID with a travel cost under our treshold.

We'll re-label some columns for easy reference.

In [11]:
## For count, we simple filter for the cost under a given threshold
## Then group by and count the results
count_within_threshold = travel_costs[travel_costs[travel_cost_col] <= travel_threshold] \
  .groupby(origin_col).count() \
  .reset_index()[[origin_col, travel_cost_col]]
count_within_threshold.columns = [origin_col, f"count within {travel_threshold}"]

### Merge Results
Now, we can merge our two findings into an easy, breezy, beautiful dataframe.

In [12]:
merged_metrics = count_within_threshold.merge(time_to_nearest, on=origin_col, how="outer")
merged_metrics.head()

Unnamed: 0,origin,count within 30,minutes
0,1001,9.0,10.4
1,1002,6.0,18.07
2,1003,5.0,20.03
3,1005,1.0,29.7
4,1007,5.0,18.88


### Cleanup
One last edge case to handle here: It is possible some origins are not within 30 minutes of a destination, meaning some of the data will be null. Or, we might have lost some origins from the full geographies dataset.

While not the end of the world, we can clean this up here before shipping of results to our (soon to be disgruntled) data scientist colleagues. 

The below finds the missing origin IDs and fills them in, giving us the revered `findings` dataframe. 

🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉

In [13]:
## To clean up any missing data, we can check back with our origin list
analyzed_origins = list(merged_metrics[origin_col])
missing_origins = [o for o in GEOGRAPHIES.FIPS if o not in analyzed_origins]

## Then, fill the missing data
missing_data = []
for o in missing_origins:
    missing_entry = {}
    missing_entry[origin_col] = o
    missing_entry[f"count within {travel_threshold}"]=0
    missing_entry[travel_cost_col]=None
    missing_data.append(missing_entry)
missing_df = pd.DataFrame(missing_data)

## and concatenate results
findings = pd.concat([merged_metrics, missing_df])
# Fill any null values with 0 for count within
findings['count within 30'] = findings['count within 30'].fillna(0).astype(int)
# Replace error value "999" in matrices with blanks
findings['minutes'] = findings['minutes'].replace(999.0, None)
findings.head()

Unnamed: 0,origin,count within 30,minutes
0,1001,9,10.4
1,1002,6,18.07
2,1003,5,20.03
3,1005,1,29.7
4,1007,5,18.88


### What's Next?
Well, now you could take this data and export it as a CSV, or join it back to the geographies and visualize it, or try running this analysis with some different data. Ball is in your court, you got this!


In [14]:
# Export to csv
findings.to_csv('hospitals_drive_zip.csv', index = False)