In [1]:
import numpy as np
import pandas as pd

Here I extract the relevant columns and convert zip to string for processing. Many zipcodes have a +4 identifier for delivery routes (e.g. 12345-6789). Since these delivery routes aren't essential to calculating location, I remove these and focus only on the 5 digit zip.

In [98]:
data = pd.read_csv('./data/scorecard/Most-Recent-Cohorts-Full.csv')
data = data.iloc[:, 0:1027]
data = data[['INSTNM', 'ZIP', 'LATITUDE', 'LONGITUDE']].dropna(how='any')
data.columns = ['name', 'zip', 'lat', 'long']
data['zip'] = data['zip'].astype('string')
data['zip'] = data['zip'].str.split('-', expand=True)[0]
len(data)

  exec(code_obj, self.user_global_ns, self.user_ns)


6189

This functions calculates the distance between two latitudes/longitudes and returns a value in km. I found the code from this stackoverflow: https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula. This might be useful if we want to directly calculate the distance from the user to a college they're interested in.

In [99]:
from math import cos, asin, sqrt, pi

def distance(lat1, lon1, lat2, lon2):
    p = pi/180
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
    return 12742 * asin(sqrt(a))

I downloaded the zipcodes from here: https://gist.github.com/abatko/ee7b24db82a6f50cfce02afafa1dfd1e. This zipcode file is from 2018 so it isn't the most updates version. I found an API version that is an alternative to using a non-updated database. However, this requires a precise address input: https://geocoding.geo.census.gov/geocoder/Geocoding_Services_API.pdf. I then calculate how many college zip codes in our college database do not have a corresponding entry in the zipcode file. Ideally we'll use this zipcode information to convert from the user's specified location to latitude/longitude for nearest colleges calculations.

In [100]:
zipcodes = pd.read_csv('./data/zipcodes.csv', dtype={'ZIP': object})
zipcodes.columns = ['zip', 'zlat', 'zlong']
zipcodes['zip'] = zipcodes['zip'].astype('string')
sum(data.merge(zipcodes, how='left', on='zip')['zlat'].isna())

272

After looking into building trees with spatial partition for efficient neighbor querying, I found a handy scipy package that implements this. Documentation can be found here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree.

In [102]:
from scipy.spatial import KDTree
tree = KDTree(data[['lat', 'long']])

For the following function, I convert from km to latitude to provide a search value for the scipy KDTree package. I chose to use latitude arbitrarily as latitude and longitude have different values for the same distance (i.e. 10 miles does not have the same latitude and longitude value). Using only either latitude or longitude to calculate radius is somewhat inaccurate as described here: http://janmatuschek.de/LatitudeLongitudeBoundingCoordinates. However, as KDTree takes one radius value as input for query, I decided to move forward with this.

In [167]:
from math import cos, pi

def dist_to_lat(dist, unit="kilometers"):
    if unit=="miles":
        dist *= 1.60934
    return dist/110.574

## Testing Query

Define zipcode and mile radius for search.

In [188]:
user_zip = '92804'
miles_radius = 10

This uses the KDTree query.

In [187]:
query = tree.query_ball_point(zipcodes[zipcodes['zip'] == user_zip][['zlat', 'zlong']], dist_to_lat(miles_radius, unit="miles"))
tree_method = data.iloc[query[0]]
tree_method

Unnamed: 0,name,zip,lat,long
187,Bethesda University,92801,33.842295,-117.941323
188,Biola University,90639,33.906203,-118.014374
189,Brownson Technical School,92805,33.819032,-117.905774
208,California State University-Fullerton,92831,33.881506,-117.885446
210,California State University-Long Beach,90840,33.782818,-118.11204
226,Haven University,92840,33.777392,-117.939412
233,Career Academy of Beauty,92845,33.781018,-118.031697
241,Cerritos College,90650,33.886874,-118.097337
245,Chapman University,92866,33.79302,-117.852518
258,Coastline Community College,92708,33.715634,-117.929143


This calculates the l2-norm distance between the provided zip code and every college in the dataframe.

In [190]:
curr_point = zipcodes[zipcodes['zip'] == user_zip][['zlat', 'zlong']]
print(curr_point)
lat = curr_point['zlat'].values[0]
long = curr_point['zlong'].values[0]
calc_method = data[np.sqrt((data['lat'] - lat)**2 + (data['long'] - long)**2) <= dist_to_lat(miles_radius, "miles")]
calc_method

            zlat       zlong
30642  33.818271 -117.975017


Unnamed: 0,name,zip,lat,long
187,Bethesda University,92801,33.842295,-117.941323
188,Biola University,90639,33.906203,-118.014374
189,Brownson Technical School,92805,33.819032,-117.905774
208,California State University-Fullerton,92831,33.881506,-117.885446
210,California State University-Long Beach,90840,33.782818,-118.11204
226,Haven University,92840,33.777392,-117.939412
233,Career Academy of Beauty,92845,33.781018,-118.031697
241,Cerritos College,90650,33.886874,-118.097337
245,Chapman University,92866,33.79302,-117.852518
258,Coastline Community College,92708,33.715634,-117.929143


Confirm that both methods provide the same results:

In [191]:
tree_method == calc_method

Unnamed: 0,name,zip,lat,long
187,True,True,True,True
188,True,True,True,True
189,True,True,True,True
208,True,True,True,True
210,True,True,True,True
226,True,True,True,True
233,True,True,True,True
241,True,True,True,True
245,True,True,True,True
258,True,True,True,True
