In [1]:
import numpy as np
import pandas as pd

Here I extract the relevant columns and convert zip to string for processing. Many zipcodes have a +4 identifier for delivery routes (e.g. 12345-6789). Since these delivery routes aren't essential to calculating location, I remove these and focus only on the 5 digit zip.

In [2]:
data = pd.read_csv('./data/scorecard/Most-Recent-Cohorts-Full.csv')
data = data.iloc[:, 0:1027]
data = data[['INSTNM', 'ZIP', 'LATITUDE', 'LONGITUDE']].dropna(how='any')
data.columns = ['name', 'zip', 'lat', 'long']
data['zip'] = data['zip'].astype('string')
data['zip'] = data['zip'].str.split('-', expand=True)[0]
len(data)

  exec(code_obj, self.user_global_ns, self.user_ns)


6189

This functions calculates the distance between two latitudes/longitudes and returns a value in km. I found the code from this stackoverflow: https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula. This might be useful if we want to directly calculate the distance from the user to a college they're interested in.

In [3]:
from math import cos, asin, sqrt, pi

def distance(lat1, lon1, lat2, lon2):
    p = pi/180
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
    return 12742 * asin(sqrt(a))

I downloaded the zipcodes from here: https://gist.github.com/abatko/ee7b24db82a6f50cfce02afafa1dfd1e. This zipcode file is from 2018 so it isn't the most updates version. I found an API version that is an alternative to using a non-updated database. However, this requires a precise address input: https://geocoding.geo.census.gov/geocoder/Geocoding_Services_API.pdf. I then calculate how many college zip codes in our college database do not have a corresponding entry in the zipcode file. Ideally we'll use this zipcode information to convert from the user's specified location to latitude/longitude for nearest colleges calculations.

In [4]:
zipcodes = pd.read_csv('./data/zipcodes.csv', dtype={'ZIP': object})
zipcodes.columns = ['zip', 'zlat', 'zlong']
zipcodes['zip'] = zipcodes['zip'].astype('string')
sum(data.merge(zipcodes, how='left', on='zip')['zlat'].isna())

272

After looking into building trees with spatial partition for efficient neighbor querying, I found a handy scipy package that implements this. Documentation can be found here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree.

In [5]:
from scipy.spatial import KDTree
tree_kd = KDTree(data[['lat', 'long']])

For the following function, I convert from km to latitude to provide a search value for the scipy KDTree package. I chose to use latitude arbitrarily as latitude and longitude have different values for the same distance (i.e. 10 miles does not have the same latitude and longitude value). Using only either latitude or longitude to calculate radius is somewhat inaccurate as described here: http://janmatuschek.de/LatitudeLongitudeBoundingCoordinates. However, as KDTree takes one radius value as input for query, I decided to move forward with this.

In [6]:
from math import cos, pi

def dist_to_lat(dist, unit="kilometers"):
    if unit=="miles":
        dist *= 1.60934
    return dist*1.60934/110.574

## Testing Query

Define zipcode and mile radius for search.

In [7]:
user_zip = '92804'
user_zip_loc = zipcodes[zipcodes['zip'] == user_zip][['zlat', 'zlong']]
miles_radius = 10

This uses the KDTree query.

In [8]:
query_kd = tree_kd.query_ball_point(user_zip_loc, dist_to_lat(miles_radius, unit="miles"))
kd_method = data.iloc[query_kd[0]]

This calculates the l2-norm distance between the provided zip code and every college in the dataframe.

In [9]:
curr_point = zipcodes[zipcodes['zip'] == user_zip][['zlat', 'zlong']]
print(curr_point)
lat = curr_point['zlat'].values[0]
long = curr_point['zlong'].values[0]
calc_method = data[np.sqrt((data['lat'] - lat)**2 + (data['long'] - long)**2) <= dist_to_lat(miles_radius, "miles")]

            zlat       zlong
30642  33.818271 -117.975017


Confirm that both methods provide the same results:

In [10]:
(kd_method != calc_method).sum()

name    0
zip     0
lat     0
long    0
dtype: int64

## SKLearn's BallTree and KDTree

After discussing with Jordan, two issues came to light: 1. The ability to create custom distance functions and 2. Being able to alter top nearest neighbor choices based on different column weights. I decided to move forward looking into sci-kit learn's nearest neighbor and distance-based tree packages. First I compare the use of sci kit learn's distance-based tree package to see if its result is the same as KD Trees.

In [11]:
from sklearn.neighbors import BallTree
from sklearn.neighbors import KDTree
from sklearn.neighbors import DistanceMetric

In [12]:
tree_sk = BallTree(data[['lat', 'long']])  

In [13]:
query_sk = tree_sk.query_radius(user_zip_loc, dist_to_lat(miles_radius, unit="miles"))
sk_method = data.iloc[query_sk[0]]
(sk_method.sort_index() != kd_method).sum()

name    0
zip     0
lat     0
long    0
dtype: int64

We will have to rebuild the tree every time we want to incorporate a new custom distance metric. I don't believe we'll use more than minowski distance but here I try a different distance metric (manhattan). Sklearn also provides a few metrics of there owns that we can use including (not surprisingly) manhattan.

In [14]:
def manhattan(x, y):
    return np.sum(np.absolute(x-y))

In [15]:
tree_sk_man = BallTree(data[['lat', 'long']], metric=manhattan)
query_sk_man = tree_sk_man.query_radius(user_zip_loc, dist_to_lat(miles_radius, unit="miles"))
sk_man_method = data.iloc[query_sk_man[0]]

In [16]:
tree_man = BallTree(data[['lat', 'long']], metric=DistanceMetric.get_metric('manhattan'))
query_man = tree_man.query_radius(user_zip_loc, dist_to_lat(miles_radius, unit="miles"))
man_method = data.iloc[query_man[0]]
(sk_man_method != man_method).sum()

name    0
zip     0
lat     0
long    0
dtype: int64

Here I time the creation of new ball trees to see how inefficient it will be to recreate a new tree.

In [17]:
%%time
for i in range(1000):
    BallTree(data[['lat', 'long']])

CPU times: user 2.49 s, sys: 9.04 ms, total: 2.5 s
Wall time: 2.49 s


In [18]:
%%time
for i in range(1000):
    KDTree(data[['lat', 'long']])

CPU times: user 2.61 s, sys: 14.9 ms, total: 2.62 s
Wall time: 2.62 s


## SKLearn NN

In [121]:
from sklearn.neighbors import NearestNeighbors

In [132]:
data = pd.read_csv('./data/scorecard/Most-Recent-Cohorts-Full.csv')
columns = pd.read_excel('./data/scorecard/columns-simplified.xlsx')

  exec(code_obj, self.user_global_ns, self.user_ns)


I drop the irrelevant identification columns as well as change any privacy suppressed and nan values to 0 for now. We may want to figure out a better way to handle these values in the future.

In [140]:
simplified = data[list(columns['VARIABLE NAME'])]
simplified.drop(['UNITID', 'INSTNM', 'CITY', 'STABBR', 'ZIP', 'ACCREDAGENCY'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [156]:
simplified.columns[(simplified == "PrivacySuppressed").any()]

Index(['COMP_ORIG_YR2_RT', 'COMP_ORIG_YR3_RT', 'COMP_ORIG_YR4_RT',
       'COMP_ORIG_YR6_RT', 'COMP_ORIG_YR8_RT'],
      dtype='object')

In [157]:
simplified[['COMP_ORIG_YR2_RT', 'COMP_ORIG_YR3_RT', 'COMP_ORIG_YR4_RT', 'COMP_ORIG_YR6_RT', 'COMP_ORIG_YR8_RT']]

Unnamed: 0,COMP_ORIG_YR2_RT,COMP_ORIG_YR3_RT,COMP_ORIG_YR4_RT,COMP_ORIG_YR6_RT,COMP_ORIG_YR8_RT
0,0.036073329391,0.114434330299,0.210526315789,0.28077232502,0.314393939394
1,0.145747707872,0.327259204165,0.461707585196,0.53630239521,0.524893863373
2,PrivacySuppressed,0.0859375,0.162962962963,0.141463414634,0.239726027397
3,0.165609584214,0.313755795981,0.464680851064,0.529255319149,0.485385296723
4,0.026500389712,0.130295763389,0.237909516381,0.284132841328,0.266284896206
...,...,...,...,...,...
6689,0.136350857464,0.163543897216,0.17,0.160021124901,0.147801009373
6690,0.538461538462,0.457142857143,0.448979591837,PrivacySuppressed,PrivacySuppressed
6691,0.630136986301,PrivacySuppressed,PrivacySuppressed,PrivacySuppressed,PrivacySuppressed
6692,,,,,


In [170]:
simplified.replace(to_replace="PrivacySuppressed", value = np.nan, inplace=True)
simplified.replace(to_replace=np.nan, value=0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [171]:
simplified[['COMP_ORIG_YR2_RT', 'COMP_ORIG_YR3_RT', 'COMP_ORIG_YR4_RT', 'COMP_ORIG_YR6_RT', 'COMP_ORIG_YR8_RT']]

Unnamed: 0,COMP_ORIG_YR2_RT,COMP_ORIG_YR3_RT,COMP_ORIG_YR4_RT,COMP_ORIG_YR6_RT,COMP_ORIG_YR8_RT
0,0.036073329391,0.114434330299,0.210526315789,0.28077232502,0.314393939394
1,0.145747707872,0.327259204165,0.461707585196,0.53630239521,0.524893863373
2,0,0.0859375,0.162962962963,0.141463414634,0.239726027397
3,0.165609584214,0.313755795981,0.464680851064,0.529255319149,0.485385296723
4,0.026500389712,0.130295763389,0.237909516381,0.284132841328,0.266284896206
...,...,...,...,...,...
6689,0.136350857464,0.163543897216,0.17,0.160021124901,0.147801009373
6690,0.538461538462,0.457142857143,0.448979591837,0,0
6691,0.630136986301,0,0,0,0
6692,0,0,0,0,0


In [185]:
neigh = NearestNeighbors()
neigh.fit(simplified)

NearestNeighbors()

In [191]:
simplified = simplified.loc[:,~simplified.columns.duplicated()]

In [193]:
neigh.fit(simplified[['CONTROL', 'LOCALE', 'CIP14BACHL']])

NearestNeighbors()

In [195]:
college_query = neigh.kneighbors(pd.DataFrame([{'CONTROL': 0, 'LOCALE': 22, 'CIP14BACHL': 1}]), return_distance=False)

In [199]:
data.iloc[college_query[0]]

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,COUNT_WNE_MALE1_P8,MD_EARN_WNE_MALE1_P8,GT_THRESHOLD_P10,MD_EARN_WNE_INC1_P10,MD_EARN_WNE_INC2_P10,MD_EARN_WNE_INC3_P10,MD_EARN_WNE_INDEP1_P10,MD_EARN_WNE_INDEP0_P10,MD_EARN_WNE_MALE0_P10,MD_EARN_WNE_MALE1_P10
1482,166692,218100,2181,Massachusetts Maritime Academy,Buzzards Bay,MA,02532-1803,New England Commission on Higher Education,https://www.maritime.edu/,https://www.maritime.edu/netprice/,...,50.0,77731.0,0.9153,,,107188.0,,96539.0,,
2900,214591,332905,3329,Pennsylvania State University-Penn State Erie-...,Erie,PA,16563-0001,Middle States Commission on Higher Education,behrend.psu.edu/,tuition.psu.edu/CostEstimate.aspx,...,12228.0,58866.0,0.8214,48397.0,56055.0,66884.0,48950.0,59218.0,49994.0,66336.0
1518,167987,221000,2210,University of Massachusetts-Dartmouth,North Dartmouth,MA,02747-2300,New England Commission on Higher Education,www.umassd.edu/,https://umassd.studentaidcalculator.com/welcom...,...,128.0,54542.0,0.7889,54536.0,58193.0,63504.0,40750.0,60978.0,54123.0,69010.0
221,110705,132000,1320,University of California-Santa Barbara,Santa Barbara,CA,93106,Western Association of Schools and Colleges Se...,www.ucsb.edu/,finaid.ucsb.edu/net-price-calculator,...,2359.0,64209.0,0.8453,60925.0,67183.0,71665.0,58764.0,67149.0,64201.0,69859.0
2613,204839,310004,3100,Ohio University-Southern Campus,Ironton,OH,45638,Higher Learning Commission,https://www.ohio.edu/southern/,https://npc.collegeboard.org/app/ohio,...,4562.0,46751.0,0.7386,35953.0,46943.0,56032.0,35699.0,49460.0,42012.0,54537.0


Rebuilding the nearest neighbor model is significantly slower than the kdtree or balltree models. However, it may still be fast enough for our use in the project.

In [175]:
%%time
for i in range(100):
    neigh = NearestNeighbors()
    neigh.fit(simplified)

CPU times: user 3.55 s, sys: 360 ms, total: 3.91 s
Wall time: 3.91 s


### Example Filter and Query

Question: Which information should we directly filter vs. which information should we use for nearest neighbor search?

In [144]:
### Basic Information

## Input zip -> string
user_zip = '92804'

## Dropdown of 10, 25, 50, 100, All (None) -> int
miles_radius = 10

## Checkbox of the provided majors -> list
degree = ['Engineering']

## Slider -> tuple of ints
tuition_range = ()

## Dropdown of public, private, or both -> string
public_private = "public"

In [140]:
## Dropdown of religious affiliation or None -> string
religious_affiliation = None

## Checkbox of setting type urban, rural, suburban -> list of string
setting = []

### Checkbox certain states -> list of string
states = []

In [141]:
data = pd.read_csv('./data/scorecard/Most-Recent-Cohorts-Full.csv')
columns = pd.read_excel('./data/scorecard/columns-simplified.xlsx')

simplified = data[list(columns['VARIABLE NAME'])]

## Figure out what we want to do with privacy suppressed and nan values later
simplified.replace(to_replace="PrivacySuppressed", value = np.nan, inplace=True)
simplified.replace(to_replace=np.nan, value=0, inplace=True)
simplified = simplified.loc[:,~simplified.columns.duplicated()]

  exec(code_obj, self.user_global_ns, self.user_ns)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [157]:
## Description: Finds colleges within a specified radius in miles of the provided zip code
## Inputs:
##   df - dataframe of college data
##   user_zip - provided zipcode
##   zip_to_lat - dataframe converting zipcodes to latitude/longitude
##   miles - radius in miles around user_zip, if none returns all colleges
## Outputs:
##   returns filtered college dataframe based on location

def radius_filter(df, user_zip, zip_to_lat, miles = None):
    if not miles:
        return df
    
    user_zip_loc = zip_to_lat[zip_to_lat['zip'] == user_zip][['zlat', 'zlong']]
    
    if user_zip_loc.empty:
        raise ValueError('Invalid zipcode input')
    
    tree = BallTree(df[['LATITUDE', 'LONGITUDE']])  
    query = tree.query_radius(user_zip_loc, (miles*1.60934/110.574))
    return df.iloc[query[0]]

In [158]:
## Description: Finds the corresponding column name for the provided degrees and returns them as a list.
##              If no degree is specified, returns all the column names.
## Inputs:
##   df - dataframe of college data
##   columns - dataframe from columns-simplified.xlsx without modifications
##   degree - list of interested majors
## Outputs:
##   returns filtered college dataframe based on interested majors

def degree_filter(df, columns, degree):
    
    temp_col = columns[columns['VARIABLE NAME'].str.contains('CIP')]
    var_name = temp_col['VARIABLE NAME']
    var_name = var_name.astype('string')
    
    deg_cols = []
    if not degree:
        deg_cols = list(var_name)
    else:
        if type(degree) != list:
            raise TypeError("degree input must be a list")
        degs = temp_col['NAME OF DATA ELEMENT'].str.split('Bachelor\'s degree in ', expand=True).iloc[:,1]
        degs = degs.str.split('.', expand=True).iloc[:,0]
        degs = degs.astype('string')
        degree_dict = dict(zip(degs, var_name))
        for deg in degree:
            deg_cols.append(degree_dict[deg])
    print(deg_cols)
    return df[df[deg_cols].all(axis=1)]