## Augment Carnegie Data with Locations

There are a few methods to get location data for the universities listed in the Carnegie data. One is to use GeoPy to employ a Geocoder to prompt Google's API to convert the university name into coordinates/an address. When that fails (around 5% of the time), we might then consult the National Center for Education Statistics -- Integrated Postsecondary Education Data System [Institutional Characteristics: Directory information (HD2023)](https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx?year=2023&sid=943e89a7-2401-4cb2-a0c5-8cce57f04a7e&rtid=7). University ids are shared across these two datasets, so any location information from this dataset may be utilized. The remaining few dozen can be imputed by hand.

I split the data in half due to API limitations.

In [1]:
import os
import pandas as pd
import geopandas as gpd
import numpy as np
from geopy import geocoders
g = geocoders.GoogleV3(api_key = os.getenv('GOOGLE_API_KEY'))

ConfigurationError: Since July 2018 Google requires each request to have an API key. Pass a valid `api_key` to GoogleV3 geocoder to fix this error. See https://developers.google.com/maps/documentation/geocoding/usage-and-billing

In [52]:
carnegie_first_2000 = pd.read_excel('../data/CCIHE2021-PublicData_limited.xlsx', sheet_name = 'Data')[:2000] 
carnegie_last = pd.read_excel('../data/CCIHE2021-PublicData_limited.xlsx', sheet_name = 'Data')[2000:] 
carnegie = pd.read_excel('../data/CCIHE2021-PublicData_limited.xlsx', sheet_name = 'Data')

In [None]:
# wrapper for the geocoder
def get_location_attributes(name):
    try:
        location = g.geocode(name, timeout=10)
        if location:
            return pd.Series({
                'address': location.address,
                'latitude': location.latitude,
                'longitude': location.longitude,
                'point': location.point
            })
        else:
            return pd.Series({
                'address': None,
                'latitude': None,
                'longitude': None,
                'point': None
            })
    except Exception:
        # In case of an error (e.g., timeout), return None values
        return pd.Series({
            'address': None,
            'latitude': None,
            'longitude': None,
            'point': None
        })

In [None]:
# Request Google geocoder
carnegie_last[['address', 'latitude', 'longitude', 'point']] = carnegie_last['name'].apply(get_location_attributes)

In [None]:
# Number of rows that Google failed to resolve
len(carnegie_last[carnegie_last['latitude'].isna()])

117

In [None]:
# Option 2: HD2023 data
hd_df = pd.read_csv('data/hd2023.csv')

# Build full address from components
hd_df['FULL_ADDR'] = hd_df['ADDR'] + ', ' + hd_df['CITY'] + ', ' + hd_df['STABBR'] + ' ' + hd_df['ZIP']

# Merge Carnegie and HD2023 data
merged_df = carnegie_last.merge(hd_df[['UNITID', 'LATITUDE', 'LONGITUD', 'FULL_ADDR']], how='left', left_on='unitid', right_on='UNITID', suffixes=('', '_B'))

# Look into the failures
failures = carnegie_last[carnegie_last['latitude'].isna()]
hd_ids = list(hd_df.UNITID)
# Inspect how many of the previously unresolved rows could be resolved by this new dataset
shared = list(set(failures.unitid) & set(hd_ids))
print(len(shared))

In [None]:
# Combine accordingly
merged_df['latitude'] = merged_df['latitude'].combine_first(merged_df['LATITUDE'])
merged_df['longitude'] = merged_df['longitude'].combine_first(merged_df['LONGITUD'])
merged_df['address'] = merged_df['address'].combine_first(merged_df['FULL_ADDR'])

In [None]:
# Test output
merged_df[merged_df.unitid == shared[0]]

Unnamed: 0,unitid,name,city,stabbr,basic2000,basic2005,basic2010,basic2015,basic2018,basic2021,...,rooms,selindex,address,latitude,longitude,point,UNITID,LATITUDE,LONGITUD,FULL_ADDR
1129,369668,Central Pennsylvania Institute of Science and ...,Pleasant Gap,PA,-2,-2,-2,-2,-2,11,...,0,,"540 N Harrison Rd, Pleasant Gap, PA 16823",40.882168,-77.740923,,369668.0,40.882168,-77.740923,"540 N Harrison Rd, Pleasant Gap, PA 16823"


In [None]:
# Get rid of columns from the merger
merged_df.drop(columns = ['UNITID', 'LATITUDE', 'LONGITUD', 'FULL_ADDR'], inplace = True)

In [None]:
# Number of still unresolved should be difference
len(merged_df[merged_df['latitude'].isna()])

10