## Predicting Health Professional Shortage Areas 

HPSA, short for "Health Professional Shortage Area", is a federal government term for a specific region or location that is experiencing a shortage of healthcare professionals. Every so often, HPSA Scores are developed by the National Health Service Corps in determining priority of assignment of clinicians to certain areas. The scores range from 0 to 26 where the higher the score, the greater the priority. In this project, I will train a Machine Learning model to predict Primary Care* HPSA scores based on various location metrics (county income, unemployment rate, etc) using features taken from other government websites such as the US Bureau of Labor Statistics.

### Step 1: ETL

This data is taken from the https://data.hrsa.gov/ website in individual XLSX files by state. Unfortunately, each state's data is separately stored, meaning we will have to extract and load each state iteratively. Let's take a peek at a single state for now loaded in a variable peek_data, that encompasses the data for Alabama.

In [1]:
import pandas as pd
import numpy as np
import re 
np.set_printoptions(threshold=np.inf) #allows for greater print capabilities for troubleshooting
peek_data=pd.read_excel("utility/data/HPSAdata/Hpsa_Find_Export.xlsx",index_col=None,header=3)
#header is 3 since that is the row the column titles are stored. 

peek_data.head().style

Unnamed: 0,Discipline,HPSA ID,HPSA Name,Designation Type,Primary State Name,County Name,HPSA FTE Short,HPSA Score,Status,Rural Status,Designation Date,Update Date
0,Primary Care,1016018546,LI-Marion County,Low Income Population HPSA,Alabama,"Marion County, AL",1.673,14.0,Designated,Rural,06/22/2022,06/22/2022
1,,Component State Name,Component County Name,Component Name,Component Type,Component GEOID,Component Rural Status,,,,,
2,,Alabama,Marion,Marion,Single County,01093,Rural,,,,,
3,Primary Care,1019011119,Perry County,High Needs Geographic HPSA,Alabama,"Perry County, AL",0.87,19.0,Designated,Rural,01/15/1979,09/08/2021
4,,Component State Name,Component County Name,Component Name,Component Type,Component GEOID,Component Rural Status,,,,,


Note above that there are various rows with extraneous information. However, the Component GEOID, a unique identifier for each county in the US also known as FIPS, is not extraneous and will need to be extracted. This is important since the GEOID is the identifier we will use to merge new features (eg. unemployment rates by county) into the dataset using SQL later. 

Closer review of the table shows that some FIPS codes not provided in lieu of ZIP codes, which are useless to us. Luckily, these entries also give us the county name alongside the ZIP code. Therefore, before we code a method to clean our data, we will create a dictionary to find FIPS codes from county names when they are not already provided.

In [2]:
#Creating a dictionary of county names to FIPS codes

#Formatting our data to enter into a dictionary
url='https://www.mdreducation.com/pdfs/US_FIPS_Codes.xls'
FIPS_Map = pd.read_excel(url, header =1, dtype={'FIPS State': str, 'FIPS County': str})
FIPS_Map['FIPS Code'] = FIPS_Map['FIPS State'] + FIPS_Map['FIPS County'] 

#loading county names and FIPS codes into a dictionary
CountyDict = dict(zip(FIPS_Map['County Name'],FIPS_Map['FIPS Code']))

Now that we have a dictionary of county names to FIPS codes, we can code in our general cleaning method. 

In [3]:
def clean(data): # drops unnecessary rows and columns and generates new GEOID column
    
    GeoIDs=[]
    
    for i in range(len(data['County Name'])): #locates all geolocation codes and truncates them at 5 digits
        
        string=str(data['County Name'][i]).casefold() #gets table title to determine if geolocation code was provided.
        
        if (('geoid' in string)): #Geolocation code was properly provided and added to new column
            GeoIDs.append(data['County Name'][i+1][:5])
            continue
            
        if (('zip' in string)): #ZIP code was provided. Geolocation code was found from county name. 
            CountyName = data['HPSA FTE Short'][i+1]
            GeoIDs.append(CountyDict.get(CountyName))
            continue
    
    data=data.loc[pd.to_numeric(data.iloc[:,7],errors='coerce').notna()]
    #Converts HPSA scores to numeric values and drops all rows where the score is not numeric
    #Dropped rows include titles and blank rows

    data=data.reset_index(drop=True)
    #renumbering our rows after dropping unnecessary ones

    data=data.iloc[:,[2,3,4,5,7,9]]
    #drops the ID, status, and two date categories, as these are logistical in nature.
    #drops discipline since all pulled data is from Primary Care only
    #Drops HPSA FTE Short since this only exists for regions experiencing dire shortages (And is therefore biased)
    
    data['FIPS'] = GeoIDs 
    #Adds the geolocation codes (aka FIPS codes) as a column to the table
    
    return data

In [4]:
peek_data=clean(peek_data)
peek_data.head().style

Unnamed: 0,HPSA Name,Designation Type,Primary State Name,County Name,HPSA Score,Rural Status,FIPS
0,LI-Marion County,Low Income Population HPSA,Alabama,"Marion County, AL",14,Rural,1093
1,Perry County,High Needs Geographic HPSA,Alabama,"Perry County, AL",19,Rural,1105
2,Marengo County,High Needs Geographic HPSA,Alabama,"Marengo County, AL",19,Rural,1091
3,Wilcox County,High Needs Geographic HPSA,Alabama,"Wilcox County, AL",21,Rural,1131
4,Bullock County,High Needs Geographic HPSA,Alabama,"Bullock County, AL",22,Rural,1011


As seen in the above, our data has been cleaned and a FIPS column has been added! Now that we've created and tested a method to appropriately clean and structure our datasets, we will proceed to wrangle all 50 state datasets together. 

In [5]:
import os 

directory = 'utility/data/HPSAdata'
data = pd.DataFrame()
counter=0 

for filename in os.listdir(directory):
    
    path = os.path.join(directory, filename) #generate file path
    
    if os.path.isfile(path):
        state_data=pd.read_excel(path,index_col=None,header=3) #import
        state_data=clean(state_data)
        data=pd.concat([data,state_data]) #add to existing data
        
        counter+=1


print (str(counter) + ' datasets were successfully concatenated with a final shape of ' + str(data.shape))
        
        

50 datasets were successfully concatenated with a final shape of (6804, 7)
