# Data Prep

### Read data in from csv files, combine into one dataframe, and save to csv file

#### Files used (11)

Texas_Airports.csv

Texas_county_numbers.csv

cy18-all-enplanements TX.csv

Financial_Tax_summary.csv

Income_poverty.csv

Population.csv

General_Information_land_use.csv

policy_scores_us_counties_reformat.xls

all_county_policies.csv

2018 County Health Rankings Texas Data - v3.xls

Healthcare worker shortage.csv

Load libriaries

In [173]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Airports

#### Airport location data from http://gis-txdot.opendata.arcgis.com/datasets/texas-airports

In [174]:
airports=pd.read_csv(r'data/Texas_Airports.csv')
print(airports.shape)
airports.head()

(362, 10)


Unnamed: 0,X,Y,FID,GID,ARPT_NM,CSTMS_FLAG,FAA_CD,CNTY_NBR,DIST_NBR,DSPLY_FLAG
0,-100.184849,35.236166,1,287,Shamrock Municipal Airport,N,2F1,242,25,N
1,-101.705939,35.219376,2,288,Amarillo International Airport,Y,AMA,188,4,N
2,-100.996269,35.612996,3,289,Perry LeFors Airfield,N,PPA,91,4,N
3,-101.394059,35.700046,4,290,Hutchinson County Airport,N,BGD,118,4,N
4,-102.547289,36.022596,5,291,Dalhart Municipal Airport,N,DHT,104,4,N


In [175]:
airports[airports['CNTY_NBR']==57]

Unnamed: 0,X,Y,FID,GID,ARPT_NM,CSTMS_FLAG,FAA_CD,CNTY_NBR,DIST_NBR,DSPLY_FLAG
139,-96.868198,32.680866,140,62,Redbird,N,RBD,57,18,N
140,-96.865568,32.557086,141,63,Carroll Air Park,N,F66,57,18,Y
317,-96.836458,32.968566,318,242,Addison Airport,Y,ADS,57,18,N
318,-96.851778,32.847116,319,243,Dallas Love Field,Y,DAL,57,18,N
319,-96.719057,32.579196,320,244,Lancaster Airport,N,LNC,57,18,N


Drop unnecessary columns

In [176]:
airports=airports.drop(['X', 'Y', 'FID', 'GID', 'ARPT_NM', 'DIST_NBR', 'DSPLY_FLAG'], axis=1)
print(airports.shape)
airports.head()

(362, 3)


Unnamed: 0,CSTMS_FLAG,FAA_CD,CNTY_NBR
0,N,2F1,242
1,Y,AMA,188
2,N,PPA,91
3,N,BGD,118
4,N,DHT,104


## County Codes

Because airport data contains county codes and not county names, county code data is required to retrieve county names

#### Data from http://onlinemanuals.txdot.gov/txdotmanuals/tri/texas_counties_and_code_numbers.htm

In [177]:
county_codes=pd.read_csv(r'data/Texas_county_numbers.csv', dtype=str)
county_codes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Data columns (total 2 columns):
CO        255 non-null object
County    255 non-null object
dtypes: object(2)
memory usage: 4.2+ KB


In [178]:
county_codes.head()

Unnamed: 0,CO,County
0,1,Anderson
1,2,Andrews
2,3,Angelina
3,4,Aransas
4,5,Archer


Merge airport dataframe with county code dataframe on county code

In [179]:
airports['CNTY_NBR']=airports['CNTY_NBR'].astype(str)
airports=pd.merge(airports, county_codes, how='left', left_on='CNTY_NBR', right_on='CO')
airports.head()

Unnamed: 0,CSTMS_FLAG,FAA_CD,CNTY_NBR,CO,County
0,N,2F1,242,242,Wheeler
1,Y,AMA,188,188,Potter
2,N,PPA,91,91,Gray
3,N,BGD,118,118,Hutchinson
4,N,DHT,104,104,Hartley


In [180]:
airports.drop('CO', axis=1, inplace=True)

In [181]:
airports[airports['County']=='Harris']

Unnamed: 0,CSTMS_FLAG,FAA_CD,CNTY_NBR,County
12,N,HPY,102,Harris
95,N,T51,102,Harris
96,N,EFD,102,Harris
97,N,EYQ,102,Harris
231,N,IWS,102,Harris
232,Y,HOU,102,Harris
233,Y,IAH,102,Harris
234,N,DWH,102,Harris
235,N,T41,102,Harris


## Airport Enplanement

#### Enplanement data from https://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/

In [182]:
airport_enplane=pd.read_csv(r'data/cy18-all-enplanements TX.csv')
airport_enplane.head()

Unnamed: 0,Rank,RO,ST,Locid,City,Airport Name,S/L,Hub,CY 18 Enplanements,CY 17 Enplanements,% Change
0,58,AL,AK,ANC,Anchorage,Ted Stevens Anchorage International,P,M,2642607,2556191,3.38%
1,122,AL,AK,FAI,Fairbanks,Fairbanks International,P,S,549289,543839,1.00%
2,133,AL,AK,JNU,Juneau,Juneau International,P,N,440277,422266,4.27%
3,198,AL,AK,BET,Bethel,Bethel,P,N,160110,146652,9.18%
4,212,AL,AK,KTN,Ketchikan,Ketchikan International,P,N,135389,131144,3.24%


Filter enplanement data by Texas

In [183]:
airport_enplane_TX=airport_enplane[airport_enplane['ST']=='TX']
print(airport_enplane_TX.shape)
airport_enplane_TX.head()

(126, 11)


Unnamed: 0,Rank,RO,ST,Locid,City,Airport Name,S/L,Hub,CY 18 Enplanements,CY 17 Enplanements,% Change
1548,4,SW,TX,DFW,Fort Worth,Dallas-Fort Worth International,P,L,32821799,31816933,3.16%
1549,14,SW,TX,IAH,Houston,George Bush Intercontinental/Houston,P,L,21157398,19603731,7.93%
1550,32,SW,TX,DAL,Dallas,Dallas Love Field,P,M,8011221,7593361,5.50%
1551,33,SW,TX,AUS,Austin,Austin-Bergstrom International,P,M,7714479,6813171,13.23%
1552,35,SW,TX,HOU,Houston,William P Hobby,P,M,7053886,6538976,7.87%


Drop unneccessary columns

In [184]:
airport_enplane_TX=airport_enplane_TX.drop(['Rank', 'RO', 'Airport Name', 'City',
                                            'ST', 'S/L', 'Airport Name', 'CY 17 Enplanements', '% Change'], axis=1)
airport_enplane_TX.rename(columns={'CY 18 Enplanements': 'Enplanements'}, inplace=True)


Convert enplanements to float

In [185]:
airport_enplane_TX['Enplanements']=airport_enplane_TX['Enplanements'].str.replace(',', '').astype(float)

In [186]:
airport_enplane_TX.shape

(126, 3)

In [187]:
airports.shape

(363, 4)

Merge airport data with enplanement data

In [188]:
airports_TX=pd.merge(airports, airport_enplane_TX, how='left', left_on='FAA_CD', right_on='Locid')
airports_TX.drop('Locid', axis=1, inplace=True)

In [189]:
airports_TX.head()

Unnamed: 0,CSTMS_FLAG,FAA_CD,CNTY_NBR,County,Hub,Enplanements
0,N,2F1,242,Wheeler,,
1,Y,AMA,188,Potter,N,355705.0
2,N,PPA,91,Gray,,3.0
3,N,BGD,118,Hutchinson,,
4,N,DHT,104,Hartley,,42.0


Hub column contains data indicating if hub is small, medium, large, or not a hub.  Data is changed to indicate hub (Y) or not a hub (N)

In [190]:
airports_TX['Hub'].value_counts()

None    75
N       14
M        4
S        3
L        2
Name: Hub, dtype: int64

In [191]:
airports_TX['Hub'].replace({'M': 'Y', 'S': 'Y', 'L': 'Y', 'None': 'N'}, inplace=True)
airports_TX['Hub'].value_counts()

N    89
Y     9
Name: Hub, dtype: int64

In [192]:
airports_TX['Hub'].isnull().sum()

265

In [193]:
airports_TX['Hub'].fillna('N', inplace=True)

In [194]:
airports_TX['Hub'].value_counts()

N    354
Y      9
Name: Hub, dtype: int64

In [195]:
airports_TX.isnull().sum()

CSTMS_FLAG        0
FAA_CD            0
CNTY_NBR          0
County            0
Hub               0
Enplanements    265
dtype: int64

In [196]:
airports_TX['Enplanements'].fillna(0, inplace=True)

In [197]:
airports_TX.sort_values(by=['County'], inplace=True)
airports_TX

Unnamed: 0,CSTMS_FLAG,FAA_CD,CNTY_NBR,County,Hub,Enplanements
288,N,PSN,1,Anderson,N,13.0
301,N,E11,2,Andrews,N,0.0
9,N,LFK,3,Angelina,N,22.0
205,N,RKP,4,Aransas,N,34.0
347,N,T39,5,Archer,N,0.0
...,...,...,...,...,...,...
73,N,E15,252,Young,N,0.0
74,N,10F,252,Young,N,0.0
10,N,ONY,252,Young,N,0.0
200,N,T86,253,Zapata,N,0.0


In [198]:
airports_TX.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 363 entries, 288 to 186
Data columns (total 6 columns):
CSTMS_FLAG      363 non-null object
FAA_CD          363 non-null object
CNTY_NBR        363 non-null object
County          363 non-null object
Hub             363 non-null object
Enplanements    363 non-null float64
dtypes: float64(1), object(5)
memory usage: 19.9+ KB


#### Conbine airport data into data by county

In [199]:
airport_count=pd.DataFrame(airports_TX['County'].value_counts()).sort_index()
airport_count.rename(columns={'County': 'Number'}, inplace=True)

In [200]:
airports_county=pd.DataFrame(airports_TX.groupby('County').
             agg({'Enplanements': sum, 'CSTMS_FLAG': list, 'Hub': list}))

In [201]:
airports_county=pd.merge(airports_county, airport_count, how='left', left_index=True, right_index=True)

In [202]:
airports_county.head()

Unnamed: 0_level_0,Enplanements,CSTMS_FLAG,Hub,Number
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Anderson,13.0,[N],[N],1
Andrews,0.0,[N],[N],1
Angelina,22.0,[N],[N],1
Aransas,34.0,[N],[N],1
Archer,0.0,[N],[N],1


Use one hot encoding on Hub and Customs flags to convert to numeric data

In [203]:
def checkflag(flags):
    if 'Y' in flags:
        return 'Y'
    else:
        return 'N'

In [204]:
airports_county['CSTMS_FLAG']=airports_county['CSTMS_FLAG'].map(checkflag)
airports_county['Hub']=airports_county['Hub'].map(checkflag)

In [205]:
airports_county.head(15)

Unnamed: 0_level_0,Enplanements,CSTMS_FLAG,Hub,Number
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Anderson,13.0,N,N,1
Andrews,0.0,N,N,1
Angelina,22.0,N,N,1
Aransas,34.0,N,N,1
Archer,0.0,N,N,1
Atascosa,0.0,N,N,1
Austin,0.0,N,N,1
Bailey,0.0,N,N,1
Bastrop,0.0,N,N,1
Baylor,0.0,N,N,1


In [206]:
customs=pd.get_dummies(airports_county['CSTMS_FLAG'], prefix='customs', drop_first=True)
hub=pd.get_dummies(airports_county['Hub'], prefix='hub', drop_first=True)

In [207]:
customs.head()

Unnamed: 0_level_0,customs_Y
County,Unnamed: 1_level_1
Anderson,0
Andrews,0
Angelina,0
Aransas,0
Archer,0


In [208]:
airports_county.drop(['CSTMS_FLAG', 'Hub'], axis=1, inplace=True)

In [209]:
airports_county=pd.concat([airports_county, customs, hub], axis=1)
airports_county.head()

Unnamed: 0_level_0,Enplanements,Number,customs_Y,hub_Y
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Anderson,13.0,1,0,0
Andrews,0.0,1,0,0
Angelina,22.0,1,0,0
Aransas,34.0,1,0,0
Archer,0.0,1,0,0


## County Tax Info

#### Data from https://imis.county.org/iMIS/CountyInformationProgram/QueriesCIP.aspx

In [210]:
tax=pd.read_csv(r'data/Financial_Tax_summary.csv')

In [211]:
tax.head()

Unnamed: 0,County,Total Market Value,Total Actual Levy,Total County Tax Rate,Total Appraised Value (Total Taxable Value for FM/FC),Total Appraised Value (Total Taxable Value for County Tax Purposes)
0,Anderson,"$4,058,915,351","$15,833,018",$0.590892,"$2,675,103,554","$2,679,692,570"
1,Andrews,"$5,154,109,977","$28,889,823",$0.603900,"$4,877,521,855","$4,768,535,096"
2,Angelina,"$5,548,157,129","$18,372,078",$0.437121,$0,"$4,202,972,959"
3,Aransas,"$3,023,164,864","$13,196,810",$0.466022,"$2,850,101,038","$2,828,595,246"
4,Archer,"$1,843,359,643","$4,355,848",$0.664240,"$651,667,245","$656,137,785"


In [212]:
tax.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 0 to 253
Data columns (total 6 columns):
County                                                                 254 non-null object
Total Market Value                                                     254 non-null object
Total Actual Levy                                                      254 non-null object
Total County Tax Rate                                                  254 non-null object
Total Appraised Value (Total Taxable Value for FM/FC)                  254 non-null object
Total Appraised Value (Total Taxable Value for County Tax Purposes)    254 non-null object
dtypes: object(6)
memory usage: 12.0+ KB


In [213]:
tax['County']=tax['County'].str.strip(' ')
tax['County']=tax['County'].str.title()

Clean currency data and convert to float

In [214]:
num_cols=tax.columns[tax.columns!='County']
tax[num_cols]=tax[num_cols].apply(lambda x: x.str.strip('$'))
tax[num_cols]=tax[num_cols].apply(lambda x: x.str.replace(',', '').astype(float))

In [215]:
tax.sort_values(by='County', inplace=True)
tax.set_index('County', inplace=True)
tax.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Anderson to Zavala
Data columns (total 5 columns):
Total Market Value                                                     254 non-null float64
Total Actual Levy                                                      254 non-null float64
Total County Tax Rate                                                  254 non-null float64
Total Appraised Value (Total Taxable Value for FM/FC)                  254 non-null float64
Total Appraised Value (Total Taxable Value for County Tax Purposes)    254 non-null float64
dtypes: float64(5)
memory usage: 11.9+ KB


In [216]:
print(tax.shape)
tax.head()

(254, 5)


Unnamed: 0_level_0,Total Market Value,Total Actual Levy,Total County Tax Rate,Total Appraised Value (Total Taxable Value for FM/FC),Total Appraised Value (Total Taxable Value for County Tax Purposes)
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Anderson,4058915000.0,15833018.0,0.590892,2675104000.0,2679693000.0
Andrews,5154110000.0,28889823.0,0.6039,4877522000.0,4768535000.0
Angelina,5548157000.0,18372078.0,0.437121,0.0,4202973000.0
Aransas,3023165000.0,13196810.0,0.466022,2850101000.0,2828595000.0
Archer,1843360000.0,4355848.0,0.66424,651667200.0,656137800.0


## Income and Poverty by County

#### Data from https://imis.county.org/iMIS/CountyInformationProgram/QueriesCIP.aspx

In [217]:
income=pd.read_csv(r'data/Income_poverty.csv')

In [218]:
income['County']=income['County'].str.strip(' ')
income['County']=income['County'].str.title()
income.sort_values(by='County', inplace=True)
income.set_index('County', inplace=True)
income.head()

Unnamed: 0_level_0,Per Capita Income,Total Personal Income,Median Household Income,Average Annual Pay,% of Population in Poverty,% of Population Under 18 in Poverty
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Anderson,"$34,242","$1,987,998,000","$45,969","$44,146",19.8,22.6
Andrews,"$50,011","$906,592,000","$84,946","$68,340",10.7,14.0
Angelina,"$38,897","$3,387,655,000","$46,653","$40,464",17.9,26.7
Aransas,"$48,389","$1,151,262,000","$46,912","$38,613",19.9,34.7
Archer,"$50,310","$442,022,000","$61,190","$38,231",10.6,14.3


In [219]:
income.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Anderson to Zavala
Data columns (total 6 columns):
Per Capita Income                      254 non-null object
Total Personal Income                  254 non-null object
Median Household Income                254 non-null object
Average Annual Pay                     254 non-null object
% of Population in Poverty             254 non-null float64
% of Population Under 18 in Poverty    254 non-null float64
dtypes: float64(2), object(4)
memory usage: 13.9+ KB


In [220]:
num_cols=income.select_dtypes(include='float').columns
str_cols=income.select_dtypes(include='object').columns
print(num_cols)
print(str_cols)

Index(['% of Population in Poverty', '% of Population Under 18 in Poverty'], dtype='object')
Index(['Per Capita Income', 'Total Personal Income', 'Median Household Income',
       'Average Annual Pay'],
      dtype='object')


Clean currency data and convert to float

In [221]:
income[str_cols]=income[str_cols].apply(lambda x: x.str.replace('$', ''))
income[str_cols]=income[str_cols].apply(lambda x: x.str.replace(',', '').astype(float))

In [222]:
#income[num_cols]=income[num_cols]/100
print(income.shape)
income.head()

(254, 6)


Unnamed: 0_level_0,Per Capita Income,Total Personal Income,Median Household Income,Average Annual Pay,% of Population in Poverty,% of Population Under 18 in Poverty
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Anderson,34242.0,1987998000.0,45969.0,44146.0,19.8,22.6
Andrews,50011.0,906592000.0,84946.0,68340.0,10.7,14.0
Angelina,38897.0,3387655000.0,46653.0,40464.0,17.9,26.7
Aransas,48389.0,1151262000.0,46912.0,38613.0,19.9,34.7
Archer,50310.0,442022000.0,61190.0,38231.0,10.6,14.3


## Population by County

#### Data from https://imis.county.org/iMIS/CountyInformationProgram/QueriesCIP.aspx

In [223]:
pop=pd.read_csv(r'data/Population.csv')

In [224]:
pop['County']=pop['County'].str.strip(' ')
pop['County']=pop['County'].str.title()
pop.sort_values(by='County', inplace=True)
pop.set_index('County', inplace=True)
pop.head()

Unnamed: 0_level_0,County Population,Population Density Per Sq Mile,County Seat,County Seat Population
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Anderson,57735,55.01,Palestine,18712
Andrews,18705,9.85,Andrews,11088
Angelina,86715,108.77,Lufkin,35067
Aransas,23510,91.87,Rockport,8766
Archer,8553,10.03,Archer City,1834


Drop County Seat and County Seat Population

In [225]:
pop=pop.drop(['County Seat', 'County Seat Population'], axis=1)
pop.columns=['Population', 'Population Density']
pop.head()

Unnamed: 0_level_0,Population,Population Density
County,Unnamed: 1_level_1,Unnamed: 2_level_1
Anderson,57735,55.01
Andrews,18705,9.85
Angelina,86715,108.77
Aransas,23510,91.87
Archer,8553,10.03


In [226]:
pop.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Anderson to Zavala
Data columns (total 2 columns):
Population            254 non-null object
Population Density    254 non-null object
dtypes: object(2)
memory usage: 6.0+ KB


Convert population and population density to numeric data

In [227]:
pop['Population']=pop['Population'].str.replace(',', '').astype(int)
pop['Population Density']=pop['Population Density'].str.replace(',', '').astype(float)
print(pop.shape)
pop.head()

(254, 2)


Unnamed: 0_level_0,Population,Population Density
County,Unnamed: 1_level_1,Unnamed: 2_level_1
Anderson,57735,55.01
Andrews,18705,9.85
Angelina,86715,108.77
Aransas,23510,91.87
Archer,8553,10.03


## County Land Use

#### Data from https://imis.county.org/iMIS/CountyInformationProgram/QueriesCIP.aspx

In [228]:
land=pd.read_csv(r'data/General_Information_land_use.csv')

In [229]:
land['County']=land['County'].str.strip(' ')
land['County']=land['County'].str.title()
land.sort_values(by='County', inplace=True)
land.set_index('County', inplace=True)
land.head()

Unnamed: 0_level_0,Land Area,Water Area,Total Area,Percent Urban,Percent Rural
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Anderson,1062.6,15.4,1078.0,32.94,67.06
Andrews,1500.7,0.4,1501.1,83.5,16.5
Angelina,797.8,66.9,864.7,56.92,43.08
Aransas,252.1,275.9,528.0,72.74,27.26
Archer,903.3,22.3,925.6,11.01,88.99


Drop Land Area, Water Area, Total Area

In [230]:
land=land.drop(['Land Area', 'Water Area', 'Total Area'], axis=1)
land.head()

Unnamed: 0_level_0,Percent Urban,Percent Rural
County,Unnamed: 1_level_1,Unnamed: 2_level_1
Anderson,32.94,67.06
Andrews,83.5,16.5
Angelina,56.92,43.08
Aransas,72.74,27.26
Archer,11.01,88.99


In [231]:
land.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Anderson to Zavala
Data columns (total 2 columns):
Percent Urban    254 non-null float64
Percent Rural    254 non-null float64
dtypes: float64(2)
memory usage: 6.0+ KB


In [232]:
#land=land/100
land.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Anderson to Zavala
Data columns (total 2 columns):
Percent Urban    254 non-null float64
Percent Rural    254 non-null float64
dtypes: float64(2)
memory usage: 6.0+ KB


In [233]:
print(land.shape)
land.head()

(254, 2)


Unnamed: 0_level_0,Percent Urban,Percent Rural
County,Unnamed: 1_level_1,Unnamed: 2_level_1
Anderson,32.94,67.06
Andrews,83.5,16.5
Angelina,56.92,43.08
Aransas,72.74,27.26
Archer,11.01,88.99


## Covid19 Policy

#### Data from 

### Policy scores by county

In [234]:
policy_scores=pd.read_excel(r'data/policy_scores_us_counties_reformat.xls')
policy_scores.head()

Unnamed: 0,FIPS,NAME,state,Deaths per Capita,Percentage Death,Percentage Infected,Policy Score
0,1001,Autauga,Alabama,0.0%,2.5%,1.6%,4
1,1003,Baldwin,Alabama,0.0%,0.7%,1.1%,5
2,1005,Barbour,Alabama,0.0%,0.8%,1.8%,3
3,1007,Bibb,Alabama,0.0%,0.7%,1.2%,4
4,1009,Blount,Alabama,0.0%,0.2%,0.9%,4


In [235]:
policy_scores=policy_scores[policy_scores['state']=='Texas']
print(policy_scores.shape)
policy_scores.head()

(145, 7)


Unnamed: 0,FIPS,NAME,state,Deaths per Capita,Percentage Death,Percentage Infected,Policy Score
1064,48001,Anderson,Texas,0.0%,0.2%,3.5%,3
1065,48005,Angelina,Texas,0.0%,1.2%,1.2%,3
1066,48013,Atascosa,Texas,0.0%,0.6%,0.8%,2
1067,48015,Austin,Texas,0.0%,0.0%,0.6%,3
1068,48019,Bandera,Texas,0.0%,0.0%,0.2%,6


In [236]:
policy_scores=policy_scores[['NAME', 'Policy Score']]
policy_scores.rename(columns={'NAME': 'County'}, inplace=True)
policy_scores['County']=policy_scores['County'].str.replace(' ', '')
policy_scores['County']=policy_scores['County'].str.title()
policy_scores.sort_values(by='County', inplace=True)
policy_scores.set_index('County', inplace=True)
policy_scores.head()

Unnamed: 0_level_0,Policy Score
County,Unnamed: 1_level_1
Anderson,3
Angelina,3
Atascosa,2
Austin,3
Bandera,6


### Policy by category

In [237]:
policy=pd.read_csv(r'data/all_county_policies.csv')
print(policy.shape)
policy.head()

(2704, 24)


Unnamed: 0,fips,county_name,school,school_url,school_date,work,work_url,work_date,shelter_enforcement,shelter_enforcement_url,...,event,event_url,event_date,testing,testing_url,testing_date,transport,transport_url,transport_date,updated
0,6073,"San Diego County, California",True,https://www.sandiegocounty.gov/content/sdc/hhs...,2020-03-16,True,https://www.sandiegocounty.gov/content/sdc/hhs...,2020-03-15,False,,...,True,https://www.sandiegocounty.gov/content/sdc/hhs...,2020-03-16,True,https://health.ucsd.edu/coronavirus/Pages/defa...,,True,https://www.sdmts.com/schedules-real-time/covi...,2020-04-12,2020-04-09 11:03:27
1,17031,"Cook County, Illinois",True,https://www.cookcountypublichealth.org/communi...,2020-03-17,True,https://www.cookcountypublichealth.org/communi...,2020-03-20,False,,...,True,https://www.cookcountypublichealth.org/communi...,2020-04-01,False,,,False,,,2020-04-09 11:32:26
2,48201,"Harris County, Texas",True,https://publichealth.harriscountytx.gov/Resour...,2020-03-24,True,https://publichealth.harriscountytx.gov/Resour...,2020-03-24,True,https://www.readyharris.org/Stay-Home,...,True,https://www.readyharris.org/Stay-Home,2020-03-24,True,https://publichealth.harriscountytx.gov/Resour...,,True,https://www.ridemetro.org/Pages/Coronavirus.aspx,2020-03-30,2020-04-09 18:08:51
3,4013,"Maricopa County, Arizona",True,https://azgovernor.gov/governor/news/2020/03/g...,2020-03-30,True,https://www.fox10phoenix.com/video/669286,2020-03-31,False,,...,False,https://www.maricopa.gov/Calendar.aspx,,True,https://www.fox10phoenix.com/news/banner-healt...,2020-03-23,False,https://www.maricopa.gov/5307/Transportation-M...,,2020-04-09 18:33:06
4,49043,"Summit County, Utah",True,http://www.pcschools.us/wp-content/uploads/202...,2020-03-16,True,https://www.summitcounty.org/DocumentCenter/Vi...,2020-03-27,True,https://www.summitcounty.org/DocumentCenter/Vi...,...,True,https://www.summitcounty.org/DocumentCenter/Vi...,2020-03-27,True,https://kutv.com/news/local/curbside-coronavir...,2020-04-02,False,,,2020-04-09 23:04:50


In [238]:
policy['County']=policy['county_name'].apply(lambda x: x.split(', ')[0])
policy['State']=policy['county_name'].apply(lambda x: x.split(',')[1].lstrip(' '))
policy=policy[policy['State']=='Texas']
print(policy.shape)
policy.head()

(292, 26)


Unnamed: 0,fips,county_name,school,school_url,school_date,work,work_url,work_date,shelter_enforcement,shelter_enforcement_url,...,event_date,testing,testing_url,testing_date,transport,transport_url,transport_date,updated,County,State
2,48201,"Harris County, Texas",True,https://publichealth.harriscountytx.gov/Resour...,2020-03-24,True,https://publichealth.harriscountytx.gov/Resour...,2020-03-24,True,https://www.readyharris.org/Stay-Home,...,2020-03-24,True,https://publichealth.harriscountytx.gov/Resour...,,True,https://www.ridemetro.org/Pages/Coronavirus.aspx,2020-03-30,2020-04-09 18:08:51,Harris County,Texas
8,48113,"Dallas County, Texas",True,https://www.dallasisd.org/cms/lib/TX01001475/C...,2020-03-20,True,https://www.dallascounty.org/covid-19/judge-or...,2020-03-31,False,,...,2020-03-31,True,https://www.dallascounty.org/covid-19/testing-...,2020-04-10,False,,,2020-04-10 19:36:29,Dallas County,Texas
14,48439,"Tarrant County, Texas",True,https://dfw.cbslocal.com/2020/03/24/all-school...,2020-03-24,True,https://www.star-telegram.com/news/local/fort-...,2020-03-24,True,https://www.texastribune.org/2020/03/23/texas-...,...,2020-03-24,True,https://www.dshs.state.tx.us/coronavirus/testi...,2020-03-31,False,,,2020-04-11 13:51:35,Tarrant County,Texas
31,48121,"Denton County, Texas",True,https://www.dentonisd.org/site/default.aspx?Pa...,2020-03-16,True,https://www.msn.com/en-us/news/us/denton-count...,2020-03-24,True,https://dentonrc.com/coronavirus_outbreak/dent...,...,2020-03-24,True,https://www.crosstimbersgazette.com/2020/03/30...,2020-03-20,False,https://www.dcta.net/schedulechanges,,2020-04-11 22:14:51,Denton County,Texas
37,48215,"Hidalgo County, Texas",True,https://www.ecisd.us/apps/pages/coronavirus,2020-03-17,True,https://www.hidalgocounty.us/DocumentCenter/Vi...,2020-03-26,True,https://www.gosanangelo.com/story/news/2020/03...,...,2020-03-26,False,https://www.themonitor.com/2020/03/13/hidalgo-...,,False,,,2020-04-11 23:51:48,Hidalgo County,Texas


In [239]:
policy['County']=policy['County'].str.replace('County', '')
policy['County']=policy['County'].str.replace(' ', '')
policy['County']=policy['County'].str.title()
policy.sort_values(by='County', inplace=True)

Drop unnecessary columns

In [240]:
policy.drop(['school_url', 'work_url', 'shelter_enforcement_url', 'shelter_url', 'testing_url', 
            'transport_url', 'event_url', 'school_date', 'work_date', 'shelter_enforcement_date', 
            'shelter_date', 'testing_date', 'transport_date', 'event_date'], 
       axis=1, inplace=True)
policy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 292 entries, 806 to 1614
Data columns (total 12 columns):
fips                   292 non-null int64
county_name            292 non-null object
school                 144 non-null object
work                   284 non-null object
shelter_enforcement    115 non-null object
shelter                256 non-null object
event                  142 non-null object
testing                148 non-null object
transport              123 non-null object
updated                292 non-null object
County                 292 non-null object
State                  292 non-null object
dtypes: int64(1), object(11)
memory usage: 29.7+ KB


Some counties have more than one row, so data is filtered by last updated row

In [241]:
county_count=policy['County'].value_counts()
df=pd.DataFrame()
for i in county_count.index:
    max_date=(policy[policy['County']==i]['updated']).max()
    new_row=policy[policy['updated']==max_date]
    df=df.append(new_row)
df.head()

Unnamed: 0,fips,county_name,school,work,shelter_enforcement,shelter,event,testing,transport,updated,County,State
1930,48339,"Montgomery County, Texas",,True,,True,,,,2020-06-14 14:30:18,Montgomery,Texas
2346,48179,"Gray County, Texas",,False,,False,,,,2020-06-22 16:42:44,Gray,Texas
1614,48507,"Zavala County, Texas",True,True,,False,True,False,False,2020-07-06 15:13:36,Zavala,Texas
1947,48215,"Hidalgo County, Texas",,True,,False,,,,2020-06-14 15:12:56,Hidalgo,Texas
1611,48031,"Blanco County, Texas",True,True,False,True,True,False,False,2020-07-06 15:03:59,Blanco,Texas


In [242]:
df.drop(['county_name', 'fips', 'State', 'updated'], inplace=True, axis=1)
df.fillna(False, inplace=True)
df.sort_values(by='County', inplace=True)
df.set_index('County', inplace=True)
df=df.astype(int)
df.head()

Unnamed: 0_level_0,school,work,shelter_enforcement,shelter,event,testing,transport
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Anderson,0,0,0,0,0,0,0
Angelina,0,0,0,0,0,0,0
Atascosa,0,0,0,0,0,0,0
Austin,0,0,0,0,0,0,0
Bandera,0,0,0,0,0,0,0


In [243]:
df.shape

(145, 7)

In [244]:
policy=df.copy()

## Health

#### Data from 2018 https://www.countyhealthrankings.org/app/texas/2020/downloads

In [245]:
health=pd.read_excel(r'data/2018 County Health Rankings Texas Data - v3.xls', 
                sheet_name='Ranked Measure Data', header=1)
print(health.shape)
health.head()

(255, 164)


Unnamed: 0,FIPS,State,County,Years of Potential Life Lost Rate,95% CI - Low,95% CI - High,Z-Score,Years of Potential Life Lost Rate (Black),Years of Potential Life Lost Rate (Hispanic),Years of Potential Life Lost Rate (White),...,95% CI - High.20,Z-Score.33,% Drive Alone (Black),% Drive Alone (Hispanic),% Drive Alone (White),# Workers who Drive Alone,% Long Commute - Drives Alone,95% CI - Low.21,95% CI - High.21,Z-Score.34
0,48000,Texas,,6674.727141,6641.090491,6708.363792,,,,,...,80.472629,,,,,9830530,36.9,36.661702,37.138298,
1,48001,Texas,Anderson,10118.882607,9289.179726,10948.585489,0.982814,10232.688515,12460.553717,9706.965989,...,88.382323,0.950584,76.267496,84.214162,86.808355,16394,24.3,21.019615,27.580385,-0.388055
2,48003,Texas,Andrews,8133.116006,6666.336541,9599.895471,-0.104561,,7128.961752,10359.353201,...,82.002163,-0.42437,,72.743764,84.096955,6113,26.3,19.564132,33.035868,-0.23255
3,48005,Texas,Angelina,8802.202511,8110.802767,9493.602256,0.261821,13478.300066,4705.742581,9058.431525,...,83.819606,0.315319,77.271893,79.643612,85.182618,28432,14.8,12.661949,16.938051,-1.126701
4,48007,Texas,Aransas,11303.977725,9622.581329,12985.374122,1.631754,,8737.720683,12762.437844,...,86.317978,0.157691,,81.697171,77.104745,7577,27.0,21.23265,32.76735,-0.178123


Drop first row, which is totals Texas data

In [246]:
health.drop(0, inplace=True)

Drop columns with data that is repeated in other datasets

In [247]:
col_list=['CI', 'Z-Score', 'White', 'Black', 'Hispanic', '# ', 'Poverty', 'Income', 'Ratio']
for c in col_list:
    drop_cols=health.columns[health.columns.str.contains(c)]
    health.drop(drop_cols, axis=1, inplace=True)
health.drop(['FIPS', 'State', 'Population', 'Teen Birth Rate', '% Uninsured', 'Violent Crime Rate', '% Obese',
      '% Single-Parent Households', '% LBW', 'Injury Death Rate', '% Unemployed'], axis=1, inplace=True)
health.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 254 entries, 1 to 254
Data columns (total 29 columns):
County                               254 non-null object
Years of Potential Life Lost Rate    229 non-null float64
% Fair/Poor                          254 non-null float64
Physically Unhealthy Days            254 non-null float64
Mentally Unhealthy Days              254 non-null float64
Unreliable                           13 non-null object
% Smokers                            254 non-null float64
Food Environment Index               253 non-null float64
% Physically Inactive                254 non-null float64
% With Access                        254 non-null float64
% Excessive Drinking                 254 non-null float64
% Alcohol-Impaired                   254 non-null float64
Chlamydia Rate                       232 non-null float64
PCP Rate                             234 non-null float64
Dentist Rate                         240 non-null float64
MHP Rate                    

Drop columns with too many null values.

In [248]:
health.drop(['Cohort Size', 'Unreliable'], axis=1, inplace=True)

Format county to remove spaces, make title case, and set as index

In [249]:
health['County']=health['County'].str.replace(' ', '')
health['County']=health['County'].str.title()
health.sort_values(by='County', inplace=True)
health.set_index('County', inplace=True)

Use one hot encoding for categorical data

In [250]:
violations=pd.get_dummies(health['Presence of violation'], drop_first=True)
health.drop('Presence of violation', axis=1, inplace=True)
health=pd.concat([health, violations], axis=1)

Use median to fill null values

In [251]:
health=health.apply(lambda x: x.fillna(x.median()))
health.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Anderson to Zavala
Data columns (total 26 columns):
Years of Potential Life Lost Rate    254 non-null float64
% Fair/Poor                          254 non-null float64
Physically Unhealthy Days            254 non-null float64
Mentally Unhealthy Days              254 non-null float64
% Smokers                            254 non-null float64
Food Environment Index               254 non-null float64
% Physically Inactive                254 non-null float64
% With Access                        254 non-null float64
% Excessive Drinking                 254 non-null float64
% Alcohol-Impaired                   254 non-null float64
Chlamydia Rate                       254 non-null float64
PCP Rate                             254 non-null float64
Dentist Rate                         254 non-null float64
MHP Rate                             254 non-null float64
Preventable Hosp. Rate               254 non-null float64
% Receiving HbA1c   

## Healthcare worker shortage

#### Data from https://data.hrsa.gov/tools/shortage-area/hpsa-find

In [252]:
hws=pd.read_csv(r'data/Healthcare worker shortage.csv')
print(hws.shape)
hws.head()

(104, 12)


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,Primary Care,1483633182,Andrews County,Geographic HPSA,Texas,"Andrews County, TX",0.975,11,Designated,Rural,02/27/2020,02/27/2020
1,Primary Care,1482120909,Aransas County,Geographic HPSA,Texas,"Aransas County, TX",2.31,17,Designated,Partially Rural,09/26/2013,02/27/2020
2,Primary Care,1486293868,Archer County,Geographic HPSA,Texas,"Archer County, TX",2.47,16,Designated,Partially Rural,09/16/1979,02/03/2020
3,Primary Care,1486348360,Armstrong County,Geographic HPSA,Texas,"Armstrong County, TX",0.53,15,Designated,Rural,08/09/1979,12/22/2019
4,Primary Care,1485163155,Atascosa County,Geographic HPSA,Texas,"Atascosa County, TX",3.985,13,Designated,Non-Rural,05/30/1978,04/26/2020


Drop unneeded columns and rename remaining columns.  Short is how many healthcare workers are needed to meet target, and score indicated priority (lower score is lower priority).

In [253]:
cols=[0,1,3,4,5,8,9,10,11]
hws.drop(hws.columns[cols], inplace=True, axis=1)
hws.columns=['County', 'Short', 'Score']
hws.head()

Unnamed: 0,County,Short,Score
0,Andrews County,0.975,11
1,Aransas County,2.31,17
2,Archer County,2.47,16
3,Armstrong County,0.53,15
4,Atascosa County,3.985,13


In [254]:
hws['County']=hws['County'].str.replace('County', '')
hws['County']=hws['County'].str.replace(' ', '')
hws['County']=hws['County'].str.title()
hws.sort_values(by='County', inplace=True)
hws.set_index('County', inplace=True)
hws.head()

Unnamed: 0_level_0,Short,Score
County,Unnamed: 1_level_1,Unnamed: 2_level_1
Andrews,0.975,11
Aransas,2.31,17
Archer,2.47,16
Armstrong,0.53,15
Atascosa,3.985,13


## Combine dataframes and save to csv file

In [255]:
land_airports=pd.merge(land, airports_county, how='left', left_index=True, right_index=True)
land_airports.shape

(254, 6)

In [256]:
land_airports_tax=pd.merge(land_airports, tax, how='left', left_index=True, right_index=True)
land_airports_tax.shape

(254, 11)

In [257]:
land_airports_tax_income=pd.merge(land_airports_tax, income, how='left', left_index=True, right_index=True)
land_airports_tax_income.shape

(254, 17)

In [258]:
combined=pd.merge(land_airports_tax_income, pop, how='left', left_index=True, right_index=True)
combined.shape

(254, 19)

In [259]:
combined.isnull().sum()

Percent Urban                                                           0
Percent Rural                                                           0
Enplanements                                                           43
Number                                                                 43
customs_Y                                                              43
hub_Y                                                                  43
Total Market Value                                                      0
Total Actual Levy                                                       0
Total County Tax Rate                                                   0
Total Appraised Value (Total Taxable Value for FM/FC)                   0
Total Appraised Value (Total Taxable Value for County Tax Purposes)     0
Per Capita Income                                                       0
Total Personal Income                                                   0
Median Household Income               

In [260]:
combined.fillna(0, inplace=True)

In [261]:
combined.shape

(254, 19)

In [262]:
combined=pd.merge(combined, policy_scores, how='left', left_index=True, right_index=True)
combined=pd.merge(combined, policy, how='left', left_index=True, right_index=True)

In [263]:
combined.isnull().sum()

Percent Urban                                                            0
Percent Rural                                                            0
Enplanements                                                             0
Number                                                                   0
customs_Y                                                                0
hub_Y                                                                    0
Total Market Value                                                       0
Total Actual Levy                                                        0
Total County Tax Rate                                                    0
Total Appraised Value (Total Taxable Value for FM/FC)                    0
Total Appraised Value (Total Taxable Value for County Tax Purposes)      0
Per Capita Income                                                        0
Total Personal Income                                                    0
Median Household Income  

In [264]:
policy_score_med=combined['Policy Score'].median()

In [265]:
combined['Policy Score'].fillna(policy_score_med, inplace=True)

In [266]:
combined.fillna(0, inplace=True)

In [267]:
combined.index=combined.index.str.replace(' ', '')
combined.index=combined.index.str.title()

In [268]:
health.shape

(254, 26)

In [269]:
combined=pd.merge(combined, health, how='left', left_index=True, right_index=True)
combined.isnull().sum()

Percent Urban                                                          0
Percent Rural                                                          0
Enplanements                                                           0
Number                                                                 0
customs_Y                                                              0
hub_Y                                                                  0
Total Market Value                                                     0
Total Actual Levy                                                      0
Total County Tax Rate                                                  0
Total Appraised Value (Total Taxable Value for FM/FC)                  0
Total Appraised Value (Total Taxable Value for County Tax Purposes)    0
Per Capita Income                                                      0
Total Personal Income                                                  0
Median Household Income                            

In [270]:
combined=pd.merge(combined, hws, how='left', left_index=True, right_index=True)
combined.isnull().sum()

Percent Urban                                                            0
Percent Rural                                                            0
Enplanements                                                             0
Number                                                                   0
customs_Y                                                                0
hub_Y                                                                    0
Total Market Value                                                       0
Total Actual Levy                                                        0
Total County Tax Rate                                                    0
Total Appraised Value (Total Taxable Value for FM/FC)                    0
Total Appraised Value (Total Taxable Value for County Tax Purposes)      0
Per Capita Income                                                        0
Total Personal Income                                                    0
Median Household Income  

In [271]:
combined.fillna(0, inplace=True)

In [272]:
combined.to_csv(r'data\all_tlip.csv')