# Top 25 Most Affordable Places to Raise a Family

Analysis by Alex Mahadevan  
Data journalist, [The Penny Hoarder]("https://www.thepennyhoarder.com")

### Data Cleaning

<p>We begin by importing the Pandas data analysis library.</p> 

In [1]:
import pandas as pd

<p>We used a combination of PANDAS and Excel to clean and merge all of the data herein. Sources include the [U.S. Census Bureau's American Community Survey](https://www.census.gov/acs/www/data/data-tables-and-tools/data-profiles/2015/), [University of Michigan's Institute for Social Research](http://www.icpsr.umich.edu/icpsrweb/NACJD/studies/35019), [The U.S. Bureau of Labor Statistics](https://data.bls.gov/map/MapToolServlet?survey=la&map=county&seasonal=u) and [The Robert Wood Johnson Foundation](http://www.countyhealthrankings.org/reports/2017-county-health-rankings-key-findings-report).
<p>The data cleaning process was daunting, dirty and rigorous, so we omitted it from the final analysis.</p>

<p>Now, we'll read in the initial dataset we'll be working from.</p>
<p>Notes:</p>
* The crime data for some smaller counties may be missing or underreported.
* We used counties instead of metropolitan statistical area or city data to have the most robust analysis with the widest number of variables, and to capture suburbs.
* Some of the data points were not available for Alaska and Hawaii. That's OK, since those states are pretty expensive to live in anyway (they'll likely be thrown out of the analysis).

In [2]:
df = pd.read_csv("/Users/alexmahadevan/Code/raise_a_family/county data.csv" , index_col=0)


<p>Here is a list of the variables we are considering.</p>

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 3220 entries, 1.0 to nan
Data columns (total 40 columns):
SUMLEV                        3141 non-null float64
REGION                        3141 non-null float64
DIVISION                      3141 non-null float64
STATE                         3141 non-null float64
COUNTY                        3141 non-null float64
STNAME                        3141 non-null object
CTY                           3141 non-null object
HEALTHCARE_COST               3134 non-null float64
PERCENT_FOOD_INSECURE         3135 non-null float64
CHILD_MORTALITY_RATE          1943 non-null float64
MENTAL_DISTRESS               3135 non-null float64
CHILDREN_UNINSURED            3135 non-null float64
DISCONNECTED_YOUTH            2046 non-null float64
DAILY_POLLUTION               3108 non-null float64
WATER_VIOLATION               3077 non-null float64
HOUSING_PROBLEMS              3135 non-null float64
POPESTIMATE2016               3141 non-null float64
AVG_NETM

<p>As you can see, for most variables, we have more than 3,100 counties to consider.</p>
<p>Since most people prefer to live where other people are living (and not in an igloo in Yakutat, Alaska), let's cull this list to only the counties with more population than the median.</p>
<p>First let's find the average.</p>

In [4]:
df.POPESTIMATE2016.mean()

102874.06080865966

<p>As you can see, the average population of a U.S. county as of 2016 was 102,874 (plus one twentieth of a person).</p>
<p>Let's drop the counties with a population less than that amount.</p>

In [5]:
df = df[df.POPESTIMATE2016 > 102874.06080865966]

In [6]:
df.COUNTY.describe()

count    584.000000
mean      90.041096
std      104.140746
min        1.000000
25%       25.000000
50%       67.000000
75%      113.000000
max      810.000000
Name: COUNTY, dtype: float64

<p>Now I'm going to tweak PANDAS so it doesn't revert every number to its scientific format</p>

In [7]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [40]:
df.fillna(0)

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTY,HEALTHCARE_COST,PERCENT_FOOD_INSECURE,CHILD_MORTALITY_RATE,...,zFOOD_ENVIRONMENT_INDEX,zCIVIC_ASSOCIATIONS,zCRIME_RATE,zDAILY_POLLUTION,zWATER_VIOLATION,zCHILD_MORTALITY_RATE,zCHILDREN_UNINSURED,zDISCONNECTED_YOUTH,zHOUSING_PROBLEMS,zSCORE_TOTAL
2.000,50.000,3.000,6.000,1.000,3.000,Alabama,"Baldwin County, Alabama",9413.000,14.000,48.000,...,0.174,-0.305,0.582,0.071,-0.628,0.155,0.324,-0.631,0.527,-0.051
8.000,50.000,3.000,6.000,1.000,15.000,Alabama,"Calhoun County, Alabama",10673.000,18.000,70.000,...,-1.281,-0.450,-1.054,-1.182,1.589,-1.258,1.069,-1.535,0.527,-0.588
35.000,50.000,3.000,6.000,1.000,69.000,Alabama,"Houston County, Alabama",9997.000,18.000,78.000,...,-0.969,-0.534,-0.211,0.133,1.589,-1.772,0.697,-1.535,0.741,-0.451
37.000,50.000,3.000,6.000,1.000,73.000,Alabama,"Jefferson County, Alabama",9495.000,20.000,90.000,...,-1.697,1.459,-2.076,-1.996,1.589,-2.542,1.069,-0.631,-0.117,0.000
41.000,50.000,3.000,6.000,1.000,81.000,Alabama,"Lee County, Alabama",8563.000,18.000,67.000,...,-1.385,-0.529,0.316,-0.869,-0.628,-1.065,0.697,1.403,-0.332,-0.066
45.000,50.000,3.000,6.000,1.000,89.000,Alabama,"Madison County, Alabama",9793.000,16.000,65.000,...,-0.450,0.065,-1.100,-1.182,1.589,-0.937,0.697,0.273,1.171,0.214
49.000,50.000,3.000,6.000,1.000,97.000,Alabama,"Mobile County, Alabama",9860.000,20.000,83.000,...,-1.697,0.297,-0.944,-0.180,1.589,-2.093,0.697,-1.083,-0.117,-0.533
51.000,50.000,3.000,6.000,1.000,101.000,Alabama,"Montgomery County, Alabama",9046.000,23.000,87.000,...,-2.424,-0.040,-0.294,-0.931,1.589,-2.350,1.069,-0.857,-0.332,-0.494
52.000,50.000,3.000,6.000,1.000,103.000,Alabama,"Morgan County, Alabama",9990.000,14.000,78.000,...,0.278,-0.498,0.696,-0.994,1.589,-1.772,0.324,-0.857,0.956,-0.195
59.000,50.000,3.000,6.000,1.000,117.000,Alabama,"Shelby County, Alabama",9866.000,11.000,48.000,...,1.109,-0.326,0.751,-1.056,-0.628,0.155,0.697,1.177,1.385,0.498


### Normalizing Data

<p>Now let's start with the initial Z-scores. Here's the formula we'll use to normalize all the data so we can compare apples to apples: z = (x – μ) / σ.</p>

In [41]:
df['zFAMKIDSCHANGE'] = (df.CHANGE_IN_FAM_WITH_KIDS - df.CHANGE_IN_FAM_WITH_KIDS.mean())/ df.CHANGE_IN_FAM_WITH_KIDS.std()

In [42]:
df['zPERCENT_WITH_KIDS'] = (df.PERCENT_WITH_KIDS - df.PERCENT_WITH_KIDS.mean())/ df.PERCENT_WITH_KIDS.std()

In [43]:
df['zBIRTHS_PER_CAPITA'] = (df.BIRTHS_PER_CAPITA - df.BIRTHS_PER_CAPITA.mean())/ df.BIRTHS_PER_CAPITA.std()

In [44]:
df['zAVG_NETMIG'] = (df.AVG_NETMIG - df.AVG_NETMIG.mean())/ df.AVG_NETMIG.std()

In [45]:
df['zPOP_CHANGE'] = (df.POP_CHANGE - df.POP_CHANGE.mean())/ df.POP_CHANGE.std()

In [46]:
df['zINCOME_AS_PERCENT_OF_STATE'] = (df.INCOME_AS_PERCENT_OF_STATE - df.INCOME_AS_PERCENT_OF_STATE.mean())/ df.INCOME_AS_PERCENT_OF_STATE.std()

In [47]:
df['zPERCENTAGE_IN_POVERTY'] = -1*((df.PERCENTAGE_IN_POVERTY - df.PERCENTAGE_IN_POVERTY.mean())/ df.PERCENTAGE_IN_POVERTY.std())

In [48]:
df['zPERCENT_W_HEALTH_INSURANCE'] = (df.PERCENT_W_HEALTH_INSURANCE - df.PERCENT_W_HEALTH_INSURANCE.mean())/ df.PERCENT_W_HEALTH_INSURANCE.std()

In [49]:
df['zUNEMPLOYMENT'] = -1*((df.UNEMPLOYMENT - df.UNEMPLOYMENT.mean())/ df.UNEMPLOYMENT.std())

In [50]:
df['zBACHELORS_DEGREE'] = (df.BACHELORS_DEGREE - df.BACHELORS_DEGREE.mean())/ df.BACHELORS_DEGREE.std()

In [51]:
df['zMORTGAGE_INCOME_RATIO'] = (df.MORTGAGE_INCOME_RATIO - df.MORTGAGE_INCOME_RATIO.mean())/ df.MORTGAGE_INCOME_RATIO.std()

In [52]:
df['zRENT_INCOME_RATIO'] = (df.RENT_INCOME_RATIO - df.RENT_INCOME_RATIO.mean())/ df.RENT_INCOME_RATIO.std()

In [53]:
df['zHEALTHCARE_INCOME_RATIO'] = (df.HEALTHCARE_INCOME_RATIO - df.HEALTHCARE_INCOME_RATIO.mean())/ df.HEALTHCARE_INCOME_RATIO.std()

In [54]:
df['zDOCTORS'] = (df.DOCTORS - df.DOCTORS.mean())/ df.DOCTORS.std()

In [55]:
df['zEXERCISE_ACCESS'] = (df.EXERCISE_ACCESS - df.EXERCISE_ACCESS.mean())/ df.EXERCISE_ACCESS.std()

In [56]:
df['zFOOD_ENVIRONMENT_INDEX'] = (df.FOOD_ENVIRONMENT_INDEX - df.FOOD_ENVIRONMENT_INDEX.mean())/ df.FOOD_ENVIRONMENT_INDEX.std()

In [57]:
df['zCIVIC_ASSOCIATIONS'] = (df.CIVIC_ASSOCIATIONS - df.CIVIC_ASSOCIATIONS.mean())/ df.CIVIC_ASSOCIATIONS.std()

In [58]:
df['zCRIME_RATE'] = -1*((df.CRIME_RATE - df.CRIME_RATE.mean())/ df.CRIME_RATE.std())

In [59]:
df['zDAILY_POLLUTION'] = -1*((df.DAILY_POLLUTION - df.DAILY_POLLUTION.mean())/ df.DAILY_POLLUTION.std())

In [61]:
df['zCHILD_MORTALITY_RATE'] = -1*((df.CHILD_MORTALITY_RATE - df.CHILD_MORTALITY_RATE.mean())/ df.CHILD_MORTALITY_RATE.std())

In [62]:
df['zCHILDREN_UNINSURED'] = -1*((df.CHILDREN_UNINSURED - df.CHILDREN_UNINSURED.mean())/ df.CHILDREN_UNINSURED.std())

In [63]:
df['zDISCONNECTED_YOUTH'] = -1*((df.DISCONNECTED_YOUTH - df.DISCONNECTED_YOUTH.mean())/ df.DISCONNECTED_YOUTH.std())

In [64]:
df['zHOUSING_PROBLEMS'] = -1*((df.HOUSING_PROBLEMS - df.HOUSING_PROBLEMS.mean())/ df.HOUSING_PROBLEMS.std())

### Ranking Counties
<p>Now that we have normalized scores for all of our 23 of our variables. You'll notice, we multiplied some of those by -1, because families want less pollution, crime, etc.</p>

In [68]:
df['zSCORE_TOTAL'] = (df['zPERCENT_WITH_KIDS'] + df['zBIRTHS_PER_CAPITA'] + df['zFAMKIDSCHANGE'] + df['zAVG_NETMIG'] + df['zPOP_CHANGE']  + df['zINCOME_AS_PERCENT_OF_STATE'] + df['zPERCENTAGE_IN_POVERTY'] + df['zPERCENT_W_HEALTH_INSURANCE'] + df['zUNEMPLOYMENT'] + df['zBACHELORS_DEGREE'] + df['zMORTGAGE_INCOME_RATIO'] + df['zRENT_INCOME_RATIO'] + df['zHEALTHCARE_INCOME_RATIO'] + df['zDOCTORS'] + df['zEXERCISE_ACCESS'] + df['zFOOD_ENVIRONMENT_INDEX'] + df['zCIVIC_ASSOCIATIONS'] + df['zCRIME_RATE'] + df['zDAILY_POLLUTION'] + df['zCHILD_MORTALITY_RATE'] + df['zCHILDREN_UNINSURED'] + df['zDISCONNECTED_YOUTH'] + df['zHOUSING_PROBLEMS'])/23

In [69]:
df_sorted = df.sort_values('zSCORE_TOTAL', ascending=False)

In [70]:
df_sorted.head(50)

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTY,HEALTHCARE_COST,PERCENT_FOOD_INSECURE,CHILD_MORTALITY_RATE,...,zFOOD_ENVIRONMENT_INDEX,zCIVIC_ASSOCIATIONS,zCRIME_RATE,zDAILY_POLLUTION,zWATER_VIOLATION,zCHILD_MORTALITY_RATE,zCHILDREN_UNINSURED,zDISCONNECTED_YOUTH,zHOUSING_PROBLEMS,zSCORE_TOTAL
2919.0,50.0,3.0,5.0,51.0,107.0,Virginia,"Loudoun County, Virginia",8272.0,4.0,30.0,...,2.771,-0.286,1.187,-0.18,-0.628,1.311,0.324,1.403,1.171,1.582
268.0,50.0,4.0,8.0,8.0,35.0,Colorado,"Douglas County, Colorado",9521.0,9.0,25.0,...,1.524,-0.412,1.2,1.699,-0.628,1.633,1.069,1.403,1.385,1.361
2564.0,50.0,3.0,6.0,47.0,187.0,Tennessee,"Williamson County, Tennessee",8545.0,8.0,26.0,...,1.628,-0.248,0.98,-0.305,1.589,1.568,0.697,0.951,1.171,1.296
740.0,50.0,2.0,3.0,18.0,57.0,Indiana,"Hamilton County, Indiana",9303.0,9.0,28.0,...,1.317,-0.171,1.407,-1.307,-0.628,1.44,0.324,1.177,1.815,1.264
2099.0,50.0,2.0,3.0,39.0,41.0,Ohio,"Delaware County, Ohio",9350.0,9.0,26.0,...,1.628,-0.453,1.205,-1.369,-0.628,1.568,0.697,1.403,1.385,1.221
455.0,50.0,3.0,5.0,13.0,117.0,Georgia,"Forsyth County, Georgia",10233.0,7.0,28.0,...,1.94,-0.539,1.26,-0.43,-0.628,1.44,-0.048,0.499,0.956,1.179
2895.0,50.0,3.0,5.0,51.0,59.0,Virginia,"Fairfax County, Virginia",7799.0,6.0,33.0,...,2.356,1.378,1.168,0.572,1.589,1.119,-0.048,1.403,0.527,1.141
1226.0,50.0,3.0,5.0,24.0,27.0,Maryland,"Howard County, Maryland",8563.0,8.0,32.0,...,1.94,-0.197,0.66,-0.618,1.589,1.183,1.069,1.177,0.956,1.069
1407.0,50.0,2.0,4.0,27.0,139.0,Minnesota,"Scott County, Minnesota",8093.0,6.0,24.0,...,2.044,-0.61,1.118,-0.618,-0.628,1.697,1.069,1.177,1.385,1.057
2873.0,50.0,3.0,5.0,51.0,13.0,Virginia,"Arlington County, Virginia",7469.0,8.0,42.0,...,1.94,-0.133,0.907,-0.242,1.589,0.541,0.324,1.403,0.527,1.016


In [71]:
df_sorted.to_csv("/Users/alexmahadevan/Code/raise_a_family/county data sorted.csv")

<p>Just glancing at the top few counties, you can see affordability kind of got lost in there, even though we did include healthcare, rent and mortgage costs in the analysis.</p>
<p>To do the next part of the analysis, I'm only picking out the top 200 counties.</p> 
<p>Then I'll hard-code the school grades from [Niche](https://www.niche.com/k12/search/best-school-districts/), and re-run the analysis putting more weight on affordability. That should produce our final list.</p>