## Question

We want to try to see if there are already correlations that explain levels of education around the US. I.e., even if there is a correlation between internet connectivity and education levels, can this be explained by other factors.

* How much of the variability of internet connectivity's relationship with education can be explained by other factors?

## Steps

1. Load in datasets
2. Clean and group into fips groupings of data into one dataframe
3. see correlations and see correlations and possibly perform PCA on the dataset

## Code

Imports

In [93]:
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
from sklearn import preprocessing

Dataset loading

In [83]:
#Load in all datasets
edu = pd.read_csv("../data/Education.csv", encoding='latin-1')
conn = pd.read_csv("../data/county_broadband_adoption.csv", encoding='utf-8')
pop = pd.read_csv('../data/population_estimates.csv', encoding='latin-1')
pov = pd.read_csv("../data/poverty_estimates.csv", encoding='latin-1')
unem = pd.read_csv("../data/unemployment.csv", encoding='latin-1', thousands=',')

In [84]:
#Clean and keep relevant education related data
edu = edu.rename(columns={
    "FIPS Code": "fips",
    "Percent of adults with less than a high school diploma, 2014-18": "perc_less_highschool",
    "Percent of adults with a high school diploma only, 2014-18": "perc_highschool",
    "Percent of adults completing some college or associate's degree, 2014-18": "perc_some_college",
    "Percent of adults with a bachelor's degree or higher, 2014-18": "perc_college"
})
edu = edu.set_index('fips')
edu = edu.drop([0])
edu = edu[["perc_less_highschool", "perc_highschool", "perc_some_college", "perc_college"]]
edu.head()

Unnamed: 0_level_0,perc_less_highschool,perc_highschool,perc_some_college,perc_college
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000,14.2,30.9,29.9,24.9
1001,11.3,32.6,28.4,27.7
1003,9.7,27.6,31.3,31.3
1005,27.0,35.7,25.1,12.2
1007,16.8,47.3,24.4,11.5


In [85]:
#Clean broadband adoption data
conn = conn.rename(columns={"cfips": "fips"})
conn = conn.set_index("fips")
conn.head()

Unnamed: 0_level_0,statenam,county,year,id,broadband
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,Alabama,Autauga County,2000,0500000US01001,
1001,Alabama,Autauga County,2001,0500000US01001,
1001,Alabama,Autauga County,2002,0500000US01001,
1001,Alabama,Autauga County,2003,0500000US01001,
1001,Alabama,Autauga County,2004,0500000US01001,


In [86]:
#Clean population data, will figure out how to use this later
pop = pop.rename(columns={"FIPStxt": "fips"})
pop = pop.set_index("fips")
pop = pop.drop([0])
pop.head()

Unnamed: 0_level_0,State,Area_Name,Rural-urban_Continuum Code_2003,Rural-urban_Continuum Code_2013,Urban_Influence_Code_2003,Urban_Influence_Code_2013,Economic_typology_2015,CENSUS_2010_POP,ESTIMATES_BASE_2010,POP_ESTIMATE_2010,...,R_DOMESTIC_MIG_2019,R_NET_MIG_2011,R_NET_MIG_2012,R_NET_MIG_2013,R_NET_MIG_2014,R_NET_MIG_2015,R_NET_MIG_2016,R_NET_MIG_2017,R_NET_MIG_2018,R_NET_MIG_2019
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000,AL,Alabama,,,,,,4779736,4780125,4785437,...,1.9,0.6,1.2,1.5,0.6,0.6,0.7,1.1,1.8,2.5
1001,AL,Autauga County,2.0,2.0,2.0,2.0,0.0,54571,54597,54773,...,4.8,6.0,-6.2,-3.9,2.0,-1.7,4.8,0.8,0.5,4.6
1003,AL,Baldwin County,4.0,3.0,5.0,2.0,5.0,182265,182265,183112,...,24.0,16.6,17.5,22.8,20.2,17.7,21.3,22.4,24.7,24.4
1005,AL,Barbour County,6.0,6.0,6.0,6.0,3.0,27457,27455,27327,...,-5.7,0.3,-6.9,-8.1,-5.1,-15.7,-18.2,-25.0,-8.8,-5.2
1007,AL,Bibb County,1.0,1.0,1.0,1.0,0.0,22915,22915,22870,...,1.4,-5.0,-3.8,-5.8,1.3,1.3,-0.7,-3.2,-6.9,1.8


In [87]:
#Clean poverty data, will figure out how to use this later
pov = pov.rename(columns={"FIPStxt": "fips"})
pov = pov.set_index("fips")
pov = pov.drop([0])
pov.head()

Unnamed: 0_level_0,Stabr,Area_name,Rural-urban_Continuum_Code_2003,Urban_Influence_Code_2003,Rural-urban_Continuum_Code_2013,Urban_Influence_Code_2013,POVALL_2018,CI90LBAll_2018,CI90UBALL_2018,PCTPOVALL_2018,...,CI90UB517P_2018,MEDHHINC_2018,CI90LBINC_2018,CI90UBINC_2018,POV04_2018,CI90LB04_2018,CI90UB04_2018,PCTPOV04_2018,CI90LB04P_2018,CI90UB04P_2018
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000,AL,Alabama,,,,,801758,785668,817848,16.8,...,23.7,49881,49123,50639,73915.0,69990.0,77840.0,26.0,24.6,27.4
1001,AL,Autauga County,2.0,2.0,2.0,2.0,7587,6334,8840,13.8,...,23.9,59338,53628,65048,,,,,,
1003,AL,Baldwin County,4.0,5.0,3.0,2.0,21069,17390,24748,9.8,...,16.9,57588,54437,60739,,,,,,
1005,AL,Barbour County,6.0,6.0,6.0,6.0,6788,5662,7914,30.9,...,45.9,34382,31157,37607,,,,,,
1007,AL,Bibb County,1.0,1.0,1.0,1.0,4400,3445,5355,21.8,...,33.6,46064,41283,50845,,,,,,


In [88]:
#Clean unemployment data
unem = unem.rename(columns={
    "FIPStxt": "fips",
    "Unemployment_rate_2019": "unemployment_rate",
    "Median_Household_Income_2018": "median_house_income"
})
unem = unem.set_index("fips")
unem = unem.drop([0])
unem = unem[["unemployment_rate", "median_house_income"]]
unem.head()

Unnamed: 0_level_0,unemployment_rate,median_house_income
fips,Unnamed: 1_level_1,Unnamed: 2_level_1
1000,3.0,49881.0
1001,2.7,59338.0
1003,2.7,57588.0
1005,3.8,34382.0
1007,3.1,46064.0


Dataset cleansing and dataframe formation

In [89]:
#Join connectivity and education dataframes
df = conn.join(edu, on="fips")
df = df.join(unem, on="fips")
#Since unemployment rate is for 2018 and the percentage of education completion rates are for 2014 to 2018, aggregate
#this over 2014 to 2018
df = df[df["year"] >= 2014]
df = df[df["year"] <= 2018]
df = df.dropna()
df = df.groupby(["fips"]).mean()
df = df[["broadband", "perc_less_highschool", "perc_highschool", "perc_some_college", "perc_college", "unemployment_rate", "median_house_income"]]
df.head()

Unnamed: 0_level_0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1001,0.703591,11.3,32.6,28.4,27.7,2.7,59338.0
1003,0.771106,9.7,27.6,31.3,31.3,2.7,57588.0
1005,0.541496,27.0,35.7,25.1,12.2,3.8,34382.0
1007,0.633871,16.8,47.3,24.4,11.5,3.1,46064.0
1009,0.635847,19.8,34.0,33.5,12.6,2.7,50412.0


Correlation

In [90]:
#View the correlation matrix of variables
df.corr()

Unnamed: 0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income
broadband,1.0,-0.477714,-0.512254,0.124391,0.643752,-0.266776,0.667774
perc_less_highschool,-0.477714,1.0,0.249938,-0.485483,-0.596502,0.348893,-0.534972
perc_highschool,-0.512254,0.249938,1.0,-0.279324,-0.776847,0.252409,-0.537826
perc_some_college,0.124391,-0.485483,-0.279324,1.0,-0.01095,-0.150285,0.092397
perc_college,0.643752,-0.596502,-0.776847,-0.01095,1.0,-0.344441,0.719382
unemployment_rate,-0.266776,0.348893,0.252409,-0.150285,-0.344441,1.0,-0.410773
median_house_income,0.667774,-0.534972,-0.537826,0.092397,0.719382,-0.410773,1.0


In [94]:
#Perform PCA to see how much these variables explain each other's variance
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df)
pca = PCA(n_components=2)
pca.fit(x_scaled)
print(pca.explained_variance_)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)
print(pca.components_)

[0.0618032  0.01699122]
[0.56236823 0.15460884]
[13.89952478  7.28797247]
[[ 0.55862795 -0.24794624 -0.45987457  0.12393222  0.46443085 -0.14388621
   0.40399801]
 [ 0.17999905  0.29601051  0.09220403 -0.90304185  0.19444289  0.06905925
   0.115822  ]]


## Answer

On the subject of unemployment and median household income, the variance of the data is not strongly explained by it.

## Interpretation/Observation

While other statistics need to be blocked again on a fairly granular level it may be statistically significant to perform a regression analysis of broadband connectivity versus education levels.