# Sampling Notebook
Notebook for random sampling and sampling of extremes based on feature and getting a better understanding of the data

In [19]:
#imports
import pandas as pd

## Dataset Loading

In [20]:
#Load in relevant cleaned datasets
df = pd.read_csv("../data/dataframe.csv").set_index("fips").dropna()
fips = pd.read_csv("../data/fips.csv").set_index("fips")

In [25]:
df.shape

(3127, 10)

## Random Sampling
Let's get a look at some random datapoints for reference

In [6]:
df.sample(8, random_state=42).join(fips, on="fips")

Unnamed: 0_level_0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income,urbanization_class,poverty_percentage,poverty_percentage_0-17,state,area
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
12055,0.691948,15.3,38.0,29.8,17.0,4.5,39994.0,4.0,20.8,38.5,FL,Highlands County
40005,0.497167,17.3,40.6,27.1,15.0,4.3,39702.0,6.0,20.8,27.8,OK,Atoka County
54049,0.677162,10.7,40.2,25.7,23.4,5.1,46244.0,5.0,16.6,23.7,WV,Marion County
47137,0.642814,23.7,40.9,26.7,8.7,4.7,40182.0,6.0,16.0,24.3,TN,Pickett County
8095,0.633842,11.8,27.6,35.1,25.5,1.6,52367.0,6.0,10.8,15.6,CO,Phillips County
49047,0.658945,13.9,35.2,36.2,14.7,4.2,63631.0,5.0,11.5,13.5,UT,Uintah County
13053,0.534857,7.2,25.9,33.6,33.4,4.6,46869.0,3.0,17.3,19.3,GA,Chattahoochee County
26153,0.582705,9.9,44.3,30.0,15.8,7.0,44887.0,6.0,13.2,22.0,MI,Schoolcraft County


This doesn't give us any particularly strong points to zero in on but there's a general breakdown of the information given. It also raises the question of how we're going to define effect on education as the only real metric we have for each county is how far along in education a certain percentage have gone. We can regress features against each of these percentages or we can also try classification and grouping via one hot encodings of these features.

## Broadband
Let's take a look at some of these places that are on the high and low end of broadband connectivity

In [8]:
sort_by_broadband = df.sort_values(by=["broadband"])

#### Lower broadband connectivity
Let's look at the lower end first

In [10]:
sort_by_broadband.join(fips, on="fips").head(3)

Unnamed: 0_level_0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income,urbanization_class,poverty_percentage,poverty_percentage_0-17,state,area
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
35021,0.309,12.0,32.9,30.8,24.3,5.1,37544.0,6.0,16.7,20.3,NM,Harding County
8109,0.348778,19.6,29.8,27.8,22.9,4.0,36869.0,6.0,24.6,37.0,CO,Saguache County
48261,0.35,66.3,21.1,5.8,6.8,3.6,42153.0,5.0,14.3,20.3,TX,Kenedy County


There's a fair bit that's expected here that we might hypothesize and try to see if the data analysis pans out. These tend to be more rural, lower income places with higher poverty  and lower rates of college and graduation.

#### Higher broadband connectivity
We can also look at the results from the flip side

In [11]:
sort_by_broadband.join(fips, on="fips").tail(3)

Unnamed: 0_level_0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income,urbanization_class,poverty_percentage,poverty_percentage_0-17,state,area
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
18057,0.936337,3.6,15.3,23.3,57.8,2.5,101740.0,2.0,4.2,4.5,IN,Hamilton County
51107,0.937183,6.5,12.4,20.3,60.8,2.3,140382.0,2.0,3.6,3.6,VA,Loudoun County
8035,0.958492,2.0,12.2,27.9,57.9,2.4,120670.0,2.0,2.6,2.5,CO,Douglas County


Here we see the opposite story. These look to be more metropolitan areas somewhat bordering on urban with lower rates of poverty and unemployment and higher rates of college graduation.

## Education
We can also look at samples of data from our response by sorting dataframes by each educational attainment statistic

In [14]:
#Lowest levels of educational attainment
sort_by_less_highschool = df.sort_values(by=["perc_less_highschool"])
sort_by_less_highschool.join(fips, on="fips").tail(3)

Unnamed: 0_level_0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income,urbanization_class,poverty_percentage,poverty_percentage_0-17,state,area
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
48377,0.48541,47.6,19.1,13.2,20.1,6.5,37738.0,6.0,22.4,35.2,TX,Presidio County
48427,0.553879,48.5,25.0,16.2,10.3,9.8,30490.0,5.0,33.2,45.3,TX,Starr County
48261,0.35,66.3,21.1,5.8,6.8,3.6,42153.0,5.0,14.3,20.3,TX,Kenedy County


In [15]:
#Educational attainment mostly at only high school graduation
sort_by_highschool = df.sort_values(by=["perc_highschool"])
sort_by_highschool.join(fips, on="fips").tail(3)

Unnamed: 0_level_0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income,urbanization_class,poverty_percentage,poverty_percentage_0-17,state,area
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
8073,0.703167,9.7,53.1,24.4,12.8,2.3,46690.0,6.0,16.2,21.0,CO,Lincoln County
54015,0.653068,24.3,53.5,12.7,9.5,8.5,35983.0,4.0,25.1,32.7,WV,Clay County
42053,0.54023,17.6,55.6,19.4,7.4,6.8,41750.0,6.0,24.0,30.1,PA,Forest County


In [16]:
#Educational attainment mostly at 
sort_by_some_college = df.sort_values(by=["perc_some_college"])
sort_by_some_college.join(fips, on="fips").tail(3)

Unnamed: 0_level_0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income,urbanization_class,poverty_percentage,poverty_percentage_0-17,state,area
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
31007,0.7495,2.9,29.2,47.2,20.7,3.7,53865.0,5.0,12.5,25.0,NE,Banner County
49007,0.666006,9.2,27.2,47.3,16.4,3.7,51394.0,5.0,14.4,17.0,UT,Carbon County
6091,0.589684,6.7,26.6,48.0,18.7,5.4,52308.0,6.0,13.3,18.0,CA,Sierra County


In [17]:
sort_by_college = df.sort_values(by=["perc_college"])
sort_by_college.join(fips, on="fips").tail(3)

Unnamed: 0_level_0,broadband,perc_less_highschool,perc_highschool,perc_some_college,perc_college,unemployment_rate,median_house_income,urbanization_class,poverty_percentage,poverty_percentage_0-17,state,area
fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
35028,0.739749,2.2,9.9,21.5,66.5,3.2,124947.0,5.0,3.9,3.1,NM,Los Alamos County
51013,0.90509,5.9,8.1,11.4,74.6,1.9,120950.0,1.0,6.3,7.3,VA,Arlington County
51610,0.75479,1.2,5.5,14.8,78.5,2.0,137551.0,2.0,3.6,2.9,VA,Falls Church city


We also get to look at samples of the maximized educational attainment rating achievement. This may help guide our thinking.