In [None]:
# Covid19 Impact on K-12 Education
By Megan Nalani Chun  
Last modified November 2020

## Motivation and Problem Statement
The children of today will be the decision makers of tomorrow and their education will help determine what those decisions will be. The Covid19 pandemic has changed daily normal activity drastically for people around the world and some areas of society have been impacted more than others. The focus of this study will be on K-12 education during Covid19 since the pandemic and lack of national government guidance has forced school districts to implement varying policies such as those in the data section below. I hope to learn the following:
- How much school did students miss?
- Under what factors is there a significant difference in the amount of school missed? If there was a similar pandemic, would early implementation of certain policies allow students to have a better continued education? 
- Do school districts with more income have more access to the internet? And are these districts more likely to have an online policy?

## Data 
This analysis will be scoped to students enrolled in public and private schools across the United States. The following datasets will be used because of their completeness, trustworthiness, and accessibility in addition to their contents. Licensing and data source information can be found in the [readme](https://github.com/NalaniKai/data-512-final/blob/main/README.md).

Combining these datasets by school districts and counties will enable the research questions to be explored.

### Covid19 K-12 Education Data by MCH Strategic Data
This dataset has the following information for 12,643 (86%) school districts in the US for the 2020-2021 school year:
- Enrollment 
- School open date 
- Teaching methods (online, on premise, etc)
- Sports participation 
- Online instruction increase
- Network investment
- Hardware investment
- Staff mask policy
- Student mask policy
- Student illness return policy
- Student isolation area 
- School temporary shutdown 

Data source: Covid19 K-12 Education Data by MCH Strategic Data. Compiled from public federal, state, and local school districts information and media updates. Go to the [Main page](https://www.mchdata.com/covid19/schoolclosings), scroll down, make a free account, and press the "download list of districts" button. Select subscribe and sign up for free subscription to download the dataset. No credit card information is needed.

Licensing: [MCH 2019 Standard Licensing Stipulations and Conditions Agreement](https://www.mchdata.com/about/terms-conditions)

### Demographic Data from the Institute of Education Sciences
The following demographic data was pulled from this dataset for all school districts in the US:
- Race 
- Income
- Computer & internet availability 
- School enrollment

Data source: Education Demographic and Geographic Estimates. National Center for Education Statistics. Institute of Education Sciences.   
- [Dataset: ACS 2014-2018 Profile](https://nces.ed.gov/programs/edge/TableViewer/acsProfile/2018)  
- Geography: All Districts  
- Population: Relevant Children  
- Tables:          
        - CDP02.2 SCHOOL ENROLLMENT [CDP02.2_102_USSchoolDistrictAll_111231815433.txt](https://github.com/NalaniKai/data-512-final/blob/main/Data/CDP02.2_102_USSchoolDistrictAll_111231815433.txt)  
        - CDP02.11 COMPUTERS AND INTERNET USE [CDP02.11_102_USSchoolDistrictAll_111233642475.txt](https://github.com/NalaniKai/data-512-final/blob/main/Data/CDP02.11_102_USSchoolDistrictAll_111233642475.txt)  
        - CDP03.2 INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) [CDP03.2_102_USSchoolDistrictAll_111234614792.txt](https://github.com/NalaniKai/data-512-final/blob/main/Data/CDP03.2_102_USSchoolDistrictAll_111234614792.txt)  
        - CDP05.2 RACE [CDP05.2_102_USSchoolDistrictAll_11123448541.txt](https://github.com/NalaniKai/data-512-final/blob/main/Data/CDP05.2_102_USSchoolDistrictAll_11123448541.txt)  

Licensing: [Open Data Policy](https://digital.gov/open-data-policy-m-13-13/) 

### Johns Hopkins University COVID-19 Data
This dataset has the daily Covid19 case counts by county and is continually being updated.

Data source: COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. [csse_covid_19_time_series](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/)

Licensing: [Public Domain U.S.- Johns Hopkins University Center for Systems Science and Engineering. [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/deed.ast)

### Intermediary Data Source for Covid19 Data Sources
This dataset will be used to join the Covid19 school district data to the Covid19 case count data.

Data source: United States Census Bureau [School Districts and Associated Counties](https://www.census.gov/programs-surveys/saipe/guidance-geographies/districts-counties.html).

Licensing: [Public Domain U.S. Government](https://www.usa.gov/government-works)

## Unknowns & Dependencies 
There are a couple dependencies and unknowns that will need to be checked: 
- the percent of matches when joining the above datasets by their school district IDs and county codes
- the amount of missing data


In [None]:
## Research questions and/or hypotheses

The overarching research question for this analysis is: What is the impact of Covid19 on school closures and teaching methods for K-12 students across the US? How does this differ for students based on race, income, and computer/internet access?

Hypotheses that will be explored include:
- School districts that have more students with access to computers and internet are more likely to have online learning teaching methods. 
- Students in poorer school districts with more racial minorities have seen the most school closures. 
- School districts implementing on-premise policies around masks, sports, returning from illness, and isolation have the least amount of school closures and most amount of on premise learning.
- Counties with stricter school closures, online learning, and on campus protection policies will have a lower percent of Covid19 cases. 

Data exploration will also take into account the number of students in the school district along with the number of Covid19 cases per county which are factors also influencing school policies and education. 

## Background/Related Work

According to the American Society for Microbiology, the decisions for how to structure K-12 learning during Covid19 is the responsibility of each school district or private institution [ASM article](https://asm.org/Articles/2020/August/A-National-Crisis-K-12-Education-During-the-COVID). The ASM article also points out that "Centers for Disease Control and Prevention (CDC) data show the 5-17 year old age group has the highest SARS-CoV-2 test positivity rate of any age group" which means K-12 students can easily spread Covid19 to others in their communities. In the case of the annual flu, "The Director of the CDC, Robert Redfield, has been warning since late April about the specter of a second wave of coronavirus during flu season. A potential driver of this second wave could be K-12 students bringing 1 or both viruses home to their care providers, who are more vulnerable to these pathogens." Since schools are expected to be a hot spot for spreading Covid19, one of the hypotheses my analysis will investigate is whether county Covid19 cases tend to be better or worse depending on measures taken by the school districts. In preparation for school openings, ASM made a list of considerations for school distrincts to plan for such as new staff training, school bus policies, policies around student lockers, and additional hand washing stations many of which will not be in scope for this analysis. 

A similar analysis on Covid19 impact on K-12 students was performed by the Pew Research Center where data was collected via surveys from parents who were asked a series of questions about their children and their education during this time. According to this study, "parents of children attending school fully in person are ... less likely to be concerned about their education" but these parents also are concerned about their children's exposure to covid19 [Pew Research Center article](https://www.pewsocialtrends.org/2020/10/29/most-parents-of-k-12-students-learning-online-worry-about-them-falling-behind/). As a parent, this is an understandable concern and logically it would make sense for Covid19 to spread more rapidly with in person classes. To test this hypothesis, part of my analysis will investigate whether counties with higher in person schooling correlates with having higher Covid19 case counts. The Pew Research Center study also showed that lower income parents reported higher percentage of online only instruction and 72% stated they were concerned about their children falling behind in school whereas only 55% of upper income parents said they were concerned. In response to these findings, my analysis will investigate the hypothesis that lower income school districts are more likely to have online learning policies. Further, my analysis will also combine the MCH Covid19 data and the US government education statistics to determine the percent of students with computer and internet access especially in school districts with online learning policies. 

## Analysis Methodology

This study will leverage data visualizations, correlation calculations, and logistic regression to investigate the research questions and hypotheses stated above. 

The distributions will be plotted and the data summaries will be compared for school district policies, Covid19 cases, and demographics across school districts. These plots and summaries will provide insight into data skew and any outliers. 

The correlations and p-values for the percent of Covid19 cases per county and the following in person school district policies will be calculated to determine which policies have the most impact on Covid19 cases. The results will also help inform whether implementing certain policies early could potentially be more useful at the start of future pandemics. This approach was selected since the data is easy to interpret and plot via a visualization and comparing the correlation value between these policies is straightforward. 
- Sports participation
- Staff mask policy
- Student mask policy
- Student illness return policy
- Student isolation area 

Logistic regression will be used to investigate which of the following attributes are most correlated with school closures, online learning, in person learning, and hardware/network investments. This approach was selected because each of the predictor variables can be split into binary labels. Moreover, the features are able to be easily compared which will help determine whether demographics play a significant role in the types of learning and investment students receive. 
- race
- household income of students
- access to computers/internet 
- % of Covid19 cases per county 
- Number of students enrolled 


In [None]:
## Step 1: Data Preparation

In [92]:
import pandas as pd

In [223]:
df_covid = pd.read_csv("Data/Covid19_K-12_Education/covid-data.csv")
df_covid.head()

Unnamed: 0,SchoolYear,DistrictNCES,DistrictID,DistrictName,Control,PhysicalCity,PhysicalState,Enrollment,OpenDate,TeachingMethod,...,OnlineInstructionIncrease,NetworkInvestment,HardwareInvestment,StaffMaskPolicy,StudentMaskPolicy,StudentIllnessReturnPolicy,StudentIsolationArea,SchoolTemporaryShutdown,ParentOptOutClassroomTeaching,LastVerifiedDate
0,2020-2021,200004.0,888659,Yupiit School District,Public,Akiachak,AK,463.0,08/12/2020,On Premises,...,No,No,No,Pending,Pending,Pending,Pending,Pending,Pending,08/12/2020
1,2020-2021,200010.0,888610,Aleutian Region School District,Public,Anchorage,AK,27.0,09/08/2020,On Premises,...,Unknown,Yes,Yes,Required for all staff,Required for all students,Unknown,Unknown,Never closed,Unknown,10/23/2020
2,2020-2021,200180.0,888658,Anchorage School District,Public,Anchorage,AK,48347.0,08/25/2020,Hybrid,...,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Yes,10/26/2020
3,2020-2021,200800.0,888657,Chugach School District,Public,Anchorage,AK,517.0,08/17/2020,Online Only,...,Yes,Unknown,Unknown,Required for all staff,Required for all students,Yes,Yes,Closed 6-14 days,Yes,10/29/2020
4,2020-2021,200730.0,888599,Chatham School District,Public,Angoon,AK,156.0,10/05/2020,Hybrid,...,Unknown,Pending,Yes,Required for all staff,Required for all students,Yes,Yes,Closed 6-14 days,Yes,10/22/2020


In [242]:
covid_cases = pd.read_csv("Data/Covid19_CasesByCounty/time_series_covid19_deaths_US.csv")       #covid19 confirmed cases
covid_cases["Confirmed_Cases_Sept1_Nov17"] = covid_cases["11/17/20"] - covid_cases["9/1/20"]    #New confirmed cases between September 1, 2020 - November 17, 2020 
keep_cols = ["UID", "Admin2", "Province_State", "Population", "Confirmed_Cases_Sept1_Nov17"]    
covid_cases_cols = covid_cases.columns.values
remove_cols = [c for c in covid_cases_cols if c not in keep_cols]
covid_cases.drop(columns=remove_cols, axis=1, inplace=True)
covid_cases.head()

Unnamed: 0,UID,Admin2,Province_State,Population,Confirmed_Cases_Sept1_Nov17
0,84001001,Autauga,Alabama,55869,14
1,84001003,Baldwin,Alabama,223234,46
2,84001005,Barbour,Alabama,24686,2
3,84001007,Bibb,Alabama,22394,10
4,84001009,Blount,Alabama,57826,23


In [243]:
#extract the state-county UID to join with the intermediary stream
covid_cases["UID_Counties"] = covid_cases["UID"].apply(lambda f: 0 if math.isnan(f) else str(f))
covid_cases["UID_Counties"] = covid_cases["UID_Counties"].apply(lambda f: f[3:] if len(f) == 8 else -1)
covid_cases["UID_Counties"] = covid_cases["UID_Counties"].apply(lambda f: int(f))
covid_cases = covid_cases[covid_cases["UID_Counties"] > -1]
covid_cases.head()

Unnamed: 0,UID,Admin2,Province_State,Population,Confirmed_Cases_Sept1_Nov17,UID_Counties
0,84001001,Autauga,Alabama,55869,14,1001
1,84001003,Baldwin,Alabama,223234,46,1003
2,84001005,Barbour,Alabama,24686,2,1005
3,84001007,Bibb,Alabama,22394,10,1007
4,84001009,Blount,Alabama,57826,23,1009


In [244]:
#combine the state and county FIPS to join with the covid cases dataset 
df_counties = pd.read_csv("Data/County_SchoolDistrict_Intermediary/sdlist-19.csv", encoding='gbk')
df_counties["FIPS"] = df_counties.apply(lambda r: str(r["State FIPS"]) + "0" + str(r["County FIPS"]) if len(str(r["State FIPS"])) == 1 else str(r["State FIPS"]) + str(r["County FIPS"]), axis=1)
df_counties = df_counties.drop(columns=["State FIPS", "County FIPS"], axis=1)

def is_integer(n):
    try:
        float(n)
    except ValueError:
        return False
    else:
        return float(n).is_integer()

#convert to int
df_counties['FIPS'] = df_counties['FIPS'].apply(lambda f: int(f) if is_integer(f) else -1)
df_counties = df_counties[df_counties['FIPS'] > -1]
df_counties.head()

Unnamed: 0,State Postal Code,District ID Number,School District Name,County Names,FIPS
0,AL,190,Alabaster City School District,Shelby County,10117
1,AL,5,Albertville City School District,Marshall County,1095
2,AL,30,Alexander City City School District,Tallapoosa County,10123
3,AL,60,Andalusia City School District,Covington County,1039
4,AL,90,Anniston City School District,Calhoun County,1015


In [246]:
#get the number of confirmed covid19 cases by school district per county 
covid_cases_county = pd.merge(covid_cases, df_counties, how='inner', left_on='UID_Counties', right_on='FIPS')
covid_cases_county.head()

Unnamed: 0,UID,Admin2,Province_State,Population,Confirmed_Cases_Sept1_Nov17,UID_Counties,State Postal Code,District ID Number,School District Name,County Names,FIPS
0,84001011,Bullock,Alabama,10101,6,1011,AL,480,Bullock County School District,Bullock County,1011
1,84001013,Butler,Alabama,19448,5,1013,AL,510,Butler County School District,Butler County,1013
2,84001015,Calhoun,Alabama,113605,45,1015,AL,90,Anniston City School District,Calhoun County,1015
3,84001015,Calhoun,Alabama,113605,45,1015,AL,540,Calhoun County School District,Calhoun County,1015
4,84001015,Calhoun,Alabama,113605,45,1015,AL,1860,Jacksonville City School District,Calhoun County,1015


In [252]:
#get the number of confirmed covid19 cases per school district with their 2020-2021 school year policies 
covid_cases_schools = pd.merge(df_covid, covid_cases_county, how='inner', left_on='DistrictName', right_on='School District Name')
remove_cols = ['SchoolYear', 'DistrictID', 'OpenDate', 'LastVerifiedDate', 'UID', 'Admin2', 'Province_State',
'UID_Counties', 'State Postal Code', 'District ID Number', 'School District Name', 'County Names', 'FIPS']
covid_cases_schools.drop(columns=remove_cols, axis=1, inplace=True)
covid_cases_schools.head()

Unnamed: 0,DistrictNCES,DistrictName,Control,PhysicalCity,PhysicalState,Enrollment,TeachingMethod,SportsParticipation,OnlineInstructionIncrease,NetworkInvestment,HardwareInvestment,StaffMaskPolicy,StudentMaskPolicy,StudentIllnessReturnPolicy,StudentIsolationArea,SchoolTemporaryShutdown,ParentOptOutClassroomTeaching,Population,Confirmed_Cases_Sept1_Nov17
0,200004.0,Yupiit School District,Public,Akiachak,AK,463.0,On Premises,No,No,No,No,Pending,Pending,Pending,Pending,Pending,Pending,18386,0
1,200760.0,Kuspuk School District,Public,Aniak,AK,385.0,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,18386,0
2,200001.0,Lower Kuskokwim School District,Public,Bethel,AK,4310.0,Unknown,Unknown,Unknown,Yes,Yes,Required for all staff,Required for all students,Unknown,Unknown,Unknown,Yes,18386,0
3,200770.0,Denali Borough School District,Public,Healy,AK,950.0,On Premises,No,No,Yes,No,Required for all staff,Required for all students,Yes,Yes,Never closed,Yes,2097,0
4,200520.0,Iditarod Area School District,Public,McGrath,AK,330.0,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,Pending,18386,0
