In [None]:
# Covid19 Impact on K-12 Education
By Megan Nalani Chun  
Last modified November 2020

## Motivation and Problem Statement
The children of today will be the decision makers of tomorrow and their education will help determine what those decisions will be. The Covid19 pandemic has changed daily normal activity drastically for people around the world and some areas of society have been impacted more than others. The focus of this study will be on K-12 education during Covid19 since the pandemic and lack of national government guidance has forced school districts to implement varying policies such as those in the data section below. I hope to learn the following:
- How much school did students miss?
- Under what factors is there a significant difference in the amount of school missed? If there was a similar pandemic, would early implementation of certain policies allow students to have a better continued education? 
- Do school districts with more income have more access to the internet? And are these districts more likely to have an online policy?

## Data 
This analysis will be scoped to students enrolled in public and private schools across the United States. The following two datasets will be used because of their completeness, trustworthiness, and accessibility in addition to their contents. The first dataset is the Covid19 K-12 Education Data by MCH Strategic Data and the second is demographic data from the Institute of Education Sciences. Licensing and data source information can be found in the [readme](https://github.com/NalaniKai/data-512-final/blob/main/README.md).

Combining these datasets by school district will enable the above research questions to be explored.

### Covid19 K-12 Education Data by MCH Strategic Data
This dataset has the following information for 12,643 (86%) school districts in the US:
- Enrollment 
- School open date 
- Teaching methods (online, on premise, etc)
- Sports participation 
- Online instruction increase
- Network investment
- Hardware investment
- Staff mask policy
- Student mask policy
- Student illness return policy
- Student isolation area 
- School temporary shutdown 

#### Data source & licensing (also in readme)
Covid19 K-12 Education Data by MCH Strategic Data. Compiled from public federal, state, and local school districts information and media updates. Go to the [Main page](https://www.mchdata.com/covid19/schoolclosings), scroll down, make a free account, and press the "download list of districts" button. Select subscribe and sign up for free subscription to download the dataset. No credit card information is needed.

[MCH 2019 Standard Licensing Stipulations and Conditions Agreement](https://www.mchdata.com/about/terms-conditions)

### Demographic Data from the Institute of Education Sciences
The following demographic data was pulled from this dataset for all school districts in the US:
- Race 
- Income
- Computer & internet availability 
- School enrollment

#### Data source & licensing (also in readme)
Education Demographic and Geographic Estimates. National Center for Education Statistics. Institute of Education Sciences.   
- [Dataset: ACS 2014-2018 Profile](https://nces.ed.gov/programs/edge/TableViewer/acsProfile/2018)  
- Geography: All Districts  
- Population: Relevant Children  
- Tables:          
        - CDP02.2 SCHOOL ENROLLMENT  
        - CDP02.11 COMPUTERS AND INTERNET USE  
        - CDP03.2 INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED   DOLLARS)  
        - CDP05.2 RACE   

[Open Data Policy](https://digital.gov/open-data-policy-m-13-13/) 

### Covid19 Cases per State from the CDC
This dataset has the daily covid19 case counts by state and is continually being updated.

#### Data source & licensing (also in readme)
Centers for Disease Control and Prevention (United States COVID-19 Cases and Deaths by State over Time)[https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data]

(Public Domain U.S. Government)[https://www.usa.gov/government-works]

### Population size per State from the US Census 
This dataset has the population size by state between April 1, 2010 and July 1, 2019. 

#### Data source & licensing (also in readme)
United States Census Bureau (Population, Population Change, and Estimated Components of Population Change: April 1, 2010 to July 1, 2019 (NST-EST2019-alldata))[https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html#par_textimage]

(Public Domain U.S. Government)[https://www.usa.gov/government-works]

## Unknowns & Dependencies 
There are a couple dependencies and unknowns that will need to be checked: 
- the percent of matches when joining the above two datasets by their school district IDs
- the amount of missing data


In [1]:
## Research questions and/or hypotheses

The overarching research question for this analysis is: What is the impact of Covid19 on school closures and teaching methods for K-12 students across the US? How does this differ for students based on race, income, and computer/internet access?

Hypotheses that will be explored include:
- School districts that have more students with access to computers and internet are more likely to have online learning teaching methods. 
- Students in poorer school districts with more racial minorities have seen the most school closures. 
- School districts implementing on-premise policies around masks, sports, returning from illness, and isolation have the least amount of school closures and most amount of on premise learning.
- States with stricter school closures, online learning, and on campus protection policies will have a lower percent of Covid19 cases. 

Data exploration will also take into account the number of students in the school district along with the number of covid19 cases in the state which are factors also influencing school policies and education. 

## Background/Related Work

According to the American Society for Microbiology, the decisions for how to structure K-12 learning during Covid19 is the responsibility of each school district or private institution (ASM article)[https://asm.org/Articles/2020/August/A-National-Crisis-K-12-Education-During-the-COVID]. The ASM article also points out that "Centers for Disease Control and Prevention (CDC) data show the 5-17 year old age group has the highest SARS-CoV-2 test positivity rate of any age group" which means K-12 students can easily spread Covid19 to others in their communities. In the case of the annual flu, "The Director of the CDC, Robert Redfield, has been warning since late April about the specter of a second wave of coronavirus during flu season. A potential driver of this second wave could be K-12 students bringing 1 or both viruses home to their care providers, who are more vulnerable to these pathogens." Since schools are expected to be a hot spot for spreading Covid19, one of the hypotheses my analysis will investigate is whether state Covid19 cases tend to be better or worse depending on measures taken by the school districts. In preparation for school openings, ASM made a list of considerations for school distrincts to plan for such as new staff training, school bus policies, policies around student lockers, and additional hand washing stations many of which will not be in scope for this analysis. 

A similar analysis on Covid19 impact on K-12 students was performed by the Pew Research Center where data was collected via surveys from parents who were asked a series of questions about their children and their education during this time. According to this study, "parents of children attending school fully in person are ... less likely to be concerned about their education" but these parents also are concerned about their children's exposure to covid19 (Pew Research Center article)[https://www.pewsocialtrends.org/2020/10/29/most-parents-of-k-12-students-learning-online-worry-about-them-falling-behind/]. As a parent, this is an understandable concern and logically it would make sense for Covid19 to spread more rapidly with in person classes. To test this hypothesis, part of my analysis will investigate whether states with higher in person schooling correlates with having higher Covid19 case counts. The Pew Research Center study also showed that lower income parents reported higher percentage of online only instruction and 72% stated they were concerned about their children falling behind in school whereas only 55% of upper income parents said they were concerned. In response to these findings, my analysis will investigate the hypothesis that lower income school districts are more likely to have online learning policies. Further, my analysis will also combine the MCH Covid19 data and the US government education statistics to determine the percent of students with computer and internet access especially in school districts with online learning policies. 

## Analysis Methodology

This study will leverage data visualizations, correlation calculations, and logistic regression to investigate the research questions and hypotheses stated above. 

The distributions will be plotted and the data summaries will be compared for school district policies, Covid19 cases, and demographics across school districts and states. These plots and summaries will provide insight into data skew and any outliers. 

The correlations and p-values for the percent of Covid19 cases per state and the following in person school district policies will be calculated to determine which policies have the most impact on Covid19 cases. The results will also help inform whether implementing certain policies early could potentially be more useful at the start of future pandemics. This approach was selected since the data is easy to interpret and plot via a visualization and comparing the correlation value between these policies is straightforward. 
- Sports participation
- Staff mask policy
- Student mask policy
- Student illness return policy
- Student isolation area 

Logistic regression will be used to investigate which of the following attributes are most correlated with school closures, online learning, in person learning, and hardware/network investments. This approach was selected because each of the predictor variables can be split into binary labels. Moreover, the features are able to be easily compared which will help determine whether demographics play a significant role in the types of learning and investment students receive. 
- race
- household income of students
- access to computers/internet 
- % of Covid19 cases per state 
- # students enrolled 


SyntaxError: invalid syntax (<ipython-input-1-2a0b71c841bc>, line 5)