In [None]:
import numpy as np 
import pandas as pd 
import math
import glob
import os
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

plt.rcParams["figure.figsize"] = [8, 6]

## Problem Statement
The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

## Challenge
Explore (1) the state of digital learning in 2020 and (2) how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

## Questions
1. What is the picture of digital connectivity and engagement in 2020?
2. What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?
3. How does student engagement with different types of education technology change over the course of the pandemic?
4. How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?
5. Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

### Products data

In [None]:
products_df = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
products_df[("Sector(s)")] = products_df[("Sector(s)")].astype('category')
products_df[['Function-1', 'Function-2', 'Function-3']] = products_df['Primary Essential Function'].str.split('-', n=2, expand=True)
products_df[("Primary Essential Function")] = products_df[("Primary Essential Function")].astype('category')

In [None]:
products_df.shape

There are 372 products. The primary function of each product was broken down into three components to capture its broader functions. These are called Function-1, Function-2, Function-3 which was derived from splitting the Primary Essential Function string by dash (-)

In [None]:
products_df.describe(include='all')

In [None]:
outsumm = products_df.groupby("Function-1").size().rename('Total').reset_index().sort_values('Total', ascending = False)
outsumm

In [None]:
sns.barplot(x="Function-1", y='Total', data = outsumm)

The largest number of products is LC or Learning & Curriculum products, followed by CM or 'Classroom Management' and then SDO or 'School and District Operations'

In [None]:
outsumm = products_df.groupby("Primary Essential Function").size().rename('Total').reset_index().sort_values('Total', ascending = False).head(10)
outsumm

In [None]:
outsumm["Primary Essential Function"] = outsumm["Primary Essential Function"].astype(str)
sns.barplot(y="Primary Essential Function", x='Total',data=outsumm)

The largest number of products is Digital Learning Platforms followed by Sites, Resources & Reference
## Districts data

In [None]:
districts_df = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
districts_df[("locale")] = districts_df[("locale")].astype('category')
ax = sns.countplot(x="locale", data=districts_df)

Most of the districts in the data are in suburbs.

In [None]:
districts_df.shape

The data contains 233 districts. Howver, this is less than one percent of all [public school districts in the United States](https://nces.ed.gov/programs/digest/d20/tables/dt20_214.10.asp?current=yes).

In [None]:
districts_df.describe(include='all')

In [None]:
districts_df.count(axis=0) / districts_df.shape[0]

However, the districts data has a lot of missing values. While state, locale, and pct_black/hispanic have about 25 percent missing, almost 50 percent of per pupil expenditures are missing for the districts. While imputation is a possibility there is generally great variation at the school district level within a state, so substituting with state level averages would not be a good idea. Moreover, since the district id is anonymized we cannot utilize existing district level data for imputation.

In [None]:
plot = districts_df.groupby(['state']).size().rename('Count').reset_index().sort_values('Count').plot(kind='barh',x='state')
plot = plot.bar_label(plot.containers[0])

There are many states with a small number of districts. Arizona, Florida, North Dakota and Minnesota only have data on one district.

In [None]:
districts_df.groupby(['state', 'county_connections_ratio']).size()

There is very little variation by connection speed. All districts except in North Dakota have a lower county connections ratio.

In [None]:
plot = districts_df.groupby(['pct_black/hispanic']).size().plot(kind="bar").set(ylabel = 'Count')

Most districts have low percentages of black and hispanic students.

In [None]:
plot = districts_df.groupby(['pct_free/reduced']).size().plot(kind="bar").set(ylabel = 'Count')

Most districts have less than 50% of students with free or reduced lunch.

In [None]:
plot = districts_df.groupby(['state', 'locale']).size().unstack().plot(kind='bar', stacked=True)

Given that most of the districts are in the suburbs and the small sample size, there is very little variation in locale within state.

In [None]:
from pandas.api.types import CategoricalDtype
pp_categories = CategoricalDtype(categories = ['[4000, 6000[', '[6000, 8000[', '[8000, 10000[', '[10000, 12000[', '[12000, 14000[', '[14000, 16000[', '[16000, 18000[', '[18000, 20000[', '[22000, 24000[', '[32000, 34000['], ordered=True)
districts_df['pp_total_raw'] = districts_df.pp_total_raw.astype(pp_categories)

In [None]:
districts_df.groupby(['pp_total_raw']).size().plot(kind="bar")

Most spending is in the middle range of 8,000 - 10,000 per student although more than 50 percent of the data is missing.

In [None]:
plot = districts_df.groupby(['locale','pp_total_raw']).size().unstack().plot(kind='barh', stacked=True)

Per pupil spending in the city, town, and rural is mostly in the 8,000-10,000 range; in suburbs it is in the 14,000-16,000 range.

In [None]:
df = districts_df.groupby(['locale','pp_total_raw']).size().rename('Count').reset_index()
df['pct'] = df.groupby('locale')['Count'].transform(lambda x: x/x.sum())
plot = df[['locale','pp_total_raw','pct']].set_index(['pp_total_raw','locale']).unstack(0).plot(kind='barh', y='pct', stacked=True)
plot = plot.legend(bbox_to_anchor=(1,1), title='Per pupil spending')



### Engagement data

In [None]:
path = '/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
efiles = glob.glob(path + "/*.csv")

elist = []

for filename in efiles:
    df = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.split("/")[5].split(".")[0]
    df["district_id"] = district_id
    elist.append(df)
    
engagement_df = pd.concat(elist)
engagement_df = engagement_df.reset_index(drop=True)

In [None]:
engagement_df['time'] = pd.to_datetime(engagement_df['time'])
engagement_df["district_id"] = engagement_df["district_id"].astype(str).astype(int)
engagement_df.describe()

The engagement index also shows one extreme value of over 20000 when the median is around 2. Likewise percent access ranges from 0 to 100 percent with a median of 0.02 percent. This is a possible indication that averages might be biased by extreme values in the data.

In [None]:
# engagement_df['scalePct_access'] = np.sqrt(100*engagement_df['pct_access'])
engagement_df['lnEngagement_index'] = np.log(engagement_df['engagement_index'])

One alternative is to transform the measures of interest.

In [None]:
engagement_df.hist('engagement_index')

In [None]:
engagement_df.hist('lnEngagement_index')

In [None]:
# Count number of zeros in percent access
len(engagement_df.query('pct_access == 0'))

In [None]:
engagement_df['cutPct_access'] = pd.cut(100*engagement_df['pct_access'], 
                                        [0, 0.5, 1, 2, 3, 4, 6, 8, 10, 12, 16, 20, 25, 30, 40, 45, 50, 60, 70, 80, 90, 100], include_lowest=True)
pctAccess = engagement_df.groupby('cutPct_access').size().rename('Counts').reset_index()
pctAccess['Share'] = pctAccess['Counts']/sum(pctAccess['Counts']) * 100
pctAccess

This shows that 33 percent of the products were never accessed during the time period considered.

In [None]:
plot = engagement_df.plot.scatter(x="pct_access", y="engagement_index")

While the share of students with at least one page load is positively correlated with engagement as measured by total page loads per 1000 students, it is also possible to have a 100 percent access (at least one page load) but very little engagement (number of page load events per student). 

In [None]:
plot = engagement_df.groupby([engagement_df["time"].dt.month, engagement_df["time"].dt.year])["pct_access"].median().plot(kind="line", ylabel='Median pct_access', 
                                                                                                                         xlabel = 'Month, year')

This time series chart shows that at the median, access peaked just prior to the pandemic in March 2020 and did not recover to its pre-pandemic levels. However, by the Fall of 2020 [almost all states had returned to partial or full in-person instruction](http://https://www.edweek.org/leadership/map-where-are-schools-closed/2020/07) and the low levels of access can be indication that

- fewer students require digital access
- disenchantment with virtual learning and digital resources in the spring of 2020

In [None]:
plot = engagement_df.groupby([engagement_df["time"].dt.month, engagement_df["time"].dt.year])["engagement_index"].median().plot(kind="line", ylabel = 'Median engagement', xlabel = 'Month,year')

This chart shows that as in the share of at least one page loads, engagement did not recover to its pre-pandemic levels for the same reasons outlined above.

### Merge data

#### Product-engagement data

In [None]:
products_engagement_df = pd.merge(products_df, engagement_df, left_on = "LP ID", right_on="lp_id")

In [None]:
products_engagement_df[['pct_access', 'engagement_index', 'lnEngagement_index']].describe()

In [None]:
plot = products_engagement_df.groupby("Function-1")["engagement_index"].median().rename('Median engagement').reset_index().plot.bar(x='Function-1', y='Median engagement')

SDO or state and district operations had the highest level of engagement. This indicates that at the median a greater amount of time was spent on administrative tasks than learning (denoted LC = Learning and Curriculum)

In [None]:
plot = products_engagement_df.groupby("Function-1")["engagement_index"].mean().rename('Average engagement').reset_index().plot.bar(x='Function-1', y='Average engagement')

Even though averages may be higly biased due to the skewed nature of the data, state and district operations (SDO) also took up a larger amount of time on average.

In [None]:
plot = products_engagement_df.groupby("Function-1")["pct_access"].median().rename('Median access').reset_index().plot.bar(x='Function-1', y='Median access')

However measured by percent access, at the median learning and curriculum was greater than state and district operations.

In [None]:
plot = products_engagement_df.groupby("Function-1")["pct_access"].mean().rename('Average access').reset_index().plot.bar(x='Function-1', y='Average access')

Due to the skewed nature of the data however, average access indicates that learning and curriculum had the lowest number of page loads.

In [None]:
products_engagement_df['Year'] = products_engagement_df["time"].dt.year
products_engagement_df['Month'] = products_engagement_df["time"].dt.month
pdf = products_engagement_df.groupby([products_engagement_df["Function-1"], products_engagement_df['Year'], products_engagement_df['Month']])["engagement_index"].median().rename('Median engagement').reset_index()

In [None]:
plot = pdf.pivot(index = ['Year', 'Month'], columns='Function-1', values='Median engagement').plot(kind='line')

At the median engagement peaked just prior to the pandemic and never recovered. A large portion of engagement was spent on state and district operations (SDO) rather learning (LC). Engagement declined over the pandemic - from April through August, curriculum management (CM) showed higher levels of engagement than learning (LC). Encouragingly as the new school year began in September of 2020, learning showed higher levels of engagement although state and district operations remained elevated. 

In [None]:
pdf = products_engagement_df.groupby([products_engagement_df["Function-1"], products_engagement_df['Year'], products_engagement_df['Month']])["pct_access"].median().rename('Median access').reset_index()
plot = pdf.pivot(index = ['Year', 'Month'], columns='Function-1', values='Median access').plot(kind='line')

Measured by median percent access however, learning (LC) and state and district operations (SDO) were at fairly similar levels although neither recovered to their pre pandemic levels.

In [None]:
#zoom = products_engagement_df[products_engagement_df['Product Name'].str.lower().str.contains('zoom')]
#hangouts = products_engagement_df[products_engagement_df['Product Name'].str.lower().str.contains('hangouts')]
#webex = products_engagement_df[products_engagement_df['Product Name'].str.lower().str.contains('webex')]

#### Virtual classroom

We can also look at access and engagement in virtual classrooms. We determine this with product names that contain the words zoom, hangouts, and webex.

In [None]:
virtualClass = ['zoom', 'hangouts', 'webex']
virtualLessons = products_engagement_df[products_engagement_df['Product Name'].str.lower().isin(virtualClass)]

In [None]:
pdf = virtualLessons.groupby([virtualLessons['Year'], virtualLessons['Month']])["pct_access"].median().rename('Median access').reset_index()
plot = pdf.set_index(['Month', 'Year']).plot(kind='line', y='Median access')

In [None]:
pdf = virtualLessons.groupby([virtualLessons['Year'], virtualLessons['Month']])["engagement_index"].median().rename('Median engagement').reset_index()
plot = pdf.set_index(['Month', 'Year']).plot(kind='line', y='Median engagement')

Similar to learning, engagement and access declined for the rest of the spring term and while access increased in the fall of 2020, engagement had only increased gradually.

### Conclusion:

In terms of digital learning, it appears that a larger proportion of time at the median was spent on state and distict operations rather than learning. Moreover, just as digital learning was beginning to gain traction, the pandemic halted all progress measured by engagement. The higher level of engagement with products associated with state and district operations may be an indication that these products were more difficult to use or required more frequent page loads than products associated with learning and curriculum. One possibility for this discrepancy is that products with learning did not require much page loads and consisted mainly of students staring at the screen instead of interacting with the product.

Based on the above findings, the general sense is that engagement is low - many students spend more time on administrative tasks (measured by SDO products) than learning (measured by LC products). Even before the pandemic, this struggle was obvious and although access and engagement began to increase it quickly reversed itself when the the pandemic began and never recovered to its original levels. We can conclude that the future of online learning is not too bright. 

#### Districts-engagement data

In [None]:
districts_engagement_df = pd.merge(districts_df, engagement_df, left_on='district_id', right_on='district_id')

In [None]:
districts_engagement_df[['pct_access', 'engagement_index', 'lnEngagement_index']].describe()

The extreme skewness of the engagement data carries over when merged with the districts data. Because of this we use the median as the more appropriate measure for each outcome.

In [None]:
plot = districts_engagement_df.groupby('state')["pct_access"].median().rename('Median access').reset_index().sort_values('Median access').plot.barh(x='state', y='Median access')

In [None]:
plot = districts_engagement_df.groupby('state')["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').plot.barh(x='state', y='Median engagement')

Measured by access and engagement, North Dakota, Arizona, an dNew York have the highest engagemen although this is not surprising since there are few districts in the first two states.

In [None]:
plot = districts_engagement_df.groupby('locale')["pct_access"].median().rename('Median access').reset_index().sort_values('Median access').plot.barh(x='locale', y='Median access')

In [None]:
plot = districts_engagement_df.groupby('locale')["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').plot.barh(x='locale', y='Median engagement')

Engagement and access follow the same patterns across different locale - highest in rural areas and lowest in the city.

In [None]:
plot = districts_engagement_df.groupby('pct_black/hispanic')["pct_access"].median().rename('Median access').reset_index().sort_values('Median access').plot.barh(x='pct_black/hispanic', y='Median access')

In [None]:
plot = districts_engagement_df.groupby('pct_black/hispanic')["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').plot.barh(x='pct_black/hispanic', y='Median engagement')

Access and engagement is highest in districts with the highest enrollments of black and hispanic students followed by districts with less than 40 percent black or hispanic students.

In [None]:
plot = districts_engagement_df.groupby('pct_free/reduced')["pct_access"].median().rename('Median access').reset_index().sort_values('Median access').plot.barh(x='pct_free/reduced', y='Median access')

In [None]:
plot = districts_engagement_df.groupby('pct_free/reduced')["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').plot.barh(x='pct_free/reduced', y='Median engagement')

While access does not vary much across districts with different shares of students on free or reduced lunch, engagement is highest in the poorest and richest districts as measured by share of students on free or reduced lunch.

In [None]:
plot = districts_engagement_df.groupby('county_connections_ratio')["pct_access"].median().rename('Median access').reset_index().sort_values('Median access').plot.barh(x='county_connections_ratio', y='Median access')

In [None]:
plot = districts_engagement_df.groupby('county_connections_ratio')["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').plot.barh(x='county_connections_ratio', y='Median engagement')

Access and engagement is higher when share of connections with high speed is higher although this should be interpreted with caution since there is only one district in this category.  

In [None]:
plot = districts_engagement_df.groupby('pp_total_raw')["pct_access"].median().rename('Median access').reset_index().sort_values('Median access').plot.barh(x='pp_total_raw', y='Median access')

In [None]:
plot = districts_engagement_df.groupby('pp_total_raw')["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').plot.barh(x='pp_total_raw', y='Median engagement')

Access and engagement is higher in districts where per pupill spending is higher. This is somewhat confounding since rural districts where the per pupil spending is in the middle range had the highest acess and engagement. 

In [None]:
districts_engagement_df.groupby(['locale', 'pp_total_raw'])["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').dropna()

Diving slightly deeper, we find that engagement is highest in rural districts with high per pupil spending. However, there are very few districts in this category.

In [None]:
districts_engagement_df.groupby(['pct_black/hispanic', 'pp_total_raw'])["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').dropna()

Engagement is highest in districts that spend more than 20,000 per pupil.

In [None]:
pdf = districts_engagement_df.groupby(['pct_black/hispanic', 'pp_total_raw'])["engagement_index"].median().rename('Median engagement').reset_index().sort_values('Median engagement').dropna()

In [None]:
plot = pdf.set_index(['pct_black/hispanic', 'pp_total_raw']).unstack(0).plot(kind='barh', y='Median engagement', subplots=True, 
                                                                            layout = (5,1), figsize=(4,12))

However, engagement does not vary in the same way within districts with different shares of minority students. In districts that have the lowest minority shares, engagement is highest at the lowest and highest levels of spending while in districts with the highest share of minority students, engagement is slightly higher in the district with lower spending.

### Conclusion

1. While there are geographical differences by state in engagement and access, this is confounded by small sample sizes.
2. There is only one district with a high county connections ratio which exhibited greater engagement thus conclusions cannot be drawn whether engagement could be higher with better broadband access.
3. Rural districts and suburbs have the highest engagement but this possibly mediated by per pupil spending. However per pupil spending data is missing from almost half the districts.
4. Engagement is highest in districts with the highest and lowest shares of minority students and pupils on free/reduced lunch. However, there is no variation at the district level in per pupil spending at the highest level of minority share. Therefore it cannot be concluded that this finding is driven by differences in per pupil spending. 

#### Products-districts-engagement data

In [None]:
districts_products_engagement_df = pd.merge(districts_df, products_engagement_df, left_on = "district_id", right_on="district_id")

In [None]:
pdf = districts_products_engagement_df.groupby([districts_products_engagement_df["pct_black/hispanic"], districts_products_engagement_df['Year'], districts_products_engagement_df['Month']])["engagement_index"].median().rename('Median engagement').reset_index()
plot = pdf.pivot(index = ['Year', 'Month'], columns='pct_black/hispanic', values='Median engagement').plot(kind='line')

Over time, engagement was higher in districts with the highest and lowest shares of black or hispanic students. Engagement was initially higher for districts with the lowest share of minority students but fell more steeply over the pandemic.

In [None]:
pdf = districts_products_engagement_df.groupby([districts_products_engagement_df["locale"], districts_products_engagement_df['Year'], districts_products_engagement_df['Month']])["engagement_index"].median().rename('Median engagement').reset_index()
plot = pdf.pivot(index = ['Year', 'Month'], columns='locale', values='Median engagement').plot(kind='line')

Engagement declined across all locales with districts in towns and rural areas showing the steepest declines over the pandemic. 

In [None]:
pdf = districts_products_engagement_df.groupby([districts_products_engagement_df["pct_free/reduced"], districts_products_engagement_df['Year'], districts_products_engagement_df['Month']])["engagement_index"].median().rename('Median engagement').reset_index()
plot = pdf.pivot(index = ['Year', 'Month'], columns='pct_free/reduced', values='Median engagement').plot(kind='line')

Districts with the lowest share of students on free or reduced declined the most sharply. Engagement remained higher for districts with the highest share of free or reduced lunch.

In [None]:
pdf = districts_products_engagement_df.groupby(["pct_black/hispanic", "Function-1"])["engagement_index"].median().rename('Median engagement')
plot = pdf.unstack().plot.bar()

However, districts with the highest proportion of minorities also spent the greatest engagement on state and district operations (SDO) and the lowest on learning (LC). While their engaement on learning was not the lowest amng all the districts with different shares of minorities, they also spent a larger proportion of their engagement in curriculumm management (CM).

In [None]:
pdf = districts_products_engagement_df.groupby(["pct_free/reduced", "Function-1"])["engagement_index"].median().rename('Median engagement')
plot = pdf.unstack().plot.bar()

Likewise, the poorest districts (those with the highest share of students on free or reduced lunch) also spent the largest amount of engagement on state and distriction operations (SDO) instead of learning.

In [None]:
pdf = districts_products_engagement_df.groupby(["pp_total_raw", "Function-1"])["engagement_index"].median().rename('Median engagement')
plot = pdf.unstack().plot.bar()

The richest district \[32000,34000\] had higher engagement in learning than state and district operations.

#### Conclusion

1. Even though districts with high percentage of minority and high percentage of students on free or reduced lunch had the highest levels of engagement, the students spent more time on state and district operations than learning.
2. Students in the richest district spent more time learning than on state and district operations.