<center><h1>State of K12 Digital Learning in 2020: Impact of the Pandemic</h1></center>

# Summary

Across the nation, **we have a participation problem**. The percentage of students participating in digital learning has decreased in 30 out of 40 observed subgroups from before the pandemic. Engagement, however, has almost doubled, suggesting a widening gap in digital learning. The most vulnerable groups are minorities or low-income level districts; but we found that the most disadvantaged groups are making above-average improvements. **It is the second-most-disadvantaged groups that may require the most urgent help (at the end of 2020)**. Some districts which lagged in digital learning before the pandemic also face an engagement problem, as they see above-average participation increases but not as much engagement increase. 

In particular, I defined the metrics **Engagement Improvement Index (EII)** and **Participation Improvement Index (PII)** to help gauge the state of digital learning in 2020. EII is the *engagement_index* of fall 2021 divided by *engagement_index* before the pandemic. **The national EII stands at 1.90**. PII is the *pct_access* in th fall divided by *pct_access* before the pandemic. **The national PII stands at 0.92**. This suggests that the COVID-19 impact on digital learning is imbalanced: while there are significant engagement improvements (90%) after the pandemic, **participation actually decreased by ~8%**. Effective policies and plans are needed to make access to digital learning more equitable. **The national EII is 0.86 excluding Virtual Classroom traffic**, suggesting an even bigger (14%) drop in participation without Virtual classroom traffic.

EII and PII, together with **PPEI (Pre-Pandemic Engagement Index)** and **PPPI (Pre-Pandemic Participation Index)**, are the 4 metrics that I use to quantify the state of digital learning. They are calculated for different demographic, socioeconomic, and product usage subgroups. For example, the subgroup with a high free-or-reduced-lunch percentage is the most vulnerable group that requires attention. And with the help of decision trees, state is identified as the most important factor that affects engagement and participation. **I thus call for state-level policy accormodations to help bridge the digital gap**.

Not surprisingly, virtual classroom products saw a significant increase in participation and engagement. However, Google Docs remains the most widely used product among schools. If monopoly is not a concern, **I suggest further promoting Google products in schools**. *Seesaw*, an app for portfolio-sharing among students, teachers, and parents, came out as the winner based on EII, with a 2,076 fold increase in usage. **I recommend promoting *Seesaw* in schools**. 

I also looked into how race, free/reduced lunch, and locale affects digital learning engagement and improvement. For pct_black/hispanic, EII is the highest in \[0.4, 0.6\[ group. **School districts with a considerable hispanic/black student body see a comparable (or slightly bigger) engagement improvement**. The \[0.8, 1.0\[ group, which has the highest hispanic/black percentage, performs well, both by engagement improvement (EII), or (absolute values of) engagement. Similar trends were observed for the free/reduced lunch dimension. For locale, Rural, Suburban, Town, and City performed in decreasing order. 

# Methodology

**Define reliable metrics:** I first wanted to define a few reliable metrics that could be used to answer questions related to digital engagement. *Engagement_index* and *pct_access* are inadequate because districts vary in size and other factors; a larger engagement_index value does not necessarily signify better engagement. Engagement change, which measures engagement_index or pct_access change for a district is less dependent on district size and is therefore more reliable. 

**Exclude outliers** I excluded 34 districts with less than 300 days of data, leaving 199. Visual inspection revealed 5 other districts that seemed to have inaccurate/unreliable data (for example, heavy usage before the summer but almost no usage in fall). They were removed from the general analysis, but may be included in more specific analyses. I also looked into two other criteria to define outliers: (1) Any school district with <365 days of data (2) Any click not corresponding to a known lip_id in products_info.csv. The first criterion would remove ~35% of districts (159 remaining); the second criterion would remove half of the clicks (22.3M to 11.7M). They were too restrictive and thus not adopted.

**Decision Tree for Causal Analysis** Decision tree is usually used for prediction/classification. I showed that it can be used to identify attributes that most likely contributed to the variance in EII/PEII. 


## Preparation & Data Cleansing

In this phase, I

1. Import libraries, define color palette
2. Read in all engagement data from different districts and merge them
3. Filter out disticts with less than 300 days of data
4. Filter out 5 outliers after visual inspection
5. Join engagement data with district and product data

In the end, about 10% of the page load records are filtered out, leaving 20.8M records and 194 districts to be analyzed. 

I then plot daily page load over time to reveal 4 periods in 2020 that show distinct engagement characteristics. 

In [None]:
import glob
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.offline as po
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.express as px
import random
import plotly.figure_factory as ff
import gc
import warnings
import matplotlib.image as mpimg

warnings.filterwarnings('ignore') 
init_notebook_mode(connected=True)
number_of_colors = 64
colors = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
             for i in range(number_of_colors)]

def pdf(df): 
    print(np.shape(df))
    print(df.head(5))

d_i = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
p_i = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
PEF1=[]
PEF2=[]
for i in p_i['Primary Essential Function']:
    if(not pd.isnull(i)):
        i1 = i.split("-",1)[0].strip()
        PEF1.append(i1)
    else:
        PEF1.append(np.nan)
        
    if(not pd.isnull(i)):
        i2 = i.split("-",1)[1].strip()
        PEF2.append(i2)
    else:
        PEF2.append(np.nan)
p_i['PEF1']=PEF1
p_i['PEF2']=PEF2

path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/' 
all_files = glob.glob(path + "/*.csv")

# concat the district-wise engagement data from all files
# code excerpt to merge the district-wise data from Ruchi Bhatia - "😷COVID-19 Impact on Digital Learning💻: EDA + W&B"
engagement_data = []

for filename in all_files:
    district_data = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.split("/")[4].split(".")[0]
    district_data["district_id"] = district_id
    engagement_data.append(district_data)

e_df = pd.concat(engagement_data)
e_df["district_id"] = e_df["district_id"].astype(int)
del engagement_data
gc.collect()

e_by_d = e_df[e_df["engagement_index"].notnull()].groupby(["district_id","time"]).size().reset_index()
e_by_d = e_by_d.groupby(["district_id"]).size().reset_index()
e_by_d.rename(columns={ e_by_d.columns[1]: "days" }, inplace = True)
days_count = e_by_d.groupby(["days"]).size().reset_index()
days_count.rename(columns={days_count.columns[1]: "count" }, inplace = True)
#px.line(days_count, x="days", y="count")
keys=e_by_d[e_by_d["days"]>=300]["district_id"].tolist()
print("Total # of Page Load Records: " + str(np.shape(e_df)[0]))
q_e = e_df[e_df["district_id"].isin(keys)]
print("After filtering out districts with less than 300 days of data, " + str(len(keys)) + " districts left.")
print("Remaining # of Page Load Records: " + str(np.shape(q_e)[0]))
del e_by_d
del e_df
gc.collect()

In [None]:
e_d_f = pd.merge(q_e, d_i,how="left", on="district_id")
del q_e
gc.collect()
e_d_p = pd.merge(e_d_f, p_i, how="left", left_on="lp_id", right_on="LP ID")
del e_d_f
gc.collect()
district_count = e_d_p.groupby(["district_id"]).size().reset_index()

# Remove 3 outliers
e = e_d_p[~e_d_p['district_id'].isin([9536, 9007,3692,2870,4808])]  
print("After removing districts 3692, 9536, 9007, 4808, 2870" + str(np.shape(district_count)[0] - 3) + " districts remain")

# garbage collection to free up space
del e_d_p
gc.collect()

print("After data cleansing, the shape of the engagement data is " + str(np.shape(e)))
#e.head(3)

fig = px.line(e.groupby('time')['engagement_index'].sum(), title='All Districts Everyday Page Load')
fig.show()

# State of Digital Learning

Based on the graph above, I first identified 4 periods in 2020 where engagement/participation differed significantly between each other. They are:

* **Pre-pandemic**: Jan 05 - Feb 08
* **Spring**: Mar 15 - May 24
* **Summer**: Jun 12 - Aug 09
* **Fall**: Aug 30 - Dec 19

There are gaps between periods, as we want to reliably measure engagement differences between them. We define 4 metrics to measure the improvement in digital learning engagement. 

### $EIPP = Avg(\text{Engagement Index(Pre-pandemic)}) \hspace{0.2in} PIPP = Avg(\text{Pct Access(Pre-pandemic)})$
## $EII = \frac{Engagement Index (Other Period)}{Engagement Index (Pre-pandemic)}\hspace{1.2in} PII = \frac{Pct Access (Other Period)}{Pct Access (Pre-pandemic)}$


EIPP and PIPP are average of averages. They are needed because districts are of different sizes; given a very large district and a very small district, the small district would be overshadowed by the large district in EII. The numerator for the EII/PII definition could correspond to any of the three periods (spring, summer, or fall). My analysis focus on the fall period. 

The following code section calculates the 4 metrics for various subgroups (districts, states, locale, percent hispanic/back, population, etc.)

## National EI & PII

We first calculate EII and PII for the whole nation. EII = 1.90, suggesting a 90% engagement increase, i.e., **90% usage increase across America**. PII = 0.92, suggesting an **8% decrease in participation**.

In [None]:
winter=range(5,41)
spring=range(75,145)
summer=range(150,225)
fall=range(243,355)

# prepare time-series data, supports top-N
def time_series(df,idx,col,val,start,howmany):
    # get a list of districts
    district_list = df[[col,val]].groupby([col])[val].mean().sort_values(ascending=False)
    if start == -1:
        district_list = district_list.index[0:].tolist()
    else:
        district_list = district_list.index[start:(start+howmany)].tolist()
    
    df = df[df[col].isin(district_list)].reset_index(drop=True)[[idx, col, val]]
    df = df.pivot_table(index=idx, columns=col, values=val)
    return df

def national_metrics(df, desc):
    df=df.assign( avg1=df.iloc[:,winter].mean(axis=1),avg2=df.iloc[:,spring].mean(axis=1),avg3=df.iloc[:,summer].mean(axis=1),avg4=df.iloc[:,fall].mean(axis=1),
                  sum1=df.iloc[:,winter].sum(axis=1),sum2=df.iloc[:,spring].sum(axis=1),sum3=df.iloc[:,summer].sum(axis=1),sum4=df.iloc[:,fall].sum(axis=1)  )
    eii = df["sum4"].sum() / df["sum1"].sum() / 112 * 35 # national engagement ratio
    avgs=df[df.columns[366:370]]
    avgs["av4_s"]=df["avg4"]/df["avg1"]
    eiia = avgs["av4_s"].mean()
    #avgs.to_csv("pattern.csv")
    return eii, eiia

def prepare_time_series(df):
    e_by_d=time_series(df, "district_id", "time", "engagement_index", -1, 5)
    p_by_d=time_series(df, "district_id", "time","pct_access", -1, 5)
    e_by_p=time_series(df, "lp_id","time","engagement_index", -1, 5)
    p_by_p=time_series(df, "lp_id","time","pct_access", -1, 5)
    return e_by_d, p_by_d, e_by_p, p_by_p 
    
e_by_d, p_by_d, e_by_p, p_by_p = prepare_time_series(e)
#np.shape(e_by_p)
#e_by_p.to_csv("engagement_product_summary.csv")
#p_by_p.to_csv("pct_access_product_summary.csv")

eii, eiia=national_metrics(e_by_d, " by engagement_index")
pii, piia = national_metrics(p_by_d, " by pct_access")
print("National EII: " + str(eii) + "\t\tNational PII:" + str(pii))
#print("National EII Average: " + str(eiia) + "\t\tNational PII Average:" + str(piia))

## Participation/Engagement by Various District Attributes

Next, I calculate EII, PII, EIPP and PIPP for various district attributes. I am able to put subgroups from different dimensions (eg: state, locale, pct_black/hispanic) into the same table/graph for comparison. 

It seems that **county_conneciton_ratio values are either \[0.18, 1\[ or NA, except for 1 district**. county_conneciton_ratio is therefore excluded from further study. However, it might play an important role in digital learning, so collecting more data is desirable.

In [None]:
# Expected df: district_merged (193, 373)
def calc_eipp_pipp(df, by, period):
    eng_by = df.groupby([by]).mean().reset_index()      # Group by across rows, take mean so it is comparable
    return eng_by.iloc[:,period].mean(axis=1)           # Take mean across days 

def calc_metrics_grouped(df, by, filename, include_district=False):
    eng_by = df.groupby([by]).mean().reset_index()
    eng_by=eng_by.assign( avg1=eng_by.iloc[:,winter].mean(axis=1),avg2=eng_by.iloc[:,spring].mean(axis=1),avg3=eng_by.iloc[:,summer].mean(axis=1),avg4=eng_by.iloc[:,fall].mean(axis=1) )
    avgs=eng_by[[by, "avg1", "avg2", "avg3", "avg4"]]
    avgs=avgs.assign(er=avgs["avg4"] / avgs["avg1"]) # districts engagement ratio
    
    avgs = avgs.assign(ipp=calc_eipp_pipp(df, by, spring))
    df=df.assign( avg1=df.iloc[:,winter].mean(axis=1),avg2=df.iloc[:,spring].mean(axis=1),avg3=df.iloc[:,summer].mean(axis=1),avg4=df.iloc[:,fall].mean(axis=1))
    
    if include_district == True:
        avgs2=df[[by,"district_id", "avg1", "avg2", "avg3", "avg4"]]
    else:
        avgs2=df[[by, "avg1", "avg2", "avg3", "avg4"]]
    avgs2=avgs2.assign(dera=avgs2["avg4"] / avgs2["avg1"]) # district engagement ratio
    if include_district == True:
        avgs2.to_csv(filename)
    avgs2 = avgs2.groupby([by]).mean().reset_index()

    avgs=avgs.assign(era=avgs2["dera"]) 
    return avgs[[by, "er", "ipp"]]

# Wrapper to calculate 4 metrics using engagement index and pct access input
def calc_er_eras_grouped(df, df_pct, by, include_district=False):      # calculate er/ers engment_idx-based and pct-based
    e_by_x = calc_metrics_grouped(df, by, "district_eng_eii.csv", include_district)
    learn_by_x = e_by_x.rename(columns={by: "Segment", "er": "EII", "ipp": "EIPP"})
        
    p_by_x = calc_metrics_grouped(df_pct, by, "district_pct_eii.csv", include_district)
    learn_by_x = learn_by_x.assign(PII=p_by_x["er"], PIPP=p_by_x["ipp"])
    return learn_by_x

def plot_df(df, w, h, with_p, log_scale=False, title=""):
    plt.figure(figsize=(w,h))
    sns.lineplot(data=df, x='Segment', y="EII")
    sns.lineplot(data=df, x='Segment', y="PII")
    plt.legend(labels=['EII', 'PII'])
    if with_p == True:
        #sns.lineplot(data=df, x='Segment', y="EIPP")
        sns.lineplot(data=df, x='Segment', y="PIPP")
        plt.legend(labels=['EII', 'PII', 'PIPP'])
    plt.axhline(y=1.0, color='gray', linestyle='--')
    plt.xticks(rotation=90)
    if log_scale==True:
        plt.yscale('log')
    plt.title(title, x=0.4, y=0.9)
    plt.show()

def dlbyd(df):    # digital learning metrics by district attributes
    dl_by_x = pd.DataFrame([], columns = ['Segment', "EIPP", "EII", "PIPP", "PII"])

    #e_by_d.to_csv("engagement_summary.csv")
    #p_by_d.to_csv("pct_access_summary.csv")

    e_d_f = pd.merge(df, d_i,how="left", on="district_id")
    p_by_d_f = pd.merge(p_by_d, d_i,how="left", on="district_id")

    dl_by_state = calc_er_eras_grouped(e_d_f, p_by_d_f, "state", True)
    dl_by_x = dl_by_x.append(dl_by_state.sort_values(by = ['EII'], ascending = False), ignore_index=True)

    dl_by_locale = calc_er_eras_grouped(e_d_f, p_by_d_f, "locale")
    dl_by_x = dl_by_x.append(dl_by_locale, ignore_index=True)

    dl_by_race = calc_er_eras_grouped(e_d_f, p_by_d_f, "pct_black/hispanic")
    dl_by_race["Segment"] = "PBH:" + dl_by_race["Segment"]
    dl_by_x = dl_by_x.append(dl_by_race, ignore_index=True)

    dl_by_pp = calc_er_eras_grouped(e_d_f, p_by_d_f, "pp_total_raw")
    dl_by_pp["Segment"] = "PP:" + dl_by_pp["Segment"]
    dl_by_x = dl_by_x.append(dl_by_pp.reindex([8,9,10,0,1,2,3,4,5,6,7]), ignore_index=True)

    dl_by_free = calc_er_eras_grouped(e_d_f, p_by_d_f, "pct_free/reduced")
    dl_by_free["Segment"] = "Lunch:" + dl_by_free["Segment"]
    dl_by_x = dl_by_x.append(dl_by_free, ignore_index=True)
    return dl_by_x

dl_by_x = dlbyd(e_by_d)

#dl_by_ccr = calc_er_eras_grouped(e_d_f, p_by_d_f, "county_connections_ratio")
#dl_by_x = dl_by_x.append(dl_by_ccr, ignore_index=True)

print(dl_by_x)
#print(dl_by_x.iloc[:, [0,1,3,2,4]])

In [None]:
#dl_by_x.to_csv("eii.csv")
plot_df(dl_by_x, 15, 8, True, False, "EII/PII/PIPP for Different Segments")

## Analyzing Subgroup Performance

Next, we plot three bubble graphs with the 4 metrics we just calculated. 

* The first chart puts PIPP and EIPP on the x and y axes, respectively. We can see high correlations between the two metrics, which is not too surprising; engagement and participation are correlated. **Despite the low start, groups with low engagement/participation before the pandemic do not see larger percentage improvements.**

* The second chart puts EIPP and EII on the x,y-axis. Three sub-regions are analyzed:
* 1. **EIPP > 350**: Two states (New York and New Hampshire) and 3 high expenditure groups (20-22k, 22k-24k, 32k-34k) had a high EIPP prior to the pandemic. Current EII is also on the higher end, except for the 22k-24k expenditure group. This shows that **richer areas were more digitally engaged pre-pandemic, and saw relatively high percentage improvements during the pandemic even with a larger baseline.**  
* 2. **200<EIPP<350**: 35% of school districts belong to this group. They are mediocre districts financially and engagement improvements are close to the national mean (1.9). 
* 3. **0<EIPP<200**: These are districts lagging behind in digital learning even before the pandemic. The impact of distanced learning varies significantly for this group. Some see a larger engagement in percentage (state=Florida, locale=city, percent_bh=\[0.4, 0.6\[, lunch free/reduced= \[0.6, 0.8\[), we note this may be due to a lower baseline. The most worrisome observation is that, **among districts lagging behind before the pandemic, more than half see improvements below average during pandemic**. 
* 4. **The Most Needy Group**: North Carolina, Tennesse, Washington, Virginia, Missouri are the 5 states that scored the worst before pandemic. Their emgagement improvement (EII) also score poorly. **Across all subgroups covering several dimensions, the subgroup that needs most lunch help (pct_free/reduced=[0.8, 1.0\[) fares the worst**. They will need a lot of support.
* 5. **Other low improvers:** Other subgroups that see low improvements in EII are Utah, Ohio, locale=town, and pp_total_raw=\[6000, 8000\[. It is surprising that **schools in towns (not rural or city) laps behind in digital improvement**. While it is understandable that **the second-lowest expenditure subgroup performed badly both pre-pandemic and in fall**, it is worth noting the lowest expenditure group (pp_total_raw=\[6000, 8000\[) performed decently. In addition, districts with a high black/hispanic population, who lagged behind pre-pandemic, are seeing decent (above average) improvements. 

My findings may not be totally reliable for a few reasons:
* 1. **Sample size bias**. Some subgroups have a small number of samples.
* 2. **Sample bias**. Installation rate for LearningPlatform in different districts may not be the same. In a high engagement, low installation district, participation/engagement could be under-estimated. 
* 3. **District size bias**. Some districts are large, some are small. The pct_access based metrics may be more reliable against this type of bias. 
* 4. **Baseline bias**. Subgroups with little participation/engagement before pandemic may see a large percentage change even though the absolute change is small. We pay special attention to not draw definitie conclusions when baseline is small. 

Despite these biases, I feel fairly confident about most of the findings, as they are confirmed in the third figure below, where I plot PIPP and PII together. 

In [None]:
fig = px.scatter(dl_by_x, x="PIPP", y="EIPP",size="EII", color="PII",hover_name="Segment", log_x=False)
fig.show()
fig = px.scatter(dl_by_x, x="EIPP", y="EII",size="PIPP", color="PII",hover_name="Segment", log_x=False)
fig.show()
fig = px.scatter(dl_by_x, x="PIPP", y="PII",size="EIPP", color="EII",hover_name="Segment", log_x=False)
fig.show()

## We have a Participation Problem!


The 3rd plot above offers great insight into national/state digital learning participation. **Out of 40 subgroups we looked into, 30 have seen participation decrease.** It does not seem that the *most* disadvantaged group performs the worst:

* 1. One of the most disadvantaged groups, pp_total_raw=\[4000-6000\[ has a rather high PIPP (top 20%), and achieved an increase in participation by 20%. 
* 2. The next group in spending, pp_total_raw=\[6000-8000\[ fared poorly, seeing a 25% drop in PII.
* 3. The vast majority black/hispanic group, pct_black/hispanic=\[0.8-1\[, saw a PIPP increase of 10%. 
* 4. The next group in black/hispanic group, pct_black/hispanic=\[0.6-0.8\[, see of drop of 14% in participation. 

I also observed that

* 1. **Among all locales, schools in towns perform meaningfully worse than the others**.  
* 2. The "most needy group", where 80\%-100\% of students are on free/reduced lunch, shows an above-average participation improvement. However, they have an engagement problem, as they show low engagement improvement. **This might suggest that they were introduced to digital learning during pandemic, but did not use them much**. 
* 3. Tennessee and North Carolina seem to suffer similar engagement problems. Starting with a low participation/engagement, they were able to improve participation significantly, but are subpar on engagement improvement. Note that engagement improvement is measured on a relative basis, so there is no baseline bias here. 

# Engagement by Products 

## Engagement by Primary Essential Function (PEF) 

I now extend my analysis to product-level. For Primary Essential Functions (PEF), at a high level,

* CM is the area that sees the most participating/engagement boost, with EII = 7.5 and PII = 3.5. Note that Zoom belongs to this category ("CM - Virtual Classroom - Video Conferencing & Screen Sharing").
* The other 3 categories see usage approximately double. 
* LC and SDO see a participation drop of about ~8% each, whereas LC/CM/SDO see a bigger drop (22%).  

Digging in to the next level in PEF, here are my major findings:

* On Primary Essential Function, "CM-Virtual Classroom  - Video Conferencing and Screensharing" garnered the most traffic increase, with a 352 fold increase in PII and 334 fold increase in EII. 
* The category garnered the second most traffic increase is "LC - Sites, Resources & Reference - Streaming", with a EII of 19 and PII 6.4.
* Any other category garnered a single digit increase in usage or participating at most.
* "CM - Classroom Management Classroom Management" and "SDO - Environmental, Health and Safety" both have an increase in participation and engagement of 2-to-3-fold.
* "SDO - Admissions, Enrollment & Rostering" sees the most significant decrease (95%) in usage. "LC Study Tools - Test Prep & Study Skills" sees a drop of ~65%.

The top 6 product category before pandemic fare decently. None have seen a steep drop. 

1. **SDO - LMS, LC - Online Course Providers & Technical Skills Development, LC-Content Creation & Curation, CM- Assessment & Classroom Response**: These 4 categories see similar level of participation before and during pandemic and see 2-3 times increase in engagement.
2. **SDO - School Management - SSO**. Participation and engagement stay at the same level as before.
3. **CM-Virtual Classroom  - Video Conferencing and Screensharing**. There is a 352-fold increase in PII and 334-fold increase in EII.

In [None]:
dl_by_x = pd.DataFrame([], columns = ['Segment', 'EII', "EIPP", "PII", "PIPP"])

e_p_f = pd.merge(e_by_p, p_i,how="left", left_on="lp_id", right_on="LP ID")
p_p_f = pd.merge(p_by_p, p_i,how="left", left_on="lp_id", right_on="LP ID")

dl_by_pef = calc_er_eras_grouped(e_p_f, p_p_f, "PEF1").sort_values(by = ['EII'], ascending = False)
dl_by_x = dl_by_x.append(dl_by_pef, ignore_index=True)
plot_df(dl_by_pef, 15, 4, False, False, "EII By Primary Essential Functions Category")

dl_by_ppf = calc_er_eras_grouped(e_p_f, p_p_f, "Primary Essential Function").sort_values(by = ['EII'], ascending = False)
#dl_by_ppf["Segment"] = dl_by_ppf["Segment"].str.slice(0,20)
plot_df(dl_by_ppf, 15, 6, False, True, "EII By Primary Essential Functions")
dl_by_x = dl_by_x.append(dl_by_ppf, ignore_index=True)

In [None]:
dl_by_p = calc_er_eras_grouped(e_p_f, p_p_f, "Product Name").sort_values(by = ['EII'], ascending = False).dropna()
dl_by_p = dl_by_p.iloc[np.r_[0:20, -20:0]]
dl_by_x = dl_by_x.append(dl_by_p, ignore_index=True)

fig = px.scatter(dl_by_p, x="EIPP", y="EII",size="PIPP", color="PII",hover_name="Segment", log_x=True, log_y=True)
fig.show()
#print(dl_by_p)
plot_df(dl_by_p, 15, 6, False, True, "EII/PII By Product, Top-20 and Bottom 20")

#print(dl_by_x.iloc[:, [0,1,3,2,4]])
dl_by_x.to_csv("eii_pef.csv")
# pdf(dl_by_x)

## Decision Tree Analysis

Decision trees are popularly used for prediction/classification. Here we leverage decision trees to determine what factors affect the observed EII/PII behaviors. 

I built two decision tree models (in R using rpart) to predict district level EII and PII, respectively. Since the number of samples is low, I did not partition the dataset into train/test sets, but just let the decision tree algorithm find the best partition based on all district data (194 total). Some observations:

1. Decision tree algorithm thinks that **the most important attribute to predict district level EII/PII is state**. In both models the first and second level splits are based on state.
2. pp_total_raw, pct_black/hispanic, pct_free all play a role in some subgroups after partition by states. 

The fact that high level state plays the most important rule suggests that **government policy plays the primary role affecting digital learning participation/engagement**. 

In [None]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

img = mpimg.imread('../input/eii-pii-decision-tree/EII_Decision_Tree.png')
plt.figure(figsize=(15, 12))
plt.imshow(img)
plt.show()

img = mpimg.imread('../input/eii-pii-decision-tree/PII_Decision_Tree.png')
plt.figure(figsize=(15, 12))
plt.imshow(img)
plt.show()

# Digital Learning Improvements: Are they Sustainable? 

## Impact of Virtual Classroom
During the pandemic, Virtual Classrooms (VC) became the norm for most school districts. Now that many schools have reopened, we expect traffic from VCs to drop significantly. How much of the traffic increase was due to Virtual Classrooms?

I recalculated PII/EII, this time excluding traffic from Virtual Classroom PEF. Here are the findings:

1. National EII goes down from 1.9 to 1.83; National PII goes down from 0.92 to 0.86. 
2. For each subgroup, EII/PII decrease is moderate. The biggest drop is in Michigan, which goes down from 3.4 to 2.8, a 19% drop. Other states see a EII change < 8%.
3. The overall VC traffic is < 10%, and most of the findings before remain qualitatively intact. 

In short, **Virtual Classroom traffic does not change my findings qualitatively**. I continue to look at impact of Google Docs. 

In [None]:
print(np.shape(e))
e = e[~e["PEF2"].str.contains("Virtual Classroom", na=False)]
print(np.shape(e))
del e_by_d
del p_by_d
del e_by_p
del p_by_p
gc.collect()

e_by_d, p_by_d, e_by_p, p_by_p = prepare_time_series(e)

eii, eiia=national_metrics(e_by_d, " by engagement_index")
pii, piia = national_metrics(p_by_d, " by pct_access")
print("National EII: " + str(eii) + "\t\tNational PII:" + str(pii)+ "\n")

dl_by_x = dlbyd(e_by_d)
print(dl_by_x)

## Impact of Google Docs

I then removed Google Docs from the engagement dataset. **Without Google Docs, national EII now drops to 1.53 (vs. 1.83), with PII almost unchanged 0.86 (vs. 0.87)**. 

We notice that several popular products are all made by Google, namely: Google Docs, Google Sheets, Google Drawings, Jamboard, etc, so the impact of Google's product usage on digital learning is even bigger than measured. 

If there is no concern of monopoly, I would rather suggest **the government and Google can team up to promote Google products to school districts**. Both have tremendous power, and the joint taskforce may help reduce the digital gap efficiently. 

In [None]:
e = e[~e["Product Name"].str.contains("Google Docs", na=False)]
print(np.shape(e))
del e_by_d
del p_by_d
del e_by_p
del p_by_p
gc.collect()

e_by_d, p_by_d, e_by_p, p_by_p = prepare_time_series(e)

eii, eiia=national_metrics(e_by_d, " by engagement_index")
pii, piia = national_metrics(p_by_d, " by pct_access")
print("National EII: " + str(eii) + "\t\tNational PII:" + str(pii)+ "\n")

dl_by_x = dlbyd(e_by_d)
print(dl_by_x)

## Seesaw: The Learning Journey

A product called "Seesaw" garnered the highest EII (2,076), and is worth further investigation. Among the top-10 EII performers, Seasaw is one of two that are not considered Virtual Classroom technology (the other one is Mocrosoft 365). 

> **EIPP: 1.48   
EII: 2,076       
PIPP: 0.04       
PII: 157**

Accorging to the company, **Seesaw is a "student-driven digital portfolio that empowers students to independently document what they are learning at school"**. Students can create their portfolio and share it with teachers. Since the product helps building bonds between teachers and students, and given its success during the pandemic, **I strongly recommend promoting *Seesaw: The Learning Journey* among school districts**. 

# Digital Gap 

I have observed that the *most* disadvantaged groups are not performing the worst. Now let's dig a level down to see if this finding is robust, or if it may be due to some other reasons (like outliers, data collection glitches, not enough sample, etc.) 

## Hispanic/Black Percentage

I found that

* Before the pandemic, engagement for highest black/hispanic percentage  (\[0.8-1\[) and highest non-black/hispanic (\[0-0.1\[) are comparable. 
* At the beginning of pandemic to mid-March, which is when schools started to close, engagement ramped up for both groups in similar magnitude.
* Then as school closed, between March 15 and April 15 engagement dropped significantly. The drop is much higher than for any other groups, suggesting that they were less prepared for the sudden onset of the pandemic. 
* Engagement reboounded before summer to about average level. 
* When fall came, engagement jumped to even higher levels than any other group. 

I have carefully inspected the districts that are mainly black/hispanic (\[0.8-1.0\[). I tried to remove outliers, using different criteria. No matter what I change, it seems that **the finding that the most disadvantaged group performed at least average, if not better, is sound and robust**. 

Excluding the \[0.8-1.0\[ group, the performance of the rest of the groups are as expected: districts with more black/hispanic students (by percentage) perform worse. **Therefore, the second-highest black/hispanic group needs more attention and help.**

My hypothesis for the observation is this: **policy makers, private business and even most disadvantaged schools have focused more efforts/resources on digital learning, even before the pandemic**. They received more help during pandemic than other groups such that after a steep drop in engagement, it bounced back to top all other groups. **It is time to give more attention to the next disadvantaged group**. 

In [None]:
import datetime 

palette_darkgrey = "#383C45" 
palette_silver = "#A2A5A9" 
palette_green = "#5BAD27" 
palette_blue = "#278BD3" 
palette_platinum = "#E3E4E5"
palette_grey2 = "#676A6C" 
palette_grey3 = "#959894" 
palette_grey4 = "#C4C5BB"

def annotation_helper(fig, texts, x, y, line_spacing, align="left", bgcolor="rgba(0,0,0,0)", borderpad=0, ref="axes", width=100):

    is_line_spacing_list = isinstance(line_spacing, list)
    total_spacing = 0

    for index, text in enumerate(texts):
        if is_line_spacing_list and index!= len(line_spacing):
            current_line_spacing = line_spacing[index]
        elif not is_line_spacing_list:
            current_line_spacing = line_spacing

        fig.add_annotation(dict(
            x= x,
            y= y - total_spacing,
            width = width,
            showarrow=False,
            text= text,
            bgcolor= bgcolor,
            align= align,
            borderpad=4,
            xref= "x" if ref=="axes" else "paper",
            yref= "y" if ref=="axes" else "paper"
        ))

        total_spacing  += current_line_spacing

def power_draw(df, by):
    e_hb = df.groupby([by,"time"])["engagement_index"].mean().reset_index()

    window = 14;

    layout = dict(
        margin = dict(t=150),
        xaxis = dict(showline=True, linewidth=1, linecolor=palette_darkgrey, dtick="M1",tickformat="%b\n%Y"),
        yaxis = dict(showline=False, showgrid=True, gridwidth=1, gridcolor='#ddd', linecolor=palette_darkgrey),
        showlegend = False,
        width = 900,
        height = 550,
        plot_bgcolor= "#fff",
        hoverlabel=dict(
            bgcolor="white",
            font_size=12
        )
    )

    fig = go.Figure(layout=layout)

    engagement_02_04= e_hb[e_hb[by]=="[0.2, 0.4["]
    fig.add_trace(go.Scatter(
                        x=engagement_02_04["time"], 
                        y= engagement_02_04["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_grey4, width=1.5),
                        name='[0.2,0.4]'))

    engagement_04_06= e_hb[e_hb[by]=="[0.4, 0.6["]
    fig.add_trace(go.Scatter(
                        x=engagement_04_06["time"], 
                        y= engagement_04_06["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_grey3, width=1.5),
                        name='[0.4,0.6]'))

    engagement_06_08= e_hb[e_hb[by]=="[0.6, 0.8["]
    fig.add_trace(go.Scatter(
                        x=engagement_06_08["time"], 
                        y= engagement_06_08["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_grey2, width=1.5),
                        name='[0.6,0.8]'))

    engagement_08_10= e_hb[e_hb[by]=="[0.8, 1["]
    fig.add_trace(go.Scatter(
                        x= engagement_08_10["time"],
                        y= engagement_08_10["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_darkgrey, width=1.8),
                        name='[0.8,1.0]'))

    engagement_00_02= e_hb[e_hb[by]=="[0, 0.2["]


    # draws the filled in learning gap
    #fig.add_trace(go.Scatter(
    #                    x=engagement_00_02["time"], 
    #                    y= engagement_00_02["engagement_index"].rolling(window).mean(),
    #                    mode='lines',
    #                    line= dict(color="#ccc", width=0),
    #                    name='[0.0,0.2]',
    #                    fill="tonexty"
    #))

    fig.add_trace(go.Scatter(
                        x=engagement_00_02["time"], 
                        y= engagement_00_02["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_green, width=3),
                        name='[0.0,0.2]'))


    text = [
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>80-100%% </b>" % (palette_darkgrey)
    ]

    annotation_helper(fig, text, datetime.date(2020, 11, 25), 160, [25,30], width=200)


    text = [
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>0-20%%</b>" % (palette_green)
    ]

    annotation_helper(fig, text, datetime.date(2021, 2, 12), 100, [25,30])


    text = [
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>20-40%%</b>" % (palette_grey4),
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>40-60%%</b>" % (palette_grey3),
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>60-80%%</b>" % (palette_grey2),
    ]

    annotation_helper(fig, text, datetime.date(2020, 11, 5), 115, [25,30], width=50, bgcolor="rgba(255,255,255,0.7)")


    text = [
        "<span style='color:%s; font-family:Tahoma; font-size:14px'>average engagement index</span>" % palette_darkgrey,
        "<span style='color:%s; font-family:Tahoma; font-size:13px'>14 day average</span>" % palette_grey2,
    ]

    annotation_helper(fig, text, datetime.date(2020, 3, 15), 520, [35], width=200)


    # title annotation
    text = [
        
    #"<span style='font-size:26px; font-family:Times New Roman;'>Did districts with students of color engage less?</span>", 
    #"<span style='font-size:13px; font-family:Helvetica'><b style='color:%s'> Districts with more hispanic/black students</b> seem to engage more, contrary to our expectation. </span>" % (palette_darkgrey) ,
    #"<span style='font-size:13px; font-family:Helvetica'> This result could skewed due to a smaller sample size in these districts,ie 8 compared to the 116 </span>",
    #"<span style='font-size:13px; font-family:Helvetica'> in <b style='color:%s'>white-dominated school districts</b>. </span>" % (palette_green)
        
        "<span style='font-size:26px; font-family:Times New Roman;'>Did districts with more " + by + " students engage less?</span>", 
    "<span style='font-size:13px; font-family:Helvetica'><b style='color:%s'> No. </b> They seem to get hurt initially, but recovered well and even became more engaged. </span>" % (palette_darkgrey), 
    "<span style='font-size:13px; font-family:Helvetica'> <b>It is the second most disadvantageous group that needs most help.</b></span>"
    ]

    annotation_helper(fig, text, 0.9, 1.175, [0.12,0.055,0.055], ref="paper", width=700)

    fig.update_yaxes(range=[0, 185])
    fig.show()
    
power_draw(e, "pct_black/hispanic")

## Free/Reduced Lunch Perentage

We observed similar trend in the pct_free/reduced lunch dimension. the main difference is

* Right after school closed in March, **the \[0.8-1.0\[ group saw an even deeper drop in engagement (84\%!)**. The Black/Hispanic group only saw a drop of 55%.
* It clearly shows that the **pct_free/reduced=\[0.8-1.0\[ group is the most vulnerable group**. 
* After the drop, the rebound is to about 50\% of the pre-pandemic level; for the black/hispanic heavy group, the rebound reached pre-pandemic level.

My hypothesis is, right after school closed, **the Black/Hispanic districts received immediate help, whereas help to districts of high free/reduced lunch percentages only came after summer**. 

In [None]:
power_draw(e, "pct_free/reduced")

## How Does Locale Affect Engagement?

Before the pandemic, Rural, Suburban, Town, and City districts performed in decreasing order. In fall, engagement are still in that order except that Suburban and Town school districts now are more indistinguishable in engagement. In short,

* **Rural schools fare better than other locales.** This may be another evidence that past investment are having positive results.
* Unlike other disadvantaged groups, **Rural schools did not see an immediate drop after schools closed**.
* One explanation of the observation is that the rural schools are more dependent on digital learning even before the pandemic. 
* **City schools are seeing much less engagement before and during pandemic**. Improving participation/engagement in this group would help close the digital learning gap.

In [None]:
def locale_draw(df, by):
    e_hb = df.groupby([by,"time"])["engagement_index"].mean().reset_index()

    window = 14;

    layout = dict(
        margin = dict(t=150),
        xaxis = dict(showline=True, linewidth=1, linecolor=palette_darkgrey, dtick="M1",tickformat="%b\n%Y"),
        yaxis = dict(showline=False, showgrid=True, gridwidth=1, gridcolor='#ddd', linecolor=palette_darkgrey),
        showlegend = False,
        width = 800,
        height = 550,
        plot_bgcolor= "#fff",
        hoverlabel=dict(
            bgcolor="white",
            font_size=12
        )
    )

    fig = go.Figure(layout=layout)

    engagement_02_04= e_hb[e_hb[by]=="Suburb"]
    fig.add_trace(go.Scatter(
                        x=engagement_02_04["time"], 
                        y= engagement_02_04["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_grey4, width=1.5),
                        name='City'))


    engagement_06_08= e_hb[e_hb[by]=="Town"]
    fig.add_trace(go.Scatter(
                        x=engagement_06_08["time"], 
                        y= engagement_06_08["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_grey2, width=1.5),
                        name='Town'))

    engagement_08_10= e_hb[e_hb[by]=="Rural"]
    fig.add_trace(go.Scatter(
                        x= engagement_08_10["time"],
                        y= engagement_08_10["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_darkgrey, width=1.8),
                        name='Rural'))

    engagement_00_02= e_hb[e_hb[by]=="City"]

    fig.add_trace(go.Scatter(
                        x=engagement_00_02["time"], 
                        y= engagement_00_02["engagement_index"].rolling(window).mean(),
                        mode='lines',
                        line= dict(color=palette_green, width=3),
                        name='City'))


    text = [
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>Rural</b>" % (palette_darkgrey),
    ]

    annotation_helper(fig, text, datetime.date(2020, 11, 20), 200, [25,30], width=200)


    text = [
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>City</b>" % (palette_green)
    ]

    annotation_helper(fig, text, datetime.date(2021, 2, 12), 50, [25,30])


    text = [
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>Suburban</b>" % (palette_grey2),
        "<b style='color:%s; font-family:Tahoma; font-size:12px'>Town</b>" % (palette_grey4)
    ]

    annotation_helper(fig, text, datetime.date(2020, 9, 25), 130, [25,30], width=100, bgcolor="rgba(255,255,255,0.7)")


    text = [
        "<span style='color:%s; font-family:Tahoma; font-size:14px'>average engagement index</span>" % palette_darkgrey,
        "<span style='color:%s; font-family:Tahoma; font-size:13px'>14 day average</span>" % palette_grey2
    ]

    annotation_helper(fig, text, datetime.date(2020, 3, 15), 520, [35], width=200)


    # title annotation
    text = [
        "<span style='font-size:26px; font-family:Times New Roman;'>How does locale affect digital learning engagement?</span>", 
        "<span style='font-size:13px; font-family:Helvetica'><b style='color:%s'> EII is highest in Rural areas, followed by Suburban, Town and City has the worst engagement. </span>"
    ]

    annotation_helper(fig, text, 0.9, 1.375, [0.12,0.055,0.055], ref="paper", width=600)

    fig.update_yaxes(range=[0, 230])
    fig.show()

locale_draw(e, "locale")

# Conclusions And Call for Action

I used 4 metrics (instead of just 1 or 2) to gauge the state of digital learning in 2020 and the impact of the COVID-19 pandemic. Not only is engagement a problem, participation is one to an even more serious degree. Based on my study, I would recommend these steps: 

1. Focus on participation as well as engagement in promoting digital learning.
2. For disadvantaged groups, also need to work on engagement in addition to participation. That is, once we get the stdents onto digitlal platforms, how to kep them continue using it is equally a challenge.
3. State demographics and policies are still the most important factors affecting digital learning. We should continue pushing for more help from goverment.
4. The most vulnerable group is high pct_free/reduced group. 
5. The most disadvantaged groups need help, but we need to help the second-most-disadvantaged groups now as they have likely been neglected in the past.
6. Google Docs is the most popular product during the pandemic; *Seesaw* sees the largest increase in engagement percentage-wise. Both should be preferred for promotion in schools. 

**Disclaimer**: I copied and modifed code from a few public notebooks, such as https://www.kaggle.com/girishkumarsahu/learnplatform-covid-19-impact, https://www.kaggle.com/ruchi798/covid-19-impact-on-digital-learning-eda-w-b, https://www.kaggle.com/spitfire2nd/the-learning-gap. The way I plotted the last 3 charts is particularly inspired by https://www.kaggle.com/spitfire2nd/the-learning-gap.