<a href="https://colab.research.google.com/github/MicMiao/data_visualization/blob/master/COVID_19_Impact_on_Digital_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## COVID-19 Impact on Digital Learning

<img src="https://images.unsplash.com/photo-1597933471507-1ca5765185d8?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=751&q=80" style="width:400px;"/>

This is a notebook for the competition launched on Kaggle (https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning).

With the dataset at our disposal and other source of information. We will try to **uncover trends in digital learning** and how such trend relates to factors like district demographics, broadband access, and state/national level policies and events. 

In particular, we will try to answer the following questions:

    - What is the picture of digital learning in 2020 for the school districts included in the dataset? 
    - What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future? 
    - How does student engagement with different types of education technology change over the course of the pandemic? 
    - How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?
    - Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?


In [None]:
#@title Datasets from Kaggle can be downloaded using the opendatsets
!pip install opendatasets

import opendatasets as od

While downloading the dataset, you will be asked to provide your Kaggle username and credentails, which you can obtain using the "Create New API Token" button on your account page on Kaggle. Upload the `kaggle.json` notebook using the files tab or enter the username and key manually when prompted.

In [None]:
#@title Download the datasets
od.download('https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/data')

In [None]:
#@title Import the necessary libraries
import os
import pandas as pd
import numpy as np
import math
import glob
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pal = sns.color_palette()
import missingno as msno


## Load engagement data into DataFrame

### Engagement data
> The engagement data are aggregated at school district level, and each file in the folder ```engagement_data``` represents data from **one school district***. 

📝The 4-digit file name represents ```district_id``` which can be used to link to district information in ```district_info.csv```. 

📝The ```lp_id``` can be used to link to product information in ```product_info.csv```.

| Name             | Description                                                                                                    |
|------------------|----------------------------------------------------------------------------------------------------------------|
| time             | date in "YYYY-MM-DD"                                                                                           |
| lp_id            | The unique identifier of the product                                                                           |
| pct_access       | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day                         |
| district_id | The unique identifier of the school district   (Custom added from filenames)                     


Here we load all the csv files into an unique DataFrame. And using the filename as the column 'district_id', we'll need this column to merge with the 'District Info DataFrame' later on. 

In [None]:
#@title Load engagement data:
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/'
all_files = glob.glob(path + "/*.csv")

data = [] # pd.concat takes a list of dataframes as an argument
for csv in all_files:
  # read the csv file
  frame = pd.read_csv(csv)
  # split filename and extension
  filename = os.path.splitext(csv)
  # extract district ID from CSV filename tuple
  frame['district_id'] = os.path.basename(filename[0])
  # add new column to new list
  data.append(frame)

engagement_df = pd.concat(data)
engagement_df

In [None]:
engagement_df.info()

Check for any missing values:

In [None]:
engagement_df.isnull().sum()

For both 'pct_access' and 'engagement_index' we can fill the NaN with zero. As for 'lp_id' we can either delete these 541 rows or fill them also with zero, we choose the later.

In [None]:
engagement_df.fillna(0, inplace=True)

In [None]:
# convert district_id from object type to numeric type
engagement_df['district_id'] = pd.to_numeric(engagement_df['district_id'])

We have also a 'time' column. For now we don't transform it, we'll consider to do it later if necessary.

In [None]:
# def split_date(df):
#   df['time'] = pd.to_datetime(df['time'])
#   df['Month'] = df.time.dt.month
#   df['Day'] = df.time.dt.day
#   df['DayOfWeek'] = df.time.dt.dayofweek
#   df['WeekOfYear'] = df.time.dt.isocalendar().week
#   df['DayOfYear'] = df.time.dt.dayofyear

In [None]:
# split_date(engagement_df)

## Load districts_info data into DataFrame & some EDA

### District information data

>The district file ```districts_info.csv``` includes information about the **characteristics of school districts**, including data from 
>- NCES (2018-19), 
>- FCC (Dec 2018), and 
>- Edunomics Lab. 

Steps taken to preserve Privacy 🔒 
- Identifiable information about the school districts has been removed. 
- An open source tool ARX (Prasser et al. 2020) was used to transform several data fields and reduce the risks of re-identification. 

📝 For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

| Name                   | Description                                                                                                                                                                                                                                                                              |
|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| district_id            | The unique identifier of the school district                                                                                                                                                                                                                                             |
| state                  | The state where the district resides in                                                                                                                                                                                                                                                  |
| locale                 | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information.                                                                                                          |
| pct_black/hispanic     | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data                                                                                                                                                                                       |
| pct_free/reduced       | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data                                                                                                                                                                              |
| county_connections_ratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information.                                                                         |
| pp_total_raw             | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |
                                                         

In [None]:
districts_df = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
districts_df 

In [None]:
districts_df.info()

We have a total of **233** school districts. We have also some missing values, later on we'll fill with a string value 'Not disclosed'. We will also map each of the range value and string value to categories like [1, 2, 3, 4] for our ML modeling.


In [None]:
districts_df.fillna('Not disclosed', inplace=True)

### No. of school districts per State:

In [None]:
plt.figure(figsize=(16,10))
plt.title('No. of school districts per State', size=20)
sns.countplot(y='state', data=districts_df, order=districts_df.state.value_counts(ascending=False).index)
plt.show();

In [None]:
plt.title('No. of school districts per State', size=20)
districts_df['state'].value_counts().plot(kind='pie', autopct='%1.1f%%', figsize=(16,10), startangle=0).legend(loc='right', bbox_to_anchor=(1.5, 0.5));

We have not a uniform dataset here. There are State with a lot of school districts like Connecticut and Utah and States that have very small samples like: Arizona, Florida, North Dakota, Minnesota. We have to pay attention when we compare the engagement average values for different States, as the number might be not representative if the sample size difference is too big.

We have 42% of the school districts coming from just 4 States: Connecticut, Utah, Masschusetts, Illinois.

In [None]:
print(engagement_df.district_id.nunique())
print(districts_df.district_id.nunique())

In [None]:
dlist = districts_df.district_id.unique().tolist()
edlist = engagement_df.district_id.unique().tolist()

for x in dlist:
  if str(x) in edlist:
    continue
  else:
    print(x)

We have checked all 233 distric_id are the same for both dataframes: engagement_df and districts_df.

### Locale Distribution Chart:

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
fig.suptitle('Locale Distribution', size=20)
explode = (0.05, 0.05, 0.05, 0.05, 0.05)
labels = list(districts_df.locale.value_counts().index)
sizes = districts_df.locale.value_counts().values
ax.pie(sizes, explode=explode, startangle=60, labels=labels, autopct='%1.0f%%', pctdistance=0.7)
ax.add_artist(plt.Circle((0,0), 0.4, fc='white'))
plt.show()


Also for Locale we don't have a uniform or close to uniform distribution. Our data is heavily skewed towards Suburb and Rural.

### Distribution pct_black/hispanic 

In [None]:
sns.countplot(data=districts_df, x='pct_black/hispanic')
plt.show()

![Screenshot at Sep 24 19-48-51.png](attachment:366c5710-06f4-46aa-bfb8-9b02b8086a6c.png)

There is an interesting survey conducted by AP and Chalkbeat in 2020 from 677 school districs covering 13 million students. The main takeaway is that race is a strong predictor of which public schools are offering in-person instruction and which aren't. 

We will try to seek confirmation of such findings also from our dataset.

link: https://www.chalkbeat.org/2020/9/11/21431146/hispanic-and-black-students-more-likely-than-white-students-to-start-the-school-year-online

### Distribution pct_free/reduced

In [None]:
sns.countplot(data=districts_df, x='pct_free/reduced')
plt.show()

In [None]:
#@title High speed connection data
districts_df.county_connections_ratio.value_counts()

For this data we'll ignore it right now. We will have better data to address this feature. 

### Per-pupil total expenditure data:

In [None]:
custom_sort = np.array(['[4000, 6000[','[6000, 8000[', '[8000, 10000[', '[10000, 12000[' ,'[12000, 14000[', '[14000, 16000[', '[16000, 18000[', '[18000, 20000[', '[20000, 22000[', '[22000, 24000[', '[32000, 34000['])

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(districts_df.pp_total_raw, order=custom_sort)
plt.xticks(rotation=60);

The majority of spending per pupil is concentrated between 6K to 18K. 


## The overall take away from analyzing districts data is that:
- Majority of the school districs in the dataset coming from only 4 States, therefore we have an unbalanced dataset;
- School districts in the dataset mainly are in suburb and rural locales;
- Majority of the school districs (~120 over 176) in the dataset have a lower percentage of both Black and Hispanic students 0-20% compared to the national average (31-40% based on data from National Center for Education Statistics link: https://nces.ed.gov/programs/coe/indicator/cge#fn4).

Therefore we have to be careful to draw conclusion on whatever insights we will find from this dataset. As it represents only the school districts covered by this LearnPlatform dataset, while the country/national picture might differ significantly from it.

# Analysis of digitial learning metrics for school districts and States

Now that we have some idea about our dataset. We proceed with our first series of analysis. In particular, here we would like to investigate our two performance metrics trend among all the available school districts: 
- **pct_access**: Percentage of students in the district have at least one page-load event of a given product and on a given day. 
- **engagement_inde**x: Total page-load events per one thousand students of a given product and on a given day.

## Q1: How was the students' engagement during 2020 in each of our districts?

we'll use the following metrics as proxies:
- **pct_access.max()**: Why max value you may ask. The reason is there are a lot of available products for the students each day. And if we take the mean, we'll underestimate the engagement level. 

>> For example: on Jan. 1st 2020, for the school district 1000 we have 138 different products, each with its pct_access. If we take the mean, which is only 0.145%! Which is obviously wrong, because if we have 5% of students active on only one product, then this 5% should be the reliable estimate for engagement for that day, we can't average it out. In this case the max value is 3.6% on a 'certain' product. We don't care which product at this stage, we just know that 3.6% of the students were active on some technology product that day.

- **engagement_index.sum()**: for engagement_index we choose the sum, as in this way we can have an idea about the total page-load events per one thousand students for all products on a given day for all our school districts.

In [None]:
#@title Calculate Max pct_access value for each district
# engagement_df[(engagement_df['district_id']==1000) & (engagement_df['time']=='2020-01-01')].pct_access.max()

# For each district and each day, we take the maximum pct_access value registered on all available products:
dist_engage_max_pct_access = engagement_df.groupby(['district_id', 'time']).pct_access.max()

# max pct_access for the district no. 1000
# dist_engage_max_pct_access.loc[1000]

# max pct_access for the district no. 1000, on Jan. 1st
# dist_engage_max_pct_access.loc[1000, '2020-01-01']

# max pct_access for all the districts on Jan. 1st
# dist_engage_max_pct_access.loc[:, '2020-01-01']

# max pct_access for each district during 2020:
# it's a df
# dist_engage_max_pct_access_df = dist_engage_max_pct_access.unstack()
# dist_engage_max_pct_access_df

# dist_engage_max_pct_access_df.loc[1000]
# dist_engage_max_pct_access_df.loc[1000, '2020-01-01']
# dist_engage_max_pct_access_df.loc[:, '2020-01-01']

# This also works:
# engagement_df.pivot_table(values='pct_access', index='district_id', columns='time', aggfunc='max')

In [None]:
#@title Calculate cumulative engagement_index for each district
# For the engagement_index we choose to sum all the values up to get a total number
dist_engagement_index_sum = engagement_df.groupby(['district_id', 'time']).engagement_index.sum()


In [None]:
#@title Create an unique DataFrame for both pct_access and engagement_index
dist_pct_access_engagement_df = pd.concat([dist_engage_max_pct_access, dist_engagement_index_sum], axis=1)
dist_pct_access_engagement_df.columns = ['pct_access_max', 'engagement_index_sum']
dist_pct_access_engagement_df

In [None]:
#@title Remove districts with missing data

# pct_access
# dist_engage_max_pct_access_df = dist_engage_max_pct_access.unstack()
# to_be_removed = list(dist_engage_max_pct_access_df.isnull().sum(axis=1).sort_values(ascending=False).head(18).index)

# engagement_index
dist_engagement_index_sum_df = dist_engagement_index_sum.unstack()
to_be_removed = list(dist_engagement_index_sum_df.isnull().sum(axis=1).sort_values(ascending=False).head(18).index)

# Since they both yield the same list of districts, we can use either one:
for index_num in to_be_removed:
  dist_pct_access_engagement_df.drop(index_num, inplace=True)

We have an substantial number of districts which have a lot of missing in both pct_access and engagement_index data. 

For our analysis, we decide to drop all districts with more than 100 days of missing pct_access and engagement_index data (18 of 233 in this case).

In [None]:
#@title Some manuale serach and check code:

dist_pct_access_engagement_df.loc[(1000),('pct_access_max', 'engagement_index_sum')]

# .loc[(what rows do I want), (what columns do I want)]
# dist_pct_access_engagement_df.loc[(1000, '2020-01-01'), :]
dist_pct_access_engagement_df.loc[(1000, '2020-01-01'), ('pct_access_max')]

dist_pct_access_engagement_df.loc[([1000, 9927], '2020-01-01'), :]

dist_pct_access_engagement_df.loc[(1000, ['2020-01-01', '2020-01-02']), :]

dist_pct_access_engagement_df.loc[(slice(None), ['2020-01-01', '2020-01-02']), :]


In [None]:
#@title Let's define a general function to plot both pct_access and engagement_index for each district we choose:

def district_metrics(district_num):
  # All the engagement data for the selected district, rolling 7 days to smooth out the effect of weekend:
  district_df = dist_pct_access_engagement_df.loc[district_num].rolling(7, min_periods=1).mean()

  # Create a series of dates for x values:
  dates = pd.date_range("1 1 2020", periods=366, freq="D")

  # Create chart:
  fig, axs = plt.subplots(2, 1, figsize=(16,16), sharex=True)

  # Ploting pct_access:
  axs[0].set_title('School district no. ' + str(district_num) + ' daily pct_access maximum values during 2020', size=16)
  axs[0].plot(dates, district_df['pct_access_max'])

  # Ploting engagement_index:
  axs[1].set_title('School district no. ' + str(district_num) + ' daily engagement_index sum values during 2020', size=16)
  axs[1].plot(dates, district_df['engagement_index_sum'])
  plt.show();

In [None]:
#@title Insert the district number to get its engagement metrics:

district_number =  1000#@param {type:"number"}

district_metrics(district_number)

We can see for the school district no.1000 the maximum percentage of students using a digital learning product oscillates between **25-40%** on the most active days aka **max pct_access**. We can see the drop during summer vacation time, and subsequent rise in September when school year restarts.

While the total page-load events per one thousand students for all digital learning products oscillates wildly during the first part of the year (almost **80k page-load events at peak and less than 10k page-load events at bottom**) and remained more stable in the second half. This could be due to the outbreak of pandemic in the first half of 2020 with schools being closed in march. Further school district level investigation can tell us more about these oscillations.

While it is useful to have a picture on each district. It would be more interesting if we can compare them to have more insights as:

- Which are the top 10 districts for both pct_access and engagement_index?
- And which are the bottom 10?
- How big is the engagement difference between top 10 and bottom 10?
- If we choose two or more districts that we're interested. How big is their engagement difference? And what are the underlying causes?  

In [None]:
#@title Define a function for pct_access metric:

def districts_pa(*args):  # pass a number of districts you want
  
  # Create an empty df:
  tot_df = dist_pct_access_engagement_df[dist_pct_access_engagement_df.index.get_level_values('district_id') == '1000']['pct_access_max'].rolling(7, min_periods=1).mean()
  tot_df = tot_df.to_frame()
  tot_df = tot_df.droplevel('district_id', axis=0)
  tot_df.drop(['pct_access_max'], axis=1, inplace=True)

  for dist in args:
    # Extract pct_access_max from the original dataframe:
    pa_df = dist_pct_access_engagement_df[dist_pct_access_engagement_df.index.get_level_values('district_id') == dist]['pct_access_max'].rolling(7, min_periods=1).mean()
    pa_df = pa_df.to_frame()
    pa_df = pa_df.droplevel('district_id', axis=0)
    pa_df.rename(columns={"pct_access_max": "pa"+str(dist)}, inplace=True)

    # Concat to the tot_df
    tot_df = pd.concat([tot_df, pa_df], axis=1)

  # Outside the for loop:
  tot_df.index = pd.to_datetime(tot_df.index)

  # Plot the chart:
  fig, ax = plt.subplots(figsize=(16,8))
  sns.lineplot(data=tot_df, palette='tab10', linewidth=2.5)
  ax.set_title("max pct_access trend for selected districts", size=16)
  ax.set_ylabel('max pct_access')


In [None]:
districts_pa(1000, 1204, 9927)

With a basic chart we can already ask some very interesting questions on pct_access here:

- Percentage of students in the district no.1000 that have at least one page-load event of a given product and on a given day are higher than district no. 9927 during the first half of 2020. However after the summer break, the second outperforms the first. What happened in the second part of the year? What's the cause of this switching? 
- What is the reason of the underperformance of school district no. 1204? Its pct_access values are less than half of other two.

On a district level, this type of inquiry can be very interesting. We just need to select districts of our interest, make the comparison and further investigate on any substantial difference.

In [None]:
#@title Define a function for engagement_index metric:

def districts_ei(*args):  # pass a number of districts you want
  
  # Create an empty df:
  tot_df = dist_pct_access_engagement_df[dist_pct_access_engagement_df.index.get_level_values('district_id') == '1000']['engagement_index_sum'].rolling(7, min_periods=1).mean()
  tot_df = tot_df.to_frame()
  tot_df = tot_df.droplevel('district_id', axis=0)
  tot_df.drop(['engagement_index_sum'], axis=1, inplace=True)

  for dist in args:
    # Extract engagement_index from the original dataframe:
    ie_df = dist_pct_access_engagement_df[dist_pct_access_engagement_df.index.get_level_values('district_id') == dist]['engagement_index_sum'].rolling(7, min_periods=1).mean()
    ie_df = ie_df.to_frame()
    ie_df = ie_df.droplevel('district_id', axis=0)
    ie_df.rename(columns={"engagement_index_sum": "ie"+str(dist)}, inplace=True)

    # Concat to the tot_df
    tot_df = pd.concat([tot_df, ie_df], axis=1)

  # Outside the for loop, convert index to datetime:
  tot_df.index = pd.to_datetime(tot_df.index)

  # Plot the chart:
  fig, ax = plt.subplots(figsize=(16,8))
  sns.lineplot(data=tot_df, palette='tab10', linewidth=2.5)
  ax.set_title("sum engagement index trend for selected districts", size=16)
  ax.set_ylabel('sum engagement_index')


In [None]:
districts_ei(1000, 1204, 9927)

As for engagement_index: we've choose the same three districts as before, and we can see some similaries compared to pct_access:
- District no.1000 outperforms in the first part of the year, while district no. 9927 overtakes in the second half (here it overtakes in April!);
- District no.1204 generally underperforms over the year, especially in the second half. 

## Q2: Which are the top 10 districts?

We have 215 school districts in total (233 - 18 that we have removed due to too much missing data). Let's find out which are the best performers.

We use the annual mean for both pct_access and engagement_index to decide which are the top performers and which are the bottom ones.

In [None]:
#@title Create a new dataframe with annual mean values of the metrics for all districts:

# Create a new empty DataFrame with index districts of dist_pct_access_engagement_df:
df_index = dist_pct_access_engagement_df.index.get_level_values(level=0).drop_duplicates()
dist_pct_engagement_mean_df = pd.DataFrame(index=df_index)

# Store all the annual mean metrics into two lists:
dist_pct_mean = []
dist_engagement_index_mean = []

for district, sub_df in dist_pct_access_engagement_df.groupby(level=0):
  dist_pct_mean.append(sub_df['pct_access_max'].mean())
  dist_engagement_index_mean.append(sub_df['engagement_index_sum'].mean())

# Append two lists to the new df:
dist_pct_engagement_mean_df['pct_access_mean'] = dist_pct_mean
dist_pct_engagement_mean_df['engagement_index_mean'] = dist_engagement_index_mean
dist_pct_engagement_mean_df

In [None]:
#@title Top 10 pct_access districts:
top10_pa = dist_pct_engagement_mean_df.pct_access_mean.sort_values(ascending=False).head(10).to_frame()
top10_pa

In [None]:
districts_pa(5890, 9553, 2779, 8815, 3228, 9536, 6577, 6194, 2598, 8702)

We noticed 2 suspicious data segments:
- District no.8815 the first three months data may contain some error, as it reaches 100% as pct_access (which is quite incredible);
- District no.9536 inversely has very low data points for the first three months. It also can be a data acquisition problem.

We need to check these points with our client/colleagues who handle the data acquisition process.


With exception of these two segments, the overal picture is quite clear:
- Top 10 district for the **pct_access metric** have an average of **50-60%** value during the school calendar peak months and an **overall average annual value in the range of 35-40%**.


In [None]:
#@title Let's link the top 10 districts with some other information to have a more complete picture:
top10_pa.index = top10_pa.index.astype(int)
districts_df.district_id = districts_df.district_id.astype(int)

top10_pa.merge(districts_df, how='left', left_index=True, right_on='district_id').set_index('district_id')

New findings emerged:
- Top 4 are all from the **State of Illinois**, half of the top10 are from this State. Even though we don't have a balanced dataset at national level and Illinois are among the top4 States in terms of school districts in our database (see above the chart of No. of school districts per State). However Utah and Massachusetts have no presence in this ranking and Connecticut has only one at the position no.7. The outperformance of Illinois' districts is an interesting point to be further studied.
- **The district no.9536** from New York has very high percentage of black/hispanic students and eligible for free/reduced-price lunch students, both between 80-100%. This is a very encouraging news, as we might infer from this, harsh economic conditions doesn't have to be an obstacle to digital learning. Maybe other districts from poor areas can adopt some of the best practice of this district to increase their students' engagement with digital learning products.

Let's do the same list for engagement_index before we try to come up with other insights.

In [None]:
#@title Top 10 engagement index districts:
pd.options.display.float_format = "{:,.2f}".format

top10_ei = dist_pct_engagement_mean_df.engagement_index_mean.sort_values(ascending=False).head(10).to_frame()
top10_ei

In [None]:
districts_ei(8815, 9536, 2779, 9553, 6194, 5890, 6512, 3314, 9463, 2393)

For our top 10 districts on engagement index:
- The overall annual average total page-load events per one thousand students ranges from **71,000 to 100,000 events**.
- During peak months the metric stays **in the range 100,000 and 200,000** with highest peak touching 300,000 events.
- The trend oscillates a lot during the first half, while remained more stable during the second half of the year. We can think of school disruption and closure during pandemic outbreak in the first 3 months of 2020. As students, teachers and parents get progressively more prepared during the year, the usage of digital learning tools becomes more stable in the last few months of 2020.


In [None]:
#@title Let's link the top 10 districts with some other information to have a more complete picture:
top10_ei.index = top10_ei.index.astype(int)

top10_ei.merge(districts_df, how='left', left_index=True, right_on='district_id').set_index('district_id')

Looks like a big portion of top 10 are the same districts for both pct_access and engagement_index, as it may be expected.

In [None]:
#@title Top performers on both metrics:

list1 = top10_pa.index.values.tolist()
list2 = top10_ei.index.values.tolist()
set1 = set(list1)
intersection = list(set1.intersection(list2))
intersection


dist_pct_engagement_mean_df.index = dist_pct_engagement_mean_df.index.astype(int)
top10_combined = dist_pct_engagement_mean_df.loc[intersection]

top10_combined.index = top10_combined.index.astype(int)
top10_combined.merge(districts_df, how='left', left_index=True, right_on='district_id').set_index('district_id').sort_values(['pct_access_mean', 'engagement_index_mean'], ascending=[False, False])


Now that we have completed our top10 analysis on both metrics, we can see that it confirms our previous observations:
- **The State of Illinois** occupies half of the top10 ranking, its districts have very high performance/engagement with digital learning technologies. However we have to keep in mind that both Illinois and Connecticut are two of the four States where the bulk of the data (over 55%) is coming from. So we have to be careful here, maybe other States' districts will have higher engagement metrics if we add more data into the analysis. Same reasoning work for the locale, since we have ~60% of the data coming from Suburb, we can't make any hasty conclusion here.
- In NY City we have district no.9536 with its perculiar characteristics of high percentage of minority ethnicities students and high percentage of students eligible for free and reduced-price lunch among the top 6 in the combined ranking. This is a very encouraging news, as we might infer from this, **harsh economic conditions don't have to be an obstacle to digital learning**.

## Have analyzed the top 10, let's have a look at the bottom 10:
## Q3: Which are the bottom 10 districts?

In [None]:
#@title Bottom 10 pct_access districts:
bottom10_pa = dist_pct_engagement_mean_df.pct_access_mean.sort_values(ascending=True).head(10).to_frame()
bottom10_pa

In [None]:
b10pa = list(bottom10_pa.index)
districts_pa(*b10pa)

There are two notable examples that can be indicative of data acquisition problem:
- District no.5042 around march 2020.
- District no.8017 which have zero pct_access until december, then the data explodes.

We have to check with client/colleagues about these examples and make sure that they are correct. 

Despite the above two data examples, the overall picture is still valid. Here for our bottom10, the pct_access metric have **an average of 10-20% value during the school calendar peak months** (vs. 50-60% of the top10) and **an overall average annual value in the range of 3-6%** (vs. 35-40% of the top10).

We are seeing that for pct_access, **top10 are 3-5X on peak values** and **8-10X on annual average values** compared to the bottom10.

In [None]:
#@title Bottom 10 engagement_index districts:
bottom10_ei = dist_pct_engagement_mean_df.engagement_index_mean.sort_values(ascending=True).head(10).to_frame()
bottom10_ei

In [None]:
b10ei = list(bottom10_ei.index)
districts_ei(*b10ei)

For our bottom 10 districts on engagement index:

- During peak months the metric stays in the **range from 10,000 to 40,000** (vs. 100,000 to 200,000 events of the top10).
- The overall annual average total page-load events per one thousand students ranges **from 1,500 to 7,500 events** (vs. 71,000 to 100,000 events of the top10).


We are seeing that for engagement_index, top10 are **5-10X on peak values** and **5-13X on annual average values** compared to the bottom10.

In [None]:
#@title Bottom performers on both metrics:

list1 = bottom10_pa.index.values.tolist()
list2 = bottom10_ei.index.values.tolist()
set1 = set(list1)
intersection = list(set1.intersection(list2))
intersection


dist_pct_engagement_mean_df.index = dist_pct_engagement_mean_df.index.astype(int)
bottom10_combined = dist_pct_engagement_mean_df.loc[intersection]

bottom10_combined.index = bottom10_combined.index.astype(int)
bottom10_combined.merge(districts_df, how='left', left_index=True, right_on='district_id').set_index('district_id').sort_values(['pct_access_mean', 'engagement_index_mean'], ascending=[True, True])

Analyzing the bottom10, we noticed the following:

- Both pct_access and engagement_index are **around or less than 1/10** from the best. The differences are very big. Now, there's maybe more than one way to interpret this result. As if a school adopts more in-person teaching, consequently its students will use relatively less digital learning tools. However in a pandemic year like 2020, we would expect the lower engagement metrics more as a negative sign than as a positive one. Especially as we know in the U.S. almost all States have closed schools around the end of March. 
- A curious finding that **district no.5042** is situated in Illinois! We've already seen that quite a lot of top districts are in Illinois. So the fact that one of the worst is also in the same State, we wonder why of this huge difference. Maybe the district no.5042 can learn something from its neighbours.
- **Lower income districts might have a correlation with lower performance on digital learnings**, 4 out of 6 bottom districts have pretty high percentage of students eligible for free/reduced-price lunch.

Clearly something is not working in these districts, it might be economic reason. But we have seen the district in NY from top10, low income doesn't necessary have to be an obtacle. To better understand the reasons, further research must be done including interviews to the teachers, to the parents etc.



## Q4: Which are the best States? And Which are the worst?

Here we must not hurry to make comparison. We already know that our datasets are not uniform, and there are 4 States who count for more than 55% of the data.

Therefore, we first need to establish a cut-off threshold to drop the States with too few data points.

We can see that Florida, Minnesota, Arizona, North Dakota only have 1 school district each. While Minnesota, Arizona, North Dakota, New Hampshire have the fewest data points for both pct_access and engagement_index.

We'll establish a cut-off at 100K data points for both pct_access and engagement_index, and drop States with less than 100K data points. The goal is allow a certain comparability, as too fewer data will probabily give us a distorted picture of the reality.


In [None]:
# Merge the two datasets:
engagement_districts_df = pd.merge(engagement_df, districts_df, on=['district_id'])
# engagement_districts_df.sample(100)

In [None]:
#@title We drop the States of Minnesota, Arizona, North Dakota, New Hampshire from our analysis, as they contain too few districts or examples.

engagement_districts_df.drop(engagement_districts_df[engagement_districts_df.state == 'Minnesota'].index, inplace=True)
engagement_districts_df.drop(engagement_districts_df[engagement_districts_df.state == 'Arizona'].index, inplace=True)
engagement_districts_df.drop(engagement_districts_df[engagement_districts_df.state == 'North Dakota'].index, inplace=True)
engagement_districts_df.drop(engagement_districts_df[engagement_districts_df.state == 'New Hampshire'].index, inplace=True)
engagement_districts_df.drop(engagement_districts_df[engagement_districts_df.state == 'Not disclosed'].index, inplace=True)


First, let's define our metrics for comparison and ranking among States.

We decide to take the average value amongs all districts in a given State as the State value. So to recap:

- For pct_access, we have the max value from each school district (we don't care which technology product in this case, if on a given day the max pct_access is 10%, we consider 10% for that day). Once we have the daily max pct_access for the whole year for each district in the State, we average them out.

- For the engagement_index, we used the sum instead of max value. So we have the cumulative daily value for the whole year for each district in the State. As above, we avrage them out among all districts to come up with the State level metric.

We will explain further once we have calculated the metrics and plotted the charts.

In [None]:
#@title State level **pct_access** and **engagement_index** ranking:

# states_pct_access_df.loc['California']
# states_pct_access_df.loc['California'].mean(axis=0).head(20)
# df = pd.DataFrame(states_pct_access_df.loc['California'].mean(axis=0))
# df.index = df.index.droplevel()
# df.transpose()

# Extract pct_access data from the df:
states_pct_access_df = engagement_districts_df.groupby(['state', 'district_id', 'time']).pct_access.max().to_frame().unstack()

# Create a new temporary DataFrame with States as index:
states_index = states_pct_access_df.index.get_level_values(level=0).drop_duplicates()
states_pct_index = pd.DataFrame(index=states_index)

frames = []

# For each State we calculate the mean of pct_access among its districts:
for state, sub_df in states_pct_access_df.groupby(level=0):
  new_df = pd.DataFrame(states_pct_access_df.loc[state].mean(axis=0))
  new_df.index = new_df.index.droplevel()
  frames.append(new_df.transpose())

states_pct_temp = pd.concat(frames)
states_pct_mean_df = states_pct_temp.set_index(states_pct_index.index)

# States Ranking for pct_access:
states_pa = states_pct_mean_df.mean(axis=1).sort_values(ascending=False).to_frame()

# plt.figure(figsize=(20,10))
# sns.barplot(x=states_pa.values.flatten().astype('int'), y=states_pa.index)
# plt.title('States Ranking for pct_access', size=20)
# plt.ylabel('States')
# plt.xlabel('Annual mean pct_access');


# Extract engagement_index data from the df:
states_engagement_index_df = engagement_districts_df.groupby(['state', 'district_id', 'time']).engagement_index.sum().to_frame().unstack()

# Create a new temporary DataFrame with States as index:
states_index = states_engagement_index_df.index.get_level_values(level=0).drop_duplicates()
states_engagement_index_index = pd.DataFrame(index=states_index)

frames = []

for state, sub_df in states_engagement_index_df.groupby(level=0):
  new_df = pd.DataFrame(states_engagement_index_df.loc[state].mean(axis=0))
  new_df.index = new_df.index.droplevel()
  frames.append(new_df.transpose())

states_engagement_temp = pd.concat(frames)
# engagement_index for each State:
states_engagement_mean_df = states_engagement_temp.set_index(states_engagement_index_index.index)

# States Ranking for engagement_index:
states_ei = states_engagement_mean_df.mean(axis=1).sort_values(ascending=False).to_frame()


# plt.figure(figsize=(20,10))
# sns.barplot(x=states_ei.values.flatten().astype('int'), y=states_ei.index)
# plt.title('States Ranking for engagement_index', size=20)
# plt.ylabel('States')
# plt.xlabel('Annual mean engagment_index');

#Plot the charts:
f, (ax0, ax1) = plt.subplots(1, 2, figsize=(18, 10))

sns.barplot(x=states_pa.values.flatten().astype('int'), y=states_pa.index, ax=ax0)
ax0.set_title('States Ranking for pct_access', size=16)
ax0.set_xlabel('Annual mean pct_access')

sns.barplot(x=states_ei.values.flatten().astype('int'), y=states_ei.index, ax=ax1)
ax1.set_title('States Ranking for engagement_index', size=16)
ax1.set_xlabel('Annual mean engagment_index')

plt.show();

We know very well that our dataset is not balanced. Therefore any findings are now linked to the data available at the present. If in the future, we can add new data, we very likely will see a different picture. 

That said. As for now, we have some confirmation with our previous analysis on districs:

- Best States consist of: **Illinois, New York, Indiana, Connecticut/Wisconsin**. 
- While at bottom of the ranking we can find: **Texas, Michigan, North Carolina**. (might partially due to the lower sample size of data. We can't have a confident conclusion unless we have more data on these States.)
- **California**'s performance are quite bad compared to the best. Both metrics pct_access and engagement_index are only 1/2 of those at top of the ranking. 

# **#WIP...**

## Load products_info data into DataFrame

### Product information data
> The product file ```products_info.csv``` includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. 

📝 Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

| Name                       | Description                                                                                                                                                                                                                                                                                                                    |
|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LP ID                      | The unique identifier of the product                                                                                                                                                                                                                                                                                           |
| URL                        | Web Link to the specific product                                                                                                                                                                                                                                                                                               |
| Product Name               | Name of the specific product                                                                                                                                                                                                                                                                                                   |
| Provider/Company Name      | Name of the product provider                                                                                                                                                                                                                                                                                                   |
| Sector(s)                  | Sector of education where the product is used                                                                                                                                                                                                                                                                                  |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |
|                            |                                                                                                                                                                         

We split the Primary Essential Function into two separate columns as there are two layers/levels of specifications here.

In [None]:
products_df = pd.read_csv('/content/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

In [None]:
products_df

In [None]:
products_df.columns

As per instruction, the last column 'Primary Essential Function' actually contains two layers of information. As first thing, we therefore split it as following:

In [None]:
products_df['func_category']= products_df['Primary Essential Function'].str.split(' - ', 1).str[0]
products_df['primary_function']= products_df['Primary Essential Function'].str.split(' - ', 1).str[1]
products_df

In [None]:
products_df.info()

In [None]:
print(products_df["LP ID"].nunique())
print(engagement_df["lp_id"].nunique())

We can see from above that in the products_df we have the top 372 products information, while in our engagement_df we have a lot more products. So when we merge the datasets, we have to be mindful of this.

### Top 10 companies in terms of number of products:

In [None]:
companies_num_prod = products_df['Provider/Company Name'].value_counts(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x = companies_num_prod.values, y = companies_num_prod.index, alpha=0.8)
plt.title('Top 10 companies in terms of number of products', size=20)
plt.ylabel('Companies')
plt.xlabel('Number of products')
plt.show();

The chart shows the importance of Google for digital learning, although we still have to dig deep on which products are the most used by the students. 

### Sectors Distribution

In [None]:
products_df['Sector(s)'].unique()

In [None]:
c1=c2=c3=0
for s in products_df["Sector(s)"]:
    if(not pd.isnull(s)):
        s = s.split(";")
        for i in range(len(s)):
            sub = s[i].strip()
            if(sub == 'PreK-12'): c1+=1
            if(sub == 'Higher Ed'): c2+=1
            if(sub == 'Corporate'): c3+=1

fig, ax  = plt.subplots(figsize=(16, 8))
fig.suptitle('Sector Distribution', size = 20)
explode = (0.05, 0.05, 0.05)
labels = ['PreK-12','Higher Ed','Corporate']
sizes = [c1,c2, c3]
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.1f%%', pctdistance=0.7, colors=["#ff228a","#20b1fd","#ffb703"])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show();

The majority of products are PreK-12, this matches well with our goal of analysis.

### Function Categories Distribution

In [None]:
products_df['func_category'].value_counts()

In [None]:
c1=c2=c3=0

for s in products_df["func_category"]:
    if(not pd.isnull(s)):
        c1 += s.count("CM")
        c2 += s.count("LC")
        c3 += s.count("SDO")

fig, ax  = plt.subplots(figsize=(16, 8))
fig.suptitle('Function Categories', size = 20)
explode = (0.05, 0.05, 0.05)
labels = ['CM','LC','SDO']
sizes = [c1, c2, c3]
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.1f%%', pctdistance=0.7, colors=["#18ff9f","#2cfbff","#ffb703"])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show()

 LC = Learning & Curriculum is the overwhelming majority function category, which matches well with our objective.

### Primary Functions Ranking


In [None]:
func_dist = products_df['primary_function'].value_counts(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x = func_dist.values, y = func_dist.index, alpha=0.8)
plt.title('Primary Functions Ranking', size=20)
plt.ylabel('Primary functions')
plt.xlabel('count')
plt.show();

As expected the higher counts are related to functions focused on students learning like: Digital Learning Platforms, Resources & Reference, Content Creation & Curation. 

While the lower end is populated with school management software, admissions, enrollment, teacher resources etc.

## Merge products_df with engagement_df

In [None]:
products_engagement_df = pd.merge(products_df, engagement_df, left_on='LP ID', right_on='lp_id')
products_engagement_df.head()

In [None]:
products_engagement_df.info()

In [None]:
def plot_pct(statename, closedate):
  pct_access_state = states_pct_mean_df[states_pct_mean_df.index == statename].transpose()
  pct_access_state = pct_access_state.rolling(7, min_periods=1).mean()
  pct_access_state.index = pd.to_datetime(pct_access_state.index)

  fig, ax = plt.subplots(figsize=(20, 10))
  pct_access_state.plot(ax=ax)
  plt.title('State of {} pct_access 2020'.format(statename), size=20)
  plt.axvline(pd.to_datetime(closedate), color='red', linestyle='--', label='The date a state closed K-12 public schools statewide.')
  plt.show();

In [None]:
plot_pct('Michigan', '2020-03-16')

Inference on Michigan must be taken with a grain of salt, as we have only less than 200K sample data compared to 2.5M of Connecticut. 

Also we have noticed that the second part of the year's trend is that low, just a little lower compared to NY. So the low ranking is mostly due to the first part of the year. With more districts and data, the pct_access might go up.

In [None]:
plot_pct('North Carolina', '2020-03-16')

North Carolina is even more strange, notice the gap starting Apr. This might also be not enough data points. 

### We now do the same analysis for the engagement_index:

In [None]:
def plot_engagement(statename, closedate):
  engagement_index_state = states_engagement_mean_df[states_engagement_mean_df.index == statename].transpose()
  engagement_index_state = engagement_index_state.rolling(7, min_periods=1).mean()
  engagement_index_state.index = pd.to_datetime(engagement_index_state.index)

  fig, ax = plt.subplots(figsize=(20, 10))
  engagement_index_state.plot(ax=ax)
  plt.title('State of {} engagement_index 2020'.format(statename), size=20)
  plt.axvline(pd.to_datetime(closedate), color='red', linestyle='--', label='The date a state closed K-12 public schools statewide.')
  plt.show();

In [None]:
plot_engagement('Illinois', '2020-03-17')

In [None]:
plot_engagement('Texas', '2020-03-21')

For Texas, the gap is probably due to missing data/not enough data in our dataset.

In [None]:
plot_engagement('Michigan', '2020-03-16')

For Michigan we have a very big difference betweeen first part and second part of year. Again we have much less data for this State, so the reason can be partially due to this.

In [None]:
plot_engagement('California', '2020-03-23')

No excuse for California. Despite having quite big and complete sample size, the California's performance on both pct_access and engagement_index only half of those States on the top part of the ranking namely: Illinois and New York.