# **Introduction**

Emerging evidence from some of the region’s highest-income countries indicate that the pandemic is giving rise to learning losses and increases in inequality. The spread of COVID-19 has sent shockwaves across the globe. 

Some countries have introduced short-term support measures such as providing digital learning devices, financial support for students and schools and funds for safety and cleaning equipment. Online platforms were used in countries through: (a) educational content, (b) real-time lessons on virtual meeting platforms, and (c) self-paced formalised lessons.  

One of the limitations of emergency remote learning is the loss of intructional time delivered in a school setting and the lack of personal interaction between teacher and student.

In this study, we investigate the different modes of learning and the socioeconomic factors that may contribute to the engagement level amongst students. Please refer to [this kaggle page](https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/data) for the datasets and data dictionaries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **Content Outline**
> * # [Exploratory Data Analysis](#section-one)
> * # [Merging Datasets](#section-two)
> * # [Data Visualization](#section-three)
> * # [Conclusion & Proposed Recommendations](#section-four)

<a id="section-one"></a>

# 1. Exploratory Data Analysis

The analysis will be performed in the following order:
(a) District Data
(b) Product Data
(c) Engagement Data.

### (a) Background on District Data:
- The district file districts_info.csv includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

In [None]:
district = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')

In [None]:
district.info()

In [None]:
district.head(10)

In [None]:
#Because columns like 'pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio' and 'pp_total_raw' are formatted as a range, we may wish to split the minimum value and maximum value into different columns

#'pct_black/hispanic'
district[['Min pct_black/hispanic','Max pct_black/hispanic']] = district['pct_black/hispanic'].str.split(",",expand=True)
district['Min pct_black/hispanic'] = district['Min pct_black/hispanic'].str.strip('[')
district['Max pct_black/hispanic'] = district['Max pct_black/hispanic'].str.strip('[')


#'pct_free/reduced'
district[['Min pct_free/reduced','Max pct_free/reduced']] = district['pct_free/reduced'].str.split(",",expand=True)
district['Min pct_free/reduced'] = district['Min pct_free/reduced'].str.strip('[')
district['Max pct_free/reduced'] = district['Max pct_free/reduced'].str.strip('[')

#'county_connections_ratio'
district[['Min county_connections_ratio','Max county_connections_ratio']] = district['county_connections_ratio'].str.split(",",expand=True)
district['Min county_connections_ratio'] = district['Min county_connections_ratio'].str.strip('[')
district['Max county_connections_ratio'] = district['Max county_connections_ratio'].str.strip('[')

#'pp_total_raw'
district[['Min pp_total_raw','Max pp_total_raw']] = district['pp_total_raw'].str.split(",",expand=True)
district['Min pp_total_raw'] = district['Min pp_total_raw'].str.strip('[')
district['Max pp_total_raw'] = district['Max pp_total_raw'].str.strip('[')

#import the original columns
district.drop(['pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio', 'pp_total_raw'], axis = 1, inplace = True)

In [None]:
#convert object to numbers
cols = district.columns.drop(['state', 'locale'])
district[cols] = district[cols].apply(pd.to_numeric, errors = 'coerce')

In [None]:
district.head()

In [None]:
district_max = district[['locale', 'Max pct_black/hispanic',
                         'Max pct_free/reduced', 
                         'Max county_connections_ratio',
                         'Max pp_total_raw']]

district_min = district[['locale', 'Min pct_black/hispanic',
                         'Min pct_free/reduced',
                         'Min county_connections_ratio',
                         'Min pp_total_raw']]

### (b) Background on Product Data:
- The products_info.csv file includes information about the characteristics of the top 372 products with most users in 2020. 
- Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

In [None]:
product = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

In [None]:
product.info()

In [None]:
product.head(10)

In [None]:
product[['Category', 'Subcategory']] = product['Primary Essential Function'].str.split("-", expand = True, n =1)
product.drop('Primary Essential Function', axis = 1, inplace = True)

In [None]:
product.head()

### (c) Background on Engagement Data:
- The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. 
- The 4-digit file name represents district_id which can be used to link to district information in district_info.csv. 
- The lp_id can be used to link to product information in product_info.csv.

In [None]:
import glob
import os

# get data file names
globbed_files = glob.glob("../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/*.csv") 
engage = []

In [None]:
for filename in globbed_files:
    frame = pd.read_csv(filename)
    frame['district_id'] = os.path.basename(filename)
    engage.append(frame)

# Concatenate all data into one DataFrame
engage = pd.concat(engage, ignore_index=True)

In [None]:
engage['district_id'] = engage['district_id'].str.strip('.csv').astype(int)

In [None]:
engage.head()

In [None]:
engage['time'] = pd.to_datetime(engage['time'], errors = 'coerce')
engage['time(month)'] = pd.DatetimeIndex(engage['time']).month

In [None]:
engage[['pct_access', 'engagement_index']].describe()

<a id="section-two"></a>
# 2. Merging Datasets

In [None]:
# The 4-digit file name represents district_id which can be used to link to district information in district_info.csv.
# The lp_id can be used to link to product information in product_info.csv.

In [None]:
engage_district = pd.merge(engage, district, on = 'district_id')

In [None]:
final = pd.merge(engage_district, product, left_on = 'lp_id', right_on = 'LP ID')

In [None]:
final.isnull().sum()

In [None]:
final.dropna(inplace = True)

In [None]:
#Since the entire dataset is too big, recommend to do a 50% sampling. 

final = final.sample(frac=0.05, replace=True, random_state=1)

In [None]:
final.info()

In [None]:
final.describe()

<a id="section-three"></a>
# 3. Visualization

* **The mean engagement index had improved over the course of 2020, with a significant dip between April and July before picking up again in August. This could be attributed to the 10/11-week summer break in USA beginning between May and June and ending between August and September. Notably, September was the month with the highest mean engagement index. Learning & Curriculum emerged as the main category of digital learning tools that have promoted one of the highest engagement indexes.**

In [None]:
df1 = final.groupby(['time(month)'])['engagement_index'].mean()
df1.plot(kind = 'bar', color = 'r')

In [None]:
sns.lmplot(data = final, x = 'time(month)', y = 'engagement_index', hue = 'Category')

* **ratio refers to the residential fixed high-speed connections over 200 kbps in at least one direction/households based on the county level data from FCC From 477 (December 2018 version). From Jan 2020 to Dec 2020, the mean minimum connection ratio and mean maximum connection ratio have remained constant.**

In [None]:
df2 = final.groupby(['time(month)'])['Min county_connections_ratio', 'Max county_connections_ratio'].mean()
df2.plot(kind = 'line')

* **The percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data remains roughly constant throughout 2020, between the range of approx. 30% and approx. 35%. The percentage of Blacks and Hispanics is the highest in City and lowest in Town.**

In [None]:
df3 = final.groupby('time(month)')['Min pct_black/hispanic', 'Max pct_black/hispanic'].mean()
df3.plot(kind = 'line')

In [None]:
df4 = final.groupby('locale')['Min pct_black/hispanic', 'Max pct_black/hispanic'].mean()
df4.plot(kind = 'bar')

* **The maximum and minimum percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data is the highest in the city, followed by town, rural and suburb.**

In [None]:
df5 = final.groupby('locale')['Min pct_free/reduced', 'Max pct_free/reduced'].mean()
df5.plot(kind = 'bar')

* **Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. By locales, the rural and suburb areas have ramped up its total-pupil total expenditure compared to the city and town areas.**

In [None]:
df6 = final.groupby('locale')['Min pp_total_raw', 'Max pp_total_raw'].mean()
df6.plot(kind = 'bar')

* **The percentage of students in the district that have at least one page-load event of a given product and on a given day is the highest in the rural and suburb areas, followed by town and city. This may imply that the higher the percentage of students that have at least one page-load event of a given product on a given day, the higher the engagement level in the rural and suburb. This hypothesis was also validated.**

In [None]:
df7 = final.groupby('locale')['pct_access'].mean()
sns.set_style("white")
plt.figure(figsize = (10, 5))
df7.plot(kind = 'bar', ylabel = 'pct_access', title = 'Percentage of students in the district have at least one page-load event of a given product and on a given day')

In [None]:
df7 = final.groupby('locale')['engagement_index'].mean()
sns.set_style("white")
plt.figure(figsize = (10, 5))
df7.plot(kind = 'bar', ylabel = 'engagement_index')

* **Amongst the different learning tools, Learning & Curriculum (LC) achieved the highest adoption. Specifically, the digital learning platform was the subcategory that yielded the highest take-up rate. In addition, the sectors with the highest adoption rate was PreK-12. By contrast, corporate and higher ed sectors made up the minority of the demand for learning tools.**

In [None]:
sns.set_style("white")
plt.figure(figsize = (7, 10))
final['Category'].value_counts(normalize = True).plot(kind = 'pie')

In [None]:
sns.set_style("white")
plt.figure(figsize = (7, 10))
final['Subcategory'].value_counts(normalize = True).plot(kind = 'pie')

In [None]:
plt.figure(figsize = (10, 5))
final['Sector(s)'].value_counts(normalize = True).plot(kind = 'pie')

* **Based on the PCA analysis, it seems that the following features are correlated to the principal compoenents (PCs):**

    1. Min/max pct black/hispanic is highly and positively correlated to PC 2
    2. Min/max pct free/reduced is highly and positively correlated to PC 2
    3. Category_LC is highly and negatively correlated to PC 1
    4. Sectors_PreK-12 is moderately and negatively correlated to PC 1
    5. Locale_City is moderately and positively correlated to PC 2

In [None]:
num_features = final[['pct_access', 'engagement_index',
       'time(month)', 'Min pct_black/hispanic',
       'Max pct_black/hispanic', 'Min pct_free/reduced',
       'Max pct_free/reduced', 'Min county_connections_ratio',
       'Max county_connections_ratio', 'Min pp_total_raw', 'Max pp_total_raw']]
cat_features = final[['locale', 'Sector(s)',
       'Category', 'Subcategory']]

In [None]:
cat_features = pd.get_dummies(cat_features)

In [None]:
features = pd.concat([num_features, cat_features], axis = 1)

In [None]:
features.head()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(features)

In [None]:
scaled_features = scaler.transform(features)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(scaled_features)
x_pca = pca.transform(scaled_features)

In [None]:
plt.figure(figsize = (8, 6))
plt.scatter(x_pca[:, 0], x_pca[:,1], c = final['engagement_index'], cmap = 'plasma')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

In [None]:
pca.components_

In [None]:
df_comp = pd.DataFrame(pca.components_, columns = features.columns)

In [None]:
plt.figure(figsize=(20,6))
sns.heatmap(df_comp,cmap='plasma',)

In [None]:
final_pca = final[['Min pct_black/hispanic', 'Min pct_free/reduced','engagement_index']]

* **As the minimum percentage of blacks/hispanics increases, the greater the minimum percentage of students in the districts that is eligible for free or reduced-price lunch based on 2018-19 NCES data. The engagement index is consistent across the different communities with different minimum percentage of blacks/hispanics and those eligible for free/reduced-price lunch. This may imply that the engagement index is independent of socioeconomic factors.**

In [None]:
hypothesis = final[['Min pct_black/hispanic', 'Min pct_free/reduced', 'Min pp_total_raw', 'engagement_index']]
sns.heatmap(hypothesis.corr(), annot=True, cmap = 'Spectral')

* **The mean engagement level is the highest in the school & district operations (SDO), followed by classroom management (CM), learning & curriculum (LC) and a combination of the three. The mean engagement level is highest in the Prek12, higher ed and corporate sector, followed by prek-12; corporate; prek12, higher ed; and higher ed and corporate. The rural and suburb are the areas with higher engagement index.**

In [None]:
category = final.groupby('Category').mean()['engagement_index'].sort_values(ascending=False)
category.plot(kind = 'bar', ylabel = 'Mean Engagement Index')

In [None]:
sector = final.groupby('Sector(s)').mean()['engagement_index'].sort_values(ascending=False)
sector.plot(kind = 'bar', ylabel = 'Mean Engagement Index')

In [None]:
locale = final.groupby('locale').mean()['engagement_index'].sort_values(ascending=False)
locale.plot(kind = 'bar', ylabel = 'Mean Engagement Index')

<a id="section-four"></a>

# **4. Conclusion & Proposed Recommendations**
* **Over Time:** The mean engagement index had improved over the course of 2020, with a significant dip between April and July before picking up again in August. This could be attributed to the 10/11-week summer break in USA beginning between May and June and ending between August and September. Notably, September was the month with the highest mean engagement index. 
* **Digital Learning Tools:** Amongst the different learning tools, Learning & Curriculum (LC) had achieved the highest adoption. Specifically, the digital learning platform was the subcategory that yielded the highest take-up rate. Despite the high adoption rate of LC/digital learning platforms, the mean engagement level is the highest in the school & district operations (SDO), followed by classroom management (CM), LC and a combination of the three. Therefore, more could be done to boost the  adoption of SDO tools to increase engagement index.
* **Sectors:** In addition, the sector with the highest adoption rate was PreK-12. By contrast, corporate and higher ed sectors made up the minority of the demand for learning tools. The mean engagement level is highest in the Prek12, higher ed and corporate sector, followed by prek-12; corporate; prek12, higher ed; and higher ed and corporate. 
* **Expenditure:** The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. By locales, the rural and suburb areas have ramped up its total-pupil total expenditure compared to the city and town areas. The rural and suburb are also the areas with higher engagement index.
* **One page-load event:** The percentage of students in the district that have at least one page-load event of a given product and on a given day is the highest in the rural and suburb areas, followed by town and city. This may imply that the higher the percentage of students that have at least one page-load event of a given product on a given day, the higher the engagement level in rural and suburb areas. This hypothesis was also validated. 
* **% of Blacks/Hispanics/Lunch Subsidies:** As the minimum percentage of blacks/hispanics increases, the greater the minimum percentage of students in the districts that is eligible for free or reduced-price lunch based on 2018-19 NCES data. The engagement index is consistent across the different communities with different minimum percentage of blacks/hispanics and those eligible for free/reduced-price lunch. This may imply that the engagement index is independent of socioeconomic factors.
* **Strength of Internet Connection:** ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). From Jan 2020 to Dec 2020, the mean minimum connection ratio and mean maximum connection ratio remain constant.
* Further investigation would be required to ensure that the findings above are conclusive.