# LearnPlatform🏆: COVID-19😷 Impact on Digital Learning 

The challenge involves exploring:
* the state of digital learning in 2020 and 
* how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

**Table of Contents**

* [Loading packages📚](#section-one)
* [Loading the data📘📗📒](#section-two)
* [Merge data📁](#section-three)
* [Exploratory Data Analysis📊](#section-four)
    * [Districts 🔵](#subsection-one)
    * [Products 🟡](#subsection-two)
    * [Merged data🟢](#subsection-three)
        * [Time series plot of engagement index and % access📉](#subsection-three-one)
        * [Engagement vs district demographics📈](#subsection-three-two)
        * [Engagement vs internet access📉](#subsection-three-three)

<a id="section-one"></a>
# Loading packages📚

In [None]:
import pandas as pd
import numpy as np
import ast
import os
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
#~~~HELPER FUNCTIONS~~~

'''The fix_column function uses the str_repl function to 
clean the columns that have values with intervals'''
def str_repl(j, str2find, str2repl, n):
    find = j.find(str2find)
    i = find != -1
    while find != -1 and i != n:
        find = j.find(str2find, find + 1)
        i += 1
    if i == n:
        return j[:find] + str2repl + j[find+len(str2find):]
    return j

def fix_column(data:pd.DataFrame,col:str):
    y = []
    for j in data[col]:
        try:
            j = str_repl(j,"[","]",2)
        except AttributeError:
            j = j
        y.append(j)
    return y

'''get average of features with interval values'''
def agg_func(data:pd.DataFrame,col:str)->None:
    x = []
    for i in data[col]:
        try:
            new = ast.literal_eval(i) 
            val = np.mean(new)
        except:
            val = i
        x.append(val)
    return x

'''checks null values and proportion of null values per feature'''
def check_proportion_null(df:pd.DataFrame):
    dff = pd.DataFrame()
    proportion = []
    summing = []
    for i in df.columns:
        summ = df[i].isnull().sum()
        summing.append(summ)
        missing =  summ/(df.shape[0])*100
        missing = missing.round(2).astype(str) + "%"   
        proportion.append(missing)
    dff['columns'] = df.columns.to_list()
    dff['sum_null_values'] = summing
    dff['proportion_null_values'] = proportion
    return dff

<a id="section-two"></a>
# Loading the data 📘📗📒

**District Information data**

The district file districts_info.csv includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, we removed the identifiable information about the school districts. We also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. ***Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.***

|Name | Description |
| --- | --- |
|district_id | The unique identifier of the school district |
|state | The state where the district resides in |
|locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information. |
|pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
|pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
|countyconnectionsratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information. |
|pptotalraw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |

In [None]:
district = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
cols = ['pct_black/hispanic','pct_free/reduced', 'county_connections_ratio', 'pp_total_raw']
for i in cols:
    district[i] = fix_column(district,i)

district['pct_black/hispanic'] = agg_func(district,'pct_black/hispanic')
district['pct_free/reduced'] = agg_func(district,'pct_free/reduced')
district['pp_total_raw'] = agg_func(district,'pp_total_raw')
district.head()

In [None]:
check_proportion_null(district)

In [None]:
'''district_id is the only column that doesn't have missing values.
We create a new df without the district_id column and check whether there are rows that have missing values
in all remaining columns. We drop these rows from our district data'''
district_df = district[district.columns.drop('district_id')]

district = district[~district_df.isnull().all(axis=1)]
district.head()

In [None]:
check_proportion_null(district)

**Product information data**

The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

|Name | Description | 
| --- | --- |
|LP ID | The unique identifier of the product |
|URL | Web Link to the specific product |
|Product Name | Name of the specific product |
|Provider/Company Name | Name of the product provider |
|Sector(s) | Sector of education where the product is used |
|Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

In [None]:
product = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
product.rename(columns = {'LP ID':'lp_id'},inplace = True)

'''creating a new column with the 3 main categories of Primary Essential Functions; LC,CM,SDO'''
z = []
for i in product['Primary Essential Function']:
    try:
        i = i.split('-')[0]
    except:
        i = i
    z.append(i)
    
product['Essential_category'] = z
product.head()

In [None]:
check_proportion_null(product)

**Engagement data**

The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. The 4-digit file name represents **district_id which can be used to link to district information in district_info.csv.** The lp_id can be used to link to product information in product_info.csv.

| Name | Description |
| --- | --- |
|time | date in "YYYY-MM-DD"|
|lp_id | The unique identifier of the product |
|pct_access |Percentage of students in the district have at least one page-load event of a given product and on a given day |
|engagement_index | Total page-load events per one thousand students of a given product and on a given day |

In [None]:
schools = []
for i, (dirpath, dirnames, filenames) in enumerate(os.walk("../input/learnplatform-covid19-impact-on-digital-learning/engagement_data")):
    for file in filenames:
        f = dirpath + "/"+file
        data = pd.read_csv(f,index_col=False)
        district_id = file.replace(".csv","")
        data['district_id'] = district_id
        schools.append(data)

engagement = pd.concat(schools)
engagement = engagement.reset_index(drop=True)
engagement['time'] = pd.to_datetime(engagement['time'])
engagement['day'] = engagement['time'].dt.day_name()
engagement['month'] = engagement['time'].dt.month_name()
engagement.head()

In [None]:
check_proportion_null(engagement)

In [None]:
'''drop all rows where engagement index is null, because a propotion > 20% could skew the data 
and also because we have about 21 million rows of data'''
engagement = engagement[engagement['engagement_index'].notna()]
"""we drop the remainig rows where lp_id is null as its a unique product identifier"""
engagement = engagement[engagement['lp_id'].notna()]
check_proportion_null(engagement)

<a id="section-three"></a>
# Merge Files📁

In [None]:
'''This step involves merging the 3 dataframes i.e. district,product and engagement to one, 
based on columns LP ID (product and engagement) and district_id (district and engagement)'''
df = pd.merge(product,engagement,left_on = 'lp_id',right_on = 'lp_id')
district['district_id'] = district['district_id'].astype(str)
data = pd.merge(df,district,left_on = 'district_id', right_on = 'district_id')
data.rename(columns={'Product Name':'Product_Name','Provider/Company Name':'Provider_Name'},inplace=True)
data = data[~data['county_connections_ratio'].isnull()]
#data.to_csv('merged.csv',index=False)
data.head()

<a id="section-four"></a>
# Exploratory Data Analysis📊

In [None]:
#~~~Plotting functions~~~
#pie chart plot
def plot_pie(data:pd.DataFrame,col:str,n:int)->None:
    data[col].value_counts().plot(kind = 'pie',autopct = '%1.1f%%',explode=[0.05]*n,
                                colormap = 'summer', figsize = (10,10)).legend()
    plt.title(f'Distribution of {col}',size=15)
    plt.show()
    
#distribution plot
def plot_distribution(data:pd.DataFrame,col1:str,col2:str)->None:
    dist = data.groupby([col1,col2]).size().reset_index().pivot(columns = col2,index = col1,values=0)
    dist.plot(kind = 'bar',stacked = True,colormap = 'RdYlGn')
    plt.show()

def count_plot(data:pd.DataFrame,col:str):
    sns.countplot(y=col,data=data,order=data[col].value_counts().index,palette="RdYlGn",linewidth=3)
    plt.title(f'Count of {col}', size=15)
    plt.show()

def top_n(data:pd.DataFrame,col:str,n:int):
    sns.countplot(y=col,data = data, order=data[col].value_counts().head(n).index,palette='RdYlGn',linewidth=3)
    plt.title(f'{col} top {n}',size = 15)
    plt.show()
    
def plot_main(data:pd.DataFrame,col:str,labels:list,n)->None:
    values = []
    for i in labels:
        x = data[col].str.count(i).sum()
        values.append(x)

    plt.figure(figsize = (10,10))    
    plt.pie(values,labels = labels,explode = [0.05]*n,autopct = '%1.1f%%',pctdistance=0.85)
    centre_circle = plt.Circle((0,0),0.5,fc='white')
    fig = plt.gcf()
    fig.gca().add_artist(centre_circle)
    plt.title(f'{col} distribution',size=15)
    plt.legend()
    plt.show()

def plot_bar(data:pd.DataFrame,col1:str,col2:str):
    df=data.sort_values(by=col1,ascending=False)
    plt.figure(figsize=(12,6))
    plt.subplot(1,2,1)
    ax=sns.barplot(df.index,df[col1],palette='RdYlGn',dodge=False)
    ax.set_xticklabels(df.index,rotation=80)
    plt.title(f'Distribution of {col1}')
    
    plt.subplot(1,2,2)
    ax=sns.barplot(df.index,df[col2],palette='RdYlBu',dodge=False)
    ax.set_xticklabels(df.index,rotation=80)
    plt.title(f'Distribution of {col2}')
    plt.show()

<a id="subsection-one"></a>
## Districts🔵

In [None]:
dist = district.groupby('locale').agg({'state':'count','pct_black/hispanic':'mean','pct_free/reduced':'mean','pp_total_raw':'mean'}).reset_index()
dist.rename(columns={'state':'#school_districts_per_locale','pct_black/hispanic':'pct_black/hispanic_mean',
                     'pct_free/reduced':'pct_free/reduced_mean',},inplace=True)

fig,ax = plt.subplots(2,2,sharey=True,figsize = (15,10))
ax = ax.flatten()
    
g1 = sns.barplot(dist['#school_districts_per_locale'],dist['locale'],palette='RdYlGn',dodge=False,orient = 'h',ax=ax[0])
g1.set(ylabel = None)
g1.set(title = (f'No of school districts per locale'))

g2 = sns.barplot(dist['pct_black/hispanic_mean'],dist['locale'],palette='RdYlBu',dodge=False,orient = 'h',ax=ax[1])
g2.set(ylabel = None)
g2.set(title = (f'% of students that identify as black or hispanic per locale'))
    
g3 = sns.barplot(dist['pct_free/reduced_mean'],dist['locale'],palette='RdYlBu',dodge=False,orient = 'h',ax=ax[2])
g3.set(ylabel = None)
g3.set(title = (f'% of students eligible for free or reduced-price lunch per locale'))
    
g4 = sns.barplot(dist['pp_total_raw'],dist['locale'],palette='RdYlGn',dodge=False,orient = 'h',ax=ax[3])
g4.set(ylabel = None)
g4.set(title = (f'Per-pupil total expenditure (sum of local and federal expenditure) per locale'))
    
fig.text(0.04, 0.5, 'Locale', va='center', rotation='vertical',size=15)
plt.show()

* Suburbs have the highest number of school districts but have a lower % of students that identify as   black or hispanic, especially when compared to Cities.
* Cities have the highest % of students eligible for free or reduced-price lunch per locale and have  a considerably high per pupil total expenditure

In [None]:
plot_pie(district,'locale',4)

***Most of the school districts are located in the suburbs.***

In [None]:
count_plot(district,"locale")

In [None]:
count_plot(district,'state')

In [None]:
#distribution of locales per state
plot_distribution(district,'state','locale')

The most occurence of school districts per locale per state are:
* Connecticut and Utah, for Suburbs
* Connecticut, for Rurals
* California, for Cities and 
* Utah, for towns

<a id="subsection-two"></a>
## Products🟡

In [None]:
cols = ['Product Name','Provider/Company Name','Essential_category','Primary Essential Function']
for i in cols:
    print(f'Number of unique values in product feature {i}:',product[i].nunique())

In [None]:
#Plot only the 3 main sectors for better analysis
labels = ['PreK-12','Higher Ed','Corporate']
plot_main(product,'Sector(s)',labels,3)

In [None]:
#Top Providers
top_n(product,'Provider/Company Name',10)

In [None]:
top_n(product,'Primary Essential Function',15)

Primary Essential Function  represents the basic function of the product. 
Products are first labeled as one of these three categories: 
* LC = Learning & Curriculum
* CM = Classroom Management
* SDO = School & District Operations.

In [None]:
plot_pie(product,'Essential_category',4)

The Learning & Curriculum category has the most product tools

<a id="subsection-three"></a>
## Merged data🟢

<a id="subsection-three-one"></a>
### Time series plot of engagement index and % access📉  

In [None]:
time_series = data.groupby('time').agg({'engagement_index':'mean','pct_access':'mean'}).reset_index()
time_series.set_index('time',inplace=True)
cols_plot = ['engagement_index','pct_access']
axes = time_series[cols_plot].plot(alpha=0.8, linestyle='solid',
                          title='Time Series plot of Engagement index and % access',grid=True,
                          figsize=(10,8), subplots=True, sharey=False,legend=False)
[ax.legend(loc=1) for ax in plt.gcf().axes]
plt.show()

* pct_access represents the % of students in the district that have at least one page-load event of a given product and on a given day have at least one page-load event of a given product and on a given day
* engagement_index represents the total page-load events per one thousand students of a given product and on a given day
* COVID 19 was declared a pandemic by WHO in March, as a result, schools began to close so as to curb the spread of the virus
* In the United States, the summer holiday in 2020 began on 20 June and ended on 22 September. This explains why the engagement index and % access dropped during summer as students were on a break from learning. 


In [None]:
col1_df = data[data.Product_Name.isin(['Google Docs'])]
col2_df = data[data.Product_Name.isin(['Google Classroom'])]
col3_df = data[data.Product_Name.isin(['YouTube'])]
col4_df = data[data.Product_Name.isin(['Canvas'])]
col5_df = data[data.Product_Name.isin(['Schoology'])]

col1_dff = col1_df.groupby('time')['engagement_index'].mean().reset_index()
col1_dff.rename(columns = {'engagement_index':'Google Docs'},inplace=True)
col2_dff = col2_df.groupby('time')['engagement_index'].mean().reset_index()
col2_dff.rename(columns = {'engagement_index':'Google Classroom'},inplace=True)
col3_dff = col3_df.groupby('time')['engagement_index'].mean().reset_index()
col3_dff.rename(columns = {'engagement_index':'YouTube'},inplace=True)
col4_dff = col4_df.groupby('time')['engagement_index'].mean().reset_index()
col4_dff.rename(columns = {'engagement_index':'Canvas'},inplace=True)
col5_dff = col5_df.groupby('time')['engagement_index'].mean().reset_index()
col5_dff.rename(columns = {'engagement_index':'Schoology'},inplace=True)

In [None]:
dff = col1_dff.merge(col2_dff, how='left', left_on='time', right_on='time').merge(col3_dff, how='left', left_on='time', right_on='time').merge(
    col4_dff, how='left', left_on='time', right_on='time').merge(col5_dff, how='left', left_on='time', right_on='time')
dff.set_index('time',inplace=True)
dff.plot()

In [None]:
plt.figure(figsize=(12,6))

g  = sns.lineplot(data=dff,palette = 'mako_r')
g.legend(loc = 'center right',bbox_to_anchor = (1.25,0.5),ncol=1,title = 'Learning Tool',)
plt.show()

In [None]:
def time_series_plot(df:pd.DataFrame,col1:str,col2:str,col3:str,col4:str,col5:str):
    col1_df = df[df.Product_Name.isin([col1])]
    col2_df = df[df.Product_Name.isin([col2])]
    col3_df = df[df.Product_Name.isin([col3])]
    col4_df = df[df.Product_Name.isin([col4])]
    col5_df = df[df.Product_Name.isin([col5])]
    
    col1_dff = col1_df.groupby('time')['engagement_index'].mean().reset_index()
    col1_dff.rename(columns = {'engagement_index':f'{col1}'},inplace=True)
    col2_dff = col2_df.groupby('time')['engagement_index'].mean().reset_index()
    col2_dff.rename(columns = {'engagement_index':f'{col2}'},inplace=True)
    col3_dff = col3_df.groupby('time')['engagement_index'].mean().reset_index()
    col3_dff.rename(columns = {'engagement_index':f'{col3}'},inplace=True)
    col4_dff = col4_df.groupby('time')['engagement_index'].mean().reset_index()
    col4_dff.rename(columns = {'engagement_index':f'{col4}'},inplace=True)
    col5_dff = col5_df.groupby('time')['engagement_index'].mean().reset_index()
    col5_dff.rename(columns = {'engagement_index':f'{col5}'},inplace=True)
    
    dff = col1_dff.merge(col2_dff, how='left', left_on='time', right_on='time').merge(col3_dff, how='left', left_on='time', right_on='time').merge(
    col4_dff, how='left', left_on='time', right_on='time').merge(col5_dff, how='left', left_on='time', right_on='time')
    dff.set_index('time',inplace=True)
    
    plt.figure(figsize=(12,6))
    g  = sns.lineplot(data=dff,palette = 'mako_r')
    g.legend(loc = 'center right',bbox_to_anchor = (1.25,0.5),ncol=1,title = 'Learning Tool')
    g.set(title='Engagement index of top 5 learning tools in 2020')
    plt.show()

In [None]:
time_series_plot(data,'Google Docs','Google Classroom','YouTube','Canvas','Schoology')

<a id="subsection-three-two"></a>
### Engagement vs district demographics📈

In [None]:
'''get top 10 tools overall per engagement index mean'''
top_products = data.groupby('Product_Name')['engagement_index'].mean().reset_index()
top_products.sort_values(by='engagement_index',ascending=False).head(10)

In [None]:
def page_load_states(col1:str,col2:str,col3:str,col4:str):
    df = data.groupby(['state','Product_Name'])['engagement_index'].mean().reset_index()
    #df['engagement_index'] = df['engagement_index'].round(2)
    df = df[df['Product_Name'].isin(['Google Docs','Google Classroom','YouTube','Canvas','Schoology',
                               'Meet','Kahoot','YouTube','Google Forms','Google Drive','Seesaw : The Learning Journal'])]
    df = df.sort_values(by='engagement_index',ascending=False)
    
    fig,ax = plt.subplots(2,2,sharey=True,figsize = (15,10))
    ax = ax.flatten()
    
    x = df[df.state.isin([col1])]
    g1 = sns.barplot(x.engagement_index,x.Product_Name,palette='RdYlGn',dodge=False,orient = 'h',ax=ax[0])
    g1.set(xlabel = None)
    g1.set(ylabel = None)
    g1.set(title = (f'Average engagement index of the top 10 tools in {col1} state'))
    
    y = df[df.state.isin([col2])]
    g2 = sns.barplot(y.engagement_index,y.Product_Name,palette='RdYlBu',dodge=False,orient = 'h',ax=ax[1])
    g2.set(xlabel = None)
    g2.set(ylabel = None)
    g2.set(title = (f'Average engagement index of the top 10 tools in {col2} state'))
    
    xy = df[df.state.isin([col3])]
    g3 = sns.barplot(xy.engagement_index,xy.Product_Name,palette='RdYlBu',dodge=False,orient = 'h',ax=ax[2])
    g3.set(xlabel = None)
    g3.set(ylabel = None)
    g3.set(title = (f'Average engagement index of the top 10 tools in {col3} state'))
    
    z = df[df.state.isin([col4])]
    g4 = sns.barplot(z.engagement_index,z.Product_Name,palette='RdYlGn',dodge=False,orient = 'h',ax=ax[3])
    g4.set(xlabel = None)
    g4.set(ylabel = None)
    g4.set(title = (f'Average engagement index of the top 10 tools in {col4} state'))
    
    fig.text(0.5, 0.04, 'Engagement index',ha='center',size=15)
    plt.show()
    

page_load_states('Illinois', 'Utah', 'Wisconsin', 'North Carolina')
page_load_states('Missouri','Washington', 'Connecticut', 'Massachusetts')

* The engagement index represents the ***mean of the total page-load events per one thousand students of a given product and on a given day***
* The states were selected by order of occurrence in the district dataframe. 
* Of these 8 states, google docs was the most popular learning tool.
* Illionois had the highest average engagement index with google docs having an average of approximately 17000 page-load events per 1000 students
* North Carolina State had the lowest engagement index, with google docs having an average of approximately 3500 page-load events per 1000 students

<a id="subsection-three-three"></a>
### Engagement vs internet access📉

In [None]:
filtered = data.groupby(['Product_Name','county_connections_ratio'])['engagement_index'].mean().reset_index()
filtered = filtered[filtered['Product_Name'].isin(['Google Docs','Google Classroom','YouTube','Canvas','Schoology',
                               'Meet','Kahoot','YouTube','Google Forms','Google Drive','Seesaw : The Learning Journal'])]
filtered = filtered.sort_values(by='engagement_index',ascending=False)

plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
sns.barplot(x="locale", y="engagement_index",hue = "county_connections_ratio" ,data=data,palette = 'rainbow')
plt.title('Average engagement index per locale given county connections ratio')

plt.subplot(1,2,2)
sns.barplot(x="Product_Name", y="engagement_index",hue = "county_connections_ratio" ,data=filtered,palette = 'summer_r')
plt.xticks(rotation=80)
plt.title('Average engagement index per top learning products given county connections ratio')
plt.show()

* The county connection ratio represents the ratio of residential fixed high-speed connections over 200 kbps in at least one direction/households
* Only school districts in the rural locale have county connection ratio of range [1,2], as seen in the catplot
* Google Classroom has the highest average engagement index of around 4500 for school districts that have county connection ratio of range [1,2]

In [None]:
months = data.groupby('month').agg({'pct_access':'mean','engagement_index':'mean'})
days = data.groupby('day').agg({'pct_access':'mean','engagement_index':'mean'})
plot_bar(months,'pct_access','engagement_index')
plot_bar(days,'pct_access','engagement_index')