<h1 align='center'> Impact of COVID-19 on Digital Learning</h1>
<img align='center' src="https://i0.wp.com/oecd-development-matters.org/wp-content/uploads/2021/05/Africa-covid-19-education-DevMatters.jpg?resize=458%2C305&ssl=1" >

[](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.cepal.org%2Fen%2Fpublications%2F45905-education-time-covid-19&psig=AOvVaw2fAUeb-QoWXElsE-7TkyjF&ust=1630819272584000&source=images&cd=vfe&ved=0CAsQjRxqFwoTCNCB-K_J5PICFQAAAAAdAAAAABAD)

## Objective

The purpose is to explore
* the state of digital learning in 2020 and 
* how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

packages and helper functions

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objects as go
import plotly.express as px
# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

def plot_hist(df:pd.DataFrame, column:str, color:str)->None:
    plt.figure(figsize=(8, 6))
    sns.displot(data=df, x=column, color=color, height=7, aspect=2)
    plt.title(f'Distribution of {column}', size=22, fontweight='bold')
    plt.show()
def mode_fill(df, column):
    mode = df[column].mode()[0]
    df[column] = df[column].fillna(mode)
    return df

# Data Overview
#### The following three basic sets of files are provided for this competition :

* The engagement_ data folder is based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.
* The products_info.csv file includes information about the characteristics of the top 372 products with most users in 2020.
* The districts_info.csv file includes information about the characteristics of school districts, including data from NCES and FCC.
* But it's also encouraged to use publicly available data
[COVID-19 US State Policy database](https://www.openicpsr.org/openicpsr/project/119446/version/V75/view;jsessionid=851ECB80E6CB42252D396C29564184DC), [KIDS Count](https://www.aecf.org/resources/2020-kids-count-data-book/?gclid=CjwKCAiAudD_BRBXEiwAudakXyXtNK90IAicHQ5T3kT12l4TdJYfAQsYsHlMPNJLZnETp0XgshKE4xoC2UcQAvD_BwE), [KFF](https://www.kff.org/coronavirus-covid-19/issue-brief/state-covid-19-data-and-policy-actions/) and others.

In [None]:
#load data
districts_info=pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
products_info=pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
PATH='../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'

## districts_info data

In [None]:
districts_info.head(3)

In [None]:
districts_info.info()

In [None]:
# Null values 
print("Percentage of Nulls per column\n", districts_info.isnull().sum()*100/districts_info.shape[0])

As we can see below, rows which have null values in the state column are also null in the other columns. Since we can't extract insights from these rows,we can drop them. 

In [None]:
districts_info[districts_info.state.isnull()].head(8)

In [None]:
districts_info=districts_info[districts_info['state'].notna()]

In [None]:
plot_hist(districts_info,'locale', 'blue')

Most of the school districts belong to Suburb lacale classification

In [None]:
plt.figure(figsize=(13,8))
plt.title('Distribution of state')
sns.countplot(y ='state', data = districts_info, order = districts_info['state'].value_counts().index)

In [None]:
plot_hist(districts_info,'pct_black/hispanic', 'green')

In [None]:
districts_info=mode_fill(districts_info,'pct_free/reduced')
plot_hist(districts_info,'pct_free/reduced', 'blue')

In [None]:
districts_info=mode_fill(districts_info,'pp_total_raw')
plot_hist(districts_info,'pp_total_raw', 'blue')

In [None]:
# districts_info=mode_fill(districts_info,'county_connections_ratio')
# plot_hist(districts_info,'county_connections_ratio', 'blue')
districts_info['county_connections_ratio'].value_counts()

# Products_info

In [None]:
products_info.head()

In [None]:
products_info.info()

In [None]:
print('percentage of null values per column\n',products_info.isnull().sum()/products_info.shape[0])

In [None]:
products_info.dropna(inplace=True)

In [None]:
# plot_hist(products_info, 'Sector(s)','green')
plt.figure(figsize=(10, 6))
products_info['Sector(s)'].value_counts().plot(kind='pie', autopct='%.2f%%')

# Top 10 products with most users

In [None]:
#Group by Provider/Company Name

df = products_info[['Provider/Company Name','LP ID']].groupby('Provider/Company Name').count().sort_values(by='LP ID',ascending=False)
df = df.iloc[:10]
_= sns.barplot(x=df['LP ID'],y=df.index)
plt.xlabel('count')

# Engagement data

In [None]:
# load all files in engagement_data folder
all_files=[]
for district in districts_info.district_id.unique():
    df=pd.read_csv(f'{PATH}/{district}.csv')
    df["district_id"]=district
    all_files.append(df)
engagement = pd.concat(all_files)
engagement = engagement.reset_index(drop=True)

In [None]:
engagement.head()

In [None]:
engagement.info()

In [None]:
#change time column to datetime type
engagement.time = pd.to_datetime(engagement.time)

All Null values in the engagement_index have 0 values. So these rows will not provide insights about user engagement.

In [None]:
engagement=engagement[engagement.engagement_index.notna()]

In [None]:
engagement.isnull().sum()

#### Merge all data to get deep information

In [None]:
all_data=engagement.merge(districts_info,how='inner')


In [None]:
products_info['LP ID']=products_info['LP ID'].astype('float')
products_info.rename({'LP ID': 'lp_id'}, axis=1, inplace=True)
all_data=all_data.merge(products_info, how='inner')

In [None]:
all_data.to_csv('all_data.csv',index=False)

In [None]:
all_data.head()

In [None]:
def find_agg(df:pd.DataFrame,agg_column:str, wanted_col:str,agg_metric:str, col_name:str, order=False )->pd.DataFrame:
    """ This function calculates aggregate of column """
    new_df = df.groupby(agg_column)[wanted_col].agg(agg_metric).reset_index(name=col_name).\
    sort_values(by=col_name, ascending=order)
    return new_df

In [None]:
Avg_engagement_index=find_agg(all_data,"state","engagement_index" ,"sum", "Avg_engagement_index",order=False )

In [None]:
def plot_bar(data,x,y,title):
    plt.figure(figsize=(16,8))
    sns.barplot(data=data,x=x,y=y)
    plt.xticks(rotation=90, size=14)
    plt.xlabel(None)
    plt.title(title, size=20)
    plt.show()
plot_bar(Avg_engagement_index, 'state','Avg_engagement_index',"Total page-load events per one thousand students in 2020")

In [None]:
Avg_pct_access=find_agg(all_data,"state","pct_access" ,"sum", "Avg_pct_access",order=False )

In [None]:
plot_bar(Avg_pct_access, 'state','Avg_pct_access',"Total percentage of students in state have at least one page-load in 2020 ")

In [None]:
all_data.time= pd.to_datetime(all_data.time, format = '%Y-%m-%d')

In [None]:
eng = all_data.groupby('time').agg({'engagement_index':'mean','pct_access':'mean'}).reset_index()
eng.set_index('time',inplace=True)
eng['engagement_index'].plot(linestyle='solid',title='Engagement index 2020',figsize=(10,8),sharey=False,legend=False)
plt.show()

#### The first American covid-19 case was reported on January 20, and President Donald Trump declared the US outbreak a public health emergency on January 31.We can see that the engagement to digital learning increasing around the end of January, decreased at June, July and Augest( summer time) and increased again after the summer time.