<img src="https://i.imgur.com/S9enGUW.png">

## Problem Statement
The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacerbating digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

## Challenge
We challenge the Kaggle community to explore (1) the state of digital learning in 2020 and (2) how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.
We encourage you to guide the analysis with questions that are related to the themes that are described above (in bold font). Below are some examples of questions that relate to our problem statement:- 

* What is the picture of digital connectivity and engagement in 2020?
* What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?
* How does student engagement with different types of education technology change over the course of the pandemic?
* How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)?   
  Learning context? Socioeconomic status?
* Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

# Import libraries 📚

In [None]:
!pip install -q klib

In [None]:
import numpy as np 
import pandas as pd 
import glob


import missingno as msno
import klib

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import preprocessing
import glob
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
%matplotlib inline

# Reading data files 👓

### Product Data Dictonary
> The product file ```products_info.csv``` includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. 


| Name                       | Description                                                                                                                                                                                                                                                                                                                    |
|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LP ID                      | The unique identifier of the product                                                                                                                                                                                                                                                                                           |
| URL                        | Web Link to the specific product                                                                                                                                                                                                                                                                                               |
| Product Name               | Name of the specific product                                                                                                                                                                                                                                                                                                   |
| Provider/Company Name      | Name of the product provider                                                                                                                                                                                                                                                                                                   |
| Sector(s)                  | Sector of education where the product is used                                                                                                                                                                                                                                                                                  |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |
|                            |                                                                                                                                                                                

In [None]:
products_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
products_df.head()

### District information data

>The district file ```districts_info.csv``` includes information about the **characteristics of school districts**, including data from 
>- NCES (2018-19), 
>- FCC (Dec 2018), and 
>- Edunomics Lab. 

Steps taken to preserve Privacy 🔒 
- Identifiable information about the school districts has been removed. 
- An open source tool ARX (Prasser et al. 2020) was used to transform several data fields and reduce the risks of re-identification. 

📝 For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

| Name                   | Description                                                                                                                                                                                                                                                                              |
|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| district_id            | The unique identifier of the school district                                                                                                                                                                                                                                             |
| state                  | The state where the district resides in                                                                                                                                                                                                                                                  |
| locale                 | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information.                                                                                                          |
| pct_black/hispanic     | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data                                                                                                                                                                                       |
| pct_free/reduced       | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data                                                                                                                                                                              |
| countyconnectionsratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information.                                                                         |
| pptotalraw             | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |
                                                         

In [None]:
districts_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
districts_df.head()

### Engagement data
> The engagement data are aggregated at school district level, and each file in the folder ```engagement_data``` represents data from **one school district***. 

📝The 4-digit file name represents ```district_id``` which can be used to link to district information in ```district_info.csv```. 

📝The ```lp_id``` can be used to link to product information in ```product_info.csv```.

| Name             | Description                                                                                                    |
|------------------|----------------------------------------------------------------------------------------------------------------|
| time             | date in "YYYY-MM-DD"                                                                                           |
| lp_id            | The unique identifier of the product                                                                           |
| pct_access       | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day                         |

In [None]:
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0,dtype={'lp_id':str})
    district_id = filename.split("/")[4].split(".")[0]
    df["district_id"] = district_id
    li.append(df)
    
engagement_df = pd.concat(li)
engagement_df = engagement_df.reset_index(drop=True)
engagement_df.head()

In [None]:
engagement_df.shape[0]

# Missing values🔮

In [None]:
klib.missingval_plot(products_df)

In [None]:
klib.missingval_plot(districts_df)

In [None]:
msno.dendrogram(districts_df)

In [None]:
engagement_df.describe()

In [None]:
klib.missingval_plot(engagement_df.sample(50000))

# EDA 📊

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(districts_df.state)
plt.xticks(rotation=90)

In [None]:
districts_df["state"].value_counts().head(10).plot(kind = 'pie', autopct='%1.1f%%', figsize=(10, 10), startangle=0).legend()

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(districts_df.locale)
plt.xticks(rotation=90)

In [None]:
districts_df["locale"].value_counts().head(10).plot(kind = 'pie', autopct='%1.1f%%', figsize=(10, 10), startangle=0).legend()

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(districts_df.pp_total_raw)
plt.xticks(rotation=90)

In [None]:
ds = districts_df['state'].value_counts().reset_index()
ds.columns = [
    'state', 
    'percent'
]
ds['percent'] /= len(districts_df)

fig = px.pie(
    ds, 
    names='state', 
    values='percent',
    color_discrete_sequence=px.colors.sequential.Mint,
    title='Occurrence of states in the District Information Data:', 
    width=700,
    height=500
)
fig.show()

In [None]:
ds = products_df['Sector(s)'].value_counts().reset_index()
ds.columns = [
    'Sector(s)', 
    'percent'
]
ds['percent'] /= len(products_df)

fig = px.pie(
    ds, 
    names='Sector(s)', 
    values='percent',
    color_discrete_sequence=px.colors.sequential.Mint,
    title='Distribution of Sector(s) in the District Information Data:', 
    width=700,
    height=500
)
fig.show()

In [None]:



#Code by Mysterious Ben https://www.kaggle.com/myster/eda-prophet-winning-solution-3-0

_ = pd.pivot_table(engagement_df, values='engagement_index', index='time').plot(style='-o', title="Learning Engagement in Pandemics")
plt.xticks(rotation=45);

In [None]:
engagement_df['district_id']=engagement_df['district_id'].astype('int')
districts_df['district_id']=districts_df['district_id'].astype('int')
engagement_df = pd.merge(engagement_df,districts_df[['district_id','state','locale']],on='district_id',how='left')

In [None]:
engagement_df.head()

In [None]:
products_df_trim = products_df[['LP ID','URL','Product Name','Provider/Company Name','Primary Essential Function']]
products_df_trim.rename(columns={'LP ID':'lp_id'},inplace=True)

In [None]:
products_df_trim['lp_id']=products_df_trim['lp_id'].astype('str')
engagement_df = pd.merge(engagement_df,products_df_trim,on='lp_id',how='left')

In [None]:
engagement_df.info()

In [None]:
engagement_df.isnull().sum()/engagement_df.shape[0]*100

In [None]:
engagement_df = engagement_df[~engagement_df['pct_access'].isnull()]

In [None]:
engagement_df['time']=pd.to_datetime(engagement_df['time'])
engagement_df['day'] = engagement_df.time.dt.day
engagement_df['week'] = engagement_df.time.dt.week
engagement_df['month'] = engagement_df.time.dt.month

In [None]:
engagement_df.groupby(['state']).agg({'pct_access':'mean'})

In [None]:
klib.dist_plot(engagement_df.loc[engagement_df['state']=='Wisconsin'][['pct_access']])

In [None]:
prd_pc_access_mean = engagement_df.groupby(['Product Name']).agg({'pct_access':'mean'}).reset_index().sort_values('pct_access',ascending=False)
prd_pc_access_median = engagement_df.groupby(['Product Name']).agg({'pct_access':'median'}).reset_index().sort_values('pct_access',ascending=False)
prd_pc_access_median.rename(columns={'pct_access':'pct_access_median'},inplace=True)
prd_pc_access_mean.rename(columns={'pct_access':'pct_access_mean'},inplace=True)

In [None]:
prd_pc_access_mean.head(10)['Product Name']

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=prd_pc_access_mean.head(15)['Product Name'],y=prd_pc_access_mean.head(15)['pct_access_mean'])

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=prd_pc_access_mean.tail(15)['Product Name'],y=prd_pc_access_mean.tail(15)['pct_access_mean'])

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x=prd_pc_access_median.head(15)['Product Name'],y=prd_pc_access_median.head(15)['pct_access_median'])

In [None]:
engagement_df.locale.unique()

In [None]:
demog_pc_access_mean = engagement_df.loc[~engagement_df['state'].isnull()].groupby(['state','locale']).agg({'pct_access':'mean'}).reset_index().sort_values('pct_access',ascending=False)
demog_pc_access_median = engagement_df.loc[~engagement_df['state'].isnull()].groupby(['state','locale']).agg({'pct_access':'median'}).reset_index().sort_values('pct_access',ascending=False)
demog_pc_access_median.rename(columns={'pct_access':'pct_access_median'},inplace=True)
demog_pc_access_mean.rename(columns={'pct_access':'pct_access_mean'},inplace=True)

In [None]:
demog_pc_access_mean.plot(x='state',y='pct_access_mean',figsize=(25,15),kind='bar',stacked=True)

In [None]:
for l in ['Suburb', 'City', 'Rural', 'Town']:
    print(" Locale - ",l)
    demog_pc_access_median.loc[demog_pc_access_median['locale']==l].plot(x='state',y='pct_access_median',figsize=(15,7),kind='bar')

In [None]:
engagement_df

In [None]:
#engagement_df
engagement_df.dropna(axis=0, subset=["state"], inplace=True)

In [None]:
engagement_df.head()

In [None]:
demog_eng_index_median = engagement_df.groupby(['state','locale']).agg({'engagement_index':'median'}).reset_index().sort_values('engagement_index',ascending=False)
demog_eng_index_median.rename(columns={'engagement_index':'engagement_index_median'},inplace=True)
engagement_df_trim = pd.merge(engagement_df,demog_eng_index_median,left_on=['state','locale'],right_on=['state','locale'],how='left')
engagement_df_trim.loc[engagement_df_trim['engagement_index'].isnull(),'engagement_index']=engagement_df_trim['engagement_index_median']

In [None]:
engagement_df_trim

In [None]:
engagement_df_trim['year']= engagement_df_trim.time.dt.year

In [None]:
engagement_df_trim.columns

In [None]:
--====================--

In [None]:
eng_pct_acc_time = engagement_df_trim.loc[engagement_df_trim['year']==2020].groupby(['time','year','month','week']).agg({'pct_access':'median','engagement_index':'median'}).reset_index()

In [None]:
eng_pct_acc_time[['time','engagement_index']].plot(x='time',y='engagement_index',figsize=(20,7))

In [None]:
#pd.merge(engagement_df,demog_pc_access_median,on=['state','locale'],how='left'