# ☠COVID-19与在线数字教育🎓

基于[LearnPlatform](https://learnplatform.com/)提供的数据分析、挖掘有关COVID-19如何对现有在线教育体系进行影响，这又与各州人口、网络情况、政策以及事件之间有何联系；

In [None]:
import numpy as np
import pandas as pd

import os,sys,random,math,time
from datetime import datetime

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False
%matplotlib inline

## 数据加载及字段解释

此处数据源来自于两部分，首先是该竞赛提供的districts_info.csv、products_info.csv、engagement_data，其次是来自于其他公开数据源数据，
比如[COVID-19 US State Policy database](https://www.openicpsr.org/openicpsr/project/119446/version/V75/view;jsessionid=851ECB80E6CB42252D396C29564184DC)、[KIDS Count](https://www.aecf.org/resources/2020-kids-count-data-book/?gclid=CjwKCAiAudD_BRBXEiwAudakXyXtNK90IAicHQ5T3kT12l4TdJYfAQsYsHlMPNJLZnETp0XgshKE4xoC2UcQAvD_BwE)、[KFF](https://www.kff.org/coronavirus-covid-19/issue-brief/state-covid-19-data-and-policy-actions/)等；

### 竞赛数据

In [None]:
# 竞赛数据README.md
!cat /kaggle/input/learnplatform-covid19-impact-on-digital-learning/README.md

In [None]:
'''
district_id 学区唯一标识
state 学区所属州
locale 所处区域类型：城市、郊区、城镇和农村
pct_black/hispanic 黑人或者西班牙裔学生占比
pct_free/reduced 有资格获得免费或者减价午餐的学生占比
county_connections_ration 高速网络在一个方向或住户或者比例
pp_total_raw 给定学区的每名学生的国家总支出
'''
districts_df = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
# districts_df = districts_df.fillna('')
districts_df = districts_df.dropna()
districts_df['pct_black'] = districts_df['pct_black/hispanic'].apply(lambda pbh:float(pbh[1:-1].split(',')[0].strip()) if pbh!='' else '')
districts_df['pct_hispanic'] = districts_df['pct_black/hispanic'].apply(lambda pbh:float(pbh[1:-1].split(',')[1].strip()) if pbh!='' else '')
districts_df['pct_free'] = districts_df['pct_free/reduced'].apply(lambda pbh:float(pbh[1:-1].split(',')[0].strip()) if pbh!='' else '')
districts_df['pct_reduced'] = districts_df['pct_free/reduced'].apply(lambda pbh:float(pbh[1:-1].split(',')[1].strip()) if pbh!='' else '')
districts_df['ratio_direction'] = districts_df['county_connections_ratio'].apply(lambda pbh:float(pbh[1:-1].split(',')[0].strip()) if pbh!='' else '')
districts_df['ratio_households'] = districts_df['county_connections_ratio'].apply(lambda pbh:float(pbh[1:-1].split(',')[1].strip()) if pbh!='' else '')
districts_df['pp_total_a'] = districts_df['pp_total_raw'].apply(lambda pbh:float(pbh[1:-1].split(',')[0].strip()) if pbh!='' else '')
districts_df['pp_total_b'] = districts_df['pp_total_raw'].apply(lambda pbh:float(pbh[1:-1].split(',')[1].strip()) if pbh!='' else '')
districts_df = districts_df[['district_id','state','locale','pct_black','pct_hispanic',
                             'pct_free','pct_reduced','ratio_direction','ratio_households','pp_total_a','pp_total_b']]
districts_df.info()

In [None]:
districts_df.head()

In [None]:
'''
LP ID 产品唯一标识
URL 产品网站链接
Product Name 产品名
Provider/Company Name 产品提供商名 
Sector(s) 使用该产品的教育部门
Primary Essential Function 产品的基本功能。这里有两层标签。产品首先被标记为以下三个类别之一：LC = 学习和课程，CM = 课堂管理，
                            以及 SDO = 学校和学区运营。这些类别中的每一个都有多个子类别
'''
products_df = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
products_df = products_df.dropna()
products_df['PEF L1'] = products_df['Primary Essential Function'].apply(lambda pef:pef.split('-')[0].strip())
products_df['PEF L2'] = products_df['Primary Essential Function'].apply(lambda pef:pef.split('-')[1].strip())
products_df = products_df[['LP ID','URL','Product Name','Provider/Company Name','Sector(s)','PEF L1','PEF L2']]
products_df.info()

In [None]:
 products_df.head()

In [None]:
'''
学区级别的汇总数据，文件名即表示学区唯一标识
time 日期
Lp_id 产品唯一标识
pct_access 该学区学生当天在该产品上至少有一个页面访问的比例
engagement_index 当前学区在当前产品在当前每千名学生访问页面总数
'''

def read_engagement(district_id):
    engagement = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data/"+str(district_id)+".csv")
    engagement = engagement.fillna(0)
    engagement['dt_id'] = district_id
    return engagement

engagement_8815 = read_engagement(8815)
engagement_8815.info()

In [None]:
engagement_8815.head()

### 外部公开数据
- covid-19 us state data
- kids count
- kff

In [None]:
'''
date 日期
state 州
cases 累计确诊
deaths 累计死亡
'''

us_covid = pd.read_csv('/kaggle/input/us-counties-covid-19-dataset/us-counties.csv')
us_covid = us_covid.drop('county',axis=1)
us_covid = us_covid.groupby(by=['date','state']).agg({'cases':'sum','deaths':'sum'}).reset_index()
us_covid.info()

In [None]:
us_covid.head()

### 通过学区id获取并链接相关数据

In [None]:
def district_product_covid(district_id):
    engagement_8815 = read_engagement(district_id)
    districts_8815 = districts_df[districts_df['district_id']==district_id]
    covid_8815 = us_covid[us_covid['state']==districts_8815['state'].iloc[0]]
    dpc_df = engagement_8815.merge(districts_8815,left_on='dt_id',right_on='district_id',how='inner').merge(covid_8815,left_on='time',right_on='date',how='inner')
    return dpc_df

dpc_df = district_product_covid(8815).sort_values(by='date')
dpc_df = dpc_df[['date','state_x','district_id','lp_id','locale','pct_access','engagement_index',
                 'pct_black','pct_hispanic','pct_free','pct_reduced','ratio_direction','ratio_households',
                'pp_total_a','pp_total_b','cases','deaths']]
dpc_df.info()

In [None]:
dpc_df.head()

## 数据可视化分析

In [None]:
def bar_plot(data,title,y='district_id',xlabel='US State',ylabel='Number of school districts',rotation=90):
    plt.figure(figsize=(20,5))
    plt.bar(x=[str(x) for x in data.index.tolist()],height=data[y].tolist())
    plt.xticks(rotation=rotation)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()

In [None]:
def line_plot(data,product_id=32213,y='pct_access'):
    plt.figure(figsize=(20,8))
    tmp = data[data['lp_id']==product_id]

    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
    plt.gca().xaxis.set_major_locator(mdates.DayLocator())
    date = [datetime.strptime(d, '%Y-%m-%d').date() for d in tmp['date'].tolist()]
    plt.xticks(date[::30])

    plt.subplot(211)

    plt.plot(date,tmp[y].tolist(),label=y)
    plt.legend()
    plt.twinx()
    plt.plot(date,tmp['cases'].tolist(),color='red',label='cases')
    plt.legend()

    plt.subplot(212)

    plt.plot(date,tmp[y].tolist(),label=y)
    plt.legend()
    plt.twinx()
    plt.plot(date,tmp['deaths'].tolist(),color='gray',label='deaths')
    plt.legend()

    plt.show()

### 学区表数据可视化

In [None]:
state_group = districts_df.groupby('state')[['district_id']].count().sort_values(by='district_id')
bar_plot(state_group,'Number of school districts by state')

In [None]:
state_group = districts_df.groupby('locale')[['district_id']].count().sort_values(by='district_id')
bar_plot(state_group,'Number of school districts by locale',xlabel='Locale Type')

In [None]:
districts_df['pct_black_box'] = districts_df['pct_black'].apply(lambda pb:str(float(pb)*100)+'%')
state_group = districts_df.groupby('pct_black_box')[['district_id']].count().sort_values(by='district_id')
bar_plot(state_group,'Number of school districts by pct_black_box',xlabel='Black percent',rotation=0)

In [None]:
districts_df['pct_hispanic_box'] = districts_df['pct_hispanic'].apply(lambda pb:str(float(pb)*100)+'%')
state_group = districts_df.groupby('pct_hispanic_box')[['district_id']].count().sort_values(by='district_id')
bar_plot(state_group,'Number of school districts by pct_hispanic_box',xlabel='Hispanic percent',rotation=0)

In [None]:
districts_df['pct_free_box'] = districts_df['pct_free'].apply(lambda pb:str(float(pb)*100)+'%')
state_group = districts_df.groupby('pct_free_box')[['district_id']].count().sort_values(by='district_id')
bar_plot(state_group,'Number of school districts by pct_free_box',xlabel='Free lunch percent',rotation=0)

In [None]:
districts_df['pct_reduced_box'] = districts_df['pct_reduced'].apply(lambda pb:str(float(pb)*100)+'%')
state_group = districts_df.groupby('pct_reduced_box')[['district_id']].count().sort_values(by='district_id')
bar_plot(state_group,'Number of school districts by pct_reduced_box',xlabel='Reduced lunch percent',rotation=0)

In [None]:
districts_df['ratio_direction_box'] = districts_df['ratio_direction'].apply(lambda pb:str(float(pb)*100)+'%')
state_group = districts_df.groupby('ratio_direction_box')[['district_id']].count().sort_values(by='district_id')
bar_plot(state_group,'Number of school districts by ratio_direction_box',xlabel='Direction ratio',rotation=0)

In [None]:
districts_df['ratio_households_box'] = districts_df['ratio_households'].apply(lambda pb:str(float(pb)*100)+'%')
state_group = districts_df.groupby('ratio_households_box')[['district_id']].count().sort_values(by='district_id')
bar_plot(state_group,'Number of school districts by ratio_households_box',xlabel='Hourseholds ratio',rotation=0)

### 产品表数据可视化

In [None]:
state_group = products_df.groupby('Provider/Company Name')[['LP ID']].count().sort_values(by='LP ID')
state_group = state_group[state_group['LP ID']>1]
bar_plot(state_group,'Number of product by Provider/Company Name',y='LP ID',xlabel='Provider/Company Name',ylabel='Number of product',rotation=90)

In [None]:
state_group = products_df.groupby('Sector(s)')[['LP ID']].count().sort_values(by='LP ID')
bar_plot(state_group,'Number of product by Sector(s)',y='LP ID',xlabel='Sector(s)',ylabel='Number of product',rotation=0)

In [None]:
state_group = products_df.groupby('PEF L1')[['LP ID']].count().sort_values(by='LP ID')
bar_plot(state_group,'Number of product by PEF L1',y='LP ID',xlabel='Primary Essential Function Level1',ylabel='Number of product',rotation=0)

### 统计数据表数据可视化
学区：8815 4921 5987 3710 7177

In [None]:
engagement_8815 = read_engagement(8815)

In [None]:
group_8815 = engagement_8815.groupby('lp_id')[['pct_access']].mean().sort_values(by='pct_access')
group_8815 = group_8815[group_8815['pct_access']>1]
bar_plot(group_8815,'Mean of product`s pct_access by LP ID in district 8815',y='pct_access',xlabel='Product ID',ylabel='Mean of product`s access percent',rotation=90)

In [None]:
group_8815 = engagement_8815.groupby('lp_id')[['engagement_index']].mean().sort_values(by='engagement_index')
group_8815 = group_8815[group_8815['engagement_index']>100]
bar_plot(group_8815,'Mean of product`s engagement_index by LP ID in district 8815',y='engagement_index',xlabel='Product ID',ylabel='Mean of product`s engagement index',rotation=90)

### 学区、产品、Engagement、Covid整合数据可视化

In [None]:
dpc_df.info()

In [None]:
line_plot(dpc_df,y='pct_access')

In [None]:
line_plot(dpc_df,y='engagement_index')

## 关联挖掘

The end.
From SIBAT.