<p style="color:Red;"><h1> Is COVID-19 Impact e-Learning habit in 2020 </h1></p>

<h3>Problem Statement</h3>

* The **COVID-19** Pandemic has disrupted learning for more than **56 million** students in the United States. 
 * In the Spring of **2020**, most *states and local governments* across the U.S. closed educational institutions to stop the spread of the virus. 
 * In response, schools and teachers have attempted to reach students remotely through *distance learning tools and digital platforms.*
 * Until today, concerns of the exacaberting *digital divide and long-term learning loss* among America’s most vulnerable learners continue to grow.

**Main Content**

This workbook will consist of three main parts :

* **Data Processing and Library Import**

* **Analysis**


In [None]:
# Library Import
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import figure
import plotly.graph_objects as go
import plotly.express as px

<h2><p style="color:Blue;"> Data Definition : Engagement Data</p></h2>
A) Engagement data files

* The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. 
* The 4-digit file name represents district_id which can be used to link to district information in district_info.csv. 
* The lp_id can be used to link to product information in product_info.csv.

The first thing to do is to merge all the engagement files from the directory **engagement_data**

In [None]:
# glob provides a portable way of using operating system dependent functionality

import glob

eng_path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'
eng_files = glob.glob(eng_path + "/*.csv")

files = []
# Loop over all files and merge them into one dataframe
for file in eng_files:
    df = pd.read_csv(file, index_col = None, header = 0)
    district_id = file.split('/')[4].split('.')[0]
    df['district_id'] = district_id
    files.append(df)
    
engagement = pd.concat(files)
engagement = engagement.reset_index(drop = True)
engagement['time'] = pd.to_datetime(engagement['time'])
# Filling all 'N/A' data with '0's 
engagement['engagement_index'] = engagement['engagement_index'].fillna(0)


<h2><p style="color:Blue;">Data Definition: districts_info.csv</p></h2>

B) District information data
* The district file districts_info.csv includes information about the characteristics of school districts, including data from **NCES** (2018-19), **FCC** (Dec 2018), and Edunomics Lab. 
 * In this data set, we removed the identifiable information about the school districts. We also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. 
 * For data generalization purposes some data points are released with a range where the actual value falls under. 
  * Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

In [None]:
#go.Choropleth to make USA map
# https://www.kaggle.com/varshachalageri/kletech-a06

us_state_abbrev = {
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

In [None]:
districts = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
districts.dropna(inplace = True)
districts['state_abbrev'] = districts['state'].replace(us_state_abbrev)

districts['district_id'] = districts['district_id'].astype(str)

districts.head(3)

districts.to_csv('district.csv')

In [None]:
districts_info_by_state = districts['state_abbrev'].value_counts().to_frame().reset_index(drop=False)
districts_info_by_state.columns = ['state_abbrev', 'num_districts']

fig = go.Figure()
layout = dict(
    title_text = "Number of Available School Districts per State",
    geo_scope='usa',
)

fig.add_trace(
    go.Choropleth(
        locations=districts_info_by_state.state_abbrev,
        zmax=1,
        z = districts_info_by_state.num_districts,
        locationmode = 'USA-states', # set of locations match entries in `locations`
        marker_line_color='white',
        geo='geo',
        colorscale=px.colors.sequential.Teal, 
    )
)
            
fig.update_layout(layout)   
fig.show()

**Fields Description in districts.csv**

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **district_id**   | The unique identifier of the school district|
|02| **state** | The state where the district resides in                 |
|03| **locale**   | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information. |
|04| **pct_black/hispanic** | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
|05| **pct_free/reduced**   | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
|06| **county_connections_ratio**   | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information.|
|07| **pp_total_raw**   | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district.|

**Data Definition:**

C) Product information data
* The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. 
 * The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. 
 * Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

In [None]:
products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
products.dropna(inplace = True)

**Fields in products.csv**

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **LP ID**   | The unique identifier of the product|
|02| **URL** | Web Link to the specific product                 |
|03| **Product Name**   | Name of the specific product|
|04| **Provider/Company Name** | Name of the product provider                 |
|05| **Sector(s)**   | Sector of education where the product is used|
|06| **Primary Essential Function** | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled                 |

In [None]:
##################################################
# No of COVID case in USA trend is shown as below#
##################################################


# Thanks SRK for providing this dataset in the following link :
# https://www.kaggle.com/sudalairajkumar/covid19-in-usa
COVID_cases = pd.read_csv('../input/covid19-in-usa/us_counties_covid19_daily.csv')
COVID_cases['date'] = pd.to_datetime(COVID_cases['date'])
COVID_cases['year'], COVID_cases['month'] = COVID_cases['date'].dt.year, COVID_cases['date'].dt.month

monthly_case = pd.DataFrame(COVID_cases.groupby(['month']).agg({'cases': 'mean'})).reset_index()
total_case_by_state = pd.DataFrame(COVID_cases.groupby('state').agg({'cases': 'sum'})).reset_index()

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import figure
sns.lineplot(data=monthly_case, x="month", y="cases")

In [None]:
used_app = engagement['lp_id'].unique()
used_product = products[products['LP ID'].isin(used_app)]
engagement = engagement.rename(columns={"lp_id": "LP ID"})
app_engage = pd.merge(left=engagement,right=used_product,on='LP ID')


district_appeared = engagement['district_id'].unique()
district = districts[districts['district_id'].isin(district_appeared)]
district_app_engage = pd.merge(left=app_engage,right=district,on='district_id')


In [None]:
district_app_engage['app_category'] = district_app_engage['Primary Essential Function'].str.split(' - ',expand=True)[0]
district_app_engage['app_subcategory'] = district_app_engage['Primary Essential Function'].str.split(' - ',expand=True)[1]

district_app_engage['time'] = pd.to_datetime(district_app_engage['time'])
district_app_engage['year'], district_app_engage['month'] = district_app_engage['time'].dt.year, district_app_engage['time'].dt.month

district_app_engage = district_app_engage.drop(district_app_engage[(district_app_engage['pct_access'] == 0) & (district_app_engage['engagement_index'] == 0.0)].index)

In [None]:
# Converting pp_total_raw into numerical value
# E.g. if the expense lies between 4000 to 5000
# it will regarded as 5000
pp_total_raw_numeric = []
for row in district_app_engage['pp_total_raw']:
        if row == '[4000, 6000[' :    pp_total_raw_numeric.append(5000)
        elif row == '[6000, 8000[':   pp_total_raw_numeric.append(7000)
        elif row == '[8000, 10000[':  pp_total_raw_numeric.append(9000)
        elif row == '[10000, 12000[':  pp_total_raw_numeric.append(11000)
        elif row == '[12000, 14000[':  pp_total_raw_numeric.append(13000)
        elif row == '[14000, 16000[':  pp_total_raw_numeric.append(15000)
        elif row == '[16000, 18000[':  pp_total_raw_numeric.append(17000)
        elif row == '[18000, 20000[':  pp_total_raw_numeric.append(19000)
        elif row == '[22000, 24000[':  pp_total_raw_numeric.append(23000)
        
        else:          pp_total_raw_numeric.append(33000)
            
district_app_engage['pp_total_raw_numeric'] = pp_total_raw_numeric

In [None]:
#from sklearn.utils import shuffle
#district_app_engage = shuffle(district_app_engage)
#district_app_engage.head(2000).to_csv('DAE_sample.csv')

# Overall Top 10 Engaged Apps by Product Name

In [None]:
# Let us see the top 10 most popular Learning Apps
# This is calculated by the mean engagement rate
TopProduct = pd.DataFrame(district_app_engage.groupby(['Product Name']).agg({'engagement_index': ['mean','max']})).reset_index()
TopProduct.columns = ['Product Name','Mean_engage','Max_engage']
TopProduct = TopProduct.sort_values(by='Mean_engage', ascending=False)
TopProduct.head(10)

# Black/hispanic Relation on e-learning

**We will look at whether regions with high black/hispanic ratio have difference e-learning apps preference.**

As coding below, I regard regions with black/hispanic ratio lying between 0 to 0.4 as low, higher than 0.6 as high.

In [None]:
bh_ratio = []
for row in district_app_engage['pct_black/hispanic']:
        if row in('[0, 0.2[','[0.2, 0.4[')  :    bh_ratio.append('low')
        elif row == '[0.4, 0.6[':  bh_ratio.append('medium')
        
        else: bh_ratio.append('high')
            
district_app_engage['bh_ratio'] = bh_ratio

**Here is the Top10 Apps used in high black/hispanic ratio districts**

In [None]:
TopProduct2 = pd.DataFrame(district_app_engage[district_app_engage['bh_ratio']=='high'].groupby(['Product Name']).agg({'engagement_index': ['mean','max']})).reset_index()
TopProduct2.columns = ['Product Name','Mean_engage','Max_engage']
TopProduct2 = TopProduct2.sort_values(by='Mean_engage', ascending=False)
TopProduct2.head(10)

From the result above, **Google Docs** are still the most popular tool for learning in district with more black/hispanic people. The top 5 most engaged Apps are still the same, though the ranking is slightly different.

**Clever, Zoom** and **Seesaw** are on the Top 10 list. **May be black/hispanic people prefer Zoom more than Google Meet for virtual classroom**

We will investigate the correlation between **portion of black/hispanic** in the community and **(NERD$) expenditure** 

In [None]:
# We will investigate is the portion of black/hispanic 
# in the community affects educational expense 
pctbh_PPtotal = pd.DataFrame(district_app_engage.groupby(['pct_black/hispanic']).agg({'pp_total_raw_numeric': ['mean']})).reset_index()
pctbh_PPtotal.columns = ['Black/hispanic Ratio','Mean_expense']
pctbh_PPtotal['Mean_expense'] = pctbh_PPtotal['Mean_expense'].astype(float)

In [None]:

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_ylabel('Expenditure')
ax.set_title('Education Expenditure by Black/hispanic Ratio')

ax.bar(pctbh_PPtotal['Black/hispanic Ratio'],pctbh_PPtotal['Mean_expense'])
plt.show()

The ratio of black/hispanic does not make signicant difference on NERD$ expendture

# Products that are declining / growing

I will compare the product's usage on January and December to see usage +/- . Only those products with engagement > 1 will be considered.

In [None]:
# Products that are declining / growing
grow_decline_app = district_app_engage[district_app_engage['engagement_index']>5]
grow_decline_app = grow_decline_app[grow_decline_app['month'].isin([1,12])]


**The Top 10 Incresing Usage Products are listed as below**

In [None]:
table = pd.pivot_table(grow_decline_app, values='engagement_index', index=['Product Name' ],
                    columns=['month'], aggfunc=np.mean)

table = table.reset_index()                
table.columns = ['Product Name','Jan','Dec']
table['+/- Ratio'] = table['Dec'] - table['Jan']
table.sort_values(by='+/- Ratio', ascending=False).head(10)

**Then it comes to The Top 10 Decresing Usage Products**

In [None]:
table.sort_values(by='+/- Ratio', ascending=True).head(10)

# Conclusion on Product Usage

For the top 10 **decreased usage products**, the keyword "Read" sometimes appears.
The products which is pure reader in nature may not fulfill the needs of e-Learning in USA.

Kahoot! and Renaissance Learning also show decreasing usage in 2020, they are traditional learning websites.

On the other hand, the products with high increasing ratio usually supports more **interactive functions**. 

For example, Schoology supports teachers, students, parents interactions.Google Docs involves co-operation of students.



With the general increase trend of COVID-19 Cases, the demand of interactive e-learning tool will increase. While tools that cannot entertain human interaction may decline.