# LearnPlatform - COVID-19 IMPACT ON DIGITAL LEARNING 📖 👨‍💻

by Rizki Amanullah Hakim (rizkiamanullah@student.telkomuniversity.ac.id)

This notebook was created using Google Colab

# Import libraries 📚

In [None]:
from IPython.display import clear_output
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import random
import glob
import plotly.express as px

# Understanding the Data 👓 & EDA 📊



## Products Dataset

### Desc

**Q: Products_info.csv? What it looks like?**

A: Here's top 5 from the dataset


In [None]:
df_products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
df_products.head()



---



**Q: What is the statistical description from the dataset?**

In [None]:
df_products.describe(include='all')

A: From the csv file, there are about 372 different products that are often used during 2020.

Explanation of each column:
- **LP ID** = Unique ID for each product.
- **URL** = Website address for the product.
- **Product Name** = Product name.
- **Provider/ Company Name** = Manufacturer's name.
- **Sector** = Sector using the product.
- **Primary Essential Func** = Application to activities, LC = Learning & Curriculum, CM = Classroom Management, and SDO = School and District Office Operations.



---



**Q: How much values missing from the dataset?**

In [None]:
df_products.isna().sum()

There are empty rows in the column Provider/Company Name, Primary Essential Function and Sector. However, it is necessary to fill in the missing value but with a different approach

For the 'Provider/Company Name' column there is only 1 missing row.

In [None]:
df_products['Provider/Company Name'].fillna("Missing", inplace=True)

As for the Sector and Primary Essential Function columns, the mode() function is used to fill in the blank values in the column. That is, by calculating the column distribution to fill in the blank values.

Check the distribution in the Sector column

In [None]:
df_products['Sector(s)'].value_counts()

Check the distribution in the Primary Essential Function

In [None]:
df_products['Primary Essential Function'].value_counts()

Fill in the missing values

In [None]:
mode_sector = df_products['Sector(s)'].mode()
df_products['Sector(s)'].fillna(value=mode_sector[0],inplace=True)

In [None]:
mode_primary = df_products['Primary Essential Function'].mode()
df_products['Primary Essential Function'].fillna(value=mode_primary[0],inplace=True)



---



### Sector Column

**Q: How the distribution in Sector?**

In [None]:
sector = df_products['Sector(s)'].value_counts().reset_index()
sector.columns = ['Sector(s)','percent']
sector['percent'] /= len(df_products)

fig = px.pie(sector, names='Sector(s)', values='percent',title='Distribution of Sector(s)', 
    width=700,height=500)
fig.show()

In [None]:
sns.displot(data=df_products, x='Sector(s)',aspect=2)

A: From the plot above, the majority is PreK-12



---



### Primary Essential Function Column

**Q: How the distribution in Primary Essential Funciton?**

In [None]:
df_products['Primary Essential Function'].value_counts()

In [None]:
sns.catplot(data=df_products, y='Primary Essential Function',kind='count',aspect=2)

A: From above plot, the majority are the LC based activity



---



### Product Name Column

**Q: How is the distribution in the Product Name column?**

In [None]:
df_products['Product Name'].value_counts()



---



### Provider/Company Name Column

**Q: How is the distribution in the Provider/Company Name column?**

In [None]:
df_products['Provider/Company Name'].value_counts()



---



## District Dataset

### Desc

**Q: What is the information from the district dataset like?**

A: Here are 5 initial data in the *district dataset*

In [None]:
df_district = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
df_district.head()



---



**Q: What are the descriptive statistics of the district dataset like?**

In [None]:
df_district.describe(include='all')

A: From the data many records contain NaN, which is possible for school data privacy.

Explanation of each column:
- **district_id** = unique ID of each district.
- **state** = State name of each district.
- **locale** = City type (City, Rural, Small town, Outskirt).
- **pct_black/hispanic** = Percentage of black/hispanic students.
- **pct_free/reduced** = Percentage of "subsidized" students.
- **countyconnectionsratio** = Ratio of internet speed users above 200kbps.
- **pptotalraw** = Expenditure for each school that must be removed. This column uses the median for each school in a district.

In [None]:
df_district.isna().sum()

In [None]:
df_district[df_district['pp_total_raw'].isna()]

Dropping the empty values

In [None]:
df_district.dropna(thresh=6, inplace=True)

Checking for empty values in current district dataset

In [None]:
df_district.isna().sum()



---



### Locale Column

**Q: How the distribution in locale column?**

In [None]:
df_district['locale'].value_counts()

In [None]:
locale = df_district['locale'].value_counts().reset_index()
locale.columns = ['locale','percent']
locale['percent'] /= len(df_district)

fig = px.pie(locale, names='locale', values='percent',
             title='Distribution Count of locale:',width=700,height=500)
fig.show()

In [None]:
sns.catplot(data=df_district, y='locale',kind='count',aspect=2)



---



### State Column

**Q: What the distribution in state column?**

In [None]:
df_district['state'].value_counts()

In [None]:
state = df_district['state'].value_counts().reset_index()
state.columns = ['state','percent']
state['percent'] /= len(df_district)

fig = px.pie(state, names='state', values='percent',
             title='Distribution Count of states:',width=700,height=500)
fig.show()

In [None]:
sns.catplot(data=df_district, y='state',kind='count',aspect=2)



---



**Q: What is the distribution between state and locale?**

In [None]:
sns.displot(data=df_district, y='state', hue='locale', col='locale', aspect=0.6)

In [None]:
sns.displot(data=df_district, y='state',hue='locale',height=8)



---



### pct_black/hispanic Column

**Q: What the distribution in pct_black/hispanic column?**

In [None]:
df_district['pct_black/hispanic'].value_counts()

Before we can use it, we need to do conversion from numeric range into mean value for represent the percentage value.

Split the pct_black/hispanic to separate value

In [None]:
pct_black_hispanic = df_district['pct_black/hispanic'].str.split(",",n=1,expand=True)

Replace the bracket that exist in both data value

In [None]:
df_district['pct_black']=pct_black_hispanic[0].str.replace('[','',regex=True)
df_district['pct_hispanic']= pct_black_hispanic[1].str.replace('[','',regex=True)

Convert both dataset to numeric using Pandas

In [None]:
df_district['pct_black']=pd.to_numeric(df_district['pct_black'])
df_district['pct_hispanic']=pd.to_numeric(df_district['pct_hispanic'])

Merge both the dataset

In [None]:
df_district['pct_black/hispanic']=(df_district['pct_black'] + df_district['pct_hispanic'])/2

Then, we can continue...

In [None]:
df_district['pct_black/hispanic'].value_counts()

Distribution between pct_black/hispanic & locale

In [None]:
sns.displot(data=df_district, kind='hist', x='pct_black/hispanic', hue='locale', aspect=2)

Distribution between pct_black/hispanic & state

In [None]:
sns.displot(data=df_district, kind='hist', x='pct_black/hispanic', hue='state', aspect=2)



---



### pct_free/reduced Column

**Q: What the distribution in pct_free/reduced column?**

Same with the pct_black/hispanic, we need to do conversion to continue

Split the pct_black/hispanic to separate value

In [None]:
pct_free_reduced = df_district['pct_free/reduced'].str.split(",",n=1,expand=True)

Replace the bracket that exist in both data value

In [None]:
df_district['pct_free']=pct_free_reduced[0].str.replace('[','',regex=True)
df_district['pct_reduced']= pct_free_reduced[1].str.replace('[','',regex=True)

Convert both dataset to numeric using Pandas

In [None]:
df_district['pct_free']=pd.to_numeric(df_district['pct_free'])
df_district['pct_reduced']=pd.to_numeric(df_district['pct_reduced'])

Fuse both the dataset

In [None]:
df_district['pct_free/reduced']=(df_district['pct_free'] + df_district['pct_reduced'])/2

Then, we can continue...

In [None]:
df_district['pct_free/reduced'].value_counts()

Distribution between pct_free/reduced & locale

In [None]:
sns.displot(data=df_district, kind='hist', x='pct_free/reduced', hue='locale', aspect=1)

Distribution between pct_free/reduced & state

In [None]:
sns.displot(data=df_district, kind='hist', x='pct_free/reduced', hue='state', aspect=1)



---



## Engagement Dataset

### Desc

**Q: What is the information from the engagement dataset like?**

A: Here are 5 initial data in the *engagement dataset*

In [None]:
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'
all_files = glob.glob(path + "/*.csv")

li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.split("/")[3].split(".")[0]
    # print(district_id)
    df["district_id"] = district_id
    li.append(df)
    
df_engagement = pd.concat(li)
df_engagement = df_engagement.reset_index(drop=True)
df_engagement.head()

**Q: What are the descriptive statistics of the engagement dataset like?**

In [None]:
df_engagement.describe(include='all')

A: 

Explanation of each column:
- time = Shows the timestamp when the record was recorded.
- lp_id = unique ID of each product.
- pct_access = Percentage of each student in a district who has at least one page of a product on any given day.
- engagement_index = total page activity per thousand students assigned a product on a given day.

Fill empty values in lp_id column

In [None]:
df_engagement['lp_id']= df_engagement['lp_id'].fillna(0.0).astype(int)

Convert district_id into numeric

In [None]:
df_engagement['district_id'] = pd.to_numeric(df_district['district_id'])

Convert time into datetime

In [None]:
df_engagement['time'] = pd.to_datetime(df_engagement['time'])

Checking for empty values in current dataset

In [None]:
df_engagement.isna().sum()

Fill empty values in engagement_index using median

In [None]:
df_engagement['engagement_index'].fillna(df_engagement['engagement_index'].median(), inplace=True)

Fill empty values in pct_access using median

In [None]:
df_engagement['pct_access'].fillna(df_engagement['pct_access'].median(), inplace=True)

In [None]:
df_engagement.isna().sum()

### df_district_engagement

Merge district with engagement data

In [None]:
df_district_engagement = pd.merge(df_district, df_engagement, left_on='district_id', right_on='district_id')



---



**Q: What is df_district_engagment data looks like?**

A: Here is the 5 initial data from the dataset

In [None]:
df_district_engagement.head()

**Q: What are the descriptive statistics of the district dataset like?**

In [None]:
df_district_engagement.describe(include='all')

**Q: What is the correlation between columns in df_district_engagment?**

In [None]:
sns.pairplot(df_district_engagement.iloc[:,1:])

Modify df_district_engagement to use district_id as main key

In [None]:
df_district_engagement = pd.merge(df_engagement,df_district,on=['district_id'])

In [None]:
df_district_engagement.head(10)

In [None]:
df_district_engagement['time'] = pd.to_datetime(df_district_engagement['time'])

In [None]:
df_district_engagement.describe(include='all').T

### df_products_district_engagement

In [None]:
df_products_district_engagement = pd.merge(df_products,df_district_engagement,left_on='LP ID', right_on='lp_id')

In [None]:
df_products_district_engagement.head()

In [None]:
df_products_district_engagement.describe(include='all')

In [None]:
corr = df_products_district_engagement.corr(method='pearson')
plt.figure(figsize=(15,15))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')

In [None]:
df_products_district_engagement.corr()

In [None]:
df_products_district_engagement.groupby('Primary Essential Function')[['engagement_index']].median().plot(kind='bar', figsize=(15, 7), color=['blue'])

In [None]:
df_products_district_engagement.groupby('locale')[['engagement_index']].median().plot(kind='bar', figsize=(15, 7), color=['blue'])

In [None]:
df_products_district_engagement.groupby('state')[['engagement_index']].median().plot(kind='bar', figsize=(15, 7), color=['blue'])

In [None]:
top_product_name = df_products_district_engagement.groupby(by = 'Product Name', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Product Name', data = top_product_name[0:10])

In [None]:
top_sectors = df_products_district_engagement.groupby(by = 'Sector(s)', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Sector(s)', data = top_sectors[0:10])

In [None]:
top_sectors = df_products_district_engagement.groupby(by = 'Primary Essential Function', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Primary Essential Function', data = top_sectors[0:10])



---



# Conclusion

To be updated...

**Product**

**District**

**Trends Pattern**