# COVID-19 Impact on digital learning - Learn Platform
This notebook is a complete solution for a Kaggle Data Analysis competition where the main goal is to understand what is the main state of digital learning (2020) and how this learning methodology is affected by district demographics, broadband access, and more factors.
## Thought process


## Import Statements
- Pandas -> Hi

In [88]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

## Exploring the engagement datasets

In [24]:
columns = ['time', 'lp_id', 'pct_access', 'engagement_index', 'district_id']

In [47]:
all_districts_engagement = pd.DataFrame(columns=columns)

In [48]:
engagement_data_path = 'dataset/engagement_data/'
all_districts_files = os.listdir(engagement_data_path)
for filename in all_districts_files:
    district_df = pd.read_csv(engagement_data_path + filename)
    district_df['district_id'] = int(filename[:-4])
    all_districts_engagement = pd.concat([all_districts_engagement,district_df])

In [59]:
n_rows = all_districts_engagement.shape[0]
for column in all_districts_engagement.columns:
    nans = all_districts_engagement[column].isna().sum()
    print(f'Nan values in {column.upper()} column are {nans}, which is {(nans/n_rows)*100:.2f}% of the total rows')

Nan values in TIME column are 0, which is 0.00% of the total rows
Nan values in LP_ID column are 541, which is 0.00% of the total rows
Nan values in PCT_ACCESS column are 13447, which is 0.06% of the total rows
Nan values in ENGAGEMENT_INDEX column are 5378409, which is 24.09% of the total rows
Nan values in DISTRICT_ID column are 0, which is 0.00% of the total rows


In [108]:
all_districts_engagement[all_districts_engagement["district_id"] == 9463]

Unnamed: 0,time,lp_id,pct_access,engagement_index,district_id
0,2020-01-01,28504.0,0.07,10.81,9463
1,2020-01-01,94058.0,0.0,,9463
2,2020-01-01,33562.0,0.02,1.44,9463
3,2020-01-01,55278.0,0.0,,9463
4,2020-01-01,98001.0,0.01,0.72,9463
...,...,...,...,...,...
175969,2020-12-31,69429.0,0.0,,9463
175970,2020-12-31,70740.0,0.02,2.6,9463
175971,2020-12-31,31218.0,0.01,6.33,9463
175972,2020-12-31,35348.0,0.0,,9463


In [103]:
all_districts_engagement['lp_id'].value_counts(dropna=False).head(10)

95731.0    78295
99916.0    77304
26488.0    76214
28504.0    75814
33185.0    73778
72758.0    72800
32213.0    72321
13496.0    71122
69827.0    70979
69863.0    70566
Name: lp_id, dtype: int64

In [99]:
all_districts_engagement.shape[0]

22324190

## Exploring the districts info dataset

In [109]:
districts_dataset = pd.read_csv("dataset/districts_info.csv")

In [112]:
districts_dataset.head(30)

Unnamed: 0,district_id,state,locale,pct_black/hispanic,pct_free/reduced,county_connections_ratio,pp_total_raw
0,8815,Illinois,Suburb,"[0, 0.2[","[0, 0.2[","[0.18, 1[","[14000, 16000["
1,2685,,,,,,
2,4921,Utah,Suburb,"[0, 0.2[","[0.2, 0.4[","[0.18, 1[","[6000, 8000["
3,3188,,,,,,
4,2238,,,,,,
5,5987,Wisconsin,Suburb,"[0, 0.2[","[0, 0.2[","[0.18, 1[","[10000, 12000["
6,3710,Utah,Suburb,"[0, 0.2[","[0.4, 0.6[","[0.18, 1[","[6000, 8000["
7,7177,North Carolina,Suburb,"[0.2, 0.4[","[0.2, 0.4[","[0.18, 1[","[8000, 10000["
8,9812,Utah,Suburb,"[0, 0.2[","[0.2, 0.4[","[0.18, 1[","[6000, 8000["
9,6584,North Carolina,Rural,"[0.4, 0.6[","[0.6, 0.8[","[0.18, 1[","[8000, 10000["


In [None]:
[14000, 16000[