                                           Team 10
                                        5th SEM, D Division
                                    Department of Computer Science

                                          5DMACP03
                            Impact of Covid-19 on Student Learning

                      Name                                USN                      Roll Number
                Annapurneshwari S. Kattishettar       01FE20BCS420                     467
                Dhanalakshmi R Hiremath               01FE20BCS421                     466
                Swathi S                              01FE20BCS412                     463
                Pradyumn P Gurlhosur                  01FE20BCS407                     461


**Problem Statement** 

  The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

In [None]:
#Importing Libraries
import glob
import warnings
import numpy as np 
from numpy import float64
import pandas as pd
import missingno as msno
import plotly as py
import seaborn as sns
import string as str
import math
import statistics as stat
import plotly.express as px
import plotly.graph_objs as go
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)
import matplotlib.pyplot as plt
%matplotlib inline

**Name : Districts_info.csv**

district_id : The unique identifier of the school district.

state : The state where the district resides in.

locale : NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural.

pct_black/hispanic : Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data.

pct_free/reduced : Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data.

county_connections_ratio : ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households).

pp_total_raw :Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project.

In [None]:
#Reading the districts_info.csv
districts = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')

In [None]:
#Display the districts_info.csv
districts.head(10)

In [None]:
#Check the number of null values in districts_info.csv
districts.isna().sum()

In [None]:
districts_nan = (districts.isnull().sum() / districts.shape[0]) * 100
districts_nan[districts_nan > 0]

**Name : Products_info.csv**

LP ID : The unique identifier of the product.

URL : Web Link to the specific product.

Product Name : Name of the specific product.

Provider/Company Name : Name of the product provider.

Sector(s) : Sector of education where the product is used.

Primary Essential Function : The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled

In [None]:
#Reading the products_info.csv
product = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

In [None]:
#Display the products data
product.head(10)

In [None]:
#Check the number of null values in products_info.csv
product.isna().sum()

In [None]:
#Check the percentage of null values in product file
products_nan = (product.isnull().sum() / product.shape[0]) * 100
products_nan[products_nan > 0]

**Name : Engagement_data**

time : date in "YYYY-MM-DD"

lp_id : The unique identifier of the product.

pct_access	Percentage of students in the district have at least one page-load event of a given product and on a given day.

engagement_index	Total page-load events per one thousand students of a given product and on a given day.

In [None]:
path = "/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data"
engagement_data = glob.glob(path + "/*.csv")

li = []
for filename in engagement_data:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)
    
engagement_df = pd.concat(li, axis=0, ignore_index=True)

In [None]:
print('\033[1m'"Shape of the Engagement File "'\033[0m',engagement_df.shape )
print('\033[1m'"Shape of the District file"'\033[0m', districts.shape)
print('\033[1m'"Shape of the Product File"'\033[0m',product.shape)

In [None]:
engagement_df

In [None]:
print('\033[1m'"Data types of each column in district data file\n"'\033[0m',districts.dtypes)
print('\033[1m'"Data types of each column in product data file\n"'\033[0m',product.dtypes)
print('\033[1m'"Data types of each column in engagement data file\n"'\033[0m',engagement_df.dtypes)

In [None]:
#Missing value count  
print('\033[1m'"Missing value count in each column of district data file\n"'\033[0m',districts.isna().sum())
print('\033[1m'"Missing value count in each column of product data file\n"'\033[0m',product.isna().sum())
print('\033[1m'"Missing value count in each column of engagement data file\n"'\033[0m',engagement_df.isna().sum())

In [None]:
msno.bar(districts,color='Green', sort="ascending", figsize=(10,5), fontsize=12)
plt.show()

In [None]:
msno.bar(product,color='Yellow', sort="ascending", figsize=(10,5), fontsize=12)
plt.show()

In [None]:
msno.bar(engagement_df,color='Pink', sort="ascending", figsize=(10,5), fontsize=12)
plt.show()

**Data Preprocessing**

In [None]:
# Dropping the  Nan values from the states column which is less than 30%
districts = districts[districts['state'].notna()].reset_index(drop=True)

In [None]:
# Count of each states in the dataframe
plt.figure(figsize=(10,12))
sns.countplot(y ='state',data = districts,order=districts['state'].value_counts().index)
plt.show()

In [None]:
#Replace the NaN values with -1
#Attribute pct_free/reduced,county_connections_ratio and pp_total_raw
#Replace the NaN pct_free/reduced, county_connections_ratio and pp_total_raw with 0
districts['pct_free/reduced']  = districts['pct_free/reduced'].fillna('-1,-1')
districts['county_connections_ratio'] = districts['county_connections_ratio'].fillna('-1,-1')
districts['pp_total_raw'] = districts['pp_total_raw'].fillna('-1,-1')

#After replacing with -1
districts

In [None]:
#After replacing the NaN values with 0, we calculate the mean values of which were previously given in a range
#Adding columns with mean values for pct_black/hispanic, pct_free/reduced, county_connections_ratio and pp_total_raw
def splitValues(column):
    mean_val = []
    for val in column:
        value_1 = pd.to_numeric(val.strip('[]').split(',')[0],errors='coerce')
        value_2 = pd.to_numeric(val.strip('[]').split(',')[0],errors='coerce')
        mean_val.append((value_1 + value_2)/2)
    return mean_val

districts['black/hisp_pct_mean'] = splitValues(districts['pct_black/hispanic'])
districts['free/red_pct_mean'] = splitValues(districts['pct_free/reduced'])
districts['county_connec_ratio_mean'] = splitValues(districts['county_connections_ratio'])
districts['pp_total_raw_mean'] = splitValues(districts['pp_total_raw'])

#After adding mean columns
districts

In [None]:
# Rechecking the no of missing values in the district column

districts.isna().sum()

In [None]:
# Count of each locale in the dataframe
plt.figure(figsize=(10,12))
sns.countplot(x ='locale',data = districts,order=districts['locale'].value_counts().index)
plt.show()

In [None]:
#Plotting Graph
fig = px.pie(districts['locale'].value_counts().reset_index().rename(columns = {'locale': 'count'}), values = 'count', names = 'index', width = 700, height = 700)

fig.update_traces(textposition = 'inside', 
                  textinfo = 'percent + label', 
                  hole = 0.7, 
                  marker = dict(colors = ['Blue','Green','Red','Orange'], line = dict(color = 'white', width = 2)))

fig.update_layout(annotations = [dict(text = ' The total count of <br>districts in different <br> areas', 
                                      x = 0.5, y = 0.5, font_size = 26, showarrow = False, 
                                      font_family = 'monospace',
                                      font_color = 'black')],
                  showlegend = False)
                  
fig.show()

**Products_info**

In [None]:
#Number of missing values in product_info.csv
print('\033[1m'"Missing value count in each column of product data file\n"'\033[0m',product.isna().sum())

In [None]:
plt.figure(figsize=(16, 10))
sns.countplot(y='Provider/Company Name', data=product, order=product["Provider/Company Name"].value_counts().index[:10],palette = 'cool')
plt.title("Top 10 Provider/Company Names",font="Serif", size=20)
plt.show()

In [None]:
product['Provider/Company Name']  = product['Provider/Company Name'].fillna('Google LLC')
product

In [None]:
plt.figure(figsize=(16, 10))
sns.countplot(y='Sector(s)', data=product, order=product["Sector(s)"].value_counts().index[:10],palette = 'cool')
plt.title("Top 10 Sector(s)",font="Serif", size=20)
plt.show()

In [None]:
product['Sector(s)']  = product['Sector(s)'].fillna('PreK-12')
product

In [None]:
plt.figure(figsize=(16, 10))
sns.countplot(y='Primary Essential Function', data=product, order=product["Primary Essential Function"].value_counts().index[:10],palette = 'cool')
plt.title("Top 10 Primary Essential Function",font="Serif", size=20)
plt.show()

In [None]:
product['Primary Essential Function']  = product['Primary Essential Function'].fillna('LC-Digital Learning Platforms')
product

In [None]:
#Number of missing values in product_info.csv
print('\033[1m'"Missing value count in each column of product data file\n"'\033[0m',product.isna().sum())

In [None]:
#plotting graph
fig = px.pie(product['Sector(s)'].value_counts().reset_index().rename(columns = {'Sector(s)': 'count'}).head(15), values = 'count', names = 'index', width = 700, height = 700)

fig.update_traces(textposition = 'inside', 
                  textinfo = 'percent + label', 
                  hole = 0.7, 
                  marker = dict(colors = ['Red','Blue','Green','orange', 'yellow'], line = dict(color = 'white', width = 2)))

fig.update_layout(annotations = [dict(text = 'Sector of education <br>where the product is used', 
                                      x = 0.5, y = 0.5, font_size = 26, showarrow = False, 
                                      font_family = 'monospace',
                                      font_color = 'black')],
                  showlegend = False)
                  
fig.show()

In [None]:
plt.figure(figsize=(10,10))
plt.title('Pct_black/Hispanic',fontsize=40)
plt.xlabel('Pct_black/hispanic', fontsize=20)
plt.ylabel('Count', fontsize=20)
sns.countplot(y=districts['pct_black/hispanic'],order=districts['pct_black/hispanic'].value_counts().index[:],palette=sns.cubehelix_palette(8, start=.75, rot=-.150))

In [None]:
plt.figure(figsize=(10,10))
plt.title('pct_free/reduced',fontsize=40)
plt.xlabel('pct_free/reduced', fontsize=20)
plt.ylabel('Count', fontsize=20)
sns.countplot(y=districts['pct_free/reduced'],order=districts['pct_free/reduced'].value_counts().index[:],palette=sns.light_palette("Orange"))

In [None]:
plt.figure(figsize=(10,10))
plt.title('pp_total_raw',fontsize=40)
plt.xlabel('pp_total_raw', fontsize=20)
plt.ylabel('Count', fontsize=20)
sns.countplot(y=districts['pp_total_raw'],order=districts['pp_total_raw'].value_counts().index[:])

**Engagement_data**

In [None]:
#Missing value count for merged Engagement_data
print('\033[1m'"Missing value count in each column of engagement data file\n"'\033[0m',engagement_df.isna().sum())

In [None]:
engagement_df['lp_id']  = engagement_df['lp_id'].fillna('-1')
engagement_df['pct_access'] = engagement_df['pct_access'].fillna('-1')
engagement_df['engagement_index'] = engagement_df['engagement_index'].fillna('-1')

#After replacing with -1
engagement_df

In [None]:
districts1=districts[['district_id']]
districts.head(10)

In [None]:
product['LP ID'] = product['LP ID'].astype(float64)
product.info()

In [None]:
engagement_df['lp_id'] = engagement_df['lp_id'].astype(float64)
engagement_df.info()


In [None]:
#merge engagement_data and products file
engagement_product = pd.merge(engagement_df, product, left_on='lp_id', right_on='LP ID' )
engagement_product

In [None]:
districts1=districts[['district_id']]
districts1.head(10)

In [None]:

product1=product[['LP ID']]
product1.head(10)

In [None]:
frames = [districts1, product1]

In [None]:
d1 = pd.concat([districts1, product1], axis = 1).T.drop_duplicates().T

In [None]:
print(d1)

In [None]:
DP = pd.merge(engagement_product, d1, left_on='lp_id', right_on='LP ID' )
print(DP)

In [None]:
engagement_district_products = pd.merge(DP, districts, on="district_id")
print(engagement_district_products)

In [None]:
engagement_district_products

In [None]:
result_df=engagement_district_products.drop('LP ID_y',axis=1)

In [None]:
result_df

In [None]:
#What is the picture of digital connectivity and engagement in 2020 ?
Graph1= result_df[['lp_id','district_id','time', 'pct_access', 'engagement_index']]
Graph1.head(5)

In [None]:
year = []
month = []
day = []
for t in Graph1["time"]:
    z = t.split("-")
    year.append(z[0])
    month.append(z[1])
    day.append(z[2])
Graph1["year"] = year
Graph1["month"] = month
Graph1["day"] = day
Graph1.head()