**Problem Statement**

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import pandas as pd
import numpy as np  
import seaborn as sns 
pal = sns.color_palette()
from wordcloud import WordCloud
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
from sklearn import preprocessing
import glob
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
import datetime as dt

## **Districts data**

District data includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab.

In [None]:
districts_data=pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")

In [None]:
districts_data.head()

**Missing data in district dataset**

In [None]:
# plotting missing data information

missing_district_data = pd.DataFrame(districts_data.isnull().sum()/len(districts_data)).reset_index().sort_values(by=0)

plt.figure(figsize=(8,6))
plt.bar(missing_district_data['index'],missing_district_data[0],alpha=0.75)
plt.xticks(rotation=90)

plt.rc('ytick', labelsize=10)
plt.title('Percetage of missing data')

**Missing values in district data**
* About 50% of values are missing from pp_total_raw columns
* More than 24% of data is missing from columns other than district id and pp_total_raw 

**Unique values in distirict data**

To understand district data let us see what are the unqiue values in each of the columns

In [None]:
# To undertsand t
object_columns = ['state', 'locale', 'pct_black/hispanic','pct_free/reduced', 'county_connections_ratio', 'pp_total_raw']

for col in object_columns:
    bold_start = '\033[1m'
    bold_end   = '\033[0m'
    print(bold_start,'\nUnique values for ' + col + ' :',bold_end)
    unique_values = districts_data[col].dropna().unique()
    print(unique_values)
    print('Number of unique values is:' + str(len(unique_values)))

In [None]:
state_codes = {
    'District of Columbia' : 'dc','Mississippi': 'MS', 'Oklahoma': 'OK', 
    'Delaware': 'DE', 'Minnesota': 'MN', 'Illinois': 'IL', 'Arkansas': 'AR', 
    'New Mexico': 'NM', 'Indiana': 'IN', 'Maryland': 'MD', 'Louisiana': 'LA', 
    'Idaho': 'ID', 'Wyoming': 'WY', 'Tennessee': 'TN', 'Arizona': 'AZ', 
    'Iowa': 'IA', 'Michigan': 'MI', 'Kansas': 'KS', 'Utah': 'UT', 
    'Virginia': 'VA', 'Oregon': 'OR', 'Connecticut': 'CT', 'Montana': 'MT', 
    'California': 'CA', 'Massachusetts': 'MA', 'West Virginia': 'WV', 
    'South Carolina': 'SC', 'New Hampshire': 'NH', 'Wisconsin': 'WI',
    'Vermont': 'VT', 'Georgia': 'GA', 'North Dakota': 'ND', 
    'Pennsylvania': 'PA', 'Florida': 'FL', 'Alaska': 'AK', 'Kentucky': 'KY', 
    'Hawaii': 'HI', 'Nebraska': 'NE', 'Missouri': 'MO', 'Ohio': 'OH', 
    'Alabama': 'AL', 'Rhode Island': 'RI', 'South Dakota': 'SD', 
    'Colorado': 'CO', 'New Jersey': 'NJ', 'Washington': 'WA', 
    'North Carolina': 'NC', 'New York': 'NY', 'Texas': 'TX', 
    'Nevada': 'NV', 'Maine': 'ME',np.nan:np.nan}

districts_data['state_code'] =districts_data['state'].map(state_codes)

map_plot = districts_data.dropna(subset=['state'])

**States and districts**

How many districts are mentioned per state?
What are the states which have highest numbr of school distrcits in the data?

In [None]:
district_count = map_plot.groupby(['state','state_code'],as_index=False)['district_id'].count()

fig = go.Figure(data=go.Choropleth(
    locations=district_count['state_code'], # Spatial coordinates
    z = district_count['district_id'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    text = district_count['state'],
    colorscale = 'Reds',
    colorbar_title = "Number of school districts",
))

fig.update_layout(
    title_text = 'Number of school district in a state',
    geo_scope='usa',width=800, height=600 # limite map scope to USA
)

fig.show()

**What is the percentage of districts in different localities**

Following plot shows percentage of districts that fall in each locatlity.

There are four localities in the data:
* Suburb
* City
* Town 
* Rural

In [None]:
#Distribution of localities 

locale_count = map_plot.groupby(['locale'],as_index=False)['district_id'].count()
locale_count =locale_count.rename(columns={'district_id':'Number of districts'})

fig1 = px.pie(locale_count, values='Number of districts', names='locale', color='locale',hover_name='locale',
             color_discrete_map={'City':'lightcyan',
                                 'Rural':'cyan',
                                 'Suburb':'royalblue',
                                 'Town':'darkblue'},title='Distribution districts in different localities')
fig1.show()

**Observations from the plot**
* Majority of the districts mentioned in the data fall in Suburb locatlity
* Least number of dictricts fall in Town locality

**How the black and hispanic population is distributed in different localities**

Each disttict under a locality has different percentage of black and hispanic population. Let us see what percetage of districts under a locality :
* has highest black and hispanic population
* has lowest black and hispani population

In [None]:
# distribution of black and hispanic population as per locality

blc_hisp_count = map_plot.groupby(['locale','pct_black/hispanic'],as_index=False)['district_id'].count()
pct_blk_hisp = {'[0, 0.2[':'0-20%', '[0.2, 0.4[':'20-40%' ,'[0.4, 0.6[':'40-60%', '[0.8, 1[':'80-100%', '[0.6, 0.8[':'60-80%'}
blc_hisp_count['Percentage of black and hipanic population']=blc_hisp_count['pct_black/hispanic'].map(pct_blk_hisp)
blc_hisp_count = blc_hisp_count.rename(columns={'district_id':'Number of districts'})
city_b = blc_hisp_count[blc_hisp_count.locale=='City']
town_b = blc_hisp_count[blc_hisp_count.locale=='Town']
suburb_b = blc_hisp_count[blc_hisp_count.locale=='Rural']
rural_b = blc_hisp_count[blc_hisp_count.locale=='Suburb']
from plotly.subplots import make_subplots


fig3 = make_subplots(rows=2, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}],[{'type':'domain'},{'type':'domain'}]])

fig3.add_trace(go.Pie(labels=suburb_b['Percentage of black and hipanic population'], values=suburb_b['Number of districts'], name="Suburb"),
              1, 1)
fig3.add_trace(go.Pie(labels=city_b['Percentage of black and hipanic population'], values=city_b['Number of districts'], name="City"),
              1, 2)
fig3.add_trace(go.Pie(labels=rural_b['Percentage of black and hipanic population'], values=rural_b['Number of districts'], name="Rural"),
              2, 1)
fig3.add_trace(go.Pie(labels=town_b['Percentage of black and hipanic population'], values=town_b['Number of districts'], name="Town"),
              2, 2)
# Use `hole` to create a donut-like pie chart
fig3.update_traces(hole=.4, hoverinfo="label+percent+name")

fig3.update_layout(
    title_text="Distibution of black and hispanic population in different locality",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Suburb', x=0.17, y=.8, font_size=20, showarrow=False),
                 dict(text='City', x=0.8, y=.8, font_size=20, showarrow=False),
                 dict(text='Rural', x=0.18, y=0.175, font_size=20, showarrow=False),
                dict(text='Town', x=0.82, y=0.175, font_size=20, showarrow=False)], width=900, height=800)

fig3.show()

**Observations from the plot**
* 93.9% of districts in Suburd locality have black and hispanic population of 0-20%. That means more than 90% of disticts in Suburb locality have low population of black and hispanics
* Districts in City locality have differnt number of black and hispanic populations. We can see eqaul proportion of disctricts with different percentage of population.
* Majority of the districts under town and rural localities have 0-20% of black and hispanic population

# Engagement Dataset

In [None]:
CSV_files=pd.DataFrame()
address = glob.glob('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/*.csv')
count=0
for i in address:
    with open(i, "rb") as data_of_files:
        data=pd.read_csv(data_of_files)
        data['district_id'] = (i.split('/')[4]).split('.')[0]
        CSV_files=pd.concat([CSV_files,data], axis=0)
        count=count+1
        if count==233:
            break  
CSV_files

**Missing values in engagement data**

In [None]:
# Percentage of missing values in each column
missing_engegement_data= pd.DataFrame(CSV_files.isnull().sum()*100/len(CSV_files)).reset_index()
missing_engegement_data = missing_engegement_data.rename(columns={'index':'Column Name', 0:'Percentage of missing data'})
missing_engegement_data

**Missing data in Engagment dataframe**
* 24% of engagement_index values are missing.
* .06% of pct_access values are missing
* .002% of Ip_id values are missing

In [None]:
CSV_files['date'] = pd.to_datetime(CSV_files['time']).dt.date
CSV_files['month']= pd.to_datetime(CSV_files['time']).dt.month_name()
CSV_files['weekday']= pd.to_datetime(CSV_files['time']).dt.day_name()
CSV_files

# Variation in percentage of students accessing the digital product over months
Pct_access gives the percentage of students in the district have at least one page-load event of a given product and on a given day.

Let's see if the distibution of this vaiables changes over the months.

In [None]:
sns.set_style("dark")
sns.set(rc={'figure.figsize':(15,8.27)})
sns.boxplot(x=CSV_files['month'], y = CSV_files['pct_access'], linewidth=5).set_title('Variation in percentage of students accessing the digital product over months ')

**Observations from the plot:**
* For all months exclusing june and july there are some days when percenntage of students accessing the digital products is greater than 80%. This decrease in June and July can be due to the summer break during this time in most states.
* Majority of the days percentage of students accessing the digital products is less than 1 %

# Variation in percentage of students accessing the digital product over week days
Pct_access gives the percentage of students in the district have at least one page-load event of a given product and on a given day.

Let's see if the distibution of this vaiables changes over the week days.

In [None]:
order = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
sns.set_style("dark")
sns.set(rc={'figure.figsize':(15,8.27)})
sns.boxplot(x=CSV_files['weekday'], y = CSV_files['pct_access'],order=order, linewidth=5).set_title('Variation in percentage of students accessing the digital product over days of a week ')

**Observations from the plot:**
* There is a decrease in percentage of students loading the page on weekends
* Trend across the working days remains same

# Engagement index and Pct_access
Let's see if varitaion in pct_access across districts and across days follows the same trend as the engagement index or not

In [None]:
engmnt_dist=CSV_files.groupby('district_id',as_index=False)['engagement_index','pct_access'].mean()

In [None]:
engmnt_dist['district_id'] = engmnt_dist['district_id'].astype('str')
engmnt_dist =engmnt_dist.sort_values(by='engagement_index')
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(12,26),squeeze=True,)
axes[0].barh(engmnt_dist['district_id'],engmnt_dist['engagement_index'], 
             color='red',align='center')
axes[0].invert_xaxis()
axes[0].yaxis.set_visible(False)
axes[0].set(title='Mean engagement index')
axes[1].barh( engmnt_dist['district_id'],engmnt_dist['pct_access'],
             color='blue',align='center')
axes[1].set(title='Mean percent access')
axes[0].yaxis.set_visible(True)
plt.rc('ytick', labelsize=6)
plt.subplots_adjust(top=.95)
fig.suptitle('Comparing engagement idex and percent access district wise')

**Observation from plot**
* General trend of engagement index and percet access is same accross the districts. The districts with hight engagement index seem to hvae hight percentage access.
* There are some districts where engagement index is low but percent access is high.

# Products Dataset

Products data includes information about the characteristics of the top 372 products with most users in 2020.

In [None]:
products_data = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")

**Missing values in products data**

In [None]:
# Percentage of missing values in each column
missing_products_data= pd.DataFrame(products_data.isnull().sum()*100/len(products_data)).reset_index()
missing_products_data = missing_products_data.rename(columns={'index':'Column Name', 0:'Percentage of missing data'})
missing_products_data

In [None]:
sector_group = products_data.groupby('Sector(s)')['LP ID'].count()
sector_group

**How many products lie under a particular product function?**

Primary Essential Function gives basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled.

Let's see distibution of products under each category

In [None]:
products_data['Primary Essential Function'] = products_data['Primary Essential Function'].astype('str')
products_data[['Main Function','Sub Category']] = products_data['Primary Essential Function'].str.split('-', 1, expand=True)
product_distribution = products_data.groupby(['Main Function','Sub Category'], as_index=False)['LP ID'].count()
product_distribution = product_distribution.rename(columns={'LP ID':'Number of products'})

In [None]:
fig5 = px.sunburst(product_distribution, path=['Main Function', 'Sub Category'], values='Number of products',
                  color='Number of products', 
                  color_continuous_scale='RdBu')
                 
fig5.update_layout(title='Number of products under each category of essential function')
fig5.show()

**Observations from plot**
* 272 products lie under Learning and Curiculum category
* 30 products lie under School and districts operations
* 34 products lie under Classroom Management