### Goal of the competition:

What is the state of digital learning in 2020? And how does the engagement of digital learning relate to factors such as district demographics, broadband access, and state/national level policies and events?
This is an Analytics competition where your task is to create a Notebook that best addresses the Evaluation criteria below. Submissions should be shared directly as specified under Submission Instructions with host and will be judged by the LearnPlatform team based on how well they address:

Clarity (5 pts)

Did the author present a clear thread of questions or themes motivating their analysis?
Did the author document why/what/how a set of methods was chosen and used for their analysis?
Is the notebook documented in a way that is easily reproducible (e.g., code, additional data sources, citations)?
Does the notebook contain clear data visualizations that help effectively communicate the author’s findings to both experts and non-experts?
Accuracy (5 pts)

Did the author process the data (e.g., merging) and/or additional data sources accurately?
Is the methodology used in the analysis appropriate and reasonable?
Are the interpretations based on the analysis and visualization reasonable and convincing?
Creativity (5 pts)

Does the notebook help the reader learn something new or challenge the reader to think in a new way?
Does the notebook leverage novel methods and/or visualizations that help reveal insights from data and/or communicate findings?
Did the author utilize additional public data sources in their analysis?

In [None]:
import numpy as np 
import pandas as pd 
import math
import glob
import os
import gc
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
%matplotlib inline


### Get the data in

The product file `products_info.csv` includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

| Name | Description |
| :--- | :----------- |
| LP ID| The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

In [None]:
products_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
print(products_df.shape)
products_df.head()


The district file `districts_info.csv` includes information about the characteristics of school districts, including data from [NCES](https://nces.ed.gov/) (2018-19), [FCC](https://www.fcc.gov/) (Dec 2018), and [Edunomics Lab](https://edunomicslab.org/). In this data set, we removed the identifiable information about the school districts. We also used an open source tool [ARX](https://arx.deidentifier.org/) [(Prasser et al. 2020)](https://onlinelibrary.wiley.com/doi/full/10.1002/spe.2812) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset. 

| Name | Description |
| :--- | :----------- |
| district_id | The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information. |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county_connections_ratio | `ratio` (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See [FCC data](https://www.fcc.gov/form-477-county-data-internet-access-services) for more information. |
| pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |

In [None]:
districts_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
districts_df.head()

The engagement data are aggregated at school district level, and each file in the folder `engagement_data` represents data from one school district. The 4-digit file name represents `district_id` which can be used to link to district information in `district_info.csv`. The `lp_id` can be used to link to product information in `product_info.csv`.

| Name | Description |
| :--- | :----------- |
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |


Since we have the engagement data seperated as single files per district we will read it in and add it as a column in the data

In [None]:
files = glob.glob('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/*.csv')
engagement_df= pd.concat([pd.read_csv(fp).assign(district_id=os.path.basename(fp).split('.')[0]) 
       for fp in files])
gc.collect()

In [None]:
engagement_df.head(10)

### Missing Data Analysis

In [None]:

msno.matrix(products_df)

In [None]:
products_df.isna().sum()/products_df.shape[0]

In [None]:
msno.matrix(districts_df)

In [None]:
districts_df.isna().sum()/districts_df.shape[0]

In [None]:
msno.matrix(engagement_df)

In [None]:
engagement_df.isna().sum()/engagement_df.shape[0]

In [None]:
gc.collect()

## Lets analyse the data in depth to derive any insights



#### 1. Analyse the distribution of locale in the district dataset

In [None]:
locale_data=districts_df.groupby('locale')['district_id'].count().reset_index(name='totalcount')
fig = px.bar(locale_data, x='locale', y='totalcount')
fig.show()

In [None]:
districts_df

#### 2. Analyse the distribution of ethinicity per state

In [None]:
ethinicity_data=districts_df.groupby(['state','pct_black/hispanic'])['district_id'].count().reset_index(name='totalcount')
fig = px.bar(ethinicity_data, x='state', y='totalcount', color='pct_black/hispanic')
fig.show()

#### 3. Analyse the distribution of free/reduced lunch per state

In [None]:
lunch_data=districts_df.groupby(['state','pct_free/reduced'])['district_id'].count().reset_index(name='totalcount')
fig = px.bar(lunch_data, x='state', y='totalcount', color='pct_free/reduced')
fig.show()

#### 4. Analyse the distribution of high speed lunch per state

In [None]:
connection_data=districts_df.groupby(['state','county_connections_ratio'])['district_id'].count().reset_index(name='totalcount')
fig = px.bar(connection_data, x='state', y='totalcount', color='county_connections_ratio')
fig.show()

#### 5. What the most used educational product across all the given districts as a function of time

In [None]:
lp_id_performance=engagement_df[(~(engagement_df['pct_access'].isnull())
                                &(~(engagement_df['lp_id'].isnull())))].groupby(['time','lp_id'])['pct_access'].mean()
lp_id_performance=lp_id_performance.reset_index(name='average_access')
lp_id_performance['lp_id']=lp_id_performance['lp_id'].astype(int)

In [None]:
ww=lp_id_performance['lp_id'].unique().tolist()
wd=products_df['LP ID'].unique().tolist()
print("Products that are not present in the product df description",len(list(set(ww).difference(wd))))
print("Total no of distinct products",lp_id_performance['lp_id'].nunique())

There are about 8277 product that are not present in the description. Below i have tried to evaluate the average performance 

In [None]:
topproducts=lp_id_performance.groupby('lp_id')['average_access'].mean().reset_index(name='average_access')
topproducts=pd.merge(topproducts,products_df, how='left',
                                 left_on='lp_id', right_on=['LP ID'])
topproducts=topproducts[~(topproducts['Product Name'].isnull())]
topproductslist=topproducts.sort_values('average_access', ascending=False).head(10)['lp_id'].tolist()

In [None]:
lp_id_performance_filter=lp_id_performance[lp_id_performance['lp_id'].isin(topproductslist)]
lp_id_performance_filter=pd.merge(lp_id_performance_filter,products_df, how='left',
                                 left_on='lp_id', right_on=['LP ID'])
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
fig = px.line(lp_id_performance_filter, x="time", y="average_access", color="Product Name")
fig.update_layout(
    title_text="Average access for top performing educational products across timeline",
)
fig.update_xaxes(title_text="Month-Year")
fig.update_yaxes(title_text="Average access")
fig.show()

This is based on products that we have data provided for. As seen from the code above you can see that i have excluded all the products that doesnt have a product name associated with them. From the graph we can see that google classroom and google docs where the mostly used products throughout the year. The dip in Jul and August is due to the school holidays.

#### 6. Which educational sectors are prominent per state

In [None]:
product_engagement_merge=pd.merge(engagement_df,products_df, how='left',
                                 left_on='lp_id', right_on=['LP ID'])
product_engagement_merge=product_engagement_merge[~(product_engagement_merge['Sector(s)'].isnull())]
product_engagement_merge['district_id']=product_engagement_merge['district_id'].astype('int64')
product_state_data=pd.merge(product_engagement_merge,districts_df, how='left')
gc.collect()

In [None]:
product_state_data_percentage=(product_state_data.groupby('state')['Sector(s)'].value_counts()/\
product_state_data.groupby('state')['lp_id'].count()).reset_index(name='percentage_split')

In [None]:
fig = px.bar(product_state_data_percentage, x='state', y='percentage_split', color='Sector(s)')
fig.show()

We can see that both Arizon and North Dakota uses a lot of educational products that fall under a combination of PreK-12; Higher Ed; Corporate

#### 7. Which educational products are popular per state across timeline given?

In [None]:
engagement_df['district_id']=engagement_df['district_id'].astype('int64')
engagement_district_df=pd.merge(engagement_df, districts_df, how='left')

In [None]:
lp_id_performance=engagement_district_df[(~(engagement_district_df['pct_access'].isnull())
                                &(~(engagement_district_df['lp_id'].isnull())))].groupby(['time','state','lp_id'])['pct_access'].mean()
lp_id_performance=lp_id_performance.reset_index(name='average_access')
lp_id_performance['lp_id']=lp_id_performance['lp_id'].astype(int)
gc.collect()

In [None]:
topproducts=lp_id_performance.groupby(['lp_id','state'])['average_access'].mean().reset_index(name='average_access')
topproducts=pd.merge(topproducts,products_df, how='left',
                                 left_on='lp_id', right_on=['LP ID'])
topproducts=topproducts[~(topproducts['Product Name'].isnull())]

In [None]:
top_products_state=topproducts.groupby('state').apply(lambda x : x.sort_values(by = 'average_access', ascending = False).head(5).reset_index(drop = True))

In [None]:
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
for state in top_products_state['state'].unique().tolist():
    state_list=lp_id_performance[(lp_id_performance['state']==state)&
                                (lp_id_performance['lp_id'].isin(
                                top_products_state[top_products_state['state']==state]['lp_id']
                                ))]
    state_list=pd.merge(state_list,products_df, how='left',
                                 left_on='lp_id',right_on='LP ID')
   
    fig = go.Figure()
    color=['#636EFA', '#EF553B',
           '#00CC96', '#AB63FA', '#FFA15A', '#19D3F3', '#FF6692', '#B6E880', '#FF97FF', '#FECB52']
    for i,lp_id in enumerate(state_list['lp_id'].unique().tolist()):
        fig.add_trace(go.Scatter(x=state_list[state_list['lp_id'] ==lp_id]['time'],
                                 y=state_list[state_list['lp_id'] ==lp_id]['average_access'], 
                                 name=state+'_'+str(state_list[state_list['lp_id'] ==lp_id]['Product Name'].iloc[0]),
                                 line=dict(color=color[i], width=2)))

    fig.update_layout(title='Top products by usage in '+ state,
                   xaxis_title='Month',
                   yaxis_title='Average product Access')
    fig.show()

    gc.collect()

We will have to analyze North Dakota data to undestand why we dont have enough data for a longer period.