# LearnPlatform Competition

# Introduction

The COVID-19 Pandemic policies implemented in the spring of 2020 by state and local governments across the United States caused many educational institutions to increase engagement in digital learning. The following analysis aims to identify factors that impact student's abilities to engage in digital learning.

The data analyzed was acquired by LearnPlatform’s Student Chrome Extension analytics, The National Center for Education Statistics (NCES), The Federal Communications Commission (FCC), and Edunomics Lab.

The factors analyzed are listed in the table below:

| Name | Description |
| :--- | :----------- |
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |
| district_id | The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information. |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county_connections_ratio | `ratio` (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See [FCC data](https://www.fcc.gov/form-477-county-data-internet-access-services) for more information. |
| pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |
| LP ID| The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

# Prepare Data

The engagement data is split by district. In order to work with the data, all the data must be merged into a single data set.

In [None]:
#Import necessary libraries

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
#Combine the data in the engagement_data folder into a single DataFrame

df = pd.DataFrame()

for dirname, _, filenames in os.walk('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data'):
    for filename in filenames:
        filepath = os.path.join(dirname, filename)
        district_df = pd.read_csv(filepath)
        district_df['district_id'] = int(filename.split('.')[0])
        df = df.append(district_df, ignore_index = True)

In [None]:
#Merge the rest of the data using unique identifiers

districts_info_df = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
products_info_df = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
products_info_df = products_info_df.rename(columns=
                                           {'LP ID': 'lp_id', 
                                            'URL': 'url', 
                                            'Product Name': 'product_name',
                                            'Provider/Company Name': 'provider/company_name', 
                                            'Sector(s)': 'sector(s)', 
                                            'Primary Essential Function': 'primary_essential_function'})

df = pd.merge(df, districts_info_df)
df = pd.merge(df, products_info_df)

In [None]:
#Average out any ranges

for category in ['pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio', 'pp_total_raw']:
    df[category] = df[category].str.extractall(r'([\d\.]+)').astype('float32').groupby(level=0).mean()

In [None]:
#Set the DataFrame index and data types

df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time')

df = df.astype({'lp_id': 'string',
                'pct_access': 'float32',
                'engagement_index': 'float32',
                'district_id': 'string',
                'state': 'string',
                'locale': 'string',
                'pct_black/hispanic': 'float32',
                'pct_free/reduced': 'float32',
                'county_connections_ratio': 'float32',
                'pp_total_raw': 'float32',
                'url': 'string',
                'product_name': 'string',
                'provider/company_name': 'string',
                'sector(s)': 'string',
                'primary_essential_function': 'string'})

In [None]:
#Average the data by state for the choropleth map

us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District Of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}

df_choropleth = df.groupby('state')[['pct_access', 'engagement_index', 'pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio', 'pp_total_raw']].mean()
df_choropleth['code'] = df_choropleth.index.map(us_state_to_abbrev)

# Analyze Data

Data is easier to analyze when it is in the form of tables, bar charts, and choropleth maps.

The following was observed from the figures below:

1. Engagement with products has increased since schools initially closed in the spring of 2020
2. Engagement with products is higher during the week days
3. Students are more likely to engage with certain types of products
4. Students are more likely to engage with specific products
5. Engagement varies greatly by state

In [None]:
df.describe().transpose()

In [None]:
df_choropleth

In [None]:
df_weekly = df.resample('W').agg({'engagement_index': 'mean'})

plt.figure(figsize=(8, 10))
sns.barplot(x='engagement_index', 
            y=df_weekly.index,
            data=df_weekly)

plt.title("Weekly Engagement Index Average")
plt.xlabel("Engagement Index")
plt.ylabel("Week")
plt.show()

In [None]:
df_daily = df.copy()
df_daily['day_of_week'] = df_daily.index.day_name()

sns.barplot(x='engagement_index', 
            y='day_of_week',
            data=df_daily,
            order=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])

plt.title("Daily Engagement Index Average")
plt.xlabel("Engagement Index")
plt.ylabel("Day of Week")
plt.show()


In [None]:
plt.figure(figsize=(8, 8))
sns.barplot(x='engagement_index', 
            y='primary_essential_function', 
            data=df,
            order=df.groupby('primary_essential_function')['engagement_index'].mean().nlargest(30).index)

plt.title("Primary Essential Function Engagement")
plt.xlabel("Engagement Index")
plt.ylabel("Primary Essential Function")
plt.show()

In [None]:
plt.figure(figsize=(8, 8))
sns.barplot(x='engagement_index', 
            y='product_name', 
            data=df,
            order=df.groupby('product_name')['engagement_index'].mean().nlargest(30).index)

plt.title("Product Engagement")
plt.xlabel("Engagement Index")
plt.ylabel("Product Name")
plt.show()

In [None]:
plt.figure(figsize=(8, 8))
sns.barplot(x='engagement_index', 
            y='state',
            data=df,
            order=df.groupby('state')['engagement_index'].mean().nlargest(30).index)

plt.title("State Engagement")
plt.xlabel("Engagement Index")
plt.ylabel("State")
plt.show()

In [None]:
sns.barplot(x='engagement_index', 
            y='locale',
            data=df,
            order=df.groupby('locale')['engagement_index'].mean().nlargest(30).index)

plt.title("Locale Engagement")
plt.xlabel("Engagement Index")
plt.ylabel("Locale")
plt.show()

In [None]:
fig = px.choropleth(df_choropleth, locations='code',
                    locationmode="USA-states", 
                    color='engagement_index',
                    scope="usa",
                    labels={'engagement_index':'Engagement Index'})

fig.update_layout(title_text = 'Engagement Choropleth Map')
  
fig.show()

# Identify Correlations

The graphs above indicate that each product is unique. Therefore, the data for each product needs to be compared separately. This can be done using a Pairplot. A Pairplot is a grid of plots that has diagonals composed of univariate distribution plots.

The following figure is a Pairplot that compares the top three products with the highest Engagement Index.

In [None]:
del df_daily
del df_weekly
del df_choropleth

top_products_df = pd.concat([df[df['product_name'] == product_name]  for product_name in ['Google Docs', 'Google Classroom', 'YouTube']])

sns.pairplot(
    data=top_products_df.reset_index(drop=True),
    hue='product_name',
    vars=['pct_access', 'engagement_index', 'pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio', 'pp_total_raw'],
    diag_kind='kde')

Pairplots are great for visualizing correlations, but they do not quantify the strength of any one correlation. In order to represent the strength of a correlation, Pearson correlation coefficients can be used. A Pearson correlation coefficient of one indicates a strong correlation, and a coefficient of zero indicates no correlation.

In [None]:
plt.figure(figsize=(8,8))

sns.heatmap(df[df['product_name'] == 'Google Docs'].corr(), annot=True, fmt=".2f")
plt.title('Google Docs Pearson Correlation Matrix')
plt.show()

Identifying correlations among a single product has its limitations. In order to get around this, the data can be normalized by product.

In [None]:
normalized_df = df.copy()

normalized_df[['pct_access', 'engagement_index']] = df.groupby('product_name')[['pct_access', 'engagement_index']].transform(
    lambda x: (x - x.mean()) / x.std())

Normalizing the data allows for the entire dataset to be plotted.

In [None]:
sns.pairplot(
    data=normalized_df.reset_index(drop=True),
    vars=['pct_access', 'engagement_index', 'pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio', 'pp_total_raw'],
    diag_kind='kde')

The normalized data can also be used to calculate Pearson correlation coefficients.

In [None]:
plt.figure(figsize=(8,8))

sns.heatmap(normalized_df.corr(), annot=True, fmt=".2f")
plt.title('Normalized Pearson Correlation Matrix')
plt.show()

# Results

The data suggests that school closures have contributed to increases in online learning. Beyond school closures, there is insufficient data to determine the impact of additional state interventions, practices, or policies. Ideally, data from across multiple years would need to be collected in order to draw any precise conclusions.

The correlation matrix suggests there is a strong correlation between the Percent Access and the Engagement Index. The matrix also suggests there is a potential correlation between the percentage of students identified as Black or Hispanic and the percentage of students eligible for free or reduced-price lunch. The correlation matrix suggests there is no correlation between the percentage of students eligible for free or reduced-price lunch and the Engagement Index. 

# Conclusion

The data suggests that students across the United States have sufficient access to resources regardless of stimulus, school funding, or broadband access. The data also suggests that neither socioeconomic status nor demographics affect a student's ability to engage with digital learning. Ultimately, the factors that affect the engagement in a particular product is the quality, necessity, and value of that particular product.

Despite this, the data fails to capture important details. Families of students eligible for free or reduced-price lunch have varying income. The majority of parents with children eligible for free or reduced-price lunch may be able to afford electronics for their children. However, many parents may not be able to.

The same argument is true for individuals who received stimulus payments. The majority of individuals who received stimulus payments may be able to afford electronics. However, many individuals may not be able to afford electronics with stimulus payments.

The solution to the issues addressed above is to provide help to students struggling to meet basic necessities, and to provide students with highly engaging products. Kaggle is a good example of a highly engaging product. Kaggle allows users to learn through competition and community.
