<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
    <h1 style="color: #3b3b3b">🤓📚 Exploring the impact of Covid19 on Online Education</h1>    
    <h2 style="color: #3b3b3b">🌎 Overview</h2>
    <h3>⁉️ What?</h3>
    <p>In this notebook we analyze <strong>usage data of online education tools 💻</strong>.<br>We want to explore the impact the Covid-19 global pandemic 🦠🌎 had on these usage numbers.</p>
    <p>The <strong>data 💽</strong> is from the <a href="https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning">Kaggle Covid-19 education challenge</a>.</p>
    <h3>🕵🏻‍♂️🕵️‍♂️ Who?</h3>
    <p>👥 This notebook was created in collaboration between <a href="https://www.kaggle.com/andruschenko">André Kovac</a> and <a href="https://www.kaggle.com/robindehde">Robin Dehde</a></p>
    <h3 style="color: #3b3b3b">🤔💭 Our research questions</h3>
    <ol>
        <li><strong style="color: #3449eb;">District demographics</strong>: Does <strong>race and/or economical background</strong> influence the effect Covid-19 had on online education?
        <p>
          To answer this question we:
          <ol>
              <li>📐 <strong>Created a metric</strong> to compare <strong>privileged</strong> from <strong>underserved</strong> districts</li>
              <li>🎓 Compared online tool engagement rates <strong>pre- and post-Covid19</strong> in each group</li>
              <li>🦠🏘 Compared how differences in social background and Covid-19 timing influenced each other</li>
              <li>🤝🔦 Used <strong>hypothesis tests</strong> to test whether observed differences are not just due to chance</li>
          </ol>
        </p>
        </li>
        <li><strong style="color: #3449eb;">Engagement time series</strong>: How did engagement change over time and how does it differ state by state?</li>
        <li><strong style="color: #3449eb;">Popular products</strong>: How did the most popular online education products evolve over time in 2020?</li>
        <p>📐 We created <strong>running averages</strong> so that we can visually analyze the data</p>
    </ol>
</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
    <h3 style="color: #3b3b3b">🗓📌 A qualitative view of the <strong>data</strong></h3>
<p>Before we jump into the quantitate analysis let's qualitatively assess the data at our disposal</p>
</div>

<div style="background-color: #f0f0f0; padding: 5px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h4 style="color: #3b3b3b; margin-left: 5px;">💪 Engagement data</h3>
</div>

- How engaged students are given a tool and a district.
- Two measures of learning engagement are included (`pct_access` and `engagement_index`). They are aggregated over **234** US school districts.

| Name             | Description |
|------------------|-------------|
| time             | date in "YYYY-MM-DD" |
| lp_id            | The unique identifier of the product |
| pct_access       | Occurance/absence of engagement. Ratio of engaged students |
| engagement_index | Engagement level |

##### A closer look at `pct_access` and `engagement_index`:

- `pct_access`: Percentage of students in the district have at least one page-load event of a given product and on a given day

    - **What?**: Indication of the number of students who engage with online education tools.
    - **Example**: 15% of students in a district in Utah engaged at least once with Google Docs on Monday April 14.
    - **Discussion**: There are values > 1. This probably indicates usage of multiple devices?!

- `engagement_index`: Total page-load events per one thousand students of a given product and on a given day

    - **What?**: Indication of how actively products are used. popular certain products are - not by how many students they .
    - **Example**: In a district in Colorado 341 Google Docs pages were loade per 1000 students on Monday April 14.

<div style="background-color: #f0f0f0; padding: 5px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h4 style="color: #3b3b3b; margin-left: 5px;">🌆🏡 District data</h3>
</div>

Information about **234** school districts in which the students' data was collected.

| Name                   | Description |
|------------------------|-------------|
| district_id            | The unique identifier of the school district |
| state                  | The state where district is located |
| locale                 | District classification: City, Suburban, Town or Rural |
| pct_black/hispanic     | Percentage of students in the districts identified as Black or Hispanic |
| pct_free/reduced       | Percentage of students in the districts eligible for free or reduced-price lunch  |
| countyconnectionsratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) |
| pptotalraw             | Per-pupil total expenditure paid by schools (sum of local and federal expenditure) as median over all schools in district |

    
##### Privileged vs. underserved districts

Districts have different demographics. We used the `pct_black/hispanic` and `pct_free/reduced` data to compute a metric of underserved districts to explory whether Covid19 had a greater impact on either group.

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black">
<h2 style="color: #3b3b3b">🏁 Quantitative Analysis Preparation</h2>
<h3 style="color: #3b3b3b">🗓📌 Groundwork</h3>
<p>Let's import some <strong>libraries</strong> we'll use later.</p>
</div>

In [None]:
# import data science and plotting libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import plotly as py
import plotly.express as px

from pathlib import Path

# Define data paths
data_dir = Path("../input/learnplatform-covid19-impact-on-digital-learning")
output_dir = Path("./")


<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black">

<h3 style="color: #3b3b3b">💽 Engagement data for all districts: Load and concatenate</h3>

<p>Engagement data is spread across many files. We have to collect all the data and merge it into one data frame:</p>

</div>

In [None]:
import glob
import pandas as pd 

all_file_names = list(data_dir.glob("engagement_data/*.csv"))
data_of_district = []

for filename in all_file_names:
    df = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.stem
    df["district_id"] = district_id
    data_of_district.append(df)

engagement = pd.concat(data_of_district)
engagement = engagement.reset_index(drop=True)
engagement.head()

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black">

<h3 style="color: #3b3b3b">📍🏞 Districts: Load data + drop missing values</h3>

</div>

In [None]:
districts = pd.read_csv(data_dir /"districts_info.csv")
districts.dropna(inplace = True)
districts.head()

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black">

<h3 style="color: #3b3b3b">⚖️ Districts data: Average anonymized data ranges</h3>
    
<p>District data got heavily anonymized to conceal the identity of a district.</p>

<p>This lead to the following <strong>side-effects</strong>:</p>
<ul>
  <li>Exact data points of several columns got replaced by equally spaced ranges</li>
  <li>Many missing values</li>
</ul>
<p><strong>Hence</strong>: Before running analyses on the data we <strong>replace each range by its mean value</strong>.</p>

</div>

In [None]:
# Average anonymized data ranges: [0.18, 1.[ --> 0.59, etc.
for col in ['pct_black/hispanic', 'pct_free/reduced', 'pp_total_raw', 'county_connections_ratio']:
    districts[col] = districts[col].apply(lambda val: np.mean([float(x) for x in val[1:-1].split(',')]))

districts.head()

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h3 style="color: #3b3b3b">🍹 <strong>Merge</strong> district and engagement data</h3>
<p>Also add an integer representation of the date on which each engagement event was recorded.</p>
</div>

In [None]:
# merge engagement with district data by district id
engagement['district_id'] = engagement['district_id'].astype('int64')
merged_data = pd.merge(engagement, districts, on = 'district_id')
merged_data = merged_data.dropna()

# add new column with time as integer
timeAsInt = [int("".join(x.split("-"))) for x in merged_data['time']]
merged_data['time_as_int'] = timeAsInt

merged_data.head()

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h2 style="color: #3b3b3b">📆🦠 Priviledged vs. underserved groups in times of Covid-19</h2>

<h3 style="color: #3b3b3b">✂︎ Differentiate between before and after Covid-19 onset</h3>    
<p>To differentiate between the time <strong>before</strong> the pandemic and the time <strong>since it started</strong> we split the data into two parts before and after March 11 2020.</p>

<p>On <strong>March 11 2020</strong> the WHO declared COVID-19 a global Pandemic (as can be seen in <a href="https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020">this timeline</a>)</p>
</div>

In [None]:
covid_date_WHO_declares_pandemic = 20200311 # March 11 2020

# Add column to differentiate between pre covid-19 and the time since it's announced to be a pandemic
merged_data['pre_covid'] = merged_data['time_as_int'] <= covid_date_WHO_declares_pandemic

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">📐 <strong>Privileged</strong> vs. <strong>underserved</strong> districts: Defining a metric</h3>

<p><strong>Districts have different demographics</strong>. We used the <code>pct_black/hispanic</code> and <code>pct_free/reduced</code> data to compute a metric of underserved districts to explore whether Covid-19 had a greater impact on either group.</p>

<p><strong>Hypothesis</strong>: Our assumption is that students in districts with a high percentage of black/hispanic people and a high number of free or reduced lunches on average may be disadvantaged and may have to struggle more in school.</p>

<p>As a measure we computed the sum of <code>pct_black/hispanic</code> and <code>pct_free/reduced</code>.<br>If this value is lower than its average over all districts, we labeled the district as <i>privileged</i>. If not, we labeled it as <i>underserved</i>.
</p>
</div>

In [None]:
# Compute metric
merged_data_underserved_sum = merged_data['pct_free/reduced'] + merged_data['pct_black/hispanic']
merged_data_underserved_sum_average = np.mean(merged_data_underserved_sum)

## Add 'underserved' column
merged_data['underserved'] = merged_data_underserved_sum > merged_data_underserved_sum_average

merged_data.head()

In [None]:
# Compute underserved counts
pre_mean_underserved_count = sum(merged_data[merged_data['pre_covid'] == True]['underserved'])
post_mean_underserved_count = sum(merged_data[merged_data['pre_covid'] == False]['underserved'])

# Pie plots to show 
colors = sns.color_palette('pastel')[2:4]
plt.figure(figsize=(12,6))
plt.suptitle("Distribution of page views", fontsize=22, fontweight="bold")
plt.subplot(1,2,1)
plt.title('Pre Covid-19')
plt.pie([pre_mean_underserved_count, len(merged_data) - pre_mean_underserved_count], labels=['underserved', 'privileged'], colors = colors, autopct='%.0f%%')
plt.subplot(1,2,2)
plt.title('Since start of Covid-19')
plt.pie([post_mean_underserved_count, len(merged_data) - post_mean_underserved_count], labels=['underserved', 'privileged'], colors = colors, autopct='%.0f%%')
plt.show()

# Ratio of page views underserved vs. privileged districts
print("# of records (pre Covid-19):           ", pre_mean_underserved_count, "/", len(merged_data))
print("# of records (since Covid-19 outbreak):", post_mean_underserved_count, "/", len(merged_data))

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h3 style="color: #3b3b3b">⚡️ Distribution of engagement data</h3>

<p>The pie-chart shows how many of the collected data points are from <code>underserved</code> districts vs. <code>priviledged</code> districts.</p>
<p><strong>We see</strong>: There's much more data present for <code>priviledged</code> districts.</p>

</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">👀 Comparing engagement data</h3>

<ul>
<li>We now plot both, <code>pct_access</code> and <code>engagement_index</code> into scatter plots.</li>
<li>Each point denotes recorded engagement data</li>
<li>Plotting <code>pct_access</code> and <code>engagement_index</code> gives us a good representation of total engagement: <p><strong>Percent Access</strong> <code>pct_access</code>: The average number of students engaging with a tool (percentage of at least one page load)<br><strong>Engagement Index</strong> <code>engagement_index</code>: The average intensity of their engagement (# of page loads)</p></li>
</ul>

</div>

In [None]:
# Plotting engagement data
g = sns.FacetGrid(merged_data, col="underserved", hue="pre_covid")
g.map(sns.scatterplot, "engagement_index", "pct_access").set(xlabel="Engagement per 1000 students", ylabel="% of min. 1 page load")
g.add_legend(title="Data recorded pre Covid-19?")

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">📈🗺 Analyzing the engagement plots</h4>
<p>Let's compare the two plots of <code>underserved</code> and <code>priveleged</code> districts:</p>
<ul>
<li><strong>Less underserved districts</strong>: There are much less <code>underserved</code> districts than <code>privileged</code> districts and thus less data points (see also piecharts below)</li>
<li><strong>Difference visible</strong>: The difference in engagement between pre-Covid-19 times and the time since the pandemic started seems to be larger in <code>underserved</code> districts.</li>
<li><strong>Significant difference not obvious from plots</strong>: Whether there is a significant difference is hard to inspect from the plots alone.</li>
</ul>
<p>For a further analysis, we will now compare mean values.</p>
</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">⚡️ Comparing means</h3>
<p>We analyze whether differences/similarities between <code>priviledged</code> and <code>underserved</code> districts remain intact pre Covid-19 and after Covid-19 hit the US.</p>

</div>

In [None]:
# Create means
engagement_data_means = merged_data.groupby(['pre_covid', 'underserved']).agg({'engagement_index': 'mean', 'pct_access': 'mean' }).reset_index()

# Plot means in barplot
ax = sns.barplot(x="underserved", y="engagement_index", hue="pre_covid", hue_order=[True, False], data=engagement_data_means)

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🪄 Engagement Index: Suprise</h4>
<p>This set of mean values looks suprising (at least for us 😉)!</p>
<ul>
<li><strong>underserved</strong> districts: page loads per 1000 students <strong>decrease</strong> during Covid-19</li>
<li><strong>priviledged</strong> districts: page loads per 1000 students <strong>increase</strong> during Covid-19</li>
</ul>
</div>

In [None]:
sns.barplot(x="underserved", y="pct_access", hue="pre_covid", hue_order=[True, False], data=engagement_data_means)

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🪄 Percent Access: Another Surprise!</h4>
<p><code>pct_access</code> is <strong>higher</strong> before Covid-19 hit the country although one might expect a higher value for the time following the onset of Covid-19.</p>
<p><strong>Same behavior in both groups</strong>: Interestingly <code>pct_access</code> (i.e. the percentage of students visiting a certain product on a certain day at least once) across all products is higher before Covid-19 in both groups (<code>priviledged</code> vs. <code>underserved</code>)</p>
<p><strong>More investigation needed</strong>: This probably hints to the fact that one has to look into more details: Which products get more engagement, which ones get less since the onset of Covid-19?</p>
</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">🕵️‍♀️🕵️ Analyzing the suprising results</h3>

<h4 style="color: #3b3b3b">🐯🏝 Vacation time as an anomaly factor</h4>
<p>Vacation is a special time for online education tools and engagement is expected to be very different.</p>

<h4 style="color: #3b3b3b">📆🏝 Vacation Time</h4>
<p>US vacation time: Between <strong>Juni 01</strong> and <strong>August 31</strong></p>
</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">⏰ Time series</h3>

<p>Let's plot engagement data over time and highlight the summer months</p>

</div>

In [None]:
# US school vacation time
us_vacation_start = 20200601
us_vacation_end = 20200831

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h4 style="color: #3b3b3b">🕐 Time series of the engagement index</h4>
<p>Compute and plot a time series (created with a running average to smooth the curve) of student engagement data.</p>
</div>

In [None]:
engagement_index_mean_grouped = merged_data.groupby(['time']).agg({'engagement_index': 'mean'}).reset_index()

# compute running average
running_mean_engagement_index = np.convolve(engagement_index_mean_grouped['engagement_index'], np.ones(8)/8, mode="same")
engagement_index_mean_grouped['running_average_engagement_index'] = running_mean_engagement_index

# plot time line
fig = px.line(engagement_index_mean_grouped, x="time", y="running_average_engagement_index")
fig.update_layout(plot_bgcolor = 'white', title = 'Page access percentage', title_x = 0.5)

# line to highlight date on which WHO declares Covid-19 to be a pandemic
fig.add_vline(x = '2020-03-11', line_width = 2, line_color="red")
fig.add_annotation(
    x='2020-03-11',
    y=2.7,
    text="WHO declares Covid-19 pandemic",
    arrowhead=2,
    arrowsize=2,
    arrowwidth=2,
    arrowcolor="red",
    ax=110,
    ay=30
)

# highlight vacation time
fig.add_vrect(x0="2020-06-01", x1="2020-08-31", fillcolor="green", opacity=0.1)
fig.add_annotation(
    x='2020-07-10',
    y=2.5,
    showarrow=False,
    text="Summer vacation",
)

fig.show()

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🤝 <strong>Verdict</strong>: Vacation time is different</h4>

<p>Online education software access mostly stops entirely or behaves unpredictable during vacation time (see also plot further down in second analysis).</p>
</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">📵 Removing vacation time</h4>
<p>If we remove vacation data from the analysis we observe something interesting:</p>

</div>

In [None]:
# Remove vacation time
before_vacation = merged_data[merged_data['time_as_int'] < us_vacation_start]
after_vacation = merged_data[merged_data['time_as_int'] > us_vacation_end]

merged_data_without_vacation = pd.concat([before_vacation, after_vacation])

In [None]:
# Comparing means
engagement_data_means_without_vacation = merged_data_without_vacation.groupby(['pre_covid', 'underserved']).agg({'engagement_index': 'mean', 'pct_access': 'mean' }).reset_index()

sns.barplot(x="underserved", y="engagement_index", hue="pre_covid", hue_order=[True, False], data=engagement_data_means_without_vacation)

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🐒 No huge differences between groups anymore</h4>

<p>We now see no big differences between pre-covid and post-covid data anymore besides total engagement being lower in underserved communities. The relative differences seem comparable..</p>

</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🤨 PCT Access remains odd</h4>

<p>Even with vacation time removed, the <code>pct_access</code> value is higher before Covid-19 than during the pandemic.</p>

</div>

In [None]:
sns.barplot(x="underserved", y="pct_access", hue="pre_covid", hue_order=[True, False], data=engagement_data_means_without_vacation)

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">🧪 Hypothesis tests</h3>

<p>Let us see whether there really is a <strong>significant</strong> difference between the groups.</p>

<p>Since data is anonymized we cannot compare a student's engagement before and after the onset of Covid-19. Hence, we can treat all groups as being independent.</p>

</div>

In [None]:
# Create subsets of data
underserved = merged_data[merged_data['underserved'] == True]
priviledged = merged_data[merged_data['underserved'] == False]

pre_covid = merged_data[merged_data['pre_covid'] == True]
post_covid = merged_data[merged_data['pre_covid'] == False]

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h4 style="color: #3b3b3b">🧪 Tests within each group</h4>
</div>

In [None]:
# T-test within underserved group (pre Covid-19 vs. after onset of Covid-19 pandemic)
stats.ttest_ind(underserved[underserved['pre_covid'] == True]['engagement_index'], underserved[underserved['pre_covid'] == False]['engagement_index'])

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h5 style="color: #3b3b3b">🧪 <strong>T-Test</strong>: Difference within the underserved group</h5>

<p><strong>Question</strong>: Within the underserved group, does Covid-19 have an impact on engagement?<br><strong>Test result</strong>: Negative - we can't say that the observed difference is significant.</p>

</div>

In [None]:
# T-test within priviledged group (pre Covid-19 vs. after onset of Covid-19 pandemic)
stats.ttest_ind(priviledged[priviledged['pre_covid'] == True]['engagement_index'], priviledged[priviledged['pre_covid'] == False]['engagement_index'])

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h5 style="color: #3b3b3b">🧪 <strong>T-Test</strong>: Difference within the priviledged group</h5>

<p><strong>Question</strong>: Within the priviledged group, does Covid-19 have an impact on engagement?<br><strong>Test result</strong>: Positive - a very low p-value (p less than 0.01) indicates that there is a highly significant effect.</p>

</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h4 style="color: #3b3b3b">🧪 Tests between underserved and privileged groups</h4>
</div>

In [None]:
# T-test between groups (underserved vs. privileged) - Before Covid-19
stats.ttest_ind(pre_covid[pre_covid['underserved'] == True]['engagement_index'], pre_covid[pre_covid['underserved'] == False]['engagement_index'])

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h5 style="color: #3b3b3b">🧪 <strong>T-Test</strong>: Difference between the underserved and priviledged group pre-Covid-19</h5>

<p><strong>Question</strong>: Pre Covid-19, is there a difference in measured engagement between the underserved and the priviledged group?<br><strong>Test result</strong>: Positive - a very low p-value (p less than 0.01) indicates that the difference is highly significant.</p>

</div>

In [None]:
# T-test between groups (underserved vs. privileged) - Since Covid-19
stats.ttest_ind(post_covid[post_covid['underserved'] == True]['engagement_index'], post_covid[post_covid['underserved'] == False]['engagement_index'])

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h5 style="color: #3b3b3b">🧪 <strong>T-Test</strong>: Difference between the underserved and priviledged group since the onset of Covid-19</h5>

<p><strong>Question</strong>: Since the onset of Covid-19, is there a difference in measured engagement between the underserved and the priviledged group?<br><strong>Test result</strong>: Positive - a very low p-value (p less than 0.01) indicates that the difference is highly significant.</p>

</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🧪 General comment on test results</h4>

<p>Positive test results indicate that there are effects (as observed in the bar charts above) which are indeed significant and not solely due to randomness.</p>
<p><strong>However:</strong> Since the number of engagement records is very high (> 1 Mio) it is very easy for a t-test to turn out significant. As future research some more rigorous hypothesis testing is necessary.</p>
    
</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h2 style="color: #3b3b3b">🎁 Part 2: Product investigations</h2>
    
<h3 style="color: #3b3b3b">🧱 Product data preparation</h3>
    
</div>

<div style="background-color: #f0f0f0; padding: 5px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h4 style="color: #3b3b3b; margin-left: 5px;">🍊 Product data</h3>
</div>

The top **372** ed-tech products (out of **10000**) identified by the Chrome browser extension [learnplatform](https://learnplatform.com/).


| Name                       | Description            |
|----------------------------|------------------------|
| LP ID                      | Unique product identifier |
| URL                        | URL of specific product         |
| Product Name               | Name of the specific product  |
| Provider/Company Name      | Name of product's company  |
| Sector(s)                  | Sector of education where the product is used    |
| Primary Essential Function | LC = Learning & Curriculum, CM = Classroom Management or SDO = School & District Operations plus sub-categories |

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🎳 Products Data: Load & Extract</h4>

</div>

In [None]:
products = pd.read_csv(data_dir / "products_info.csv")
products.head()

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🧂 Merge products into existing data (consisting of engagement and district data)</h4>
<p>We now add product data to the mix for a short analysis of most successful products among students.</p>

</div>

In [None]:
# merge in products data
merged_data = pd.merge(products, merged_data, left_on = 'LP ID', right_on = 'lp_id')
merged_data.drop(['URL', 'lp_id'], axis = 1, inplace = True)

# replace NaNs with 0.0
merged_data.loc[:, ["engagement_index"]] = merged_data["engagement_index"].fillna(0, inplace=False)

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">⚖️ Engagement per State over Time</h3>

<p><strong>Questions</strong>:<br>1) How does student online learning engagement change over time and<br>2) How does student online learning engagement differ by state?</p>

</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">
<h4 style="color: #3b3b3b">Generalize creation of running average in function</h4>
</div>

In [None]:
def create_running_average(df: pd.DataFrame, category: str, kernel_size: int=8, out_category_prefix="running_average_"):
    """
    Create running average for each state
    
    Create new variable in passed data set
    """
    runn_avg_colname = f"{out_category_prefix}{category}"
    df[runn_avg_colname] = 0
    for state in df.state:
        state_mask = df.state == state
        column = df.loc[state_mask, category].values
        running_mean = np.convolve(column, np.ones(kernel_size)/kernel_size, mode="same")
        df.loc[state_mask, runn_avg_colname] = running_mean

In [None]:
pct_access_mean_grouped = merged_data.groupby(['state', 'time']).agg({'pct_access': 'mean'}).reset_index()
engagement_index_mean_grouped = merged_data.groupby(['state', 'time']).agg({'engagement_index': 'mean'}).reset_index()

create_running_average(pct_access_mean_grouped, 'pct_access')
create_running_average(engagement_index_mean_grouped, 'engagement_index')

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">🎨 Create plot of engagement for every state</h3>

</div>

In [None]:
# plot time line
fig = px.line(engagement_index_mean_grouped, x="time", y="running_average_engagement_index", color="state", line_group="state")

fig.update_layout(plot_bgcolor = 'white', title = 'Page access percentage', title_x = 0.5)

# line to highlight date on which WHO declares Covid-19 to be a pandemic
fig.add_vline(x = '2020-03-11', line_width = 2, line_color="red")
fig.add_annotation(
        x='2020-03-11',
        y=2.7,
        text="WHO declares Covid-19 pandemic",
        arrowhead=2,
        arrowsize=2,
        arrowwidth=2,
        arrowcolor="red",
        ax=110,
        ay=30
        )

fig.add_vrect(x0="2020-06-01", x1="2020-08-31", fillcolor="green", opacity=0.1, line_width=0)
fig.add_annotation(
        x='2020-07-10',
        y=2.5,
        text="Summer vacation",
        showarrow=False,
        )
fig.update_traces(line_width=1)

fig.show()

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h4 style="color: #3b3b3b">🪜 Analysis of the engagement time series split by state</h4>

<ul>
<li>Shortly after the WHO declared covid-19 a pandemic, engagement index changes drastically within each state.</li>
<li>While some states show a <strong>steep rise</strong> (New York, etc.), other states show a <strong>steep decline</strong> in engagement (North Carolina, etc.)</li>
<li>In the time around the summer holidays in the U.S., engagement drops very low compared to all other times (as seen above).</li>
<li>More drops, similar to that of the summer holidays can be seen during other dates as well.</li>
</ul>
    
</div>

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px">

<h3 style="color: #3b3b3b">💻 Explore Product Usage</h3>

<p><strong>Question</strong>: Which are the most successful products over time?</p>
<p>Construct a pivot table for the analysis.</p>

</div>

In [None]:
pivoted_engagement = pd.pivot_table(merged_data, values='engagement_index', 
    index=["time"],
    columns=['Product Name'], aggfunc=np.sum)
pivoted_engagement.to_csv(output_dir / "product_pivoted_engagement_data.csv")

In [None]:
pivoted_engagement = pd.read_csv(output_dir / "product_pivoted_engagement_data.csv", index_col="time")
sorted_columns = pivoted_engagement.sum().sort_values(ascending=False).index
pivoted_engagement = pivoted_engagement[sorted_columns[:20]]

In [None]:
# compute running averages
for i in range(4):
    kernel_size = 5
    df = pivoted_engagement
    for colname in df.columns:
        df["tmp"] = 0
        col_vals = pivoted_engagement[colname].values
        running_mean = np.convolve(col_vals, np.ones(kernel_size)/kernel_size, mode="same")
        df[colname] = running_mean
        df.drop("tmp", axis=1, inplace=True)

In [None]:
pivoted_engagement.plot(figsize=(23, 10))

<div style="background-color: #f0f0f0; padding: 10px; border: 2px solid #4a4a4a; border-radius: 5px; color: black;">

<h3 style="color: #3b3b3b">📈 Graph Interpretation</h3>
    
<h4 style="color: #3b3b3b">📈 What is the graph about?</h4>
    
<ul>
<li>Here, the engagement for the top-10 products can be seen, aggregated over all states and districts to a single number of engagement for each day for each product.</li>
<li>x-axis = time</li>
<li>y-axis = engagement-index</li>
</ul>
<h4 style="color: #3b3b3b">📈 Analysis of the graph</h4>

<ul>
<li>Again, shortly after the declaration of covid 19 to be a pandemic, some product's usage increases sharply (google docs, zoom, Epic!) while others drop (Lexia Core5 Reading)</li>
<li>During summer vacation time, purely school related products usage goes pretty much to 0, while tools that are not only realted to school usage (youtube, google docs) remains a bit above that</li>
</ul>

</div>