<div class="h1">Overview</div>
<br>
<div class="h3">Primary Objective</div>
<br>
<b>The primary objective of this analysis is to understand how the COVID-19 virus has impacted digital learning and student engagement over time and over geospatial and socioeconomic landscapes.</b>
<br>
<br>
<div class="h3">Goals and Questions</div>
<p>
Time Series Analysis
<li>What role does time of year play in digital engagement?</li>
<li>Does the quantity of COVID Cases over time impact digital engagement?</li> <br>
Geospatial Analysis <br>
<li>What does digital learning look like across a geospatial landscape</li>
<li>Does the quantity of COVID Cases between states impact digital engagement?</li> <br>
Socioeconomic Analysis <br>
<li>What socioeconomic factors impact digital engagement?</li>
<li>Moving forward, what tools/products were successful in creating student engagement for students at any socioeconomic status?</li>
</p>
<br>
<div class="h3">Code Reproducibility and Organization</div>
<p>
Below is a GitHub that contains hundreds on lines of code that preprocesses the data and creates the visualizations: <br>
    <a href=https://github.com/RaviShah1/COVID-19-Impact-On-Digital-Learning>https://github.com/RaviShah1/COVID-19-Impact-On-Digital-Learning</a> <br>
<br>
All the code from the hidden code cells in this notebook can be found on the GitHub, and all the visible method calls show the results of my analysis. The GitHub should make reproducing the code super simple.
<br><br>
After I preprocess the data, the analysis is broken down into 3 main parts: 
<ol>   
    <li>times series analysis</li>
    <li>geospatial analysis</li>
    <li>socioeconomic analysis</li>
</ol>
A summary of my findings from all 3 sections can be found in the conclusion portion of this notebook.
<br>
</p>
<br>
<div class="h3">Data and Sources</div>
<p>
All of the raw data I use can be found in the data folder in the GitHub above or by going to the original sites (linked below)
<a href=https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/data>https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/data</a>
<br>
<a href=https://github.com/nytimes/covid-19-data>https://github.com/nytimes/covid-19-data</a>

<br>
</p>
<br>

<div class="h3">Additional Notes</div>
<p>
All visualizations are created using plotly and are interactive. <br>
Put your cursor over the graphs and charts to see exact numbers and more details.
</p>

<div class="h1">Preprocessing</div>

<br>
<div class="h3">Preprocessing Overview</div>
The code below sets up the analysis by preforming the following steps: 
<ol>
  <li>Preprocess products dataframe by segmenting the sectors column and filling null values with "unknown"</li>   
  <li>Preprocess districts dataframe by formatting categorical variables and filling null values with "unknown"</li>
  <li>Linking each row in the districts dataframe to its engagement dataframe</li>
  <li>Transforming all string dates to datetimes</li>
  <li>Shifting the cumulative cornavirus cases dataframe to get a daily dataframe</li>
  <li>Generating python dictionaries that map places with different characteristics to data about their engagement and coronavirus cases</li>
  <li>Creating dataframes to analyze different educational products</li>
  <li>Creating dataframes that represent engagement and coronavirus cases for the entire U.S.</li>
</ol>

In [1]:
import pandas as pd
import numpy as np
import glob
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.offline as po
from plotly import tools
import plotly.graph_objs as pg

import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", 100)

In [2]:
comp_dir = '../input/learnplatform-covid19-impact-on-digital-learning' 

engage_files = glob.glob(comp_dir + "/engagement_data/*.csv")
engage_dfs = list()
for file in engage_files:
    df = pd.read_csv(file)
    engage_dfs.append(df)
    
products_info = pd.read_csv(comp_dir + "/products_info.csv")
districts_info = pd.read_csv(comp_dir + "/districts_info.csv")

nytimes_dir = '../input/ny-times-covid-19-tracker-data'
covid_case_df = pd.read_csv(nytimes_dir + "/nytimes_covid_cases_data")

In [3]:
""" 
The original data/input for all these functions can be located at the following sources:
1) https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/data
2) https://github.com/nytimes/covid-19-data
"""

def time_to_dt(time):
    return dt.datetime.strptime(time, "%Y-%m-%d")

""" Preprocessing Products DataFrame"""

def preprocess_products_df(df: pd.DataFrame):
    df['PreK-12'] = df["Sector(s)"].apply(lambda x: True if (str(x)).find("PreK-12")!=-1 else False)
    df['HigherEd'] = df["Sector(s)"].apply(lambda x: True if (str(x)).find("Higher Ed")!=-1 else False)
    df['Corporate'] = df["Sector(s)"].apply(lambda x: True if (str(x)).find("Corporate")!=-1 else False)
    df = df.fillna("unknown")
    return df

""" Preprocess Districts DataFrame"""

def format_percent(val: str):
    if val is None or val=='unknown':
        return 'unknown'
    val = val[1:-1]
    val1 = str(float(val.split(', ')[0])*100)
    val2 = str(float(val.split(', ')[1])*100)
    return val1+'% - '+val2+'%'

def format_population(val: str):
    if val is None or val=='unknown':
        return 'unknown'
    val = val[1:-1]
    return val.split(', ')[0]+' - '+val.split(', ')[1]

def preprocess_districts_df(df: pd.DataFrame):
    df = df.fillna("unknown")
    df['pct_black/hispanic'] = df['pct_black/hispanic'].apply(format_percent)
    df['pct_free/reduced'] = df['pct_free/reduced'].apply(format_percent)
    df['county_connections_ratio'] = df['county_connections_ratio'].apply(format_percent)
    df['pp_total_raw'] = df['pp_total_raw'].apply(format_population)
    return df

""" Preprocess Engage DataFrames"""

def link_district_to_engage_df(loc_id: int, engage_files: list):
    for i in range(len(engage_files)):
        if(engage_files[i].find(str(loc_id)) != -1):
            return i
    return -1  

def preprocess_engage_dfs(engage_dfs: list, enage_files:list, districts_df: pd.DataFrame):
    for df in engage_dfs:
        df['dt'] = df['time'].apply(time_to_dt)
    districts_df['engage_file_id'] = districts_df['district_id'].apply(lambda x: link_district_to_engage_df(x, engage_files)) 

""" Preprocessing NY Times COVID cases DataFrame"""

def preprocess_nytime_cases_df(df: pd.DataFrame) :
    df['dt'] = df['date'].apply(time_to_dt)
    df = df[df.dt > dt.datetime(2020,1,1)]
    df = df[df.dt < dt.datetime(2020,12,31)]
    df = df.set_index('dt')
    return df  

""" Generate Maps That Connect Different Variables """

def generate_state_map(districts_df: pd.DataFrame, covid_case_df: pd.DataFrame, engage_dfs: list, engage_files: list):
    state_map = dict()
    for state in districts_df.state.value_counts().index:
        if(state!='unknown' and state!='District Of Columbia'):
            case_df = covid_case_df[covid_case_df.state == state]
            case_df['new_cases'] = case_df['cases'].transform(lambda s: s.sub(s.shift().fillna(0)).abs())

            ids = districts_df[districts_df.state == state].engage_file_id.values
            dfs = list()
            for i in ids:
                dfs.append(engage_dfs[i].groupby(by=['dt'])[['pct_access', 'engagement_index']].sum())
            engage_df = pd.concat(dfs)
            engage_df1 = engage_df.groupby(by=['dt'])[['pct_access', 'engagement_index']].mean()
            #engage_df2 = engage_df.groupby(by=['dt', 'lp_id'])[['pct_access', 'engagement_index']].mean()
            #engage_df2.reset_index().set_index('dt')

            state_map[state] = [case_df, engage_df1]
            
    return state_map

def generate_pct_black_hisp_map(districts_df: pd.DataFrame, engage_dfs: list):
    pct_black_hisp_map = dict()
    for val in np.unique(districts_df['pct_black/hispanic'].values):
        ids = districts_df[districts_df['pct_black/hispanic'] == val].engage_file_id.values
        dfs = list()
        for i in ids:
            dfs.append(engage_dfs[i].groupby(by=['dt'])[['pct_access', 'engagement_index']].max())
        engage_df = pd.concat(dfs)
        engage_df1 = engage_df.groupby(by=['dt'])[['pct_access', 'engagement_index']].mean()
        #engage_df2 = engage_df.groupby(by=['dt', 'lp_id'])[['pct_access', 'engagement_index']].mean()
        #engage_df2.reset_index().set_index('dt')

        #pct_black_hisp_map[val] = [engage_df1, engage_df2]
        pct_black_hisp_map[val] = engage_df1
    return pct_black_hisp_map
        
def generate_pct_free_reduced_lunch_map(districts_df: pd.DataFrame, engage_dfs: list):
    pct_free_reduced_lunch_map = dict()
    for val in np.unique(districts_df['pct_free/reduced'].values):
        ids = districts_df[districts_df['pct_free/reduced'] == val].engage_file_id.values
        dfs = list()
        for i in ids:
            dfs.append(engage_dfs[i].groupby(by=['dt'])[['pct_access', 'engagement_index']].max())
        engage_df = pd.concat(dfs)
        engage_df1 = engage_df.groupby(by=['dt'])[['pct_access', 'engagement_index']].mean()
        #engage_df2 = engage_df.groupby(by=['dt', 'lp_id'])[['pct_access', 'engagement_index']].mean()
        #engage_df2.reset_index().set_index('dt')

        #pct_free_reduced_lunch_map[val] = [engage_df1, engage_df2]
        pct_free_reduced_lunch_map[val] = engage_df1
    return pct_free_reduced_lunch_map

def product_dfs(engage_dfs: list, districts_df: pd.DataFrame):
    engage_df_us = pd.concat(engage_dfs)
    product_df_us = engage_df_us.groupby(by=['lp_id'])[['pct_access', 'engagement_index']].mean()
    
    prod_lunch_dfs = dict()
    for val in np.unique(districts_df['pct_free/reduced'].values):
        ids = districts_df[districts_df['pct_free/reduced'] == val].engage_file_id.values
        dfs = list()
        for i in ids:
            dfs.append(engage_dfs[i].groupby(by=['lp_id'])[['pct_access', 'engagement_index']].mean())
        engage_df = pd.concat(dfs)
        prod_lunch_dfs[val] = (engage_df.groupby(by=['lp_id'])[['pct_access', 'engagement_index']].mean())
    
    return product_df_us, prod_lunch_dfs


""" Generate Nation Wide DataFrames"""

def us_dfs(engage_dfs: list, covid_case_df: pd.DataFrame):
    engage_df_us = pd.concat(engage_dfs)
    engage_df_us = engage_df_us.groupby(by=['dt'])[['pct_access', 'engagement_index']].mean()
    
    covid_case_us = covid_case_df.groupby(by=['dt'])[['cases', 'deaths']].mean()
    covid_case_us['new_cases'] = covid_case_us['cases'].transform(lambda s: s.sub(s.shift().fillna(0)).abs())
    
    return engage_df_us, covid_case_us

In [4]:
products_df = preprocess_products_df(products_info)
districts_df = preprocess_districts_df(districts_info)
covid_case_df = preprocess_nytime_cases_df(covid_case_df)
preprocess_engage_dfs(engage_dfs, engage_files, districts_df)

state_map = generate_state_map(districts_df, covid_case_df, engage_dfs, engage_files)
pct_black_hisp_map = generate_pct_black_hisp_map(districts_df, engage_dfs)
pct_free_lunch_map = generate_pct_free_reduced_lunch_map(districts_df, engage_dfs)

engage_df_us, covid_case_us = us_dfs(engage_dfs, covid_case_df)
product_df_us, prod_lunch_dfs = product_dfs(engage_dfs, districts_df)

In [5]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

def get_state_ids(state_map: dict):
    st_ids = list()
    for st in list(state_map.keys()):
        st_ids.append(us_state_abbrev[st])
    return st_ids

st_ids = get_state_ids(state_map)

<div class="h3">Visualizing the Sample Size</div>
<br>
<p>
All visualizations and conclusions drawn from this data are based on this sample. As a result, the charts could vary if the dataset included data from different states, represented a different proportion of locales, or represented a different proportion of age groups.
</p>

In [6]:
long_teal = ['rgb(42, 86, 116)',
             'rgb(45, 95, 125)',
             'rgb(55, 105, 135)',
             'rgb(59, 115, 143)',
             'rgb(62, 124, 147)',
             'rgb(63, 128, 151)',
             'rgb(65, 133, 154)',
             'rgb(71, 138, 160)',
             'rgb(76, 142, 163)',
             'rgb(79, 144, 166)',
             'rgb(81, 151, 169)',
             'rgb(85, 157, 172)',
             'rgb(90, 162, 175)',
             'rgb(95, 167, 178)',
             'rgb(104, 171, 184)',
             'rgb(114, 182, 192)',
             'rgb(124, 188, 198)',
             'rgb(133, 196, 201)',
             'rgb(145, 205, 205)',
             'rgb(155, 213, 213)',
             'rgb(168, 219, 217)',
             'rgb(178, 225, 223)',
             'rgb(188, 232, 227)',
             'rgb(209, 238, 234)']

def plot_sample_size_details(districts_df: pd.DataFrame, products_df: pd.DataFrame):
    fig = make_subplots(rows=1, cols=3, specs=[[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}]],
                       column_titles=['States', 'Locale', 'Product Audience'])
    fig.add_trace(
        go.Pie(labels=districts_df.state.value_counts().index, values=districts_df.state.value_counts().values, marker=dict(colors=long_teal), hole=.5, textposition='inside', textinfo='label'),
        row=1, col=1)
    
    fig.add_trace(
        go.Pie(labels=districts_df.locale.value_counts().index, values=districts_df.locale.value_counts().values, marker=dict(colors=px.colors.sequential.Teal_r), hole=.5, textposition='inside', textinfo='label'),
        row=1, col=2)
    
    fig.add_trace(
        go.Pie(labels=['Prek-12', 'Higher Ed', 'Corporate'], values=[products_df['PreK-12'].sum(), products_df['HigherEd'].sum(), products_df['Corporate'].sum()], marker=dict(colors=px.colors.sequential.Teal_r), hole=.5, textposition='inside', textinfo='label'),
        row=1, col=3)
    
    fig.update_layout(showlegend=False)
    fig.show()

In [7]:
plot_sample_size_details(districts_df, products_df)

<div class="h3">What is Engagement Index</div>
<br>
<p>
Throughout this analysis, 2 common measurements I will be using are Engagement Index and COVID-19 Cases. 
<br>
COVID-19 Cases is the quantity of coronavirus cases.
Engagement Index is a measurement of student engagement based on the equation below:
<br>
    <b>Engagement index = total page load events / 1000 students of a given product on that day</b>
</p>

<div class="h1">Time Series Anaylsis</div>

<br>
In the timeseries portion (and other parts of the analysis), I will be splitting the data into 4 sections:
<br>
<ol>
    <li>Before COVID Hit</li>
    <li>Spring 2020</li>
    <li>Summer 2020</li>
    <li>Fall 2020</li>
</ol>

<div class="h3">Line Plots Over Time</div>
<br>

The line plots below investigate engagement (on the left y-axis) and COVID cases (on the right y-axis) over time. 
<br>
The COVID line in the first plot shows the cumulative number of COVID-19 while the second plot shows the number of new cases on that given day.
<br>
<p style="color:#36B6D2">The Engagement Index is represented by the blue line</p>
<p style="color:#E76a5d">The COVID-19 cases are represented by the red line</p>
<p style="color:#7f8781">The Gray Dashed Lines separate the 4 segments of the year mentioned above</p>


In [8]:
def resample(df: pd.DataFrame, sample: str, what: list=None):
    if what is not None:
        return df[what].resample(sample).mean()
    return df[['engagement_index', 'pct_access']].resample(sample).mean()

def add_secondary_y_plot(fig: go.Figure, 
                         row: int, col: int, 
                         x1: np.array, x2: np.array, 
                         y1: np.array, y2: np.array, 
                         color1: str, color2: str,
                         name1: str, name2: str):
    fig.add_scatter(x=x1, 
                    y=y1, 
                    mode='lines',
                    marker={'color': color1},
                    name=name1,
                    secondary_y=False,
                    row=row,
                    col=col)
    
    fig.add_scatter(x=x2, 
                    y=y2, 
                    mode='lines', 
                    marker={'color': color2},
                    name=name2,
                    secondary_y=True,
                    row=row,
                    col=col)
    
def plot_engage_to_cases(engage_df: pd.DataFrame, cases_df: pd.DataFrame):
    fig =  make_subplots(rows=2, cols=1, 
                         specs=[[{'secondary_y': True}], [{'secondary_y': True}]],
                         subplot_titles=('Engagement Index VS Cumulative Cases - Over Time (US)', 'Engagement Index VS New Cases Daily - Over Time (US)'))
    add_secondary_y_plot(fig, 1, 1, engage_df.index, cases_df.index, engage_df.engagement_index, cases_df.cases, '#36B6D2', '#E76a5d', 'Engagement Index', 'COVID Cases')
    add_secondary_y_plot(fig, 2, 1, engage_df.index, cases_df.index, engage_df.engagement_index, cases_df.new_cases, '#36B6D2', '#E76a5d', 'Engagement Index', 'COVID Cases')

        
    fig.add_vline(x=dt.datetime(2020,1,21), line_width=2, line_dash='dash', line_color='#9B9B9B')
    fig.add_vline(x=dt.datetime(2020,6,1), line_width=2, line_dash='dash', line_color='#9B9B9B')
    fig.add_vline(x=dt.datetime(2020,8,12), line_width=2, line_dash='dash', line_color='#9B9B9B')

    fig.update_layout(
        autosize=False,
        width=850,
        height=450,)

    fig.update_layout(plot_bgcolor='#E6EFF1') 
    fig.update_xaxes(showgrid=False, title_text='Date')
    fig.update_yaxes(showgrid=False, title_text='Engagement Index', secondary_y=False)
    fig.update_yaxes(showgrid=False, title_text='Cases', secondary_y=True)
    fig.update_layout(title_x=.3, title_y=.92, title_font_size=30, height=750, width=1000)
    fig.update_traces(showlegend=False)
    
    fig.show()
    
def plot_resampled_engage_to_cases(engage_df: pd.DataFrame, cases_df: pd.DataFrame):
    fig =  make_subplots(rows=3, cols=1, 
                         specs=[[{'secondary_y': True}], [{'secondary_y': True}], [{'secondary_y': True}]],
                         subplot_titles=("Daily", "Weekly", "Monthly"))
    
    add_secondary_y_plot(fig, 1, 1, engage_df.index, cases_df.index, engage_df.engagement_index, cases_df.new_cases, '#36B6D2', '#E76a5d', 'Engagement Index', 'COVID Cases')
    
    engage_df_weekly = resample(engage_df, '7D')
    cases_df_weekly = resample(cases_df, '7D', ['cases', 'new_cases'])
    add_secondary_y_plot(fig, 2, 1, engage_df_weekly.index, cases_df_weekly.index, engage_df_weekly.engagement_index, cases_df_weekly.new_cases, '#36B6D2', '#E76a5d', 'Engagement Index', 'COVID Cases')
    
    engage_df_monthly = resample(engage_df, 'M')
    cases_df_monthly = resample(cases_df, 'M', ['cases', 'new_cases'])
    add_secondary_y_plot(fig, 3, 1, engage_df_monthly.index, cases_df_monthly.index, engage_df_monthly.engagement_index, cases_df_monthly.new_cases, '#36B6D2', '#E76a5d', 'Engagement Index', 'COVID Cases')
    
    fig.add_vline(x=dt.datetime(2020,1,21), line_width=2, line_dash='dash', line_color='#9B9B9B')
    fig.add_vline(x=dt.datetime(2020,6,1), line_width=2, line_dash='dash', line_color='#9B9B9B')
    fig.add_vline(x=dt.datetime(2020,8,12), line_width=2, line_dash='dash', line_color='#9B9B9B')

    fig.update_layout(
        autosize=False,
        width=850,
        height=450,)

    fig.update_layout(plot_bgcolor='#E6EFF1') 
    fig.update_xaxes(showgrid=False, title_text='Date')
    fig.update_yaxes(showgrid=False, title_text='Engagement Index', secondary_y=False)
    fig.update_yaxes(showgrid=False, title_text='Cases', secondary_y=True)
    fig.update_layout(title_x=.3, title_y=.92, title_font_size=30, height=750, width=1000)
    fig.update_traces(showlegend=False)
    
    fig.show()

In [9]:
plot_engage_to_cases(engage_df_us, covid_case_us)

<div class="h3">Resampling For Generalizing the Trend</div>
<br>
<p>
In the plots above, there is an obvious, regular, constantly occuring dip in engagement. These dips are the weekends when students are far less active. Other larger dips are holidays such as Christmas and other time off school such as summer break.
<br>
<br>
In order to better visualize the trend of engagement, we can use resampling. Resampling is a time series technique that can allow us to look at the engagement and cases (daily) on a weekly or monthly basis instead of a daily basis
</p>

In [10]:
plot_resampled_engage_to_cases(engage_df_us, covid_case_us)

While time of year certainly seems to play a part in digital student engagement, the actual quantity of COVID-19 cases does not seem to have a significant impact of engagement.

<div class="h3">Visualizing With Box Plots</div>
<br>
<p>
By utilizing box plots, we can clearly see the change in engagement over the 4 time periods. 
<br>
It is clear that with the existence of COIVD-19 in the U.S., there is now an overall increase in digital engagement. Of course summer had the lowest engagement since school was out of seesion. Also, fall had higher engagement scores than the spring likely because the school systems had figured out how to more effective utilize digital platforms.
</p>

In [11]:
def split_by_time(engage_df: pd.DataFrame, start: dt.datetime, end: dt.datetime):
    engage_df = engage_df[engage_df.index > start]
    engage_df = engage_df[engage_df.index < end]
    return engage_df.engagement_index.values

def plot_engage_box_plot(engage_df: pd.DataFrame):
    y1 = split_by_time(engage_df, dt.datetime(2019,12,31), dt.datetime(2020,1,21))
    y2 = split_by_time(engage_df, dt.datetime(2020,1,22), dt.datetime(2020,6,1))
    y3 = split_by_time(engage_df, dt.datetime(2020,6,2), dt.datetime(2020,8,12))
    y4 = split_by_time(engage_df, dt.datetime(2020,8,13), dt.datetime(2021,1,1))
    
    fig = go.Figure()
    fig.add_trace(go.Box(y=y1, marker_color=px.colors.sequential.Teal[2], name='Before COVID Hit'))
    fig.add_trace(go.Box(y=y2, marker_color=px.colors.sequential.Teal[3], name='Spring 2020'))
    fig.add_trace(go.Box(y=y3, marker_color=px.colors.sequential.Teal[1], name='Summer 2020'))
    fig.add_trace(go.Box(y=y4, marker_color=px.colors.sequential.Teal[4], name='Fall 2020'))
    fig.update_layout(height=400, width=1000, yaxis_title='engagement', title='Engagement Index Box Plots', title_x=.3) 
    fig.show()

In [12]:
plot_engage_box_plot(engage_df_us)

<div class="h1">Geospatial Analysis</div>

<br>
<div class="h3">Mapping Engagement and COVID-19 Cases</div>
<br>
<p>
The 8 maps below show the geospatial landscape of engagement and cases. The first column shows avg total engagement, and the second column shows avg COVID-19 cases per day. The 4 rows show the different segments of the year mentioned in the time series portion.
<br>
There does not appear to be much correlation between location and engagement or cases.
</p>

In [13]:
def add_map(fig: go.Figure, row: int, col: int, color: str, vals: list):
    fig.add_trace(
        go.Choropleth(
            locations=st_ids,
            locationmode = 'USA-states',
            z = vals,
            colorscale=color, 
        ),
        row=row, col=col
    )
    
def segment_data_by_time(state_map: dict, df_type: int, start_date: dt.datetime, end_date: dt.datetime):
    vals = list()
    for st in list(state_map.keys()):
        df = state_map[st][df_type][state_map[st][df_type].index < end_date]
        df = df[df.index > start_date]
        if(df_type==0):
            if(len(df)>0):
                vals.append(df['new_cases'].mean())
        else:
            vals.append(df['engagement_index'].mean())
            
    return vals

def plot_engage_cases_maps_over_time(state_map: dict):
    rows = 4
    cols = 2
    fig = make_subplots(rows=4, cols=2, 
                        column_titles=['Engagement', 'Cases'], 
                        row_titles=['Before COVID Hit', 'Spring 2020', 'Summer 2020', 'Fall 2020'],
                        specs=[[{'type': 'choropleth'} for c in np.arange(cols)] for r in np.arange(rows)])


    add_map(fig, 1,1,'teal', segment_data_by_time(state_map, 1, dt.datetime(2019,12,31), dt.datetime(2020,1,21)))
    add_map(fig, 2,1,'teal', segment_data_by_time(state_map, 1, dt.datetime(2020,1,22), dt.datetime(2020,6,1)))
    add_map(fig, 3,1,'teal', segment_data_by_time(state_map, 1, dt.datetime(2020,6,2), dt.datetime(2020,8,12)))
    add_map(fig, 4,1,'teal', segment_data_by_time(state_map, 1, dt.datetime(2020,8,13), dt.datetime(2021,1,1)))

    add_map(fig, 1,2,'orrd', segment_data_by_time(state_map, 0, dt.datetime(2019,12,31), dt.datetime(2020,1,21)))
    add_map(fig, 2,2,'orrd', segment_data_by_time(state_map, 0, dt.datetime(2020,1,22), dt.datetime(2020,6,1)))
    add_map(fig, 3,2,'orrd', segment_data_by_time(state_map, 0, dt.datetime(2020,6,2), dt.datetime(2020,8,12)))
    add_map(fig, 4,2,'orrd', segment_data_by_time(state_map, 0, dt.datetime(2020,8,13), dt.datetime(2021,1,1)))

    layout = dict(geo = dict(scope='usa'), height=1000, width=1000, title_font_size=30)
    fig.update_traces(showscale=False)
    fig.update_layout(layout)
    fig.update_geos(scope='usa')

    fig.show()

In [14]:
plot_engage_cases_maps_over_time(state_map)

<div class="h1">Socioeconomic Analysis</div>

<br>
<div class="h3">Charting Engagement Across Different Cultural and Economic Groups</div>
<br>
<p>
The bar chart on the left shows the engagement compared in districts with different percents of black and hispanic populations, and the bar chart on the right shows engagement compared in districts with different percents of students on free and reduced lunch. In order to account for outliers, I am only looking at engagement levels from the most used products in a given district.
<br>
<br>
Overall, there seems to be a general trend in which districts with greater percents of black and hispanic population and greater percents of free and reduced lunch students have lower engagement. There are some exceptions that could be accurate or due to a small sample size.
</p>

In [15]:
def add_bar(fig: go.Figure, row: int, col: int, x: list, y: list, x_label: str):
    chart = go.Bar(x=x, y=y, marker_color=px.colors.sequential.Teal[2])
    fig.append_trace(chart, row=row, col=col)
    

def plot_socioeconomic_bar_charts(pct_black_hisp_map: dict, pct_free_lunch_map: dict):
    x = ['0.0% - 20.0%', '20.0% - 40.0%', '40.0% - 60.0%', '60.0% - 80.0%', '80.0% - 100.0%']
    y1, y2= list(), list()
    for val in x:
        y1.append(pct_black_hisp_map[val].engagement_index.mean())
        y2.append(pct_free_lunch_map[val].engagement_index.mean())
    
    fig = make_subplots(rows=1, cols=2, specs=[[{'type':'bar'}, {'type':'bar'}]], 
                        column_titles=['% black & hispanic', '% free & reduced lunch'])
    add_bar(fig, 1, 1, x, y1, '%')
    add_bar(fig, 1, 2, x, y2, 'teal')
    fig.update_layout(height=400, width=1000, yaxis_title='engagement')
    fig.update_traces(showlegend=False)
    
    fig.show()

In [16]:
plot_socioeconomic_bar_charts(pct_black_hisp_map, pct_free_lunch_map)

<div class="h3">Analyzing Successful Educational Products For All Students</div>
<br>
<p>
The first barpolar chart show the most used products across the whole U.S. and the second chart shows the most used products in districts with 80 to 100 percent of its students on free and reduced lunch.
<br>
The charts show the 7 most used tools. The longer the bar is, the more engagement there is. When you hover over the bar, the r value shows the avg number of engagement that product got.
<br>
<br>
By far, the most used product in both these charts was Google Docs followed by Google Classroom. This is likely because products such as Google Docs are free and easy to use. Other products such as Canvas were commonly used likely because they could act as a hub for teachers to put assignments on. However, Canvas was not as commonly used in lower income areas likely because it costed more money. In area with more students on free and reduced lunch, more free products such as Kahoot were used.
</p>

In [17]:
def plot_product_barpolar(df: pd.DataFrame, products_df: pd.DataFrame, title:str):
    prod_data = df.nlargest(7, 'engagement_index')
    plots = list()
    for i in range(7):
        try:
            label = str(products_df[products_df['LP ID'] == int(prod_data.index[i])]["Product Name"].values[0])
        except Exception:
            label = "name not known"
        color = px.colors.sequential.Teal[6-i]
        plots.append(go.Barpolar(
            r=[prod_data.iloc[i][1]],
            theta=[(i+1)*50],
            width=[28],
            marker_color=color,
            marker_line_color="#0541a1",
            marker_line_width=2,
            name=label,
            opacity=0.8))
    fig = go.Figure(plots)

    fig.update_layout(
        template=None,
        polar = dict(
            radialaxis = dict(showticklabels=False, ticks=''),
            angularaxis = dict(showticklabels=False, ticks='')
        ),
        showlegend=True,
        title=title, title_x=.45
    )

    fig.show()

In [18]:
plot_product_barpolar(product_df_us, products_df, 'Most Used Digital Products - US')

In [19]:
plot_product_barpolar(prod_lunch_dfs['80.0% - 100.0%'], products_df, 'Most Used Digital Products - 80-100% Free & Reduced Lunch')

<div class="h1">Conclusion</div>
<br>
<div class="h3">Time Series Findings</div>
<br>
Digital student engagement changed during different times of year. After COVID-19 hit, there was a general increase in online engagement. Summer had the lowest engagement since school was out of seesion, and fall had higher engagement scores than the spring likely because the school systems had figured out how to more effective utilize digital platforms.
<br>
While time of year impacted engagement, actual quantity of COVID-19 Cases did not seem to impact engagement.
<br>
<br>
<div class="h3">Geospatial Findings</div>
<br>
Overall, there did not seem to be much correlation between location and engagement. 
<br>
Furthermore, locations with more/less COVID-19 cases did not necessarily have more/less engagement.
<br>
<br>
<div class="h3">Socioeconomic Findings</div>
<br>
Socioeconomic factors did generally seem to have an impact on digital student engagement. Typically, districts with greater percents of black and hispanic population and greater percents of free and reduced lunch students have lower engagement.
<br>
Successful products for all students include several Google products (e.g. Google Docs and Google Classroom) because they are free or inexpensive. Other successful tools seemed to be YouTube, Meet, Canvas, Kahoot, and Schoology.
<br>

In [20]:
%%HTML
<style type="text/css">
div.h1 {
    font-size: 32px; 
    margin-bottom:2px;
}

div.h3 {
    color: #5ab8c7; 
    font-size: 20px; 
    margin-top: 4px; 
    margin-bottom:8px;
}

</style