##                                  <B> DIGITAL LEARNING <B>
    
<b>Problem Statement<b>
    
The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.
    
<b>Business Need<b>
    
What is the state of digital learning in 2020? And how does the engagement of digital learning relate to factors such as district demographics, broadband access, and state/national level policies and events?

<b>Basic information<b>
    
    
<b>Engagement data<b>
    
The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. The 4-digit file name represents district_id which can be used to link to district information in district_info.csv. The lp_id can be used to link to product information in product_info.csv.

<b>Name :</b>Description
    
<b>time :</b>date in "YYYY-MM-DD"
    
<b>lp_id:</b>The unique identifier of the product
    
<b>pct_access:</b>Percentage of students in the district have at least one page-load event of a given product and on a given day
    
<b>engagement_index:</b>Total page-load events per one thousand students of a given product and on a given day
    
    
 <b>District information data<b>
    
The district file districts_info.csv includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, we removed the identifiable information about the school districts. We also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

<b>Name</b>:Description
    
<b>district_id:</b>The unique identifier of the school district
    
<b>state:</b>The state where the district resides in
    
<b>locale:</b>NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information.
    
<b>pct_black/hispanic:</b>Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data
    
<b>pct_free/reduced:</b>Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data
    
<b>countyconnectionsratio:</b>ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information.
    
<b>pptotalraw:</b>Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district.
    
<b>Product information data</b>

The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

<b>Name:</b>Description

<b>LP ID:</b>The unique identifier of the product

<b>URL:</b>Web Link to the specific product

<b>Product Name:</b>Name of the specific product

<b>Provider/Company Name:</b>Name of the product provider

<b>Sector(s):</b>Sector of education where the product is used

<b>Primary Essential Function:</b>The basic function of the product. There are two layers of labels here. 
Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled
    
 <b>Objectives</b>
 
Uncover trends in digital learning

Visualize the trends of digital connectivity and engagement in 2020

Understand and measure the scope and impact of the pandemic on digital learning

How does student engagement with different types of education technology change over the course of the pandemic?

How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?
Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#!pip install glob2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import glob
from sklearn.feature_selection import chi2;
from scipy import stats
import datetime as dt
import warnings
warnings.filterwarnings("ignore")
import plotly
import plotly.express as px
import plotly.graph_objs as go
from geopy.geocoders import Nominatim
import folium
from folium.plugins import HeatMap
from folium.plugins import FastMarkerCluster
from plotly import tools
import re
from plotly.offline import init_notebook_mode, plot, iplot
from wordcloud import WordCloud, STOPWORDS 
from warnings import filterwarnings
filterwarnings('ignore')
import missingno as msno
import glob

In [None]:



path = r'/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.split('/')[5].split('.')[0]
    df['district_id'] = district_id
    li.append(df)

df_eng = pd.concat(li, axis=0, ignore_index=True)

In [None]:
#df_eng.to_csv('df_enga.csv')
df_eng

In [None]:
df_eng.dtypes

In [None]:
os.chdir("/kaggle/input/learnplatform-covid19-impact-on-digital-learning")
df_dist = pd.read_csv("districts_info.csv")

In [None]:
df_dist

In [None]:
df_dist.dtypes

In [None]:
df_prod = pd.read_csv("products_info.csv")
df_prod

In [None]:
df_prod.dtypes

## Data prepocessing

In [None]:
# how many missing values exist or better still what is the % of missing values in the dataset?
def percent_missing(df):

    # Calculate total number of cells in dataframe
    totalCells = np.product(df.shape)

    # Count number of missing values per column
    missingCount = df.isnull().sum()

    # Calculate total number of missing values
    totalMissing = missingCount.sum()

    # Calculate percentage of missing values
    print("The Data  contains", round(((totalMissing/totalCells) * 100), 2), "%", "missing values.")



In [None]:
percent_missing(df_eng)

In [None]:
percent_missing(df_prod)

In [None]:
percent_missing(df_dist)

# Now which column(s) has missing values


In [None]:
# Function to calculate missing values by column
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()

    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)

    # dtype of missing values
    mis_val_dtype = df.dtypes

    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent, mis_val_dtype], axis=1)

    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values', 2: 'Dtype'})

    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
    '% of Total Values', ascending=False).round(1)

    # Print some summary information
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")

    # Return the dataframe with missing information
    return mis_val_table_ren_columns


In [None]:
missing_values_table(df_eng)

In [None]:
missing_values_table(df_dist)

In [None]:
missing_values_table(df_prod)

## Clean data

The most important thing is this part is to know how to deal with missin data. Here we decide to  remove columns from the dataset  which contains more than 30% of missing values and fix those which is less than 30% by using methode such  **forward fill**,  **backward fill**, **fill by the mode, mean etc....**


In [None]:
#fix the missing values
def fix_missing_ffill(df, col):
    df[col] = df[col].fillna(method='ffill')
    return df[col]

def fix_missing_bfill(df, col):
    df[col] = df[col].fillna(method='bfill')
    return df[col]


In [None]:
clean_eng = df_eng.copy()

clean_eng['engagement_index'] = clean_eng['engagement_index'].fillna(clean_eng['engagement_index'].mean())
clean_eng['pct_access'] = clean_eng['pct_access'].fillna(clean_eng['pct_access'].mean())
clean_eng['lp_id'] = fix_missing_ffill(clean_eng, 'lp_id')




In [None]:
missing_values_table(clean_eng)

In [None]:
clean_dist = df_dist.copy()
#remove columns with 30% of missing values
clean_dist = clean_dist.drop(['pp_total_raw','pct_free/reduced','county_connections_ratio'],axis=1)
clean_dist['state'] = fix_missing_ffill(clean_dist, 'state')

clean_dist['locale'] = fix_missing_bfill(clean_dist, 'locale')
clean_dist['pct_black/hispanic'] = fix_missing_bfill(clean_dist, 'pct_black/hispanic')



In [None]:
missing_values_table(clean_dist)

In [None]:
clean_prod = df_prod.copy()
clean_prod['Sector(s)'] = fix_missing_ffill(clean_prod, 'Sector(s)')

clean_prod['Primary Essential Function'] = fix_missing_ffill(clean_prod, 'Primary Essential Function')
clean_prod['Provider/Company Name'] = fix_missing_ffill(clean_prod, 'Provider/Company Name')

In [None]:
missing_values_table(clean_prod)

### Merge data

In [None]:
clean_eng['district_id'] =clean_eng['district_id'].astype(int)

clean_eng.dtypes

In [None]:
df_merge1 = pd.merge(clean_eng, clean_dist, on="district_id")

In [None]:
df_merge1

In [None]:
missing_values_table(df_merge1)

In [None]:
#Change the name of the column
clean_prod = clean_prod.rename(columns={'LP ID': 'lp_id'})
clean_prod['lp_id'] =clean_prod['lp_id'].astype('float')
clean_prod.head(3)

In [None]:
df_merge2 = pd.merge( clean_prod,df_merge1, on="lp_id")


In [None]:
missing_values_table(df_merge2)

In [None]:
df_merge2

## Data exploration

In [None]:
df_merge2.dtypes

In [None]:
df_merge2.describe()

In [None]:
mode1= df_merge2.mode()
print(mode1)

### **Univariate analysis**

In [None]:
def plot_count(df:pd.DataFrame, column:str) -> None:
    plt.figure(figsize=(12, 7))
    sns.countplot(data=df, x=column)
    plt.title(f'Distribution of {column}', size=20, fontweight='bold')
    plt.show()

In [None]:
plot_count(df_merge2, "Sector(s)")

In [None]:
df_merge2.columns

In [None]:
plot_count(df_merge2, "pct_black/hispanic")

In [None]:
#splitting date column into day_name,month,weekdef features_create(data):
def features_create(data): 
    data['year']=data['time'].dt.year
    data['month']=data['time'].dt.month
    data['day_name']=data['time'].dt.day_name()
    return data


In [None]:
df_merge2['time'] = pd.to_datetime(df_merge2['time'])


In [None]:
features_create(df_merge2)

In [None]:
plot_count(df_merge2, "day_name")

In [None]:
plot_count(df_merge2, "month")

In [None]:
plt.figure(figsize=(14,10))
for i, column in enumerate(df_merge2[['Sector(s)']]):
    data=df_merge2[column].value_counts().sort_values(ascending=False)
   # plt.subplot(1,2,i+1)
    sns.barplot(x=data, y=data.index);

In [None]:
plt.figure(figsize=(14,10))
for i, column in enumerate(df_merge2[['pct_black/hispanic']]):
    data=df_merge2[column].value_counts().sort_values(ascending=False)
   # plt.subplot(1,2,i+1)
    sns.barplot(x=data, y=data.index)

In [None]:
labels = list(df_merge2.state.value_counts().index)
values = df_merge2['state'].value_counts()
# colors = ['mediumslateblue', 'darkorange']
fig = go.Figure(data=[go.Pie(labels=labels,
                             values=values,hole=.3)])
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=20,
                  marker=dict( line=dict(color='#000000', width=3)))
fig.update_layout(title="State Distribution ",
                  titlefont={'size': 30},      
                  )
fig.show()

**The plot above showed  where the schools districts are located in the state. The 3 states Connecticus,Utah,Massachussettes are the high engaged states** 

In [None]:
plt.figure(figsize=(10,70))
for i, column in enumerate(df_merge2[['Provider/Company Name']]):
    data=df_merge2[column].value_counts().sort_values(ascending=False)
   # plt.subplot(1,2,i+1)
    sns.barplot(x=data, y=data.index)

**With the plot above , Google LLC was the the provider company that offered services significantly**

In [None]:
plt.figure(figsize=(10,25))
for i, column in enumerate(df_merge2[['Primary Essential Function']]):
    data=df_merge2[column].value_counts().sort_values(ascending=False)
   # plt.subplot(1,2,i+1)
    sns.barplot(x=data, y=data.index)

**From the graph above we see LC DIgital learning platform was most engaged in this case**

In [None]:
def bar_p(df:pd.DataFrame, column:str,title:str):
    plt.figure(figsize=(15, 9))
    sns.countplot(y=column, data=df_merge2, order=df[column].value_counts().head(10).index,color = "red")
    plt.title(title,font="Serif", size=15)
    plt.show()

In [None]:
bar_p(df_merge2,'Product Name','Distribution of the best Product Name')

**With plot above Google docs has high engagement, and also the 3 first place wre taken by Google services**

In [None]:
plt.figure(figsize=(10,8))
for i, column in enumerate(df_merge2[['locale']]):
    data=df_merge2[column].value_counts().sort_values(ascending=False)
   # plt.subplot(1,2,i+1)
    sns.barplot(x=data, y=data.index)

In [None]:
def plot_hist(df:pd.DataFrame, column:str, color:str)->None:
    plt.figure(figsize=(12, 7))
    sns.displot(data=df, x=column, color=color, bins = 100, kde=True, height=7, aspect=2)
    plt.title(f'Distribution of {column}', size=20, fontweight='bold')
    plt.show()
def plot_correlation(df:pd.DataFrame, title:str) -> None:
    f = plt.figure(figsize=(19, 15))
    plt.matshow(df.corr(), fignum=f.number)
    plt.xticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14, rotation=45)
    plt.yticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14)
    cb = plt.colorbar()
    cb.ax.tick_params(labelsize=14)
    plt.title('Correlation Matrix', fontsize=16)

In [None]:
plot_hist(df_merge2[df_merge2['pct_access'] <= df_merge2['pct_access'].quantile(0.95)], 'pct_access', 'blue')

In [None]:
plot_hist(df_merge2[df_merge2['engagement_index'] <= df_merge2['engagement_index'].quantile(0.95)], 'engagement_index', 'blue')

## **Bivariate analysis**

In [None]:
df_merge2.columns

**Here the first step of our analysis was to compute a mean of engagement and pct access per state in order to know how they were distributed and also which state  had  a greatest mean  of those variable.** 

In [None]:
data1= df_merge2[['state','pct_access','engagement_index']]
data2 = data1.groupby(['state']).mean()
data2.sort_values("engagement_index", ascending=False).head(6)
#data1.head(6)

In [None]:
most_engaged = data2.sort_values('engagement_index', ascending = False).head(10).reset_index()
plt.figure(figsize=(15,9))
plt.title("Distribution of percentage of page loads per district")
sns.barplot(x = most_engaged['state'], y = most_engaged['engagement_index'])
plt.xticks(rotation=60)


**We can notice with the table above , the first  state which has the  greatest mean engagement was Arizona followed by North Dakota  during the period of Covid  in 2020.**

In [None]:
data2.sort_values("pct_access", ascending=False).head(6)

**We can notice with the table above , the first  state which has the  greatest mean of Percentage of students in the district which at least one page-load event  was North Dakota followed by   Arizona  during the period of Covid  in 2020.**

In [None]:
data1= df_merge2[['locale','pct_access','engagement_index']]
data2 = data1.groupby(['locale']).mean()
data2.sort_values("engagement_index", ascending=False).head(6)

**Most of the people  who are leaning are more engaged on the Rural locale** 

In [None]:
data1= df_merge2[['Provider/Company Name','pct_access','engagement_index']]
data2 = data1.groupby(['Provider/Company Name']).mean()
data2.sort_values("engagement_index", ascending=False).head(6)

**People are more engaged to use  product coming from Instructure ,Inc. than Google and others Provider**

In [None]:
data1= df_merge2[['Product Name','pct_access','engagement_index']]
data2 = data1.groupby(['Product Name']).mean()
data2.sort_values("engagement_index", ascending=False).head(6)

**People are more engaged to use Service provided by Google . Google doc are the head of the classement   due to engagement .**

### **Mapping**

In [None]:
state_available = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}
df_merge2['state_available'] = df_merge2['state'].map(state_available)

In [None]:
fig = go.Figure()
layout = dict(
    title_text = "Districts in the available States",
    title_font = dict(
            family = "monospace",
            size = 30,
            color = "black"
            ),
    geo_scope = 'usa'
)

fig.add_trace(
    go.Choropleth(
        locations = df_merge2['state_available'].value_counts().to_frame().reset_index()['index'],
        zmax = 1,
        z = df_merge2['state_available'].value_counts().to_frame().reset_index()['state_available'],
        locationmode = 'USA-states',
        marker_line_color = 'white',
        geo = 'geo',
        colorscale = "cividis", 
    )
)
            
fig.update_layout(layout)   
fig.show()