# Illinois Dashboard - Day 5

#### Description

During the first modules, due to the large size of the overall Wage Data, you restricted all plots to a random sample of the data. In this notebook, you will incorporate the entire data without increasing the run time: by collapsing the underlying data into wage, county, year and quarter buckets. The steps will be the following:
- Write a SQL Query that buckets the underlying data
- Alter dashboard queries so they pull from the bucketed data
- Run dashboard on entire data and observe run time

## Python Setup

In [None]:
# Package for database connection
from sqlalchemy import create_engine

# Packages for data manipulation
import pandas as pd
import numpy as np
import geopandas as gpd

# Packages for visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore warnings. This is to prevent distracting notices of new packages that are unnecessary
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Database connection
engine = create_engine('postgresql://@10.10.2.10/appliedda')

## SQL Exploration

Let's start by taking a look at the data we have at our disposal:

In [None]:
# Dashboard Data (random sample)
query = '''
SELECT *
FROM ada_18_uchi.dashboard_data_il_jobs_rs
LIMIT 5;
'''
df = pd.read_sql(query, engine)

In [None]:
df.head()

The data the dashboard currently pulls from is still microdata: every observation accounts for a single individual. The dashboard however displays county-level metrics. Pulling from millions of individual-level observations significantly increases the run-time, as we saw in the first module. Would there be a way to reduce the underlying data and increase the run time? **Discuss potential solutions with your team.**  

One way of reducing the underlying data is grouping already by year, quarter, and county. In this case, the dashboard query will simply pull the metrics for every county at the given year and quarter. Here is how we would modify the underlying data:

In [None]:
query = '''
SELECT cnty, year, qtr,
    count(*) as jobs, 
    avg(wage) as avg_wage
from ada_18_uchi.dashboard_data_il_jobs_rs
group by cnty, year, qtr
order by cnty, year, qtr
'''
df_grouped = pd.read_sql(query, engine)

In [None]:
df_grouped.head()

This is great if you are looking at the metrics on the overall population. But the dashboard let's you restrict to subgroups of interest (by minimum and maximum earnings, for example), and this feature is lost when pulling from the above table. 

Instead of entirely grouping the data by county, let's keep different earning buckets so we can still filter the dashboard visualization to subgroups of interest. For example, let's group the data by buckets of \$1000.

In [None]:
query = '''
SELECT cnty, year, qtr,
    (wage/1000)*1000 as wage_bucket,
    count(*) as jobs, 
    avg(wage) as avg_wage
from ada_18_uchi.dashboard_data_il_jobs_rs
group by year, qtr, cnty, (wage/1000)*1000
order by year, qtr, cnty, (wage/1000)*1000;
'''
df_grouped = pd.read_sql(query, engine)

In [None]:
df_grouped.head()

The underlying data is now much more reduced than microdata, but still has the flexibility of subsetting to earning groups of interest. This should be perfect for our dashboard!

The entire data has been grouped in this way and saved as the table `ada_18_uchi.dashboard_data_il_buckets`.

## Incorporating in Dashboard

Now that we have created this flag, let's add this flag to the previous `group by` query that we used to generate the dashboard. 

In [None]:
count_qry = """
select cnty, 
    sum(jobs) as jobs, 
    sum(jobs*avg_wage)/sum(jobs) as avg_wage
from ada_18_uchi.dashboard_data_il_buckets
where year = {y} and qtr = {q}
group by cnty
order by cnty
"""

In [None]:
change_qry = '''
select a.cnty,
    cast(b.jobs - a.jobs as decimal)/(a.jobs+1) as change_in_jobs_pct,
    cast(b.avg_wage - a.avg_wage as decimal)/(a.avg_wage+1) as change_in_avg_wage_pct
from(
    select cnty, 
        sum(jobs) as jobs, 
        sum(jobs*avg_wage)/sum(jobs) as avg_wage
    from ada_18_uchi.dashboard_data_il_buckets
    where year = {y0} and qtr = {q0} 
    group by cnty
) as a
full join (
    select cnty, 
        sum(jobs) as jobs, 
        sum(jobs*avg_wage)/sum(jobs) as avg_wage
    from ada_18_uchi.dashboard_data_il_buckets
    where year = {y1} and qtr = {q1}
    group by cnty
) as b
on a.cnty = b.cnty
order by cnty
'''

In [None]:
# Import Dashboard Functions
from ui import DashUI

In [None]:
# Define metrics to plot
statefp = '17' # 17 is statefp for Illinois
list_of_metrics = {'Jobs': 'jobs'
                   , 'Average Quarterly Earnings': 'avg_wage'
                   
                   # Insert additional metric for New Jobs
                   
                  }

In [None]:
# Create Dashboard
dash = DashUI(statefp, list_of_metrics, count_qry, change_qry)

In [None]:
# Display the input panel and the output of the dashboard
display(dash.input_panel)
display(dash.output)