# Illinois Dashboard - Day 2: Exploration

#### Description

Before begining to build the Illinois Dashboard, it is important to understand the underlying data and assumptions that are made. In this notebook, you will visualize some summary statistics from the Illinois Wage Records data. Your notebook will teach you to do the following:

- Queries economic data from a database using SQL.
- Visualize data using bar charts, pie charts, and other plots.

## Python Setup

Before writing any of the code for queries or plotting, you'll need to import the necessary Python packages. Afterwards, you'll create a connection to the database from which you will query the data.

In [None]:
# Package for database connection
from sqlalchemy import create_engine

# Packages for data manipulation
import pandas as pd
import numpy as np
import geopandas as gpd

# Packages for visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore warnings. This is to prevent distracting notices of new packages that are unnecessary
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Database connection
engine = create_engine('postgresql://@10.10.2.10/appliedda')

## Summary Statistics

In the rest of the notebook, we will generate summary statistics related to several economic metrics (wage levels, job geography, etc). Due to the large size of the overall data, you will be working off of a random sample (`ada_18_uchi.dashboard_data_il_jobs_rs` instead of `ada_18_uchi.dashboard_data_il_jobs`).

### Wage Distribution in a Quarter

We will start by looking at the distribution of wages at a given year and quarter.

In [None]:
query = '''
SELECT *
FROM ada_18_uchi.dashboard_data_il_jobs_rs
WHERE year = 2010 AND qtr = 1
'''
df = pd.read_sql(query, engine)

In [None]:
# Make a simple histogram:
plt.hist(df['wage'])
plt.show()

The chart only shows us one bar. What is the distribution of our data? 

In [None]:
df['wage'].describe(percentiles = [.01, .1, .25, .5, .75, .9, .99])

Since the distribution of average wages is very skewed to the right, let's limit our data to job wages under $25,000 a quarter.

In [None]:
# Restrict df
df_lim = df[(df['wage'] <= 25000)]

# Bar chart
plt.hist(df_lim['wage'])
plt.show()

In [None]:
## We can change options within the hist function (e.g. number of bins, color, transparency:
plt.hist(df_lim['wage'], bins=20, facecolor="purple", alpha=0.5)

## And we can affect the plot options too:
plt.xlabel('Monthly Wage')
plt.ylabel('Number of Jobs')
plt.title('Most Jobs Earn Pay $15,000 per Quarter')

## And add Data sourcing:
### xy are measured in percent of axes length, from bottom left of graph:
plt.annotate('Source: MO Department of Labor', xy=(0.7,-0.2), xycoords="axes fraction")

## We use plt.show() to display the graph once we are done setting options:
plt.show()

An alternative is to define manually the wage buckets. This step can be done either in the SQL query, or after the data has been read into a Python dataframe. For this example, we will write the query in SQL.

In [None]:
query = '''
SELECT ssn, wage
    , case when wage = 0 or wage is null then '0. Missing wage'
        when wage < 10000 then '1. Under \$10,000'
        when wage < 20000 then '2. \$10,000 to \$19,999'
        when wage < 30000 then '3. \$20,000 to \$29,999'
        when wage < 40000 then '4. \$30,000 to \$39,999'
        when wage < 50000 then '5. \$40,000 to \$49,999'
        when wage >= 50000 then '6. \$50,000 and above'
        end as wage_bucket
FROM ada_18_uchi.dashboard_data_il_jobs_rs
WHERE year = 2010 AND qtr = 1
'''
df = pd.read_sql(query, engine)

In [None]:
freq = df['wage_bucket'].value_counts().reset_index()
freq

In [None]:
freq = freq.sort_values('index')
freq['index'] = freq['index'].str[2:]

In [None]:
sns.barplot(x = 'index', y = 'wage_bucket', data = freq)

plt.xticks(rotation=30)
plt.xlabel('Quarterly Job Wage')
plt.ylabel('Number of Jobs')
plt.title('Most Jobs Pay Under $10,000 per Quarter')
plt.annotate('Source: IL Department of Employment Security', xy=(0.4, -0.35), xycoords="axes fraction")

plt.show()

One can also choose to plot this information using a pie-chart.

In [None]:
plt.pie(freq['wage_bucket'], labels = freq['index'])

plt.axis('equal')
plt.title('Most Jobs Pay Under $10,000 per Quarter')
plt.annotate('Source: IL Department of Employment Security', xy=(0.4, -0.1), xycoords="axes fraction")

plt.show()

### Wage Levels over Time

Now, let's plot the average job wage and the average worker earnings (combining several jobs) over time. 

In [None]:
# Average Job Wage over time:
query = '''
SELECT year, qtr, avg(wage) as avg_wage
FROM ada_18_uchi.dashboard_data_il_jobs_rs
GROUP BY year, qtr
ORDER BY year, qtr
'''
df = pd.read_sql(query, engine)

In [None]:
df.head()

In [None]:
df['year_qtr'] = df['year'].astype(str)+"-Q"+df['qtr'].astype(str)

In [None]:
## Simple line chart:
sns.tsplot(data=df['avg_wage'])

plt.xlabel('Time')
plt.ylabel('Average Job Wage')
plt.title('The average wage per job has increased over the last 10 years')
plt.annotate('Source: IL Department of Employment Security', xy=(0.4, -0.2), xycoords="axes fraction")

plt.show()

Notice the strong cyclical trend of earnings: Q1 and Q4 have systematically higher earnings than the others. Keep this in mind when you compare quarters across time.  

## County Information

Now, let's see how many observations have geographic information. Since we will be plotting jobs on county-level for the dashboard, let's look at how well the county variable (`cnty`) is populated.

In [None]:
# In Q1 of 2010:
query = '''
SELECT cnty, count(*) as count
FROM ada_18_uchi.dashboard_data_il_jobs_rs
WHERE year = 2010 AND qtr = 1
GROUP BY cnty
ORDER BY cnty
'''
df = pd.read_sql(query, engine)

Let's take a look at the dataframe:

In [None]:
df.head()

Looks good! Let's take a look at the last rows:

In [None]:
df.tail()

Unfortunately, not all county codes here are relevant: some observations have no county code (`None`), and a large number have a default code that will not be mapped to anything (`999`). These observations will not be included in the dashboard.

Let's try for another year of data (2005 Q1):

In [None]:
# In Q1 of 2005:
query = '''
SELECT cnty, count(*) as count
FROM ada_18_uchi.dashboard_data_il_jobs_rs
WHERE year = 2005 AND qtr = 1
GROUP BY cnty
ORDER BY cnty
'''
df = pd.read_sql(query, engine)

In [None]:
df.head()

In this case, non of the data has county information. The dashboard will be completely empty for this quarter.