## Examples discussed on the March 9 webinar

Two examples:
1. Join the establishment location addendum with the employer dataset
2. Join the UI wage record data with the employer data ([jump to example](#UI-wage-to-employer))

In [None]:
# libraries
import pandas as pd
from sqlalchemy import create_engine

### Establishment to employer

IDES provides an addendum to employer locations that offers better addresses for years 2012 - 2015. What follows is an example of how to join the two datasets.

The tables:
1. Establishment locations ("Illinois Department of Employment Security (IDES) establishment data addendum to employer records - 2012-2015" [in the ADRF Explorer](https://deepdish.adrf.info/detail/adrf-000035/))
2. Employer data ("Illinois Department of Employment Security (IDES) Employer records - 2005-2015" [in the ADRF Explorer](https://deepdish.adrf.info/detail/adrf-000034))

The fields to use to join the datasets:
1. Employer Identification Number ("ein" in both tables)
2. Reporting Unit Number ("rptunitno" in establishment data and "seinunit" in employer data)
3. UI Account Number ("uiacctno" in establishment data and "empr_no" in employer data)
4. Time fields "year" and "quarter" in both datasets

In [None]:
# Establish the query, we'll separate out the lines:

# first the select portion let's grab 
# the geographic coordinates "x" (longitude), "y" (latitude), and "census_id" (Block code) from establishments, and
# from Employer we'll get the ID and time variables above as well as NAICS code, total wages, and legal name
query = "SELECT e.ein, e.seinunit AS reporting_unit, e.empr_no AS ui_account, e.year, e.quarter, "
# note we'll use a simple "+" to concatenate the query, this is still part of the SELECT clause
query += "e.naics, e.total_wages, e.name_legal, a.x AS longitude, a.y AS latitude, a.census_id " 
# note the space at the end, it is intentional so the resulting query will be correct

# the FROM clause, here we'll use the "FROM table_a a JOIN table_b b ON a.join_field = b.join_field" syntax
query += "FROM il_des_establishment a JOIN il_qcew_employers e "
# separating out the ON portion as we are using 5 fields for these tables, first the ID fields
query += "ON e.ein = a.ein AND a.rptunitno = e.seinunit AND e.empr_no = a.uiacctno "
# and then the time fields
query += "AND e.year = a.year AND e.quarter = a.quarter "

# it would be possible to run like this on the entire datasets, 
# however let's add a WHERE clause to limit our results to just a single quarter
# note that because we are joining on both these variables above with an "inner join", 
# we can simply include them from one dataset to limit our results
query += "WHERE e.year = 2013 AND e.quarter = 1 "

# and for kicks let's order by our 3 ID variables: 
# (you can use either the aliases or the original names, but if you use original names you'll have to specify
# from what table they come)
query += "ORDER BY ein, reporting_unit, ui_account "

# finally let's limit to just 100 rows for this exmaple
query += "LIMIT 100;"

# and finally let's print out the query to see what it looks like all together:
print(query)

> At this point, you could simply copy and paste the query above into a PgAdmin SQL window if you prefer

In [None]:
# establish the database connection:
db = "appliedda"
host = "10.10.2.10"
conn = create_engine("postgresql://{}/{}".format(host, db))
# string formatting can be very useful, above is another example of how it can be used

# and let's use our connection to check that our tables are both actually there
# we'll get a little fancy and use string formatting and an inline if statement to print out
# if we found the table or not
tbl1 = 'il_des_establishment'
print('{} {} {}'.format(db, 'has' if conn.has_table(tbl1) else 'does NOT have', tbl1))
tbl2 = 'il_qcew_employers'
print('{} {} {}'.format(db, 'has' if conn.has_table(tbl2) else 'does NOT have', tbl2))

> We have now
1. constructed our query (named "query")
2. established our database connection
3. tested that the database does indeed have the tables we want

In [None]:
# let's get our data into a pandas dataframe:
employers = pd.read_sql(query, conn)

# and check it's info:
employers.info()

In [None]:
# and let's check the descriptive stats of all the (numeric) columns:
employers.describe()

> As we expect, year and quarter only have one value

### UI wage to employer

-- [back to top](#Examples-discussed-on-the-March-9-webinar)

Very similar to above, but this time we will use:
1. Employer data ("Illinois Department of Employment Security (IDES) Employer records - 2005-2015" [in the ADRF Explorer](https://deepdish.adrf.info/detail/adrf-000034))
2. UI wage records ("Illinois Department of Employment Security (IDES) Unemployment Insurance (UI) wage records - 2005-2015" [in the ADRF Explorer](https://deepdish.adrf.info/detail/adrf-000003))

Columns to join are identical, with two different field names in the wage record data:
1. Employer Identification Number ("ein" in both tables)
2. Reporting Unit Number ("seinunit" in both)
3. UI Account Number ("empr_no" in both)
4. Time fields "year" and "quarter" in both datasets

In [None]:
# Establish the query, we'll separate out the lines:

# first the select portion let's grab 
# the hashed SSN ("ssn") and wages ("wage") from wage records, and
# from Employer we'll get the ID and time variables above as well as NAICS code, total wages, and legal name
query = "SELECT e.ein, e.seinunit AS reporting_unit, e.empr_no AS ui_account, e.year, e.quarter, "
# note we'll use a simple "+" to concatenate the query, this is still part of the SELECT clause
query += "e.naics, e.total_wages, e.name_legal, w.ssn AS ssn_hash, w.wage AS wage " 
# note the space at the end, it is intentional so the resulting query will be correct

# the FROM clause, here we'll use the "FROM table_a a JOIN table_b b ON a.join_field = b.join_field" syntax
query += "FROM il_wage w JOIN il_qcew_employers e "
# separating out the ON portion as we are using 5 fields for these tables, first the ID fields
query += "ON e.ein = w.ein AND w.seinunit = e.seinunit AND e.empr_no = w.empr_no "
# and then the time fields
query += "AND e.year = w.year AND e.quarter = w.quarter "

# it would be possible to run like this on the entire datasets, 
# however let's add a WHERE clause to limit our results to just a single quarter
# note that because we are joining on both these variables above with an "inner join", 
# we can simply include them from one dataset to limit our results
query += "WHERE e.year = 2013 AND e.quarter = 1 "

# and for kicks let's order by our 3 ID variables: 
# (you can use either the aliases or the original names, but if you use original names you'll have to specify
# from what table they come)
query += "ORDER BY ein, reporting_unit, ui_account "

# finally let's limit to just 100 rows for this exmaple
query += "LIMIT 100;"

# and finally let's print out the query to see what it looks like all together:
print(query)

In [None]:
# for consistency, check the new table exists in the databse:
print('{} {} {}'.format(db, 'has' if conn.has_table('il_wage') else 'does NOT have', tbl2))

> Please note that without the LIMIT for this wage record to employer join for 2013Q1 nearly 6.2 million records are returned, so you should probably do further subsetting (eg by industry) or summarization (eg average wages, count of employees, etc) before extracting the data for analysis

In [None]:
# and, get into a dataframe:
workers = pd.read_sql(query, conn)

# the info
workers.info()

In [None]:
# and the first 10 records
workers.head(10)