<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, Ursula Kaczmarek, Benjamin Feder. 

_source to be updated when notebook added to GitHub_

# Dataset Exploration
----------
Basic dataset exploration

# Table of Contents

- [Introduction](#Introduction)
    - [Learning Objectives](#Learning-Objectives)
    - [Methods](#Methods)
    - [Python Setup](#Python-Setup)
    - [Load the Data](#Load-the-Data)
        - [Establish a Connection to the Database](#Establish-a-Connection-to-the-Database)
        - [Formulate Data Query](#Formulate-Data-Query)
        - [Pull Data from the Database](#Pull-Data-from-the-Database)
    - [Analysis: Using Python and SQL](#Analysis:-Using-Python-and-SQL)
        - [What is in the Database?](#What-is-in-the-Database?)
- [Summary Statistics: TANF Spells](#Summary-Statistics:-TANF-Spells)
- [Summary Statistics: Wages](#Summary-Statistics:-Wages)
- [Joining Employers and Wage Data](#Joining-Employers-and-Wage-Data)
- [Summary Statistics: Employers](#Summary-Statistics:-Employers)
- [Creating New Measures](#Creating-New-Measures)

# Introduction
- Back to [Table of Contents](#Table-of-Contents)

In an ideal world, we have all of the data we want with all of the desirable properties (no missing values, no errors, standard formats, and so on). 
However, that is hardly ever true, and we have to use our datasets to answer questions of interest as intelligently as possible. 

In this notebook, we will discover the datasets we have on the ADRF and use them to begin our sample case study: 

**What does the TANF experience look like? Where are TANF recipients finding employment? Is this different from where they are finding stable employment?**

## Learning Objectives
- Back to [Table of Contents](#Table-of-Contents)

This notebook will give you the opportunity to spend some hands-on time with the data. You will explore relevant datasets in the ADRF and discover different ways to analyze your data. This will be done using both SQL and `pandas` in Python. The `sqlalchemy` Python package will give you the opportunity to interact with the database using SQL to pull data into Python. Some additional manipulations will be handled by `pandas` by converting datasets into dataframes.

This notebook will provide an introduction and examples for: 

- How to create new tables from the larger tables in database (sometimes called the "analytical frame")
- How to explore different variables of interest
- How to create aggregate metrics
- How to handle missing values
- How to join newly created tables

And the questions below will guide the content in this notebook: 

__What are different measures of the TANF experience? What are different measures of employment for TANF recipients? Can we find the companies that are common employers of TANF recipients?__

## Methods
- Back to [Table of Contents](#Table-of-Contents)

We will be using the `sqlalchemy` Python package to access tables in our class database server - PostgreSQL. 

To read the results of our queries, we will be using `pandas`, which will read tabular data from SQL queries into a `pandas` DataFrame object. Within `pandas`, we will use various commands:

- `isin`
- `groupby`
- `nunique`

Within SQL, we will use various queries to:

- Select data subsets
- Sum over groups
- Create new tables
- Count distinct values of desired variables
- Order data by chosen variables
- Join datasets

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

In Python, we `import` packages. The `import` command allows us to use libraries created by others. You can think of importing a library as opening up a toolbox and pulling out a specific tool. Here are some of the most famous Python packages:
- `numpy` is short for "numerical Python". `numpy` is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object and a large suite of functions for doing numerical computing. 
- `pandas` is a library in Python for data analysis that uses the DataFrame object (modeled after R DataFrames, for those familiar with the language) similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click methods in Microsoft Excel. It is a lynchpin of the PyData stack and is built on top of `numpy`.  
- `sqlalchemy` is a Python library for interfacing with a PostGreSQL database. 

In [None]:
# pandas-related imports
import pandas as pd

# database interaction imports
import sqlalchemy

__When in doubt, use shift + tab to read the documentation of a method.__

__The `help()` function provides information on what you can do with a function.__

In [None]:
# for example
help(sqlalchemy.create_engine)

## Load the Data

- Back to [Table of Contents](#Table-of-Contents)

We can execute SQL queries using Python to get the best of both worlds. For example, Python - and `pandas` in particular - makes it much easier to calculate descriptive statistics of the data. Additionally, as we will see in the Data Visualization exercises, it is relatively easy to create data visualizations using Python. 

`pandas` provides many ways to load data. It allows the user to read the data from a local csv or Excel file, pull the data from a relational database, or read directly from a URL (when you have internet access). Since we are working with the PostgreSQL database `appliedda` in this course, we will demonstrate how to use `pandas` to pull data from a relational database. For examples to read data from a CSV file, refer to the `pandas` documentation [Getting Data In/Out](pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out).

The function to run a SQL query and pull the data into a `pandas` dataframe (more to come) is `pd.read_sql()`. Just like running a SQL query in DBeaver, this function will ask for some information about the database and the query you would like to run. Let's walk through an example below.

### Establish a Connection to the Database
- Back to [Table of Contents](#Table-of-Contents)

The first parameter is the connection to the database. To create a connection, we will use the `SQLAlchemy` package and tell it which database we want to connect to, just like in DBeaver. Additional details on creating a connection to the database are provided in the [Databases](01_0_Database_Connections.ipynb) notebook.

__Parameter 1: Connection__

In [None]:
# we need to pass the name of the database and host of the database

host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = sqlalchemy.create_engine(connection_string)

> We can parameterize Python `string` objects using the built-in `.format()` function. We will use various formulations in the program notebooks (e.g. when building queries). Some examples are:
1. Empty brackets (shown above), which simply inserts the variable in the string; when there is more than one set of brackets, Python will insert variables in the order they are listed
2. Brackets with formatting can be used to make print statements more readable (e.g. text with formatted number with comma and 1-digit decimal `{:,.1f}'.format(number_value)` will print `123,456.7` instead of `123456.7123401`)
3. Named brackets to use the same variables multiple times in a text block (we use this in more compicated queries like when creating "labels" and "features" for machine learning models)

### Formulate Data Query
- Back to [Table of Contents](#Table-of-Contents)

This part is similar to writing a SQL query in DBeaver. Depending on the data we are interested in, we can use queries to pull in different data. In this example, we will pull in 20 rows of individual TANF spells data.

__create a query as a `string` object in Python__

In [None]:
# 20 entries of the TANF data

query = '''
SELECT *
FROM il_dhs.ind_spells
LIMIT 20
'''

> Together, the three quotation marks surrounding the query body is called a multi-line string. It is quite handy for writing SQL queries because the new line character will be considered part of the string instead of breaking it.

In [None]:
# Now that we have defined a variable `query`, we can call it in the code
print(query)

Here, we use the `LIMIT` statement for two reasons. First, `LIMIT` helps users avoid running into memory issues in Python, as the command controls the maximum amount of rows of the dataframe. Second, it will also speed up some queries for the same reason. Generally, when performing an exploratory data analysis using SQL commands, we recommend you use `LIMIT` to look at a small sample of the data rather than wasting time and potentially creating memory issues by looking at the entire dataset. Sometimes, it may also be advantageous to provide robust `WHERE` clauses that will naturally limit the size of the output, such as restricting the resulting dataset to a specific year. For instance, if you were curious how some metric within the demographic data changed by year, you could start by restricting the dataset to just 2012 and then systematically change the year until you had a full sense of the trend (or lack thereof) in the dataset instead of grouping by the year from the start. Also, you can use PostgreSQL's `EXPLAIN` command to dissect the processing order of the query.

> Note that `LIMIT` provides a simple way to get a "sample" of data. However, using `LIMIT` **does not provide a _random_ sample**; it is just based on what is fastest for the database to return.

### Pull Data from the Database
- Back to [Table of Contents](#Table-of-Contents)

Now that we have the two parameters (database connection and query), we can pass them to the `pd.read_sql()` function and obtain the data.

In [None]:
# here we pass the query and the connection to the pd.read_sql() function 
df = pd.read_sql(query, conn)

In [None]:
# first five rows of df
df.head()

## Analysis: Using Python and SQL
- Back to [Table of Contents](#Table-of-Contents)

To explore the answers to our guiding questions, we will need to combine various tables together. If you are curious as to how we generated these specific permanent tables available in the `ada_tdc_2019` schema, you can refer to the [Table Creation](Permanent_Table_Creation.ipynb) notebook. Before delving into these tables, though, we will first look at the entire database and the TANF individual spells table.

### What is in the Database?
- Back to [Table of Contents](#Table-of-Contents)

As introduced in the [Databases](01_0_Database_Connections.ipynb) notebook, there are a few ways to connect and explore the data in the database.

__Schemas, Tables, and Columns in database__

Let's find the list of schema names in the database, the list of tables in these schemas and the list of columns in these tables.

In [None]:
# See all available schemas:
query = '''
SELECT schema_name 
FROM information_schema.schemata;
'''
pd.read_sql(query, conn)

> As a reminder, in this class you have access to the following schemas: 'public', 'il_des_kcmo', 'il_dhs', 'in_dwd', 'in_fssa', and 'ada_tdc_2019'.

In [None]:
# assign available schemas to variable schemas
schemas = """
'public', 'il_des_kcmo', 'il_dhs', 'in_dwd', 'in_fssa',  'ada_tdc_2019'
"""

In [None]:
# confirm our schemas exist with an updated version of the previous query
query = '''
SELECT schema_name 
FROM information_schema.schemata
WHERE schema_name IN ({})
'''.format(schemas)
pd.read_sql(query, conn)

In [None]:
# find all tables within desired schemas
query = '''
SELECT schemaname, tablename
FROM pg_tables
WHERE schemaname IN ({})
'''.format(schemas)

tables = pd.read_sql(query, conn)
# print tables not in the public schema
print(tables.query("schemaname != 'public'"))

In [None]:
# list all the tables in the IL DHS schema:
sorted(tables[tables["schemaname"] == 'il_dhs']['tablename'])

> Note the two ways shown above to subset a `pandas.DataFrame`:
1. Use the built-in `.query()` function
2. Create an array of `True` and `False` values (done in this line: `tables["schemaname"] == 'il_dhs'`)

In [None]:
# We can look at column names within tables

schema = 'il_dhs'
tbl = 'ind_spells'

query = '''
SELECT * 
FROM information_schema.columns 
WHERE table_schema = '{}' AND table_name = '{}'
'''.format(schema, tbl)

# read and print results
pd.read_sql(query, conn)

In [None]:
# assign 100 rows of tanf data to il_tanf
query = '''
SELECT *
FROM il_dhs.ind_spells
limit 100;
'''
il_tanf = pd.read_sql(query, conn)

In [None]:
# display first five rows in il_tanf
il_tanf.head()

Again, take some time to look at the documentation and understand what the different variables refer to.

# Summary Statistics: TANF Spells
- Back to [Table of Contents](#Table-of-Contents)

In this section, we'll explore TANF individual spell data by focusing on any spells for primary TANF recipients in Illinois and Indiana that ended in 2014 Q4. A spell is the time in which an individual/household is receiving aid from TANF.

Here are some questions we will use as a guide: 

- For how long had these individuals been receiving benefits? 
- Can we find the spell length distribution?

To find the answers to these questions, we will refer to the permanent table `ada_tdc_2019.q42014_hoh`, which contains social security numbers for every primary TANF recipient whose benefits ended in 2014 Q4 for both Indiana and Illinois, the state they were receiving the benefits in, and the start date of their spell.

In [None]:
# check out ada_tdc_2019.q42014_hoh
qry = '''
select *
from ada_tdc_2019.q42014_hoh
limit 10;
'''

pd.read_sql(qry, conn)

In [None]:
# find spell lengths using age() function
qry = '''
select count(*), round((end_date - start_date)/30,1) as length
from ada_tdc_2019.q42014_hoh
group by length
order by count desc
'''

df = pd.read_sql(qry, conn)

print(df)

In [None]:
df.describe(include="all")

In [None]:
# 25th percentile of spell lengths
query = """
SELECT percentile_cont(0.25)
    WITHIN GROUP (ORDER BY round((end_date - start_date)/30,1)) as percentile_25
FROM ada_tdc_2019.q42014_hoh
"""
# display result
pd.read_sql(query, conn)

> For the purposes of your final project outputs, you will be asked to report "fuzzy" means to maintain data confidentiality. For example, a fuzzy mean could be reporting the average of the 45th and 55th percentiles.

In [None]:
# use a list of percentile values for spell lengths
query = """
SELECT percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
WITHIN GROUP (ORDER BY round((end_date - start_date)/30,1))
FROM ada_tdc_2019.q42014_hoh
"""
# display result
pd.read_sql(query, conn)

Luckily, we can use `unnest` to make this a bit easier to read.

In [None]:
# output from above with a reference column and column names for each percentile
query = """
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY round((end_date - start_date)/30,1))
    ) AS months,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM ada_tdc_2019.q42014_hoh
"""
# get the result
df = pd.read_sql(query, conn)
# view result
df

# Summary Statistics: Wages
- Back to [Table of Contents](#Table-of-Contents)

In this section, we'll dive into a subset of the wage data and analyze a bunch of different measures to compare primary TANF recipients whose spells ended in 2014 Q4 to the entire population of employees in Illinois and Indiana. Here, we will use `ada_tdc_2019.all_wages` that contains all Indiana and Illinois employee data for 2015 Q1, in tandem with `ada_tdc_2019.q42014_cohort_wage`, which contains employee data for each primary TANF recipient whose spell ended in 2014 Q4, to compare the two populations.

Let's explore some specific questions to better understand our data:
- How many total jobs did individuals hold in 2015 Q1? How many individuals were working those jobs?
- What did the wage distribution look like in 2015 Q1?
- Can we find out how much each individual was making and how many jobs they had in 2015 Q1?
- What was the proportion of TANF individuals with a stable job in 2015?
- How many TANF individuals do not have available wage data in our cohort?

> __Large tables__ can take a long time to process on shared databases. The 2015 Q1 Illinois and Indiana combined wage data has more than 9.7 million records, so keep this in mind when running complicated queries on this table.

In [None]:
# amount of jobs in ada_tdc_2019.all_wages in 2015 Q1
qry = '''
select count(*)
from ada_tdc_2019.all_wages
where year = 2015 and quarter = 1
'''

pd.read_sql(qry, conn)

In [None]:
# amount of people working those jobs in 2015 Q1
qry = '''
select count(distinct(ssn))
from ada_tdc_2019.all_wages
where year = 2015 and quarter = 1
'''

pd.read_sql(qry, conn)

In [None]:
# find count of total jobs in 2015 Q1 for primary TANF recipients whose spells ended in Q4 2014 for both states

qry = '''
select count(*)
from ada_tdc_2019.q42014_cohort_wage
where year = 2015 and quarter = 1
'''

pd.read_sql(qry, conn)

In [None]:
# amount of people working those jobs
qry = '''
select count(distinct(ssn))
from ada_tdc_2019.q42014_cohort_wage
where year = 2015 and quarter = 1
'''

pd.read_sql(qry, conn)

In [None]:
# wages in 2015 Q1 for full Illinois and Indiana employee data
query = '''
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY wages)
    ) AS earnings_value,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM ada_tdc_2019.all_wages
WHERE year = 2015 and quarter = 1
'''
# get results
df_earnings = pd.read_sql(query, conn)

print(df_earnings)

In [None]:
# wages for 2014 Q4 TANF recipients in 2015 Q1

query = """
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY wages)
    ) AS earnings_value,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM ada_tdc_2019.q42014_cohort_wage
WHERE year = 2015 AND quarter = 1
"""
# get the result
df = pd.read_sql(query, conn)

# view result
df

In [None]:
# wage distribution for just Indiana employees in 2015 Q1, state is coded as 18
qry = '''
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY wages)
    ) AS earnings_value,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM ada_tdc_2019.all_wages
WHERE state = '18' and year = 2015 and quarter = 1
'''

pd.read_sql(qry, conn)

In [None]:
# wage distribution for Indiana TANF in 2015 Q1

qry = '''
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY wages)
    ) AS earnings_value,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM ada_tdc_2019.q42014_cohort_wage
WHERE year = 2015 AND quarter = 1 and state = 18
'''

pd.read_sql(qry, conn)

In [None]:
# wage distribution for just illinois employees in 2015 Q1, state is coded as 17
qry = '''
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY wages)
    ) AS earnings_value,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM ada_tdc_2019.all_wages
WHERE state = '17' and year = 2015 and quarter = 1
'''

pd.read_sql(qry, conn)

In [None]:
# wage distribution for Illinois TANF in 2015 Q1

qry = '''
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY wages)
    ) AS earnings_value,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM ada_tdc_2019.q42014_cohort_wage
WHERE year = 2015 AND quarter = 1 and state = 17
'''

pd.read_sql(qry, conn)

In [None]:
# selected sample of total wages and amount of jobs per person in 2015 Q1
# limit is used to speed up query
query = '''
select sum(wages), count(*) as num_jobs, ssn
from ada_tdc_2019.all_wages
where year = 2015 and quarter = 1
group by ssn
order by ssn
limit 100;
'''

df_full = pd.read_sql(query, conn)

df_full.head()

In [None]:
# sample of total wage per quarter and amount of jobs per 2014 Q4 TANF recipient in 2015
query = '''
select sum(wages), count(*) as num_jobs, ssn
from ada_tdc_2019.q42014_cohort_wage
where year = 2015 and quarter = 1
group by ssn
limit 100;
'''

df_tanf = pd.read_sql(query, conn)

df_tanf.head()

There are many ways you could define "stable employment." You could look at full year employment, full quarter employment, or employment over a certain wage. 

Here, we're going to define stable employment as full quarter employment, meaning that an individual had wage data for the same company for three straight quarters. To do so, we're going to select the same information from `job_yr_q` in `ada_tdc_2019.q42014_cohort_wage` three times and then find if each individual experienced stable employment sometime in 2015.

In [None]:
# Find number of employees with stable employment at least one quarter in 2015 that had were a primary beneficiary of a TANF spell
#that ended in Q4 2014
qry = '''
select count(distinct(a.ssn))
from ada_tdc_2019.q42014_cohort_wage a, ada_tdc_2019.q42014_cohort_wage b, ada_tdc_2019.q42014_cohort_wage c
where a.ssn = b.ssn and a.uiacct=b.uiacct and a.state = b.state and a.state = c.state and 
a.ssn = c.ssn and a.uiacct = c.uiacct and a.job_yr_q = (b.job_yr_q - '3 month'::interval)::date and 
a.job_yr_q = (c.job_yr_q + '3 month'::interval)::date 
'''

pd.read_sql(qry, conn)

Now that we have a better sense of the wage data and the differences between subsets of TANF and general employees, we're going to move to analyzing data about the employers, so we can find out the types of companies that are and are not hiring TANF recipients.

# Joining Employers and Wage Data

Previously, to compare the larger cohort of employees to primary recipients who stopped receiving TANF benefits in 2014 Q4, we created another table that filtered the wage data for our desired cohort of TANF recipients. Here, though, since we are analyzing employer data, we will compare all employers who had at least one registered employee in the wage dataset with employers of both cohorts. To make sure each of the employers had at least one employee within our desired time period, we will join the employer and wage datasets into `ada_tdc_2019.combined_employers` AND `ada_tdc_2019.tanf_employers`.

Since there are so many entries between these two tables, we dropped some columns that were deemed extraneous in regards to our sample case study and guiding questions for this notebook. The tables contain individual employment data with relevant employer statistics for 2015 Q1 for all employers (`ada_tdc_2019.tanf_employers`) and just those employing primary recipients whose TANF spells ended in 2014 Q4 (`ada_tdc_2019.combined_employers`).

The code for creating these tables exists in the [Table Creation](Permanent_Table_Creation.ipynb) notebook. First, let's explore what's in these tables.

In [None]:
# first 1000 rows of tanf employers
qry = '''
select *
from ada_tdc_2019.tanf_employers
limit 1000
'''

df = pd.read_sql(qry, conn)

df.head()

In [None]:
# What naics codes are in our sample of 1000 jobs? How many employers are associated with each one?
df['naics'].value_counts()

In [None]:
# size distribution
df['size'].describe(percentiles=[0.1,0.25,0.5, 0.75, 0.9])

In [None]:
# Count amount of primary TANF recipients whose spells ended in Q4 2014 that are in the wage data
query ="""
select count(distinct(ssn_hash))
from ada_tdc_2019.q42014_hoh 
where ssn_hash in (select distinct ssn from ada_tdc_2019.q42014_cohort_wage)
"""

has_wages = pd.read_sql(query, conn)

has_wages

In [None]:
# here's how we can access the count in 'has_naics'
has_wages['count'][0]

In [None]:
# What percentage of total TANF recipients have wages? 
query ="""
SELECT {}./count(distinct(ssn_hash)) AS not_missing
FROM ada_tdc_2019.q42014_hoh;
""".format(has_wages['count'][0])

# read the query and print the result
print('{:.1f}% of TANF individuals have at least one wage record'\
.format(pd.read_sql(query, conn)['not_missing'][0]*100))

> **Discuss with your team:** What we should do about these missing values? We will revisit this discussion in the [Imputting Wages](impute_wages_example.ipynb) notebook.

# Summary Statistics: Employers

In this section, we'll start looking at aggregate statistics on the employers data in 2015 Q1 and how the total employer data compares to that of employers of primary recipients of TANF benefits that ended during 2014 Q4.

Here are some questions we will solve in this section:

- How old are these employers? How big are they?
- What industries are these employers in?

There are some categories available in the Illinois emplyoers dataset that are not in the Indiana one, like age of the company. Here, we will just analyze Illinois employers of the two cohorts.

In [None]:
# age of all employers in Illinois distribution for 2015 Q1
query = """
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY 2015 - setup_date_year)
    ) AS years,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM il_des_kcmo.il_qcew_employers
WHERE quarter = 1 and year = 2015
"""

pd.read_sql(query, conn)

In [None]:
# age of TANF employers distribution for 2015 Q1
query = """
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9])
        WITHIN GROUP (ORDER BY 2015 - setup_date_year)
    ) AS years,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9]) AS percentile_value
FROM il_des_kcmo.il_qcew_employers
WHERE year = 2015 and quarter = 1 and empr_no in (select distinct uiacct from
ada_tdc_2019.tanf_employers where state = 17)
"""

pd.read_sql(query, conn)

The Indiana employers data also does not contain specific monthly counts of employees for each company in every given year and quarter, while the Illinois table does. Luckily, though, we can use a reasonably accurate proxy by summing up the number distinct social security numbers in the Indiana wage file for each employer and match the counts to the Indiana employer data. We could do the same for the Illinois data, but since we were already provided with the monthly counts, we will utilize the maximum of these three counts for ease of simplicity and time.

In [None]:
# find size distribution of all employers
query = """
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9, 0.99])
        WITHIN GROUP (ORDER BY size)
    ) AS total_size,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9, 0.99]) AS percentile_value
FROM ada_tdc_2019.all_employers
"""

pd.read_sql(query, conn)

In [None]:
# find size distribution of tanf employers
query = """
SELECT unnest(
        percentile_cont(array[0.1, 0.25, 0.5, 0.75, 0.9, .99])
        WITHIN GROUP (ORDER BY size)
    ) AS total_size_tanf,
    unnest(array[0.1, 0.25, 0.5, 0.75, 0.9, .99]) AS percentile_value
FROM ada_tdc_2019.tanf_employers
"""

pd.read_sql(query, conn)

In [None]:
# find top 10 of industry breakdown for all employers in 2015 Q1
query = """
select naics, count(distinct(uiacct))
from ada_tdc_2019.all_employers
group by naics
order by count(distinct(uiacct)) desc
limit 10
"""

pd.read_sql(query, conn)

In [None]:
# find top 10 of industry breakdown for tanf employers in 2015 Q1
query = """
select naics, count(distinct(uiacct))
from ada_tdc_2019.tanf_employers
group by naics
order by count(distinct(uiacct)) desc
limit 10
"""

pd.read_sql(query, conn)

In [None]:
# subset of total employers in retail or temporary help in 2015 Q1
query = '''
select count(distinct(uiacct))
from ada_tdc_2019.all_employers
where naics = '561' or naics = '451' or naics = '452' or naics = '453' or naics = '454'
'''

retail_temp = pd.read_sql(query, conn)

In [None]:
# set the SQL query
query ="""
SELECT {}./count(*) AS not_missing
FROM ada_tdc_2019.all_employers
where naics is not null
""".format(retail_temp['count'][0])


# read the query and print the result
print('{:.1f}% of all employers are in retail or temporary help'\
.format(pd.read_sql(query, conn)['not_missing'][0]*100))

In [None]:
# subset of tanf employers in retail or temporary help
query = '''
select count(distinct(uiacct))
from ada_tdc_2019.tanf_employers
where (naics = '561' or naics = '451' or naics = '452' or naics = '453' or naics = '454')
'''

retail_temp = pd.read_sql(query, conn)
retail_temp

In [None]:
# set the SQL query
query ="""
SELECT {}./count(*) AS not_missing
FROM ada_tdc_2019.tanf_employers
where naics is not null
""".format(retail_temp['count'][0])


# read the query and print the result
print('{:.1f}% of TANF employers are in retail or temporary help'\
.format(pd.read_sql(query, conn)['not_missing'][0]*100))

# Creating New Measures
- Back to [Table of Contents](#Table-of-Contents)

To wrap up this notebook, you will go through examples of how to create new columns. Creating new columns oftentimes allows you to hone in on whether a variable meets a certain threshold, and is also of great use in machine learning.

**Preliminary Examples**

As the notebooks progress we will dig into different aspects of the above questions, but for now we will show two examples of using the `df_jobs` dataframe that is subsetted to wage data for Indiana primary TANF recipients whose spells ended in 2014 Q4 to create two new measures:

- Find which TANF leavers experienced another measure of employment in 2015 Q1
- Find which employees worked in retail or temporary help sectors

In [None]:
# Select ssn, wages, quarter and year from sample of in_dwd.wage_by_employer
query = '''
select ssn, year, quarter, uiacct, wages, naics_3_digit
from in_dwd.wage_by_employer 
where year = 2015 and quarter = 1 and ssn in (select distinct(ssn) from ada_tdc_2019.q42014_hoh where fips = 18)
limit 500;
'''

df_jobs = pd.read_sql(query, conn)

In [None]:
df_jobs.head()

> Since July 24, 2009, the minimum wage in Indiana has been \\$7.25 per hour. Assuming a 35 hour work week and 12 weeks in a quarter, someone working an entire quarter at minumum wage would earn \\$3,045 in the quarter (ignoring taxes).

In [None]:
# 2015 Q1 earnings are over "full-time minimum wage" value of $3,045 in 2015 in Illinois
df_jobs[df_jobs['wages']>=3045]['ssn'].nunique()

In [None]:
# create new column emp_1qtr_overMin for if the individual made more than minimum wage in 2015 Q1
df_jobs['emp_1qtr_overMin'] = df_jobs['ssn'].isin(df_jobs[df_jobs['wages']>=3045]['ssn'].unique())

# Find the percentage of emp_1qtr_overMin with True
df_jobs['emp_1qtr_overMin'].value_counts(normalize=True)

In [None]:
# confirm new column creation
df_jobs.columns

In [None]:
# check records for specific ssn value
ssn_val = 'INSERT SSN HASH'
df_jobs.query("ssn == '{}'".format(ssn_val))

In [None]:
# create new dummy column retail_or_temp for retail or temporary work employees
df_jobs['retail_or_temp'] = df_jobs['naics_3_digit'].apply(lambda x:True if x in
                                                        ['451', '452', '453', '454', '561'] else False)
df_jobs['retail_or_temp'].value_counts(normalize=True)

> How would you create a similar new column for 2014 Q4 TANF Indiana recipients who worked in the restaurant business?

__Separate example: Replicating the QWI Statistics__

The [QWI Statistics](../notebooks_additional/02_3_QWI_stats.ipynb) notebook demonstrates another example of feature creation: the Quarterly Workforce Indicators Census framework using IL wage records.