# Dataset Exploration
----------

## Introduction

In an ideal world, we will have all of the data we want with all of the desirable properties (no missing values, no errors, standard formats, and so on). 
However, that is hardly ever true - and we have to work with using our datasets to answer questions of interest as intelligently as possible. 

In this notebook, we will explore our datasets to answer some questions of interest. 

### Learning Objectives

This notebook will give you the opportunity to spend some hands-on time with the data. 

This notebook will take you around the different ways you can analyze your data. This involves looking at basic metrics in the larger dataset, taking a random sample, creating derived variables, making sense of the missing values, and so on. 

This will be done using both SQL and `pandas` in Python. The `sqlite3` Python package will give you the opportunity to interact with the database using SQL to pull data into Python. Some additional manipulations will be handled by Pandas in Python (by converting your datasets into dataframes).

This notebook will provide an introduction and examples for: 

- How to create new tables from the larger tables in database (sometimes called the "analytical frame")
- How to explore different variables of interest
- How to explore aggregate metrics
- How to handle missing values
- How to join newly created tables

### Methods

We will be using the `sqlite3` Python package to access tables in our database - SQLite3. 

To read the results of our queries, we will be using the `pandas` Python package, which has the ability to read tabular data from SQL queries into a pandas DataFrame object. Within `pandas`, we will use various commands:

- Subsetting data
- `groupby`
- `merge`

Within SQL, we will use various queries to:

- select data subsets
- Sum over groups
- create new tables
- Count distinct values of desired variables
- Order data by chosen variables
- Select a random sub-sample

## Python Setup

In Python, we `import` packages. The `import` command allows us to use libraries created by others in our own work by "importing" them. You can think of importing a library as opening up a toolbox and pulling out a specific tool. Among the most famous Python packages:
- `numpy` is short for "numerical Python". `numpy` is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object, and a large suite of functions for doing numerical computing. 
- `pandas` is a library in Python for data analysis that uses the DataFrame object (modeled after R DataFrames, for those familiar with that language) which is similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click of Excel. It is a lynchpin of the PyData stack and is built on top of `numpy`.  
- `sqlite3` is a library that helps us connect to an sqlite3 database.

In [1]:
# pandas-related imports
import pandas as pd

# database interaction imports
import sqlite3

__When in doubt, use shift + tab to read the documentation of a method.__

__The `help()` function provides information on what you can do with a function.__

In [2]:
# for example
help(sqlite3.connect)

Help on built-in function connect in module _sqlite3:

connect(...)
    connect(database[, timeout, detect_types, isolation_level,
            check_same_thread, factory, cached_statements, uri])
    
    Opens a connection to the SQLite database file *database*. You can use
    ":memory:" to open a database connection to a database that resides in
    RAM instead of on disk.



## Load the Data

We can execute SQL queries using Python to get the best of both worlds. For example, Python - and pandas in particular - make it much easier to calculate descriptive statistics of the data. Additionally, as we will see in the Data Visualization exercises, it is relatively easy to create data visualizations using Python. 

Pandas provides many ways to load data. It allows the user to read the data from a local csv or excel file, pull the data from a relational database, or read directly from a URL (when you have internet access). Since we are working with an SQLite3 database, we will demonstrate how to use pandas to read data from a relational database. For examples to read data from a CSV file, refert to the pandas documentation [Getting Data In/Out](pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out).

The function to run a SQL query and pull the data into a pandas dataframe (more to come) is `pd.read_sql()`. Just like doing a SQL query from pgAdmin, this function will ask for some information about the database, and what query you would like to run. Let's walk through the example below.

### Establish a Connection to the Database

The first parameter is the connection to the database. To create a connection we will use the SQLAlchemy package and tell it which database we want to connect to, just like in pgAdmin. Additional details on creating a connection to the database are provided in the [Databases](02_1_Databases.ipynb) notebook.

__Parameter 1: Connection__

In [3]:
# to create a connection to the database, 
# we need to pass the name of the database 

DB = 'testing/ncdoc.db'

conn = sqlite3.connect(DB)

### Formulate Data Query

Depending on what data we are interested in, we can use different queries to pull different data. In this example, we will pull all the content of the offenders data.

__Create a query as a `string` object in Python__

In [15]:
query = '''
SELECT *
FROM inmate
--LIMIT 20;
'''

Note:

- the three quotation marks surrounding the query body is called multi-line string. It is quite handy for writing SQL queries because the new line character will be considered part of the string, instead of breaking the string

In [16]:
# Now that we have defined a variable `query`, we can call it in the code
print(query)


SELECT *
FROM inmate
--LIMIT 20;



> Note that the `LIMIT` provides one simple way to get a "sample" of data; however, using `LIMIT` does **not provide a _random_** sample. You may get different samples of data than others using just the `LIMIT` clause, but it is just based on what is fastest for the database to return.

### Pull Data from the Database

Now that we have the two parameters (database connection and query), we can pass them to the `pd.read_sql()` function, and obtain the data.

In [17]:
# here we pass the query and the connection to the pd.read_sql() function and assign the variable `wage` 
# to the dataframe returned by the function
df = pd.read_sql(query, conn)

In [18]:
df.shape

(461421, 67)

## Analysis: Using Python and SQL

__What are the characteristics of offenders in North Carolina?__

To explore possible metrics, we will need to combine offender and inmate data. 

__North Carolina Department of Corrections Data__:
- `inmate`: inmate data
- `offender`: offender data

In [12]:
query = '''
SELECT *
FROM offender
limit 100;
'''
offender = pd.read_sql(query, conn)

In [13]:
offender.head()

Unnamed: 0,OFFENDER_NC_DOC_ID_NUMBER,OFFENDER_BIRTH_DATE,OFFENDER_GENDER_CODE,OFFENDER_RACE_CODE,OFFENDER_HEIGHT_(IN_INCHES),OFFENDER_WEIGHT_(IN_LBS),OFFENDER_SKIN_COMPLEXION_CODE,OFFENDER_HAIR_COLOR_CODE,OFFENDER_EYE_COLOR_CODE,OFFENDER_BODY_BUILD_CODE,...,OFFENDER_ETHNIC_CODE,OFFENDER_PRIMARY_LANGUAGE_CODE,OFFENDER_SHIRT_SIZE,OFFENDER_PANTS_SIZE,OFFENDER_JACKET_SIZE,OFFENDER_SHOE_SIZE,OFFENDER_DRESS_SIZE,NEXT_PHOTO_YEAR,DATE_OF_LAST_UPDATE,TIME_OF_LAST_UPDATE
0,1,1974-04-04,FEMALE,BLACK,66,180,UNKNOWN,BLACK,BROWN,UNKNOWN,...,UNKNOWN,ENGLISH,0,0,0,0,0,0,2015-02-04,13:32:12
1,3,1955-07-24,MALE,WHITE,74,240,LIGHT,BROWN,BLUE,STOCKY,...,EUROPEAN/N.AM./AUSTR,ENGLISH,0,0,0,0,0,0,2015-05-05,17:20:06
2,4,1961-10-15,MALE,WHITE,70,150,UNKNOWN,BROWN,GREEN,UNKNOWN,...,UNKNOWN,ENGLISH,0,0,0,0,0,0,1995-06-25,00:00:00
3,5,1972-01-22,MALE,WHITE,70,145,UNKNOWN,BLONDE,BROWN,UNKNOWN,...,UNKNOWN,ENGLISH,0,0,0,0,0,0,2001-12-20,13:36:13
4,6,1951-07-17,MALE,WHITE,69,150,UNKNOWN,BROWN,BLUE,UNKNOWN,...,UNKNOWN,ENGLISH,0,0,0,0,0,0,1995-06-25,00:00:00


In [39]:
offender.columns

Index(['OFFENDER_NC_DOC_ID_NUMBER', 'OFFENDER_BIRTH_DATE',
       'OFFENDER_GENDER_CODE', 'OFFENDER_RACE_CODE',
       'OFFENDER_HEIGHT_(IN_INCHES)', 'OFFENDER_WEIGHT_(IN_LBS)',
       'OFFENDER_SKIN_COMPLEXION_CODE', 'OFFENDER_HAIR_COLOR_CODE',
       'OFFENDER_EYE_COLOR_CODE', 'OFFENDER_BODY_BUILD_CODE',
       'CITY_WHERE_OFFENDER_BORN', 'NC_COUNTY_WHERE_OFFENDER_BORN',
       'STATE_WHERE_OFFENDER_BORN', 'COUNTRY_WHERE_OFFENDER_BORN',
       'OFFENDER_CITIZENSHIP_CODE', 'OFFENDER_ETHNIC_CODE',
       'OFFENDER_PRIMARY_LANGUAGE_CODE', 'OFFENDER_SHIRT_SIZE',
       'OFFENDER_PANTS_SIZE', 'OFFENDER_JACKET_SIZE', 'OFFENDER_SHOE_SIZE',
       'OFFENDER_DRESS_SIZE', 'NEXT_PHOTO_YEAR', 'DATE_OF_LAST_UPDATE',
       'TIME_OF_LAST_UPDATE'],
      dtype='object')

In [31]:
query = '''
SELECT *
FROM inmate
limit 100;
'''
inmate = pd.read_sql(query, conn)

In [32]:
inmate.head()

Unnamed: 0,INMATE_DOC_NUMBER,INMATE_LAST_NAME,INMATE_FIRST_NAME,INMATE_MIDDLE_INITIAL,INMATE_NAME_SUFFIX,INMATE_NAME_SOUNDEX_CODE,INMATE_GENDER_CODE,INMATE_RACE_CODE,INMATE_BIRTH_DATE,INMATE_ETHNIC_AFFILIATION,...,CURRENT_PENDING_REVIEWS_FLAG,ESCAPE_HISTORY_FLAG,PRIOR_INCARCERATIONS_FLAG,NEXT_PAROLE_REVIEW_TYPE_CODE,TIME_OF_LAST_MOVEMENT,POPULATION/MANAGEMENT_UNIT,INMATE_POSITIVELY_IDENTIFIED,PAROLE_AND_TERMINATE_STATUS,INMATE_LABEL_STATUS_CODE,PRIMARY_OFFENSE_QUALIFIER
0,4,AARON,DAVID,C,,,MALE,WHITE,1961-10-15,UNKNOWN,...,N,N,Y,,00:09:00,,YES,,,
1,6,AARON,GERALD,,,,MALE,WHITE,1951-07-17,UNKNOWN,...,N,N,Y,,00:11:00,,YES,,,
2,8,AARON,JAMES,M,,,MALE,WHITE,1963-12-29,UNKNOWN,...,N,N,Y,,23:59:00,,YES,,FILE JACKET LABEL PRINTED,PRINCIPAL
3,10,AARON,KENNETH,T,,,MALE,BLACK,1953-05-18,UNKNOWN,...,N,N,Y,,00:13:00,,YES,,,
4,14,AARON,MOYER,,,,MALE,WHITE,1921-08-26,UNKNOWN,...,N,N,Y,,00:12:00,,YES,,,


Some values seem to be missing. Let's see how many records have a value in the `naics` column.

In [30]:
# set the SQL query
query ="""
SELECT strftime("%Y",INMATE_ADMISSION_DATE) as admit_year
FROM inmate
limit 10;
"""
# read the query into a DataFrame
year = pd.read_sql(query, conn)

# print the resulting DataFrame
year

Unnamed: 0,admit_year
0,1983
1,1973
2,1995
3,1977
4,1977
5,1992
6,1988
7,1977
8,1983
9,2000


In [38]:
# Missing values

df['INMATE_RACE_CODE'].value_counts()

BLACK         229101
WHITE         203682
OTHER          16904
INDIAN          9269
UNKNOWN         1559
ASIAN/ORTL       905
                   1
Name: INMATE_RACE_CODE, dtype: int64

In [40]:
offender['NC_COUNTY_WHERE_OFFENDER_BORN'].value_counts() # some are missing

OTHER          41
FORSYTH         9
                5
GUILFORD        4
BURKE           4
VANCE           3
WAKE            3
NEW HANOVER     3
CATAWBA         3
WILKES          2
PERSON          2
WARREN          2
CRAVEN          1
GRANVILLE       1
HENDERSON       1
ROCKINGHAM      1
ONSLOW          1
RUTHERFORD      1
SAMPSON         1
CARTERET        1
ROWAN           1
JOHNSTON        1
PITT            1
SURRY           1
GASTON          1
STOKES          1
CASWELL         1
MECKLENBURG     1
FRANKLIN        1
ROBESON         1
CUMBERLAND      1
Name: NC_COUNTY_WHERE_OFFENDER_BORN, dtype: int64

In [62]:
# Let's look at every inmate in the 1980s

# set the SQL query
query ="""
SELECT *, CAST(strftime("%Y",INMATE_ADMISSION_DATE) as integer) as admit_year
FROM inmate
WHERE admit_year >= 1980 AND admit_year < 1990
"""

# print the query for reference
print(query)

# read the query and print the result

in80 = pd.read_sql(query, conn)


SELECT *, CAST(strftime("%Y",INMATE_ADMISSION_DATE) as integer) as admit_year
FROM inmate
WHERE admit_year >= 1980 AND admit_year < 1990



In [66]:
in80.shape

(60799, 68)

Also, some offenders are missing the NC County where they were born. Let's see how many.

In [74]:
# It is likely that you will see that some employers do not have a legal name. 
# Let's find how many.

#generating read SQL
query = '''
SELECT count(distinct OFFENDER_NC_DOC_ID_NUMBER)
FROM offender
WHERE NC_COUNTY_WHERE_OFFENDER_BORN IS ""
'''
# read the query into a DataFrame
missing_county = pd.read_sql(query, conn)
# print the resulting DataFrame
missing_county

Unnamed: 0,count(distinct OFFENDER_NC_DOC_ID_NUMBER)
0,314241


In [27]:
# It is likely that you will see that some employers do not have a legal name. 
# Let's find how many.

#generating read SQL
query = '''
SELECT count(distinct INMATE_DOC_NUMBER), count(*)
FROM inmate
'''
# read the query into a DataFrame
unique_offender = pd.read_sql(query, conn)
# print the resulting DataFrame
unique_offender

Unnamed: 0,count(distinct INMATE_DOC_NUMBER),count(*)
0,461421,461421


## Date Variables

In [16]:
admit_date = pd.to_datetime(df.INMATE_ADMISSION_DATE, yearfirst= True)

In [18]:
year = [k.year for k in admit_date]

In [21]:
year

[1983,
 1973,
 1995,
 1977,
 1977,
 1992,
 1988,
 1977,
 1983,
 2000,
 1972,
 1992,
 1995,
 1990,
 2001,
 2009,
 1981,
 1987,
 1994,
 1984]

## Summary Statistics

In this section, let's start looking at aggregate statistics on the data. 

Let's explore a few specific, simple questions to better understand our data:
- How many people were admitted between 1980 and 1990?
- How many 

> Note: __ Large tables__ can take a long time to process on shared databases. 

In [9]:
# additionally we'll use the Python time package
# to see how long different queries takes to return

import time # import time package

In [12]:
# example count of records for 2007 Q2 

start_time = time.time() # get current time

qry = """
SELECT count(*), CAST(strftime("%Y",INMATE_ADMISSION_DATE) as integer) as admit_year
FROM inmate
WHERE admit_year >= 1980 AND admit_year < 1990
"""
# print results
print(pd.read_sql(qry, conn)) 
# print analysis time
print('Query returned in {:.1f} seconds'.format(time.time()-start_time))

   count(*)  admit_year
0     60799        1983
Query returned in 0.3 seconds


> A **question to consider**: This simple count is one measure of the total jobs in IL in 2007 Q2. What may we want to consider when defining a "job" in addition to just being a row in this dataset?

In [19]:
# we can get descriptive stats from the DataFrame:
df.describe(include='all')

Unnamed: 0,INMATE_DOC_NUMBER,INMATE_LAST_NAME,INMATE_FIRST_NAME,INMATE_MIDDLE_INITIAL,INMATE_NAME_SUFFIX,INMATE_NAME_SOUNDEX_CODE,INMATE_GENDER_CODE,INMATE_RACE_CODE,INMATE_BIRTH_DATE,INMATE_ETHNIC_AFFILIATION,...,CURRENT_PENDING_REVIEWS_FLAG,ESCAPE_HISTORY_FLAG,PRIOR_INCARCERATIONS_FLAG,NEXT_PAROLE_REVIEW_TYPE_CODE,TIME_OF_LAST_MOVEMENT,POPULATION/MANAGEMENT_UNIT,INMATE_POSITIVELY_IDENTIFIED,PAROLE_AND_TERMINATE_STATUS,INMATE_LABEL_STATUS_CODE,PRIMARY_OFFENSE_QUALIFIER
count,461421,461421,461421,461421.0,461421.0,461421.0,461421,461421,461421,461421,...,461421,461421,461421,461421.0,461421,461421.0,461421,461421.0,461421,461421
unique,461421,37085,31073,28.0,115.0,3593.0,2,7,30483,12,...,3,2,3,7.0,1438,2.0,4,3.0,4,12
top,261733,SMITH,JAMES,,,,MALE,BLACK,0001-01-01,UNKNOWN,...,N,N,Y,,23:59:00,,YES,,FILE JACKET LABEL PRINTED,PRINCIPAL
freq,1,6861,17043,118775.0,446314.0,253079.0,403315,229101,218,153415,...,261292,442599,447347,271441.0,86011,461271.0,433489,460582.0,209017,301624


In [30]:
df['TOTAL_SENTENCE_COUNT'].astype(int).describe()

count    461421.000000
mean          1.325347
std           1.621162
min           0.000000
25%           0.000000
50%           1.000000
75%           2.000000
max         125.000000
Name: TOTAL_SENTENCE_COUNT, dtype: float64

One way to characterize a job is the employer industry. In order to run summary statistics on number of jobs per industry (NAICS code), we need to get the NAICS code from the `il_qcew_employers` table

In [23]:
# check how many records from our inmate data matches the offender data
# just for 2007 Q2 data

start_time = time.time()

qry = """
SELECT *
FROM offender
JOIN inmate
ON offender.OFFENDER_NC_DOC_ID_NUMBER = inmate.INMATE_DOC_NUMBER
"""
res = pd.read_sql(qry, conn)
print('query took {:.1f} seconds'.format(time.time()-start_time))

query took 17.1 seconds


In [100]:
# what is the distribution of earnings in our sample?
height = res['OFFENDER_HEIGHT_(IN_INCHES)'].astype(float)
res['OFFENDER_HEIGHT_(IN_INCHES)'] = height

In [102]:
res['OFFENDER_HEIGHT_(IN_INCHES)'].describe(percentiles=[0.1,0.25,0.5, 0.75, 0.9])

count    460900.000000
mean         68.775103
std           5.871367
min           0.000000
10%          64.000000
25%          67.000000
50%          69.000000
75%          72.000000
90%          73.000000
max          98.000000
Name: OFFENDER_HEIGHT_(IN_INCHES), dtype: float64

In [103]:
# and earnings by industry?
res.groupby('INMATE_GENDER_CODE')['OFFENDER_HEIGHT_(IN_INCHES)'].describe(percentiles=[0.1,0.25,0.5, 0.75, 0.9])

Unnamed: 0_level_0,count,mean,std,min,10%,25%,50%,75%,90%,max
INMATE_GENDER_CODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
FEMALE,58038.0,64.074589,5.93344,0.0,61.0,63.0,64.0,66.0,68.0,91.0
MALE,402862.0,69.452279,5.543087,0.0,66.0,68.0,70.0,72.0,74.0,98.0


## Exploring Inmate data

Our questions:
- How many people who were admitted in 1980 

Since the TANF data have start and end dates, we will need to consider how our questions relate to dates (whereas with the wage record data we only know if people were paid during a given quarter).

Additionally, the TANF data are much more complex so here will focus on just two tables:
1. `ind_spells` - individual level spells on different benefits (we'll further focus on just the TANF data, coded as 'tanf46' in this data)
2. `member` - includes more information about the people, such as birthdate and gender

In [None]:
# How many spells end in Q4 of 2006?

start_time = time.time()

query="""
SELECT count(*) 
FROM il_dhs.ind_spells
WHERE end_date >= '2006-10-01'::date AND 
    end_date < '2007-01-01'::date
    AND benefit_type = 'tanf46'
"""

print('there are {:,.0f} TANF spells'.format(pd.read_sql(query, conn)['count'][0]))
print('query completed in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# let's pull the list's information in the member table
# in this query we grab just the recptno values for our
# cohort, then use that list to pull the info from "member"
start_time = time.time()
query = """
SELECT * FROM il_dhs.member
WHERE recptno IN ( SELECT recptno
    FROM il_dhs.ind_spells
    WHERE end_date >= '2006-10-01'::date AND 
        end_date < '2007-01-01'::date
        AND benefit_type = 'tanf46')
"""
df = pd.read_sql(query, conn)
print('query completed in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
df.info()

In [None]:
# there are many more results than just our list
# does the unique list of recptno match?
df['recptno'].nunique()

In [None]:
# how many unque values for the 'static' variables?
static_vars = ['recptno', 'ssn_hash', 'sex', 'rac', 'rootrace', 'foreignbn', 'birth_date']

df[static_vars].nunique()

In [None]:
df.groupby(static_vars)['update_id'].count()\
.reset_index()\
.rename(columns={'update_id': 'records'})\
.sort_values('records',ascending=False).head()

In [None]:
df.groupby(static_vars)['update_id'].count().shape

Instead of digging through how the member table is constructed, let's instead base our cohort selection on the `ind_spells` table

In [None]:
# columns from the member table
print(df.columns.tolist())

In [None]:
start_time = time.time()
query = """
SELECT DISTINCT ON (i.recptno) i.recptno, i.start_date, i.end_date, 
    m.birth_date, m.ssn_hash, sex, rac, rootrace
FROM il_dhs.ind_spells i
LEFT JOIN il_dhs.member m
ON i.recptno = m.recptno
WHERE end_date >= '2006-10-01'::date AND 
        end_date < '2007-01-01'::date
        AND benefit_type = 'tanf46'
"""
# read the data, and additionally parse the dates
df = pd.read_sql(query, conn, parse_dates=['start_date', 'end_date', 'birth_date'])
print('data read in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
df.info()

In [None]:
df.head()

In [None]:
# how many of our cohort are there in each of the categories we have?
print('count by sex')
print(df['sex'].value_counts())
print('')
print('count by rac')
print(df['rac'].value_counts())
print('')
print('count by rootrace')
print(df['rootrace'].value_counts())
print('')

### Employment and TANF

Now we'll explore how many of our cohort were **employed** before, during, or after Q4 of 2006 (note: we'll further need to use their start_date to say if the job was before or during enrollment in this TANF spell)

In [None]:
# rather than selecting the cohort in a sub-query, let's use 
# the values from the data frame:
cohort_ssns = df['ssn_hash'].unique()

# reformat the list as a long string of values, this will make it easier to use in the query
cohort_ssns = ','.join(["'"+ssn+"'" for ssn in cohort_ssns])

> This line of code may look complicated, so let's break it down step by step:
>
> 1. __`... for ssn in cohort_ssns ...`__ - Loop through every element `ssn` in the list `cohort_ssns`
> 2. __`"'"+ssn+"'" ...`__ - Return SSN value with single quote
> 3. __`','.join(...)`__ - join all elements of the list with a comma between them
>
> _Additional Note: The formulation `[<action> for <item> in <iterable>]`is known as "list comprehension"._ 

In [None]:
start_time = time.time()
# start with before and during
# the wage record data starts in 2005 below is only 2 years
query = """
SELECT count(*) recs
FROM il_des_kcmo.il_wage
WHERE year < 2007
    AND ssn IN ({})
""".format(cohort_ssns)

print('there are {} records of "before" or "during" jobs'.format(pd.read_sql(query, conn)['recs'][0]))
print('query completed in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
start_time = time.time()
# how many after 2007Q4 - we'll only consider the following 2 years
# the wage record data starts in 2005
query = """
SELECT count(*) recs
FROM il_des_kcmo.il_wage
WHERE year IN (2007, 2008)
    AND ssn IN ({})
""".format(cohort_ssns)

print('there are {} records of "after" jobs'.format(pd.read_sql(query, conn)['recs'][0]))
print('query completed in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# let's pull the "after" jobs for some further analysis

start_time = time.time()

query = """
SELECT ssn, ein, seinunit, empr_no, wage, year, quarter
FROM il_des_kcmo.il_wage
WHERE year IN (2007, 2008)
    AND ssn IN ({})
""".format(cohort_ssns)
df_jobs = pd.read_sql(query, conn)
print('query completed in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
df_jobs.info()

In [None]:
# how many individuals in our cohort had _any_ job in 2007 or 2008?
df_jobs['ssn'].nunique()

In [None]:
print('{:.1f}% of our cohort had _a_ job in 2007 and/or 2008'.format(\
100.*df_jobs['ssn'].nunique()/df.shape[0]))

## Creating New Measures
- Back to [Table of Contents](#Table-of-Contents)

> **Questions to consider** (we will revisit similar questions of measurement frequently during the program)
0. What problem are we working to solve? How can we measure it?
1. How should we define that an individual "received a job"? For example, definitions could be 
  * Received greater than 0 pay at some point within 1 year
  * Received greater than minimum wage (assuming XyZ) in 6 of 8 quarters after exiting the TANF program
2. How narrowly can you define the unit of analysis? Eg an individual who stopped receiving TANF by...
3. What additional information do we know about these individuals?

**Preliminary Examples**

As the notebooks progress we will dig into different aspects of the above questions, but for now we will show the example of using the `df` dataframe as our study cohort and the `df_jobs` dataframe to create a few example ways to define whether each individual received a job after leaving TANF.

In [None]:
# simple example of getting "any" job
df['ssn_hash'].isin(df_jobs['ssn'].unique()).value_counts()

In [None]:
# simple example of getting "any" job - add as column to `df` DataFrame
df['emp_any_job'] = df['ssn_hash'].isin(df_jobs['ssn'].unique())

In [None]:
# at least one quarter's earnings are over "full-time minimum wage" value of $2,730 in 2007
df_jobs[df_jobs['wage']>=2730]['ssn'].nunique()

In [None]:
df['emp_1qtr_overMin'] = df['ssn_hash'].isin(df_jobs[df_jobs['wage']>=2730]['ssn'].unique())
df['emp_1qtr_overMin'].value_counts(normalize=True)

In [None]:
# at least 4 quarters' earnings over "full-time minimum wage" value of $2,730 in 2007
df_jobs[df_jobs['wage']>=2730].groupby('ssn')['wage'].count().sort_values(ascending=False).head(10)

In [None]:
# check records for specific ssn value
ssn_val = '...'
df_jobs.query("ssn == '{}'".format(ssn_val))

In [None]:
# create temporary dataframe
temp_df = df_jobs[df_jobs['wage']>=2730].groupby('ssn')['wage'].count().reset_index().rename(columns={'wage':'count'})

# add outcome measure to df
df['emp_2qrts_overMin'] = df['ssn_hash'].isin(temp_df[temp_df['count']>=2]['ssn'])
df['emp_2qrts_overMin'].value_counts(normalize=True)

In [None]:
df.columns

In [None]:
outcome_list = ['emp_any_job', 'emp_1qtr_overMin','emp_2qrts_overMin']
df[outcome_list].sum() # number of True for each outcome metric