<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Brian Kim, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Avishek Kumar, Jonathan Morgan, Ursula Kaczmarek, Benjamin Feder, Ekaterina Levitskaya, Tian Lou, Lina Osorio-Copete

### **Employment outcomes of Ohio Technical Centers Trainees**

**Datasets we will explore in this notebook:**
- **Ohio Technical Center (OTC) data**: Ohio vocational training program enrollee information (demographic, course start and end month and year, credentials type and description, credential status).
- **Ohio Unemployment Insurance (UI) Wage data**: Ohio workers' quarterly earnings and employment.

## Notebook Setup

In [None]:
# pandas-related imports
import pandas as pd

# Numpy
import numpy as np

# database interaction imports
import sqlalchemy

__Database Connection__

In [None]:
# to create a connection to the database, 
# we need to pass the name of the database and host of the database

host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = sqlalchemy.create_engine(connection_string)

### Pull Data from the Database

Let's see what the `oh_otc` table looks like.

In [None]:
query = '''
SELECT *
FROM data_ohio_olda_2018.oh_otc 
LIMIT 10
'''

In [None]:
# here we pass the query and the connection to the pd.read_sql() function and assign the variable `df1`
# to the dataframe returned by the function
df1 = pd.read_sql(query, conn)

In [None]:
df1.head()

> Each row of the `oh_otc` table represents a student's course enrollment. There are multiple rows per student if the student is taking more than one course, and there are numerous columns for credentials because it is possible to receive multiple credentials for a given course.

> Note that `fiscal_year` is the year the data was reported. In the later section, we will define how to identify the school year a student enrolled in/graduated from an institution. Ohio Technical Centers do not have a standarized academic term schedule so we will create a column for quarter from the course month.

## Summary Statistics

__Motivating Question # 1__: How many OTC students completed a training during school year 2012-2013 by quarter? How does the number vary by subject and region?

As mentioned above, because the information reported is by month, we need to create and exit quarter from the course end date month (`course_end_date_m`) as well as filtering the course end data year (`course_end_date_y`) to 2012-2013. In this case, you will need to limit the sample by using student results to select completers (`student_results`), and select distinct `ssn_hash` values by quarter.

Following the same definition used in the [Data Exploration](01_2_Data_Exploration.ipynb) notebook that works with Ohio higher education records (Table `oh_hei_long`), we define **2012-13 school year** as the Summer and Autumn semesters of 2012 and the Winter and Spring semesters of 2013. In this case, you will need to limit the sample by using the a new variable for quarter created based on course end date month (`course_end_date_m`).

Since we will be using this table to further subset to calculate employment outcomes, we will save the above SQL query in a temporary table.

In [None]:
# store query to find 2012-2013 academic year graduates in a temporary table
# The following query selects ssn_hashs, 2-digit subject codes based on 6 digits code, and quarters; 
# Subsets course completers by filtering student_result == 1, and filters the quarters needed 
# to subset the school year 2012-13.
qry = '''
create temp table otc_complet as
with by_quarter as (select ssn_hash,
                    left(hei_subject_code::text, 2) as sub_cod2,
                    course_end_date_y as year,
                    region,
                    case 
                    when course_end_date_m in (1,2,3) then 1
                    when course_end_date_m in (4,5,6) then 2
                    when course_end_date_m in (7,8,9) then 3
                    when course_end_date_m in (10,11,12) then 4
                    end as quarter
                    from data_ohio_olda_2018.oh_otc
                    where student_result = 1) -- Completer
select distinct ssn_hash, sub_cod2, year, region, quarter
from by_quarter
where (year = '2012' and (quarter = 3 or quarter = 4)) or 
    (year = '2013' and (quarter = 1 or quarter = 2));
'''
conn.execute(qry)

In [None]:
qry = '''
select *
from otc_complet
'''
df1 = pd.read_sql(qry,conn)
df1.head()

Using the temporary table `otc_complet`, you can join it to the lookup table `oh_subject_codes_lkp` to get subject descriptions, as well as to `oh_region_lookup` to get region names.

In [None]:
# now create temp table because these are OTC completers of 2012_2013 to get corresponding subject descriptions and
# region names
qry = '''
create temp table otc_comp_12_13 as
select a.ssn_hash, a.year, a.quarter, lkp.subject_desc, lkp2.region_name 
from otc_complet a
join data_ohio_olda_2018.oh_subject_codes_lkp lkp on a.sub_cod2::int = lkp.subject_code_2010::int
join data_ohio_olda_2018.oh_region_lkp lkp2 on a.region = lkp2.otc_region_code;
'''
conn.execute(qry)

At this point, you have properly subset the initial `oh_otc` table to just include completers from the 2012-13 academic year. From here, you can find the number of graduates. You will run through code chunks containing code.
> To get to `otc_comp_12_13`, it is possible to combine the above queries into one larger one. However, for instructional purposes, we felt it would be more beneficial to show these steps in smaller chunks.

In [None]:
#can find completers count one of two ways
#find count by substituting * with count(*) in sql query
qry = '''
select count(distinct(ssn_hash))
from otc_comp_12_13
'''
pd.read_sql(qry, conn)

You've found the answer to the first part of this motivating question. Again, to find the subject breakdown of this completers subset you can work using Python and SQL commands.

In [None]:
# find subject breakdown of graduates in sql
qry = '''
select subject_desc, count(distinct(ssn_hash)) as num_students
from otc_comp_12_13
group by subject_desc
order by num_students desc
'''
pd.read_sql(qry, conn)

Using SQL's `group by`, you can find the number of completers by region. We have done this for you and saved it to the temporary table `otc_comp_12_13`. 

In [None]:
#selecting * from otc_comp_12_13 and assigning to df1
qry = '''
select ssn_hash, region_name
from otc_comp_12_13
'''
df1 = pd.read_sql(qry, conn)

In [None]:
df1.groupby(['region_name'])['ssn_hash'].nunique().sort_values(ascending=False)

__Motivating Question #2__: How many 2012-13 Ohio OTC completers are employed in Ohio one year after graduation? What are their employment patterns?

In this example, we will join`otc_comp_12_13` to the Ohio UI wage data. We will examine:

- How many people have positive earnings during all four quarters after graduation?
- What are the earning distributions of graduates who have positive earnings during the first year after graduation?

To answer the first question, you first need to pull out the data on OTC completers. For this exercise, the data is already subset on table `otc_comp_12_13`. In the next query, we join this data to table `oh_ui_wage_by_quarter` to obtain wages information.

In [None]:
query = '''
with ui_quarter as (
                    select ssn_hash, year as year_ui, quarter as quarter_ui, sumwages, maxweeks
                    from data_ohio_olda_2018.oh_ui_wage_by_quarter
                    where (year = '2012' and quarter = 4) or
                          (year = '2013') or
                          (year = '2014' and quarter in (1,2)))
select distinct a.ssn_hash, a.year as otc_year, a.quarter as otc_quarter, year_ui, quarter_ui, sumwages, maxweeks
from otc_comp_12_13 as a
join ui_quarter as b
on a.ssn_hash = b.ssn_hash
'''
df12 = pd.read_sql(query, conn)

To find exactly one year of employment history for every completer, the code becomes a bit complicated, since a completer may have completed their course at any point in the year. To isolate exactly a year's worth of potential employment, you can select the following fiscal quarters, depending on the time of completion.

**How do we want to calculate earnings during the first year after graduation for 2012-13 graduates?**
```
   Course Completion     Earnings during the first year after graduation
   
    2012_Q3                $2012_Q4+ $2013_Q1+ $2013_Q2+ $2013_Q3
   
   
    2012_Q4                $2013_Q1+ $2013_Q2+ $2013_Q3+ $2013_Q4
   
   
    2013_Q1                $2013_Q2+ $2013_Q3+ $2013_Q4+ $2014_Q1
   
   
    2013_Q2                $2013_Q3+ $2013_Q4+ $2014_Q1+ $2014_Q2

```

In [None]:
df12.head()

In [None]:
# adding a new column to count the number of quarters after completion in which the student has positive earnings
df12['otc_yq'] = df12['otc_year'] + 'q' +df12['otc_quarter'].astype(str)
df12['ui_yq'] = df12['year_ui'] + 'q' + df12['quarter_ui'].astype(str)
# Sequence of quarters number starting on 2012 q3
qrt_dictionary = {'2012q3':1, '2012q4':2, '2013q1':3, '2013q2':4, '2013q3':5, '2013q4':6, '2014q1':7, '2014q2':8}
df12['otc_q_num'] = df12['otc_yq'].map(qrt_dictionary)
df12['ui_q_num'] = df12['ui_yq'].map(qrt_dictionary)
# Number of quarters after graduation
df12['num_emp_quarter'] = df12['ui_q_num'] - df12['otc_q_num']
df12

In [None]:
# Select only earnings from one term after completion and up to four terms after completion
earn_1yr = df12[(df12.num_emp_quarter > 0) & (df12.num_emp_quarter < 5)]
# insert a column of ones
earn_1yr.insert(11,'emp', 1)
earn_1yr

### Adjusting earnings by annual inflation

In [None]:
def cpi_adj(year,wage):
    """ Adjust annual earnings to 2017 dollars using
        end of period CPI:
    """
    ref = 247.847
    if year == '2007':
        return wage * ref/211.445
    elif year == '2008':
        return wage * ref/211.398
    elif year == '2009':
        return wage * ref/2017.347
    elif year == '2010':
        return wage * ref/220.472
    elif year == '2011':
        return wage * ref/227.223
    elif year == '2012':
        return wage * ref/229.594
    elif year == '2013':
        return wage * ref/232.957
    elif year == '2014':
        return wage * ref/236.252
    elif year == '2015':
        return wage * ref/237.761
    elif year == '2016':
        return wage * ref/242.712
    elif year == '2017':
        return wage
    else:
        return 'CPI undefined'

In [None]:
earn_1yr.dtypes

In [None]:
earn_1yr['sumwages_adj'] = earn_1yr.loc[:,('year_ui', 'sumwages')].apply(lambda x: cpi_adj(*x), axis = 1).round()
earn_1yr

In [None]:
# Design of quarterly wages table

emp_outcomes = earn_1yr.loc[:,('ssn_hash', 'otc_yq','num_emp_quarter', 'emp')].drop_duplicates()

emp_outcomes_wages = earn_1yr.loc[:,('ssn_hash', 'otc_yq', 'sumwages_adj')].drop_duplicates().groupby(['ssn_hash', 'otc_yq'])['sumwages_adj'].sum()

completers_emp_outcomes = emp_outcomes.pivot_table(index=['ssn_hash', 'otc_yq'], columns='num_emp_quarter', values='emp', fill_value = 0).sort_values('otc_yq')

result = pd.concat([completers_emp_outcomes, emp_outcomes_wages], axis=1).reindex(completers_emp_outcomes.index)

result.columns = ['q1', 'q2', 'q3', 'q4', 'sumwages_adj']

result

In [None]:
# Number of completers during school year 2012-2013 that have positive earnings during all four quarters after completion
full_emp = result.loc[((result['q1']==1) & (result['q2']==1) & (result['q3']==1) & (result['q4']==1))]
full_emp.shape[0]

In [None]:
# Completers by quarter: completers during school year 2012-2013 that have positive earnings during all four quarters after completion
full_emp.reset_index().groupby(['otc_yq'])['ssn_hash'].count()

### Distribution of annual earnings after OCT course completion

In [None]:
# distribution of wages per person one year out
full_emp['sumwages_adj'].describe().round(1)

### Stable employment 

Student completers that entered stable employment are those with a job that will last for the first year after course completion

In [None]:
query = '''
with ui_employer as (
                    select ssn_hash, year as year_ui, quarter as quarter_ui, employer, wages
                    from data_ohio_olda_2018.oh_ui_wage_by_employer
                    where (year = '2012' and quarter = 4) or
                          (year = '2013') or
                          (year = '2014' and quarter in (1,2)) and
                          employer_num = 1)
select distinct a.ssn_hash, a.year as otc_year, a.quarter as otc_quarter, year_ui, quarter_ui, employer, wages
from otc_comp_12_13 as a
join ui_employer as b
on a.ssn_hash = b.ssn_hash
'''
df12 = pd.read_sql(query, conn)

In [None]:
df12.head()

In [None]:
# add a new column to count the number of quarters after completion in which the student has positive earnings
df12['otc_yq'] = df12['otc_year'] + 'q' + df12['otc_quarter'].astype(str)
df12['ui_yq'] = df12['year_ui'] + 'q' + df12['quarter_ui'].astype(str)
df12['otc_q_num'] = df12['otc_yq'].map(qrt_dictionary)
df12['ui_q_num'] = df12['ui_yq'].map(qrt_dictionary)
df12['num_emp_quarter'] = df12['ui_q_num'] - df12['otc_q_num']
df12

In [None]:
# Select only earnings from one term after completion and up to four terms after completion
employers = df12[(df12.num_emp_quarter > 0) & (df12.num_emp_quarter < 5)]
employers.insert(11,'emp', 1)
employers

In [None]:
# Selecting employer code that pays max wage by quarter
res = employers.pivot_table(index=['ssn_hash', 'otc_yq'], columns='num_emp_quarter', values='employer', fill_value = 0).sort_values('otc_yq').round()
res.columns = ['emp_max_wage_q1', 'emp_max_wage_q2', 'emp_max_wage_q3', 'emp_max_wage_q4']
res

In [None]:
# Select only completers with same employer all four quarters after completion
stable_emp =res[res.apply(lambda x: min(x)==max(x), 1)]
stable_emp

In [None]:
stable_emp.shape[0]

#### Stable employment as retention from Q2 to Q4 after completion

- Number of completers who are employed two and four quarters after completion 

- Number of completers employed two quarters after completion who are employed with the same employer four quarters after completion

In [None]:
# Number of students with stable employment under the first definition

# Number of completers during school year 2012-2013 that have positive earnings 
# two and four quarters after course completion
result.loc[((result['q2']==1) & (result['q4']==1))].shape[0]

In [None]:
# Number of students with stable employment under the second definition

# Number of completers during school year 2012-2013 that have the same employer
# two and four quarters after course completion
res.loc[(res['emp_max_wage_q2']==res['emp_max_wage_q4'])].shape[0]