<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Rayid Ghani, Frauke Kreuter, Julia Lane, Brian Kim, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Avishek Kumar, Jonathan Morgan, Ursula Kaczmarek, Benjamin Feder, Ekaterina Levitskaya, Lina Osorio-Copete, Tian Lou.

# Dataset Exploration

## Introduction

In an ideal world, we would have all of the data we want with all of the desirable properties (no missing values, no errors, standard formats, and so on). We'd also have perfect data documentation, with summary statistics and approproiate aggregate measures of everything we'd want to investigate. However, that is hardly ever true - so we have to work with using our datasets to answer questions of interest as intelligently as possible. 

In this notebook, we will discover the datasets we have on the ADRF and we will use our datasets to answer some questions of interest. 

### Learning Objectives

This notebook will give you the opportunity to spend some hands-on time with the data. 

Throughout the notebook, you will work through various techniques of how to use SQL and Python to explore the various datasets in the ADRF and better understand what you are working with. This will form the basis of all the other types of analyses you will do in this class and is a crucial first step for any data analysis workflow. As you work through the notebook, we will have checkpoints for you try out your own code, but you can also think about how you might apply any of the techniques and code presented with other datasets as well. 

We are going to show just a portion of what you might be interested in investigating, so don't feel restricted by the questions we've decided to try to answer.

**Datasets We Will Explore In This Notebook:**
- **Ohio Higher Education Information (HEI) data**: Ohio public college student information (student enrollment, degree earned, demographic, course credits, institution).
- **Ohio Unemployment Insurance (UI) Wage data**: Ohio workers' quarterly earnings and employment.
- **Indiana Unemployment Insurance (UI) Wage data**: Indiana workers' quarterly earnings and employment. 

You will explore these datasets using both SQL and `pandas` in Python. The `sqlalchemy` Python package will give you the opportunity to interact with the database using SQL to pull data into Python. Some additional manipulations will be handled by Pandas in Python (by converting your datasets into dataframes). We've also provided a [supplementary notebook](01_2_Dataset_Exploration_supplemental.ipynb) for you that walks you through how to do these same analyses on the OTC data.

**This notebook will provide an introduction and examples for:**

- How to create new tables from the larger tables in database (sometimes called the "analytical frame")
- How to explore different variables of interest
- How to create aggregate metrics
- How to join tables
- How to generate descriptive statistics to describe a specific cohort

### Methods

We will be using the `sqlalchemy` Python package to access tables in our class database server - PostgreSQL. 

To read the results of our queries, we will be using the `pandas` Python package, which has the ability to read tabular data from SQL queries into a pandas DataFrame object. Within `pandas`, we will use various commands:

- `fillna`
- `groupby`
- `nunique`

Within SQL, we will use various queries to:

- select data subsets
- sum over groups
- create new tables
- count distinct values of desired variables

## Python Setup

In Python, we `import` packages. The `import` command allows us to use libraries created by others in our own work by "importing" them. You can think of importing a library as opening up a toolbox and pulling out a specific tool. Among the most famous Python packages:
- **numpy** is short for "numerical Python". `numpy` is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object, and a large suite of functions for doing numerical computing. 
- **pandas** is a library in Python for data analysis that uses the DataFrame object (modeled after R DataFrames, for those familiar with that language) which is similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click of Excel. It is a lynchpin of the PyData stack and is built on top of `numpy`.  
- **sqlalchemy** is a Python library for interfacing with a PostGreSQL database. 

In [None]:
# pandas-related imports
import pandas as pd

# Numpy
import numpy as np

# database interaction imports
import sqlalchemy

__When in doubt, use shift + tab to read the documentation of a method by placing a cursor near the name of the method and pressing shift_+tab.__

__The `help()` function provides information on what you can do with a function.__

In [None]:
# for example
help(pd.read_sql)

## Load the Data

We can execute SQL queries using Python to get the best of both worlds. For example, Python - and `pandas` in particular - makes it much easier to calculate descriptive statistics and perform more complicated analyses with the data. Additionally, as we will see in the Data Visualization exercises, it is relatively easy to create data visualizations using Python. 

Pandas provides many ways to load data. It allows the user to read the data from a local csv or excel file, pull the data from a relational database, or read directly from a URL (when you have internet access). Since we are working with the PostgreSQL database `appliedda` in this course, we will demonstrate how to use pandas to read data from a relational database. For examples on reading data from a CSV file, refer to the pandas documentation [Getting Data In/Out](pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out).

The function to run a SQL query and pull the data into a pandas dataframe (more to come) is `pd.read_sql()`. Just like doing a SQL query from DBeaver, this function will ask for some information about the database, and what query you would like to run. Let's walk through the example below.

### Establish a Connection to the Database

The first parameter is the connection to the database. To create a connection we will use the SQLAlchemy package and tell it which database we want to connect to.

__Database Connection__

In [None]:
# to create a connection to the database, 
# we need to pass the name of the database and host of the database

host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = sqlalchemy.create_engine(connection_string)

> Note we can parameterize Python `string` objects - using the built-in `.format()` function. We will use various formulations in the program notebooks (e,g. when building queries), some examples are:
1. Empty brackets (shown above) which simply inserts the variable in the string; when there is more than one set of brackets Python will insert variables in the order they are listed
2. Brackets with formatting can be used to make print statements more readable (eg `'text with formatted number with comma and 1-digit decimal {:,.1f}'.format(number_value)` will print `123,456.7` instead of `123456.7123401`)
3. Named brackets to use the same variables multiple times in a text block

### Formulate Data Query

This part is similar to writing a SQL query in DBeaver. Depending on the data we are interested in, we can use different queries to pull different data. In this example, we will pull in academic content from the HEI data for post-secondary education attendees in the 2013 calendar year.

__create a query as a `string` object in Python__

In [None]:
query = '''
SELECT *
FROM data_ohio_olda_2018.oh_hei_long 
WHERE file_year = '2013'
LIMIT 5
'''

> The three quotation marks surrounding the query body is called multi-line string. It is quite handy for writing SQL queries because the new line character will be considered part of the string, instead of breaking the string.

> Note that `file_year` is the year the data was reported. In the later section, we will learn how to identify the school year a student enrolled in/graduated from an institution.

In [None]:
# Now that we have defined a variable `query`, we can call it in the code
print(query)

> Note that the `LIMIT` provides one simple way to get a "sample" of data; however, using `LIMIT` does **not provide a _random_** sample. You may get different samples of data than others using just the `LIMIT` clause, but it is just based on what is fastest for the database to return.

### Pull Data from the Database

Now that we have the two parameters (database connection and query), we can pass them to the `pd.read_sql()` function to obtain the data.

In [None]:
# here we pass the query and the connection to the pd.read_sql() function and assign the variable `df`
# to the dataframe returned by the function
df = pd.read_sql(query, conn)

In [None]:
df.head()

<font color=red><h3> Checkpoint 1: Read in the HEI table </h3></font> 

Read in and explore a subsample of the HEI long table (table name: `oh_hei_long`). 

Try to:
1. limit the sample to 20 observations
2. look at records of people enrolled in 2012 (e.g., `enroll_yr_num`='2012')
3. count the number of people enrolled in 2012 (e.g., use `count(distinct(ssn_hash))` in your query) 

In [None]:
query= '''


'''

pd.read_sql(query, conn)

## Analysis: Using Python and SQL

### What is in the Database?

There are a few different ways to connect and explore the data in the database. 

__Schemas, Tables, and Columns in database__

Let's pull the list of schema names in the database, the list of tables in these schemas, and the list of columns in these tables.

In [None]:
# See all available schemas:
query = '''
SELECT schema_name 
FROM information_schema.schemata;
'''
pd.read_sql(query, conn)

As a reminder, in this class you have access to the following schemas: `public`, `data_ohio_olda_2018`, `il_des_kcmo`, `kcmo_lehd`, `mo_dhe`,`in_dwd`, `in_che`, and `ada_20_osu`. You only have write access to the `ada_20_osu` schema.

In [None]:
schemas = """
'public', 'data_ohio_olda_2018', 'il_des_kcmo', 'kcmo_lehd', 'mo_dhe', 'in_dwd', 'in_data_2019', 'ada_20_osu'
"""

In [None]:
# confirm our schemas exist with 
# an updated version of the previous query
query = '''
SELECT schema_name 
FROM information_schema.schemata
WHERE schema_name IN ({})
'''.format(schemas)
pd.read_sql(query, conn)

In [None]:
query = '''
SELECT schemaname, tablename
FROM pg_tables
WHERE schemaname IN ({})
'''.format(schemas)

tables = pd.read_sql(query, conn)
# print tables not in the public schema
print(tables.query("schemaname != 'public'"))

In [None]:
# list all the tables in the OLDA schema:
sorted(tables[tables["schemaname"] == 'data_ohio_olda_2018']['tablename'])

> Note the two ways shown above to subset a `Pandas.DataFrame`:
1. Use the built-in `.query()` function (done in this line: `tables.query("schemaname != 'public'"`) )
2. Create an array of `True` and `False` values (done in this line: `tables["schemaname"] == 'data_ohio_olda_2018'`)

In [None]:
# We can look at column names within tables
# here we'll set the schema and table with variables

schema = 'data_ohio_olda_2018'
tbl = 'oh_hei_long'

query = '''
SELECT * 
FROM information_schema.columns 
WHERE table_schema = '{}' AND table_name = '{}'
'''.format(schema, tbl)

# read and print results
pd.read_sql(query, conn)

## Summary Statistics

In this section, you'll start looking at aggregate statistics on the data. As you work through this section, try to ask yourself some questions such as: 
- What variables are you interested in? 
- What variables do you need to identify the sample you are interested in?
- In which table(s) are these variables available? 
- Are there any missing values in these variables?

<font color=red> <h3> __Motivating Question # 1__: </h3> </font>

Assume you are eventually interested in looking at labor market outcomes of Ohio community college students who received their degrees during the 2012-13 academic year. You will need to combine education and employment data (in this case, Ohio HEI and UI wage records). First, though, you should take some steps to better understand your cohort, so you'll first focus on these questions:

**How many students got their degrees from Ohio community colleges during the 2012-13 academic year? How does the number vary by the regional location of the college and by degree field?**

*JobsOhio Region:* The state of Ohio divides the state into 6 regions for economic development purposes: Southeast, Southwest, Central, West, Northwest, and Northeast Ohio.

*2012-13 academic year:* According to Ohio Department of Higher Education, it is defined as the Summer and Autumn semesters of 2012 and the Winter and Spring semesters of 2013. To look at graduates' information, you can use variables that start with `degcert_`. In this case, you will need to limit the sample by using the semester (`degcert_term_earned`) and the year a person received their degree (`degcert_yr_earned`).

`degcert_term_earned`:
1= Autumn,
2= Winter,
3= Spring,
4= Summer

In [None]:
#find all higher education graduates in school year 2012-13
qry = '''
select *
from data_ohio_olda_2018.oh_hei_long
where (degcert_yr_earned = '2012' and (degcert_term_earned = '4' or degcert_term_earned = '1')) or 
    (degcert_yr_earned = '2013' and (degcert_term_earned = '2' or degcert_term_earned = '3'))
'''
df = pd.read_sql(qry, conn)

In [None]:
df.head()

Since we will be using this table to further subset to community college graduates, we will save the above SQL query in a temporary table. Temporary tables are similar to saving tables/dataframes in python, as they store a table that we can use for future reference, but it needs to be created every time you re-open or redo any analysis.

In general, it is best practice to test out your queries on a small subset of a table (i.e. with a limit) before creating the temporary table, otherwise you may have to delete and recreate your temporary tables if there were any issues during your initial creation.

> It is possible to perform the following merges in Python, but due to the relative sizes of the tables in question, you may run into memory issues using Python.

In [None]:
# store query to find 2012-13 academic year graduates in a temporary table
# use conn.execute instead of pd.read_sql because there is no output
qry = '''
create temp table all_grads as
select *
from data_ohio_olda_2018.oh_hei_long
where (degcert_yr_earned = '2012' and (degcert_term_earned = '4' or degcert_term_earned = '1')) or 
    (degcert_yr_earned = '2013' and (degcert_term_earned = '2' or degcert_term_earned = '3'))
'''
conn.execute(qry)

In [None]:
qry = '''
select count(*) from all_grads
'''
pd.read_sql(qry, conn)

Now, you need to identify the community college graduates in our sample. You can find them by looking at the type of institutions/campuses they graduated from. There are five types of institions/campuses in the HEI data: University Main Campus (UM); University Branch Campus (UB), Community College (CC), State College (SC), Technical College (TC). You need to limit your sample to students who graduated from community colleges (**CC**), state colleges (**SC**), and technical colleges (**TC**).

>  Note that in order to store information more efficiently, our main table, `oh_hei_long`, only has campus numbers (`degcert_campus`). If you want to know more details about the campus, such as campus type and campus location, you need to join the main table with the lookup table `oh_hei_campus_county_lkp`.

>  In the schema `data_ohio_olda_2018`, we added suffix `_lkp` to all the lookup tables. You will need to join your main tables with lookup tables to find more detailed information about the students (such as demographics) as well as the institutions/campuses (such as county, region, and campus type).

In [None]:
# see oh_hei_campus_county_lkp table
qry = '''
select *
from data_ohio_olda_2018.oh_hei_campus_county_lkp
limit 5
'''
pd.read_sql(qry, conn)

In [None]:
# now create temp table because this is our cohort of 2012-13 community college graduates
qry = '''
create temp table cc_grads as
select a.*, lkp.*
from all_grads a
left join data_ohio_olda_2018.oh_hei_campus_county_lkp lkp
on a.degcert_campus = lkp.campus_num
where lkp.campus_type_code in ('TC', 'SC', 'CC')
'''
conn.execute(qry)

In [None]:
#read cc_grads into python
qry = '''
select * from cc_grads
'''
df=pd.read_sql(qry, conn)

In [None]:
#Let's take a look at the dataframe
df.head()

For the purposes of understanding your cohort, do you suspect that there might be some individuals with multiple records? Let's see if that is the case.

In [None]:
#Check the total number of records in our 2012-13 community college graduate dataframe
df['ssn_hash'].count()

From here, you just need to find the unique count of `ssn_hash` values in `df.`

In [None]:
#Check the number of unique key_ids (person identifiers) in our dataframe
df['ssn_hash'].nunique()

You've found the answer to the first part of this motivating question. Next, you will work through breaking down the data by Ohio job region: `jobsohioregion`. You can use `.groupby()` command in Python. However, if you want to find breakdown by many demographic variables, such as students' county of residence, you cannot use your current DataFrame/temporary table since it does not contain much demographic information. You would need to go through one more step: joining `cc_grads` to the `oh_hei_demo` table in the `data_ohio_olda_2018` schema.

In [None]:
# find number of graduates by region
df.groupby(['jobsohioregion'])['ssn_hash'].nunique()

The `.groupby()` method in Python ignores missing values. Thus, to make sure you don't forget to include missing values in your answer to this motivating question, you can use `fillna()` to fill missing values with something you know doesn't exist for the column(s) in question. Here, you can use -1.

In [None]:
#Check if there are missing values in jobsohioregion
df['jobsohioregion'].isna().sum()

In [None]:
# in case there is any missing values, you can use fillna()
df['jobsohioregion'].fillna(-1)

Another common practice is to look at number of graduates by fields. In Ohio's HEI data, you can use `degcert_subject` to find what field a student's degree focuses on. However, the subject code in the data is 6-digits, making it hard to differentiate between subjects. To avoid this issue, you can convert them to 2-digit codes so that you have less subject groups, making it easier to perform analyses.

The following queries will help you get the 2-digit subject code. In these queries, you will try out an alternative to the temporary table in SQL using the `with` command in SQL. You can also get the description of the subject code by joining `cc_grads` with the lookup table `oh_subject_codes_lkp`.

In [None]:
df[['ssn_hash','degcert_subject']].head()

In [None]:
# first look at oh_subject_codes_lkp
qry = '''
select *
from data_ohio_olda_2018.oh_subject_codes_lkp
limit 5
'''
pd.read_sql(qry, conn)

In [None]:
# with creates a mini table
# In the last line of the query, we use ::varchar to convert 'subject_code' in the lkp table from integer
# to text. This is because when we join tables, the variable types should be the same. 
qry= '''
with subject as (select ssn_hash, left(degcert_subject,2) as code from cc_grads)
select subject.ssn_hash, lkp.subject_code_2010, lkp.subject_desc 
from subject
join data_ohio_olda_2018.oh_subject_codes_lkp lkp
on subject.code=lkp.subject_code_2010::varchar; 
'''

subject_df=pd.read_sql(qry,conn)

In [None]:
subject_df.head()

In [None]:
# Are there any missing values in subject_code?
subject_df['subject_code_2010'].isna().sum()

In [None]:
# don't need to worry about using fillna() and can sort using sort_values()
subject_df.groupby(['subject_code_2010', 'subject_desc'])['ssn_hash'].nunique().sort_values(ascending=False)

<font color=red><h3> Checkpoint 2: Explore school year 2012-13 community college graduates </h3></font> 

How does the number vary by region and by degree field? Use `jobsohioregion` and the 2-digit subject code.

For example, let's take a look at community college students that graduated in Northeast region. Which degree fields have the most community college graduates?

Are there any NAs? If so, how many?

In [None]:
query= '''

'''

practice_df=pd.read_sql(query,conn)

Now that you have completed Motivating Question #1, you will move onto Motivating Question #2, where, as promised, you will begin to explore wage records data.

<font color=red><h3>__Motivating Question # 2__:</h3></font>
**How many 2012-13 Ohio community college graduates are employed in Ohio one year after graduation? How many of them have stable employment? How does the number vary by industry?**

In this example, we will examine:
- How many people have positive earnings during the first year after graduation?
- What are the earning distributions within a year's time of graduates who have positive earnings during the first year after graduation?
- How many people achieved stable employment within the first year after graduation? 
    - **Stable employment metric 1**: have positive earnings during ALL four quarters after graduation
    - **Stable employment metric 2**: work for the same employer during the second quarter and the fourth quarter after graduation
- How does the number of people who have stable employment vary by industry?

In this example, you will join the table `cc_grads` you created in the previous question with Ohio UI wage data. In the ADRF, there are two UI wage tables for Ohio:
- `oh_ui_wage_by_employer`: each row shows a worker's quarterly earnings, weeks worked, and industry of one employer. Note that some people may have more than one employer.
- `oh_ui_wage_by_quarter`: each row shows a worker's total quarterly earnings, maximum weeks worked during a quarter, and total number of employers during a quarter. 

>*Which table should you use when you look at post-graduation labor market outcomes?*
 It depends on how you define "employed" and the employment metrics you use.



Let's take a quick look at the Unemployment Insurance wage records tables so you can get an idea of the tables' structures first.

In [None]:
qry = '''
select * 
from data_ohio_olda_2018.oh_ui_wage_by_employer
limit 5
'''

pd.read_sql(qry, conn)

In [None]:
qry = '''
select * 
from data_ohio_olda_2018.oh_ui_wage_by_quarter
limit 5
'''
pd.read_sql(qry, conn)

> Note: __Large tables__ can take a long time to process on shared databases. For example, the Ohio UI wage data has more than 4.7M records per quarter.

Recall that the number of unique `ssn_hash` values is less than the number of total records in `cc_grads`, indicating that some people have more than one record. Depending on the type of analysis you want to conduct, you need to think about whether you should keep all the records or only keep one record for each person. Here, since you will be exploring graduates' employment outcomes one year later, it would make the most sense to have exactly one graduation record for each person.
    
So which record should you keep? Again, it depends on the results you want to look at and the assumptions you make about community college students. One may argue that the first degree is more important because it improves a person's educational attainment and make a person more advantaged in the labor market. Others may argue that the most recent degree is more important because it updates a person's skillset and may have a greater impact on their job choices, such as occupation. Here, you will look at a person's most recent record.

Recall that `cc_grads` is a table that includes only records of 2012-13 community college graduates, with some `ssn_hash` values appearing multiple times. You will have to do a tiny bit of manuevering to find the most recent graduation record within `cc_grads`. Here, you can assign the first day of the term that corresponds with the correct fiscal quarter (i.e. Summer graduates would graduate on July 1) so you can easily find the most recent graduation date for each `ssn_hash`.
> Reminder: This step could be broken into two steps using a temporary table.

In [None]:
# Find most recent graduation within the span of 2012-13 academic year
qry = '''
create temp table cc_grads_recent as
select distinct on (ssn_hash) *
from (
SELECT *, 
    CASE WHEN degcert_term_earned = 4 THEN
        format('%%s-%%s-01', degcert_yr_earned, 7)::date 
    WHEN degcert_term_earned = 1 THEN
        format('%%s-%%s-01', degcert_yr_earned, 10)::date 
    WHEN degcert_term_earned = 2 THEN
        format('%%s-%%s-01', degcert_yr_earned, 1)::date 
    WHEN degcert_term_earned = 3 THEN
        format('%%s-%%s-01', degcert_yr_earned, 4)::date 
    END AS deg_date
    from cc_grads
) q
order by ssn_hash, deg_date DESC
'''
conn.execute(qry)

In [None]:
cc_recent_df=pd.read_sql('select * from cc_grads_recent',conn)

#Check if we only have one record for each person
#Whether the number of records is the same as the number of unique ssn_hash
if cc_recent_df['ssn_hash'].count()==cc_recent_df['ssn_hash'].nunique():
    print('Each person has one record.')
else:
    print('Some peole have more than one records.')

But because the `oh_ui_wage_by_quarter` table in the `data_ohio_olda_2018` schema is so big, you would run into memory issues if you tried to join `cc_grads_recent` directly with this table. To combat this issue and minimize the time you will have to wait for the following joins to run, you can take a subset of `oh_ui_wage_by_quarter` that will contain employment data within the time frame of the motivating question. The table `small_ohio_ui` table in the `ada_20_osu` schema contains wage data from 2012, 2013 and 2014. We also used the quarter/year combination to create a category `job_date`, which is the first day of the quarter. This will make it easier to join `small_ohio_ui` to `cc_grads_recent`.

The following code was used to create `small_ohio_ui`:

    create table ada_20_osu.small_ohio_ui as
    select *, format('%%s-%%s-01', year, quarter*3-2)::date job_date 
    from ada_20_osu.oh_ui_wage_by_quarter
    where year in ('2012','2013','2014')

In [None]:
qry = '''
select * from ada_20_osu.small_ohio_ui limit 5
'''
pd.read_sql(qry, conn)

**How do we want to calculate earnings during the first year after graduation for 2012-13 graduates?**
```
   Graduation     Earnings during the first year after graduation
   
    2012_Q3        $2012_Q4+ $2013_Q1+ $2013_Q2+ $2013_Q3
   (summer)
   
    2012_Q4        $2013_Q1+ $2013_Q2+ $2013_Q3+ $2013_Q4
   (autumn)
   
    2013_Q1        $2013_Q2+ $2013_Q3+ $2013_Q4+ $2014_Q1
   (winter)
   
    2013_Q2        $2013_Q3+ $2013_Q4+ $2014_Q1+ $2014_Q2
   (spring)

```

To find exactly one year of employment history for every graduate, first you can join `cc_grads_recent` to `small_ohio_ui`, which would have a row for each community college graduate in the 2012-13 academic year that had a job in Ohio in a specific quarter from 2012-14. From there, you can use SQL's understanding of time intervals to subset to jobs for each graduate within one year post-graduation in the `where` clause.

Again, due to the size of the join, you will find yourself waiting around if you run the code yourself. Thus, we've already created the table, which is in the `ada_20_osu` schema and is titled `cohort_oh_jobs`. The code is available for your viewing pleasure below.

    create table ada_20_osu.cohort_oh_jobs as
    select a.ssn_hash, a.deg_date, b.job_date, b.sumwages, (b.job_date - a.deg_date) time_after_grad
    from cc_grads_recent a
    join ada_20_osu.small_ohio_ui b
    on a.ssn_hash = b.ssn_hash
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

In [None]:
# subset to jobs within 1 year *after* graduation

qry = '''
select * from ada_20_osu.cohort_oh_jobs
'''

df_jobs = pd.read_sql(qry, conn)

In [None]:
df_jobs.head()

In [None]:
# confirm people don't have more than four quarters worth of earnings
# If a person doesn't show in Ohio UI data, we will check his/her employment in Indiana UI data
df_jobs.groupby(['ssn_hash']).count()['sumwages'].unique()

In [None]:
# how many people had wages for at least one quarter
df_jobs['ssn_hash'].nunique()

In [None]:
# Percentage of people who have positive earnings during the first year after graduation
df_jobs['ssn_hash'].nunique()/df['ssn_hash'].nunique()

In [None]:
# See distribution of wages per person one year out
df_jobs.groupby(['ssn_hash'])['sumwages'].agg('sum').describe()

**Stable Employment Metric 1**: had positive earnings during ALL four quarters after graduation. 
```
           Quarters after graduation
                  
           Q1     Q2     Q3      Q4
Earning    $X     $X     $X      $X       X>0
    
```

**Stable Employment Metric 2**: worked for the sample employer during the second quarter and the fourth quarter after graduation.

```
           Quarters after graduation
                  
           Q1     Q2     Q3      Q4
Earning           $X             $X       X>0

Employer         1234         1234

```

In [None]:
#Stable employment metric 1:
# positive earnings during all four quarters
sum(df_jobs.groupby(['ssn_hash']).count()['sumwages'] == 4)

In [None]:
# Percentage of people who were employed during all four quarters after graduation
sum(df_jobs.groupby(['ssn_hash']).count()['sumwages'] == 4)/df['ssn_hash'].nunique()

To find the breakdown of jobs by industry, you need to use the other Ohio wage records table, `oh_ui_wage_by_employer`, since this table contains NAICS codes associated with each employer. You can use a similar process as above to match `cc_grads_recent` to `oh_ui_wage_by_employer`. The table, `cohort_oh_jobs_emp`, has already been created for you using the code below.
> The additional clause `employer_num = 1` is used to find the industry of the individual's primary employer.

The following code was used to create `small_ohio_ui_emp`:

    create table ada_20_osu.small_ohio_ui_emp as
    select *, format('%%s-%%s-01', year, quarter*3-2)::date job_date 
    from ada_20_osu.oh_ui_wage_by_employer
    where year in ('2012','2013','2014') and employer_num = 1

Which was joined to `cc_grads_recent` to create `cohort_oh_jobs_emp`.

    create table ada_20_osu.cohort_oh_jobs_emp as
    select a.ssn_hash, a.deg_date, b.job_date, b.sumwages, b.naics_3_digit, b.employer, (b.job_date - a.deg_date) time_after_grad
    from cc_grads_recent a
    join ada_20_osu.small_ohio_ui_emp b
    on a.ssn_hash = b.ssn_hash
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

In [None]:
qry = '''
select *
from ada_20_osu.cohort_oh_jobs_emp
'''
emp_df = pd.read_sql(qry, conn)

emp_df.head()

In [None]:
#Get the ssn_hash of people who have four quarters of records
ssn_4q_df=emp_df.groupby(['ssn_hash'])['wages'].agg(['count']).reset_index()
ssn_4q_df=ssn_4q_df[ssn_4q_df['count']==4]

#Merge this with emp_df to get industry code
emp_4q_df=ssn_4q_df.merge(emp_df,left_on='ssn_hash',right_on='ssn_hash')

#Keep the first quarter records only
emp_4q_df=emp_4q_df[emp_4q_df['time_after_grad']<=92]
emp_4q_df.shape

In [None]:
emp_4q_df.groupby(['naics_3_digit'])['ssn_hash'].count().sort_values(ascending=False)

In [None]:
# find top 10 industries
sort_ind = emp_4q_df.groupby(['naics_3_digit'])['ssn_hash'].count().sort_values(ascending=False)
sort_ind.iloc[0:10]

There's a lookup table `oh_naics3_codes_lkp` you can join to if you want the corresponding 3-digit NAICS codes in text. We have looked up the top 10 industries with the most graduates who have stable employment:
- `622`: Hospital
- `722`: Food services and drinking places
- `623`: Nursing and residential care facilities
- `621`: Ambulatory health care services
- `561`: Administrative and support services.
- `611`: Education services
- `624`: Social assistance
- `541`: Professional, scientific, and technical services
- `452`: General merchandise stores
- `445`: Food and beverage stores

Lastly, you will analyze stable employment by seeing if a member of this cohort has the same primary employer in the second and fourth quarters post-graduation. You can make use of the `time_after_grad` column you calculated in `cohort_oh_jobs_emp` by finding all jobs that are 2 and 4 quarters after graduation (approximately 182 and 365 days after graduation).

In [None]:
# what are possible values of time_after_grad
emp_df['time_after_grad'].sort_values().unique()

In [None]:
# get all jobs 2 and 4 quarters after graduation
qry = '''
select ssn_hash, deg_date, job_date, employer, naics_3_digit
from ada_20_osu.cohort_oh_jobs_emp
where time_after_grad between 180 and 185 or time_after_grad = 365
'''
stable_emp = pd.read_sql(qry, conn)

In [None]:
stable_emp.head()

In [None]:
# find the amount that had only one employer and showed up in stable_emp twice
sum((stable_emp.groupby(['ssn_hash'])['employer'].nunique() == 1) & (stable_emp.groupby(['ssn_hash']).count()['employer'] == 2))

In [None]:
#percentage of people who have stable employment based on metric 2
sum((stable_emp.groupby(['ssn_hash'])['employer'].nunique() == 1) & (stable_emp.groupby(['ssn_hash']).count()['employer'] == 2))/df['ssn_hash'].nunique()

In [None]:
#Get the dataframe of people who worked for the same employer during the 2nd and the 4th quarter after graduation
stable_df=stable_emp.groupby(['ssn_hash','employer','naics_3_digit']).count().reset_index()
stable_df=stable_df[stable_df['count']==2]

#breakdown the number by industry
sort_ind2=stable_df.groupby(['naics_3_digit'])['ssn_hash'].count().sort_values(ascending=False)
sort_ind2.iloc[0:10]

In [None]:
#Compare the number of stable employment defined by the two metrics
compare_df=pd.concat([sort_ind.iloc[0:10],sort_ind2.iloc[0:10]],axis=1).reset_index()
compare_df

<font color=red><h3> Checkpoint 3: Explore additional earning metrics </h3></font> 

How many people have positive earnings and earn more than $1,000 during all four quarters after graduation?

Hint:
1. Get a subsample of table `df_jobs` by restricting `sumwages`.
2. Count how many people have 4 records.

Afterwards, discuss additional metrics you can use to measure a person's labor market outcomes amongst your group.

<font color=red><h3>__Motivating Question #3__:</h3></font>
**How many 2012-13 Ohio community college graduates found jobs in Indiana one year after graduation? How many of them have stable employment?**

In the ADRF, we have education and employment data from other states. This creates us opportunities to examine Ohio graduates' employment across states.

In this example, we will focus on the flow of Ohio community college graduates to Indiana. We will use the table `wage_by_employer` in the `in_dwd` schema, which contains both quarterly employment and employer information. Similar to the analysis using Ohio wage records, we've already created smaller versions of the Indiana wage records table `small_indiana_ui` and `small_indiana_ui_emp` in the `ada_20_osu` schema. We've also created permanent tables `cohort_in_wages` and `cohort_in_wages_emp` after joining the most recent record of graduation with the Indiana wage records, whose code is shown below.

The analysis will answer the same questions as in Motivating Question #2, while also adding an analysis of cross-state employment flow.

In [None]:
qry = '''
select * from in_dwd.wage_by_employer limit 5
'''
pd.read_sql(qry, conn)

The following code was used to create `small_indiana_ui`:

    create table ada_20_osu.small_indiana_ui as
    select *, format('%%s-%%s-01', year, quarter*3-2)::date job_date 
    from in_dwd.wage_by_employer
    where year in (2012,2013,2014)

From there, the following code was used to create `cohort_in_jobs`:

    create table ada_20_osu.cohort_oh_jobs as
    select a.ssn_hash, a.deg_date, b.job_date, b.sumwages, (b.job_date - a.deg_date) time_after_grad
    from cc_grads_recent a
    join ada_20_osu.small_ohio_ui b
    on a.ssn_hash = b.ssn_hash
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date
    
> Note: An individual may appear more than once per quarter because this table contains all jobs that person worked in the quarter.

In [None]:
# see cohort_in_jobs
qry = '''
select * from ada_20_osu.cohort_in_jobs
'''
df_in_jobs = pd.read_sql(qry, conn)

In [None]:
df_in_jobs.head()

In [None]:
# how many people had wages for at least one quarter
df_in_jobs['ssn_hash'].nunique()

In [None]:
#Stable employment metric 1:
# positive earnings during all four quarters
# first need to find aggregate earnings per quarter
jobs_agg = df_in_jobs.groupby(['ssn_hash', 'job_date'])['wages'].agg('sum').reset_index()

In [None]:
# now can find amount with positive earnings during all four quarters
sum(jobs_agg.groupby(['ssn_hash']).count()['wages'] == 4)

In [None]:
# See distribution of wages per person one year out
jobs_agg.groupby(['ssn_hash'])['wages'].agg('sum').describe()

Similar to the analysis you did to answer Motivating Question #2, you will use `small_indiana_ui_emp` and `cohort_in_wages_emp` to identify those who experienced stable employment, as defined in this notebook. To find the primary employer in Indiana, though, you need to find the employer in each quarter where the individual had the highest earnings.

The following code was used to create `small_indiana_ui_emp`:

    create table ada_20_osu.small_indiana_ui_emp as
    select distinct on (ssn, year, quarter) *, format('%%s-%%s-01', year, quarter*3-2)::date job_date 
    from in_dwd.wage_by_employer 
    where year in (2012, 2013, 2014)
    order by ssn, year, quarter, wages desc

Which was joined to `cc_grads_recent` to create `cohort_in_jobs_emp`.

    create table ada_20_osu.cohort_in_jobs_emp as
    select a.ssn_hash, a.deg_date, b.job_date, b.wages, b.naics_3_digit, b.fein, (b.job_date - a.deg_date) time_after_grad
    from cc_grads_recent a
    join ada_20_osu.small_indiana_ui_emp b
    on a.ssn_hash = b.ssn
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

In [None]:
qry = '''
select * from ada_20_osu.cohort_in_jobs_emp
'''
emp_df_in = pd.read_sql(qry, conn)

In [None]:
# find top 10 industries
sort_ind = emp_df_in.groupby(['naics_3_digit'])['ssn_hash'].count().sort_values(ascending=False)
sort_ind.iloc[0:10]

In [None]:
# get all jobs 2 and 4 quarters after graduation
qry = '''
select ssn_hash, deg_date, job_date, fein, naics_3_digit
from ada_20_osu.cohort_in_jobs_emp
where time_after_grad between 180 and 185 or time_after_grad = 365
'''
stable_emp_in = pd.read_sql(qry, conn)

In [None]:
# find the amount that had only one employer and showed up in stable_emp twice
sum((stable_emp_in.groupby(['ssn_hash'])['fein'].nunique() == 1) & (stable_emp_in.groupby(['ssn_hash']).count()['fein'] == 2))

Finally, to analyze cross-state employment patterns, you can union `cohort_in_jobs` with `cohort_oh_jobs`. From there, you can see if a member of your cohort held jobs in both Indiana and Ohio in the same quarter by seeing if they have more than one employer in the quarter.

In [None]:
# temp table of two job tables unioned
qry = '''
create temp table jobs_combined as 
select *, 'in' as state
from ada_20_osu.cohort_in_jobs
union
select *, 'oh' as state 
from ada_20_osu.cohort_oh_jobs
'''
conn.execute(qry)

In [None]:
qry = '''
select * from jobs_combined
'''
df_combined = pd.read_sql(qry, conn)

In [None]:
# want to change the factors of time after grad
df_combined['time_after_grad'].unique()

In [None]:
#make even time_after_grad for later
df_combined.loc[(df_combined['time_after_grad'] == 91) | (df_combined['time_after_grad'] == 92),'time_after_grad'] = 91
df_combined.loc[(df_combined['time_after_grad'] == 182) | (df_combined['time_after_grad'] == 183) | 
            (df_combined['time_after_grad'] == 184),'time_after_grad'] = 183
df_combined.loc[(df_combined['time_after_grad'] == 273) | (df_combined['time_after_grad'] == 274) | 
            (df_combined['time_after_grad'] == 275),'time_after_grad'] = 273

In [None]:
df_combined.head()

In [None]:
df_combined.groupby(['ssn_hash', 'time_after_grad'])['wages'].count().unstack(['time_after_grad'])

In [None]:
df_tmp = df_combined.groupby(['ssn_hash', 'time_after_grad'])['wages'].count().unstack(['time_after_grad'])

In [None]:
# replace NaN with 0
df_tmp.fillna(0, inplace=True)

# and set values >1 to 1
df_tmp[df_tmp>1] = 2

In [None]:
# make ID value a column instead of an index - then we can count it when we group by the 'year_q' columns
df_tmp.reset_index(inplace=True)
df_tmp.head()

In [None]:
# group by all columns to count number of people with the same pattern
df_tmp.groupby([91, 183, 273, 365])['ssn_hash'].count().reset_index().sort_values(by='ssn_hash', ascending = False)

To take this one step further, you could track their employment patterns based on the state they worked in given that they only worked in one state in a given quarter. For this example, though, you can see common employment patterns for this cohort and whether graduates found employment, and in how many states they were employed at the same time.

<font color=red><h3> Checkpoint 4: Explore cross-state employment for 2012-13 Ohio community college graduates</h3></font> 

Are there any differences between students from Hamilton county colleges and students from non-Hamilton county colleges? Cincinnati metropolitan area is at the intersection between Ohio, Indiana, and Kentucky. Students graduated in this area are easier to find jobs and commute across states. 

You can find which county an institution is in by using variable `countyname` in table `oh_hei_campus_count_lkp`. What changes do you need to make to the above code?


In this notebook, you have covered how to identify the cohort that you are interested in from a database and save it as dataframe in Python. You have also seen how to conduct descriptive analyses in Python, such as checking missing values and breaking down the sample based on the variables that you are interested.

After you find interesting results, you may want to present them in the form of pictures, or visualizations. In the next notebook, which will cover [Data Visualization](02_1_Data_Visualization.ipynb), we will show you how to use `matplotlib` and `seaborn` in Python to display some of your findings.