# How to join Onet code to get salary and title

##### Data sources used:
+ Linkup Raw Job Records.  We will be using a slice of raw data.
+ All data XML file from Bureau of Labor Statistics.  https://www.bls.gov/oes/tables.htm
+ Onet 2010 to 2018 Crosswalk.  https://www.onetcenter.org/crosswalks.html

##### Warning:  The Bureau of Labor Statistics and only includes salary data for the United States.  Salary estimates that are being joined are estimates based on United States Data.

##### In order to join these tables together there are 2 approaches that can be done within python:

1. The first approach is using the pandas library to join these tables.
2. The second apprach uses SQL to join these tables.  I will be using the sqlite3 library to create a SQL structure in memory, but the query can be taken and used in any SQL database

##### For this tutorial I am going to use select columns from BLS that are most common for our clients, however feel free to look through to choose the data points most relevant to your use case.  The columns I will join are:
- Occ_Code:  This is the join key
- Occ_Title:  Human readable onet description
- h_mean & a_mean:  Hourly and annual mean income
- h_median & h_median:  Hourly and annual hourly income

In [1]:
# Import Libaries
import pandas as pd
import numpy as np
import os
import sqlite3
import tarfile

# Display parameters for dataframes for tutorial display purposes
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
# Loading raw sample slice
with tarfile.open('../../../Feeds/raw-sample.tar.gz', "r:*") as tar:
    # Get path to the job_records file within the tarfile
    csv_path = tar.getnames()[1]
    # Load job records file into pandas dataframe
    job_records = pd.read_csv(tar.extractfile(csv_path),
                    parse_dates = ['created','delete_date','last_checked','last_updated'],
                    low_memory = False)
job_records.loc[:10,'hash':'company_name']

Unnamed: 0,hash,title,company_id,company_name
0,0f659e59f8967f986ab53b898e543095,Financial Solutions Advisor - Bilingual Mandar...,381,Bank of America Corporation
1,058e013f78e737baf7a6d9c36b33ba9e,"Teller - Bishop, CA (Parttime, 20hrs)",381,Bank of America Corporation
2,0234422f9c5eff7f6d1d008c3c31dae6,"UI Artist, Double Helix Games, Amazon Game Stu...",469,"Amazon.com, Inc."
3,0a5cf0310984a8c86bb42b1269640b28,Home Retention Specialist/Collector I - Russel...,381,Bank of America Corporation
4,0a7cd92ca3f7277442bbd157c018a968,South San Diego-Sales & Service Specialist-Jac...,381,Bank of America Corporation
5,03ffb7f37da9f17b3c9806fb032363c8,Personal Banker - Hialeah Gardens Banking Cent...,381,Bank of America Corporation
6,07faeba744c2ca2d0525e78b6b08e142,Mortgage Loan Officer-Palm Desert,381,Bank of America Corporation
7,0c392d37eb6dfb15b1ffc2a48b8d95e7,"Warehouse Team Member (Seasonal, Part Time, Fl...",469,"Amazon.com, Inc."
8,0ee6d7263a67121e8405af8ad8fdef2a,Responsable des opérations de contrôle d'inven...,469,"Amazon.com, Inc."
9,0aaf97d36a5279ac18a43d1e34344232,"Relationship Manager-Newton/Waltham, MA Area",381,Bank of America Corporation


In [3]:
# Load BLS Datadata
BLS_data = pd.read_excel('../../../External/all_data_M_2018.xlsx')
BLS_data = BLS_data[(BLS_data.area_title == 'U.S.') &
                    (BLS_data.naics_title == 'Cross-industry')]

In [4]:
#Load Crosswalk
OnetCrosswalk = pd.read_excel('../../../External/2010_to_2018_SOC_Crosswalk.xls',
                            usecols = ['O*NET-SOC 2010 Code','2018 SOC Code'])
OnetCrosswalk.columns = ['O*NET-SOC 2010 Code','occ_code']

BLS_data = OnetCrosswalk.merge(BLS_data[['occ_code', 'occ_title', 'h_mean', 'a_mean', 'h_median', 'a_median']], 
                    how = 'left',on = 'occ_code')
BLS_data

Unnamed: 0,O*NET-SOC 2010 Code,occ_code,occ_title,h_mean,a_mean,h_median,a_median
0,11-1011.00,11-1011,Chief Executives,96.22,200140,91.15,189600
1,11-1011.03,11-1011,Chief Executives,96.22,200140,91.15,189600
2,11-1021.00,11-1021,General and Operations Managers,59.56,123880,48.52,100930
3,11-1031.00,11-1031,Legislators,*,47620,*,24670
4,11-2011.00,11-2011,Advertising and Promotions Managers,63.99,133090,56.31,117130
...,...,...,...,...,...,...,...
1165,55-3015.00,55-3015,,,,,
1166,55-3016.00,55-3016,,,,,
1167,55-3017.00,17-3029,"Engineering Technicians, Except Drafters, All ...",31.6,65720,30.38,63200
1168,55-3018.00,55-3018,,,,,


# Crosswalk Onet Codes in Job Records to 2018 Version

##### Here we are just adding the occ_code column to job records so that we have the 2018 SOC code that joins to the BLS dataset

In [5]:
job_records = job_records.merge(OnetCrosswalk, 
                              how = 'left',
                              left_on = 'onet_occupation_code', 
                              right_on = 'O*NET-SOC 2010 Code')
job_records = job_records.drop('O*NET-SOC 2010 Code', axis = 1)

# Final Join to add salary data and normalize title

##### Use SQL query to join these tables.  I will be using the sqlite3 library to create a SQL structure in memory, but the query can be taken and used in any SQL database

In [6]:
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
job_records.to_sql('job_records', conn, index=False)
BLS_data.to_sql('BLS_data', conn, index=False)

qry = '''
    SELECT
        hash,
        job_records.occ_code,
        occ_title,
        company_name,
        created,
        h_mean,
        a_mean,
        h_median,
        a_median
    FROM job_records
    LEFT JOIN BLS_data
    ON job_records.occ_code = BLS_data.occ_code;
    '''

New_JobRecords = pd.read_sql_query(qry, conn)
New_JobRecords.head(3)

  method=method,


Unnamed: 0,hash,occ_code,occ_title,company_name,created,h_mean,a_mean,h_median,a_median
0,0f659e59f8967f986ab53b898e543095,41-3031,"Securities, Commodities, and Financial Service...",Bank of America Corporation,2015-04-16 12:25:01+00:00,47.49,98770,30.83,64120
1,0f659e59f8967f986ab53b898e543095,41-3031,"Securities, Commodities, and Financial Service...",Bank of America Corporation,2015-04-16 12:25:01+00:00,47.49,98770,30.83,64120
2,0f659e59f8967f986ab53b898e543095,41-3031,"Securities, Commodities, and Financial Service...",Bank of America Corporation,2015-04-16 12:25:01+00:00,47.49,98770,30.83,64120


#  Possible next steps

1. This same methodology can be applied, but joining based on occupation code as well as state to get more granular based on geography

2. Data can be aggregated for modeling where salaries are summed based on the number of positions to get an idea of financial investment in various job types or geographic areas.