# PURPOSE OF THIS NOTEBOOK

This notebook takes in the ONET classification dataset and gets manipulated to allow me to manipulate the data to the expected input of BERT. This has 2 main advantages. The first one being that this makes loading bert quite easy, second being that we will take the original purpose of the data and lets us label the the titles which in theory, makes the model much more accurate. (**SOURCE ONET DATA HERE**)

In [47]:
import pandas as pd
import numpy as np

In [48]:
# Start by loading in the data. 
onet_df = pd.read_csv("../Data/Updated_ONET_Alt_Titles.csv")

# Filter out any columns we don't need. 
onet_filtered = onet_df[['O*NET-SOC Code', 'Reported Job Title']]

# Show the filtered data
print(onet_filtered.shape)
onet_filtered.head()

(44545, 2)


Unnamed: 0,O*NET-SOC Code,Reported Job Title
0,11-1011.00,Chief Diversity Officer (CDO)
1,11-1011.00,Chief Executive Officer (CEO)
2,11-1011.00,Chief Financial Officer (CFO)
3,11-1011.00,Chief Nursing Officer
4,11-1011.00,Chief Operating Officer (COO)


In [49]:
# Dummify the data. 
dummy_onets = pd.get_dummies(onet_filtered['O*NET-SOC Code'])

dummy_onets['Reported Job'] = onet_filtered['Reported Job Title']

In [50]:
dummy_onets.head()

Unnamed: 0,11-1011.00,11-1011.03,11-1021.00,11-1031.00,11-2011.00,11-2021.00,11-2022.00,11-2032.00,11-2033.00,11-3012.00,...,55-2013.00,55-3011.00,55-3012.00,55-3013.00,55-3014.00,55-3015.00,55-3016.00,55-3018.00,55-3019.00,Reported Job
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Diversity Officer (CDO)
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Executive Officer (CEO)
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Financial Officer (CFO)
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Nursing Officer
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Operating Officer (COO)


In [51]:
# Seperate the labels and the reported jobs
label_df = dummy_onets.loc[:,dummy_onets.columns != 'Reported Job']
reported_df = dummy_onets.loc[:, dummy_onets.columns == 'Reported Job']

In [52]:
# Check that we isloated just the labels 
label_df.head()

Unnamed: 0,11-1011.00,11-1011.03,11-1021.00,11-1031.00,11-2011.00,11-2021.00,11-2022.00,11-2032.00,11-2033.00,11-3012.00,...,55-2012.00,55-2013.00,55-3011.00,55-3012.00,55-3013.00,55-3014.00,55-3015.00,55-3016.00,55-3018.00,55-3019.00
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [53]:
# Check that we islated the reported jobs 
reported_df.head()

Unnamed: 0,Reported Job
0,Chief Diversity Officer (CDO)
1,Chief Executive Officer (CEO)
2,Chief Financial Officer (CFO)
3,Chief Nursing Officer
4,Chief Operating Officer (COO)


In [54]:
# Lastly take the labels and change them to booleans
label_df = label_df.astype(int)

In [55]:
# Confirm that there are 0 null values in the label or reported jobs data 
print(label_df.isna().sum().sum())
print(reported_df.isna().sum().sum())

0
0


In [56]:
# Create the final dataframe to export
label_df['Reported_Jobs'] = reported_df['Reported Job']
final_onets = label_df

In [57]:
final_onets

Unnamed: 0,11-1011.00,11-1011.03,11-1021.00,11-1031.00,11-2011.00,11-2021.00,11-2022.00,11-2032.00,11-2033.00,11-3012.00,...,55-2013.00,55-3011.00,55-3012.00,55-3013.00,55-3014.00,55-3015.00,55-3016.00,55-3018.00,55-3019.00,Reported_Jobs
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Chief Diversity Officer (CDO)
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Chief Executive Officer (CEO)
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Chief Financial Officer (CFO)
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Chief Nursing Officer
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Chief Operating Officer (COO)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44540,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,Tactical Debriefer
44541,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,Tactical/Mobile (Tacmobile) Ashore Analysis Sy...
44542,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,Target Aircraft Technician
44543,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,Technical Surveillance Countermeasures (TSCM) ...


In [58]:
# Export the data 
final_onets.to_csv("../Data/ONET_Label_Job_Pairing.csv", index=False)

## Next Steps 

From here we take the data that was set up for the model and split it into a train/test/validation. We will proceed to train and evaluate the model and proceed to start building the script to extract job titles out of resumes, then build another script to auto generate analytics to see how likely the applicant is a good fit and finally host all of this in an API. 