# PURPOSE OF THIS NOTEBOOK

This notebook takes in the ONET classification dataset and gets manipulated to allow me to manipulate the data to the expected input of BERT. This has 2 main advantages. The first one being that this makes loading bert quite easy, second being that we will take the original purpose of the data and lets us label the the titles which in theory, makes the model much more accurate. (**SOURCE ONET DATA HERE**)

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Start by loading in the data. 
onet_df = pd.read_csv("../Data/Updated_ONET_Alt_Titles.csv")

# Filter out any columns we don't need. 
onet_filtered = onet_df[['O*NET-SOC Code', 'Reported Job Title']]

# Show the filtered data
print(onet_filtered.shape)
onet_filtered.head()

(44545, 2)


Unnamed: 0,O*NET-SOC Code,Reported Job Title
0,11-1011.00,Chief Diversity Officer (CDO)
1,11-1011.00,Chief Executive Officer (CEO)
2,11-1011.00,Chief Financial Officer (CFO)
3,11-1011.00,Chief Nursing Officer
4,11-1011.00,Chief Operating Officer (COO)


In [3]:
# Dummify the data. 
dummy_onets = pd.get_dummies(onet_filtered['O*NET-SOC Code'])

dummy_onets['Reported Job'] = onet_filtered['Reported Job Title']

In [4]:
dummy_onets.head()

Unnamed: 0,11-1011.00,11-1011.03,11-1021.00,11-1031.00,11-2011.00,11-2021.00,11-2022.00,11-2032.00,11-2033.00,11-3012.00,...,55-2013.00,55-3011.00,55-3012.00,55-3013.00,55-3014.00,55-3015.00,55-3016.00,55-3018.00,55-3019.00,Reported Job
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Diversity Officer (CDO)
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Executive Officer (CEO)
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Financial Officer (CFO)
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Nursing Officer
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,Chief Operating Officer (COO)


In [5]:
# Seperate the labels and the reported jobs
label_df = dummy_onets.loc[:,dummy_onets.columns != 'Reported Job']
reported_df = dummy_onets.loc[:, dummy_onets.columns == 'Reported Job']

In [6]:
# Check that we isloated just the labels 
label_df.head()

Unnamed: 0,11-1011.00,11-1011.03,11-1021.00,11-1031.00,11-2011.00,11-2021.00,11-2022.00,11-2032.00,11-2033.00,11-3012.00,...,55-2012.00,55-2013.00,55-3011.00,55-3012.00,55-3013.00,55-3014.00,55-3015.00,55-3016.00,55-3018.00,55-3019.00
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [7]:
# Check that we islated the reported jobs 
reported_df.head()

Unnamed: 0,Reported Job
0,Chief Diversity Officer (CDO)
1,Chief Executive Officer (CEO)
2,Chief Financial Officer (CFO)
3,Chief Nursing Officer
4,Chief Operating Officer (COO)


In [8]:
# Lastly take the labels and change them to booleans
label_df = label_df.astype(int)

In [9]:
# Confirm that there are 0 null values in the label or reported jobs data 
print(label_df.isna().sum().sum())
print(reported_df.isna().sum().sum())

0
0


In [10]:
# Turn the labels into lists 
label_df['Label'] = label_df.values.tolist()

In [14]:
# Do a train test split on the data. 
def tt_split(df):
    ''' 1. Define the final dataframes that we need     
        2. Create a for loop that will iterate through all of the rows and do a 70/30 split on the data. 
        3. Append the new data to the final dataframes
        4. Reset the index of the final dataframes 
        5. Export the final dataframes. 
        STRETCH GOAL: take this function and convert the dataframe equations to numpy array equations, will be much faster.  ''' 

    # Define the final dataframes 
    test_df = pd.DataFrame(columns=['Reported_Jobs', 'Label'])
    train_df = pd.DataFrame(columns=['Reported_Jobs', 'Label'])

    # Grab all the columns apart from the final reported job title
    label_df = df.drop(columns=['Reported_Jobs', 'Label'])
    label_list = label_df.columns.to_list()
    
    for onet in label_list:
        filter_df = df[df[onet] == 1]
        temp_train_df = filter_df.sample(frac=.7,random_state=150)
        temp_test_df = filter_df.drop(temp_train_df.index).reset_index(drop=True)
        temp_train_df.reset_index(inplace=True, drop=True)

        # Append the new data to the final train/test dataframes 
        train_df = pd.concat([train_df, temp_train_df], ignore_index=True)
        test_df = pd.concat([test_df, temp_test_df], ignore_index=True)

    train_df.reset_index(drop=True, inplace=True)
    test_df.reset_index(drop=True, inplace=True)
    
    train_df = train_df[['Reported_Jobs', 'Label']]
    test_df = test_df[['Reported_Jobs', 'Label']]

    return train_df, test_df

In [12]:
# Create the final dataframe to export
label_df['Reported_Jobs'] = reported_df['Reported Job']


In [15]:
train_df, test_df = tt_split(label_df)

In [22]:
train_df.shape

(31166, 2)

In [23]:
test_df.shape

(13379, 2)

In [24]:
# Export the Training and Testing data. 
train_df.to_csv('../Data/Training_Data.csv', index=False)
test_df.to_csv('../Data/TestingData.csv', index=False)

## Next Steps 

From here we take the data that was set up for the model and split it into a train/test/validation. We will proceed to train and evaluate the model and proceed to start building the script to extract job titles out of resumes, then build another script to auto generate analytics to see how likely the applicant is a good fit and finally host all of this in an API. 