# Job Ad Analysis Toolkit 

## A Demonstration from Unstructured Job Ad Text to Data


### Tools: TitleMatch, FirmExtract, TaskMatch, and CREAM

Let's code up 990 Job Ads with the following machine tools:

1. TitleMatch - Match Job Titles to O-NET Occupation Codes.
2. FirmNER - Extracts Candidate Firm Names from Job Ad (or other) Text.
3. TaskMatch - Match Job Ad Text to O-NET Task IDs
4. CREAM - Develop custom classifiers with embeddings by augmenting a human curated list of 'rules'. Demo uses Education Requirements.

This project has received generous support from the National Labor Exchange, the Russell Sage Foundation, the Washington Center for Equitable Growth.

Job ad data for this demo is from:

```

Zhou, Steven, John Aitken, Peter McEachern, and Renee McCauley. “Data from 990 Public Real-World Job Advertisements Organized by O*NET Categories.” Journal of Open Psychology Data 10 (November 21, 2022): 17. https://doi.org/10.5334/jopd.69.

```

In [1]:
import pandas as pd
from operator import itemgetter

data = pd.read_excel("data/demo/TitleMatch/JobAdsData2022_OSF.xlsx")
data['text'] = data['text'].str.strip()
data.text = data.text.replace(r'\n','. ', regex=True)
data.text = data.text.apply(lambda x: x.lower() if isinstance(x, str) else x)
# We will work with the columns 'job_title' and 'text' (a concatenation of all text)
data[['job_title','text']]

Unnamed: 0,job_title,text
0,SBA Loan Associate,assist business development officer with sba 7...
1,Credit Analyst,the credit analyst is responsible for assistin...
2,"Scientist, Lead Discovery",design and execution of biochemical assays for...
3,Maintenance Director,"as a maintenance director, you’ll effectively ..."
4,Engineer I,freese and nichols is seeking an engineer i to...
...,...,...
985,$1500 Signing Bonus Home Health Aide,responsible to the r.n. and/or therapist who a...
986,Home Health Aide - Home Health - New River Val...,this position supports carilion's hallmarks of...
987,Home Health Aide,documents the various aspects of care on appro...
988,Home Health Aide,"observes, reports and documents patient status..."


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id                     990 non-null    int64 
 1   onet_id                990 non-null    object
 2   job_title              990 non-null    object
 3   company                990 non-null    object
 4   job_location           986 non-null    object
 5   salary                 78 non-null     object
 6   text                   990 non-null    object
 7   responsibilities_text  978 non-null    object
 8   requirements_text      961 non-null    object
 9   preferred_text         655 non-null    object
 10  company_desc           609 non-null    object
dtypes: int64(1), object(10)
memory usage: 85.2+ KB


# Title Match: Match Job Titles to O-NET Occupation Codes

From:

'Operations Engineer'
'ESL Processing Operator'

To: 

[('Operations Engineer', '17-2112.00', 1.0)]
[('Processing Operator', '51-3091.00', 0.943)]

In [3]:
from JAAT import TitleMatch
TiM = TitleMatch()

data['TitleMatch'] = TiM.get_title(data.job_title.to_list())
data['TM_onet_id'] = data.TitleMatch.apply(itemgetter(1)).astype(str)


  from .autonotebook import tqdm as notebook_tqdm


INIT
Loading data...
Preparing embeddings...


Batches: 100%|██████████| 1424/1424 [00:42<00:00, 33.28it/s]
Batches: 100%|██████████| 31/31 [00:01<00:00, 28.03it/s]


In [4]:
data[['job_title', 'TM_onet_id']]

Unnamed: 0,job_title,TM_onet_id
0,SBA Loan Associate,43-4131.00
1,Credit Analyst,13-2041.00
2,"Scientist, Lead Discovery",47-5049.00
3,Maintenance Director,11-3013.00
4,Engineer I,15-1241.01
...,...,...
985,$1500 Signing Bonus Home Health Aide,31-1121.00
986,Home Health Aide - Home Health - New River Val...,31-1121.00
987,Home Health Aide,31-1121.00
988,Home Health Aide,31-1121.00


# Firm Extract: Extracts Candidate Firm Names from Job Ad (or other) Text

From:

"First American Equipment Finance, an RBC/City National Company, is a growing, national leader providing equipment leasing and equipment finance services to commercial borrowers in all fifty states. In 2020, the company earned top honors among midsize companies for the Best Companies to Work for in New York for the 3rd consecutive year, and in 2021, added to their accolades by being recognized as the #1 top workplace among midsized companies in Rochester. With national headquarters in Rochester (Woodcliff Office Park, Fairport), First American has approximately 270 employees and manages a $2bn portfolio.  

To: 

{'american', 'first american'}

From: 

As a key member of the Lead Discovery team, reporting to the Head of Lead Discovery, the Scientist, Lead Discovery will develop and implement biochemical assays for internal and external use and collaborate with a cross-functional team of scientists within the company and with external partners to support drug discovery efforts across the Accent portfolio.	

To: 
{'accent'}

In [5]:
from JAAT import FirmExtract
FE = FirmExtract()

data['FirmExtract'] = FE.get_firm_batch(data.text.to_list())

INIT


100%|██████████| 990/990 [02:00<00:00,  8.21it/s]


In [8]:
data[['company','FirmExtract']]

Unnamed: 0,company,FirmExtract
0,Bankers Healthcare Group,"{-, fund, ex solutions}"
1,First American Equipment Finance,"{american, first american}"
2,"Accent Therapeutics, Inc.",{accent}
3,CWS Apartment Homes LLC,
4,"Freese and Nichols, Inc.","{freese and nichols, freese and}"
...,...,...
985,Cuyuna Regional Medical Center,
986,Carilion Clinic,{carilion clinic}
987,Ridgeview Medical Center,{ridgeview}
988,St. Claire HealthCare,


# Task Match: Match Job Ad Text to O-NET Task IDs

From text: 

```
Develops software solutions by studying information needs; conferring with users; studying systems flow, data usage, and work processes; investigating problem areas; following the software development lifecycle.
Determines operational feasibility by evaluating analysis, problem definition, requirements, solution development, and proposed solutions.
Documents and demonstrates solutions by developing documentation, flowcharts, layouts, diagrams, charts, code comments and clear code.
Supports and develops software developers by providing advice, coaching and educational opportunities.
Other duties as required.

```

To O-NET Task IDs:

```

[('16363', 'Identify operational requirements for new systems to inform selection of technological solutions.'),
 ('16987', 'Prepare documentation or presentations, including charts, photos, or graphs.'),
 ('9583', 'Assign duties to other staff and give instructions regarding work methods and routines.')]

```

In [9]:
from JAAT import TaskMatch
TM = TaskMatch()

data['TaskMatch'] = TM.get_tasks_batch(data.text.to_list())

INIT
Preparing embeddings...


Batches: 100%|██████████| 295/295 [00:35<00:00,  8.34it/s]

Setting up pipeline...
Finished.





AttributeError: Can't pickle local object 'pad_collate_fn.<locals>.inner'

In [None]:
data[['TaskMatch']]

# CREAM: Adapt a Classification Scheme or Build a Novel Lexicon

From initial rules:


And natural language in a corpus:



To labeled data:


In [11]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
import json
import re
import swifter

# sbert model
#model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model = SentenceTransformer("thenlper/gte-large")

# similarity function
# vectorize input text and use cosine similarity to compare to encoded rules
def get_sim(model, rule_map, encoded_rules, q):
    sim_scores = util.cos_sim(model.encode([q]), encoded_rules)
    return dict(zip(rule_map.keys(), sim_scores[0].tolist()))

# label via max score
def label_from_max(scores, rule_map):
    max_rule = max(scores, key=scores.get)
    label = rule_map[max_rule]
    return max_rule, label, scores[max_rule]

# if keyword found, get context window for similarity scoring
def get_context(text, keywords):
    n = 4
    text = text.lower()
    text = re.sub(r'[^a-z0-9]+', ' ', text)

    words = text.split()
    found_index = [i for i, w in enumerate(words) if any(k.strip() in w for k in keywords)]
    context = [" ".join(words[max(0, idx-n):min(idx+n+1, len(words))]) for idx in found_index]

    return '|'.join(context)

## helper function to run CREAM on all data points
def __helper__(row):
    global keywords
    global model
    global rule_map
    global encoded_rules
    
    THRESHOLD = 0.9
        
    text = row["text"]
    context = get_context(text, keywords).split('|')
    
    if len(context) > 0 and context[0] != "":
        all_scores = []
        for c in context:
            scores = get_sim(model, rule_map, encoded_rules, c)
            all_scores.append(label_from_max(scores, rule_map))
        max_score = max(all_scores, key=itemgetter(2))
        if max_score[2] >= THRESHOLD:
            return max_score[1], max_score[2], max_score[0]
        else:
            return None, 0, None
    else:
        return None, None, None

In [14]:
# load sample keywords, rules, and data
keywords = [
    "bachelor",
    "master",
    "degree",
    "high school",
    "education",
    "diploma",
    "ged",
    "certification"
]

rules = pd.read_csv("data/demo/CREAM_ed_requirements_onet_coded.csv")
encoded_rules = model.encode(rules['rule'].tolist())
rule_map = dict(zip(rules['rule'].tolist(), rules['category'].tolist()))

data[['ed_inferred_rule', 'ed_inferred_label', 'ed_inferred_confidence']] = data.apply(__helper__, axis=1, result_type="expand")

In [None]:
data.to_excel('data/demo/coded_output.xlsx', index=False)
data