# Job Ad Analysis Toolkit 

## A Demonstration from Unstructured Job Ad Text to Data


### Tools: TitleMatch, FirmExtract, TaskMatch, and CREAM

Let's code up 990 Job Ads with the following machine tools:

1. TitleMatch - Match Job Titles to O-NET Occupation Codes.
2. FirmNER - Extracts Candidate Firm Names from Job Ad (or other) Text.
3. TaskMatch - Match Job Ad Text to O-NET Task IDs
4. CREAM - Develop custom classifiers with embeddings by augmenting a human curated list of 'rules'. Demo uses Education Requirements.

### Acknolwedgements

This project has received generous support from the National Labor Exchange, the Russell Sage Foundation, the Washington Center for Equitable Growth.

Job ad data for this demo is from:

```

Zhou, Steven, John Aitken, Peter McEachern, and Renee McCauley. “Data from 990 Public Real-World Job Advertisements Organized by O*NET Categories.” Journal of Open Psychology Data 10 (November 21, 2022): 17. https://doi.org/10.5334/jopd.69.

```

In [None]:
import pandas as pd
from operator import itemgetter

data = pd.read_excel("data/demo/TitleMatch/JobAdsData2022_OSF.xlsx")
data['text'] = data['text'].str.strip()
data.text = data.text.replace(r'\n','. ', regex=True)
data.text = data.text.apply(lambda x: x.lower() if isinstance(x, str) else x)
# We will work with the columns 'job_title' and 'text' (a concatenation of all text)
data[['job_title','text']]

In [None]:
data.info()

# Title Match: Match Job Titles to O-NET Occupation Codes

From:

'Operations Engineer'
'ESL Processing Operator'

To: 

[('Operations Engineer', '17-2112.00', 1.0)]
[('Processing Operator', '51-3091.00', 0.943)]

In [None]:
from JAAT import TitleMatch
TiM = TitleMatch()

data['TitleMatch'] = TiM.get_title(data.job_title.to_list())
data['TM_onet_id'] = data.TitleMatch.apply(itemgetter(1)).astype(str)


In [None]:
data[['job_title', 'TM_onet_id']]

# Firm Extract: Extracts Candidate Firm Names from Job Ad (or other) Text

From:

"First American Equipment Finance, an RBC/City National Company, is a growing, national leader providing equipment leasing and equipment finance services to commercial borrowers in all fifty states. In 2020, the company earned top honors among midsize companies for the Best Companies to Work for in New York for the 3rd consecutive year, and in 2021, added to their accolades by being recognized as the #1 top workplace among midsized companies in Rochester. With national headquarters in Rochester (Woodcliff Office Park, Fairport), First American has approximately 270 employees and manages a $2bn portfolio.  

To: 

{'american', 'first american'}

From: 

As a key member of the Lead Discovery team, reporting to the Head of Lead Discovery, the Scientist, Lead Discovery will develop and implement biochemical assays for internal and external use and collaborate with a cross-functional team of scientists within the company and with external partners to support drug discovery efforts across the Accent portfolio.	

To: 
{'accent'}

In [None]:
from JAAT import FirmExtract
FE = FirmExtract()

data['FirmExtract'] = FE.get_firm_batch(data.text.to_list())

In [None]:
data[['company','FirmExtract']]

# Task Match: Match Job Ad Text to O-NET Task IDs

From text: 

```
Develops software solutions by studying information needs; conferring with users; studying systems flow, data usage, and work processes; investigating problem areas; following the software development lifecycle.
Determines operational feasibility by evaluating analysis, problem definition, requirements, solution development, and proposed solutions.
Documents and demonstrates solutions by developing documentation, flowcharts, layouts, diagrams, charts, code comments and clear code.
Supports and develops software developers by providing advice, coaching and educational opportunities.
Other duties as required.

```

To O-NET Task IDs:

```

[('16363', 'Identify operational requirements for new systems to inform selection of technological solutions.'),
 ('16987', 'Prepare documentation or presentations, including charts, photos, or graphs.'),
 ('9583', 'Assign duties to other staff and give instructions regarding work methods and routines.')]

```

In [None]:
from JAAT import TaskMatch
TM = TaskMatch()

data['TaskMatch'] = TM.get_tasks_batch(data.text.to_list())

In [None]:
data[['TaskMatch']]

# CREAM: Adapt a Classification Scheme or Build a Novel Lexicon

From initial rules:


And natural language in a corpus:



To labeled data:


In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
import json
import re
import swifter

# sbert model
#model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model = SentenceTransformer("thenlper/gte-large")

# similarity function
# vectorize input text and use cosine similarity to compare to encoded rules
def get_sim(model, rule_map, encoded_rules, q):
    sim_scores = util.cos_sim(model.encode([q]), encoded_rules)
    return dict(zip(rule_map.keys(), sim_scores[0].tolist()))

# label via max score
def label_from_max(scores, rule_map):
    max_rule = max(scores, key=scores.get)
    label = rule_map[max_rule]
    return max_rule, label, scores[max_rule]

# if keyword found, get context window for similarity scoring
def get_context(text, keywords):
    n = 4
    text = text.lower()
    text = re.sub(r'[^a-z0-9]+', ' ', text)

    words = text.split()
    found_index = [i for i, w in enumerate(words) if any(k.strip() in w for k in keywords)]
    context = [" ".join(words[max(0, idx-n):min(idx+n+1, len(words))]) for idx in found_index]

    return '|'.join(context)

## helper function to run CREAM on all data points
def __helper__(row):
    global keywords
    global model
    global rule_map
    global encoded_rules
    
    THRESHOLD = 0.9
        
    text = row["text"]
    context = get_context(text, keywords).split('|')
    
    if len(context) > 0 and context[0] != "":
        all_scores = []
        for c in context:
            scores = get_sim(model, rule_map, encoded_rules, c)
            all_scores.append(label_from_max(scores, rule_map))
        max_score = max(all_scores, key=itemgetter(2))
        if max_score[2] >= THRESHOLD:
            return max_score[1], max_score[2], max_score[0]
        else:
            return None, 0, None
    else:
        return None, None, None

In [None]:
# load sample keywords, rules, and data
keywords = [
    "bachelor",
    "master",
    "degree",
    "high school",
    "education",
    "diploma",
    "ged",
    "certification"
]

rules = pd.read_csv("data/demo/CREAM_ed_requirements_onet_coded.csv")
encoded_rules = model.encode(rules['rule'].tolist())
rule_map = dict(zip(rules['rule'].tolist(), rules['category'].tolist()))

data[['ed_inferred_rule', 'ed_inferred_label', 'ed_inferred_confidence']] = data.apply(__helper__, axis=1, result_type="expand")

# JobTag

##CRAML and CREAM built binary ML classifiers that indicate niche job ad features (telework, independent contractor, etc.).

Classes currently available:   

List of Keys (Classes):
* wfh
* ind_contractor
* proflicenses
* driverslicense
* yesunion
* GovContract
* CitizenshipReq
* WorkAuthReq
* VisaInclude
* VisaExclude

See [https://zenodo.org/records/7454652](Zenodo repository) and the following for details on the classifiers:

```
    Meisenbacher, Stephen, and Peter Norlander. 2023. “Transforming Unstructured Text into Data with Context Rule Assisted Machine Learning (CRAML).” arXiv. https://doi.org/10.48550/arXiv.2301.08549.
```

In [None]:

from JAAT import JobTag

# Initialize the JobTag class with the desired class name
class_name = "proflicenses"  # Replace with the actual class name you want to use
job_tagger = JobTag(class_name)

# Example text to classify
text = "Your example job description or text here."

# Get the tag for the single text
tag, confidence = job_tagger.get_tag(text)
print(f"Tag: {tag}, Confidence: {confidence}")

# Example list of texts to classify in batch
texts = [
    "Remote opportunity.",
    "Independent contractor.",
    "We do not sponsor H-1B visas.",
    "U.S. Citizens and Permanent Residents only."
]

# Get tags for the batch of texts
tags = job_tagger.get_tag_batch(texts)
print("Batch Tags:", tags)

In [None]:

# Initialize JobTag for each class and apply to the data
for class_name in classes:
    print(f"Processing {class_name}...")
    job_tagger = JobTag(class_name)
    
    # Use get_tag_batch to process all texts at once
    tags = job_tagger.get_tag_batch(data['text'].tolist())
    
    # Create a new column for each class
    data[f'JobTag_{class_name}'] = [tag[1] for tag in tags]  # tag[1] is the confidence score

print("JobTag processing complete.")

# Display the first few rows of the updated dataframe
print(data.head())

# Optional: Save the updated dataframe
# data.to_csv('updated_data_with_jobtags.csv', index=False)



In [None]:
data.to_excel('data/demo/coded_output.xlsx', index=False)
data