# JAAT Demo


A short demonstration of Job Ad Analysis Toolkit (JAAT) - efficiently and accurately extracting information from job ad texts.

# Simple Startup

In [None]:
from pprint import pprint
import nltk
from tqdm.auto import tqdm
import time
import pandas as pd
import random
from pathlib import Path

import sys
#sys.path.insert(0, "/path/to/JAAT/") # TODO: edit!
sys.path.insert(0, "/home/sjmeis/JAAT/") # TODO: edit!

from JAAT import TaskMatch

In [2]:
# load TaskMatch - performs setup steps
TM = TaskMatch()

INIT
Preparing embeddings...


Batches:   0%|          | 0/295 [00:00<?, ?it/s]

Setting up pipeline...
Finished.


# Extracting Task IDs from Job Ad Texts

Let's start off with a sample job ad:

In [3]:
job = """
Junior Full Stack Software Developer

Description
Develops software solutions by studying information needs; conferring with users; studying systems flow, data usage, and work processes; investigating problem areas; following the software development lifecycle.
Determines operational feasibility by evaluating analysis, problem definition, requirements, solution development, and proposed solutions.
Documents and demonstrates solutions by developing documentation, flowcharts, layouts, diagrams, charts, code comments and clear code.
Supports and develops software developers by providing advice, coaching and educational opportunities.
Other duties as required.

About Us
We are a small team of dedicated professionals that work to support the business objectives of our company as well as developing innovative software solutions for other companies in our industry. If you join our team, you will have the opportunity to work with a wide variety of technologies in a fast-paced development environment that caters to innovation and efficiency as opposed to rigid processes and ingrained mentalities. If you like to code, can follow other people’s code, can work in a team, and for a team, we say come and talk to us (mention ‘verko’ in your cover letter). We believe our company is a nice place to work and grow your skills, where working smart is appreciated as much as working hard.
Requirements:

Experience Requirements
Bachelor’s degree in computer science, MIS, other related field or relevant experience.
Experience with the Microsoft .NET technology stack (C#, MVC, Web API, Web Forms, etc.)
Experience with JavaScript frameworks (ReactJS, Node.js preferred).
Experience with relational databases (MS SQL preferred)
Experience with code versioning tools, such as Git.
Experience with modern software design patterns, debugging and refactoring.
Familiarity with continuous integration and automated build products like Team City and Azure DevOps
Geographical Requirements
Applicants from Glastonbury/Hartford CT and the vicinity will be favored.
Applicants from outside of New England states will not be considered.
"""

To extract task IDs from the job, simply call:

In [4]:
tasks = TM.get_tasks(job)
pprint(tasks, width=120)

[('16363', 'Identify operational requirements for new systems to inform selection of technological solutions.'),
 ('16987', 'Prepare documentation or presentations, including charts, photos, or graphs.'),
 ('9583', 'Assign duties to other staff and give instructions regarding work methods and routines.')]


## Great! But let's look under the hood

Before matching to task IDs, TaskMatch first identifies candidate sentences. To do a this, a classifier model identifies which segments of the job ad text are potentially task statements. Going back to the example:

In [5]:
candidates = ["({}) ".format(i+1)+x.strip() for i, x in enumerate(TM.get_candidates(job))]
pprint(candidates, width=150)

['(1) Junior Full Stack Software Developer\n'
 '\n'
 'Description\n'
 'Develops software solutions by studying information needs; conferring with users; studying systems flow, data usage, and work processes; '
 'investigating problem areas; following the software development lifecycle.',
 '(2) Determines operational feasibility by evaluating analysis, problem definition, requirements, solution development, and proposed solutions.',
 '(3) Documents and demonstrates solutions by developing documentation, flowcharts, layouts, diagrams, charts, code comments and clear code.',
 '(4) Supports and develops software developers by providing advice, coaching and educational opportunities.',
 '(5) Other duties as required.',
 '(6) Experience with modern software design patterns, debugging and refactoring.']


In [6]:
len(nltk.sent_tokenize(job))

16

So we see that six candidates are identified out of the 16 "sentences" in the job ad. From these we can narrow down to three matched tasks.

## Batch Processing

What if we want to process many job ads at once? Use our batch processing function.

Let's first load in a large file of job ad "sentences".

In [7]:
with open("./data/demo/TaskMatch/sample_job_ads_in_line.csv", 'r') as f:
    sentences = [x.strip() for x in f.readlines()]
len(sentences)

27710

Let's look at a couple samples.

In [8]:
sentences[1]

'Pack luggage for travel and move bags to the proper place.'

In [9]:
sentences[11]

'Play with dogs.'

In [10]:
sentences[111]

'Participate in the call back of service requests to monitor resident satisfaction.'

Let's first see how long it would take to process these texts sequentially.

In [11]:
start = time.time()
for t in tqdm(sentences[:int(len(sentences)/1000)]):
    res = TM.get_tasks(t)
end = time.time() - start
print(end*1000)

  0%|          | 0/27 [00:00<?, ?it/s]



2736.9091510772705


And now using our batch processing...

In [12]:
start = time.time()
res = TM.get_tasks_batch(sentences)
end = time.time() - start
print(end)

6.258026599884033


Much faster!

Finally, the output tasks matches for our three examples:

In [13]:
pprint(res[1], width=150)

[('3151', 'Load and unload baggage in baggage compartments.')]


In [14]:
pprint(res[11], width=150)

[('4321', 'Exercise animals or provide them with companionship.')]


In [15]:
pprint(res[111], width=150)

[]


# Matching Titles from Job Ad Titles

What if we want to match job titles found in job ads with those established in the O*NET framework? Use our tool!

In [16]:
from JAAT import TitleMatch
TiM = TitleMatch()

INIT
Loading data...
Preparing embeddings...


Batches:   0%|          | 0/1424 [00:00<?, ?it/s]

Now let's load of a sample of titles.

In [17]:
data = pd.read_excel("data/demo/TitleMatch/JobAdsData2022_OSF.xlsx")
data.job_title

0                                     SBA Loan Associate
1                                         Credit Analyst
2                              Scientist, Lead Discovery
3                                   Maintenance Director
4                                             Engineer I
                             ...                        
985                 $1500 Signing Bonus Home Health Aide
986    Home Health Aide - Home Health - New River Val...
987                                     Home Health Aide
988                                     Home Health Aide
989                          Medical Lab Technician/UKHC
Name: job_title, Length: 990, dtype: object

In [18]:
data.job_title[42]

'Operations Engineer'

In [19]:
data.job_title[123]

'ESL Processing Operator'

We can use TitleMatch to match a single text:

In [20]:
TiM.get_title(data.job_title[42])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[('Operations Engineer', '17-2112.00', 1.0)]

In [21]:
TiM.get_title(data.job_title[123])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[('Processing Operator', '51-3091.00', 0.943)]

Use the same function to process a list of job titles all at once!

In [22]:
results = TiM.get_title(data.job_title.to_list())

Batches:   0%|          | 0/31 [00:00<?, ?it/s]

Looking at a random sample of 10 results...

In [23]:
d = dict(zip(data.job_title, results))
pprint({k:d[k] for k in random.sample(sorted(d), k=10)}, width=150)

{'DevSecOps Solutions Architect': ('Solutions Architect', '15-1241.00', 0.909),
 'Digital Technology Architect, Pharma': ('Informatics Pharmacist', '15-1211.01', 0.901),
 'Family Nurse Practitioner - Cherry Hill, NJ Area Locations': ('Family Health Nurse Practitioner', '29-1171.00', 0.885),
 'Firmware Engineer (Features / Embedded Systems / Product Software), Bachelors (Meraki)': ('Firmware Engineer', '15-1299.07', 0.932),
 'Loan Servicing Admin I': ('Loan Administrator', '13-2072.00', 0.923),
 'Salesforce Analyst/Developer': ('Salesforce Developer', '15-1211.00', 0.963),
 'Software Architect Fellow': ('Software Architect', '15-1252.00', 0.944),
 'Sr Accounting Specialist': ('Accounting Specialist', '43-3031.00', 0.949),
 'Visiting Assistant Professor in Sociology': ('Sociology Professor', '25-1067.00', 0.942),
 'Web Design Art Director': ('Digital Art Director', '27-1011.00', 0.936)}


# Extracting Firm Names from Text Documents

Using FirmExtract, we can efficiently extract firm names from text documents, be it job ads or other documents.

In [24]:
from JAAT import FirmExtract
FE = FirmExtract()

INIT


  return self.fget.__get__(instance, owner)()


Now let's load a sample of 10 franchise documents, found in our demo data folder.

In [25]:
texts = []
for file in Path("data/demo/FirmExtract/").glob("*"):
    with open(file, 'r') as f:
        texts.append(f.read().strip())
len(texts)

10

Let's focus on one document first:

In [26]:
texts[9]



In [27]:
FE.get_firm(texts[9])

{'federal trade commission', 'togo franchisor llc'}

Much like the other tools, you can also use the batch version of FirmExtract for much quicker processing:

In [28]:
FE.get_firm_batch(texts)

  0%|          | 0/10 [00:00<?, ?it/s]

[{'federal trade commission.'},
 {'buymg'},
 {'trade'},
 {'adkins carter carter', 'carter carter'},
 {'franchising', 'trade'},
 {'healthy'},
 {'ace', 'ace sushi franchise', 'franchise'},
 None,
 None,
 {'federal trade commission', 'togo franchisor llc'}]

# That's all! Start using JAAT today.