# JAAT Demo


A short demonstration of Job Ad Analysis Toolkit (JAAT) - efficiently and accurately extracting information from job ad texts.

# Simple Startup

In [None]:
from pprint import pprint
import nltk
from tqdm.auto import tqdm
import time
import pandas as pd
import random
from pathlib import Path

from JAAT import TaskMatch

In [2]:
# load TaskMatch - performs setup steps
TM = TaskMatch()

INIT
Preparing embeddings...


Batches:   0%|          | 0/295 [00:00<?, ?it/s]

Setting up pipeline...
Finished.


# Extracting Task IDs from Job Ad Texts

Let's start off with a sample job ad:

In [3]:
job = """
Junior Full Stack Software Developer

Description
Develops software solutions by studying information needs; conferring with users; studying systems flow, data usage, and work processes; investigating problem areas; following the software development lifecycle.
Determines operational feasibility by evaluating analysis, problem definition, requirements, solution development, and proposed solutions.
Documents and demonstrates solutions by developing documentation, flowcharts, layouts, diagrams, charts, code comments and clear code.
Supports and develops software developers by providing advice, coaching and educational opportunities.
Other duties as required.

About Us
We are a small team of dedicated professionals that work to support the business objectives of our company as well as developing innovative software solutions for other companies in our industry. If you join our team, you will have the opportunity to work with a wide variety of technologies in a fast-paced development environment that caters to innovation and efficiency as opposed to rigid processes and ingrained mentalities. If you like to code, can follow other people’s code, can work in a team, and for a team, we say come and talk to us (mention ‘verko’ in your cover letter). We believe our company is a nice place to work and grow your skills, where working smart is appreciated as much as working hard.
Requirements:

Experience Requirements
Bachelor’s degree in computer science, MIS, other related field or relevant experience.
Experience with the Microsoft .NET technology stack (C#, MVC, Web API, Web Forms, etc.)
Experience with JavaScript frameworks (ReactJS, Node.js preferred).
Experience with relational databases (MS SQL preferred)
Experience with code versioning tools, such as Git.
Experience with modern software design patterns, debugging and refactoring.
Familiarity with continuous integration and automated build products like Team City and Azure DevOps
Geographical Requirements
Applicants from Glastonbury/Hartford CT and the vicinity will be favored.
Applicants from outside of New England states will not be considered.
"""

To extract task IDs from the job, simply call:

In [4]:
tasks = TM.get_tasks(job)
pprint(tasks, width=120)

[('16363', 'Identify operational requirements for new systems to inform selection of technological solutions.'),
 ('16987', 'Prepare documentation or presentations, including charts, photos, or graphs.'),
 ('9583', 'Assign duties to other staff and give instructions regarding work methods and routines.')]


## Great! But let's look under the hood

Before matching to task IDs, TaskMatch first identifies candidate sentences. To do a this, a classifier model identifies which segments of the job ad text are potentially task statements. Going back to the example:

In [5]:
candidates = ["({}) ".format(i+1)+x.strip() for i, x in enumerate(TM.get_candidates(job))]
pprint(candidates, width=150)

['(1) Junior Full Stack Software Developer\n'
 '\n'
 'Description\n'
 'Develops software solutions by studying information needs; conferring with users; studying systems flow, data usage, and work processes; '
 'investigating problem areas; following the software development lifecycle.',
 '(2) Determines operational feasibility by evaluating analysis, problem definition, requirements, solution development, and proposed solutions.',
 '(3) Documents and demonstrates solutions by developing documentation, flowcharts, layouts, diagrams, charts, code comments and clear code.',
 '(4) Supports and develops software developers by providing advice, coaching and educational opportunities.',
 '(5) Other duties as required.',
 '(6) Experience with modern software design patterns, debugging and refactoring.']


In [6]:
len(nltk.sent_tokenize(job))

16

So we see that six candidates are identified out of the 16 "sentences" in the job ad. From these we can narrow down to three matched tasks.

## Batch Processing

What if we want to process many job ads at once? Use our batch processing function.

Let's first load in a large file of job ad "sentences".

In [7]:
with open("./data/demo/TaskMatch/sample_job_ads_in_line.csv", 'r') as f:
    sentences = [x.strip() for x in f.readlines()]
len(sentences)

27710

Let's look at a couple samples.

In [8]:
sentences[1]

'Pack luggage for travel and move bags to the proper place.'

In [9]:
sentences[11]

'Play with dogs.'

In [10]:
sentences[111]

'Participate in the call back of service requests to monitor resident satisfaction.'

Let's first see how long it would take to process these texts sequentially.

In [11]:
start = time.time()
for t in tqdm(sentences[:int(len(sentences)/1000)]):
    res = TM.get_tasks(t)
end = time.time() - start
print(end*1000)

  0%|          | 0/27 [00:00<?, ?it/s]

5382.537603378296


And now using our batch processing...

In [12]:
start = time.time()
res = TM.get_tasks_batch(sentences)
end = time.time() - start
print(end)

133.48082518577576


Much faster!

Finally, the output tasks matches for our three examples:

In [13]:
pprint(res[1], width=150)

[('3151', 'Load and unload baggage in baggage compartments.')]


In [14]:
pprint(res[11], width=150)

[('4321', 'Exercise animals or provide them with companionship.')]


In [15]:
pprint(res[111], width=150)

[]


# Matching Titles from Job Ad Titles

What if we want to match job titles found in job ads with those established in the O*NET framework? Use our tool!

In [16]:
from JAAT import TitleMatch
TiM = TitleMatch()

INIT
Loading data...
Preparing embeddings...


Batches:   0%|          | 0/1424 [00:00<?, ?it/s]

Now let's load of a sample of titles.

In [17]:
data = pd.read_excel("data/demo/TitleMatch/JobAdsData2022_OSF.xlsx")
data.job_title

0                                     SBA Loan Associate
1                                         Credit Analyst
2                              Scientist, Lead Discovery
3                                   Maintenance Director
4                                             Engineer I
                             ...                        
985                 $1500 Signing Bonus Home Health Aide
986    Home Health Aide - Home Health - New River Val...
987                                     Home Health Aide
988                                     Home Health Aide
989                          Medical Lab Technician/UKHC
Name: job_title, Length: 990, dtype: object

In [18]:
data.job_title[42]

'Operations Engineer'

In [19]:
data.job_title[123]

'ESL Processing Operator'

We can use TitleMatch to match a single text:

In [20]:
TiM.get_title(data.job_title[42])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[('Operations Engineer', '17-2112.00', 1.0)]

In [21]:
TiM.get_title(data.job_title[123])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[('Processing Operator', '51-3091.00', 0.943)]

Use the same function to process a list of job titles all at once!

In [22]:
results = TiM.get_title(data.job_title.to_list())

Batches:   0%|          | 0/31 [00:00<?, ?it/s]

Looking at a random sample of 10 results...

In [23]:
d = dict(zip(data.job_title, results))
pprint({k:d[k] for k in random.sample(sorted(d), k=10)}, width=150)

{'Assistant/Associate Professor of Christian Theology': ('Theology Professor', '25-1126.00', 0.947),
 'Boiler Technician': ('Boiler Technician', '47-2011.00', 1.0),
 'Business Operations Analyst': ('Business Operations Analyst', '13-1111.00', 1.0),
 'CNA I or CNA II': ('Certified Nursing Assistant (CNA)', '29-9093.00', 0.898),
 'Distribution Systems Administrator/Support': ('Distribution System Operator', '51-8012.00', 0.93),
 'Manager - Annuity Product Applications\n': ('Pension Fund Manager', '11-3031.03', 0.895),
 'Mechanical Technician I': ('Mechanical Technician', '49-2092.00', 0.97),
 'Night Customer Service Representative': ('Customer Service Representative', '43-4041.00', 0.917),
 'Physician - Venice Family Clinic': ('Family Physician', '29-1215.00', 0.914),
 'Revenue Accountant': ('Revenue Accountant', '13-2011.00', 1.0)}


# Extracting Firm Names from Text Documents

Using FirmExtract, we can efficiently extract firm names from text documents, be it job ads or other documents.

In [24]:
from JAAT import FirmExtract
FE = FirmExtract()

INIT


Now let's load a sample of 10 franchise documents, found in our demo data folder.

In [25]:
texts = []
for file in Path("data/demo/FirmExtract/").glob("*"):
    with open(file, 'r') as f:
        texts.append(f.read().strip())
len(texts)

10

Let's focus on one document first:

In [26]:
texts[9]

'th place east suite minnesota a i a department of mn.gov commerce. fax. an equal opportunity employer march barry kurtz attorney at law oxnard street suite woodland hills california re ace sushi franchise corporation ace sushi franchise corporation dear mister kurtz the annual report has been reviewed and is in compliance with minnesota statute chapter society and minnesota rules chapter. this means that there continues to be an effective registration statement on file and that the franchisor may offer and sell the above referenced franchise in minnesota. the franchisor is not required to escrow franchise fees post a franchise surety bond or defer receipt of franchise fees during this registration period. as a reminder the next annual report is due within days after the franchisor fiscal year end which is december. sincerely mike rothman commissioner by daniel sexton commerce analyst supervisor registration division mr des dlw state of minnesota department of commerce registration div

In [27]:
FE.get_firm(texts[9])

{'ace', 'ace sushi franchise', 'franchise'}

Much like the other tools, you can also use the batch version of FirmExtract for much quicker processing:

In [28]:
FE.get_firm_batch(texts)

  0%|          | 0/10 [00:00<?, ?it/s]

[{'trade'},
 {'franchising', 'trade'},
 {'federal trade commission', 'togo franchisor llc'},
 None,
 {'adkins carter carter', 'carter carter'},
 {'buymg'},
 {'federal trade commission.'},
 None,
 {'healthy'},
 {'ace', 'ace sushi franchise', 'franchise'}]

# CREAM - finding the needle in a haystack

Our CREAM tool allows you to define concepts (classes), for which we can search through any number of texts. In particular, we envision the ability to be able to conceptualize any class you can think of (e.g., educational requirements), and be able to extract these concepts from texts using our method. See below for an example!

First, let's create the concept of "educational requirements". To do this, all we need is a list of keywords that are relevant to this class, as well as a number of "rules" which trigger when the class is "true". First the keywords:

In [29]:
with open("data/demo/CREAM/education_keywords.txt", 'r') as f:
    keywords = [x.strip() for x in f.readlines()]
keywords

['bachelor',
 'master',
 'degree',
 'high school',
 'education',
 'diploma',
 'ged',
 'certification']

This is just an example! Now for the rules:

In [30]:
rules = pd.read_csv("data/demo/CREAM/education_rules.csv")
rules

Unnamed: 0,rule,education
0,qualifications associate degree,1
1,accredited college or university,1
2,accredited college or university,1
3,at least a bachelor degree,1
4,ba bs degree in a related,1
...,...,...
81,at least a bachelor degree,1
82,must be a graduate of an accredited,1
83,needed ba bs degree,1
84,bachelors degree experience,1


One can define arbitrarily many rules. The more the merrier! For CREAM, we need to input the rules in a specific format:

In [31]:
rules = list(rules.itertuples(index=False, name=None))
rules

[('qualifications associate degree ', 1),
 ('accredited college or university ', 1),
 ('accredited college or university ', 1),
 ('at least a bachelor degree', 1),
 ('ba bs degree in a related', 1),
 ('bachelor degree and years work experience', 1),
 ('bachelor degree ba bs', 1),
 ('bachelor degree in a', 1),
 ('bachelor degree or equivalent', 1),
 ('bachelor degree skills', 1),
 ('bs ba degree in related field', 1),
 ('bs degree or equivalent', 1),
 ('bs or ms degree', 1),
 ('college degree programs', 1),
 ('degree required bachelor degree', 1),
 ('education and experience bachelor degree', 1),
 ('education bachelor degree', 1),
 ('education from accredited colleges', 1),
 ('education level bachelor degree', 1),
 ('experience bachelor degree ', 1),
 ('experience bachelor degree and years', 1),
 ('foreign colleges or universities', 1),
 ('from an accredited college or university', 1),
 ('minimum degree required bachelor degree', 1),
 ('must be a graduate of an accredited', 1),
 ('neede

Now, we can import CREAM!

In [32]:
from JAAT import CREAM
C = CREAM(keywords=keywords, rules=rules, n=4, threshold=0.9) 
# n and threshold --> optional parameters (these are the default values)

INIT
Finished.


Now let's load in a sample of job ads that we used before.

In [33]:
jobs = pd.read_excel("data/demo/TitleMatch/JobAdsData2022_OSF.xlsx")
jobs

Unnamed: 0,id,onet_id,job_title,company,job_location,salary,text,responsibilities_text,requirements_text,preferred_text,company_desc
0,1,43-4131.00,SBA Loan Associate,Bankers Healthcare Group,"Fort Lauderdale, FL",,Assist Business Development Officer with SBA 7...,Assist Business Development Officer with SBA 7...,A successful candidate is a motivated administ...,,Are you ready to join a growing team that puts...
1,2,43-4131.00,Credit Analyst,First American Equipment Finance,"Rochester, NY",,The Credit Analyst is responsible for assistin...,The Credit Analyst is responsible for assistin...,"Bachelor’s degree preferred, or equivalent com...",,"First American Equipment Finance, an RBC/City ..."
2,3,19-1042.00,"Scientist, Lead Discovery","Accent Therapeutics, Inc.","Lexington, MA",,Design and execution of biochemical assays for...,Design and execution of biochemical assays for...,"BA/BS in Biochemistry, Enzymology or related f...",,"As a key member of the Lead Discovery team, re..."
3,4,49-9071.00,Maintenance Director,CWS Apartment Homes LLC,"Austin, TX",,"As a Maintenance Director, you’ll effectively ...","As a Maintenance Director, you’ll effectively ...",Able to read service requests and schedules in...,"High school diploma or GED\nHVAC, Pool and oth...",Who We Are\nWe are honored to be recognized as...
4,5,17-2112.00,Engineer I,"Freese and Nichols, Inc.","Atlanta, GA",,Freese and Nichols is seeking an Engineer I to...,Freese and Nichols is seeking an Engineer I to...,Bachelor’s Degree in Civil or Environmental En...,Master’s in Environmental Engineering,"At Freese and Nichols, everyone on our team ge..."
...,...,...,...,...,...,...,...,...,...,...,...
985,986,31-1121.00,$1500 Signing Bonus Home Health Aide,Cuyuna Regional Medical Center,"Crosby, MN",,Responsible to the R.N. and/or therapist who a...,Responsible to the R.N. and/or therapist who a...,POSITION QUALIFICATIONS\nEducation and Experie...,,
986,987,31-1121.00,Home Health Aide - Home Health - New River Val...,Carilion Clinic,"Radford, VA",,This position supports Carilion's hallmarks of...,This position supports Carilion's hallmarks of...,"Education: Demonstrate ability to read, write ...",,This is Carilion Clinic ...\n\nAn organization...
987,988,31-1121.00,Home Health Aide,Ridgeview Medical Center,"Waconia, MN",,Documents the various aspects of care on appro...,Documents the various aspects of care on appro...,Minimum Education/Work Experience \nOn Minneso...,"Preferred Qualifications: Previous home care, ...",
988,989,31-1121.00,Home Health Aide,St. Claire HealthCare,"Morehead, KY",,"Observes, reports and documents patient status...","Observes, reports and documents patient status...","Skills/Abilities: Basic ability to read, write...",,


In [34]:
texts = jobs.text.to_list()

Now all we need to do is input these texts into CREAM, and let it run!

In [35]:
results = C.run(texts)

  0%|          | 0/990 [00:00<?, ?it/s]

In [36]:
results.inferred_label.sum()

311

From the above, we can see that 311 out of the 990 texts were matched with a rule from our rule set, thus receiving a label of 1 for the "educational requirement" class we defined. Let's look at some examples.

In [37]:
results[results.inferred_label == 1].iloc[11]

text                   A skilled position responsible for installatio...
inferred_rule                                degree required high school
inferred_label                                                         1
inferred_confidence                                             0.925455
Name: 35, dtype: object

In [38]:
results[results.inferred_label == 1].iloc[11].text

'A skilled position responsible for installation, set up, repair and ongoing maintenance of vending, cooling and/or fountain equipment at customer accounts. Diagnoses equipment problems, uses judgment to determine how to best repair or replace. Position works independently and has frequent customer contact.  May require lifting, carrying, pulling and/or moving between 20 and 45 pounds repeatedly over workday Requires kneeling, squatting, crouching, crawling and bending when making repairs, often in low places. Position may require moving vending machines weighing 800-1200 pounds. \nRepair and perform preventative maintenance on marketing equipment\nUnload and reload with products as necessary\nEducate customers on basic equipment repair and upkeep procedures\nInstall equipment by making holes and route lines to connect products to dispensing unit, connecting water and gas supply and finding drains for units with ice. For box syrup, build racks and connect lines\nFill installed equipmen

In [39]:
results[results.inferred_label == 1].iloc[123]

text                   Analyzes complex end user needs to determine o...
inferred_rule                                     needed bachelor degree
inferred_label                                                         1
inferred_confidence                                             0.937813
Name: 372, dtype: object

In [40]:
results[results.inferred_label == 1].iloc[123].text

'Analyzes complex end user needs to determine optimal means of meeting those needs.\nDetermines specific business application software requirements to address complex business needs.\nDevelops project plans and identifies and coordinates resources, involving those outside the unit.\nWorks with programming staff to ensure requirements will be incorporated into system design and testing.\nActs as a resource to users of the software to address questions/issues.\nMay provide direction and guidance to team members and serves as an expert for the team. Requires a BS/BA degree, 5-7 years of business analysis experience that includes knowledge of systems capabilities and business operations; or any combination of education and experienced, which would provide an equivalent background. Project management certification (PMP) preferred.\n\n•             EDI experience strongly preferred Your Talent. Our Vision.  At Beacon Health Options, a proud member of the Anthem, Inc. family of companies, it’

In [41]:
results[results.inferred_label == 1].iloc[-1]

text                   Observes, reports and documents patient status...
inferred_rule                                         high school degree
inferred_label                                                         1
inferred_confidence                                              0.92737
Name: 988, dtype: object

In [42]:
results[results.inferred_label == 1].iloc[-1].text

'Observes, reports and documents patient status and the care or services provided.\nObserves and suggests changes to patient/caregiver for internal factors in the home that often cause accidents or an unsafe environment for the patient/family/aide.\nMaintains a clean and safe environment in the patient’s place of residence and performs appropriate nutritional observations.\nUtilizes proper procedures for personal care.\nUtilizes proper technique for transfers and ambulation.\nDemonstrates proficiency and accuracy in other procedures performed in the home.\nDemonstrates appropriate practices based upon the age of the patient served. Skills/Abilities: Basic ability to read, write and follow verbal instructions. Must possess ability to operate a motor vehicle. Must have reliable transportation.\nEducation: High School graduate or GED or equivalent years of experience. \nExperience: Previous work in a hospital or nursing home/private duty setting preferred. \nLicensure/Certification: Certi

Looking into the text examples, we can see that CREAM did indeed identify the needles in the haystack!

# That's all! Start using JAAT today.