# Profession Taxonomy Mapping Pipeline

This notebook assembles an end-to-end workflow for mapping free-text profession titles to a master taxonomy. The code loads the taxonomy tables, engineers lookup vocabularies, builds spaCy matchers, and documents how the classifier consolidates matches into a single primary label. Use this notebook as a reference of the logic that will be implemented on the productive system of taxonomy mamppings and translations.

**Workflow outline**
- Configure file paths for the profession inputs, input taxonomy
- Load datasets with robust CSV readers that handle encoding issues
- Generate normalized vocabularies, qualifier lists, and acronym lookups from the taxonomy
- Build phrase and token-level matchers that capture strong, suffix-qualified, and prefix-qualified occupations
- Run the classifier to assign a primary taxonomy label per input profession and persist the results in a CSV file.


In [None]:
import re
import pandas as pd
from collections import Counter, defaultdict8u
from pathlib import Path
import spacy
from spacy.matcher import Matcher, PhraseMatcher
try:
    nlp = spacy.load('en_core_web_sm')
except Exception:
    nlp = spacy.load('en_core_web_sm')

## Load raw profession titles

Each configured CSV is read with `_read_csv_any`, which retries in Latin-1 if UTF-8 decoding fails. The code picks the appropriate column containing profession or title data, standardizes its name to `text`, concatenates all files, and drops empty records. Inspect the resulting DataFrame to ensure the sample looks as expected before moving on.


In [2]:
INPUT_FILES = ["./test2.csv"]
TARGET_COLUMN = 'nm_profession'
STOPWORDS = set('''the and of in to for a on with by at from or as an amp ii iii iv i v vi vii viii ix x'''.split())
CLIENT_TAXONOMY_PATH = "./master_taxonomy.csv"
LABELS_PATH = "./labels.csv"

## File configuration

Set the `INPUT_FILES`, `TARGET_COLUMN`, and taxonomy file paths before running the notebook. The loader will search for profession-like columns if `TARGET_COLUMN` is missing, but explicit configuration is safer when new extracts are introduced. Update the stopword list when the input feed introduces additional filler words that should be ignored during matching.


In [3]:
def normalize_text(s: str):
    s = re.sub(r'[-/]', ' ', s)
    s = re.sub(r'[()\",;:.\[\]{}!?\u2013\u2014]', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s
def tokenize_lower(s: str):
    return [t.lower_ for t in nlp.make_doc(normalize_text(s))]
def ngrams(tokens, n):
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
def normalize_acronym(tok: str):
    base = tok.lower().replace('.', '').replace('-', '')
    for k, v in ACRONYMS.items():
        if base == k.replace('-', '').replace('.', ''):
            return v
    return None
def _read_csv_any(path: str) -> pd.DataFrame:
    try:
        return pd.read_csv(path)
    except UnicodeDecodeError:
        return pd.read_csv(path, encoding="latin-1")

In [4]:
frames = []
for p in INPUT_FILES:
    df = _read_csv_any(p)
    col = TARGET_COLUMN if TARGET_COLUMN in df.columns else next((c for c in df.columns if c.lower() in {'nm_profession','profession','title','job_title'} or 'prof' in c.lower() or 'title' in c.lower()), None)
    if not col:
        raise ValueError(f'No profession/title column in {p}. Columns: {list(df.columns)}')
    frames.append(df[[col]].rename(columns={col:'text'}))
data = pd.concat(frames, ignore_index=True).dropna(subset=['text']).astype({'text': str})
len(data), data.head()

(223,
                                         text
 0  Advanced Palliative Hospice Social Worker
 1           Associate Clinical Social Worker
 2                    Associate Social Worker
 3                Baccalaureate Social Worker
 4            Bachelors Limited Social Worker)

In [5]:
client_tax = _read_csv_any(CLIENT_TAXONOMY_PATH)
labels_gold = _read_csv_any(LABELS_PATH) if Path(LABELS_PATH).exists() else None
print("Client taxonomy shape:", client_tax.shape)
print("Labels gold shape:", None if labels_gold is None else labels_gold.shape)
print("Client taxonomy columns:", list(client_tax.columns))
if labels_gold is not None:
    print("Labels columns:", list(labels_gold.columns))

Client taxonomy shape: (28, 9)
Labels gold shape: (28, 2)
Client taxonomy columns: ['Taxonomy Code', 'Taxonomy Description', 'Industry', 'Major Group', 'Minor Group', 'Broad Occupation', 'Detailed Occupation', 'Occupation Level', 'Notes / reasoning']
Labels columns: ['nm_profession', 'primary_category']


## Build taxonomy vocabularies and qualifiers

The input taxonomy is normalized to lowercase, whitespace-collapsed strings so we can match consistently. From these cleaned columns, the notebook derives:
- Base vocabularies for industry through occupation level
- Strong suffix heads sourced from detailed and multiword broad occupations
- Qualified suffix heads that need contextual qualifiers
- Qualifier phrases gathered from minor groups, industries, and prefixes detected in broad/detailed occupations
- Acronym expansions captured from the occupation level column

These derived sets are the backbone for the matcher rules that follow.


In [6]:

# --- Normalize helper ---
def _norm_series(s: pd.Series) -> pd.Series:
    s = s.fillna("").astype(str).str.lower().str.replace(r"\s+", " ", regex=True).str.strip()
    s = s.replace({"nan": ""})
    return s

# --- Resolve expected columns (case-insensitive) ---
def _resolve(df: pd.DataFrame, keys):
    low = {c.lower(): c for c in df.columns}
    out = {}
    for k in keys:
        out[k] = low.get(k, None)
    return out

need_cols = ["industry","major group","minor group","broad occupation","detailed occupation","occupation level"]
colmap = _resolve(client_tax, need_cols)

IND = _norm_series(client_tax[colmap["industry"]]) if colmap["industry"] else pd.Series([], dtype=str)
MAJ = _norm_series(client_tax[colmap["major group"]]) if colmap["major group"] else pd.Series([], dtype=str)
MIN = _norm_series(client_tax[colmap["minor group"]]) if colmap["minor group"] else pd.Series([], dtype=str)
BRO = _norm_series(client_tax[colmap["broad occupation"]]) if colmap["broad occupation"] else pd.Series([], dtype=str)
DET = _norm_series(client_tax[colmap["detailed occupation"]]) if colmap["detailed occupation"] else pd.Series([], dtype=str)
LEV = _norm_series(client_tax[colmap["occupation level"]]) if colmap["occupation level"] else pd.Series([], dtype=str)

def _uniq_nonempty(s: pd.Series): 
    return sorted([x for x in s.drop_duplicates().tolist() if x])

VOCAB = {
    "Industry": set(_uniq_nonempty(IND)),
    "Major_Group": set(_uniq_nonempty(MAJ)),
    "Minor_Group": set(_uniq_nonempty(MIN)),
    "Broad_Occupation": set(_uniq_nonempty(BRO)),
    "Occupation_Level": set(_uniq_nonempty(LEV)),
}

# Heads strong: detailed + multiword broad
STRONG_SUFFIX_HEADS = set(_uniq_nonempty(DET))
STRONG_SUFFIX_HEADS |= {b for b in VOCAB["Broad_Occupation"] if len(b.split()) >= 2}

# Qualified heads: generic tails
GENERIC_TAILS = {"nurse","therapist","counselor","counsellor","specialist","coordinator","manager","worker","navigator","assistant","associate"}
def _tails(phr):
    toks = phr.split()
    outs = []
    if toks: outs.append(toks[-1])
    if len(toks)>=2: outs.append(" ".join(toks[-2:]))
    return outs
derived = set()
for p in list(VOCAB["Broad_Occupation"]) + list(STRONG_SUFFIX_HEADS):
    for t in _tails(p):
        if t.split()[-1] in GENERIC_TAILS:
            derived.add(t)
QUALIFIED_SUFFIX_HEADS = sorted(derived | GENERIC_TAILS)

# Qualifiers: from Minor + prefixes of BRO/DET before generic heads + industries
QUALIFIERS = set(VOCAB["Industry"] | VOCAB["Minor_Group"])
def _prefix_before_head(phrase, head):
    if phrase.endswith(head) and phrase != head:
        return re.sub(r"\s+", " ", phrase[:-len(head)]).strip()
    return ""
for p in list(VOCAB["Broad_Occupation"]) + list(STRONG_SUFFIX_HEADS):
    for head in QUALIFIED_SUFFIX_HEADS:
        if p.endswith(head):
            pref = _prefix_before_head(p, head)
            if pref: QUALIFIERS.add(pref)

# Acronyms: extract uppercase tokens from Occupation Level; keep previous map if exists
try:
    ACRONYMS = ACRONYMS.copy()
except NameError:
    ACRONYMS = {}
for val in VOCAB["Occupation_Level"]:
    for tok in re.split(r"[ \-/]", val):
        if tok.isupper() and tok.isalpha() and 2 <= len(tok) <= 6:
            ACRONYMS.setdefault(tok.lower(), tok)

print("VOCAB sizes:", {k:len(v) for k,v in VOCAB.items()})
print("Strong heads:", len(STRONG_SUFFIX_HEADS), "| Qualified heads:", len(set(QUALIFIED_SUFFIX_HEADS)), "| Qualifiers:", len(QUALIFIERS), "| Acronyms:", len(ACRONYMS))


VOCAB sizes: {'Industry': 1, 'Major_Group': 1, 'Minor_Group': 2, 'Broad_Occupation': 2, 'Occupation_Level': 13}
Strong heads: 11 | Qualified heads: 12 | Qualifiers: 20 | Acronyms: 0


## Construct spaCy matchers

Phrase matchers cover direct lookups against the normalized vocabularies. The token matcher adds three rule families:
- **Strong occupations** where the detailed occupation appears anywhere in the text
- **Suffix-qualified patterns** that validate qualifying context before the head term
- **Prefix-qualified patterns** that look for qualifiers after the head term

Rebuilding these matchers is required whenever the taxonomy-derived vocabularies change.


In [7]:
phrase_matchers = {}
for cat, phrases in VOCAB.items():
    pm = PhraseMatcher(nlp.vocab, attr="LOWER")
    pm.add(cat, [nlp.make_doc(p) for p in phrases if p])
    phrase_matchers[cat] = pm

matcher = Matcher(nlp.vocab)

# Strong (suffix-tolerant; head may appear anywhere)
for head in sorted(STRONG_SUFFIX_HEADS):
    toks = [{"LOWER": t} for t in head.split()]
    pattern = [{"OP": "*"}, *toks]      # .* HEAD
    matcher.add("Detailed_Occupation__strong__"+head, [pattern])

# Qualified: suffix (require some tokens before head) + prefix (head then tokens)
for head in sorted(set(QUALIFIED_SUFFIX_HEADS)):
    toks = [{"LOWER": t} for t in head.split()]
    # suffix-qualified: + HEAD  (then check qualifiers in prefix text)
    pattern_suffix = [{"OP": "+"}, *toks]
    matcher.add("Detailed_Occupation__qualified_suffix__"+head, [pattern_suffix])
    # prefix-qualified: HEAD +  (then check qualifiers in suffix text)
    pattern_prefix = [*toks, {"OP": "+"}]
    matcher.add("Detailed_Occupation__qualified_prefix__"+head, [pattern_prefix])

print("Matchers rebuilt with prefix+suffix qualified rules.")

Matchers rebuilt with prefix+suffix qualified rules.


In [8]:
# --- pick_primary_from_buckets respects your CAT_KEYS order (generic->specific) or the reverse ---
try:
    CAT_KEYS
except NameError:
    CAT_KEYS = ["Industry","Major Group","Minor Group","Broad Occupation","Detailed Occupation","Occupation Level"]

def pick_primary_from_buckets(buckets, prefer_specific=True):
    # Choose ONE primary category.
    order = list(reversed(CAT_KEYS)) if prefer_specific else CAT_KEYS
    for cat in order:
        key = cat.replace(" ", "_")
        vals = buckets.get(key) or buckets.get(cat)
        if vals:
            best = max(vals, key=lambda s: len(s.split()))
            return cat, best, ("prefer_specific" if prefer_specific else "prefer_generic")
    return None, None, "no_match"

## Classify a profession title

`classify_text_spacy` orchestrates the final labeling logic:
- Normalize the input text and run the phrase and token matchers
- Collect matches by category, including qualifier-aware detailed occupations and acronym-based occupation levels
- Deduplicate matches and pick a primary label, honoring `CAT_KEYS` order and the `prefer_specific` flag
- Return a structured record with the chosen label, the reasoning flag, and all supporting matches

This function is the main integration point for downstream services that need taxonomy assignments.


In [9]:

# --- Override classify_text_spacy to use new prefix+suffix qualified logic ---
def classify_text_spacy(text: str, prefer_specific=True):
    from collections import defaultdict
    doc = nlp(normalize_text(text))
    buckets = defaultdict(list)

    # Phrase categories
    for cat, pm in phrase_matchers.items():
        for _, s, e in pm(doc):
            buckets[cat].append(doc[s:e].text.lower())

    # Detailed Occupation via matcher (strong + qualified suffix/prefix)
    for mid, s, e in matcher(doc):
        label = nlp.vocab.strings[mid]
        span = doc[s:e].text.lower()

        if label.startswith("Detailed_Occupation__qualified_suffix__"):
            prefix = doc[0:s].text.lower()
            if any(q in prefix for q in QUALIFIERS):
                buckets["Detailed_Occupation"].append(span)
        elif label.startswith("Detailed_Occupation__qualified_prefix__"):
            suffix = doc[e:].text.lower()
            if any(q in suffix for q in QUALIFIERS):
                buckets["Detailed_Occupation"].append(span)
        else:
            buckets["Detailed_Occupation"].append(span)

    # Occupation Level from acronyms (normalize)
    for t in doc:
        base = t.text.lower().replace(".", "").replace("-", "")
        if base in ACRONYMS:
            buckets["Occupation_Level"].append(ACRONYMS[base])

    # Deduplicate + sort
    for k in list(buckets.keys()):
        buckets[k] = sorted(set(buckets[k]))

    # Decide primary
    primary_category, primary_label, reason = pick_primary_from_buckets(buckets, prefer_specific=prefer_specific)

    return {
        "source_text": text,
        "category": primary_category,
        "label": primary_label,
        "reason": reason,
        "matches": dict(buckets)
    }


## Persist results

The notebook applies the classifier to every input profession, writes the structured outputs to `./outputs/taxonomy_mapping.csv`, and logs the destination path. Review this file to validate match quality or feed the results into subsequent QA steps.


In [10]:
results = data['text'].apply(classify_text_spacy)
OUT = Path('./outputs'); OUT.mkdir(parents=True, exist_ok=True)
results.to_csv(OUT/'taxonomy_mapping.csv', index=False)
print('Saved to', OUT.resolve())

Saved to /Users/germandominguez/Documents/GitHub/propelus_ai/notebooks/outputs
