Module Description:
-------------------
Class to extract skills from text and align them to existing taxonomy

Ownership:
----------
Project: Leveraging Artificial intelligence for Skills Extraction and Research (LAiSER)
Owner:  George Washington University Institute of Public Policy
        Program on Skills, Credentials and Workforce Policy
        Media and Public Affairs Building
        805 21st Street NW
        Washington, DC 20052
        PSCWP@gwu.edu
        https://gwipp.gwu.edu/program-skills-credentials-workforce-policy-pscwp

License:
--------
Copyright 2024 George Washington University Institute of Public Policy

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the “Software”), to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Revision History:
-----------------
| Rev No. | Date | Author | Description |
|---------|------|--------|-------------|
| [1.0.0] | 06/05/2024 |      Satya Phanindra K. |  Created a standalone notebook for skill extraction
| [1.0.1] | 06/11/2024 |      Satya Phanindra K. |  Added GPU support for processing
| [1.0.1] | 06/20/2024 |      Satya Phanindra K. |  Added error handling and logging
| [1.0.2] | 07/01/2024 |      Satya Phanindra K. |  Threshold update for similarity and AI model
| [1.0.3] | 07/10/2024 |      Satya Phanindra K. |  Added seperate functions set for LLM usecases
| [1.0.4] | 07/13/2024 |      Satya Phanindra K. |  Add descriptions to each method
| [1.0.5] | 07/18/2024 |      Satya Phanindra K. |  Added CONDITIONAL GPU support for LLM
| [1.0.6] | 07/22/2024 |      Satya Phanindra K. |  Added support for SkillNer model for skill extraction, if GPU not available
| [1.0.7] | 07/25/2024 |      Satya Phanindra K. |  Calculate cosine similarities in bulk for optimal performance.
| [1.0.8] | 07/28/2024 |      Satya Phanindra K. |  Error handling for empty list outputs from extract_raw function
| [1.0.9] | 11/24/2024 |      Prudhvi Chekuri    |  Add functionality to extract skills from syllabi data.
| [1.1.0] | 03/14/2025 |      Deepika Reddygari  |  Import laiser as a python package.
| [1.1.1] | 03/15/2025 |      Bharat Khandelwal  |  Resolved all issues related to importing laiser as a python package.
| [1.1.2] | 03/19/2025 |      Satya Phanindra K.  |  Update installation with uv.


## Install and import LAiSER

In [None]:
!pip install uv
!uv pip install dev-laiser -q

Collecting dev-laiser
  Downloading dev_laiser-0.2.7-py3-none-any.whl.metadata (5.6 kB)
Collecting skillNer (from dev-laiser)
  Downloading skillNer-1.0.3.tar.gz (24 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bitsandbytes (from dev-laiser)
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting datasets (from dev-laiser)
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting trl (from dev-laiser)
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting python-dotenv (from dev-laiser)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting vllm (from dev-laiser)
  Downloading vllm-0.7.3-cp38-abi3-manylinux1_x86_64.whl.metadata (25 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->dev-laiser)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->dev-laiser

In [None]:
!uv pip install --upgrade --force-reinstall numpy pandas torch scipy -q

In [None]:
import os
os.kill(os.getpid(), 9) # Restart the kernel
# After the restart, run the following cells to continue 

In [2]:
from laiser.skill_extractor import Skill_Extractor
import pandas as pd
import torch

## Using the Skill Extractor

#### With Job Descriptions

In [3]:
# Import the dataset
nlx_sample = pd.read_csv('https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/jobs-data/nlx_job_data_50rows.csv')

nlx_sample = nlx_sample[0:1]
nlx_sample = nlx_sample[['description', 'job_id']]
print("Considering", len(nlx_sample), "rows for processing...")

Considering 1 rows for processing...


In [4]:
nlx_sample

Unnamed: 0,description,job_id
0,Req ID: 29534BR POSITION SUMMARY This po...,69322097


In [6]:
print('Initializing the Skill Extractor...')
se = Skill_Extractor(AI_MODEL_ID="marcsun13/gemma-2-9b-it-GPTQ", HF_TOKEN="<YOUR_HUGGING_FACE_API_TOKEN>", use_gpu=True)
print('The Skill Extractor has been initialized successfully!')



Initializing the Skill Extractor...
Found 'en_core_web_lg' model. Loading...
Downloading 'en_core_web_lg' model...
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
GPU is available. Using GPU for Large Language model initialization...
INFO 03-16 01:26:39 __init__.py:207] Automatically detected platform cuda.


config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

INFO 03-16 01:26:55 config.py:549] This model supports multiple tasks: {'generate', 'embed', 'reward', 'score', 'classify'}. Defaulting to 'generate'.
INFO 03-16 01:26:56 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='marcsun13/gemma-2-9b-it-GPTQ', speculative_config=None, tokenizer='marcsun13/gemma-2-9b-it-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=marcsun13/gemma-2-9b-it-GPTQ,

tokenizer_config.json:   0%|          | 0.00/40.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

INFO 03-16 01:27:00 cuda.py:178] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 03-16 01:27:00 cuda.py:226] Using XFormers backend.
INFO 03-16 01:27:01 model_runner.py:1110] Starting to load model marcsun13/gemma-2-9b-it-GPTQ...
INFO 03-16 01:27:01 weight_utils.py:254] Using model weights format ['*.safetensors']


model-00002-of-00002.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

INFO 03-16 01:27:40 weight_utils.py:270] Time spent downloading weights for marcsun13/gemma-2-9b-it-GPTQ: 38.420339 seconds


model.safetensors.index.json:   0%|          | 0.00/134k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 03-16 01:28:01 model_runner.py:1115] Loading model weights took 5.7838 GB
INFO 03-16 01:28:15 worker.py:267] Memory profiling takes 14.14 seconds
INFO 03-16 01:28:15 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 03-16 01:28:15 worker.py:267] model weights take 5.78GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 2.38GiB; the rest of the memory reserved for KV Cache is 5.05GiB.
INFO 03-16 01:28:16 executor_base.py:111] # cuda blocks: 985, # CPU blocks: 780
INFO 03-16 01:28:16 executor_base.py:116] Maximum concurrency for 8192 tokens per request: 1.92x
INFO 03-16 01:28:20 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utiliz

Capturing CUDA graph shapes: 100%|██████████| 35/35 [01:09<00:00,  2.00s/it]

INFO 03-16 01:29:30 model_runner.py:1562] Graph capturing finished in 70 secs, took 0.53 GiB
INFO 03-16 01:29:30 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 89.23 seconds
The Skill Extractor has been initialized successfully!






In [7]:
# skills output based on the taxonomy database
output = se.extractor(nlx_sample, 'job_id', text_columns = ['description'])

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.94s/it, est. speed input: 152.84 toks/s, output: 25.68 toks/s]


0it [00:00, ?it/s]

  extracted = extracted._append(matches, ignore_index=True)


In [8]:
# save the extracted skills to a csv file
print(output)
output.to_csv('extracted_skills_for_sample_jobs.csv', index=False)

   Research ID                                        Description  \
0     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
1     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
2     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
3     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
4     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
5     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
6     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
7     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
8     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
9     69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
10    69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
11    69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
12    69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
13    69322097  Req ID: 29534BR

POSITION SUMMARY

This po...   
14    69322097  Req I

#### With syllabi

In [9]:
syllabi_sample = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/syllabi-data/preprocessed_50_opensyllabus_syllabi_data.csv")
syllabi_sample = syllabi_sample[0:1]
syllabi_sample = syllabi_sample[['id', 'description', 'learning_outcomes']]
print("Considering", len(syllabi_sample), "rows for processing...")

Considering 1 rows for processing...


In [10]:
syllabi_sample

Unnamed: 0,id,description,learning_outcomes
0,4904852663176,"survey and analysis of cinema , including hist...",communications skills — to include effective w...


In [11]:
output = se.extractor(syllabi_sample, 'id', text_columns = ['description', 'learning_outcomes'], input_type = "syllabus")

Processed prompts: 100%|██████████| 1/1 [00:06<00:00,  6.93s/it, est. speed input: 77.08 toks/s, output: 27.57 toks/s]


0it [00:00, ?it/s]

  extracted = extracted._append(matches, ignore_index=True)


In [12]:
# save the extracted skills to a csv file
print(output)
output.to_csv('extracted_skills_for_sample_syllabus.csv', index=False)

      Research ID                                        Description  \
0   4904852663176  survey and analysis of cinema , including hist...   
1   4904852663176  survey and analysis of cinema , including hist...   
2   4904852663176  survey and analysis of cinema , including hist...   
3   4904852663176  survey and analysis of cinema , including hist...   
4   4904852663176  survey and analysis of cinema , including hist...   
5   4904852663176  survey and analysis of cinema , including hist...   
6   4904852663176  survey and analysis of cinema , including hist...   
7   4904852663176  survey and analysis of cinema , including hist...   
8   4904852663176  survey and analysis of cinema , including hist...   
9   4904852663176  survey and analysis of cinema , including hist...   
10  4904852663176  survey and analysis of cinema , including hist...   
11  4904852663176  survey and analysis of cinema , including hist...   
12  4904852663176  survey and analysis of cinema , including his