Module Description:
-------------------
Adds a new LLM model (Llama-3.2-1B) to the existing module for extracting skills from text.

Ownership:
----------
Project: Leveraging Artificial intelligence for Skills Extraction and Research (LAiSER)

Owner:  

        George Washington University Insitute of Public Policy
        Program on Skills, Credentials and Workforce Policy
        Media and Public Affairs Building
        805 21st Street NW
        Washington, DC 20052
        PSCWP@gwu.edu
        https://gwipp.gwu.edu/program-skills-credentials-workforce-policy-pscwp

License:
--------
Copyright 2024 George Washington University Insitute of Public Policy

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files
(the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Revision History:
-----------------
Rev No. | Date | Author | Description

---
[1.0.0] | 09/28/2024 | Prudhvi Chekuri | Initial Version

In [4]:
import os
import gc
import torch
import pandas as pd
import transformers

In [5]:
data = pd.read_csv('https://raw.githubusercontent.com/phanindra-max/LAiSER-datasets/master/nlx_tx_sample_data_gwu.csv')

data = data[['description', 'job_id']]
data.head()

Unnamed: 0,description,job_id
0,Req ID: 29534BR\n\nPOSITION SUMMARY\n\nThis po...,69322097
1,Enters data using computer applications. Assis...,70014023
2,"Kforce has a client in Austin, Texas (TX) that...",70241308
3,"*We believe that*, when done right, investing ...",70543388
4,**Description:** \nBaylor St. Luke’s Medical ...,70543468


In [6]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"
access_token = "<Add your api key here>"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    token = access_token
)

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [7]:
def clear_memory():
    torch.cuda.empty_cache()
    gc.collect()

In [16]:
clear_memory()

desc = data["description"][17]

messages = [
    {"role": "system", "content": "You are a highly capable assistant trained to extract skills from descriptions. Provide a clean, concise list of skills mentioned in the description without any additional commentary."},
    {"role": "user", "content": f"Extract the skills from the following description:\n\n{desc}"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1]["content"])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Here is the list of extracted skills:

1. Cisco enterprise level L3 switching
2. Nexus 7K/5K/2K
3. MDS
4. Fiber Channel platforms
5. Cisco Unified Computing Systems (UCS)
6. F5
7. VMware
8. Storage
9. Low Level design documents
10. High Level design documents
11. Router
12. Switch
13. Firewall
14. Hardware
15. Software
16. Peripheral devices
17. Network security
18. Technical support
19. Web development
20. Systems integration
21. Network security protocols
22. Security protocols
23. Inter-company routing
24. WAN/LAN
25. Transport technologies
26. Ethernet
27. Frame Relay
28. MPLS
29. IP communication
30. Routing
31. Inter-company routing
32. Microsoft Visio
33. Customer service
34. Problem-solving
35. Analytical thinking
36. Time management
37. Customer service orientation
38. Technical writing
39. Project management
40. Network control protocols
41. Network management protocols
42. Security protocols
43. Interpersonal communication
44. Written communication
45. Oral communication

