Module Description:
-------------------
Compare the performance of current model (GEMMA 2b-it) and LLAMA 3.2-1b-it model (latest light-weight model developed by meta). Introduce the torch compile setup for faster inference.

Ownership:
----------
Project: Leveraging Artificial intelligence for Skills Extraction and Research (LAiSER)

Owner:  

        George Washington University Insitute of Public Policy
        Program on Skills, Credentials and Workforce Policy
        Media and Public Affairs Building
        805 21st Street NW
        Washington, DC 20052
        PSCWP@gwu.edu
        https://gwipp.gwu.edu/program-skills-credentials-workforce-policy-pscwp

License:
--------
Copyright 2024 George Washington University Insitute of Public Policy

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files
(the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Revision History:
-----------------
Rev No. | Date | Author | Description

---
[1.0.0] | 10/05/2024 | Prudhvi Chekuri | Initial Version

# Install Packages

In [1]:
pip -q install triton bitsandbytes pynvml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.4/209.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h

# Import Libraries

In [2]:
import os
import gc
import re
import time
import torch
import random
import pynvml
import subprocess
import numpy as np
import pandas as pd
import transformers
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Get the HuggingFace Access Token and Data

To access the Gemma and Llama models from HuggingFace, you need to create an account in HuggingFace and request access for these models. Once the request is granted, you have to create access token in HuggingFace, inorder to access these models using HuggingFace API.

In [None]:
access_token = "Your HuggingFace Access Token"

In [None]:
# Get data
data = pd.read_csv('https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/jobs-data/linkedin_jobs_sample_36rows.csv')

data = data[['description', 'job_id']]
data.head()

Unnamed: 0,description,job_id
0,Req ID: 29534BR\n\nPOSITION SUMMARY\n\nThis po...,69322097
1,Enters data using computer applications. Assis...,70014023
2,"Kforce has a client in Austin, Texas (TX) that...",70241308
3,"*We believe that*, when done right, investing ...",70543388
4,**Description:** \nBaylor St. Luke’s Medical ...,70543468


In [4]:
# test_sample = random.choice(data['description']) # uncomment to test on random samples.
test_sample = data.iloc[17]["description"] # This sample is related to tech job, used for easier illustration.
#print(test_sample)

NOTE: The code below related to extraction with gemma model is taken from the laiser main branch (links below).

https://github.com/LAiSER-Software/extract-module/blob/main/laiser/llm_methods.py

https://github.com/LAiSER-Software/extract-module/blob/main/laiser/skill_extractor.py

In [5]:
def fetch_model_output(response):
    # Find the content between the model start tag and the last <eos> tag
    pattern = r'<start_of_turn>model\s*<eos>(.*?)<eos>\s*$'
    match = re.search(pattern, response, re.DOTALL)

    if match:
        content = match.group(1).strip()

        # Split the content by lines and filter out empty lines
        lines = [line.strip() for line in content.split('\n') if line.strip()]

        # Extract skills (lines starting with '-')
        skills = [line[1:].strip() for line in lines if line.startswith('-')]

        return [s.split('<eos>')[0].strip() if ('<eos>' in s) else s for s in skills]


def get_completion(query: str, model, tokenizer) -> str:
    device = "cuda:0"

    prompt_template = """
    <start_of_turn>user
    Name all the skills present in the following description in a single list. Response should be in English and have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
    Below text is the Description:

    {query}
    <end_of_turn>\n<start_of_turn>model
    """
    prompt = prompt_template.format(query=query)

    encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

    model_inputs = encodeds.to(device)

    generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
    decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=False)
    response = decoded.strip()
    processed_response = fetch_model_output(response)
    return (processed_response)

# Load GEMMA Model and Tokenizer

In [6]:
model_id = "google/gemma-2b-it"

In [7]:
bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch.bfloat16
            )
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"": 0},
    token=access_token
            )
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True, padding_side='left', token=access_token)

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

# Inference with GEMMA

In [8]:
start = time.time()
gemma_extracted_skills = get_completion(test_sample, model, tokenizer)
end = time.time()
print("Time taken: ", end-start)
print("Extracted_skills: ", gemma_extracted_skills)

Time taken:  12.024119138717651
Extracted_skills:  ['Network Design and Planning', 'Cisco Routing and Switching', 'Network Security', 'Infrastructure Design & Deployment', 'Network Management', 'Configuration', 'Problem Solving', 'Communication', 'Research', 'Teamwork']


In [9]:
def model_size(model):
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()

    size_all_mb = (param_size + buffer_size) / 1024**2
    print(f'Estimated model size: {size_all_mb:.2f} MB')

    return size_all_mb

gemma_size = model_size(model)

Estimated model size: 1945.15 MB


In [10]:
def gpu_utilization():
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    info = pynvml.nvmlDeviceGetMemoryInfo(handle)
    memory = info.used//1024**2
    print(f"GPU memory occupied: {memory} MB.")

    return memory

gemma_memory = gpu_utilization()

GPU memory occupied: 3876 MB.


In [11]:
del model, tokenizer

In [12]:
torch.cuda.empty_cache()
gc.collect()

4

# Load LLAMA Model and Tokenizer

In [13]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, quantization_config=bnb_config, device_map = "cuda:0", token=access_token)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [14]:
def fetch_output(response):
    response = response.split('The skills are: ')[-1].strip().split('\n')[:-1]
    response = [re.sub(r'^\W*\d*\.?\s*', '', item).strip() for item in response]

    return list(set(response))

# Inference with LLAMA

In [15]:
start = time.time()
prompt = f"""You are a hiring manager, and your job is to look at a job description and extract all the skill keywords or phrases that have no more than four words.
    Extract the skills from the following job description : {test_sample}.

    The skills are:"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_length = len(inputs[0])+150, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]
llama_extracted_skills = fetch_output(response)

end = time.time()
print('The skills are: ')
print(llama_extracted_skills)
print("TIME TAKEN: ", end-start)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


The skills are: 
['Fiber Channel', 'Analytical', 'Logical design', 'Interpersonal communication', 'Oral communication', 'Written communication', 'Logical design models', 'Network management', 'Security protocols', 'Cisco enterprise level L3 switching', 'MDS', 'Technical support', 'Network management protocols', 'Transport technologies', 'Nexus 7K/5K/2K', 'Storage', 'Teamwork', 'Problem-solving', 'Customer service', 'WAN/LAN', 'Design and deploy Datacenter networks', 'Inter-company routing', 'Network security', 'Prioritization']
TIME TAKEN:  8.167593002319336


In [16]:
llama_size = model_size(model)

Estimated model size: 965.13 MB


In [17]:
llama_memory = gpu_utilization()

GPU memory occupied: 2150 MB.


In [18]:
test_sample

'For additional information email your resume to: Ryan.Rhodes@RHT.com Design and deploy Datacenter networks utilizing industry best practices and Cisco hardware to include: Cisco enterprise level L3 switching, Nexus 7K/5K/2K, MDS and Fiber Channel platforms. Design and deploy managed LANs, WANs, including routers, switches, firewalls and other hardware. Design and Deploy Cisco Unified Computing Systems (UCS), F5, VMware, and storage. Monitor performance and troubleshoot problem areas as needed. Create Low Level and High Level design documents in accordance with customer and . standards Perform installation, configuration, maintenance, and troubleshooting of customer managed hardware, software, and peripheral devices. Conduct research on network and datacenter products, services, protocols, and standards to remain abreast of developments in the networking industry. Oversee/perform new and existing equipment, hardware, and software upgrades.\n\nWith more than 100 locations worldwide, Rob

# Conclusions

Output comparision

In [19]:
gemma_extracted_skills

['Network Design and Planning',
 'Cisco Routing and Switching',
 'Network Security',
 'Infrastructure Design & Deployment',
 'Network Management',
 'Configuration',
 'Problem Solving',
 'Communication',
 'Research',
 'Teamwork']

In [20]:
llama_extracted_skills

['Fiber Channel',
 'Analytical',
 'Logical design',
 'Interpersonal communication',
 'Oral communication',
 'Written communication',
 'Logical design models',
 'Network management',
 'Security protocols',
 'Cisco enterprise level L3 switching',
 'MDS',
 'Technical support',
 'Network management protocols',
 'Transport technologies',
 'Nexus 7K/5K/2K',
 'Storage',
 'Teamwork',
 'Problem-solving',
 'Customer service',
 'WAN/LAN',
 'Design and deploy Datacenter networks',
 'Inter-company routing',
 'Network security',
 'Prioritization']

Size comparision

In [21]:
print(f'Gemma model size: {gemma_size:.2f} MB')
print(f'Llama model size: {llama_size:.2f} MB')

Gemma model size: 1945.15 MB
Llama model size: 965.13 MB


Memory Comparision

In [22]:
print(f"GPU memory occupied by Gemma: {gemma_memory} MB.")
print(f"GPU memory occupied by Llama: {llama_memory} MB.")

GPU memory occupied by Gemma: 3876 MB.
GPU memory occupied by Llama: 2150 MB.


In [23]:
del model, tokenizer

torch.cuda.empty_cache()
gc.collect()

30

Observations:

*   Llama model is doing better job than gemma in extracting as many skills as possible.
*   The size of the Llama model is half the size of gemma, hence the memory consumption is lower than gemma. Since Llama has less number of parameters to train, it will be a good choice for fine-tuning on our data.



# torch.compile

Added starter code to setup torch compile for faster inference. This code can be modified to speed up the inference of any model that supports torch.compile.

For more information: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

Reference: https://github.com/huggingface/huggingface-llama-recipes/blob/main/performance_optimization/torch_compile.py

In [24]:
def supports_bfloat16():
    device_properties = torch.cuda.get_device_properties(0)
    compute_capability = device_properties.major * 10 + device_properties.minor

    return True if compute_capability >= 80 else False

torch_dtype = torch.bfloat16 if supports_bfloat16() else torch.float16

In [25]:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch_dtype, token=access_token)
model.to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)

In [26]:
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length = len(inputs[0])+100, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [27]:
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
model.generation_config.cache_implementation = "static"

In [28]:
outputs = model.generate(**inputs, max_length = len(inputs[0])+100, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]
end = time.time()
print("Time taken for FIRST CALL: ", end-start)

start = time.time()
outputs = model.generate(**inputs, max_length = len(inputs[0])+100, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]
end = time.time()
print("Time taken for SECOND CALL: ", end-start)

start = time.time()
outputs = model.generate(**inputs, max_length = len(inputs[0])+100, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]
end = time.time()
print("Time taken for THIRD CALL: ", end-start)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Time taken for FIRST CALL:  82.38701581954956


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Time taken for SECOND CALL:  2.162736177444458
Time taken for THIRD CALL:  1.6693203449249268


In [29]:
fetch_output(response)

['VMware',
 'F5',
 'Security protocols',
 'Cisco enterprise level L3 switching',
 'MDS and Fiber Channel platforms',
 'WAN/LAN',
 'Network control protocols',
 'Network management protocols',
 'Transport technologies',
 'Nexus 7K/5K/2K',
 'Inter-company routing',
 'Microsoft Visio',
 'Storage',
 'Cisco Unified Computing Systems (UCS)']