Module Description:
-------------------
Adds dynamic batching and vllm support to the current version. Compares the performance of current version without vllm vs with vllm for extracting skills from job descriptions.

Ownership:
----------
Project: Leveraging Artificial intelligence for Skills Extraction and Research (LAiSER)

Owner:  

        George Washington University Insitute of Public Policy
        Program on Skills, Credentials and Workforce Policy
        Media and Public Affairs Building
        805 21st Street NW
        Washington, DC 20052
        PSCWP@gwu.edu
        https://gwipp.gwu.edu/program-skills-credentials-workforce-policy-pscwp

License:
--------
Copyright 2024 George Washington University Insitute of Public Policy

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files
(the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Revision History:
-----------------
Rev No. | Date | Author | Description

---
[1.0.0] | 11/28/2024 | Prudhvi Chekuri | Base version with dynamic batching and vllm code additions.

NOTE: To reproduce the execution times in this notebook you need an NVIDIA A100 GPU.

# Without vLLM

# Setup Environment for running Gemma model

In [1]:
pip install -U bitsandbytes  -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import re
import gc
import sys
import spacy
import torch
import subprocess
import numpy as np
import pandas as pd
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [3]:
from google.colab import userdata
HF_API_KEY = userdata.get('HF_TOKEN')
model_id = "google/gemma-2b-it"

In [None]:
jobs_data = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/jobs-data/linkedin_jobs_sample_36rows.csv")
jobs_data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,system_job_id,file_name,file_id,source_state,expired,expired_date,date_compiled,created_date,...,application_method_1,application_method_2,application_method_3,application_method_4,application_method_text_1,application_method_text_2,application_method_text_3,application_method_text_4,description,job_id
0,0,0,5109722,TX_JCJobs.xml,,,1,2016-01-15 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,Req ID: 29534BR\n\nPOSITION SUMMARY\n\nThis po...,69322097
1,1,1,5688866,TX_StateJobs.xml,,,1,2016-01-06 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,Enters data using computer applications. Assis...,70014023
2,2,2,5974087,TX_JCJobs.xml,,,1,2016-02-17 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,"Kforce has a client in Austin, Texas (TX) that...",70241308
3,3,3,6230051,TX_JCJobs.xml,,,1,2016-02-02 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,"*We believe that*, when done right, investing ...",70543388
4,4,4,6230127,TX_JCJobs.xml,,,1,2016-01-12 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,**Description:** \nBaylor St. Luke’s Medical ...,70543468


# Load Gemma model and tokenizer

In [5]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    token=HF_API_KEY
)
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True, padding_side='left', token=HF_API_KEY)

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

# Inference

In [6]:
def fetch_model_output(response):
    """
    Format the model's output to extract the skill keywords from the get_completion() response

    Parameters
    ----------
    input_text : text
        The model's response after processing the prompt.
        Contains special tags to identify the start and end of the model's response.

    Returns
    -------
    list: List of extracted skills from text

    """
    # Find the content between the model start tag and the last <eos> tag
    pattern = r'<start_of_turn>model\s*<eos>(.*?)<eos>\s*$'
    match = re.search(pattern, response, re.DOTALL)

    if match:
        content = match.group(1).strip()

        # Split the content by lines and filter out empty lines
        lines = [line.strip() for line in content.split('\n') if line.strip()]

        # Extract skills (lines starting with '-')
        skills = [line[1:].strip() for line in lines if line.startswith('-')]

        return skills

def get_completion_batch(queries: list, model, tokenizer, batch_size=2) -> list:
    """
    Get completions for a list of queries using the model

    Parameters
    ----------
    queries : list
        List of queries to get completions for using the model
    model : model
        The model to use for generating completions
    tokenizer : tokenizer
        The tokenizer to use for encoding the queries
    batch_size : int, optional
        Preferred batch size to use for generating completions

    Returns
    -------
    list: List of extracted skills from the text(s)

    """

    device = "cuda:0"
    results = []

    prompt_template = """
    <start_of_turn>user
    Name all the skills present in the following description in a single list. Response should be in English and have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
    Below text is the Description:

    {query}
    <end_of_turn>\n<start_of_turn>model
    """

    for i in tqdm(range(0, len(queries), batch_size)):
        batch = queries[i:i+batch_size]
        prompts = [prompt_template.format(query=query) for query in batch]

        encodeds = tokenizer(prompts, return_tensors="pt", add_special_tokens=True, padding=True, truncation=True)
        model_inputs = encodeds.to(device)

        with torch.no_grad():
            generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

        decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)

        for full_output in decoded:
            # Extract only the model's response
            response = full_output.split("<start_of_turn>model<eos>")[-1].strip()
            processed_response = fetch_model_output(response)
            results.append(processed_response)

        # Clear CUDA cache after each batch
        del generated_ids, decoded, model_inputs, batch, prompts, encodeds, response, processed_response
        gc.collect()
        torch.cuda.empty_cache()

        print(f"Processed batch {i//batch_size + 1}/{(len(queries)-1)//batch_size + 1}")

    return results

def get_gpu_memory_info(gpu_id=0):
    """
    Get the GPU memory usage and free memory for a specific GPU.

    Parameters
    ----------
    gpu_id : int, optional
        The ID of the GPU to query (default is 0)

    Returns
    -------
    dict
        A dictionary with keys 'memory_used' and 'memory_free' (in MiB)
    """
    try:
        result = subprocess.run(
            [
                "nvidia-smi",
                f"--query-gpu=memory.used,memory.free",
                "--format=csv,noheader,nounits",
                "-i", str(gpu_id)
            ],
            stdout=subprocess.PIPE,
            text=True
        )
        output = result.stdout.strip()
        used, free = map(int, output.split(","))
        return {"memory_used": used, "memory_free": free}
    except Exception as e:
        print(f"Failed to query GPU memory: {e}")
        return {"memory_used": 0, "memory_free": 0}

gpu_memory_info = get_gpu_memory_info()
print(f"GPU Memory Used: {gpu_memory_info['memory_used']} MiB")
print(f"GPU Memory Free: {gpu_memory_info['memory_free']} MiB")

GPU Memory Used: 2479 MiB
GPU Memory Free: 38034 MiB


In [None]:
def get_completion_batch_dynamic(queries: list, model, tokenizer, initial_batch_size=16) -> list:
    """
    Dynamically adjust batch size based on GPU memory availability while processing queries.

    Parameters
    ----------
    queries : list
        List of queries to get completions for using the model
    model : model
        The model to use for generating completions
    tokenizer : tokenizer
        The tokenizer to use for encoding the queries
    initial_batch_size : int, optional
        Initial batch size to use for generating completions (default is 16)

    Returns
    -------
    list: List of extracted skills from the text(s)
    """

    device = "cuda:0"
    results = []
    batch_size = initial_batch_size
    i = 0

    prompt_template = """
    <start_of_turn>user
    Name all the skills present in the following description in a single list. Response should be in English and have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
    Below text is the Description:

    {query}
    <end_of_turn>\n<start_of_turn>model
    """

    with tqdm(total=len(queries), desc="Processing queries", unit="query") as pbar:
        while i < len(queries):
            try:
                # Create the batch
                batch = queries[i:i+batch_size]
                prompts = [prompt_template.format(query=query) for query in batch]

                # Tokenize the batch
                encodeds = tokenizer(prompts, return_tensors="pt", add_special_tokens=True, padding=True, truncation=True).to("cpu")
                model_inputs = encodeds.to(device)

                # Generate responses
                with torch.no_grad():
                    generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

                # Decode and process results
                decoded = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
                results.extend(fetch_model_output(output.split("<start_of_turn>model<eos>")[-1].strip()) for output in decoded)

                # Update progress bar
                pbar.update(len(batch))

                # Clear memory
                del generated_ids, decoded, model_inputs, encodeds, batch, prompts
                gc.collect()
                torch.cuda.empty_cache()

                # Monitor GPU memory usage
                gpu_memory_info = get_gpu_memory_info()
                memory_used = gpu_memory_info['memory_used']
                memory_free = gpu_memory_info['memory_free']

                # If sufficient memory is free, double the batch size for the next iteration
                if memory_free > memory_used + 1000 and batch_size < 64:  # Cap batch size if needed
                    i -= batch_size # nullify the additional increment
                    batch_size *= 2

                # Move to the next batch
                i += batch_size

                #print(f"Processed batch with size {batch_size}, Free Memory: {memory_free // 1e6} MB")

            except RuntimeError as e:
                if "CUDA out of memory" in str(e):
                    print(f"OOM error encountered with batch size {batch_size}. Reducing batch size...")
                    batch_size = max(batch_size // 2, 16)  # Halve the batch size but ensure it's at least 16
                    torch.cuda.empty_cache()  # Clear cache and retry
                else:
                    raise e  # Re-raise non-OOM exceptions

    return results

In [8]:
result = get_completion_batch_dynamic(queries=jobs_data['description'].tolist(), model=model, tokenizer=tokenizer, initial_batch_size=64)

Processing queries:   0%|          | 0/5000 [00:00<?, ?query/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:   1%|▏         | 64/5000 [00:26<33:31,  2.45query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:   3%|▎         | 128/5000 [00:36<21:14,  3.82query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:   4%|▍         | 192/5000 [00:54<21:35,  3.71query/s]A decoder-only architecture is being used, but right-paddi

OOM error encountered with batch size 64. Reducing batch size...


Processing queries:   8%|▊         | 416/5000 [02:00<23:00,  3.32query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


OOM error encountered with batch size 64. Reducing batch size...


Processing queries:   9%|▉         | 448/5000 [02:16<26:06,  2.91query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  10%|█         | 512/5000 [02:48<30:08,  2.48query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  12%|█▏        | 576/5000 [03:02<24:57,  2.95query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  13%|█▎        | 640/5000 [03:16<21:43,  3.35query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queri

OOM error encountered with batch size 64. Reducing batch size...


Processing queries:  51%|█████     | 2528/5000 [12:01<11:47,  3.49query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


OOM error encountered with batch size 64. Reducing batch size...


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  51%|█████     | 2560/5000 [12:16<13:09,  3.09query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  52%|█████▏    | 2624/5000 [12:25<10:07,  3.91query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  54%|█████▍    | 2688/5000 [12:37<08:54,  4.33query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  55%|█████▌    | 2752/5000 [12:49<08:12,  4.56query/s]A decoder-on

OOM error encountered with batch size 64. Reducing batch size...


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


OOM error encountered with batch size 32. Reducing batch size...


Processing queries:  63%|██████▎   | 3152/5000 [14:30<07:53,  3.90query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


OOM error encountered with batch size 32. Reducing batch size...


Processing queries:  63%|██████▎   | 3168/5000 [14:45<10:18,  2.96query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  64%|██████▍   | 3200/5000 [14:56<10:15,  2.92query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  65%|██████▌   | 3264/5000 [15:10<08:12,  3.53query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  67%|██████▋   | 3328/5000 [15:20<06:40,  4.18query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing q

OOM error encountered with batch size 64. Reducing batch size...


Processing queries:  70%|██████▉   | 3488/5000 [16:23<08:22,  3.01query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


OOM error encountered with batch size 64. Reducing batch size...


Processing queries:  70%|███████   | 3520/5000 [16:44<09:59,  2.47query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  72%|███████▏  | 3584/5000 [17:01<08:13,  2.87query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  73%|███████▎  | 3648/5000 [17:13<06:38,  3.39query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing queries:  74%|███████▍  | 3712/5000 [17:30<06:04,  3.53query/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing q

# Save Output and Clean Memory

In [9]:
jobs_data["output"] = result

In [10]:
jobs_data.to_csv("gemma_actual_A100_output.csv", index=False)

In [16]:
del model, tokenizer
gc.collect()
torch.cuda.empty_cache()

# With vLLM

Restart the session before running the cells below.

# Setup environment for vLLM

In [1]:
!pip install vllm triton -q

In [2]:
!pip install -U bitsandbytes -q

In [3]:
import re
import gc
import torch
import pandas as pd

import triton
from vllm import LLM
from tqdm import tqdm
from vllm import SamplingParams

import os
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

In [None]:
jobs_data = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/jobs-data/linkedin_jobs_sample_36rows.csv")
jobs_data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,system_job_id,file_name,file_id,source_state,expired,expired_date,date_compiled,created_date,...,application_method_1,application_method_2,application_method_3,application_method_4,application_method_text_1,application_method_text_2,application_method_text_3,application_method_text_4,description,job_id
0,0,0,5109722,TX_JCJobs.xml,,,1,2016-01-15 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,Req ID: 29534BR\n\nPOSITION SUMMARY\n\nThis po...,69322097
1,1,1,5688866,TX_StateJobs.xml,,,1,2016-01-06 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,Enters data using computer applications. Assis...,70014023
2,2,2,5974087,TX_JCJobs.xml,,,1,2016-02-17 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,"Kforce has a client in Austin, Texas (TX) that...",70241308
3,3,3,6230051,TX_JCJobs.xml,,,1,2016-02-02 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,"*We believe that*, when done right, investing ...",70543388
4,4,4,6230127,TX_JCJobs.xml,,,1,2016-01-12 00:00:00+00:00,2016-01-01 00:00:00+00:00,2016-01-01 00:00:00+00:00,...,,,,,,,,,**Description:** \nBaylor St. Luke’s Medical ...,70543468


# Load Gemma model with vLLM

In [5]:
llm = LLM(model="google/gemma-2b-it", dtype="bfloat16", quantization='bitsandbytes', load_format='bitsandbytes')

INFO 11-27 06:18:47 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='google/gemma-2b-it', speculative_config=None, tokenizer='google/gemma-2b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2b-it, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 11-27 06:18:52 model_runner.py:1077] Loading model weights took 1.9351 GB
INFO 11-27 06:18:58 worker.py:232] Memory profiling results: total_gpu_memory=39.56GiB initial_memory_usage=2.43GiB peak_torch_memory=4.29GiB memory_usage_post_profile=2.47GiB non_torch_memory=0.52GiB kv_cache_size=30.79GiB gpu_memory_utilization=0.90
INFO 11-27 06:18:58 gpu_executor.py:113] # GPU blocks: 112107, # CPU blocks: 14563
INFO 11-27 06:18:58 gpu_executor.py:117] Maximum concurrency for 8192 tokens per request: 218.96x
INFO 11-27 06:19:00 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-27 06:19:00 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
IN

# Inference

In [6]:
def vllm_batch_generate(queries, batch_size=32):

    result = []

    sampling_params = SamplingParams(max_tokens=1000)

    prompt_template = """
    <start_of_turn>user
    Name all the skills present in the following job description in a **single list**. Response should be in English and have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
    Below text is the Description:

    {query}
    <end_of_turn>\n<start_of_turn>model
    """

    for i in range(0, len(queries), batch_size):
        prompts = [prompt_template.format(query = jobs_data.iloc[i]["description"]) for i in range(i, min(i+batch_size, len(queries)))]

        output = llm.generate(prompts, sampling_params=sampling_params)

        result.extend(output)

    return result

In [7]:
result = vllm_batch_generate(jobs_data, batch_size=4096)

Processed prompts: 100%|██████████| 4096/4096 [02:29<00:00, 27.42it/s, est. speed input: 22362.24 toks/s, output: 1621.09 toks/s]
Processed prompts: 100%|██████████| 904/904 [00:32<00:00, 27.43it/s, est. speed input: 23286.75 toks/s, output: 1585.91 toks/s]


# Save Output

In [12]:
def fetch_vllm_model_output(text):
    skills = re.findall(r'(?:[-+*]|\d\.)\s*(.+)', text)
    cleaned_skills = [re.sub(r'\*{1,2}', '', skill).strip() for skill in skills]
    return cleaned_skills

In [13]:
output = []
for r in result:
    output.append(fetch_vllm_model_output(r.outputs[0].text))

In [15]:
jobs_data["output"] = output

In [16]:
jobs_data.to_csv("gemma_vllm_A100_output.csv", index=False)

# Conclusions

Task: Extract skills from 5000 job descriptions.
- GPU Used: A100
- Experiment1: Execution times for current version with dynamic batching - 23 minutes.
- Experiment2: Execution times for current version with vLLM wrapper - 3 minutes.

Other observations on the same task and experiments.
- GPU Used: T4
- Experiment1: 7 Hrs
- Experiment2: 1 Hr


Based on these experiments, we found that using vLLM can give us 7X performance boost for our current version.