# LLM 서빙 프레임워크

## 오프라인 서빙

**데이터셋 준비**

In [3]:
import torch
from datasets import load_dataset

In [3]:
def make_prompt(ddl, question, query=''):
    prompt = f"""당신은 SQL을 생성하는 SQL 봇입니다. DDL의 테이블을 활용한 Question을 해결할 수 있는 SQL 쿼리를 생성하세요.

### DDL:
{ddl}

### Question:
{question}

### SQL:
{query}"""
    return prompt

In [4]:
dataset = load_dataset("shangrilar/ko_text2sql", "origin")['test']
dataset = dataset.to_pandas()

In [5]:
for idx, row in dataset.iterrows():
    prompt = make_prompt(row['context'], row['question'])
    dataset.loc[idx, 'prompt'] = prompt

**모델과 토크나이저를 불러와 추론 파이프라인 준비**

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

2024-10-28 02:18:05.651003: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-28 02:18:05.657983: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-28 02:18:05.666003: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-28 02:18:05.668419: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-28 02:18:05.674948: I tensorflow/core/platform/cpu_feature_guar

In [7]:
model_id = "shangrilar/yi-ko-6b-text2sql"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
hf_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/9.74k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.28M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/467 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


**배치 크기에 따른 추론 시간 확인**

In [5]:
import time

In [10]:
for batch_size in [1, 2, 4, 8, 16, 32]:
    start_time = time.time()
    hf_pipeline(dataset['prompt'].tolist(), max_new_tokens=128, batch_size=batch_size)
    print(f"{batch_size}: {time.time() - start_time}")

1: 63.74052119255066
2: 68.76405763626099
4: 46.17130756378174
8: 32.09809470176697
16: 23.121740341186523
32: 20.11572813987732


In [9]:
import gc

def cleanup():
    if 'model' in globals():
        del globals()['model']
    if 'dataset' in globals():
        del globals()['dataset']
    gc.collect()
    torch.cuda.empty_cache()

In [10]:
cleanup()

torch.cuda.empty_cache()

In [11]:
print(f"Allocated Memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")
print(f"Cached Memory: {torch.cuda.memory_reserved() / 1024 ** 2:.2f} MB")

Allocated Memory: 20889.91 MB
Cached Memory: 20964.00 MB


**vLLM 모델 불러오기**

In [6]:
from vllm import LLM, SamplingParams

In [7]:
model_id = "shangrilar/yi-ko-6b-text2sql"
llm = LLM(model=model_id, dtype=torch.float16, max_model_len=1024)

INFO 10-28 02:18:11 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='shangrilar/yi-ko-6b-text2sql', speculative_config=None, tokenizer='shangrilar/yi-ko-6b-text2sql', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 10-28 02:18:12 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 10-28 02:18:12 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 10-28 02:18:12 selector.py:33] Using XFormers backend.
INFO 10-28 02:18:18 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 10-28 02:18:21 model_runner.py:173] Loading model weights took 11.5127 GB
INFO 10-28 02:18:21 gpu_executor.py:119] # GPU blocks: 9049, # CPU blocks: 4096
INFO 10-28 02:18:22 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-28 02:18:22 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eag

**vLLM을 활용한 오프라인 추론 시간 측정**

In [8]:
for max_num_seqs in [1, 2, 4, 8, 16, 32]:
    start_time = time.time()
    llm.llm_engine.scheduler_config.max_num_seqs = max_num_seqs
    sampling_params = SamplingParams(temperature=1, top_p=1, max_tokens=128)
    output = llm.generate(dataset['prompt'].to_list(), sampling_params)
    print(f"{max_num_seqs}: {time.time() - start_time}")

NameError: name 'dataset' is not defined

## 온라인 서빙

**온라인 서빙을 위한 vLLM API 서버 실행**

In [3]:
!python3 -m vllm.entrypoints.openai.api_server \
--model shangrilar/yi-ko-6b-text2sql --host 127.0.0.1 --port 8890 --max-model-len 1024

INFO 10-28 02:21:07 api_server.py:151] vLLM API server version 0.4.1
INFO 10-28 02:21:07 api_server.py:152] args: Namespace(host='127.0.0.1', port=8890, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='shangrilar/yi-ko-6b-text2sql', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1024, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_ca

**백그라운드에서 vLLM API 서버 실행하기**

In [10]:
%%bash

nohup python3 -m vllm.entrypoints.openai.api_server \
--model shangrilar/yi-ko-6b-text2sql --host 127.0.0.1 --port 8890 --max-model-len 1024 &

INFO 10-28 02:24:19 api_server.py:151] vLLM API server version 0.4.1
INFO 10-28 02:24:19 api_server.py:152] args: Namespace(host='127.0.0.1', port=8890, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='shangrilar/yi-ko-6b-text2sql', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1024, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_ca

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 10-28 02:24:19 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 10-28 02:24:19 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 10-28 02:24:19 selector.py:33] Using XFormers backend.
INFO 10-28 02:24:20 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 10-28 02:24:22 model_runner.py:173] Loading model weights took 11.5127 GB
INFO 10-28 02:24:23 gpu_executor.py:119] # GPU blocks: 9049, # CPU blocks: 4096
INFO 10-28 02:24:23 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-28 02:24:23 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eag

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.




Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [1950]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8890 (Press CTRL+C to quit)


INFO 10-28 02:24:37 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 10-28 02:24:47 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
Process is interrupted.


INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1950]


**API 서버 실행 확인**

In [11]:
!curl http://localhost:8890/v1/models

/usr/bin/sh: 1: curl: not found


**API 요청**

In [None]:
import json

In [None]:
json_data = json.dumps(
    {"model": "shangrilar/yi-ko-6b-text2sql",
     "prompt": dataset.loc[0, "prompt"],
     "max_tokens": 128,
     "temperature": 1}
)

In [None]:
!curl http://localhost:8888/v1/completions \
    -H "Content-Type: application/json" \
    -d '{json_data}'

**OpenAI 클라이언트를 사용한 API 요청**

In [None]:
from openai import OpenAI

In [11]:
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8888/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="shangrilar/yi-ko-6b-text2sql", prompt=dataset.loc[0, 'prompt'], max_tokens=128)
print("생성 결과:", completion.choices[0].text)

SyntaxError: unterminated string literal (detected at line 8) (2570726381.py, line 8)