# Overview
I used the following great notebook as a reference.

[Infer 34B with vLLM](https://www.kaggle.com/code/cdeotte/infer-34b-with-vllm) (by [Chris Deotte](https://www.kaggle.com/cdeotte))

[vllm](https://github.com/vllm-project/vllm) is very fast inference!!!

Inference of the test data took about 30 minutes!!(GPU T4×2)

(2024/10/26 update!) Inference of the test data took about 15 minutes!!(GPU L4×4)

The model is [deepseek-math-7b-instruct](https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct), which was strong in [AIMO1](https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize).

# Question
Please Comment.

Q1. When I used GPU L4×4, I got an error when loading vllm.

> The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

(2024/10/19)

A1. This can be solved via setting swap_space=2 . As per the docs its the space in GB thats allocated in CPU per every available GPU. Since we have allowed 30 gb ram only, it gets a OOM error.

Thank you, [pjmathematician](https://www.kaggle.com/pjmathematician).

# Comment
I do not do prompt engineering.
prompt engineering alone would improve score.

The results generated by LLMs are currently processed in a messy.(There is much room for improvement!)

The solution method of the previous competition was very unique and even generated a program to solve the problem and execute it.

Reasoning with vllm is the basic library of this competition.

Let's enjoy the competition together!


To be updated!! (I plan to add more hints if the number of votes increases.)

In [None]:
import os
import polars as pl
import kaggle_evaluation.aimo_2_inference_server

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3" # "0,1,2,3"

In [None]:
%%time
!pip uninstall -y torch
!pip install -U --no-index --find-links=/kaggle/input/vllm-whl -U vllm
!pip install -U --upgrade /kaggle/input/vllm-t4-fix/grpcio-1.62.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!pip install -U --upgrade /kaggle/input/vllm-t4-fix/ray-2.11.0-cp310-cp310-manylinux2014_x86_64.whl

In [None]:
# ====================================================
# Library
# ====================================================
import gc
import warnings
warnings.filterwarnings('ignore')
import random
import scipy as sp
import numpy as np
import pandas as pd
import math
from glob import glob
from pathlib import Path
import joblib
import pickle
import itertools
from tqdm.auto import tqdm
import re

import vllm

In [None]:
llm = vllm.LLM(
    "/kaggle/input/deepseek-math-7b-instruct/transformers/main/1", # "deepseek-ai/deepseek-math-7b-instruct"
    tensor_parallel_size=4, # 2, 4 
    gpu_memory_utilization=0.95, 
    trust_remote_code=True,
    dtype="half", 
    enforce_eager=True,
    swap_space=2, # L4×4
)
tokenizer = llm.get_tokenizer()

In [None]:
def generate_text_vllm(requests, tokenizer, model):
    sampling_params = vllm.SamplingParams(
        temperature=0.00,
        seed=42, 
        max_tokens=1024
    )
    responses = model.generate(requests, sampling_params=sampling_params, use_tqdm=False)
    response_text_list = []
    for response in responses:
        # total_tokens += len(response.outputs[0].token_ids)
        response_text_list.append(response.outputs[0].text)
    return response_text_list

In [None]:
def naive_parse(answer):
    out = []
    start = False
    end = False
    for l in reversed(list(answer)):
        if l in '0123456789' and not end:
            start = True
            out.append(l)
        else:
            if start:
                end = True
        
    out = reversed(out)
    return ''.join(out)

In [None]:
tool_instruction = '\nPlease solve the problem above, and put your final answer within \\boxed{}.'

In [None]:
# Replace this function with your inference code.
# The function should return a single integer between 0 and 999, inclusive.
# Each prediction (except the very first) must be returned within 30 minutes of the question being provided.
def predict(id_: pl.Series, question: pl.Series) -> pl.DataFrame | pd.DataFrame:
    """Make a prediction."""
    # print(id_, question)
    # print(type(id_), type(question))
    id_ = id_.item(0)
    question = question.item(0)
    # print(id_, question)
    # print(type(id_), type(question))
    prompt = question + tool_instruction
    generate_text = generate_text_vllm([prompt], tokenizer, llm)[0]
    answer = 3
    try:
        result_output = re.findall(r'\\boxed\{(\d+)\}', generate_text)
        if len(result_output) > 0:
            no = naive_parse(result_output[0])
            if len(no) > 0:
                answer = int(no) % 1000
        print(answer)
    except:
        print('error')
    return pl.DataFrame({'id': id_, 'answer': answer})

In [None]:
inference_server = kaggle_evaluation.aimo_2_inference_server.AIMO2InferenceServer(predict)

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    inference_server.serve()
else:
    inference_server.run_local_gateway(
        (
            '/kaggle/input/ai-mathematical-olympiad-progress-prize-2/test.csv',
        )
    )