
### References

*   [https://www.kaggle.com/code/abdmental01/jigsaw-mpnet-base-v2-inference-cv-0-876](https://www.kaggle.com/code/abdmental01/jigsaw-mpnet-base-v2-inference-cv-0-876)
*   [https://www.kaggle.com/code/aerdem4/jigsaw-acrc-qwen7b-finetune-logits-processor-zoo](https://www.kaggle.com/code/aerdem4/jigsaw-acrc-qwen7b-finetune-logits-processor-zoo)
*   [https://www.guruguru.science/competitions/24/discussions/21027ff1-2074-4e21-a249-b2d4170bd516/](https://www.guruguru.science/competitions/24/discussions/21027ff1-2074-4e21-a249-b2d4170bd516/)
*   https://www.kaggle.com/code/mks2192/jigsaw-llama3-1-8b-instruct-training-one-epoch
*   [https://www.kaggle.com/code/fuumin621/qwen2-5-lora-finetune-baseline-inference](https://www.kaggle.com/code/fuumin621/qwen2-5-lora-finetune-baseline-inference)
*   https://www.kaggle.com/code/neibyr/30-min-just-use-semantic-search-qwen3-emb-0-6b
*   https://www.kaggle.com/code/bibanh/qwen3-embedding-qwen2-5-32b-llama3-1-8b

### I want to say thanks to @neibyr for your interesting idea: [Retrieve by Qwen3Embedding](http://https://www.kaggle.com/code/neibyr/30-min-just-use-semantic-search-qwen3-emb-0-6b) 

# diff
+ Offline library install by uv [Reason: Dependencies's workflow switching internet on-off is tiresome]

# 1. Qwen2.5 32B GPTQ Int4 Inference

In [1]:
#这个版本就是离线下载版的程序这是为了避免当进行测试时kaggle官方会断网
!uv pip install -U --system --no-index --find-links='/kaggle/input/jigsaw-packages/whls/' 'vllm' 'logits-processor-zoo==0.1.10' 'triton==3.2.0' 'clean-text' 'bitsandbytes' 'peft' 'accelerate' 'datasets' 'emoji' 'setuptools>=40.8.0' 'numpy<2'

[2mUsing Python 3.11.13 environment at: /usr[0m
[2K  [31m×[0m No solution found when resolving dependencies:
[31m  ╰─▶ [0mBecause torch==2.8.0 depends on triton{platform_machine == 'x86_64' and
[31m      [0msys_platform == 'linux'}==3.4.0 and only the following versions of torch
[31m      [0mare available:
[31m      [0m    torch==2.7.1
[31m      [0m    torch==2.8.0
[31m      [0mwe can conclude that torch>2.7.1 depends on triton==3.4.0. (1)

[31m      [0mBecause there is no version of nvidia-cuda-nvrtc-cu12{platform_machine
[31m      [0m== 'x86_64' and sys_platform == 'linux'}==12.4.127 and
[31m      [0mtorch==2.6.0+cu124 depends on nvidia-cuda-nvrtc-cu12{platform_machine
[31m      [0m== 'x86_64' and sys_platform == 'linux'}==12.4.127, we can conclude that
[31m      [0mtorch==2.6.0+cu124 cannot be used.
[31m      [0mAnd because we know from (1) that torch>2.7.1 depends on triton==3.4.0,
[31m      [0mwe can conclude that torch>2.7.1 depends o

In [2]:
! mkdir -p /tmp

In [3]:
%%writefile /tmp/infer_qwen.py

import os
import pandas as pd
from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor
import torch
import vllm
import numpy as np
from vllm.lora.request import LoRARequest
import argparse
from scipy.special import softmax
df = pd.read_csv("/kaggle/input/jigsaw-agile-community-rules/test.csv")

MODEL_NAME = "/kaggle/input/qwen2-5-32b-instruct-gptq-int4"
LORA_PATH = "/kaggle/input/qwen2-5-32b-gptq-int4-batch4-full"
if __name__=='__main__':
    os.environ["VLLM_USE_V1"] = "0"

    llm = vllm.LLM(
        MODEL_NAME,
        # quantization='awq',
        quantization='gptq',
        tensor_parallel_size=torch.cuda.device_count(),
        gpu_memory_utilization=0.95,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=4096,
        disable_log_stats=True,
        enable_prefix_caching=True,
        enable_lora=True,
    )
    tokenizer = llm.get_tokenizer()
    SYS_PROMPT = """
    You are given a comment on reddit. Your task is to classify if it violates the given rule. Only respond Yes/No.
    """
    
    prompts = []
    for i, row in df.iterrows():
        text = f"""
    r/{row.subreddit}
    Rule: {row.rule}
    
    1) {row.positive_example_1}
    Violation: Yes
    
    2) {row.positive_example_2}
    Violation: Yes
    
    3) {row.negative_example_1}
    Violation: No
    
    4) {row.negative_example_2}
    Violation: No
    
    5) {row.body}
    """
        
        messages = [
            {"role": "system", "content": SYS_PROMPT},
            {"role": "user", "content": text}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
        ) + "Answer:"
        prompts.append(prompt)
    
    df["prompt"] = prompts
    
    mclp = MultipleChoiceLogitsProcessor(tokenizer, choices=['Yes','No'])
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            skip_special_tokens=True,
            max_tokens=1,
            logits_processors=[mclp],
            logprobs=2,
        ),
        use_tqdm=True,
        lora_request=LoRARequest("default", 1, LORA_PATH)
    )
    logprobs = [
        {lp.decoded_token: lp.logprob for lp in out.outputs[0].logprobs[0].values()}
        for out in outputs
    ]
    logit_matrix = pd.DataFrame(logprobs)[['Yes','No']]
    df = pd.concat([df, logit_matrix], axis=1)
    
    df[['Yes',"No"]] = df[['Yes',"No"]].apply(lambda x: softmax(x.values), axis=1, result_type="expand")
    df["pred"] = df["Yes"]
    df['rule_violation'] = df["pred"]
    df[['row_id', 'rule_violation']].to_csv("submission_qwen.csv",index=False)
    pd.read_csv('submission_qwen.csv')

Writing /tmp/infer_qwen.py


In [4]:
%cd /tmp
!python infer_qwen.py

/tmp
Traceback (most recent call last):
  File "/tmp/infer_qwen.py", line 4, in <module>
    from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor
ModuleNotFoundError: No module named 'logits_processor_zoo'


# 2. Qwen3 0.6b Embedding

In [5]:
%%writefile constants.py

EMBDEDDING_MODEL_PATH = "/kaggle/input/qwen-3-embedding/transformers/0.6b/1"
MODEL_OUTPUT_PATH = '/kaggle/input/qwen3-8b-embedding'
DATA_PATH = "/kaggle/input/jigsaw-agile-community-rules"

# https://huggingface.co/Qwen/Qwen3-Embedding-0.6B/blob/main/config_sentence_transformers.json
EMBEDDING_MODEL_QUERY = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:"

CLEAN_TEXT = True
TOP_K = 2000
BATCH_SIZE = 128


Writing constants.py


In [6]:
%%writefile utils.py
import pandas as pd
import torch.distributed as dist
import os
from datasets import Dataset
from cleantext import clean
from tqdm.auto import tqdm

from constants import CLEAN_TEXT


def build_prompt(row):
    return f"""r/{row["subreddit"]}\nComment: {row["body"]}"""


def cleaner(text):
    return clean(
        text,
        fix_unicode=True,
        to_ascii=True,
        lower=False,
        no_line_breaks=False,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=False,
        no_punct=False,
        replace_with_url="<URL>",
        replace_with_email="<EMAIL>",
        replace_with_phone_number="<PHONE>",
        lang="en",
    )



def get_dataframe_to_train(data_path):
    train_dataset = pd.read_csv(f"{data_path}/train.csv")
    test_dataset = pd.read_csv(f"{data_path}/test.csv")

    flatten = []
    flatten.append(train_dataset[["body", "rule", "subreddit", "rule_violation"]])
    
    for violation_type in ["positive", "negative"]:
        for i in range(1, 3):
            sub_dataset = test_dataset[[f"{violation_type}_example_{i}", "rule", "subreddit"]].copy()
            sub_dataset = sub_dataset.rename(columns={f"{violation_type}_example_{i}": "body"})
            sub_dataset["rule_violation"] = 1 if violation_type == "positive" else 0
            flatten.append(sub_dataset)

    dataframe = pd.concat(flatten, axis=0)    
    dataframe = dataframe.drop_duplicates(ignore_index=True)
    return dataframe


def prepare_dataframe(dataframe):
    dataframe["prompt"] = dataframe.apply(build_prompt, axis=1)

    
    if CLEAN_TEXT:
        tqdm.pandas(desc="cleaner")
        dataframe["prompt"] = dataframe["prompt"].progress_apply(cleaner)

    if "rule_violation" in dataframe.columns:
        dataframe["rule_violation"] = dataframe["rule_violation"].map(
            {
                1: 1,
                0: -1,
            }
        )

    return dataframe

Writing utils.py


In [7]:
%%writefile semantic.py
import pandas as pd
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search, dot_score
from tqdm.auto import tqdm
from peft import PeftModel, PeftConfig


from utils import get_dataframe_to_train, prepare_dataframe
from constants import DATA_PATH, EMBDEDDING_MODEL_PATH, EMBEDDING_MODEL_QUERY, TOP_K, BATCH_SIZE, MODEL_OUTPUT_PATH



def get_scores(test_dataframe):
    corpus_dataframe = get_dataframe_to_train(DATA_PATH)
    corpus_dataframe = prepare_dataframe(corpus_dataframe)
    
    # Load base model
    model = AutoModelForCausalLM.from_pretrained(EMBDEDDING_MODEL_PATH)
    tokenizer = AutoTokenizer.from_pretrained(EMBDEDDING_MODEL_PATH)
    
    # Load adapter configuration and model
    adapter_config = PeftConfig.from_pretrained(MODEL_OUTPUT_PATH)
    lora_model = PeftModel.from_pretrained(model, MODEL_OUTPUT_PATH, config=adapter_config)
    merged_model = lora_model.merge_and_unload()
    tokenizer.save_pretrained("Qwen3Emb_Finetuned")
    merged_model.save_pretrained("Qwen3Emb_Finetuned")

    # 4. Tạo lại SentenceTransformer từ encoder đã merge
    embedding_model = SentenceTransformer(model_name_or_path="Qwen3Emb_Finetuned", device="cuda")

    print('Done loading model!')

    result = []
    for rule in tqdm(test_dataframe["rule"].unique(), desc=f"Generate scores for each rule"):
        test_dataframe_part = test_dataframe.query("rule == @rule").reset_index(drop=True)
        corpus_dataframe_part = corpus_dataframe.query("rule == @rule").reset_index(drop=True)
        corpus_dataframe_part = corpus_dataframe_part.reset_index(names="row_id")
        
        query_embeddings = embedding_model.encode(
            sentences=test_dataframe_part["prompt"].tolist(),
            prompt=EMBEDDING_MODEL_QUERY,
            batch_size=BATCH_SIZE,
            show_progress_bar=True,
            convert_to_tensor=True,
            device="cuda",
            normalize_embeddings=True,
        )
        document_embeddings = embedding_model.encode(
            sentences=corpus_dataframe_part["prompt"].tolist(),
            batch_size=BATCH_SIZE,
            show_progress_bar=True,
            convert_to_tensor=True,
            device="cuda",
            normalize_embeddings=True,
        )
        test_dataframe_part["semantic"] = semantic_search(
            query_embeddings,
            document_embeddings,
            top_k=TOP_K,
            score_function=dot_score,
        )
        def get_score(semantic):
            semantic = pd.DataFrame(semantic)
            semantic = semantic.merge(
                corpus_dataframe_part[["row_id", "rule_violation"]],
                how="left",
                left_on="corpus_id",
                right_on="row_id",
            )
            semantic["score"] = semantic["score"]*semantic["rule_violation"]
            return semantic["score"].sum()
            
        tqdm.pandas(desc=f"Add label for {rule=}")
        test_dataframe_part["rule_violation"] = test_dataframe_part["semantic"].progress_apply(get_score)
        result.append(test_dataframe_part[["row_id", "rule_violation"]].copy())
        
    submission = pd.concat(result, axis=0)
    return submission


def generate_submission():
    test_dataframe = pd.read_csv(f"{DATA_PATH}/test.csv")
    test_dataframe = prepare_dataframe(test_dataframe)
    
    submission = get_scores(test_dataframe)
    submission = test_dataframe[["row_id"]].merge(submission, on="row_id", how="left")
    submission.to_csv("submission_qwen3.csv", index=False)


if __name__ == "__main__":
    generate_submission()



Writing semantic.py


In [8]:
!python semantic.py

2025-09-08 07:22:45.509336: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757316165.672225      92 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757316165.721885      92 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Traceback (most recent call last):
  File "/tmp/semantic.py", line 9, in <module>
    from utils import get_dataframe_to_train, prepare_dataframe
  File "/tmp/utils.py", line 5, in <module>
    from cleantext import clean
ModuleNotFoundError: No module named 'cleantext'


# 3. Llama 3.1 8B

In [9]:
%%writefile llama.py
import os, math, numpy as np
#指定程序可以使用的GPU的序号
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
sys_prompt = '''You are given a comment on reddit and a rule. Your task is to classify whether the comment violates the rule. Only respond Yes/No.'''

#将一个包含多段对话（conversation）的数据集，
#批量转换成符合特定聊天模型（Chat Model）要求的、格式化好的纯文本字符串。
def formatting(dataset):
    texts = []
    for i in range(len(dataset)):
        texts.append(tokenizer.apply_chat_template(dataset[i], tokenize=False, add_generation_prompt=False))
    return texts


#提示模板 (prompt template)，
#它的作用是为语言模型（LLM）构建一个结构化、带示例的输入，以引导它完成一项特定的任务。
template = """
Subreddit: r/{subreddit}
Rule: {rule}
Examples:
1) {positive_example_1}
Violation: Yes

2) {negative_example_1}
Violation: No

3) {negative_example_2}
Violation: No

4) {positive_example_2}
Violation: Yes
Comment:
{body}
Violation: """

dataset = []
#对可循环对象test.iterrows的每一行进行循环index为序列号，row为行内容
for index,row in test.iterrows():
    
    formatted_sample = [
        {
        "role": "system",
        "content": sys_prompt
    },
       {
           "role": "user",
           #.format() 是 Python 字符串的一个方法，用于填充字符串中的占位符（那些用 {} 括起来的部分）。
           "content": template.format(
               rule = row.rule,
               subreddit = row.subreddit,
               body = row.body,
               positive_example_1 = row.positive_example_1,
               negative_example_1 = row.negative_example_1,
               positive_example_2 = row.positive_example_2,
               negative_example_2 = row.negative_example_2
           )
       }]
    
    dataset.append( formatted_sample )

#将多端对话变成一个字符串
all_prompts = formatting(dataset)

#这里是在创建示例tokenizer是作为参数传递进去的。
logits_processors = [DigitLogitsProcessor(tokenizer)]
#这里是正式生成结果
responses = llm.generate(
    all_prompts,
    vllm.SamplingParams(
        n=1,  # 控制返回答案的数量。如果你想让模型对同一个问题给出多种不同的回答，可以把 n 调高，比如 n=3 就会返回三个独立的答案序列。.
        top_p=0.9,  # 控制模型在生成下一个词时考虑的词汇范围。模型会先按概率高低对所有可能的下一个词进行排序，然后从高到低累加它们的概率，直到总和达到 0.9 (90%) 为止。
        temperature=0,  # rtemperature=0: 意味着完全没有随机性。模型在每一步都会选择概率最高的那个词（贪心搜索），结果是完全确定的。
        seed=777, # Seed for reprodicibility
        skip_special_tokens=True,  # 在最终输出中跳过（不显示）特殊标记。
        max_tokens=1,  # Maximum number of tokens to generate per output sequence.
        logits_processors=logits_processors,
        #返回每个生成词元（token）的对数概率，以及最有可能的 2 个词元的信息。
        logprobs = 2
    ),
    use_tqdm = True
)

results = []
errors = 0
#enumerate() 函数让你在循环时可以同时获得索引 i（0, 1, 2...）和响应对象 response。
for i,response in enumerate(responses):
    try:
        #第一个结果的第一个生成的token以及最高概率的两个概率
        #即Yes和No的概率，属于[0,1.]
        x = response.outputs[0].logprobs[0]
        logprobs = []
        for k in KEEP:
            if k in x:
                logprobs.append( math.exp(x[k].logprob) )
            else:
                logprobs.append( 0 )
                print(f"bad logits {i}")
        #转换为np数组
        logprobs = np.array( logprobs )
        #概率归一化
        logprobs /= logprobs.sum()
        results.append( logprobs )
    except:
        #print(f"error {i}")
        #数字后面加.代表转换为浮点类型
        results.append( np.array([1/2., 1/2.]) )
        errors += 1
        
print(f"There were {errors} inference errors out of {i+1} inferences")
#将多个np数组组成的列表变成一个二维的np数组或np矩阵
results = np.vstack(results)

#仅取违规的概率，即Yes的概率
probs = [x[1] for x in results]
sub['rule_violation'] = probs
sub.to_csv('submission_llama.csv')

Writing llama.py


In [10]:
!python llama.py

Traceback (most recent call last):
  File "/tmp/llama.py", line 38, in <module>
    for index,row in test.iterrows():
                     ^^^^
NameError: name 'test' is not defined


# 4. ENSEMBLE RESULT

In [11]:
%%writefile main.py
import pandas as pd
import numpy as np

q = pd.read_csv('submission_qwen.csv')
l = pd.read_csv('submission_qwen3.csv')
m = pd.read_csv('submission_llama.csv')

rq = q['rule_violation'].rank(method='average') / (len(q)+1)
rl = l['rule_violation'].rank(method='average') / (len(l)+1)
rm = m['rule_violation'].rank(method='average') / (len(m)+1)

blend = 0.5*rq + 0.4*rl + 0.1*rm   # or tune the rank-weights with a tiny grid using OOF
q['rule_violation'] = blend
q.to_csv('/kaggle/working/submission.csv', index=False)

Writing main.py


In [12]:
!python main.py

Traceback (most recent call last):
  File "/tmp/main.py", line 4, in <module>
    q = pd.read_csv('submission_qwen.csv')
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-package

In [13]:
import os
p = "/kaggle/input/qwen2-5-32b-gptq-int4-batch4-full"
print("exists:", os.path.exists(p))
print("top files:", os.listdir(p))
for name in ["README.md","README","LICENSE","license","config.json","adapter_config.json"]:
    fp = os.path.join(p, name)
    if os.path.exists(fp):
        print(f"--- {name} ---")
        print(open(fp, "r", encoding="utf-8", errors="ignore").read()[:2000])

exists: True
top files: ['adapter_model.safetensors', 'merges.txt', 'training_args.bin', 'adapter_config.json', 'README.md', 'tokenizer.json', 'vocab.json', 'tokenizer_config.json', 'chat_template.jinja', 'special_tokens_map.json', 'added_tokens.json']
--- README.md ---
---
base_model: Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4
- lora
- transformers
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** [More Information 