<a href="https://www.kaggle.com/code/mengaidev/qwen2-5-omni-caption?scriptVersionId=265999934" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Qwen2.5 Omni finetune

Goal: Use clotho dataset to finetune qwen2.5-omni-3b.

## Data preparation

In [None]:
!wget https://zenodo.org/records/4783391/files/clotho_audio_development.7z
!wget https://zenodo.org/records/4783391/files/clotho_captions_development.csv

In [None]:
!pip install py7zr

import py7zr

# Extract entire archive
with py7zr.SevenZipFile('clotho_audio_development.7z', mode='r') as z:
    z.extractall('../temp')

!rm clotho_audio_development.7z

Now, let's generate the `train.jsonl`

In [None]:
import pandas as pd
import json

def csv_to_jsonl_conversation(csv_file_path, jsonl_file_path, base_path="/kaggle/temp/development"):
    """
    将CSV文件转换为对话格式的JSONL文件
    
    参数:
    csv_file_path: 输入的CSV文件路径
    jsonl_file_path: 输出的JSONL文件路径
    base_path: 要添加到音频文件前的基路径
    """
    
    df = pd.read_csv(csv_file_path)
    
    with open(jsonl_file_path, 'w', encoding='utf-8') as f:
        # 遍历每一行（跳过表头）
        for index, row in df.iterrows():
            # 构建完整的音频文件路径
            audio_filename = f"{base_path}/{row.iloc[0]}"
            
            # 创建对话格式的JSON对象
            json_obj = {
                "messages": [
                    {
                        "role": "user",
                        "content": "<audio>What did the audio say?"
                    },
                    {
                        "role": "assistant",
                        "content": row.iloc[1]  # 使用CSV第二列的caption
                    }
                ],
                "audios": [audio_filename]
            }
            
            # 写入JSONL文件
            f.write(json.dumps(json_obj, ensure_ascii=False) + '\n')

In [None]:
csv_file = "clotho_captions_development.csv"
jsonl_file = "train.jsonl"
    
csv_to_jsonl_conversation(csv_file, jsonl_file)

## Finetune

### Installation

In [None]:
%%capture

!pip install --upgrade uv
!uv pip install ms-swift -U
!uv pip install --upgrade numpy scikit-learn --force-reinstall
!uv pip install --upgrade torch torchvision --force-reinstall
!uv pip install qwen-omni-utils

### Let's start training!

In [None]:
import torch
import gc

gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

In [None]:
%%capture
import os
os.environ["MAX_PIXELS"]="1003520"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
os.environ["PYTORCH_CUDA_ALLOC_CONF"]="expandable_segments:True"

In [None]:
!swift sft --model Qwen/Qwen2.5-Omni-3B --dataset train.jsonl --train_type lora --torch_dtype bfloat16 --num_train_epochs 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-4 --lora_rank 8 --lora_alpha 32 --target_modules all-linear --freeze_vit true --gradient_accumulation_steps 4 --eval_steps 100 --save_steps 100 --save_total_limit 5 --logging_steps 10 --max_length 1024 --output_dir qwen_caption_output --warmup_ratio 0.05 --dataloader_num_workers 4

## Unfortunately

We cannot finish the training, it's end with 1100+ steps.