接下来，将下载的文件中的 csv 读取为 Pandas 数据帧。每行都包含一个训练示例，其中包含图像的文件名和相应的 one-hot 编码标签。

In [1]:
import pandas as pd

df = pd.read_csv("/root/autodl-tmp/multilabel_modified/multilabel_classification(2).csv")
df.head()

Unnamed: 0,Image_Name,"Classes(motorcycle, truck, boat, bus, cycle, person, desert, mountains, sea, sunset, trees, sitar, ektara, flutes, tabla, harmonium)",motorcycle,truck,boat,bus,cycle,person,desert,mountains,sea,sunset,trees,sitar,ektara,flutes,tabla,harmonium
0,image1.jpg,bus person,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0
1,image2.jpg,sitar,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,image3.jpg,flutes,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,image4.jpg,bus trees,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
4,image5.jpg,bus,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


创建一个 id2label 字典，将整数映射到字符串。

In [2]:
labels = list(df.columns)[2:]
id2label = {id: label for id, label in enumerate(labels)}
print(id2label)

{0: 'motorcycle', 1: 'truck', 2: 'boat', 3: 'bus', 4: 'cycle', 5: 'person', 6: 'desert', 7: 'mountains', 8: 'sea', 9: 'sunset', 10: 'trees', 11: 'sitar', 12: 'ektara', 13: 'flutes', 14: 'tabla', 15: 'harmonium'}


接下来加载离线模型与图像处理器，其中将problem_type指定为 “multi_label_classification”,其是告诉模型当前为多标签分类，从而促使其使用正确的激活函数，

In [4]:
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "/root/autodl-tmp/siglip-so400m-patch14-384"

processor = AutoImageProcessor.from_pretrained(model_id, device=device)
model = AutoModelForImageClassification.from_pretrained(model_id, problem_type="multi_label_classification", id2label=id2label)
model = model.to(device) 

Some weights of SiglipForImageClassification were not initialized from the model checkpoint at /root/autodl-tmp/siglip-so400m-patch14-384 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


创建数据集读取类，从而确保能够正确的读取图片以及分类标签并转换为正确的格式

In [13]:
from torch.utils.data import Dataset
import torch
from PIL import Image
import os
import numpy as np

class MultiLabelDataset(Dataset):
  def __init__(self, root, df, transform):
    self.root = root
    self.df = df
    self.transform = transform

  def __getitem__(self, idx):
    item = self.df.iloc[idx]
    # get image
    image_path = os.path.join(self.root, item["Image_Name"])

    if not os.path.exists(image_path):
        return None

    image = Image.open(image_path).convert("RGB")

    # prepare image for the model
    pixel_values = self.transform(image)

    # get labels
    labels = item[2:].values.astype(np.float32)

    # turn into PyTorch tensor
    labels = torch.from_numpy(labels)

    return pixel_values, labels

  def __len__(self):
    return len(self.df)

为了准备的图像，将使用 Torchvision 包，它提供了若干图像转换工具将图像大小调整为模型预期的大小（在本例中为 384），
并且使用适当的平均值和标准偏差对颜色通道进行标准化。

In [14]:
from torchvision.transforms import Compose, Resize, ToTensor, Normalize

# get appropriate size, mean and std based on the image processor
size = processor.size["height"]
mean = processor.image_mean
std = processor.image_std

transform = Compose([
    Resize((size, size)),
    ToTensor(),
    Normalize(mean=mean, std=std),
])

train_dataset = MultiLabelDataset(root="/root/autodl-tmp/multilabel_modified/images",
                                  df=df, transform=transform)
len(train_dataset)

8968

接下来，我们可以创建相应的 PyTorch DataLoader，以获取批量训练示例（因为神经网络通常使用随机梯度下降 = SGD 对批量数据进行训练）。

In [20]:
from torch.utils.data import DataLoader

def collate_fn(batch):
    # 过滤掉 None
    batch = [item for item in batch if item is not None]
    
    # 如果 batch 为空，返回 None，避免 torch.stack 出错
    if len(batch) == 0:
        return None

    data = torch.stack([item[0] for item in batch])
    target = torch.stack([item[1] for item in batch])
    return data, target

train_dataloader = DataLoader(train_dataset, collate_fn=collate_fn, batch_size=2, shuffle=True)
batch = next(iter(train_dataloader))

验证初始损失

In [21]:
outputs = model(pixel_values=batch[0].to(device), labels=batch[1].to(device))
outputs.loss

tensor(0.0682, device='cuda:0',
       grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)

是时候训练模型了！我们将在此处以常规的 PyTorch 方式进行训练，但请随时升级以利用 🤗 Accelerate（对于具有最少代码更改的分布式训练非常有用），或者利用 🤗 Trainer 类来处理
我们在此处为您定义的许多逻辑（例如创建数据加载器）。
- learning rate  学习率
- number of epochs  纪元数
- optimizer  优化
- gradient accumulation, gradient checkpointing, Flash Attention can be leveraged to speed up training 可以利用梯度累积、梯度检查点、Flash Attention 来加速训练
- mixed precision training (bfloat16) etc. 混合精度训练 （bfloat16） 等。

In [None]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

from torch.optim import AdamW
from tqdm.auto import tqdm

optimizer = AdamW(model.parameters(), lr=5e-5)

losses = AverageMeter()

model.train()
for epoch in range(10):  # loop over the dataset multiple times
    for idx, batch in enumerate(tqdm(train_dataloader)):
        # 跳过无效批次
        if batch is None:
            continue
        # get the inputs;
        pixel_values, labels = batch

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward pass
        outputs = model(
            pixel_values=pixel_values.to(device),
            labels=labels.to(device),
        )

        # calculate gradients
        loss = outputs.loss
        losses.update(loss.item(), pixel_values.size(0))
        loss.backward()

        # optimization step
        optimizer.step()

        if idx % 2000 == 0:
            print('Epoch: [{0}]\t'
                  'Loss {loss.val:.4f} ({loss.avg:.4f})\t'.format(
                   epoch, loss=losses,))

  0%|          | 0/4484 [00:00<?, ?it/s]

Epoch: [0]	Loss 0.1064 (0.1064)	
Epoch: [0]	Loss 0.0090 (0.0550)	
Epoch: [0]	Loss 0.0169 (0.0626)	
Epoch: [0]	Loss 0.0126 (0.0621)	
Epoch: [0]	Loss 0.0663 (0.0611)	
Epoch: [0]	Loss 0.0018 (0.0600)	
Epoch: [0]	Loss 0.0054 (0.0596)	
Epoch: [0]	Loss 0.0179 (0.0600)	
Epoch: [0]	Loss 0.0215 (0.0582)	
Epoch: [0]	Loss 0.0723 (0.0583)	
Epoch: [0]	Loss 0.0682 (0.0591)	
Epoch: [0]	Loss 0.0049 (0.0585)	
Epoch: [0]	Loss 0.1924 (0.0588)	
Epoch: [0]	Loss 0.0254 (0.0583)	
Epoch: [0]	Loss 0.0205 (0.0587)	
Epoch: [0]	Loss 0.0268 (0.0595)	
Epoch: [0]	Loss 0.1002 (0.0591)	
Epoch: [0]	Loss 0.0089 (0.0584)	
Epoch: [0]	Loss 0.0022 (0.0582)	
Epoch: [0]	Loss 0.0266 (0.0574)	
Epoch: [0]	Loss 0.0098 (0.0575)	
Epoch: [0]	Loss 0.0328 (0.0577)	
Epoch: [0]	Loss 0.0478 (0.0571)	
Epoch: [0]	Loss 0.0020 (0.0570)	
Epoch: [0]	Loss 0.0092 (0.0564)	
Epoch: [0]	Loss 0.0048 (0.0575)	
Epoch: [0]	Loss 0.0026 (0.0575)	
Epoch: [0]	Loss 0.0026 (0.0580)	
Epoch: [0]	Loss 0.0586 (0.0577)	
Epoch: [0]	Loss 0.0716 (0.0578)	
Epoch: [0]

  0%|          | 0/4484 [00:00<?, ?it/s]

Epoch: [1]	Loss 0.0307 (0.0607)	
Epoch: [1]	Loss 0.0455 (0.0603)	
Epoch: [1]	Loss 0.0344 (0.0598)	
Epoch: [1]	Loss 0.0298 (0.0598)	
Epoch: [1]	Loss 0.0083 (0.0602)	
Epoch: [1]	Loss 0.0147 (0.0600)	
Epoch: [1]	Loss 0.0263 (0.0600)	
Epoch: [1]	Loss 0.0060 (0.0596)	
Epoch: [1]	Loss 0.0298 (0.0595)	
Epoch: [1]	Loss 0.0026 (0.0595)	
Epoch: [1]	Loss 0.0052 (0.0592)	
Epoch: [1]	Loss 0.0013 (0.0588)	
Epoch: [1]	Loss 0.0816 (0.0589)	
Epoch: [1]	Loss 0.0055 (0.0589)	
Epoch: [1]	Loss 0.0295 (0.0587)	
Epoch: [1]	Loss 0.0732 (0.0588)	
Epoch: [1]	Loss 0.0922 (0.0589)	
Epoch: [1]	Loss 0.0009 (0.0588)	
Epoch: [1]	Loss 0.0141 (0.0593)	
Epoch: [1]	Loss 0.2931 (0.0592)	
Epoch: [1]	Loss 0.0268 (0.0592)	
Epoch: [1]	Loss 0.0310 (0.0592)	
Epoch: [1]	Loss 0.0315 (0.0595)	
Epoch: [1]	Loss 0.0953 (0.0594)	
Epoch: [1]	Loss 0.0036 (0.0593)	
Epoch: [1]	Loss 0.0080 (0.0594)	
Epoch: [1]	Loss 0.1460 (0.0593)	
Epoch: [1]	Loss 0.0919 (0.0592)	
Epoch: [1]	Loss 0.0160 (0.0591)	
Epoch: [1]	Loss 0.0735 (0.0589)	
Epoch: [1]

  0%|          | 0/4484 [00:00<?, ?it/s]

Epoch: [2]	Loss 0.0017 (0.0594)	
Epoch: [2]	Loss 0.0542 (0.0592)	
Epoch: [2]	Loss 0.0030 (0.0591)	
Epoch: [2]	Loss 0.1584 (0.0590)	
Epoch: [2]	Loss 0.0018 (0.0590)	
Epoch: [2]	Loss 0.0417 (0.0590)	
Epoch: [2]	Loss 0.0112 (0.0588)	
Epoch: [2]	Loss 0.1308 (0.0587)	
Epoch: [2]	Loss 0.0051 (0.0588)	
Epoch: [2]	Loss 0.0810 (0.0587)	
Epoch: [2]	Loss 0.0768 (0.0587)	
Epoch: [2]	Loss 0.1360 (0.0589)	
Epoch: [2]	Loss 0.0614 (0.0588)	
Epoch: [2]	Loss 0.0005 (0.0587)	
Epoch: [2]	Loss 0.0491 (0.0586)	
Epoch: [2]	Loss 0.0357 (0.0585)	
Epoch: [2]	Loss 0.4642 (0.0584)	
Epoch: [2]	Loss 0.2217 (0.0583)	
Epoch: [2]	Loss 0.0920 (0.0582)	
Epoch: [2]	Loss 0.1208 (0.0581)	
Epoch: [2]	Loss 0.0692 (0.0581)	
Epoch: [2]	Loss 0.0492 (0.0582)	
Epoch: [2]	Loss 0.0213 (0.0583)	
Epoch: [2]	Loss 0.0018 (0.0582)	
Epoch: [2]	Loss 0.2407 (0.0581)	
Epoch: [2]	Loss 0.1662 (0.0581)	
Epoch: [2]	Loss 0.0023 (0.0579)	
Epoch: [2]	Loss 0.0109 (0.0581)	
Epoch: [2]	Loss 0.0626 (0.0581)	
Epoch: [2]	Loss 0.0066 (0.0581)	
Epoch: [2]

  0%|          | 0/4484 [00:00<?, ?it/s]

Epoch: [3]	Loss 0.0385 (0.0579)	
Epoch: [3]	Loss 0.3101 (0.0578)	
Epoch: [3]	Loss 0.0100 (0.0577)	
Epoch: [3]	Loss 0.0094 (0.0576)	


测试进行推理

In [12]:
image = Image.open("/root/autodl-tmp/multilabel_modified/images/image6179.jpg")
model.eval()

# prepare image for the model
pixel_values = processor(image, return_tensors="pt").pixel_values.to(device)

# forward pass
with torch.no_grad():
  outputs = model(pixel_values)
  logits = outputs.logits

# 由于我们在训练期间使用了 BCEWithLogitsLoss（在计算损失之前对 logit 应用 sigmoid），因此我们也需要在此处将 sigmoid 应用于 logits。这将它们转化为单独的概率。
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())

# select the probabilities > a certain threshold (e.g. 50%) as predicted
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1 # turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)

['bus', 'trees']


持久化保存模型

In [None]:
model.save_pretrained("./saved_model/")