<a href="https://colab.research.google.com/github/Pengyu-gis/RemoteCLIP/blob/main/open_clip_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 准备数据与环境
使用的数据是Flickr 8k Dataset
数据集地址: https://www.kaggle.com/datasets/adityajn105/flickr8k/code

In [None]:
from google.colab import files

# 上传 kaggle.json
uploaded = files.upload()

# 确保 kaggle.json 被正确上传
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# 创建 kaggle 目录并移动 kaggle.json 到该目录
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/

# 更改权限
!chmod 600 ~/.kaggle/kaggle.json


In [None]:
!kaggle datasets download -d adityajn105/flickr8k
!unzip flickr8k.zip

In [None]:
!pip install open_clip_torch

## 加载模型
使用open_clip提供的接口来加载预训练的CLIP模型。可以选择一个适合您任务的模型版本

In [None]:
import open_clip

# 加载预训练模型
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')

## 定义数据加载器
为了能够加载TIFF图像和对应的文本描述, 需要定义一个自定义的torch.utils.data.Dataset。以下是一个示例实现:

In [None]:
import os
import pandas as pd
from PIL import Image
import torch
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader

class ImageTextDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None):
        # 使用pandas读取文本文件，假设字段之间是由逗号分隔的
        self.img_labels = pd.read_csv(annotations_file, delimiter=',')
        self.img_dir = img_dir
        self.transform = transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = Image.open(img_path).convert("RGB")  # 读取JPG文件
        caption = self.img_labels.iloc[idx, 1]
        if self.transform:
            image = self.transform(image)
        return image, caption

# 设置数据转换
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

# 创建数据集和数据加载器
dataset = ImageTextDataset(annotations_file='/content/captions.txt',
                           img_dir='/content/Images',
                           transform=transform)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)


## 微调模型
一旦定义了数据加载器，就可以开始微调模型。这涉及到迭代数据加载器，将每批图像和文本送入模型，计算损失，并更新模型的权重。以下是微调过程的一个简化示例:

In [None]:
from torch import nn, optim, from_numpy
import numpy as np
from open_clip import tokenize

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"

# If the model isn't automatically moved to the correct device, explicitly do so
model = model.to(device)


# 假设已经定义了optimizer和loss function
optimizer = optim.Adam(model.parameters(), lr=5e-5)
criterion = nn.CrossEntropyLoss()

num_epochs = 3  # Example number of epochs

for epoch in range(num_epochs):
    running_loss = 0.0
    for images, captions in dataloader:  # Assuming dataloader is your DataLoader instance
        images = images.to("cuda")
        text_tokens = tokenize(captions).to("cuda")  # Ensure captions are properly processed if needed

        # Zero the parameter gradients
        optimizer.zero_grad()
        # Temporarily capture the entire output
        output = model(images, text_tokens)
        # print(type(output))  # Check the type of the output
        # print(len(output))   # If it's a tuple or list, check how many elements it contains
        image_features, text_features, _ = model(images, text_tokens)

        # Compute loss
        loss = criterion(image_features, text_features)  # Placeholder, adjust as necessary

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss/len(dataloader)}")
