# CLIP (Contrastive Language-Image Pretraining)

In [1]:
# git repository를 통한 CLIP 설치
!pip install git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-0lo_vccp
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-0lo_vccp
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ftfy (from clip==1.0)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: clip
  Building wheel for clip (setup.py) ... [?25l[?25hdone
  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369490 sha256=bb7682374b8dff3714ffe12178a5e38662777bd3ee28aab1315aeb33886e2366
  Stored in directory: /tmp/pip-ephem-wheel-cache-9tzhzsej/wheels/35/3e/df/3d24cbfb3b6a06f17

In [2]:
# model 활용
import torch
import clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, processor = clip.load("ViT-B/32", device=device)

100%|████████████████████████████████████████| 338M/338M [00:03<00:00, 107MiB/s]


In [None]:
from PIL import Image

# 이미지 전처리 및 캡션 옵션 생성
image = processor(Image.open("test.jpg")).unsqueeze(0).to(device)
caption_options = [
    "a dog on the grass",
    "a cat on the grass",
    "a pug sitting",
    "a cat on the table"
]
captions = clip.tokenize(caption_options).to(device)

In [None]:
with torch.no_grad():
  image_features = model.encode_image(image)
  text_features = model.encode_text(captions)

  logits_per_image, _ = model(image, captions)
  probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print(image_features)
print(text_features)
print("CLIP이 뽑은 Best Caption:", caption_options[probs.argmax()])

tensor([[ 1.5186e-01,  1.3696e-01,  2.1985e-01, -1.4417e-01,  2.3120e-01,
         -8.2568e-01,  4.1382e-01,  2.1313e-01, -4.0820e-01,  8.3618e-02,
         -1.3159e-01, -9.8389e-02,  3.8770e-01,  2.9590e-01,  6.3574e-01,
          4.9744e-02,  2.4548e-01, -1.4746e-01,  3.4619e-01, -4.4891e-02,
         -4.0161e-01, -1.9989e-02,  1.6296e-01, -2.5708e-01, -4.0161e-01,
         -8.8928e-02,  2.3157e-01, -2.5360e-02, -6.9763e-02, -7.2693e-02,
          1.5356e-01,  2.2693e-01, -4.1162e-01, -1.5918e-01,  1.1298e-01,
          5.8350e-01,  3.8452e-02,  3.6963e-01, -4.2529e-01,  1.0459e+00,
         -4.3018e-01, -8.0566e-02, -2.2180e-01, -2.3120e-01,  9.7595e-02,
         -1.0852e-01, -5.9570e-02,  6.1737e-02,  4.6387e-01,  1.3000e-01,
          2.6025e-01,  3.4985e-01, -1.0736e-01, -2.8003e-01,  3.9697e-01,
          1.6650e-01,  3.0472e-02,  2.7979e-01, -2.9102e-01,  4.5357e-03,
          1.2031e+00,  7.4890e-02,  4.7876e-01,  3.1396e-01, -4.2944e-01,
         -3.4521e-01, -2.7866e-03,  7.

# BLIP (Bootstrapped Language-Image Pretrainig)

In [5]:
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

In [6]:
image = Image.open("test.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

output = model.generate(**inputs)

In [9]:
print(output)
print("BLIP이 생성한 Caption:", processor.decode(output[0]))
print("BLIP이 생성한 Caption:", processor.decode(output[0], skip_special_tokens=True))

tensor([[30522,  1037, 17022,  3564,  1999,  1996,  5568,  2007,  2049,  2677,
          2330,   102]])
BLIP이 생성한 Caption: a puppy sitting in the grass with its mouth open [SEP]
BLIP이 생성한 Caption: a puppy sitting in the grass with its mouth open


# [실습] 이미지 캡션 매칭 퀴즈

1. BLIP을 이용해 이미지에 적절한 캡션 생성
2. OpenAI API를 이용해 1에서 생성한 캡션을 포함해 5개의 '보기' 생성
3. 사용자의 선택
4. CLIP을 이용해 가장 유사도가 높은 캡션 매칭
5. 결과 출력
  - 정답/오답 여부
  - BLIP 생성 캡션
  - CLIP 매칭 캡션과 각 유사도 점수



In [None]:
"""
**<< 출력 예시 >>**

보기:
1) 어쩌고 저쩌고
2) 어쩌고 저쩌고
3) 어쩌고 저쩌고
4) 어쩌고 저쩌고
5) 어쩌고 저쩌고
(사용자 입력)

정답🚀
BLIP 생성 캡션: 2) 어쩌고 저쩌고
CLIP 매칭 캡션: 2) 어쩌고 저쩌고
각 유사도 점수:
1) 어쩌고 저쩌고 00점 (BLIP 생성 캡션)
2) 어쩌고 저쩌고 00점 (CLIP 매칭 캡션)
3) 어쩌고 저쩌고 00점
4) 어쩌고 저쩌고 00점
5) 어쩌고 저쩌고 00점
"""