### OpenAI CLIP으로 이미지 분류하기

### 학습 내용
 * 실습을 위한 사전 준비 및 확인
 * OpenAI CLIP 이미지 분류하기 전체 실행 코드
 * 역방향 안정적 확산 : 이미지를 텍스트로 변환하기

### 실습을 위한 사전 준비 및 확인
  * pip install openai 설치
  * 코랩 노트북 런타임 유형 GPU로 변경
  * 이미지 다운로드
     * wget https://upload.wikimedia.org/wikipedia/commons/d/d0/STS086-371-015_-_STS-086_-_Various_views_of_STS-86_and_Mir_24_crewmembers_on_the_Mir_space_station_-_DPLA_-_92233a2e397bd089d70a7fcf922b34a4.jpg

In [1]:
!pip install openai

Collecting openai
  Downloading openai-0.27.10-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.10


In [2]:
!wget -O image01.jpg https://upload.wikimedia.org/wikipedia/commons/d/d0/STS086-371-015_-_STS-086_-_Various_views_of_STS-86_and_Mir_24_crewmembers_on_the_Mir_space_station_-_DPLA_-_92233a2e397bd089d70a7fcf922b34a4.jpg

--2023-08-31 07:00:40--  https://upload.wikimedia.org/wikipedia/commons/d/d0/STS086-371-015_-_STS-086_-_Various_views_of_STS-86_and_Mir_24_crewmembers_on_the_Mir_space_station_-_DPLA_-_92233a2e397bd089d70a7fcf922b34a4.jpg
Resolving upload.wikimedia.org (upload.wikimedia.org)... 185.15.59.240, 2a02:ec80:300:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|185.15.59.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1843795 (1.8M) [image/jpeg]
Saving to: ‘image01.jpg’


2023-08-31 07:00:40 (6.58 MB/s) - ‘image01.jpg’ saved [1843795/1843795]



#### CUDA인지 확인하기

In [4]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


#### clip.load() 함수를 이용한 모델 로드

In [5]:
!pip install git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-j3lszw1a
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-j3lszw1a
  Resolved https://github.com/openai/CLIP.git to commit a1d071733d7111c9c014f024669f959182114e33
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ftfy (from clip==1.0)
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: clip
  Building wheel for clip (setup.py) ... [?25l[?25hdone
  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369497 sha256=644d0bc8369fb6814301a1a4ce29950411557ece02518c302ed1845fa1232543
  Stored in directory: /tmp/pip-ephem-wheel-cache-ax0zcjor/wheels/da/2b/4c/d6691fa9597aac8bb85d2ac13b112deb897d5b50f5ad9a37e4
Successfully built clip
Inst

In [6]:
import clip
model, preprocess = clip.load('ViT-B/32', device=device)
print(model, '\n preprocess :', preprocess)

100%|███████████████████████████████████████| 338M/338M [00:06<00:00, 57.1MiB/s]


CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
    (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): Sequential(
        (0): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=768, out_features=3072, bias=True)
            (gelu): QuickGELU()
            (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (1): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          

#### CLIP 모델을 사용한 인코딩

In [7]:
import PIL

In [8]:
# Load an image
# 이미지 불러오기
image = PIL.Image.open("image01.jpg")

# 이미지 전처리
image_input = preprocess(image).unsqueeze(0).to(device)

# CLIP 모델을 사용하여 이미지 인코드
with torch.no_grad():
    image_features = model.encode_image(image_input)


#### 텍스트 프롬프트 리스트 정의

In [9]:
prompts = [
    "A large galaxy in the center of a cluster of galaxies located in the constellation Bostes",
    "MTA Long Island Bus has just left the Hempstead Bus Terminal on the N6",
    "STS-86 mission specialists Vladimir Titov and Jean-Loup Chretien pose for photos in the Base Block",
    "A view of the International Space Station (ISS) from the Soyuz TMA-19 spacecraft, as it approaches the station for docking",
    "A domesticated tiger in a cage with a tiger trainer in the background",
    "A mechanical engineer working on a car engine",
]

# CLIP 모델을 사용한 텍스트 프롬프트 인코드
with torch.no_grad():
    text_features = model.encode_text(clip.tokenize(prompts).to(device))

#### 이미지와 각 프롬프트 사이의 유사도 계산

In [10]:
# 이미지와 각각의 프롬프트 사이의 유사도 계산하기
similarity_scores = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(similarity_scores)


tensor([[0., 0., 1., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)


In [11]:
# 유사도 점수가 가장 높은 프롬프트를 출력
most_similar_prompt_index = similarity_scores.argmax().item()
most_similar_prompt = prompts[most_similar_prompt_index]
print("The image is most similar to the prompt: {} ".format(most_similar_prompt))


The image is most similar to the prompt: STS-86 mission specialists Vladimir Titov and Jean-Loup Chretien pose for photos in the Base Block 


### OpenAI CLIP 이미지 분류하기 전체 실행 코드

In [12]:
# 필수 라이브러리 불러오기
import torch
import clip
import PIL

# CLIP 모델 불러오기
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device=device)

# 이미지 불러오기
image = PIL.Image.open("image01.jpg")

# 이미지 전처리
image_input = preprocess(image).unsqueeze(0).to(device)

# CLIP 모델을 사용한 이미지 인코드
with torch.no_grad():
    image_features = model.encode_image(image_input)

# 텍스트 프롬프트 리스트 정의하기
prompts = [
    "A large galaxy in the center of a cluster of galaxies located in the constellation Bostes",
    "MTA Long Island Bus has just left the Hempstead Bus Terminal on the N6",
    "STS-86 mission specialists Vladimir Titov and Jean-Loup Chretien pose for photos in the Base Block",
    "A view of the International Space Station (ISS) from the Soyuz TMA-19 spacecraft, as it approaches the station for docking",
    "A domesticated tiger in a cage with a tiger trainer in the background",
    "A mechanical engineer working on a car engine",
]

# CLIP 모델을 사용한 텍스트 프롬프트 인코드
with torch.no_grad():
    text_features = model.encode_text(clip.tokenize(prompts).to(device))

# 이미지와 각각의 텍스트 프롬프트 사이의 유사도 계산하기
similarity_scores = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# 가장 높은 유사도 점수와 프롬프트 출력하기
most_similar_prompt_index = similarity_scores.argmax().item()
most_similar_prompt = prompts[most_similar_prompt_index]
print("The image is most similar to the prompt: {}".format(most_similar_prompt))


The image is most similar to the prompt: STS-86 mission specialists Vladimir Titov and Jean-Loup Chretien pose for photos in the Base Block


### 역방향 안정적 확산 : 이미지를 텍스트로 변환하기

In [13]:
!pip install clip-interrogator

Collecting clip-interrogator
  Downloading clip_interrogator-0.6.0-py3-none-any.whl (787 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m787.8/787.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors (from clip-interrogator)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
Collecting open-clip-torch (from clip-interrogator)
  Downloading open_clip_torch-2.20.0-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate (from clip-interrogator)
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers>=4.27.1 (from clip-inte

#### 이미지를 텍스트로 변경하는 작업 수행

In [14]:
from PIL import Image
from clip_interrogator import Config, Interrogator

# 이미지 경로 지정
image_path = 'image01.jpg'
image = Image.open(image_path).convert('RGB')

ci = Interrogator(Config(clip_model_name = "ViT-L-14/openai"))

print(ci.interrogate(image))

Loading caption model blip-large...


Downloading (…)lve/main/config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Loading CLIP model ViT-L-14/openai...


100%|███████████████████████████████████████| 933M/933M [00:17<00:00, 54.3MiB/s]
ViT-L-14_openai_artists.safetensors: 100%|██████████| 16.2M/16.2M [00:00<00:00, 268MB/s]
ViT-L-14_openai_flavors.safetensors: 100%|██████████| 155M/155M [00:00<00:00, 315MB/s]
ViT-L-14_openai_mediums.safetensors: 100%|██████████| 146k/146k [00:00<00:00, 11.7MB/s]
ViT-L-14_openai_movements.safetensors: 100%|██████████| 307k/307k [00:00<00:00, 18.8MB/s]
ViT-L-14_openai_trendings.safetensors: 100%|██████████| 111k/111k [00:00<00:00, 12.7MB/s]
ViT-L-14_openai_negative.safetensors: 100%|██████████| 63.2k/63.2k [00:00<00:00, 56.5MB/s]


Loaded CLIP model and data in 29.97 seconds.


100%|██████████| 55/55 [00:00<00:00, 167.64it/s]
Flavor chain:  38%|███▊      | 12/32 [00:40<01:07,  3.40s/it]
100%|██████████| 55/55 [00:00<00:00, 171.85it/s]
100%|██████████| 6/6 [00:00<00:00, 136.60it/s]
100%|██████████| 50/50 [00:00<00:00, 185.57it/s]

three men in overalls standing in a space station with a monitor, macron with afro hair, without beard and mustache, inspired by Mikhail Evstafiev, in the 1986 vert contest, taking control while smiling, moustache, the backroom, technical vest, from 2001, camper, 2009, 2005



