### OpenAI CLIP으로 이미지 분류하기

### 학습 내용
 * 실습을 위한 사전 준비 및 확인
 * OpenAI CLIP 이미지 분류하기 전체 실행 코드
 * 역방향 안정적 확산 : 이미지를 텍스트로 변환하기

### 실습을 위한 사전 준비 및 확인
 * 구글 코랩 환경은 일정 시간이후에 초기화가 되기 때문에 두가지 작업을 매번 수행해야 함.
   * chatgpt.env 파일 생성이 필요.
     * 준비된 chatgpt.env를 내용을 변경하여 업로드 하거나 또는 API_KEY와 ORG_ID를 확인하여 생성한다.
   * pip install openai 설치
     * 설치시 첫 실행시 에러가 발생(23/12) - 해결(다시 한번 실행하면 사라짐)
     ```
     ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.
```
  * 코랩 노트북 런타임 유형 GPU로 변경
  * 이미지 다운로드
     * wget https://upload.wikimedia.org/wikipedia/commons/d/d0/STS086-371-015_-_STS-086_-_Various_views_of_STS-86_and_Mir_24_crewmembers_on_the_Mir_space_station_-_DPLA_-_92233a2e397bd089d70a7fcf922b34a4.jpg

In [2]:
!pip install openai



In [3]:
!wget -O image01.jpg https://upload.wikimedia.org/wikipedia/commons/d/d0/STS086-371-015_-_STS-086_-_Various_views_of_STS-86_and_Mir_24_crewmembers_on_the_Mir_space_station_-_DPLA_-_92233a2e397bd089d70a7fcf922b34a4.jpg

--2023-12-12 07:37:50--  https://upload.wikimedia.org/wikipedia/commons/d/d0/STS086-371-015_-_STS-086_-_Various_views_of_STS-86_and_Mir_24_crewmembers_on_the_Mir_space_station_-_DPLA_-_92233a2e397bd089d70a7fcf922b34a4.jpg
Resolving upload.wikimedia.org (upload.wikimedia.org)... 208.80.154.240, 2620:0:861:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|208.80.154.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1843795 (1.8M) [image/jpeg]
Saving to: ‘image01.jpg’


2023-12-12 07:37:50 (9.47 MB/s) - ‘image01.jpg’ saved [1843795/1843795]



#### CUDA인지 확인하기

In [4]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


#### clip.load() 함수를 이용한 모델 로드

In [5]:
!pip install git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-7r9jdq60
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-7r9jdq60
  Resolved https://github.com/openai/CLIP.git to commit a1d071733d7111c9c014f024669f959182114e33
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ftfy (from clip==1.0)
  Downloading ftfy-6.1.3-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.4/53.4 kB[0m [31m803.7 kB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: clip
  Building wheel for clip (setup.py) ... [?25l[?25hdone
  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369497 sha256=6ba95b661755ea01fd0427a0b4c6fa55dc3458d0d839770521c026d8c44ac246
  Stored in directory: /tmp/pip-ephem-wheel-cache-my8w24w3/wheels/da/2b/4c/d6691fa9597aac8bb85d2ac13b112deb897d5b50f5ad9a37e4
Successfully built clip
In

In [6]:
import clip
model, preprocess = clip.load('ViT-B/32', device=device)
print(model, '\n preprocess :', preprocess)

100%|████████████████████████████████████████| 338M/338M [00:02<00:00, 133MiB/s]


CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
    (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): Sequential(
        (0): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=768, out_features=3072, bias=True)
            (gelu): QuickGELU()
            (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (1): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          

#### CLIP 모델을 사용한 인코딩

In [9]:
import PIL

# 이미지 불러오기
image = PIL.Image.open("image01.jpg")

# 이미지 전처리
image_input = preprocess(image).unsqueeze(0).to(device)

# 이미지 확인
print(image_input.shape)

# CLIP 모델을 사용하여 이미지 인코드
with torch.no_grad():
    image_features = model.encode_image(image_input)


torch.Size([1, 3, 224, 224])


#### 텍스트 프롬프트 리스트 정의

In [10]:
prompts = [
    "A large galaxy in the center of a cluster of galaxies located in the constellation Bostes",
    "MTA Long Island Bus has just left the Hempstead Bus Terminal on the N6",
    "STS-86 mission specialists Vladimir Titov and Jean-Loup Chretien pose for photos in the Base Block",
    "A view of the International Space Station (ISS) from the Soyuz TMA-19 spacecraft, as it approaches the station for docking",
    "A domesticated tiger in a cage with a tiger trainer in the background",
    "A mechanical engineer working on a car engine",
]

# CLIP 모델을 사용한 텍스트 프롬프트 인코드
with torch.no_grad():
    text_features = model.encode_text(clip.tokenize(prompts).to(device))

#### 이미지와 각 프롬프트 사이의 유사도 계산

In [11]:
# 이미지와 각각의 프롬프트 사이의 유사도 계산하기
similarity_scores = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(similarity_scores)


tensor([[0., 0., 1., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)


In [12]:
# 유사도 점수가 가장 높은 프롬프트를 출력
most_similar_prompt_index = similarity_scores.argmax().item()
most_similar_prompt = prompts[most_similar_prompt_index]
print("The image is most similar to the prompt: {} ".format(most_similar_prompt))


The image is most similar to the prompt: STS-86 mission specialists Vladimir Titov and Jean-Loup Chretien pose for photos in the Base Block 


### OpenAI CLIP 모델 실습 - 전체 실행 코드

In [13]:
# 필수 라이브러리 불러오기
import torch
import clip
import PIL

# CLIP 모델 불러오기
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device=device)

# 이미지 불러오기
image = PIL.Image.open("image01.jpg")

# 이미지 전처리
image_input = preprocess(image).unsqueeze(0).to(device)

# CLIP 모델을 사용한 이미지 인코드
with torch.no_grad():
    image_features = model.encode_image(image_input)

# 텍스트 프롬프트 리스트 정의하기
prompts = [
    "A large galaxy in the center of a cluster of galaxies located in the constellation Bostes",
    "MTA Long Island Bus has just left the Hempstead Bus Terminal on the N6",
    "STS-86 mission specialists Vladimir Titov and Jean-Loup Chretien pose for photos in the Base Block",
    "A view of the International Space Station (ISS) from the Soyuz TMA-19 spacecraft, as it approaches the station for docking",
    "A domesticated tiger in a cage with a tiger trainer in the background",
    "A mechanical engineer working on a car engine",
]

# CLIP 모델을 사용한 텍스트 프롬프트 인코드
with torch.no_grad():
    text_features = model.encode_text(clip.tokenize(prompts).to(device))

# 이미지와 각각의 텍스트 프롬프트 사이의 유사도 계산하기
similarity_scores = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# 가장 높은 유사도 점수와 프롬프트 출력하기
most_similar_prompt_index = similarity_scores.argmax().item()
most_similar_prompt = prompts[most_similar_prompt_index]
print("The image is most similar to the prompt: {}".format(most_similar_prompt))


The image is most similar to the prompt: STS-86 mission specialists Vladimir Titov and Jean-Loup Chretien pose for photos in the Base Block


### Reverse Stable Diffusion : 이미지를 텍스트로 변환하기

In [14]:
!pip install clip-interrogator

Collecting clip-interrogator
  Downloading clip_interrogator-0.6.0-py3-none-any.whl (787 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m787.8/787.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting open-clip-torch (from clip-interrogator)
  Downloading open_clip_torch-2.23.0-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate (from clip-interrogator)
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from open-clip-torch->clip-interrogator)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
Collecting timm (from open-c

#### 이미지를 텍스트로 변경하는 작업 수행

In [16]:
import clip_interrogator
from PIL import Image
from clip_interrogator import Config, Interrogator

print(clip_interrogator.__version__)
print(PIL.__version__)

0.6.0
9.4.0


In [17]:
from PIL import Image
from clip_interrogator import Config, Interrogator

# 이미지 경로 지정
image_path = 'image01.jpg'
image = Image.open(image_path).convert('RGB')

ci = Interrogator(Config(clip_model_name = "ViT-L-14/openai"))

print(ci.interrogate(image))

Loading caption model blip-large...


config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Loading CLIP model ViT-L-14/openai...


100%|███████████████████████████████████████| 933M/933M [00:09<00:00, 97.1MiB/s]
ViT-L-14_openai_artists.safetensors: 100%|██████████| 16.2M/16.2M [00:00<00:00, 104MB/s] 
ViT-L-14_openai_flavors.safetensors: 100%|██████████| 155M/155M [00:00<00:00, 281MB/s]
ViT-L-14_openai_mediums.safetensors: 100%|██████████| 146k/146k [00:00<00:00, 4.54MB/s]
ViT-L-14_openai_movements.safetensors: 100%|██████████| 307k/307k [00:00<00:00, 7.78MB/s]
ViT-L-14_openai_trendings.safetensors: 100%|██████████| 111k/111k [00:00<00:00, 3.57MB/s]
ViT-L-14_openai_negative.safetensors: 100%|██████████| 63.2k/63.2k [00:00<00:00, 3.63MB/s]


Loaded CLIP model and data in 22.73 seconds.


100%|██████████| 55/55 [00:00<00:00, 65.15it/s]
Flavor chain:  28%|██▊       | 9/32 [00:25<01:03,  2.78s/it]
100%|██████████| 55/55 [00:00<00:00, 112.49it/s]
100%|██████████| 6/6 [00:00<00:00, 80.75it/s]
100%|██████████| 50/50 [00:00<00:00, 112.17it/s]

three men in overalls standing in a space station with a monitor, macron with afro hair, without beard and mustache, inspired by Mikhail Evstafiev, in the 1986 vert contest, moustache, man in adidas tracksuit, file photo, it is the captain of a crew, 2 0 0 2 photo



