### Image preprocessing의 필요성
- 이미지에서 중요한 대상을 추출
- Image search 성능 향상

참고 : https://arxiv.org/pdf/2309.06581.pdf

---

In [1]:
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity
import json
import matplotlib.pyplot as plt
import torch
import numpy as np
import os

Read coco dataset

In [2]:
annotation_file = '../data/coco/annotations/instances_val2017.json'
with open(annotation_file, 'r') as file:
    data = json.load(file)

In [3]:
data.keys()

dict_keys(['info', 'licenses', 'images', 'annotations', 'categories'])

In [4]:
data['categories']

[{'supercategory': 'person', 'id': 1, 'name': 'person'},
 {'supercategory': 'vehicle', 'id': 2, 'name': 'bicycle'},
 {'supercategory': 'vehicle', 'id': 3, 'name': 'car'},
 {'supercategory': 'vehicle', 'id': 4, 'name': 'motorcycle'},
 {'supercategory': 'vehicle', 'id': 5, 'name': 'airplane'},
 {'supercategory': 'vehicle', 'id': 6, 'name': 'bus'},
 {'supercategory': 'vehicle', 'id': 7, 'name': 'train'},
 {'supercategory': 'vehicle', 'id': 8, 'name': 'truck'},
 {'supercategory': 'vehicle', 'id': 9, 'name': 'boat'},
 {'supercategory': 'outdoor', 'id': 10, 'name': 'traffic light'},
 {'supercategory': 'outdoor', 'id': 11, 'name': 'fire hydrant'},
 {'supercategory': 'outdoor', 'id': 13, 'name': 'stop sign'},
 {'supercategory': 'outdoor', 'id': 14, 'name': 'parking meter'},
 {'supercategory': 'outdoor', 'id': 15, 'name': 'bench'},
 {'supercategory': 'animal', 'id': 16, 'name': 'bird'},
 {'supercategory': 'animal', 'id': 17, 'name': 'cat'},
 {'supercategory': 'animal', 'id': 18, 'name': 'dog'},

In [6]:
image_ids = [i['image_id'] for i in data['annotations'] if ( i['category_id'] in [17, 18] )]
image_paths = [os.path.join("../data", "coco", "val2017", i['file_name']) for i in data['images']
               if (i['id'] in image_ids) & (i['license'] in [4, 5, 6])]

### @강아지를 찾습니다@

- 실제 프로젝트에서는 이미지 내에 있는 특정한 부분을 찾아야 하는 경우가 많음
- 따라서 텍스트에서 document를 chunking하여 문장 또는 문단 단위로 분석하듯이,
- 이미지 역시 '구성요소'를 파악할 수 있어야 보다 정확한 search가 가능해짐

In [8]:
dog = Image.open('../data/dog.jpg')
# 출처 : https://unsplash.com/photos/white-and-brown-long-coat-large-dog-U3aF7hgUSrk

- CLIP 모델을 활용하여 COCO image set 중 유사한 사진 탐색

In [10]:
from utils import fetch_clip, extract_img_features, search_image, draw_images

In [11]:
clip, processor = fetch_clip()

In [12]:
coco_features = [extract_img_features(Image.open(i), processor, clip) for i in image_paths]

In [13]:
len(image_paths)

97

In [14]:
input_image = Image.open('../data/dog.jpg')
query_features = extract_img_features(input_image, processor, clip)

In [15]:
input_image.size

(4272, 2848)

In [16]:
most_similar_idx, distance = search_image(query_features, coco_features)

In [17]:
similar_images = [Image.open(image_paths[i]) for i in most_similar_idx]

In [50]:
# input_image

In [51]:
# draw_images(similar_images[:5], distance[:5])

In [52]:
# draw_images(similar_images[5:], distance[5:])

- 강아지의 종 구분 불가

## [Preprocessing] YOLO를 활용한 object detection

### Prepare model

In [22]:
import yolov5

# load pretrained model
model = yolov5.load('yolov5s.pt')

# set model parameters
model.conf = 0.25  # NMS confidence threshold
model.iou = 0.45  # NMS IoU threshold
model.agnostic = False  # NMS class-agnostic
model.multi_label = False  # NMS multiple labels per box
model.max_det = 1000  # maximum number of detections per image

YOLOv5  2024-5-17 Python-3.11.9 torch-2.3.0 CUDA:0 (NVIDIA GeForce GTX 1050 Ti, 4096MiB)

Downloading https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt to yolov5s.pt...

Fusing layers... 
YOLOv5s summary: 270 layers, 7235389 parameters, 0 gradients, 16.6 GFLOPs
Adding AutoShape... 


In [23]:
def tensor2np(tensor):
    if tensor.is_cuda:
      numpy_array = tensor.cpu().numpy()
    else:
      numpy_array = tensor.numpy()

    return numpy_array

def detect_objects(img_path):
    img = Image.open(img_path)
    results = model(img, size=1280, augment=True)

    pred_dict = dict()
    predictions =results.pred[0]

    pred_dict['boxes'] = tensor2np(predictions[:, :4]) # x1, y1, x2, y2
    pred_dict['scores'] = tensor2np(predictions[:, 4])
    pred_dict['categories'] = tensor2np(predictions[:, 5])

    return results, pred_dict

In [24]:
dog = Image.open('../data/dog.jpg')
dog_result, dog_pred = detect_objects('../data/dog.jpg')

In [49]:
# dog_result.show()

In [26]:
predictions = list()
results = list()
print("Extracting objects from {} images".format(len(image_paths)))
for image_path in image_paths:
  result, pred = detect_objects(image_path)
  predictions.append(pred)
  results.append(result)

Extracting objects from 97 images


In [53]:
# results[79].show()

## Preprocessing
- [Zero-Shot Visual Classification with Guided Cropping](https://arxiv.org/pdf/2309.06581.pdf)

- Image crop
- Resize
- normalize pixel values

In [28]:
def crop_bbox(pil_image, bbox):
    x_min, y_min, x_max, y_max = bbox
    crop_box = (x_min, y_min, x_max, y_max)

    cropped_image = pil_image.crop(crop_box)

    return cropped_image

def normalize_image(pil_image, target_size=(224, 224)):
    # pixel resizing
    resized_image = pil_image.resize(target_size, Image.LANCZOS)

    # normalization
    np_image = np.array(resized_image).astype('float32')
    np_image /= 255.0  # pixel values to [0, 1]
    normalized_image = Image.fromarray((np_image * 255).astype('uint8'))
    return normalized_image

In [29]:
cropped_images = list()

for img_path, preds in zip(image_paths, predictions):
  img = Image.open(img_path)
  cropped = [crop_bbox(img, bbox) for bbox in preds['boxes']]
  normalized = [normalize_image(c) for c in cropped]
  cropped_images.extend(normalized)

In [30]:
len(cropped_images)

478

In [46]:
# cropped_images[0]

In [47]:
# cropped_images[400]

In [48]:
# cropped_images[410]

### Search

In [34]:
dog = Image.open('../data/dog.jpg')
dog_result, dog_pred = detect_objects('../data/dog.jpg')

In [35]:
dog_cropped = normalize_image(crop_bbox(dog, dog_pred['boxes'][0]))

In [54]:
# dog_cropped

In [37]:
# extract item features
item_features = [extract_img_features(i, processor, clip) for i in cropped_images]
# extract input item feature
dog_features = extract_img_features(dog_cropped, processor, clip)

In [38]:
most_similar_idx, distance = search_image(dog_features, item_features)

In [39]:
similar_items = [cropped_images[i] for i in most_similar_idx]

In [40]:
# cosine 유사도 상승
distance

array([    0.89113,     0.88722,     0.82846,      0.8035,     0.80085,      0.7968,     0.78888,     0.78307,     0.77808,      0.7729], dtype=float32)

In [55]:
# similar_items[0]

In [56]:
# similar_items[1]

In [57]:
# similar_items[2]

In [58]:
# similar_items[3]

In [59]:
# similar_items[4]

## 한계점
- 이미지에서 정보를 '추출'할 수 있는 기술이 필수적
    - i.g. OCR, semantic segmentation, localized object detection, pose estimation