<a href="https://www.kaggle.com/code/aisuko/zero-shot-image-classification?scriptVersionId=164773113" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Zero-shot image classification is a task that involves classifying images into different categories using a model that was not explicity trained on data containing labeled examples from those specific categories.

Traditionally, image classifiction requires training a model on a specific set of labeled images, and this model learns to "map" certain image features to labels. When there's a need to use such model for a classification task that introduces a new set of labels, fine-tuning is required to "recalibrate" the model. In contrast, **zero-shot or open vocabulary image classification models are typically multi-modal models that have been trained on a large dataset of images and associated descriptions**. These models learn aligned vision-language representations that can be used for many downstream tasks including zero-shot image classification. This is a more flexible approach to image classification that allows models to generalize to new and unseen categories without the need for additional training data and enables users to query images with free-form text descriptions of their target objects.

In [None]:
%%capture
!pip install transformers==4.35.2

# Loading the Pipeline

In [None]:
from transformers import pipeline

model_checkpoint="openai/clip-vit-large-patch14"
detector=pipeline(model=model_checkpoint, task="zero-shot-image-classification")
detector.enable_cpu_offloading()
detector.to('cuda')

# Loading the Image

In [None]:
from PIL import Image
import requests

url = "https://unsplash.com/photos/g8oS8-82DxI/download?ixid=MnwxMjA3fDB8MXx0b3BpY3x8SnBnNktpZGwtSGt8fHx8fDJ8fDE2NzgxMDYwODc&force=true&w=640"
image=Image.open(requests.get(url, stream=True).raw)
image

In [None]:
predictions=detector(image, candidate_labels=["fox","bear","seagull","owl"])
predictions

# Zero-shot Image Classification

Let's see how to use the zero-shot image classification pipeline.

In [None]:
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

model=AutoModelForZeroShotImageClassification.from_pretrained(model_checkpoint)
processor=AutoProcessor.from_pretrained(model_checkpoint)

In [None]:
from PIL import Image
import requests

url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640"
image=Image.open(requests.get(url, stream=True).raw)
image

Here, we use the processor to prepare the inputs for the model. The processor combines an image that prepares the image for the model by resizing and normalizing it, and a tokenizer that takes care of the text inputs.

In [None]:
candidate_labels=["tree","car","bike","cat"]
inputs=processor(images=image, text=candidate_labels, return_tensors="pt", padding=True)
print(inputs)

Pass the inputs through the model, and post-process the results:

In [None]:
import torch

with torch.no_grad():
    outputs=model(**inputs)
    
logits=outputs.logits_per_image[0]
probs=logits.softmax(dim=-1).numpy()
scores=probs.tolist()

result=[
    {"score": score, "label": candidate_label}
    for score, candidate_label in sorted(zip(probs,candidate_labels), key=lambda x:-x[0])
]

print(result)