# Models
Transformers currently includes a large amount of vision models, for various tasks

## Load a model

Instantiating a model without pre-trained weights can be done by 1) instantiating a configuration, defining the model architecture 2) creating a model based on that configuration.

In [None]:
from transformers import ViTConfig, ViTForImageClassification

config = ViTConfig(num_hidden_layers=12, hidden_size=768)
model = ViTForImageClassification(config)

The configuration just stores the hyperparameters related to the architecture of the model.

In [None]:
print(config)

Alternatively, (and this is what most people use), is to equip a model with pre-trained weights, such that it can be easily fine-tuned on a custom dataset.

https://huggingface.co/models

In [None]:
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

In [None]:
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224",
                                                  revision="db75733ce9ead4ed3dce26ab87a6ed2f6f565985")

## Load a feature extractor

A feature extractor can be used to prepare images for the model.

It's a minimal object to prepare images for inference.

It typically does some very simple image transformations (like resizing to fixed size + normalizing).

In [None]:
from transformers import ViTImageProcessor 

feature_extractor = ViTImageProcessor ()
feature_extractor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")

## Predict on image

In [None]:
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image.save("cats.png")
image

In [None]:
# prepare for the model
inputs = feature_extractor(image, return_tensors="pt")
pixel_values = inputs.pixel_values
print(pixel_values.shape)

In [None]:
# forward pass
outputs = model(pixel_values)
logits = outputs.logits

In [None]:
# take argmax on logits' last dimension
predicted_class_idx = logits.argmax(-1).item()
# turn into actual class name
print(model.config.id2label[predicted_class_idx])

## Auto API

The Auto Classes automatically instantiate the appropriate class for you, based on the checkpoint identifier you provide.  
https://huggingface.co/docs/transformers/main/en/model_doc/auto

In [None]:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification

feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")
model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")

## Image classification pipeline

Image classification is probably the simplest vision task: given an image, predict which class(es) belong to it.

In [None]:
from transformers import pipeline

image_pipe = pipeline("image-classification")

In [None]:
image_pipe(image)

Note that you can also provide a custom model from the hub:

In [None]:
image_pipe = pipeline("image-classification", 
                      model="microsoft/swin-tiny-patch4-window7-224")

In [None]:
image_pipe(image)

model + feature extractor  

In [None]:
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/convnext-tiny-224")
model = AutoModelForImageClassification.from_pretrained("facebook/convnext-tiny-224")

image_pipe = pipeline("image-classification", 
                      model=model,
                      feature_extractor=feature_extractor)

In [None]:
image_pipe(image)

## Object detection pipeline

In [None]:
object_detection_pipe = pipeline("object-detection",
                                model="facebook/detr-resnet-50")

In [None]:
results = object_detection_pipe(image)
results

Visualize the results:

In [None]:
import matplotlib.pyplot as plt

# colors for visualization
COLORS = [[0.000, 0.447, 0.741], [0.850, 0.325, 0.098], [0.929, 0.694, 0.125], [0.494, 0.184, 0.556], [0.466, 0.674, 0.188]]

def plot_results(image, results):
    plt.figure(figsize=(16,10))
    plt.imshow(image)
    ax = plt.gca()
    colors = COLORS * 100
    for result, color in zip(results, colors):
        box = result['box']
        xmin, xmax, ymin, ymax = box['xmin'], box['xmax'], box['ymin'], box['ymax']
        label = result['label']
        prob = result['score']
        ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin,
                                   fill=False, color=color, linewidth=3))
        text = f'{label}: {prob:0.2f}'
        ax.text(xmin, ymin, text, fontsize=15,
                bbox=dict(facecolor='yellow', alpha=0.5))
    plt.axis('off')
    plt.show()

In [None]:
plot_results(image, results)

## Depth estimation pipeline

In [None]:
depth_estimation_pipe = pipeline("depth-estimation")

In [None]:
results = depth_estimation_pipe(image)
results

In [None]:
results['depth'].show()

## Datasets

Loading a dataset from the hub 

In [None]:
from datasets import load_dataset 

dataset = load_dataset("cifar100")

## Image feature

You can directly view images in a notebook, as the images are of type Image. 

In [None]:
dataset['train'][0]['img']

In [None]:
dataset['train'].features

In [None]:
id2label = {id: label for id, label in enumerate(dataset['train'].features['fine_label'].names)}
print(id2label)

## MultiModal

Multi-modal models combine several modalities (e.g. language, vision, audio,...)
* AI models are getting more powerful due to this! Humans also capture several modalities at the same time.
* in Transformer, a so-calle `Processr` can be used to prepare the inputs for a model. Internally, a processor combines a tokenizer (for the text modality) and a  feature extractor (for the image/audio modality) for a model.

## Visual question answering (VQA)

In [None]:
from transformers import ViltProcessor, ViltForQuestionAnswering

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

In [None]:
question = "how many cats are there?"

encoding = processor(image, question, return_tensors="pt")
print(encoding.keys()) 

In [None]:
# forward pass
outputs = model(**encoding)
logits = outputs.logits

In [None]:
predicted_class_idx = logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[predicted_class_idx])

## Vision Encoder-Decoder Model

Allows to use any Transformer-based vision encoder (e.g. ViT, Swin, BEiT) with any language decoder (e.g., BERT, GPT-2)

In [None]:
from transformers import AutoTokenizer, ViTFeatureExtractor, VisionEncoderDecoderModel

repo_name = "ydshieh/vit-gpt2-coco-en"

feature_extractor = ViTFeatureExtractor.from_pretrained(repo_name)
tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = VisionEncoderDecoderModel.from_pretrained(repo_name)

In [None]:
pixel_values = feature_extractor(image, return_tensors="pt").pixel_values

# autoregressively generate text (using beam search or other decoding strategy)
generated_ids = model.generate(pixel_values, max_length=16, return_dict_in_generate=True)

In [None]:
# decode into text
preds = tokenizer.batch_decode(generated_ids[0], skip_special_tokens=True)
preds = [pred.strip() for pred in preds]
print(preds)