# Image Captioning using Transformers
In this jupter notebook, I will first use a pre-trained model to build up the image captioning model. Then I will fine-tune the model using the **Instagram Images with Captions** dataset from Kaggle. 

The fine-tuning will be done twice for each selected pre-trained model. The first fine-tuning will be done only on the last several layers in the selected model, which means I need to freeze all the prior layers. The second fine-tuning will be done on all the parameters in the selected model.

### Model Selection
1. encoder-decoder model which has not been fine-tuned for image captioning. For the image_encoder_model = **"google/vit-base-patch16-224-in21k"** and text_decoder_model = **"gpt2"**.
2. encoder-decoder model which has been fine-tuned for image captioning. For the image_encoder_model = **"nlpconnect/vit-gpt2-image-captioning"** and text_decoder_model = **"nlpconnect/vit-gpt2-image-captioning"**.

References:
- https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/

In [2]:
from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
from transformers import Trainer, TrainingArguments


from transformers import default_data_collator
from datasets import Dataset, DatasetDict, load_dataset

from PIL import Image, ImageDraw
import glob
import os

import pandas as pd

from tqdm.auto import tqdm

import torch

## Convert my customerized dataset to HuggingFace's dataset format

In [2]:
df = pd.read_csv("fine_tuning_dataset/instagram_data/captions_csv.csv",  names=["id", "image_path", "caption"],nrows=5000,header =1)

# Drop rows where the 'Caption' column is empty
df = df.dropna(subset=['caption'])

dataset = Dataset.from_pandas(df[['caption', 'image_path']])

  if _pandas_api.is_sparse(col):


In [4]:
# Check the valid rows of the dataset
dataset

Dataset({
    features: ['caption', 'image_path', '__index_level_0__'],
    num_rows: 4054
})

In [5]:
def load_image_with_caption(example):
    image_path = f"fine_tuning_dataset/instagram_data/{example['image_path']}.jpg"
    with Image.open(image_path) as image:
        image = image.convert("RGB")
    return {'image': image, 'caption': example['caption']}

# Map the function over the dataset
dataset = dataset.map(load_image_with_caption)

Map:   0%|          | 0/4054 [00:00<?, ? examples/s]

In [33]:
dataset

Dataset({
    features: ['caption', 'image_path', '__index_level_0__', 'image'],
    num_rows: 4054
})

In [34]:
# Save the dataset to disk
dataset.save_to_disk('raw_dataset')

Saving the dataset (0/9 shards):   0%|          | 0/4054 [00:00<?, ? examples/s]

In [36]:
# Check no null values in the caption column
# This step is CRITICAL for the future step to work
null_count = sum(1 for caption in dataset['caption'] if caption is None)
print(f"Number of nulls in 'caption': {null_count}")

Number of nulls in 'caption': 0


### As the feature extractor and tokenizer will be the same for all the following fine-tuning, I will first convert the dataset to the format that model can read.

In [3]:
image_encoder_model = "google/vit-base-patch16-224-in21k"
text_decode_model = "gpt2"

# image feature extractor
feature_extractor = ViTImageProcessor.from_pretrained(image_encoder_model)
# text tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained(text_decode_model)
# GPT2 only has bos/eos tokens but not decoder_start/pad tokens
tokenizer.pad_token = tokenizer.eos_token

In [7]:
# text preprocessing step
def tokenization_fn(caption):
    """Run tokenization on captions."""
    labels = tokenizer(caption, 
                      max_length=128,
                      padding="max_length",
                      truncation=True).input_ids # Must explicitly enable truncation

    return labels

# image preprocessing step
def feature_extraction_fn(image):
    encoder_inputs = feature_extractor(image, return_tensors="np").pixel_values
    return encoder_inputs

def preprocess_fn(examples):
    """Run tokenization + image feature extraction"""
    image = examples['image']
    caption = examples['caption']
    
    model_inputs = {}

    model_inputs['labels'] = tokenization_fn(caption)
    model_inputs['pixel_values'] = feature_extraction_fn(image)
    return model_inputs

In [44]:
# Split the dataset into train and test
train_test_split = dataset.train_test_split(test_size=0.2)

processed_dataset = DatasetDict()

processed_dataset['train'] = train_test_split['train'].map(
    function=preprocess_fn,
    batched=True,
    remove_columns=train_test_split['train'].column_names
)

processed_dataset['test'] = train_test_split['test'].map(
    function=preprocess_fn,
    batched=True,
    remove_columns=train_test_split['train'].column_names
)

Map:   0%|          | 0/3243 [00:00<?, ? examples/s]

Map:   0%|          | 0/811 [00:00<?, ? examples/s]

In [46]:
# Save the dataset to disk
processed_dataset.save_to_disk('processed_dataset')

Saving the dataset (0/4 shards):   0%|          | 0/3243 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/811 [00:00<?, ? examples/s]

In [2]:
# processed_dataset = DatasetDict.load_from_disk('processed_dataset')

## First, try encoder-decoder model which has not been fine-tuned for image captioning
- image_encoder_model = "google/vit-base-patch16-224-in21k"
- text_decode_model = "gpt2"

In [4]:
print(torch.backends.mps.is_built())
# Check if GPU is available and set the device accordingly
device = torch.device("mps") if torch.backends.mps.is_built() else torch.device("cpu")  # mps is for Apple Silicon GPU

True


In [8]:
image_encoder_model = "google/vit-base-patch16-224-in21k"
text_decode_model = "gpt2"

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    image_encoder_model, text_decode_model)

# Move the model to the GPU
model.to(device)

# update the model config
model.config.eos_token_id = tokenizer.eos_token_id
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.11.crossattention.c_proj.weight', 'h.8.crossattention.c_attn.weight', 'h.2.crossattention.c_proj.weight', 'h.0.crossattention.c_proj.weight', 'h.1.crossattention.q_attn.weight', 'h.9.crossattention.c_proj.weight', 'h.10.crossattention.c_attn.weight', 'h.8.crossattention.q_attn.bias', 'h.4.ln_cross_attn.bias', 'h.8.crossattention.c_attn.bias', 'h.6.ln_cross_attn.weight', 'h.11.crossattention.q_attn.weight', 'h.4.ln_cross_attn.weight', 'h.7.crossattention.c_attn.bias', 'h.6.crossattention.c_attn.weight', 'h.1.ln_cross_attn.weight', 'h.1.crossattention.c_proj.bias', 'h.8.crossattention.c_proj.weight', 'h.0.ln_cross_attn.weight', 'h.0.crossattention.c_proj.bias', 'h.10.crossattention.c_proj.bias', 'h.10.ln_cross_attn.bias', 'h.4.crossattention.c_proj.bias', 'h.5.crossattention.c_proj.weight', 'h.9.crossattention.q_attn.weight', 'h.5.crossattention.c_attn.bias', 'h.11.ln_cro

In [9]:
# Check the number of parameters of the model
total_params = sum(p.numel() for p in model.parameters())
print(f'{total_params:,} total parameters.')

239,195,904 total parameters.


## Generate caption on sample images without fine-tuning on IG dataset
- image_encoder_model = "google/vit-base-patch16-224-in21k"
- text_decode_model = "gpt2"

In [8]:
# Define a inference function
def generate_caption(image_folder):
    '''
    Use the ABSOLUTE path of the image folder
    '''
    generated_texts = []

    image_paths = glob.glob(os.path.join(image_folder, '*'))

    for image_path in image_paths:
        with Image.open(image_path) as image:
            pixel_values = feature_extractor(image, return_tensors="pt").pixel_values
            pixel_values = pixel_values.to(device)  # Move pixel values to GPU device

            generated_ids = model.generate(pixel_values,max_length=128)
            generated_ids = generated_ids.to(device)   # Move input_ids/labels to GPU device

            generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
            generated_texts.append(generated_text)

    return generated_texts

In [50]:
image_folder = "/Users/stoneman/Library/CloudStorage/OneDrive-Vanderbilt/Transformers/Transformers/Final-Project-Automatic-IG-Caption-Generator/test_images"
generate_caption(image_folder)

  if unfinished_sequences.max() == 0:


['\nThe first time I saw the new version of the game, I was so excited. I',
 '\nThe first time I saw the new version of the game, I was so excited. I',
 '\nThe U.S. Department of Justice has filed a lawsuit against the company that owns the',
 '\nThe first time I saw the new version of the game, I was so excited. I',
 '\nThe U.S. Department of Justice has filed a lawsuit against the company that owns the',
 '\nThe first time I saw the new version of the game, I was so excited. I',
 '\nThe U.S. Department of Justice has filed a lawsuit against the company that owns the',
 '\nThe first time I saw the new version of the game, I was so excited. I',
 '\nThe first time I saw the new version of the game, I was so excited. I',
 '\nThe first time I saw the new version of the game, I was so excited. I',
 '\nThe first time I saw the new version of the game, I was so excited. I']

## Fine-tune the model
- image_encoder_model = "google/vit-base-patch16-224-in21k"
- text_decode_model = "gpt2"


#### Fine-tune only the last several layers in the selected model (freeze all the prior layers)
- image_encoder_model = "google/vit-base-patch16-224-in21k"
- text_decode_model = "gpt2"

In [21]:
# First, freeze all parameters in both encoder and decoder
for param in model.parameters():
    param.requires_grad = False

# Unfreeze the last GPT2 block
for param in model.decoder.transformer.h[11].parameters():
        param.requires_grad = True

# The last layer norm before the LM head
for param in model.decoder.transformer.ln_f.parameters():
    param.requires_grad = True

# The last layer norm before the LM head
for param in model.decoder.lm_head.parameters():
    param.requires_grad = True

In [23]:
# check the number of trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'{total_params:,} total parameters.')

48,050,688 total parameters.


In [26]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    output_dir="./vit-gpt2-last-block-paras"
)

# instantiate trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=processed_dataset['train'],
    eval_dataset=processed_dataset['test'],
    data_collator=default_data_collator
)

In [27]:
trainer.train()

  0%|          | 0/2433 [00:00<?, ?it/s]

{'loss': 1.0813, 'learning_rate': 3.972461981093301e-05, 'epoch': 0.62}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5653371810913086, 'eval_runtime': 69.6831, 'eval_samples_per_second': 11.638, 'eval_steps_per_second': 2.913, 'epoch': 1.0}
{'loss': 0.5994, 'learning_rate': 2.944923962186601e-05, 'epoch': 1.23}
{'loss': 0.5578, 'learning_rate': 1.9173859432799017e-05, 'epoch': 1.85}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5612289309501648, 'eval_runtime': 71.5821, 'eval_samples_per_second': 11.33, 'eval_steps_per_second': 2.836, 'epoch': 2.0}
{'loss': 0.5631, 'learning_rate': 8.898479243732018e-06, 'epoch': 2.47}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5618223547935486, 'eval_runtime': 70.324, 'eval_samples_per_second': 11.532, 'eval_steps_per_second': 2.887, 'epoch': 3.0}
{'train_runtime': 1647.229, 'train_samples_per_second': 5.906, 'train_steps_per_second': 1.477, 'train_loss': 0.6708380106884023, 'epoch': 3.0}


TrainOutput(global_step=2433, training_loss=0.6708380106884023, metrics={'train_runtime': 1647.229, 'train_samples_per_second': 5.906, 'train_steps_per_second': 1.477, 'train_loss': 0.6708380106884023, 'epoch': 3.0})

In [28]:
trainer.save_model("./vig-gpt2-model-finetuned-on-last-block-paras")

In [29]:
generate_caption(image_folder)

  if unfinished_sequences.max() == 0:


["I'm so excited to be launching my first app! ",
 "I'm so excited to finally be able to share my story with you guys. I'm so",
 "I'm so excited to finally be able to share my story with you guys. I'm so",
 "I'm so excited to finally be able to share my story with you guys. I'm so",
 "I'm so excited to be launching my first app! I'm so excited to be launching my",
 'I love my new lip kit. I love how easy it is to customize my lip kit.',
 "I'm so excited to finally be able to share my new collection with you guys! ",
 "I'm so excited to be launching my first app! ",
 "I'm so excited to be launching my first app! ",
 "I'm so excited to be launching my first app! ",
 "I'm so excited to finally be able to share my story with you guys. I'm so"]

#### Fine-tune all trainable parameters
- image_encoder_model = "google/vit-base-patch16-224-in21k"
- text_decode_model = "gpt2"

In [51]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    output_dir="./vit-gpt2-all-paras"
)

# instantiate trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=processed_dataset['train'],
    eval_dataset=processed_dataset['test'],
    data_collator=default_data_collator
)

In [53]:
trainer.train()

  0%|          | 0/2433 [00:00<?, ?it/s]

{'loss': 0.6129, 'learning_rate': 3.972461981093301e-05, 'epoch': 0.62}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5214453339576721, 'eval_runtime': 72.5733, 'eval_samples_per_second': 11.175, 'eval_steps_per_second': 2.797, 'epoch': 1.0}
{'loss': 0.5173, 'learning_rate': 2.944923962186601e-05, 'epoch': 1.23}
{'loss': 0.4514, 'learning_rate': 1.9173859432799017e-05, 'epoch': 1.85}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5179643630981445, 'eval_runtime': 70.8282, 'eval_samples_per_second': 11.45, 'eval_steps_per_second': 2.866, 'epoch': 2.0}
{'loss': 0.423, 'learning_rate': 8.898479243732018e-06, 'epoch': 2.47}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.526144802570343, 'eval_runtime': 73.3089, 'eval_samples_per_second': 11.063, 'eval_steps_per_second': 2.769, 'epoch': 3.0}
{'train_runtime': 2606.2994, 'train_samples_per_second': 3.733, 'train_steps_per_second': 0.934, 'train_loss': 0.4822222376085337, 'epoch': 3.0}


TrainOutput(global_step=2433, training_loss=0.4822222376085337, metrics={'train_runtime': 2606.2994, 'train_samples_per_second': 3.733, 'train_steps_per_second': 0.934, 'train_loss': 0.4822222376085337, 'epoch': 3.0})

In [54]:
trainer.save_model("./vig-gpt2-model-finetuned-on-all-paras")

In [55]:
generate_caption(image_folder)



["I'm so excited to finally reveal my new collection with my new music video coming out tomorrow!",
 "I'm so happy I'm not a vampire ",
 "I'm so happy I'm not a vampire ",
 'I\'m a little nervous about this pic but I\'m not a big fan of the "I',
 "I'm so excited to share my first cover for the new issue of Cosmo! ",
 "I'm so happy I'm not a vampire ",
 "I'm so happy I got to spend my birthday with my bestie 💗 ",
 "I'm so excited to finally share my first collection with you guys! I've been working on",
 "I'm so excited to finally be apart of this family. I love you guys so much.",
 "I'm so happy I'm not a vampire ",
 "I'm so happy I'm not a vampire "]

## Generate caption on sample images without fine-tuning on IG dataset
- image_encoder_model = "nlpconnect/vit-gpt2-image-captioning"
- text_decode_model = "nlpconnect/vit-gpt2-image-captioning"

In [34]:
# load a fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
model.to(device)

tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

# GPT2 only has bos/eos tokens but not decoder_start/pad tokens
tokenizer.pad_token = tokenizer.eos_token

# update the model config
model.config.eos_token_id = tokenizer.eos_token_id
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

In [35]:
total_params = sum(p.numel() for p in model.parameters())
print(f'{total_params:,} total parameters.')

239,195,904 total parameters.


In [36]:
generate_caption(image_folder)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


['a large brown and white giraffe standing in a field ',
 'a lake with a mountain range and a mountain range ',
 'a stuffed bear is in a bear hug ',
 'a man standing on a ledge near a river ',
 'a painting of a bird with a sky background ',
 'a man and woman wearing glasses and sunglasses ',
 'a dog wearing a green shirt and a green scarf ',
 'a woman with a mask on holding a large knife ',
 'a city street with a large building ',
 'a blurry photo of a car with a reflection of a water fountain ',
 'a large building with a large white building ']

## Fine-tune the model
- image_encoder_model = "nlpconnect/vit-gpt2-image-captioning"
- text_decode_model = "nlpconnect/vit-gpt2-image-captioning"


#### Fine-tune only the last several layers in the selected model (freeze all the prior layers)
- image_encoder_model = "nlpconnect/vit-gpt2-image-captioning"
- text_decode_model = "nlpconnect/vit-gpt2-image-captioning"

In [37]:
# First, freeze all parameters in both encoder and decoder
for param in model.parameters():
    param.requires_grad = False

# Unfreeze the last GPT2 block
for param in model.decoder.transformer.h[11].parameters():
        param.requires_grad = True

# The last layer norm before the LM head
for param in model.decoder.transformer.ln_f.parameters():
    param.requires_grad = True

# The last layer norm before the LM head
for param in model.decoder.lm_head.parameters():
    param.requires_grad = True

In [38]:
# check the number of trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'{total_params:,} total parameters.')

48,050,688 total parameters.


In [39]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    output_dir="./nlpconnect-last-block-paras"
)

# instantiate trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=processed_dataset['train'],
    eval_dataset=processed_dataset['test'],
    data_collator=default_data_collator
)

In [40]:
trainer.train()

  0%|          | 0/2433 [00:00<?, ?it/s]

{'loss': 0.8086, 'learning_rate': 3.972461981093301e-05, 'epoch': 0.62}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5885778069496155, 'eval_runtime': 73.036, 'eval_samples_per_second': 11.104, 'eval_steps_per_second': 2.779, 'epoch': 1.0}
{'loss': 0.6327, 'learning_rate': 2.944923962186601e-05, 'epoch': 1.23}
{'loss': 0.5848, 'learning_rate': 1.9173859432799017e-05, 'epoch': 1.85}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5798221230506897, 'eval_runtime': 68.7028, 'eval_samples_per_second': 11.804, 'eval_steps_per_second': 2.955, 'epoch': 2.0}
{'loss': 0.5912, 'learning_rate': 8.898479243732018e-06, 'epoch': 2.47}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5791286826133728, 'eval_runtime': 69.0349, 'eval_samples_per_second': 11.748, 'eval_steps_per_second': 2.941, 'epoch': 3.0}
{'train_runtime': 1673.2949, 'train_samples_per_second': 5.814, 'train_steps_per_second': 1.454, 'train_loss': 0.6372706942042842, 'epoch': 3.0}


TrainOutput(global_step=2433, training_loss=0.6372706942042842, metrics={'train_runtime': 1673.2949, 'train_samples_per_second': 5.814, 'train_steps_per_second': 1.454, 'train_loss': 0.6372706942042842, 'epoch': 3.0})

In [41]:
trainer.save_model("./nlpconnect-model-finetuned-on-last-block-paras")

In [42]:
generate_caption(image_folder)



["I'm not sure what this is about. I'm just curious. ",
 "I can't wait to see what the next lake looks like. ",
 "I'm not sure what to think of this. I'm just so excited to see this.",
 "I'm not sure what I'm going to do with my life right now. I'm just",
 "I can't wait to see what you guys come up with! ",
 "I'm not sure if I'm going to get this one right or not. I'm just",
 'I love this little dog 💙💙 ',
 "I'm so excited to see my new favorite ",
 "I'm not sure if this is a good place to start my journey. ",
 "I can't see the reflection of the water ",
 "I can't see the picture above but I can see the building "]

#### Fine-tune all trainable parameters
- image_encoder_model = "nlpconnect/vit-gpt2-image-captioning"
- text_decode_model = "nlpconnect/vit-gpt2-image-captioning"

In [61]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    output_dir="./vit-gpt2-all-paras"
)

# instantiate trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=processed_dataset['train'],
    eval_dataset=processed_dataset['test'],
    data_collator=default_data_collator
)

In [62]:
trainer.train()

  0%|          | 0/2433 [00:00<?, ?it/s]

{'loss': 0.6084, 'learning_rate': 3.972461981093301e-05, 'epoch': 0.62}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5267587304115295, 'eval_runtime': 72.5089, 'eval_samples_per_second': 11.185, 'eval_steps_per_second': 2.8, 'epoch': 1.0}
{'loss': 0.5231, 'learning_rate': 2.944923962186601e-05, 'epoch': 1.23}
{'loss': 0.4551, 'learning_rate': 1.9173859432799017e-05, 'epoch': 1.85}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5228403806686401, 'eval_runtime': 73.5217, 'eval_samples_per_second': 11.031, 'eval_steps_per_second': 2.761, 'epoch': 2.0}
{'loss': 0.4226, 'learning_rate': 8.898479243732018e-06, 'epoch': 2.47}


  0%|          | 0/203 [00:00<?, ?it/s]

{'eval_loss': 0.5313527584075928, 'eval_runtime': 75.8511, 'eval_samples_per_second': 10.692, 'eval_steps_per_second': 2.676, 'epoch': 3.0}
{'train_runtime': 2721.7336, 'train_samples_per_second': 3.575, 'train_steps_per_second': 0.894, 'train_loss': 0.48252051463599466, 'epoch': 3.0}


TrainOutput(global_step=2433, training_loss=0.48252051463599466, metrics={'train_runtime': 2721.7336, 'train_samples_per_second': 3.575, 'train_steps_per_second': 0.894, 'train_loss': 0.48252051463599466, 'epoch': 3.0})

In [63]:
trainer.save_model("./nlpconnect-model-finetuned-on-all-paras")

In [64]:
generate_caption(image_folder)



["I'm so excited to share this with you guys! I'm so excited to share this with",
 "I'm going to be back in a few days but this is the best I've had so",
 '💗 ',
 "I'm not sure what this means... I'm just saying that I'm not sure what this",
 "I'm so excited to finally share my first cover for the new Koko edition! I'm",
 "I'm not sure what that means... ",
 '💜 ',
 '💜 ',
 "I'm not sure what kind of city this is in but it's definitely not me. ",
 "I'm not sure what I'm looking at. ",
 '💗 ']

In [132]:
# # check parameters in each layer
# for name, param in model.named_parameters():
#     print(name, param.shape)

encoder.embeddings.cls_token torch.Size([1, 1, 768])
encoder.embeddings.position_embeddings torch.Size([1, 197, 768])
encoder.embeddings.patch_embeddings.projection.weight torch.Size([768, 3, 16, 16])
encoder.embeddings.patch_embeddings.projection.bias torch.Size([768])
encoder.encoder.layer.0.attention.attention.query.weight torch.Size([768, 768])
encoder.encoder.layer.0.attention.attention.query.bias torch.Size([768])
encoder.encoder.layer.0.attention.attention.key.weight torch.Size([768, 768])
encoder.encoder.layer.0.attention.attention.key.bias torch.Size([768])
encoder.encoder.layer.0.attention.attention.value.weight torch.Size([768, 768])
encoder.encoder.layer.0.attention.attention.value.bias torch.Size([768])
encoder.encoder.layer.0.attention.output.dense.weight torch.Size([768, 768])
encoder.encoder.layer.0.attention.output.dense.bias torch.Size([768])
encoder.encoder.layer.0.intermediate.dense.weight torch.Size([3072, 768])
encoder.encoder.layer.0.intermediate.dense.bias torch

In [7]:
output_dir = "fine"
feature_extractor.save_pretrained(output_dir)

['fine/preprocessor_config.json']

In [65]:
# def show_examples(image_paths, generated_texts, size=(350, 350)):
#     w, h = size
#     grid_width = w * 3
#     grid_height = h * 3
#     grid = Image.new('RGB', size=(grid_width, grid_height))
#     draw = ImageDraw.Draw(grid)


#     for idx, (image_path, text) in enumerate(zip(image_paths, generated_texts)):
#         image = Image.open(image_path)
#         box = ((idx % 3) * w, (idx // 3) * h)
#         grid.paste(image.resize(size), box=box)
#         # Draw the label text
#         text_position = (box[0], box[1] + h - 20)  # Adjust text position as needed
#         draw.text(text_position, text, (255, 255, 255))  # Use the emoji-compatible font

#     return grid

# show_examples(image_paths, generated_texts)