Image Analysis:

    Object detection: Identify key objects in the image (e.g., "tree", "bird", "sun").
    Scene recognition: Understand the broader scene (e.g., "sunset over the ocean").
    Sentiment analysis: Derive the emotional tone of the image (e.g., "calm", "serene").

Prompter:

    Use a pre-trained GPT-2 or GPT-3 (or T5) model with prompt engineering to generate a haiku based on the extracted image features (objects, scene, sentiment).

Detector:

    Check whether the haiku generated adheres to the 5-7-5 syllable structure.
    Evaluate whether the emotional tone and meaning match the original image features.

Reinforcement Learning (RL):

    Use PPO (Proximal Policy Optimization) or any other RL algorithm to optimize the prompt and reward the haiku generation process based on structural and thematic correctness.

In [3]:
import os
import json
import requests
from io import BytesIO
from PIL import Image
from torchvision import transforms
from transformers import BlipProcessor, BlipForConditionalGeneration, CLIPProcessor, CLIPModel
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
import pandas as pd
from datasets import load_dataset
from tqdm import tqdm
from sklearn.model_selection import train_test_split


### Preprocessing Data

In [None]:

# Load the tsv and tsv.meta files into pandas DataFrames
df_tsv = pd.read_csv('cvpr2019.tsv', sep='\t')  # file with scenes
df_meta = pd.read_csv('cvpr2019.tsv.meta', sep='\t')  # file with image_key and url

# Pivot the df_tsv to create a column for each scene
df_tsv_pivoted = df_tsv.groupby('IMAGE_KEY')['CAPTION'].apply(list).apply(lambda x: pd.Series(x)).reset_index()

# Rename columns for clarity, assuming there are exactly 5 scenes
df_tsv_pivoted.columns = ['IMAGE_KEY', 'scene1', 'scene2', 'scene3', 'scene4', 'scene5']

# Merge the two DataFrames on the 'image_key' column
merged_df = pd.merge(df_tsv_pivoted, df_meta, on='IMAGE_KEY', how='left')

# Save the merged DataFrame back to a TSV file
merged_df.to_csv('merged_file.tsv', sep='\t', index=False)

### Fine tuning CLIP for scene recognition

In [None]:


df = pd.read_csv('merged_file.tsv', sep='\t')
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

print("Training set:")
print(train_df.head())
print("\nValidation set:")
print(val_df.head())

Training set:
            IMAGE_KEY                                  scene1  \
29   00210e646b9a01f4          digital art selected for the #   
535  03484078db61e940  a view of the interior of the building   
695  04651a6159d60055          view of the pond in the garden   
557  0371cd0bb621e734         snowboarder jumping in the snow   
836  0579eb8fae88f403                     view of the kitchen   

                                                scene2  \
29                      natural history of the insects   
535  a view of the inside of the new terminal building   
695                  the pond in front of the building   
557        a snowboarder catches a big jump in the air   
836                  person in the kitchen of his home   

                                         scene3  \
29                an illustration from the book   
535                interior of a subway station   
695                   view from across the pond   
557  a snowboarder performs a jump in the 

In [133]:
import requests
from io import BytesIO
from PIL import Image

def download_image(image_url):
    response = requests.get(image_url)
    
    if response.status_code == 200:  # Ensure the request was successful
        try:
            # Try to open the image from the response content
            image = Image.open(BytesIO(response.content))
            image.verify()  # Verify the image integrity (optional)
            return image
        except (IOError, SyntaxError) as e:
            print(f"Error with image: {image_url}, Error: {e}")
            return None
    else:
        print(f"Failed to retrieve image from {image_url}, Status Code: {response.status_code}")
        return None

In [140]:
import requests
from io import BytesIO
from PIL import Image

class SceneDataset(Dataset):
    def __init__(self, dataframe, processor):
        self.dataframe = dataframe
        self.processor = processor

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        row = self.dataframe.iloc[idx]
        image_url = row['OriginalURL']  # Adjust based on your column name
        scene = row['scene1']  # Adjust based on your column name

        image = download_image(image_url)
        print(f"Downloaded image type: {type(image)}")

        # If you need to check its size
        if isinstance(image, Image.Image):  # Check if it's an instance of PIL.Image
            print(f"Image size: {image.size}")
        else:
            print("Invalid image")
        
        if image is None:  # Skip invalid images
            return None

        # Access the image dimensions
        width, height = image.size

        # Example: Perform a check on image size (optional)
        if width < 100 or height < 100:
            print(f"Skipping small image: {image_url}")
            return None

        # Process the image and scene text
        inputs = self.processor(images=image, text=scene, return_tensors="pt", padding=True)
        
        return inputs

In [141]:

# Initialize the processor (CLIP)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Create DataLoader for training and validation datasets
train_dataset = SceneDataset(train_df, processor)
val_dataset = SceneDataset(val_df, processor)

train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=16)

In [142]:
from torch.optim import AdamW
from tqdm import tqdm

# Initialize the CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=5e-6)

# Training loop
model.train()
for epoch in range(3):  # Adjust the number of epochs
    loop = tqdm(train_dataloader, desc=f"Epoch {epoch+1}")
    for batch in loop:
        inputs = {key: value.to(model.device) for key, value in batch.items()}
        optimizer.zero_grad()

        # Forward pass
        outputs = model(**inputs)

        # Calculate loss (contrastive loss is used by default in CLIP)
        loss = outputs.loss
        loss.backward()

        optimizer.step()

        loop.set_postfix(loss=loss.item())

    # Validation step
    model.eval()
    total_loss = 0
    for batch in val_dataloader:
        with torch.no_grad():
            inputs = {key: value.to(model.device) for key, value in batch.items()}
            outputs = model(**inputs)
            total_loss += outputs.loss.item()

    print(f"Validation loss after epoch {epoch+1}: {total_loss / len(val_dataloader)}")

# Save the fine-tuned model
model.save_pretrained("fine_tuned_clip_model")
processor.save_pretrained("fine_tuned_clip_processor")

Epoch 1:   0%|          | 0/50 [00:00<?, ?it/s]

Failed to retrieve image from https://c1.staticflickr.com/9/8124/29750795646_11c16a54f1_o.jpg, Status Code: 403
Downloaded image type: <class 'NoneType'>
Invalid image


Epoch 1:   0%|          | 0/50 [00:01<?, ?it/s]

Downloaded image type: <class 'PIL.JpegImagePlugin.JpegImageFile'>
Image size: (6125, 2454)





TypeError: '>=' not supported between instances of 'JpegImageFile' and 'int'

In [25]:
from PIL import Image

# Load the fine-tuned model and processor
model = CLIPModel.from_pretrained("fine_tuned_clip_model")
processor = CLIPProcessor.from_pretrained("fine_tuned_clip_processor")

# Example image for inference
image = Image.open("path_to_image.jpg")

# Generate scene description
inputs = processor(images=image, return_tensors="pt")
outputs = model.get_text_features(**inputs)

# Your method to generate or select a scene description from the model's outputs
scene_description = generate_scene_description(outputs)  # Define your logic here
print(scene_description)

OSError: fine_tuned_clip_model is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

### Fine tuning Image-Poem model

In [7]:
model = BlipForConditionalGeneration.from_pretrained("fine_tuned_blip")
processor = BlipProcessor.from_pretrained("fine_tuned_blip")

# Generate poems from images
def generate_poem(image_path, model, processor):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=50)
    return processor.decode(outputs[0], skip_special_tokens=True)

# Example usage
image_path = "../images/00000001.jpg"
line = generate_poem(image_path, model, processor)
print(line)


You are using a model of type gpt2 to instantiate a model of type blip. This is not supported for all configurations of models and can yield errors.
Some weights of BlipForConditionalGeneration were not initialized from the model checkpoint at fine_tuned_blip and are newly initialized: ['text_decoder.bert.embeddings.LayerNorm.bias', 'text_decoder.bert.embeddings.LayerNorm.weight', 'text_decoder.bert.embeddings.position_embeddings.weight', 'text_decoder.bert.embeddings.word_embeddings.weight', 'text_decoder.bert.encoder.layer.0.attention.output.LayerNorm.bias', 'text_decoder.bert.encoder.layer.0.attention.output.LayerNorm.weight', 'text_decoder.bert.encoder.layer.0.attention.output.dense.bias', 'text_decoder.bert.encoder.layer.0.attention.output.dense.weight', 'text_decoder.bert.encoder.layer.0.attention.self.key.bias', 'text_decoder.bert.encoder.layer.0.attention.self.key.weight', 'text_decoder.bert.encoder.layer.0.attention.self.query.bias', 'text_decoder.bert.encoder.layer.0.attentio

##sphere refuses condemnation viewer vicious cheat dumont cheng trapping stunning elvis condemnation concealed cassette admitted offshore juniors condemnation 505 qualities trapping cheat trapping refuses investigations condemnation optimization technically rents qualities autonomous master vengeance relocated qualities investigations condemnation malaysian dharma 105 detectives cheng advice relocatedslin elvis agreed gregfinder


### Haiku Initial draft

In [9]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2_finetuned_haiku")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2_finetuned_haiku")

prompt = "the sun sets over the "

inputs = tokenizer.encode(prompt, return_tensors='pt')
outputs = model.generate(inputs, max_length=150, num_return_sequences=1, temperature=0.7, top_k=50)
    
haiku = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Haiku:", haiku)



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Generated Haiku: the sun sets over the  /  horizon and the  /  moon shines through the  $


<h2> RL PPO </h2>

In [1]:
from gymnasium.envs.registration import register

register(
    id='HaikuEnv-v0',  # Unique ID for your environment
    entry_point='HaikuRefinerEnv_v0:HaikuEnvironment',  # Update with the actual module name
)

pygame 2.5.2 (SDL 2.28.3, Python 3.11.4)
Hello from the pygame community. https://www.pygame.org/contribute.html


In [11]:
from HaikuRefinerEnv_v0 import HaikuEnvironment

env = HaikuEnvironment(["mountain", "lake"], 0.5)

prompt = "Objects: mountain, lake; Sentiment: serene; Current Line: 'I'"

# Generate suggestions
suggestions = env.get_lm_suggestions(prompt)
print("Suggestions:", suggestions)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Suggestions: is


In [2]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create and vectorize your environment
env = make_vec_env('HaikuEnv-v0', n_envs=1)
env.reset()




# Define the PPO model
model = PPO("MultiInputPolicy", env, verbose=1, tensorboard_log="./haiku_tensorboard/")

  logger.warn(


TypeError: HaikuEnvironment.__init__() missing 2 required positional arguments: 'objects' and 'sentiment' was raised from the environment creator for HaikuEnv-v0 with kwargs ({})

In [4]:
model.learn(total_timesteps=10)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
