# **Image-to-Poem Generation Using Deep Generative Models**

### Student Correspondences:

1. Neha Anusooya Thimmarayi - nanusooy@depaul.edu
2. Rohan Shankar Patil - rpatil5@depaul.edu

#### Project Description:

This project explores the use of deep generative models to generate creative, emotionally resonant poetry from visual inputs. The core objective is to develop a machine learning system that accepts an image and generates a corresponding poem that captures the image’s mood, theme, or aesthetic, rather than providing literal descriptions.

***Sufficient explanations on why each step is essential.
Instructions on how to test each function with example cases to illustrate functionality.
Commentary on the purpose of each implementation choice, especially if choices deviate from typical practices.***

***(We'll remove all content which are in Italics)***

----------------

### 1. Introduction to Libraries

This project utilizes a range of libraries for data processing, deep learning, and image-text modeling:

1. **NumPy** and **Pandas** are used for efficient numerical operations and structured data manipulation.
2. **PyTorch** is the core deep learning framework used to build and train models.
3. **Torchvision** provides pre-trained models and image transformation utilities.
4. **Transformers** (by Hugging Face) loads the pre-trained CLIP model for visual feature extraction.
5. **Pillow (PIL)** and **Requests** help load and process image data from URLs.
6. **tqdm** is used for progress bars during image processing and dataset iteration.

#### 1.1 Importing Required Libraries

In [41]:

import os
import json
import random
import pandas as pd
import numpy as np
from PIL import Image
import requests
from io import BytesIO
import torch
import torch.nn as nn
import torchvision
from torchvision import transforms
from transformers import CLIPProcessor, CLIPModel
from torchvision.models import resnet18
from tqdm import tqdm


In [None]:
# Check if a GPU is available and print the result (useful for performance monitoring)
if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")

GPU is not available


-----

### 2. Model Design and Implementation

This section describes the overall model pipeline, including how input data is preprocessed, how image features are extracted and projected into a textual embedding space, and how the resulting representations are used for poem generation using GPT-2.



#### 2.1 Data Preprocessing and Dataset Preparation

We begin by filtering and cleaning a multimodal dataset consisting of images and their corresponding poems. The dataset is parsed from JSON, cleaned to remove empty entries, and saved in structured CSV and JSON formats. Images are downloaded using their URLs, and only those with valid image-poem pairs are retained. A total of 899 usable pairs were selected for training.



In [None]:
# Load the raw multimodal poem dataset from JSON and inspect its structure
dataset_path = '/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/multim_poem.json'

with open(dataset_path, 'r') as f:
    data = json.load(f)

print(f"Total entries in dataset: {len(data)}")

# Display a sample entry to understand the data format
print("Example entry:")
print(json.dumps(data[0], indent=2))

Total entries in dataset: 8292
Example entry:
{
  "poem": "what is lovely never dies\nbut passes into other loveliness\nstar-dust or sea-foam flower or winged air",
  "image_url": "https://farm2.staticflickr.com/1086/1002051357_0e9162423e.jpg",
  "id": 0
}


In [None]:
# Filter out entries with missing or empty poems or image URLs
cleaned_data = []

for item in data:
    poem = item.get("poem")
    url = item.get("img") or item.get("image_url")

    if poem and url and poem.strip():
        cleaned_data.append({
            "image_url": url,
            "poem": poem.strip()
        })

print(f"Cleaned dataset size: {len(cleaned_data)} entries")

# Convert the cleaned data to a DataFrame for further processing
df = pd.DataFrame(cleaned_data)
df.head()

Cleaned dataset size: 8292 entries


Unnamed: 0,image_url,poem
0,https://farm2.staticflickr.com/1086/1002051357...,what is lovely never dies\nbut passes into oth...
1,https://farm8.staticflickr.com/7434/1002469112...,sods on the dugout begin to be fledged\nwith f...
2,https://farm1.staticflickr.com/19/100255672_97...,one must have the mind of winter\nto regard th...
3,https://farm2.staticflickr.com/1034/1002997433...,to put meaning in one's life may end in madnes...
4,https://farm4.staticflickr.com/3741/1004000893...,of living pained branches\nmy garden's braided...


In [None]:
# Save the cleaned dataset in both CSV and JSON formats for future use
os.makedirs("/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/processed", exist_ok=True)
df.to_csv("/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/processed/cleaned_poems.csv", index=False)
df.to_json("/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/processed/cleaned_poems.json", orient="records", indent=2)

print("Cleaned dataset saved.")

Cleaned dataset saved.


In [None]:
# Create a directory to store downloaded images (if it doesn't already exist)
image_dir = "/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/images"
os.makedirs(image_dir, exist_ok=True)

In [None]:
# Helper function to download and save an image from a URL
def download_image(url, save_path):
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        img = Image.open(BytesIO(response.content)).convert("RGB")
        img.save(save_path)
        return True
    except Exception as e:
        return False

# Sample 1000 image-poem pairs for downloading
sample_df = df.sample(n=1000, random_state=42).reset_index(drop=True)

success_count = 0

valid_data = []
image_dir = "/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/images"
os.makedirs(image_dir, exist_ok=True)

img_id = 0  # counter for naming valid image files

# Attempt to download each image; store successful image-poem pairs
for _, row in tqdm(sample_df.iterrows(), total=len(sample_df)):
    url = row['image_url']
    poem = row['poem']
    filename = f"{img_id}.jpg"
    img_path = os.path.join(image_dir, filename)

    if download_image(url, img_path):
        valid_data.append({
            "image_filename": filename,
            "poem": poem
        })
        img_id += 1

print(f"Successfully downloaded: {len(valid_data)} images")

# Save the final list of valid image-poem pairs to CSV
valid_df = pd.DataFrame(valid_data)
valid_df.to_csv("/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/processed/filtered_poem_data.csv", index=False)

100%|██████████| 1000/1000 [00:22<00:00, 44.07it/s]

Successfully downloaded: 899 images





In [None]:
# Align image files with their corresponding poem entries by checking if the image exists
valid_indices = [i for i in range(1000) if os.path.exists(f"/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/images/{i}.jpg")]

# Create a filtered DataFrame of valid image-poem pairs
filtered_df = sample_df.loc[valid_indices].reset_index(drop=True)
print(f"Final usable pairs: {len(filtered_df)}")

# Save the aligned dataset to CSV
filtered_df.to_csv("/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/processed/filtered_poem_data.csv", index=False)

Final usable pairs: 899


#### 2.2 Designing the Image Encoder

To explore how different visual representations affect text generation, we implemented two image encoders:

- **CLIP** (Contrastive Language–Image Pre-training): A vision-language model pre-trained on 400 million image-text pairs. We used the vision tower of the `openai/clip-vit-base-patch32` variant to extract 512-dimensional image embeddings. These embeddings are inherently normalized by the model.

- **ResNet18**: A convolutional neural network pre-trained on ImageNet, used to extract 512-dimensional embeddings from images. We removed its final classification layer and normalized the output feature vectors.




#### 2.2.1  Designing the Image Encoder (CLIP - Contrastive Language-Image Pre-training)

We used the `openai/clip-vit-base-patch32` model from Hugging Face Transformers to extract image embeddings. Each image is passed through the CLIP image processor to perform resizing, normalization, and tensor conversion. The processed image is then fed into the model's vision tower to obtain a 512-dimensional embedding.

In [49]:
# Load pre-trained CLIP model and processor
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Ensure the model is on the same device as your PyTorch setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
clip_model.to(device)



CLIPModel(
  (text_model): CLIPTextTransformer(
    (embeddings): CLIPTextEmbeddings(
      (token_embedding): Embedding(49408, 512)
      (position_embedding): Embedding(77, 512)
    )
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05,

In [50]:
def extract_image_features(image_path):
    """
    Extract visual features from an image using CLIP.
    
    Args:
        image_path (str): Path to the image file.
    
    Returns:
        torch.Tensor: Feature vector of shape (1, 512) representing the image.
    """
    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")
    inputs = clip_processor(images=image, return_tensors="pt").to(device)
    
    # Extract features from the model
    with torch.no_grad():
        image_features = clip_model.get_image_features(**inputs)
    
    # Normalize features (CLIP features are typically L2-normalized)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    return image_features

In [51]:
# Test the feature extraction function
def test_image_feature_extraction():
    # Use a sample image from your dataset (e.g., the first valid image)
    sample_image_path = "/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/images/0.jpg"
    if os.path.exists(sample_image_path):
        features = extract_image_features(sample_image_path)
        print(f"Extracted feature shape: {features.shape}")  # Expected: torch.Size([1, 512])
        print(f"Feature norm: {features.norm(dim=-1).item():.4f}")  # Should be close to 1.0 due to normalization
    else:
        print("Sample image not found. Please ensure images are downloaded.")

In [52]:
# Run the test
test_image_feature_extraction()

Extracted feature shape: torch.Size([1, 512])
Feature norm: 1.0000


In [53]:
# Extract features for the dataset (optional: process a subset for testing)
def process_dataset_features(df, image_dir, output_path="/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/Feature Extraction/image_features.pt"):
    features_list = []
    for index, row in tqdm(df.iterrows(), total=len(df)):
        img_path = os.path.join(image_dir, row["image_filename"])
        if os.path.exists(img_path):
            features = extract_image_features(img_path)
            features_list.append(features.cpu())  # Move to CPU to save GPU memory
    torch.save(torch.cat(features_list, dim=0), output_path)
    print(f"Saved {len(features_list)} image features to {output_path}")

In [None]:
# Ensure valid_df is defined by reloading or recreating it if necessary
if 'valid_df' not in globals():
    # Load the filtered dataset if it exists, or recreate it
    filtered_csv_path = "/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/processed/filtered_poem_data.csv"
    if os.path.exists(filtered_csv_path):
        valid_df = pd.read_csv(filtered_csv_path)
        print(f"Loaded valid_df from {filtered_csv_path} with {len(valid_df)} entries")
    else:
        print("Filtered dataset not found. Recreating from raw data...")
        valid_indices = [i for i in range(1000) if os.path.exists(f"/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/images/{i}.jpg")]
        sample_df = pd.DataFrame(data).sample(n=1000, random_state=42).reset_index(drop=True)
        valid_df = sample_df.loc[valid_indices].reset_index(drop=True)
        valid_df.to_csv(filtered_csv_path, index=False)
        print(f"Recreated valid_df with {len(valid_df)} entries and saved to {filtered_csv_path}")

# Process a subset of the dataset (e.g., first 10 images) for testing
subset_df = valid_df.head(10)
process_dataset_features(subset_df, "/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/images")

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:01<00:00,  5.99it/s]

Saved 10 image features to /workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/Feature Extraction/image_features.pt





In [55]:
loaded_features = torch.load("/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/Feature Extraction/image_features.pt")
print(loaded_features.shape)  # Should be torch.Size([10, 512])

torch.Size([10, 512])


#### 2.2.2 Designing the Image Encoder (ResNet18 - Convolutional Visual Features)

ResNet18 was loaded using torchvision.models, and its final classification layer was removed. Images were preprocessed to match the model’s expected input format (224×224 resolution, normalized using ImageNet mean and std), and the 512-d feature vector was extracted from the penultimate layer.

In [None]:
#Load ResNet18 model without final classification layer
resnet_model = resnet18(pretrained=True)
resnet_model = torch.nn.Sequential(*list(resnet_model.children())[:-1])
resnet_model.eval().to(device)

#Define standard image preprocessing steps for ResNet input
resnet_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
]) 

#Extract and return a single 512-d ResNet feature vector from a given image
def extract_resnet_features(image_path):
    image = Image.open(image_path).convert("RGB")
    image = resnet_transform(image).unsqueeze(0).to(device)
    with torch.no_grad():
        features = resnet_model(image).squeeze()
    return features.cpu()





In [57]:
# Test the first ResNet feature before batch processing
sample_path = os.path.join(image_dir, valid_df.loc[0, "image_filename"])
sample_feature = extract_resnet_features(sample_path)

print("Sample feature shape:", sample_feature.shape)
print("Norm before normalization:", torch.norm(sample_feature).item())
print("First 10 values:", sample_feature[:10])
 

Sample feature shape: torch.Size([512])
Norm before normalization: 31.320510864257812
First 10 values: tensor([1.8201, 0.4422, 1.9081, 3.1979, 1.6765, 0.2751, 0.2356, 1.7061, 2.0732,
        0.5250])


In [None]:
#Extract and normalize ResNet features for all images in the dataset
def process_resnet_dataset(df, image_dir, output_path):
    features_list = []
    for index, row in tqdm(df.iterrows(), total=len(df)):
        img_path = os.path.join(image_dir, row["image_filename"])
        if os.path.exists(img_path):
            features = extract_resnet_features(img_path)
            features = features / features.norm()  # L2-normalize
            features_list.append(features.cpu())

    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    
    torch.save(torch.stack(features_list), output_path)
    print(f"Saved {len(features_list)} normalized ResNet features to {output_path}")
    

In [None]:
# Process the full dataset and save all normalized ResNet18 features to disk
resnet_output_path = "/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/features/resnet_features.pt"
process_resnet_dataset(valid_df, image_dir, resnet_output_path)
 

100%|██████████| 899/899 [00:58<00:00, 15.37it/s]


Saved 899 normalized ResNet features to /workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/features/resnet_features.pt


In [None]:
# Load the saved ResNet feature vectors and verify their shape and normalization
resnet_output_path = "/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/features/resnet_features.pt"
resnet_features = torch.load(resnet_output_path)
print(f"Loaded ResNet features shape: {resnet_features.shape}")
print("Norm sample:", resnet_features[0].norm().item()) 

Loaded ResNet features shape: torch.Size([899, 512])
Norm sample: 1.0


#### 2.3 Building the Dense Projection Network

To bridge the gap between the 512-d image features and GPT-2’s 768-d input space, we use a shared projection network for both encoders. This network transforms the single 512-d vector into a sequence of 10 GPT-2-compatible prefix tokens.

#### 2.3.1 Projection for CLIP Features

#### 2.3.2 Projection for ResNet18 Features

In [None]:
# Define a projection network to map 512-d image features to a sequence of GPT-2-compatible prefix embeddings
class ImageToTextProjection(nn.Module):
    def __init__(self, input_dim=512, gpt2_emb_dim=768, prefix_length=10):
        super(ImageToTextProjection, self).__init__()
        self.prefix_length = prefix_length
        self.gpt2_emb_dim = gpt2_emb_dim
        self.projector = nn.Sequential(
            nn.Linear(input_dim, gpt2_emb_dim * prefix_length),
            nn.Tanh()
        )

    def forward(self, x):
        out = self.projector(x)
        return out.view(-1, self.prefix_length, self.gpt2_emb_dim)

# Initialize the projection model
projector = ImageToTextProjection(input_dim=512, gpt2_emb_dim=768, prefix_length=10)

In [None]:
# Project all ResNet features into GPT-2 embedding space as prefix tokens
projected_all = projector(resnet_features)
print("Projected shape:", projected_all.shape) #Expected: (899, 10, 768)

Projected shape: torch.Size([899, 10, 768])


In [None]:
# Save the projected prefix embeddings to disk for use in GPT-2 conditioning
torch.save(projected_all, "/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/features/resnet_projected_prefix.pt")
print("Projected prefixes saved.")

Projected prefixes saved.


In [67]:
# Load the filtered dataset and iterate over all image-poem pairs
valid_df = pd.read_csv("/workspaces/Image-to-Poem-Generation-Using-Deep-Generative-Models/data/processed/filtered_poem_data.csv")
assert len(valid_df) == 899  

# Iterate through image-poem pairs: for each projected prefix, retrieve the corresponding poem text
for i in range(len(valid_df)):
    image_prefix = projected_all[i]  # (10, 768)
    poem = valid_df.loc[i, "poem"]

print(image_prefix.shape)    

torch.Size([10, 768])


#### 2.4 Configuring the GPT-2 Decoder

#### 2.5 Model Validation on Dummy Data

-----

### 3. Training Process

*Outline your training pipeline, including data loading, pre-processing, and any regularization techniques.
Briefly describe hyperparameters used (learning rate, batch size, epochs) and reasoning behind their choice.
Include sample output or logs from training to illustrate model performance and learning curves.*

#### 3.1 Setting Up Training Configurations

#### 3.2 Training on a Small Batch

#### 3.3 Full Dataset Training

#### 3.4 Visualizing Training Progress

------

### 4. Evaluation Results

*Present evaluation metrics and explain the criteria used to assess the model’s performance.
Show example predictions or outputs to demonstrate model accuracy and behavior.
Provide insights into the model’s strengths, weaknesses, and areas for improvement based on the results.*

#### 4.1 Quantitative Evaluation with Automated Metrics

#### 4.2 Qualitative Evaluation via Human Judgment

#### 4.3 Error Analysis and Visualization

#### 4.4 Summary of Findings

-----