<a href="https://colab.research.google.com/github/MarMarhoun/freelance_work/blob/main/side_projects/NLP_projs/LLMs_with_Gradio/image_captioning_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web app for image captioning and image generation using Gradio, LLM models

To create a web application for image captioning and image generation using Gradio, Hugging Face's models, and an online dataset, we can break down the task into several steps. Below is a structured approach to implement this functionality.



## Step 1: Set Up Your Environment

Make sure you have the necessary libraries installed. You can install them using pip:

In [None]:
!pip install gradio transformers torch torchvision pillow

In [None]:
!pip install --upgrade huggingface_hub diffusers transformers

In [21]:
import transformers
import diffusers

print("Transformers version:", transformers.__version__)
print("Diffusers version:", diffusers.__version__)

Transformers version: 4.51.0
Diffusers version: 0.32.2


## Step 2: Import Required Libraries & Load Models

We will use the BLIP model for image captioning and DALL-E for image generation.


In [22]:
import os
import io
import base64
import gradio as gr
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import pipeline
import torch
from transformers import CLIPProcessor, CLIPModel
from diffusers import StableDiffusionPipeline

# Load models
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Load the Stable Diffusion pipeline, using CPU instead of GPU if no CUDA is available.
try:
    dalle_pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16).to("cuda")
except RuntimeError as e:
    if "Found no NVIDIA driver" in str(e):
        print("Warning: No NVIDIA driver found. Using CPU instead.")
        dalle_pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16).to("cpu")
    else:
        raise e

clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

'''import torch
from transformers import CLIPProcessor, CLIPModel, GPT2LMHeadModel, GPT2Tokenizer

# Initialize models and processors
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")'''

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

'import torch\nfrom transformers import CLIPProcessor, CLIPModel, GPT2LMHeadModel, GPT2Tokenizer\n\n# Initialize models and processors\nclip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")\nclip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")'

## Step 3: Define Helper Functions:

* Define Functions for Captioning and Image Generation
*  Define Similarity Score Function: For simplicity, we can use a basic similarity measure like cosine similarity. However, for a more advanced approach, you might want to use a pre-trained model to evaluate the similarity.



In [25]:
import io
import base64
from PIL import Image
import torch
import gradio as gr

# Define the Helper functions

def image_to_base64_str(pil_image):
    byte_arr = io.BytesIO()
    pil_image.save(byte_arr, format='PNG')
    byte_arr = byte_arr.getvalue()
    return str(base64.b64encode(byte_arr).decode('utf-8'))

def generate_caption(image):
    try:
        # Ensure the image is in the correct format
        if not isinstance(image, Image.Image):
            raise ValueError("Input is not a valid PIL image.")

        # Use the BLIP model to generate a caption
        inputs = blip_processor(images=image, return_tensors="pt")
        out = blip_model.generate(**inputs)
        caption = blip_processor.decode(out[0], skip_special_tokens=True)
        return caption
    except Exception as e:
        return f"Error generating caption: {str(e)}"

def generate_image_from_caption(caption):
    try:
        # Generate image from caption using Stable Diffusion
        generated_images = dalle_pipeline(caption).images
        generated_image = generated_images[0]  # Get the first generated image
        return generated_image  # Ensure this is a PIL image
    except Exception as e:
        return f"Error generating image: {str(e)}"
import torch
from PIL import Image

def calculate_similarity(original_image, generated_image):
    try:
        # Ensure both images are PIL images
        if not isinstance(original_image, Image.Image) or not isinstance(generated_image, Image.Image):
            raise ValueError("Both inputs must be PIL images.")

        # Process the images
        image_inputs = clip_processor(images=[original_image, generated_image], return_tensors="pt", padding=True)

        # Forward pass through the model
        with torch.no_grad():  # Disable gradient calculation for inference
            image_outputs = clip_model.get_image_features(**image_inputs)

        # Calculate cosine similarity for images
        image_similarity = torch.nn.functional.cosine_similarity(image_outputs[0], image_outputs[1], dim=0)

        # Convert similarity score to percentage
        similarity_percentage = image_similarity.item() * 100
        return f"Similarity Score between the Original & Generated Image is: {similarity_percentage:.3f}%"  # Format to 3 decimal places
    except Exception as e:
        return f"Error calculating similarity: {str(e)}"


## Step 4: Create Gradio Interface

Now, we can create a Gradio interface to bring everything together.

In [26]:

# Define the Gradio interface with two columns
with gr.Blocks() as iface:
    # Centered title using HTML
    gr.Markdown("<h1 style='text-align: center;'>Image Captioning and Generation with LLMs</h1>")
    gr.Markdown("<p style='text-align: center;'>Upload an image to generate a caption and create a similar image.</p>")

    with gr.Row():
        with gr.Column():
            image_input = gr.Image(type="pil", label="Upload Image")
            generate_caption_button = gr.Button("Generate Caption")
            caption_output = gr.Textbox(label="Generated Caption", interactive=False)

            # Set up the button action for generating caption
            generate_caption_button.click(
                fn=generate_caption,
                inputs=image_input,
                outputs=caption_output
            )
        with gr.Column():
            caption_input = gr.Textbox(label="Paste Caption Here", placeholder="Copy and paste the generated caption here...")
            generate_button = gr.Button("Generate Image")
            similarity_output = gr.Textbox(label="Similarity Score", interactive=False)
            generated_image_output = gr.Image(label="Generated Image", interactive=False)

            # Set up the button action for generating image and calculating similarity
            def generate_and_calculate(original_image, caption):
                generated_image = generate_image_from_caption(caption)
                if isinstance(generated_image, str) and "Error" in generated_image:
                    return None, generated_image  # Return None for image and error message for similarity

                similarity_score = calculate_similarity(original_image, generated_image)
                return generated_image, similarity_score
            generate_button.click(
                fn=generate_and_calculate,
                inputs=[image_input, caption_input],
                outputs=[generated_image_output, similarity_output]
            )

iface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2442d87ce3b47ce52e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Additional functions

In [None]:
import gradio as gr
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import pipeline
import torch
from transformers import CLIPProcessor, CLIPModel
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler


In [None]:
# Load models
# Load models, using blip-image-captioning-large
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
# Load the Stable Diffusion pipeline, using CPU instead of GPU if no CUDA is available.
try:
    dalle_pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16).to("cuda")
    '''dalle_pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16)
    dalle_pipeline.scheduler = DPMSolverMultistepScheduler.from_config(dalle_pipeline.scheduler.config) # Set custom scheduler
    dalle_pipeline = dalle_pipeline.to("cuda")  # Move to GPU if available, otherwise, it'll use CPU'''
except RuntimeError as e:
    if "Found no NVIDIA driver" in str(e):
        print("Warning: No NVIDIA driver found. Using CPU instead.")
        dalle_pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16).to("cpu") # Use CPU if no GPU is found.
        '''dalle_pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16)
        dalle_pipeline.scheduler = DPMSolverMultistepScheduler.from_config(dalle_pipeline.scheduler.config) # Set custom scheduler
        dalle_pipeline = dalle_pipeline.to("cpu")  # Move to GPU if available, otherwise, it'll use CPU'''
    else:
        raise e # Reraise other exceptions.
#dalle_pipeline = pipeline("text-to-image", model="dalle-mini/dalle-mini")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")


preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
# Function to fetch image from URL
def fetch_image(image_url):
    response = requests.get(image_url)
    image = Image.open(requests.get(image_url, stream=True).raw)
    return image

def generate_caption(image):
    inputs = blip_processor(images=image, return_tensors="pt")
    out = blip_model.generate(**inputs)
    caption = blip_processor.decode(out[0], skip_special_tokens=True)
    return caption


'''def generate_image_from_caption(caption):
    # Generate image from caption
    inputs = dalle_processor(caption, return_tensors="pt")
    generated_image = dalle_model.generate(**inputs)
    return generated_image
'''
def generate_image_from_caption(caption):
    # Generate image from caption using Stable Diffusion
    generated_images = dalle_pipeline(caption).images
    generated_image = generated_images[0]  # Get the first generated image
    return generated_image
'''

def generate_image_from_caption(caption):
       generated_images = dalle_pipeline(caption, width=256, height=256).images  # Adjust width and height
       generated_image = generated_images[0]
       return generated_image'''

'\n\ndef generate_image_from_caption(caption):\n       generated_images = dalle_pipeline(caption, width=256, height=256).images  # Adjust width and height\n       generated_image = generated_images[0]\n       return generated_image'

In [None]:

def calculate_similarity(original_caption, generated_caption):
    # Process the captions
    inputs = clip_processor(text=[original_caption, generated_caption], return_tensors="pt", padding=True)
    outputs = clip_model(**inputs)

    # Calculate cosine similarity
    similarity_score = torch.nn.functional.cosine_similarity(outputs.text_embeds[0], outputs.text_embeds[1], dim=0)
    return similarity_score.item()

In [None]:

def process_image(image):
    #image = fetch_image(image_url)
    caption = generate_caption(image)
    generated_image = generate_image_from_caption(caption)

    # Save the generated image locally
    generated_image_path = "generated_image.png"
    generated_image.save(generated_image_path)

    # Calculate similarity score using captions
    similarity_score = calculate_similarity(caption, caption)  # Use the same caption for demonstration
    return caption, generated_image, similarity_score, generated_image_path




# Define the Gradio interface with two columns
with gr.Blocks() as iface:
    # Centered title using HTML
    gr.Markdown("<h1 style='text-align: center;'>Image Captioning and Generation with LLMs</h1>")
    gr.Markdown("<p style='text-align: center;'>Upload an image to generate a caption and create a similar image.</p>")



    with gr.Row():
        with gr.Column():
            # First column: Uploaded image and generated caption
            #image_url_input = gr.Textbox(label="Image URL", placeholder="Enter image URL here...")

            image_input = gr.Image(type="pil", label="Upload Image")
            generate_caption_button = gr.Button("Generate Caption")
            caption_output = gr.Textbox(label="Generated Caption", interactive=False)

            # Set up the button action for generating caption
            generate_caption_button.click(
                fn=process_image,
                inputs=image_input,
                outputs=caption_output
            )

        with gr.Column():
            # Second column: Text box for user input and button for generating image
            caption_input = gr.Textbox(label="Paste Caption Here", placeholder="Copy and paste the generated caption here...")
            generate_button = gr.Button("Generate Image")
            similarity_output = gr.Textbox(label="Similarity Score", interactive=False)
            generated_image_output = gr.Image(label="Generated Image", interactive=False)

            # Set up the button action for generating image and calculating similarity
            def generate_and_calculate(caption):
                generated_image = generate_image_from_caption(caption)
                similarity_score = calculate_similarity(caption, caption)  # Use the same caption for demonstration
                return generated_image, similarity_score

            generate_button.click(
                fn=generate_and_calculate,
                inputs=caption_input,
                outputs=[generated_image_output, similarity_output]
            )

iface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://a1ac21341182146c1d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [7]:
# Define the Helper functions

def image_to_base64_str(pil_image):
    byte_arr = io.BytesIO()
    pil_image.save(byte_arr, format='PNG')
    byte_arr = byte_arr.getvalue()
    return str(base64.b64encode(byte_arr).decode('utf-8'))

def generate_caption(image):
    try:
        # Ensure the image is in the correct format
        if not isinstance(image, Image.Image):
            raise ValueError("Input is not a valid PIL image.")

        # Use the BLIP model to generate a caption
        inputs = blip_processor(images=image, return_tensors="pt")
        out = blip_model.generate(**inputs)
        caption = blip_processor.decode(out[0], skip_special_tokens=True)
        return caption
    except Exception as e:
        return f"Error generating caption: {str(e)}"

def generate_image_from_caption(caption):
    try:
        # Generate image from caption using Stable Diffusion
        generated_images = dalle_pipeline(caption).images
        generated_image = generated_images[0]  # Get the first generated image
        return generated_image
    except Exception as e:
        return f"Error generating image: {str(e)}"
'''def calculate_similarity(original_caption, generated_caption):
    # Process the captions
    inputs = clip_processor(text=[original_caption, generated_caption], return_tensors="pt", padding=True)
    outputs = clip_model(**inputs)

    # Calculate cosine similarity
    similarity_score = torch.nn.functional.cosine_similarity(outputs.text_embeds[0], outputs.text_embeds[1], dim=0)
    return similarity_score.item()
'''



def calculate_similarity(original_image, generated_image):
    #

    # Process the images
    image_inputs = clip_processor(images=[original_image, generated_image], return_tensors="pt", padding=True)
    image_outputs = clip_model(**image_inputs)

    # Calculate cosine similarity for images
    image_similarity = torch.nn.functional.cosine_similarity(image_outputs.image_embeds[0], image_outputs.image_embeds[1], dim=0)



    return image_similarity.item()


# Example usage
# original_image and generated_image should be PIL images or tensors
# result = calculate_similarity("A cat sitting on a mat", "A cat lying on a mat", original_image, generated_image)
# print(result)

def process_image(image):
    caption = generate_caption(image)
    if "Error" in caption:  # Check if there was an error in caption generation
        return caption, None, None  # Return error message and None for image and similarity
    generated_image = generate_image_from_caption(caption)

    # Calculate similarity score using captions
    similarity_score = calculate_similarity(caption, caption)  # Use the same caption for demonstration
    return caption, generated_image, similarity_score


# Define the Gradio interface with two columns
with gr.Blocks() as iface:
    # Centered title using HTML
    gr.Markdown("<h1 style='text-align: center;'>Image Captioning and Generation with LLMs</h1>")
    gr.Markdown("<p style='text-align: center;'>Upload an image to generate a caption and create a similar image.</p>")

    with gr.Row():
        with gr.Column():
            # First column: Uploaded image and generated caption
            #image_url_input = gr.Textbox(label="Image URL", placeholder="Enter image URL here...")

            image_input = gr.Image(type="pil", label="Upload Image")
            generate_caption_button = gr.Button("Generate Caption")
            caption_output = gr.Textbox(label="Generated Caption", interactive=False)

            # Set up the button action for generating caption
            generate_caption_button.click(
                fn=generate_caption,
                inputs=image_input,
                outputs=caption_output
            )
        with gr.Column():
            # Second column: Text box for user input and button for generating image
            caption_input = gr.Textbox(label="Paste Caption Here", placeholder="Copy and paste the generated caption here...")
            generate_button = gr.Button("Generate Image")
            similarity_output = gr.Textbox(label="Similarity Score", interactive=False)
            generated_image_output = gr.Image(label="Generated Image", interactive=False)

            # Set up the button action for generating image and calculating similarity
            def generate_and_calculate(original_image, caption):
                generated_image = generate_image_from_caption(caption)
                similarity_score = calculate_similarity(original_image, generated_image)  # Use the same caption for demonstration
                return generated_image, similarity_score

            generate_button.click(
                fn=generate_and_calculate,
                inputs=[image_input, caption_input],
                outputs=[generated_image_output, similarity_output]
            )

iface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://e8c3b3e74b9c504fbe.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [16]:
from PIL import Image
import numpy as np

# Create a dummy image for testing
dummy_image = Image.fromarray(np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8))

# Test the similarity function with dummy images
similarity_score = calculate_similarity(dummy_image, dummy_image)
print(f"Similarity Score: {similarity_score}")

Similarity Score: 100.00%
