#Caption Generator

Captionify is a versatile and user-friendly image captioning project that harnesses the power of three remarkable models: Llama-2-7b and Llama3-8b-8192 for natural language understanding and Blip-Image-Captioning-Large for image caption generation. This dynamic combination ensures a seamless experience for users, whether they are developers or individuals seeking to effortlessly add descriptive captions to their images. Captionify simplifies the captioning process, eliminating the need for complex coding and making it accessible to users of all backgrounds.


## Setup and Installation

In this step, we will install all the necessary packages required for our tasks. The packages include `transformers`, `datasets`, `loralib`, `sentencepiece`, `bitsandbytes`, `accelerate`, `xformers`, `einops`, `langchain`, `torch`, `torchvision`, `groq`, and `langchain_community`.


In [1]:
!pip -q install git+https://github.com/huggingface/transformers # need to install from github
!pip install -q datasets loralib sentencepiece
!pip -q install bitsandbytes accelerate xformers einops
!pip -q install langchain
!pip install --upgrade torch torchvision
!pip uninstall torch torchvision torchaudio -y
!pip install torch torchvision torchaudio
!pip install -q gradio
!pip install groq langchain_community
!pip install langchain_groq

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.1/316.1 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

## GPU Status

We will check the status of the GPU using the `nvidia-smi` command. This command provides information about the NVIDIA GPU, including its usage, memory allocation, and other relevant details. Monitoring the GPU status helps ensure that our environment is correctly set up for running deep learning models efficiently.


In [None]:
!nvidia-smi

Sat Jul 27 10:01:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Main Content

In this section, we will proceed with the core tasks using the installed packages. We will cover the following steps:

1. **Model Setup**: Configure and initialize the models.
2. **Training**: Train the models on the datasets.
3. **Evaluation**: Evaluate the performance of the models.
4. **Inference**: Use the trained models for making predictions.


In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

## Model Setup

Here, we will configure and initialize the models using the `transformers` library. We will set up the necessary configurations and prepare the models for training.


### LLaMA2 7B Chat


In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf",use_auth_token=True,)

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                             device_map='auto',
                                             torch_dtype=torch.float16,
                                             use_auth_token=True,
                                            #  load_in_8bit=True,
                                            #  load_in_4bit=True
                                             )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
# Use a pipeline for later
from transformers import pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                max_new_tokens = 512,
                do_sample=True,
                top_k=30,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id
                )

In [None]:
!nvidia-smi

Sat Jul 27 10:04:48 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0              26W /  70W |  12957MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Caption Generation with Emojis

This section covers generating captions with emojis using a pre-trained language model.

1. **Imports**: Necessary libraries (`json`, `textwrap`).
2. **Constants**: Define system prompts and instruction markers.
3. **Prompt Creation**: `get_prompt` function combines system and user prompts.
4. **Text Utilities**: Functions to trim (`cut_off_text`) and clean (`remove_substring`) text.
5. **Caption Generation**: `generate` function creates prompts, generates text, and processes outputs.
6. **Text Parsing**: `parse_text` formats text for readability.

The following code implements these steps:


In [8]:
import json
import textwrap

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
Write caption and use emojis."""



def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):
    SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
    prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
    return prompt_template

def cut_off_text(text, prompt):
    cutoff_phrase = prompt
    index = text.find(cutoff_phrase)
    if index != -1:
        return text[:index]
    else:
        return text

def remove_substring(string, substring):
    return string.replace(substring, "")



def generate(text):
    prompt = get_prompt(text)
    with torch.autocast('cuda', dtype=torch.bfloat16):
        inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
        outputs = model.generate(**inputs,
                                 max_new_tokens=512,
                                 eos_token_id=tokenizer.eos_token_id,
                                 pad_token_id=tokenizer.eos_token_id,
                                 )
        final_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        final_outputs = cut_off_text(final_outputs, '</s>')
        final_outputs = remove_substring(final_outputs, prompt)

    return final_outputs#, outputs

def parse_text(text):
        wrapped_text = textwrap.fill(text, width=100)
        print(wrapped_text +'\n\n')
        # return assistant_text


In [4]:
from langchain import HuggingFacePipeline
from langchain import PromptTemplate,  LLMChain

In [None]:
llm = HuggingFacePipeline(pipeline = pipe, model_kwargs = {'temperature':0})

  warn_deprecated(


## Using Llama3-8b-8192 GROQ API for Fast Processing

In this section, we leverage the GROQ API to speed up our processing tasks. The GROQ API provides accelerated computing capabilities, by using LPU. We will demonstrate how to integrate and use the GROQ API in our workflow.

The following code implements the necessary steps:


In [22]:
from langchain_groq import ChatGroq
import os

groq_api_key = ""
os.environ['GROQ_API_KEY']= groq_api_key

llm_groq = ChatGroq(groq_api_key=groq_api_key, model_name="Llama3-8b-8192", temperature=0.0)

In [6]:
from langchain import HuggingFacePipeline
from langchain import PromptTemplate,  LLMChain

In [9]:
system_prompt = "Generate a creative and engaging caption for the given description of an image. Include appropriate emojis to enhance the caption."
instruction = "Write a caption, using emojis for added expression: \n\n {text}"
template = get_prompt(instruction, system_prompt)

print(template)
prompt = PromptTemplate(template=template, input_variables=["text"])

[INST]<<SYS>>
Generate a creative and engaging caption for the given description of an image. Include appropriate emojis to enhance the caption.
<</SYS>>

Write a caption, using emojis for added expression: 

 {text}[/INST]


In [11]:
llm_hug = HuggingFacePipeline(pipeline = pipe, model_kwargs = {'temperature':0})
llm_chain = LLMChain(prompt=prompt, llm=llm)

NameError: name 'llm' is not defined

In [12]:
llm_groq_chain = LLMChain(prompt=prompt, llm=llm_groq)

  warn_deprecated(


## Generating Captions

The `generate` function sends the prompt to the model, generates the output, and processes the text. The function:
- Creates the prompt.
- Uses mixed precision for efficient GPU usage with `torch.autocast`.
- Generates text using the pre-trained model.
- Decodes and processes the generated text to remove special tokens and the original prompt.


In [16]:
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
from langchain_core.output_parsers import StrOutputParser
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

# output_parser=StrOutputParser()



def image_desc(input_image):
    raw_image = Image.fromarray(input_image.astype('uint8'), 'RGB')

    # conditional image captioning
    inputs = processor(raw_image, return_tensors="pt")

    out = model.generate(**inputs)
    output_text = processor.decode(out[0], skip_special_tokens=True)

    # Using hugginng face with llama 2 7b
    # output = llm_chain.run(output_text)
  #  output = output[output.find("[/INST]")+7:]




    # using groq api wiht llama 3 8b
    output = llm_groq_chain.run(output_text)


    return output



## Evaluation

After training the models, we will evaluate their performance using various metrics. This will help us understand how well the models are performing on the given tasks.


In [23]:
import gradio as gr


# Create a Gradio interface with a submit button
iface = gr.Interface(
    fn=image_desc,
    inputs=gr.Image(type="numpy"),  # Gradio Image component takes input as numpy array
    outputs=gr.Textbox(),  # Gradio Textbox component for displaying text output
    live=False,
    title="🐸Captionify🐸",
    description="Add Image!!",
    allow_flagging="never",
    theme='Ajaxon6255/Emerald_Isle'

)

# Launch the Gradio interface
iface.launch(debug = True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://a4c3a60cedcd68afb7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1923, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1508, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 8

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://a4c3a60cedcd68afb7.gradio.live


