<a href="https://colab.research.google.com/github/NSALHI1/Animal-Recognition/blob/main/Animals_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Animals Recognition**

This first code snippet installs a set of essential libraries required for the project. The libraries include tools for natural language processing (NLP), audio handling, and web interface creation.

In [None]:
!pip install transformers
!pip install scipy
!pip install torch
!pip install gradio
!pip install sentencepiece
!pip install timm
!pip install inflect
!pip install phonemizer
!pip install py-espeak-ng
!pip install soundfile



 Importing and loading essential models for an interactive application:

* **Translation Model**: Enables translation from English to Arabic, broadening accessibility for users.
* **Image Captioning Model (BLIP)**: Analyzes uploaded images to generate descriptive captions, enhancing user understanding.
* **Question Answering Model**: Provides relevant answers to user inquiries, making the app intuitive and responsive.
* **Text-to-Speech Model**: Converts written content into audio, allowing users to listen to insights and improving accessibility.
* **User Interface**: Utilizes Gradio to create a simple, user-friendly web interface for easy navigation.




In [None]:
from transformers import pipeline, BlipProcessor, BlipForConditionalGeneration
import gradio as gr
import torch
import scipy.io.wavfile as wavfile

# Load translation model to translate English text to Arabic
translator = pipeline("translation_en_to_ar", model="Helsinki-NLP/opus-mt-en-ar")

# Load the BLIP image captioning model and processor
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Load a question-answering model for retrieving information
qa_model = pipeline("question-answering", model="deepset/roberta-base-squad2")

# Load a text-to-speech model for generating audio from text
narrator = pipeline("text-to-speech", model="kakao-enterprise/vits-ljs")

Some weights of the model checkpoint at kakao-enterprise/vits-ljs were not used when initializing VitsModel: ['flow.flows.0.wavenet.in_layers.0.weight_g', 'flow.flows.0.wavenet.in_layers.0.weight_v', 'flow.flows.0.wavenet.in_layers.1.weight_g', 'flow.flows.0.wavenet.in_layers.1.weight_v', 'flow.flows.0.wavenet.in_layers.2.weight_g', 'flow.flows.0.wavenet.in_layers.2.weight_v', 'flow.flows.0.wavenet.in_layers.3.weight_g', 'flow.flows.0.wavenet.in_layers.3.weight_v', 'flow.flows.0.wavenet.res_skip_layers.0.weight_g', 'flow.flows.0.wavenet.res_skip_layers.0.weight_v', 'flow.flows.0.wavenet.res_skip_layers.1.weight_g', 'flow.flows.0.wavenet.res_skip_layers.1.weight_v', 'flow.flows.0.wavenet.res_skip_layers.2.weight_g', 'flow.flows.0.wavenet.res_skip_layers.2.weight_v', 'flow.flows.0.wavenet.res_skip_layers.3.weight_g', 'flow.flows.0.wavenet.res_skip_layers.3.weight_v', 'flow.flows.1.wavenet.in_layers.0.weight_g', 'flow.flows.1.wavenet.in_layers.0.weight_v', 'flow.flows.1.wavenet.in_layers.

This snippet creates a reference context that includes detailed descriptions of various animals.

It also includes a list of known animal names to facilitate identification and extraction of relevant information from generated captions.

In [None]:
# Combined context for known animals
combined_context = """
Tigers are the largest species among the Felidae and classified in the genus Panthera.
Elephants are the largest land animals on Earth. They are known for their large ears, tusks made of ivory, and their trunks.
Deer are herbivorous mammals forming the family Cervidae. Species include white-tailed deer, mule deer, elk, moose, and reindeer.
Lions are large carnivorous mammals known for their majestic manes and live in social groups called prides.
Penguins are flightless birds primarily found in the Southern Hemisphere, known for their distinctive black and white plumage.
Dogs are domesticated mammals known for their loyalty and intelligence.
Domestic cats are small mammals valued for their companionship and playfulness.
Giraffes are the tallest land animals, known for their long necks.
Zebras are known for their distinctive black and white stripes and live in herds.
Horses are domesticated mammals valued for their strength, speed, and companionship.
Crocodiles are large reptiles found in tropical regions, known for their powerful jaws and stealthy hunting techniques.
Pandas are large bears native to China, distinguished by their black and white coloration.
"""

# List of known animal names for extraction
animal_names = ["tiger", "elephant", "deer", "lion", "penguin", "dog", "cat", "giraffe", "zebra", "dolphin", "panda", "crocodile"]

The following functions enhance the application's functionality for animal recognition and information retrieval.

1. **Extracting Animal Names**: The `extract_animal_from_caption` function scans the generated caption for any known animal names. If it finds a match, it returns that animal's name, enabling further processing and information retrieval.

2. **Generating Audio**: The `generate_audio` function utilizes a text-to-speech model to convert text into audio. It attempts to save the generated audio as a WAV file .

3. **Recognizing Animals and Providing Information**: The `recognize_animal_and_get_info` function integrates several steps. It generates a caption for an uploaded image, extracts the animal's name from the caption, and uses a question-answering model to fetch relevant information based on a predefined context. Additionally, it translates the caption and information into Arabic and generates audio from the caption.


In [None]:
# Function to extract animal name from the generated caption
def extract_animal_from_caption(caption):
    for animal in animal_names:
        if animal in caption.lower():
            return animal
    return None

# Function to generate audio from text using the text-to-speech model
def generate_audio(text):
    # Generate the narrated text
    narrated_text = narrator(text)
    # Save the audio to a WAV file
    wavfile.write("output.wav", rate=narrated_text["sampling_rate"], data=narrated_text["audio"][0])
    return "output.wav"

# Function to recognize the animal in the image and provide relevant information
def recognize_animal_and_get_info(image):
    # Step 1: Generate a caption for the uploaded image using BLIP
    inputs = blip_processor(images=image, return_tensors="pt")
    caption_ids = blip_model.generate(**inputs)
    caption = blip_processor.decode(caption_ids[0], skip_special_tokens=True)

    # Step 2: Extract the animal name from the generated caption
    animal_name = extract_animal_from_caption(caption)

    # Step 3: Use the QA model to retrieve information based on the combined context
    if animal_name:
        question = f"Describe a {animal_name}?"
        answer = qa_model(question=question, context=combined_context)
        info = answer['answer']
    else:
        info = "Sorry, I couldn't identify the animal in the image."

    # Translate both the caption and the information to Arabic
    translated_caption = translator(caption)[0]['translation_text']
    translated_info = translator(info)[0]['translation_text']

    # Generate audio from the caption
    audio_file = generate_audio(caption)

    return caption, info, audio_file, translated_caption, translated_info  # Return all outputs


This section sets up a user-friendly **Gradio** interface for the animal recognition application. It features a clean layout with tabs for displaying image captions, insights, audio and translations. Users can easily upload animal images, and the interface automatically updates with results.

In [None]:
# Define the Gradio interface with tabs for displaying results
with gr.Blocks(css=".gradio-container { background-color: beige; }") as iface:
    gr.Markdown("# Animal Recognition")
    gr.Markdown("Upload an animal image to generate a caption and insights about the identified animal.")

    with gr.Row():
        image_input = gr.Image(type="pil", label="Upload Image")

    with gr.Tab("Generated Results"):
        with gr.Row():
            output_caption = gr.Textbox(label="Caption", interactive=False)
            output_info = gr.Textbox(label="Animal Insight", interactive=False)
            output_audio = gr.Audio(label="Audio", interactive=False)

    with gr.Tab("Translations"):
        translated_caption_output = gr.Textbox(label="Translated Caption", interactive=False)
        translated_info_output = gr.Textbox(label="Translated Animal Insight", interactive=False)

    # Define the action to take when an image is uploaded
    image_input.change(
        fn=recognize_animal_and_get_info,
        inputs=image_input,
        outputs=[output_caption, output_info, output_audio, translated_caption_output, translated_info_output]
    )

# Launch the Gradio interface
iface.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://59c4808da5d3934783.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


