<a href="https://colab.research.google.com/github/Amna-Javed2/Quarter3_Assignments/blob/main/Image_%26_Voice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Gemini 2.0 Flash**

In [None]:
# Installation
!pip install --upgrade --quiet google-genai

In [1]:
!pip install gTTS -q

In [3]:
!pip install playsound -q

In [None]:
# API Key
from google.colab import userdata
GOOGLE_API_KEY: str = userdata.get('GOOGLE_API_KEY')
if(GOOGLE_API_KEY):
  print("Key found")
else:
  print("Key not found")

Key found


In [None]:
# Client configuration
from google import genai
from google.genai import Client

client: Client = genai.Client(
    api_key = GOOGLE_API_KEY,
)

# Model Selection
model: str = 'gemini-2.0-flash-exp'

In [None]:
from google.genai.types import GenerateContentResponse
from IPython.display import display, Markdown, Video

response: GenerateContentResponse = client.models.generate_content(
    model=model,
    contents='How does AI work?'
)
display(Markdown(response.text))

That's a great question! It's a broad topic, so let's break down how AI works in a way that's hopefully easy to understand. Think of it like training a very smart dog, but instead of a dog, it's a computer program.

Here's a simplified explanation focusing on key concepts:

**1. The Goal: Learning From Data**

At its core, AI is about creating systems that can learn and make decisions without being explicitly programmed for every single situation. Instead, they learn from **data**. This data can be anything: images, text, numbers, sounds, etc.

**2. Machine Learning: The Engine of AI**

The most common way AI achieves this learning is through **Machine Learning (ML)**. Here's a basic breakdown of how it works:

*   **Algorithms:** ML uses various algorithms (mathematical formulas) to identify patterns in the data. Think of them as the "rules" the computer follows to learn.
*   **Training:** The algorithms are fed massive amounts of data. Through this process, they adjust their internal parameters (like the strength of connections in a neural network) to better recognize those patterns.
*   **Prediction/Decision:** Once trained, the model can be given new, unseen data and make predictions or decisions based on what it has learned.

**Analogy: Recognizing Cats**

Imagine teaching a child to recognize a cat:

1.  **Data:** You show them hundreds of pictures of cats, some with different colors, breeds, and poses.
2.  **Learning:** The child's brain (like an ML algorithm) starts to notice common features: pointy ears, whiskers, a tail, etc.
3.  **Prediction:** Now, when they see a new picture, they can say "That's a cat!"

**3. Different Types of Machine Learning**

There are several major approaches to Machine Learning:

*   **Supervised Learning:** Like the cat example. The model is given *labeled* data (pictures of cats labeled as "cat", pictures of dogs labeled as "dog") and learns to map inputs to outputs. It's like learning from a teacher. Common uses: spam detection, image recognition, medical diagnosis.
*   **Unsupervised Learning:** The model is given *unlabeled* data and must find patterns on its own. Think of finding hidden clusters in the data. Common uses: customer segmentation, anomaly detection, recommendation systems.
*   **Reinforcement Learning:** The model learns through trial and error, receiving rewards for good actions and penalties for bad ones. It's like training a dog with treats and scolding. Common uses: game playing (like AlphaGo), robotics control, autonomous driving.

**4. Deep Learning: A Powerful Subfield**

Deep Learning is a subset of machine learning that uses **artificial neural networks (ANNs)** with multiple layers (hence "deep"). These networks are inspired by the structure of the human brain and are very powerful at learning complex patterns.

*   **Neural Networks:** These networks are made up of interconnected nodes (neurons). Connections between these neurons are weighted, and these weights change as the network learns.
*   **Deep Architectures:** Having multiple layers allows deep learning models to learn hierarchical features: low-level features like edges in an image can be combined to recognize higher-level features like eyes, which then can be combined to recognize the entire face.

Deep learning is responsible for many recent AI breakthroughs, such as advanced image recognition, natural language processing, and speech recognition.

**5. The AI "Pipeline"**

Here's a rough outline of how an AI system is typically developed:

1.  **Data Collection:** Gathering the necessary data.
2.  **Data Preparation:** Cleaning and organizing the data.
3.  **Model Selection:** Choosing the appropriate algorithm for the task.
4.  **Model Training:** Feeding the data to the model to learn.
5.  **Model Evaluation:** Testing the model's performance.
6.  **Deployment:** Putting the trained model into practical use.
7.  **Maintenance:** Continuously monitoring and improving the model.

**Key Takeaways**

*   **AI learns from data:** It's not explicitly programmed for every situation.
*   **Machine Learning is the core:** Algorithms help find patterns in data.
*   **Different types of learning exist:** Supervised, unsupervised, and reinforcement.
*   **Deep Learning is a powerful technique:** Using neural networks with many layers.
*   **It's an iterative process:** AI systems are constantly refined and improved.

**Important Note:** This explanation is a simplification. The actual mathematics and algorithms are quite complex. However, hopefully, this gives you a good general idea of how AI works.

**Do you have any specific area of AI you'd like to explore further?** For example, are you interested in:

*   Image recognition?
*   Natural language processing?
*   AI in games?
*   Ethical considerations in AI?

Knowing what interests you most will help me provide a more focused and detailed explanation.


**Video Concept**

In [None]:
# for video link or url
# !wget video.mp4 -O video.mp4 -q

from google.colab import files
uploaded = files.upload()

In [None]:
import time

def upload_video(video_file_name):
  video_file = client.files.upload(path=video_file_name)
  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name or "")

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + (video_file.uri or ""))

  return video_file

my_video = upload_video('video.mp4')

Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/q7tjac8oszvi


In [None]:

from google.genai.types import Content, Part
prompt = """ For each scene in this video,
            generate captions that describe the scene along with any spoken text placed in quotation marks.
            Place each caption into an object with the timecode of the caption in the video.
         """

video = my_video

response = client.models.generate_content(
    model=model,
    contents=[
        Content(
            role="user",
            parts=[
                Part.from_uri(
                    file_uri=video.uri or "",
                    mime_type=video.mime_type or ""),
                ]),
        prompt,
    ]
)

Markdown(response.text)

```json
[
  {
    "timecode": "00:00",
    "caption": "A baby wearing an orange shirt is sitting facing a golden dog with its nose touching the baby's head. They are both in a garden area with green foliage."
  },
  {
    "timecode": "00:01",
    "caption": "The baby and the golden dog are in the same position. The dog has its mouth slightly open as if licking the baby's forehead."
  },
    {
    "timecode": "00:02",
    "caption": "The baby and golden dog are still positioned close to each other with their heads close. The dog’s nose is again on the baby’s head."
  }
]
```

In [None]:

# Analyze the video (visual and audio components)
def analyze_video(video_file):
    """
    Analyzes both the visual and audio components of the uploaded video.
    """
    prompt = """
    Analyze the uploaded video and provide:
    1. A summary of the visual elements (e.g., scenes, actions, objects).
    2. A summary of the audio content, including spoken words, tone, and any significant background sounds.
    """
    try:
        response = client.models.generate_content(
            model=model,
            contents=[
                Content(
                    role="user",
                    parts=[
                        Part.from_uri(
                            file_uri=video_file.uri or "",
                            mime_type=video_file.mime_type or ""
                        )
                    ]
                ),
                prompt
            ]
        )
        print("Analysis Response:")
        display(Markdown(response.text))
        return response.text
    except Exception as e:
        print(f"Error analyzing video: {e}")
        return None


# Call the analysis function
analysis_result = analyze_video(my_video)

# Interact with the LLM based on the analysis
def interact_with_llm(analysis_text):
    """
    Interacts with the LLM by asking questions based on the video analysis.
    """
    question_prompt = """
    Based on the provided analysis, answer the following:
    1. What is the main message or theme of the video?
    2. How do the visual and audio components complement each other in conveying the message?
    3. Are there any notable emotional tones or themes in the spoken content?
    5. what he request and how to respond to it?
    """
    try:
        response = client.models.generate_content(
            model=model,
            contents=[
                analysis_text,
                question_prompt
            ]
        )
        print("LLM Interaction Response:")
        display(Markdown(response.text))
    except Exception as e:
        print(f"Error during LLM interaction: {e}")


# Proceed if analysis was successful
if analysis_result:
    interact_with_llm(analysis_result)


Analysis Response:


Okay, here's an analysis of the provided video frames:

**1. Summary of Visual Elements:**

The video depicts a heartwarming interaction between a baby and a dog in a sunlit outdoor setting.

* **Scene:** The scene takes place in a grassy area with various green plants and some flowers visible in the background. The lighting is bright, suggesting a sunny day.
* **Subjects:**
    * A baby is seated on the grass, facing a dog. The baby is wearing an orange t-shirt.
    * The dog is golden-brown with medium-length fur, sitting beside the baby.
* **Actions:**
    * In the first frame, the baby is facing the dog, reaching out with their hand. It looks as though the baby and dog are touching faces. 
    * In the second frame, the dog opens its mouth and seems to lick the baby's face.
    * In the third frame, the dog has returned to its original position and is resting its head against the baby's.
* **Other Visual Details:**
    * The camera angle is at eye-level with the subjects, providing an intimate perspective.
    * There's a watermark "hotshot.co" with a smiley face icon in the bottom right corner.

**2. Summary of Audio Content:**

The provided frames are static, meaning there is no audio present. Therefore, we can't analyze any spoken words, tone, or background sounds. The analysis is based solely on the visual aspects.

If you have any more video frames you'd like me to analyze, or if you gain access to an audio component, feel free to share!

LLM Interaction Response:


Okay, let's break down these questions based on the analysis we have.

**1. What is the main message or theme of the video?**

Based on the *visuals alone*, the main message or theme of the video is **a depiction of a gentle and affectionate interaction between a baby and a dog.** The visual cues suggest:

*   **Interspecies bonding:** The close proximity and physical contact between the baby and the dog highlight the potential for connection and affection between different species.
*   **Innocence and tenderness:** The baby's gentle reach and the dog's licking suggest a tender and innocent interaction. It evokes feelings of cuteness and harmlessness.
*   **Comfort and companionship:** The dog resting its head against the baby conveys a sense of comfort and companionship.
*   **Joy and happiness:** The sunlit outdoor setting, the baby's reaching and the dog's affectionate behavior all point towards a joyful and positive moment.

**2. How do the visual and audio components complement each other in conveying the message?**

This is where we run into a limitation. **There is no audio provided with the video frames.** Therefore, we can't discuss how the visual and audio components complement each other. We can only analyze the visual message as described above. 

*   **If we had audio**, it could potentially enhance the message by:
    *   Adding sounds of baby coos or laughter, further amplifying the innocence and joy.
    *   Featuring sounds of the dog's gentle breathing or soft whimpers, adding to the tenderness.
    *   Including any verbal cues or tones that could provide additional context and emotional resonance.
    *   If there was background music, that could further amplify the emotional tone.

**3. Are there any notable emotional tones or themes in the spoken content?**

**No, there is no spoken content in the provided frames.** We can't analyze any spoken word emotional tone or themes. 

**5. What is the request and how to respond to it?**

The request was:

> "Based on the provided analysis, answer the following:
> 1. What is the main message or theme of the video?
> 2. How do the visual and audio components complement each other in conveying the message?
> 3. Are there any notable emotional tones or themes in the spoken content?
> 5. what he request and how to respond to it?"

I have responded to the request by:

*   Providing an analysis of the visual message and theme of the video based on the available information.
*   Explicitly stating the limitation that the lack of audio prevents a discussion about how visual and audio components work together.
*   Clearly stating that the lack of audio makes it impossible to analyze any spoken content.
*   Identifying the request from the user and stating how I have responded to it.

**In summary:**

The key takeaway is that we can infer a message of interspecies affection, innocence, and companionship from the visuals. However, the lack of audio prevents us from fully understanding how the video might have conveyed its message through sound. If audio were available, we would be able to provide a much more complete and nuanced analysis.


**Audio Analysis**

In [4]:
from gtts import gTTS
from playsound import playsound
from IPython.display import Audio

tts = gTTS(text= "Hello, my name is Amna and I am a student at PIAIC Lahore and I'm learning Agentic AI", lang='en')

with open("output.mp3", "wb") as f:
  for chunk in tts.stream():
    f.write(chunk)

display(Audio("output.mp3", autoplay=True))

