# Introduction to Large Language Models (LLMs)

Welcome to this introductory notebook on Large Language Models (LLMs)! In this session, we will explore the capabilities of Google's Gemini models, specifically focusing on the multimodal capabilities of gemini-2.0-flash.

## Learning Objectives
- Set up and configure an LLM API connection
- Make basic text-based queries to an LLM
- Explore multimodal capabilities (image and audio processing)
- Understand the impact of prompting on LLM responses

Let's start with a simple "Hello World" example to demonstrate the basic functionality of an LLM.

In [None]:
# Step 1: Import necessary libraries
from google import genai  # Google's Generative AI library
from dotenv import load_dotenv  # For loading environment variables
import os

# Configuration modules for the LLM
from google.genai import types

# Load environment variables from .env file
# This helps keep your API keys secure by storing them in a separate file not checked into version control
load_dotenv()

# Ensure the API key is set in the environment
# The API key is required to authenticate with Google's Generative AI services
api_key = os.getenv("GOOGLE_API_KEY")

# Default system prompt that will be used in all our interactions with the model
# System prompts help steer the model's behavior and style of responses
system_prompt_default = "You are an assistant that provides concise and accurate answers to user queries. Always say 'Thanks for your question!' at the end of your response."

def call_llm(prompt: str, system_prompt: str = system_prompt_default) -> str:
    """
    Calls the Google GenAI LLM with the provided prompt and returns the response.
    
    This function establishes a connection with Google's GenAI API, sends your prompt
    to the Gemini model, and returns the generated text response.
    
    Args:
        prompt (str): The input prompt or question to send to the LLM.
        system_prompt (str): Instructions that guide the model's behavior.
                            Defaults to the system_prompt_default defined above.
        
    Returns:
        str: The text response from the LLM.
    """
    # Initialize the client with your API key
    client = genai.Client(api_key=api_key)
    
    # Generate content using the specified model
    response = client.models.generate_content(
        model="gemini-2.0-flash",  # Using Gemini 2.0 Flash - a fast, efficient model
        contents=prompt,           # Your input prompt/question
        
        # Configuration parameters to control the generation
        config=types.GenerateContentConfig(
            temperature=0.8,          # Controls randomness: lower = more deterministic outputs
            system_instruction=system_prompt,  # Provides overall guidance to the model
        ),
    )
    
    # Extract and return the text from the response
    return response.text.strip() if response.text else ""

In [None]:
# Example 1: Basic Text Prompt
# Let's send a simple factual question to test our LLM connection

print("Sending prompt: 'What is the capital of France?'")
result = call_llm("What is the capital of France?")
print("-" * 50)  # Separator for better readability
print(f"LLM Response:\n{result}")
print("-" * 50)

# Note: The model's response should include the fact that Paris is the capital of France,
# and end with "Thanks for your question!" as instructed in our system prompt.

### gemini-flash-2.0 is multimodal

## Exploring Multimodal Capabilities

One of the most powerful features of modern LLMs like Gemini is their ability to understand and process multiple modalities of information - not just text, but also images, audio, and more. This allows for much more versatile and useful applications.

### Image Understanding

Let's explore how the model can analyze and describe images. Multimodal models can:
- Identify objects and people in images
- Describe scenes and settings
- Understand visual contexts
- Extract text from images
- Answer questions about visual content

Let's try with this image:
<br>
![alt text](sample_image_1.png)

In [None]:
# Ask about the image

def call_llm_image(prompt: str, image_path: str, system_prompt: str = system_prompt_default) -> str:
    """
    Calls the Google GenAI LLM with both text prompt and image input, returning the response.
    
    This demonstrates how Gemini can process and understand images alongside text,
    enabling multimodal interactions.
    
    Args:
        prompt (str): The text prompt or question about the image.
        image_path (str): The path to the image file to analyze.
        system_prompt (str): Instructions for the model's behavior.
        
    Returns:
        str: The model's response about the image.
    """
    # Initialize the client with your API key
    client = genai.Client(api_key=api_key)
    
    # Upload the image file to the API
    uploaded_image = client.files.upload(
        file=image_path,
    )
    
    # Generate content using both the text prompt and image
    response = client.models.generate_content(
        model="gemini-2.0-flash",  # Using Gemini 2.0 Flash 
        contents=[prompt, uploaded_image],  # Both text and image inputs
        config=types.GenerateContentConfig(
            temperature=0.2,
            system_instruction=system_prompt,
        ),
    )
    
    # Extract and return the text from the response
    return response.text.strip() if response.text else ""

In [None]:
# Example 2: Image Analysis
# Let's ask the model to describe what it sees in our first image

print("Analyzing image: sample_image_1.png")
print("Prompt: 'What is in this image?'")

image_result = call_llm_image(
    "What is in this image?",
    "sample_image_1.png"
)

print("-" * 50)
print(f"LLM Image Response:\n{image_result}")
print("-" * 50)

# Note: The model will analyze the visual content and provide a description
# of what it sees in the image, identifying objects, scenery, etc.

### Testing with a More Complex Image

Let's challenge the model with a more complex image. Complex images can include:
- Multiple objects or subjects
- Detailed backgrounds
- Text within the image
- Abstract concepts or scenes
- Unusual perspectives

This allows us to better understand the model's visual processing capabilities and limitations.

<br>
<img src="sample_image_2.jpg" alt="alt text" width="300"/>

In [None]:
# Example 3: Complex Image Analysis
# Testing the model's ability to handle more detailed or complex images

print("Analyzing image: sample_image_2.jpg")
print("Prompt: 'What is in this image?'")

image_result = call_llm_image(
    "What is in this image?",
    "sample_image_2.jpg"
)

print("-" * 50)
print(f"LLM Image Response:\n{image_result}")
print("-" * 50)

# Examining how well the model can identify and describe elements in a more complex image
# We're using the same prompt as before to compare how the model handles different levels of visual complexity

We tried with images, now we'll try with an audio snippet

## Audio Processing Capabilities

Another impressive capability of multimodal LLMs is their ability to process and understand audio data.
This opens up possibilities for:

- Speech transcription
- Audio content analysis
- Voice command processing
- Language translation from spoken content
- Content summarization from audio

Let's see how Gemini can process an audio snippet and extract the spoken information.

In [None]:
# Creating a function to handle audio-based prompting

def call_llm_audio(prompt: str, audio_path: str, system_prompt: str = system_prompt_default) -> str:
    """
    Calls the Google GenAI LLM with both text prompt and audio input, returning the response.
    
    This demonstrates how Gemini can process and understand audio alongside text,
    enabling another dimension of multimodal interactions.
    
    Args:
        prompt (str): The text prompt or instruction about the audio.
        audio_path (str): The path to the audio file to process.
        system_prompt (str): Instructions for the model's behavior.
        
    Returns:
        str: The model's response about the audio content.
    """
    # Initialize the client with your API key
    client = genai.Client(api_key=api_key)
    
    # Upload the audio file to the API
    uploaded_audio = client.files.upload(
        file=audio_path,
    )
    
    # Generate content using both the text prompt and audio
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=[prompt, uploaded_audio],  # Both text and audio inputs
        config=types.GenerateContentConfig(
            temperature=0.2,
            system_instruction=system_prompt,
        ),
    )
    
    # Extract and return the text from the response
    return response.text.strip() if response.text else ""

In [None]:
# Example 4: Basic Audio Processing
# First attempt at audio transcription with a general prompt

print("Processing audio: sample_audio.mp3")
print("Prompt: 'What is being said in this audio?'")

audio_result = call_llm_audio(
    "What is being said in this audio?",
    "sample_audio.mp3"
)

print("-" * 50)
print(f"LLM Audio Response:\n{audio_result}")
print("-" * 50)

# This example demonstrates how the model can listen to audio content
# and extract the spoken information using a general prompt

In [None]:
# Example 5: Improved Audio Processing with Better Prompting
# Demonstrating how prompt engineering can affect results

print("Processing audio: sample_audio.mp3")
print("Improved prompt: 'Please transcribe the audio'")

audio_result_better_prompt = call_llm_audio(
    "Please transcribe the audio",
    "sample_audio.mp3"
)

print("-" * 50)
print(f"LLM Audio Response with better prompt:\n{audio_result_better_prompt}")
print("-" * 50)

# This example demonstrates an important concept in working with LLMs:
# The quality and specificity of your prompt can significantly affect the results.
# By asking specifically for a transcription rather than a general description,
# we may get more accurate or detailed text from the audio content.

# Conclusion

In this notebook, we've explored the fundamental capabilities of Large Language Models, specifically Google's Gemini 2.0 Flash model:

1. **Text Processing**: We started with basic text queries, showing how the model can answer factual questions.

2. **Multimodal Capabilities**:
   - **Image Analysis**: We demonstrated how the model can interpret and describe both simple and complex images.
   - **Audio Processing**: We showed how the model can transcribe and interpret audio content.

3. **Prompt Engineering**: We demonstrated how the quality and specificity of prompts can significantly affect the results you get from an LLM.

## Key Takeaways

- Modern LLMs like Gemini can process multiple types of data (text, images, audio) in a single interaction
- The way you phrase your prompts has a meaningful impact on the quality of responses
- System prompts can help guide the model's behavior and response style

## Next Steps

To continue learning about LLMs, you might want to explore:
- More advanced prompt engineering techniques
- Fine-tuning models for specific use cases
- Implementing LLMs in real-world applications
- Exploring limitations and ethical considerations of LLM usage

### Additional Resources and References 

I've based this notebook mostly from this youtube video
- [Gemini API Introduction](https://youtu.be/qfWpPEgea2A)