# Assignment: Prompt Latency and Performance Testing with Google Generative AI API



---

### Objective:
This assignment aims to provide hands-on experience in measuring and analyzing the **latency and performance characteristics** of Large Language Model (LLM) API calls. You will specifically use the Google Generative AI API (e.g., Gemini Pro) to understand how factors like prompt complexity, length, and concurrency impact response times and overall throughput. This knowledge is crucial for designing efficient and responsive AI applications.

---

### Instructions:
1.  **Google Cloud Project & API Key**: Ensure you have a Google Cloud Project with the Google Generative AI API enabled and an API Key generated. You can get started at [Google AI Studio](https://aistudio.google.com/).
2.  **Environment Setup**: Install the necessary Python library: `pip install google-generativeai matplotlib seaborn pandas`.
3.  **Jupyter Notebook**: All your code, measured data, plots, observations, and analysis must be documented in this Jupyter Notebook.
4.  **Data Collection**: For each test, collect sufficient data points (e.g., 20-50 samples) to ensure statistical significance. Store results in lists or Pandas DataFrames for easy analysis.
5.  **Analysis and Visualization**: Use `matplotlib` and `seaborn` to visualize your latency data (e.g., histograms, box plots, line plots).
6.  **Critical Thinking**: Beyond just reporting numbers, analyze *why* certain results occur and discuss their implications for real-world applications.

**Important**: Be mindful of your API usage and potential costs, especially when running many requests or concurrent tests. Free tiers typically have generous limits, but it's good practice to monitor your usage.

---

## Part 1: Setup and Basic Latency Measurement
In this section, you'll configure your API access and measure the baseline latency for a simple prompt.

### Task 1.1: API Configuration
Configure the `google-generativeai` library with your API key. Select a model (e.g., `gemini-pro`).

In [None]:
import google.generativeai as genai
import time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

# --- YOUR API KEY HERE ---
# It's recommended to load your API key from an environment variable
# For example: GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
# For this assignment, you can temporarily paste it directly, but be careful not to share your notebook with the key.
GOOGLE_API_KEY = "YOUR_API_KEY_HERE"

genai.configure(api_key=GOOGLE_API_KEY)

# Choose a model (e.g., 'gemini-pro' for text generation)
model = genai.GenerativeModel('gemini-pro')

print("API configured and model loaded!")

### Task 1.2: Basic Prompt Latency Test
Measure the time taken for a simple, short prompt. Repeat the call multiple times (e.g., 20-50 times) to get an average and understand variability.

In [None]:
num_runs = 30 # Number of times to run the test
simple_prompt = "Tell me a very short, one-sentence story."
latencies_simple = []
responses_simple = []

print(f"Running {num_runs} tests for simple prompt...")
for i in range(num_runs):
    start_time = time.time()
    try:
        response = model.generate_content(simple_prompt)
        end_time = time.time()
        latency = end_time - start_time
        latencies_simple.append(latency)
        responses_simple.append(response.text.strip())
        print(f"Run {i+1}: Latency = {latency:.4f} seconds")
    except Exception as e:
        print(f"Run {i+1}: Error - {e}")
        latencies_simple.append(np.nan) # Mark as NaN for error

df_simple = pd.DataFrame({"latency": latencies_simple})

print("\n--- Results for Simple Prompt ---")
print(df_simple.describe())

plt.figure(figsize=(8, 5))
sns.histplot(df_simple['latency'].dropna(), kde=True)
plt.title('Distribution of Latency for Simple Prompt')
plt.xlabel('Latency (seconds)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Optionally print a few responses to check quality
print("\n--- Sample Responses ---")
for i, resp in enumerate(responses_simple[:5]): # Print first 5 responses
    print(f"Response {i+1}: {resp}")

### Analysis for Task 1.2:
* What is the average, median, and standard deviation of latency for the simple prompt?
* Observe the histogram: Is the latency distribution normal, skewed, or does it have outliers?
* Does the model consistently provide a one-sentence story as requested?

---

## Part 2: Impact of Prompt Complexity and Length
Investigate how changes in prompt characteristics affect latency.

### Task 2.1: Varying Prompt Length
Create prompts of varying lengths (short, medium, long) and compare their latencies. Aim for approximate token counts (e.g., <20, 50-100, >200 input tokens).

In [None]:
num_runs_length = 20 # Number of runs for each length category

prompts_by_length = {
    "short": "Define AI.",
    "medium": "Explain the concept of machine learning in simple terms, providing a real-world example.",
    "long": "Elaborate on the historical development of artificial intelligence, starting from its early conceptualization in the mid-20th century, discussing key milestones, influential figures, and significant breakthroughs that have shaped the field into what it is today. Also, touch upon the ethical implications of AI development and its potential future impact on society. Aim for about 200-300 words."
}

results_length = []

for length_type, prompt_text in prompts_by_length.items():
    print(f"\nRunning tests for {length_type} prompt ({len(prompt_text)} chars)...")
    for i in range(num_runs_length):
        start_time = time.time()
        try:
            response = model.generate_content(prompt_text)
            end_time = time.time()
            latency = end_time - start_time
            results_length.append({
                "prompt_type": length_type,
                "latency": latency,
                "input_length_chars": len(prompt_text),
                "output_length_chars": len(response.text.strip())
            })
            # print(f"  Run {i+1}: Latency = {latency:.4f}s")
        except Exception as e:
            print(f"  Run {i+1}: Error - {e}")
            results_length.append({
                "prompt_type": length_type,
                "latency": np.nan,
                "input_length_chars": len(prompt_text),
                "output_length_chars": np.nan
            })

df_length = pd.DataFrame(results_length)
print("\n--- Results for Prompt Length Variation ---")
print(df_length.groupby('prompt_type')['latency'].describe())

plt.figure(figsize=(10, 6))
sns.boxplot(x='prompt_type', y='latency', data=df_length.dropna())
plt.title('Latency by Prompt Length')
plt.xlabel('Prompt Type')
plt.ylabel('Latency (seconds)')
plt.grid(True)
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(x='input_length_chars', y='latency', hue='prompt_type', data=df_length.dropna())
plt.title('Latency vs. Input Prompt Length (Characters)')
plt.xlabel('Input Prompt Length (Characters)')
plt.ylabel('Latency (seconds)')
plt.grid(True)
plt.show()

### Analysis for Task 2.1:
* How does prompt length correlate with latency? Is it linear?
* Does the output length also affect latency? (You might need to add a column for output token length and analyze this).
* What are the implications for designing prompts for latency-sensitive applications?

### Task 2.2: Varying Prompt Complexity
Design prompts that require different levels of reasoning or creativity (e.g., simple factual lookup, creative writing, complex problem-solving). Keep them roughly similar in length but vary complexity.

In [None]:
num_runs_complexity = 20

prompts_by_complexity = {
    "factual": "What is the capital of France?",
    "creative": "Write a 1-sentence poetic description of a sunset.",
    "reasoning": "If all A are B, and some B are C, can we conclude that some A are C? Explain your reasoning in one short paragraph."
}

results_complexity = []

for comp_type, prompt_text in prompts_by_complexity.items():
    print(f"\nRunning tests for {comp_type} prompt...")
    for i in range(num_runs_complexity):
        start_time = time.time()
        try:
            response = model.generate_content(prompt_text)
            end_time = time.time()
            latency = end_time - start_time
            results_complexity.append({
                "prompt_type": comp_type,
                "latency": latency,
                "input_length_chars": len(prompt_text),
                "output_length_chars": len(response.text.strip())
            })
            # print(f"  Run {i+1}: Latency = {latency:.4f}s")
        except Exception as e:
            print(f"  Run {i+1}: Error - {e}")
            results_complexity.append({
                "prompt_type": comp_type,
                "latency": np.nan,
                "input_length_chars": len(prompt_text),
                "output_length_chars": np.nan
            })

df_complexity = pd.DataFrame(results_complexity)
print("\n--- Results for Prompt Complexity Variation ---")
print(df_complexity.groupby('prompt_type')['latency'].describe())

plt.figure(figsize=(10, 6))
sns.boxplot(x='prompt_type', y='latency', data=df_complexity.dropna())
plt.title('Latency by Prompt Complexity')
plt.xlabel('Prompt Type')
plt.ylabel('Latency (seconds)')
plt.grid(True)
plt.show()

### Analysis for Task 2.2:
* Does prompt complexity (even with similar length) impact latency? Why or why not?
* Were there any differences in the consistency or quality of responses across complexity levels?
* How might this inform prompt engineering for different task types?

---

## Part 3: Concurrency and Rate Limits
Explore how sending multiple requests, either sequentially or concurrently, affects overall performance and observe rate limiting.

### Task 3.1: Sequential vs. Concurrent Requests
Compare the total time taken to process a batch of requests (e.g., 10-20 requests) when sent sequentially versus concurrently. For concurrency, you can use `asyncio` or `threading.ThreadPoolExecutor`.

In [None]:
import asyncio
from concurrent.futures import ThreadPoolExecutor

num_batch_requests = 10 # Number of requests in a batch
prompt_for_batch = "Generate a random fact."

# --- Sequential Requests ---
print("\n--- Running Sequential Requests ---")
sequential_latencies = []
start_total_seq = time.time()
for i in range(num_batch_requests):
    start_call = time.time()
    try:
        response = model.generate_content(prompt_for_batch)
        end_call = time.time()
        sequential_latencies.append(end_call - start_call)
    except Exception as e:
        print(f"Sequential request {i+1} error: {e}")
        sequential_latencies.append(np.nan)

end_total_seq = time.time()
total_time_sequential = end_total_seq - start_total_seq
print(f"Total time for {num_batch_requests} sequential requests: {total_time_sequential:.4f} seconds")

# --- Concurrent Requests (using ThreadPoolExecutor for simplicity) ---
print("\n--- Running Concurrent Requests (ThreadPoolExecutor) ---")
concurrent_latencies = []

def make_api_call(prompt):
    start_call = time.time()
    try:
        response = model.generate_content(prompt)
        end_call = time.time()
        return end_call - start_call
    except Exception as e:
        print(f"Concurrent request error: {e}")
        return np.nan

start_total_conc = time.time()
with ThreadPoolExecutor(max_workers=5) as executor: # You can adjust max_workers
    futures = [executor.submit(make_api_call, prompt_for_batch) for _ in range(num_batch_requests)]
    for future in futures:
        concurrent_latencies.append(future.result())

end_total_conc = time.time()
total_time_concurrent = end_total_conc - start_total_conc
print(f"Total time for {num_batch_requests} concurrent requests: {total_time_concurrent:.4f} seconds")

print("\n--- Comparison ---")
print(f"Speedup (Sequential / Concurrent): {total_time_sequential / total_time_concurrent:.2f}x")


### Analysis for Task 3.1:
* How much faster are concurrent requests compared to sequential ones for the given batch size?
* What is the theoretical maximum speedup you might expect, and why might you not achieve it?
* Discuss the trade-offs of using concurrency (e.g., complexity, resource usage, rate limits).

### Task 3.2: Observing Rate Limiting (Optional/Careful)
Intentionally (or semi-intentionally) hit the API's rate limits by sending a very large number of requests in a short period. Observe the error messages and how the API handles it.

**Caution**: This might temporarily block your API key or incur costs if you exceed free tier limits. Proceed with care and be ready to stop the execution if necessary.

In [None]:
num_aggressive_requests = 100 # Try a large number, e.g., 100-200. Adjust based on your rate limit.
prompt_aggressive = "Generate a single random word."
aggressive_results = []

print(f"\n--- Attempting to hit Rate Limit with {num_aggressive_requests} requests ---")
for i in range(num_aggressive_requests):
    start_time = time.time()
    try:
        response = model.generate_content(prompt_aggressive)
        end_time = time.time()
        latency = end_time - start_time
        aggressive_results.append({"status": "success", "latency": latency})
    except Exception as e:
        end_time = time.time()
        latency = end_time - start_time
        aggressive_results.append({"status": "error", "latency": latency, "error_message": str(e)})
        print(f"Request {i+1} failed: {e}")
        # Optional: Add a small delay if you hit rate limits frequently
        # time.sleep(0.5)

df_aggressive = pd.DataFrame(aggressive_results)
print("\n--- Aggressive Test Results ---")
print(df_aggressive['status'].value_counts())
print(df_aggressive.loc[df_aggressive['status'] == 'error', 'error_message'].value_counts())


### Analysis for Task 3.2:
* Did you encounter rate limiting? What specific error messages did you receive (e.g., `ResourceExhausted` or `TooManyRequests`)?
* How quickly did the rate limit activate? (If it did)
* What strategies can you implement in your application to handle rate limits gracefully (e.g., backoff, retry mechanisms, token buckets)?

---

## Part 4: Conclusion and Reflection
In a markdown cell, provide a comprehensive summary of your findings and reflections based on this assignment.

* **Key Latency Factors**: Summarize the primary factors that influence LLM API latency based on your experiments.
* **Performance Bottlenecks**: Identify potential performance bottlenecks in integrating LLMs into applications.
* **Strategies for Optimization**: What are your top recommendations for optimizing LLM API usage for better latency and throughput?
* **Implications for AI Applications**: How does understanding latency affect the design of real-time vs. batch processing AI applications?
* **Ethical Considerations**: Are there any ethical implications related to performance (e.g., equitable access, environmental impact of computation)?

---

### Submission:
* Ensure all code cells have been executed and their outputs (including prints and plots) are visible.
* All analysis and reflections are clearly written in markdown cells.
* Save your Jupyter Notebook as `[YourName]_LLM_Latency_Assignment.ipynb`.