# Assignment: Build a Long-Running Summarization + Citation Agent Loop



---

### Objective:
This assignment challenges you to build a **long-running agent loop** that can continuously process new information, summarize it, and critically, **provide citations for the summarized content**. This simulates a real-world scenario where an AI needs to act as a research assistant, continuously ingesting data and providing attributed insights. You'll focus on designing the prompt engineering for summarization and citation, as well as managing the loop for continuous processing.

---

### Instructions:
1.  **LLM Access**: You'll need access to an LLM API. **OpenAI's models (e.g., GPT-4o, GPT-4, GPT-3.5-turbo)** are recommended for their strong summarization and instruction-following capabilities. Configure your API key securely (e.g., via environment variables).
2.  **Environment Setup**: Install necessary Python libraries: `pip install openai python-dotenv`.
3.  **Simulated Data Source**: You'll simulate a stream of 'new' articles or documents, each with content and a source/URL. This will be a simple Python list of dictionaries.
4.  **Agent Loop**: The core of the assignment is a loop that:
    * Picks a new document from your simulated stream.
    * Sends its content to the LLM for summarization and citation extraction.
    * Stores the summary and citation.
    * Can be stopped by a specific command or after processing all available documents.
5.  **Prompt Engineering**: Design a robust prompt that guides the LLM to:
    * Summarize the provided text concisely.
    * Explicitly cite the source (URL/title) provided with the text.
    * Handle cases where information might be missing or incomplete.
6.  **Jupyter Notebook**: All your code, outputs, observations, and analysis must be documented in this Jupyter Notebook.
7.  **Analysis**: Explain your prompt design, the loop mechanism, and the challenges of ensuring accurate citations.

---

## Part 1: Setup and LLM Configuration
Begin by setting up your environment and configuring your LLM.

### Task 1.1: Install Libraries and Configure LLM
Install `openai` and `python-dotenv`. Set up your OpenAI API key from environment variables.

In [None]:
# Install necessary libraries (if not already installed)
# !pip install openai python-dotenv --quiet

import openai
import os
import asyncio
import time
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# --- IMPORTANT: Create a .env file in the same directory as this notebook with the line: ---
# OPENAI_API_KEY="YOUR_OPENAI_API_KEY_HERE"

# Configure OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

if not openai.api_key:
    print("WARNING: OPENAI_API_KEY not found in environment variables. Please set it in .env file.")
else:
    print("OpenAI API key loaded!")

# Define the LLM model to use
LLM_MODEL = "gpt-3.5-turbo" # You can use "gpt-4o", "gpt-4", etc. if you have access

print(f"Using LLM Model: {LLM_MODEL}")

### Task 1.2: Simulate a Stream of Articles
Create a list of dictionaries, where each dictionary represents an article with `title`, `url`, and `content`. This will act as your continuous data source.

In [None]:
article_stream = [
    {
        "id": 1,
        "title": "Quantum Computing Breakthrough",
        "url": "https://example.com/quantum-breakthrough",
        "content": "Scientists at Quantum Labs have announced a significant breakthrough in quantum entanglement, achieving stable qubits for over 10 minutes. This could pave the way for more powerful quantum computers, moving beyond the current limitations of decoherence. The research paper, published in 'Nature Physics', details their novel method of using cryogenic chambers and magnetic shielding to maintain qubit integrity. This advancement is crucial for error correction and scaling quantum systems."
    },
    {
        "id": 2,
        "title": "New AI Model for Drug Discovery",
        "url": "https://example.com/ai-drug-discovery",
        "content": "AlphaMolecule, a new AI model developed by BioGen AI, has shown promising results in accelerating drug discovery. The model can predict molecular interactions with high accuracy, drastically reducing the time and cost associated with traditional pharmaceutical research. Early trials indicate a 20% improvement in identifying potential drug candidates for various diseases, including a new treatment for Alzheimer's. The model was trained on a vast dataset of chemical compounds and biological targets."
    },
    {
        "id": 3,
        "title": "Impact of Climate Change on Global Food Supply",
        "url": "https://example.com/climate-food-supply",
        "content": "A recent report from the UN's IPCC warns of severe disruptions to global food supply chains due to accelerating climate change. Extreme weather events, prolonged droughts, and unpredictable rainfall patterns are leading to crop failures and reduced agricultural yields in key farming regions. The report emphasizes the urgent need for climate-resilient farming practices and diversified food sources to ensure future food security. It also highlights the disproportionate impact on developing nations."
    },
    {
        "id": 4,
        "title": "The Rise of Sustainable Urban Planning",
        "url": "https://example.com/sustainable-cities",
        "content": "Cities worldwide are increasingly adopting sustainable urban planning principles to combat climate change and improve livability. Initiatives include green infrastructure development, promoting public transportation, establishing car-free zones, and investing in renewable energy sources for urban consumption. Singapore, Copenhagen, and Curitiba are cited as leading examples, demonstrating how integrated planning can lead to reduced carbon footprints and enhanced public well-being. Community engagement is a critical component of these successful projects."
    }
]

processed_articles = [] # To store summaries and citations

print(f"Simulated stream of {len(article_stream)} articles created.")

---

## Part 2: Design the Summarization + Citation Agent
You'll create a function that takes an article and generates a summary with a citation using the LLM. This requires careful prompt engineering.

### Task 2.1: Implement `summarize_with_citation` Function
This asynchronous function will be the core of your agent. It takes an article dictionary and returns a structured summary including the citation.

* **Prompt Design**: Your prompt should instruct the LLM to:
    * Summarize the provided `content` concisely (e.g., 2-3 sentences).
    * Explicitly include a citation at the end of the summary in the format `(Source: [Title], [URL])`.
    * If the summary requires breaking down complex information, encourage simplification.
    * Handle cases where content might be very short or irrelevant (though for this assignment, assume relevant content).
* **Output Format**: You might want the LLM to output a specific format, or you can parse its response. For simplicity, instruct the LLM to directly output the combined summary and citation string.
* **Error Handling**: Include `try-except` blocks for API errors.

In [None]:
async def summarize_with_citation(article: dict) -> dict:
    """
    Sends an article's content to the LLM for summarization and citation.
    Returns a dictionary with 'summary', 'citation_title', and 'citation_url'.
    """
    title = article.get("title", "Unknown Title")
    url = article.get("url", "No URL provided")
    content = article.get("content", "")

    if not content:
        return {
            "summary": "No content provided for summarization.",
            "citation_title": title,
            "citation_url": url
        }

    system_message = (
        "You are an expert summarizer. Your task is to provide a concise summary (2-4 sentences) of the given article content. "
        "Crucially, you must append a clear citation in the format: '(Source: [Article Title], [Article URL])'. "
        "Ensure the summary accurately reflects the content and the citation is precise."
    )

    user_message = (
        f"Please summarize the following article and include its source:\n\n"
        f"Article Title: {title}\n"
        f"Article URL: {url}\n\n"
        f"Content:\n{content}"
    )

    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message}
    ]

    try:
        response = await openai.chat.completions.create(
            model=LLM_MODEL,
            messages=messages,
            temperature=0.2, # Keep temperature low for factual accuracy
            max_tokens=200 # Sufficient tokens for summary and citation
        )
        full_response_content = response.choices[0].message.content

        # Simple parsing to separate summary and citation, assuming LLM follows format
        summary_part = full_response_content
        extracted_citation_title = title # Default to provided title
        extracted_citation_url = url # Default to provided URL

        # A more robust parsing mechanism might be needed for production
        # Here, we trust the LLM to include the exact citation provided in the prompt
        # or we can try to extract it from the end of the response.
        if "(Source:" in full_response_content and ")" in full_response_content:
            citation_start = full_response_content.rfind("(Source:")
            citation_end = full_response_content.rfind(")")
            if citation_start != -1 and citation_end != -1 and citation_end > citation_start:
                citation_text = full_response_content[citation_start:citation_end+1]
                summary_part = full_response_content[:citation_start].strip()
                # Basic attempt to parse title/url from LLM's citation output if it deviated
                # For this assignment, we mostly rely on the LLM to output what it was given.
                # In a real system, you'd use regex or a more robust parser.
                if f"(Source: {title}, {url})" not in citation_text:
                    # If LLM generated its own citation, try to extract it, otherwise stick to original
                    pass

        return {
            "summary": summary_part,
            "citation_title": title, # Use the original title from the input
            "citation_url": url      # Use the original URL from the input
        }
    except openai.APIError as e:
        print(f"OpenAI API error during summarization: {e}")
        return {
            "summary": f"Error summarizing article: {e}",
            "citation_title": title,
            "citation_url": url
        }
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return {
            "summary": f"An unexpected error occurred: {e}",
            "citation_title": title,
            "citation_url": url
        }

print("summarize_with_citation function defined!")

---

## Part 3: Build the Long-Running Agent Loop
Now, you'll create the main loop that iterates through your article stream, calls the summarization agent, and stores the results. You'll also implement a way to manage the loop (e.g., stopping condition).

### Task 3.1: Implement `run_summarization_agent_loop`
Create an asynchronous function that orchestrates the entire process.

* **Loop Control**: The loop should run until all articles in `article_stream` are processed, or until a simulated 'stop' condition (e.g., an empty stream).
* **Processing Logic**: In each iteration:
    1.  Fetch the next article from `article_stream` (you can `pop(0)` or just iterate and keep track of index).
    2.  Call `summarize_with_citation` with the article.
    3.  Store the result in `processed_articles` list.
    4.  Print the current summary and citation for real-time observation.
    5.  Include a small delay (`await asyncio.sleep(X)`) to simulate a long-running process and avoid hitting API rate limits too quickly.
* **Reporting**: After the loop finishes, print a final summary of all processed articles or confirmation of completion.

In [None]:
async def run_summarization_agent_loop(articles_to_process: list, delay_seconds: int = 5):
    """
    Runs a long-running loop to summarize articles with citations.
    """
    global processed_articles # Ensure we're modifying the global list
    processed_articles = [] # Reset for a fresh run

    print("\n--- Starting Long-Running Summarization Agent Loop ---")
    print(f"Processing {len(articles_to_process)} articles with a delay of {delay_seconds} seconds between each.")

    for i, article in enumerate(articles_to_process):
        print(f"\n--- Processing Article {i+1}/{len(articles_to_process)}: '{article.get('title', 'N/A')}' ---")

        # Simulate work/API call
        summary_data = await summarize_with_citation(article)
        processed_articles.append(summary_data)

        print("Summary:")
        print(summary_data["summary"].strip())
        print(f"Citation: (Source: {summary_data['citation_title']}, {summary_data['citation_url']})")

        if i < len(articles_to_process) - 1: # Don't delay after the last article
            print(f"\nWaiting {delay_seconds} seconds before next article...")
            await asyncio.sleep(delay_seconds)

    print("\n--- Long-Running Summarization Agent Loop Finished ---")
    print(f"Successfully processed {len(processed_articles)} articles.")

print("run_summarization_agent_loop function defined!")

---

## Part 4: Execute and Observe
Run the agent loop and review its output.

### Task 4.1: Run the Agent Loop
Execute the `run_summarization_agent_loop` function with your `article_stream`.

In [None]:
# Run the loop
await run_summarization_agent_loop(article_stream, delay_seconds=7)

### Task 4.2: Review Processed Articles
Inspect the `processed_articles` list to see the collected summaries and citations.

In [None]:
print("\n--- All Processed Summaries with Citations ---")
for i, data in enumerate(processed_articles):
    print(f"\nArticle {i+1}:")
    print(f"Summary: {data['summary']}")
    print(f"Citation: (Source: {data['citation_title']}, {data['citation_url']})")
    print("-------------------------------------------")

---

## Part 5: Analysis and Reflection
Based on your implementation and observations, answer the following questions.

### Task 5.1: Prompt Engineering for Citation
* **Effectiveness**: How effective was your prompt in consistently making the LLM include the citation in the required format? Did it ever omit it or misformat it?
* **Robustness**: What challenges did you face in making the LLM reliably extract and include the *exact* provided source information (title and URL) in the summary's citation? Did you observe any instances where it hallucinated a URL or title, or modified the given ones?
* **Improvement**: If you needed to make the citation process more robust (e.g., handling missing URLs, different source types), how would you modify your prompt or add post-processing logic?

### Task 5.2: Long-Running Agent Loop Design
* **Loop Mechanism**: Describe how your `run_summarization_agent_loop` function ensures continuous processing. How does it simulate new data arriving?
* **State Management**: How is the state (i.e., `processed_articles`) managed across loop iterations? Why is it important to store the results externally rather than just printing them?
* **Concurrency/Rate Limiting**: The current loop uses `asyncio.sleep`. In a real-world scenario with many articles, what strategies would you employ for managing API rate limits and potentially parallelizing API calls (e.g., using `asyncio.gather`, queues, or dedicated rate-limiting libraries)?

### Task 5.3: Application and Scalability
* **Use Cases**: What real-world applications would benefit from a long-running summarization and citation agent like this (e.g., news monitoring, research assistance, content aggregation)?
* **Scalability Challenges**: Discuss the challenges of scaling this system to process thousands or millions of articles daily. Consider computational resources, API costs, data storage, and the need for more advanced error recovery.
* **Reliability**: How would you enhance the reliability of this agent to ensure it can run for days or weeks without human intervention, handling unexpected errors (e.g., API downtime, malformed content) gracefully?

---

### Submission:
* Ensure all code cells have been executed and their outputs are visible.
* All analysis and reflections are clearly written in markdown cells.
* Make sure your `.env` file (or equivalent API key setup) is mentioned but **NOT** included in the submitted notebook for security reasons.
* Save your Jupyter Notebook as `[YourName]_Summarization_Citation_Agent_Assignment.ipynb`.