# Let AI do the reading: a research assistant with AI agents.

We’ll start by building a smart research assistant that helps you find, understand, and work with scientific papers more easily. Using tools like LangGraph and powerful language models, this assistant connects to academic websites and can read and summarize papers, answer your questions, and guide you through complex topics. It’s designed to make research easier for students, beginners, and anyone curious about science.

Reading and reviewing research papers takes up a huge chunk of time in R&D — in fact, studies show that researchers spend 30–50% of their time just trying to understand, analyze, and summarize academic literature. This is a common struggle for students, scientists, and professionals alike.
While literature review is a vital part of driving innovation and discovery, the way we do it today is still slow, manual, and inefficient.

## What is LangGraph?

LangGraph is a framework for creating applications using graph-based workflows. Each node represents a function or computational step, and edges define the flow between these nodes based on certain conditions. It is a tool that helps you build AI workflows where different steps happen in a smart, structured order. Think of it like a flowchart that connects different parts of an AI system — like planning, searching, reading, and summarizing — and lets them talk to each other. With LangGraph, you can build stateful AI agents, which means the agent remembers what it's doing at each step and can make smart decisions based on the task, tools, and results so far. Instead of just one big AI model doing everything at once, LangGraph lets you break the task into smaller, manageable pieces, each handled by a different part of your agent.


## Key Components of Our Scientific Paper Agent

### 1. State-Driven Workflow Engine

LangGraph helps us organize our agent’s thinking using a **state machine** — a system where the agent knows exactly where it is in the process.

- **StateGraph Architecture**: A five-node structure that manages the full research journey step by step.
- **Decision-Making Node**: Understands your question and figures out what kind of research is needed.
- **Planning Node**: Creates a plan to find and analyze papers.
- **Tool Execution Node**: Calls tools like **CORE API** to search, retrieve, and read academic papers.
- **Judge Node**: Checks the quality of the results and improves them if needed.


### 2. Paper Processing Integration

This is how our agent reads and understands actual research papers.

- **Source Integration**: Uses tools like the **CORE API** to access thousands of academic papers.
- **Document Processing**: Extracts and cleans PDF content while keeping the structure of sections, headings, and paragraphs intact.

### 3. Analysis Workflow

This is how we ensure the output is useful, reliable, and high-quality.

- **State-aware Pipeline**: Each step knows where it fits in the bigger picture.
- **Validation Gates**: We check the agent’s output at every step.
- **Improvement Cycles**: The agent can reflect and refine its answers.
- **Human-in-the-Loop**: You can jump in at key points to guide or validate the output.


An overview of the workflow is shown below:

![image](https://i.ibb.co/0BBzkcb/mermaid-diagram-2024-11-17-195744.png)]

## Method Details

### 1. System Requirements

To run this scientific paper agent, you’ll need the following:

- **OpenAI API Key**  
  Used to access **GPT-4o**.
  *(Note: You can also swap in any LLM that supports tool/function calling.)*

- **CORE API Key**  
  Used to retrieve research papers.  
  CORE is one of the largest repositories of scientific papers, with over **136 million documents**.  
  You can request a free personal API key [here](https://core.ac.uk/services/api#form).

### 2. Technical Architecture

The system is built using the following technologies:

- **LangGraph** – Manages the step-by-step stateful logic of the agent.  
- **PDFplumber** – Extracts text from PDF documents while preserving structure.  
- **Pydantic** – Ensures data is clean, well-structured, and validated across the workflow.


## Setup

This cell installs the required dependencies.

In [1]:
! pip install --upgrade --quiet langchain==0.2.16 langchain-community==0.2.16 langchain-openai==0.1.23 langgraph==0.2.18 langsmith==0.1.114 pdfplumber python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.8/96.8 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.0/290.0 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

This cell imports the required libraries and sets the environment variables.

In [2]:
# Import core libraries
import io  # Handles byte streams (e.g., working with file-like objects)
import json  # Allows working with JSON data (parsing and writing)
import os  # For accessing environment variables (like API keys)
import urllib3  # HTTP library used for making API calls
import time  # Used for time-related functions (like delays or timestamps)

# Import PDF processing library
import pdfplumber  # Used to extract and parse text from PDF files

# Load environment variables from a .env file
from dotenv import load_dotenv  # Loads variables from a .env file into your environment

# For displaying Markdown output in Jupyter Notebooks
from IPython.display import display, Markdown  # Lets you show formatted Markdown in notebook cells

# LangChain message types (used for chat history and reasoning)
from langchain_core.messages import BaseMessage, SystemMessage, ToolMessage, AIMessage

# LangChain base tool interface
from langchain_core.tools import BaseTool, tool  # Used to define tools the AI agent can call

# OpenAI chat model interface from LangChain
from langchain_openai import ChatOpenAI  # Lets you connect LangChain to OpenAI's GPT models

# LangGraph orchestration framework
from langgraph.graph import END, StateGraph  # END marks the end of the graph, StateGraph defines the workflow
from langgraph.graph.state import CompiledStateGraph  # Used for compiling a graph into executable form
from langgraph.graph.message import add_messages  # Helps attach messages to the graph's internal state

# Data validation and model structure
from pydantic import BaseModel, Field  # Pydantic is used to define structured data models and ensure correctness

# Typing support (used to make data types clearer and stricter)
from typing import Annotated, ClassVar, Sequence, TypedDict, Optional

# Suppress HTTPS warnings (not recommended for production but helpful during development)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Load environment variables from .env file into the runtime environment
load_dotenv()

# Set your OpenAI and CORE API keys directly (can also be done via .env for better security)
os.environ["OPENAI_API_KEY"] = "sk-"  # Replace with your actual OpenAI API key
os.environ["CORE_API_KEY"] = "..."  # Replace with your actual CORE API key

## Prompts

This cell defines the prompts that will guide the AI agent’s behavior throughout the workflow. Prompts are like instructions that tell the language model what to do, how to behave, and how to use tools or data sources.

In [3]:
# Prompt for the initial decision-making step in the agent workflow
# This prompt tells the AI how to decide whether to answer the user directly or start doing research.
# This prompt teaches the AI agent when to act like a researcher (and dig into papers/tools) and when to just respond conversationally without doing any research.

decision_making_prompt = """
You are an experienced scientific researcher.
Your goal is to help the user with their scientific research.

Based on the user query, decide if you need to perform a research or if you can answer the question directly.
- You should perform a research if the user query requires any supporting evidence or information.
- You should answer the question directly only for simple conversational questions, like "how are you?".
"""

# Prompt to guide the AI in creating a detailed step-by-step plan to answer a scientific query.

# This prompt is part of the planning step in the agent's workflow.
# It instructs the LLM to:
# - Act like a scientific researcher
# - Break down complex questions into smaller, evidence-based subtasks
# - Avoid making assumptions or guesses
# - Incorporate user feedback when available
# - Assign the correct external tool to each subtask

planning_prompt = """
# IDENTITY AND PURPOSE

You are an experienced scientific researcher.
Your goal is to make a new step by step plan to help the user with their scientific research .

Subtasks should not rely on any assumptions or guesses, but only rely on the information provided in the context or look up for any additional information.

If any feedback is provided about a previous answer, incorportate it in your new planning.

# TOOLS

For each subtask, indicate the external tool required to complete the subtask.
Tools can be one of the following:
{tools}
"""

# Prompt used by the agent to generate a final answer to the user's research query.

# This prompt is intended for the execution phase of the agent’s workflow — once a plan has been made.
#
# PURPOSE:
# - Guide the agent to behave like an experienced scientific researcher.
# - Ensure the agent follows the previously generated step-by-step plan to answer the query.
# - Encourage the agent to cite sources extensively to support any claims made.
#
# EXTERNAL TOOL ACCESS:
# - The agent has access to tools like the CORE API for retrieving scientific papers.
# - This prompt includes a reference table showing how to build complex search queries using CORE's syntax.
# - It also includes a sample structure of the paper object returned by the CORE API.
#
# QUERY EXAMPLES:
# - Several example queries are shown to help the model construct advanced search queries using fields like `yearPublished`, `authors`, `title`, etc.

agent_prompt = """
# IDENTITY AND PURPOSE

You are an experienced scientific researcher.
Your goal is to help the user with their scientific research. You have access to a set of external tools to complete your tasks.
Follow the plan you wrote to successfully complete the task.

Add extensive inline citations to support any claim made in the answer.


# EXTERNAL KNOWLEDGE

## CORE API

The CORE API has a specific query language that allows you to explore a vast papers collection and perform complex queries. See the following table for a list of available operators:

| Operator       | Accepted symbols         | Meaning                                                                                      |
|---------------|-------------------------|----------------------------------------------------------------------------------------------|
| And           | AND, +, space          | Logical binary and.                                                                           |
| Or            | OR                     | Logical binary or.                                                                            |
| Grouping      | (...)                  | Used to prioritise and group elements of the query.                                           |
| Field lookup  | field_name:value       | Used to support lookup of specific fields.                                                    |
| Range queries | fieldName(>, <,>=, <=) | For numeric and date fields, it allows to specify a range of valid values to return.         |
| Exists queries| _exists_:fieldName     | Allows for complex queries, it returns all the items where the field specified by fieldName is not empty. |

Use this table to formulate more complex queries filtering for specific papers, for example publication date/year.
Here are the relevant fields of a paper object you can use to filter the results:
{
  "authors": [{"name": "Last Name, First Name"}],
  "documentType": "presentation" or "research" or "thesis",
  "publishedDate": "2019-08-24T14:15:22Z",
  "title": "Title of the paper",
  "yearPublished": "2019"
}

Example queries:
- "machine learning AND yearPublished:2023"
- "maritime biology AND yearPublished>=2023 AND yearPublished<=2024"
- "cancer research AND authors:Vaswani, Ashish AND authors:Bello, Irwan"
- "title:Attention is all you need"
- "mathematics AND _exists_:abstract"
"""

# Prompt for the judging step in the agent's workflow.
#
# PURPOSE:
# - Used to evaluate the quality of the final answer generated by the agent.
# - The agent acts like an expert reviewer and self-assesses its output.
# - This step is part of a feedback loop (improvement cycle).
#
# WHAT THE MODEL IS INSTRUCTED TO DO:
# - Review the full conversation history between itself and the user.
# - Decide whether the final answer was good or needs revision.
# - Evaluate based on criteria like: relevance, completeness, feedback incorporation, and proper citation.
# - If the answer is unsatisfactory, it must return actionable feedback for improvement.

judge_prompt = """
You are an expert scientific researcher.
Your goal is to review the final answer you provided for a specific user query.

Look at the conversation history between you and the user. Based on it, you need to decide if the final answer is satisfactory or not.

A good final answer should:
- Directly answer the user query. For example, it does not answer a question about a different paper or area of research.
- Answer extensively the request from the user.
- Take into account any feedback given through the conversation.
- Provide inline sources to support any claim made in the answer.

In case the answer is not good enough, provide clear and concise feedback on what needs to be improved to pass the evaluation.
"""

## Utility classes and functions

This cell contains utility classes and helper functions that support the overall workflow of the AI research assistant. One of the key components is the CoreAPIWrapper class, which serves as a connection between the agent and the CORE API. This wrapper is designed to be robust, including a retry mechanism that automatically attempts to reconnect if there are temporary issues like network delays or API errors, ensuring the workflow doesn't break. The cell also defines Pydantic models, which act like templates to structure and validate the data passed between different steps in the agent's workflow. These models help make sure that each step receives the correct inputs and returns consistent outputs. Additionally, the cell includes some general-purpose utility functions that handle small but important tasks, such as formatting responses or processing data, to keep the workflow clean, efficient, and easy to manage.

In [4]:
# Define a class to interact with the CORE API, using Pydantic's BaseModel for structured input/output
class CoreAPIWrapper(BaseModel):
    """Simple wrapper around the CORE API to search for academic papers."""

    # Base URL for CORE API (this won't change)
    base_url: ClassVar[str] = "https://api.core.ac.uk/v3"

    # API key fetched from environment variables (make sure it's set in .env or the environment)
    api_key: ClassVar[str] = os.environ["CORE_API_KEY"]

    # Number of top results to return per query (defaults to 1 if not set)
    top_k_results: int = Field(description="Top k results obtained by running a query on CORE", default=1)

    # Internal method to send a GET request to the CORE API and retrieve the search results
    def _get_search_response(self, query: str) -> dict:
        http = urllib3.PoolManager()  # Create a connection pool for HTTP requests

        # Define a retry mechanism in case of temporary errors (e.g., rate limits, connection issues)
        max_retries = 5
        for attempt in range(max_retries):
            # Make the GET request to the CORE API search endpoint
            response = http.request(
                'GET',
                f"{self.base_url}/search/outputs",  # Search endpoint for research outputs
                headers={"Authorization": f"Bearer {self.api_key}"},  # Auth header with API key
                fields={"q": query, "limit": self.top_k_results}  # Search query and result limit
            )

            # If response is successful (status code 2xx), return parsed JSON data
            if 200 <= response.status < 300:
                return response.json()
            # If request failed but retries remain, wait and try again (exponential backoff)
            elif attempt < max_retries - 1:
                time.sleep(2 ** (attempt + 2))  # e.g., 4s, 8s, 16s...
            else:
                # If all attempts fail, raise an exception with status info
                raise Exception(f"Got non 2xx response from CORE API: {response.status} {response.data}")

    # Public method to perform a search and return results as a formatted string
    def search(self, query: str) -> str:
        # Get the raw response from the internal method
        response = self._get_search_response(query)
        results = response.get("results", [])  # Extract "results" from the JSON response

        # If no results found, return a default message
        if not results:
            return "No relevant results were found"

        # Format each result into a readable string
        docs = []
        for result in results:
            # Prefer full 'publishedDate', fallback to 'yearPublished'
            published_date_str = result.get('publishedDate') or result.get('yearPublished', '')

            # Combine author names (if available) into one string
            authors_str = ' and '.join([item['name'] for item in result.get('authors', [])])

            # Create a formatted string for each result
            docs.append((
                f"* ID: {result.get('id', '')},\n"
                f"* Title: {result.get('title', '')},\n"
                f"* Published Date: {published_date_str},\n"
                f"* Authors: {authors_str},\n"
                f"* Abstract: {result.get('abstract', '')},\n"
                f"* Paper URLs: {result.get('sourceFulltextUrls') or result.get('downloadUrl', '')}"
            ))

        # Combine all results into a single string with separators between entries
        return "\n-----\n".join(docs)

# Input model for searching papers using the CORE API
class SearchPapersInput(BaseModel):
    """Input object to search papers with the CORE API."""
    query: str = Field(description="The query to search for on the selected archive.")
    # The search query entered by the user (e.g., "machine learning and healthcare")

    max_papers: int = Field(
        description="The maximum number of papers to return. It's default to 1, but you can increase it up to 10 in case you need to perform a more comprehensive search.",
        default=1,  # Default is 1 paper
        ge=1,       # Minimum value allowed: 1
        le=10       # Maximum value allowed: 10
    )

# Output model for the decision-making step
class DecisionMakingOutput(BaseModel):
    """Output object of the decision making node."""
    requires_research: bool = Field(
        description="Whether the user query requires research or not."
    )
    # True if research needs to be done, False for basic responses

    answer: Optional[str] = Field(
        default=None,
        description="The answer to the user query. It should be None if the user query requires research, otherwise it should be a direct answer to the user query."
    )
    # If no research is needed, the answer will be given directly here

# Output model for the judging step
class JudgeOutput(BaseModel):
    """Output object of the judge node."""
    is_good_answer: bool = Field(
        description="Whether the answer is good or not."
    )
    # True if the answer passes the quality check

    feedback: Optional[str] = Field(
        default=None,
        description="Detailed feedback about why the answer is not good. It should be None if the answer is good."
    )
    # Explanation of what needs improvement, if the answer wasn't satisfactory

# Utility function to format a list of tools into a readable string
def format_tools_description(tools: list[BaseTool]) -> str:
    # Each tool is shown with its name, description, and expected input arguments
    return "\n\n".join([
        f"- {tool.name}: {tool.description}\n Input arguments: {tool.args}"
        for tool in tools
    ])

# Function to stream and display the results from the agent step-by-step
async def print_stream(app: CompiledStateGraph, input: str) -> Optional[BaseMessage]:
    # Show that research is starting
    display(Markdown("## New research running"))
    display(Markdown(f"### Input:\n\n{input}\n\n"))
    display(Markdown("### Stream:\n\n"))

    all_messages = []  # Store all messages received during the run

    # Asynchronously stream the conversation updates
    async for chunk in app.astream({"messages": [input]}, stream_mode="updates"):
        for updates in chunk.values():
            if messages := updates.get("messages"):
                all_messages.extend(messages)  # Collect all messages
                for message in messages:
                    message.pretty_print()  # Nicely print each message
                    print("\n\n")

    # Return the last message if any were collected
    if not all_messages:
        return None
    return all_messages[-1]

## Agent state

This cell defines the agent state, which contains the following information:
- `requires_research`: Whether the user query requires research or not.
- `num_feedback_requests`: The number of times the LLM asked for feedback.
- `is_good_answer`: Whether the LLM's final answer is good or not.
- `messages`: The conversation history between the user and the LLM.

In [5]:
# Define the overall "state" that the agent keeps track of during its research process.
# This is like the agent's notepad — it remembers what it's doing, what it's already done, and what still needs to happen.
class AgentState(TypedDict):
    """The state of the agent during the paper research process."""

    # Whether or not the agent decided that research is required for the current query
    requires_research: bool = False

    # How many times the agent has asked for feedback from the judge or the user
    # Useful to avoid endless improvement loops
    num_feedback_requests: int = 0

    # Whether the most recent answer passed the judge’s quality check
    is_good_answer: bool = False

    # Conversation history (messages exchanged so far)
    # This includes user input, AI responses, tool calls, etc.
    # `Annotated` with `add_messages` means this will auto-track messages step-by-step
    messages: Annotated[Sequence[BaseMessage], add_messages]

## Agent tools

This cell defines the tools that the AI agent can use to complete its tasks. The toolkit includes three main tools: one for searching scientific papers using the CORE API, one for downloading a paper from a given link, and another for asking a human for feedback when the AI isn't sure about its answer. To make the paper downloading tool more reliable, it includes a retry system (so it tries again if the first attempt fails) and a mock browser header to avoid being blocked by some websites — this helps prevent common errors like “403 Forbidden.” Together, these tools help the agent search, retrieve, and verify scientific content in a smooth and stable way.

In [7]:
# Register this function as a tool called "search-papers"
# The agent will be able to call this tool when it needs to search for papers
# `args_schema` tells the tool what input format it expects — in this case, a query string and max_papers value
@tool("search-papers", args_schema=SearchPapersInput)
def search_papers(query: str, max_papers: int = 1) -> str:
    """
    Search for scientific papers using the CORE API.

    Example input the agent might give:
    {"query": "Attention is all you need", "max_papers": 1}

    Returns:
        A formatted list of relevant papers with key information like title, authors, abstract, etc.
    """
    try:
        # Create an instance of the CoreAPIWrapper with the desired number of results
        # Then run the search using the provided query
        return CoreAPIWrapper(top_k_results=max_papers).search(query)
    except Exception as e:
        # If something goes wrong (like an API error), return a helpful error message
        return f"Error performing paper search: {e}"


# Register this function as a tool called "download-paper"
# The agent can use this tool to download and read the full text of a paper from a given PDF URL
@tool("download-paper")
def download_paper(url: str) -> str:
    """
    Download a specific scientific paper from a given URL.

    Example input:
    {"url": "https://sample.pdf"}

    Returns:
        The plain text content of the paper.
    """
    try:
        # Create an HTTP connection manager (turns off certificate checks for simplicity)
        http = urllib3.PoolManager(cert_reqs='CERT_NONE')

        # Define browser-like headers to avoid 403 Forbidden errors from some websites
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
        }

        # Try up to 5 times if the request fails
        max_retries = 5
        for attempt in range(max_retries):
            # Make the GET request to download the paper
            response = http.request('GET', url, headers=headers)

            # If download was successful (status code 2xx)
            if 200 <= response.status < 300:
                # Convert the PDF file content into a byte stream
                pdf_file = io.BytesIO(response.data)

                # Use pdfplumber to extract text from the PDF
                with pdfplumber.open(pdf_file) as pdf:
                    text = ""
                    for page in pdf.pages:
                        text += page.extract_text() + "\n"  # Add text from each page
                return text  # Return the full extracted paper text

            # If request failed, retry with exponential backoff (wait 4s, 8s, 16s, etc.)
            elif attempt < max_retries - 1:
                time.sleep(2 ** (attempt + 2))
            else:
                # Raise an error if all retry attempts fail
                raise Exception(f"Got non 2xx when downloading paper: {response.status_code} {response.text}")

    # Catch and return any errors that occur during the process
    except Exception as e:
        return f"Error downloading paper: {e}"

# Register a tool called "ask-human-feedback"
# This allows the agent to ask the user (a human) for help when something goes wrong or it needs confirmation
@tool("ask-human-feedback")
def ask_human_feedback(question: str) -> str:
    """
    Ask for human feedback. You should call this tool when encountering unexpected errors.

    The agent will ask the user a question (e.g., "Was this answer helpful?" or "What should I fix?")
    The user's typed response will be captured and used as feedback.
    """
    return input(question)  # Waits for user input from the terminal and returns it as a string

# Create a list of all tools that the agent can use
# These include:
# - search_papers: to look up research papers
# - download_paper: to download and extract text from a PDF
# - ask_human_feedback: to ask the user for help or clarification
tools = [search_papers, download_paper, ask_human_feedback]

# Create a dictionary where the tool name is the key and the tool function is the value
# This makes it easier to look up a tool by its name later on
tools_dict = {tool.name: tool for tool in tools}

## Workflow nodes

This cell defines the nodes of the AI agent’s workflow — think of these as the individual steps or checkpoints the agent goes through while completing a task. Each node performs a specific function: one might decide what the user needs, another might plan how to do it, another might search for papers, and yet another might judge the quality of the final answer. These nodes are the building blocks of the full research process, and they are connected using a tool like LangGraph, which helps the agent move from one step to the next in a smart, structured way. Defining these nodes is essential because it allows the agent to work in an organized and flexible manner — like a to-do list it follows step-by-step.

In [8]:
# Create the base LLM (Large Language Model) we'll use for all tasks
# This uses OpenAI's GPT-4o model, with temperature 0.0 for predictable and stable responses
base_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

# Create a specialized version of the LLM for the decision-making step
# This version expects structured output matching the DecisionMakingOutput format (true/false + optional answer)
decision_making_llm = base_llm.with_structured_output(DecisionMakingOutput)

# Create another version of the LLM that is aware of the tools it can use
# This will be used during the main agent step where it needs to perform tasks using search/download/etc.
agent_llm = base_llm.bind_tools(tools)

# Create a version of the LLM that can judge the final answer
# It also returns structured feedback using the JudgeOutput format (pass/fail + optional explanation)
judge_llm = base_llm.with_structured_output(JudgeOutput)


# Define the decision-making node — this is the starting point of the agent workflow
# It decides whether the user's query needs a full research process or can be answered immediately
def decision_making_node(state: AgentState):
    """
    Entry point of the workflow.
    Based on the user query, the model can either respond directly or perform a full research,
    routing the workflow to the planning node.
    """

    # Add the decision-making prompt to the beginning of the message history
    system_prompt = SystemMessage(content=decision_making_prompt)

    # Call the LLM to analyze the current user query (along with past conversation)
    # and return a structured response (True/False + optional direct answer)
    response: DecisionMakingOutput = decision_making_llm.invoke([system_prompt] + state["messages"])

    # Prepare the output dictionary for the next step in the workflow
    output = {
        "requires_research": response.requires_research  # True if deeper research is needed
    }

    # If the model gave a direct answer (for simple questions), include it in the messages
    if response.answer:
        output["messages"] = [AIMessage(content=response.answer)]

    # Return updated state — either it moves to planning next, or the answer is complete
    return output


# Task router function
def router(state: AgentState):
    """
    Router directing the user query to the appropriate branch of the workflow.

    If research is needed (based on decision_making_node), it sends the workflow to the 'planning' node.
    Otherwise, it ends the workflow (e.g., for simple questions already answered).
    """
    if state["requires_research"]:
        return "planning"  # Go to the planning step if research is needed
    else:
        return "end"       # End the process if the question was already answered


# Planning node
def planning_node(state: AgentState):
    """
    Planning node that creates a step-by-step plan to answer the user query.

    This step uses the base LLM to:
    - Analyze the user query
    - Come up with smaller subtasks
    - Decide which tools to use for each subtask
    """
    # Format the system prompt by including the list of available tools and how to use them
    system_prompt = SystemMessage(
        content=planning_prompt.format(tools=format_tools_description(tools))
    )

    # Call the base LLM (GPT-4o or GPT-4o-mini) with the planning prompt and user messages
    response = base_llm.invoke([system_prompt] + state["messages"])

    # Return the plan as a message so it can be used by the next node
    return {"messages": [response]}


# Tool call node
def tools_node(state: AgentState):
    """
    This node is responsible for executing any tool calls the agent requested in the last message.

    Based on the plan created earlier, the agent might decide to:
    - Search papers
    - Download a PDF
    - Ask for feedback

    This node loops through each requested tool call and runs it.
    """
    outputs = []  # Store the results of each tool call here

    # Go through each tool call in the agent's last message
    for tool_call in state["messages"][-1].tool_calls:
        # Use the name of the tool to fetch and run it with its input arguments
        tool_result = tools_dict[tool_call["name"]].invoke(tool_call["args"])

        # Format the result as a ToolMessage so the agent can read it as input in the next step
        outputs.append(
            ToolMessage(
                content=json.dumps(tool_result),       # Convert result to a JSON string
                name=tool_call["name"],                # Tool's name
                tool_call_id=tool_call["id"],          # ID that ties this result to its original request
            )
        )

    # Return all the tool outputs so they can be used by the agent again
    return {"messages": outputs}

# Agent call node
def agent_node(state: AgentState):
    """
    This node is where the agent uses the LLM — along with the tools — to reason through the task.

    It:
    - Reads the plan and tool outputs
    - Continues the conversation
    - Assembles a final response or decides the next action

    This is the LLM acting as the brain, calling tools and thinking step-by-step.
    """
    # Add the agent instruction prompt to guide its behavior
    system_prompt = SystemMessage(content=agent_prompt)

    # Invoke the LLM (that has access to tool usage) with all the messages so far
    response = agent_llm.invoke([system_prompt] + state["messages"])

    # Return the LLM's response, which may include follow-up reasoning or next tool calls
    return {"messages": [response]}

# Should continue function
def should_continue(state: AgentState):
    """
    Check if the agent should continue or end the workflow.

    This function looks at the last message from the agent to decide:
    - If the agent requested any tool calls → continue the workflow
    - If not → stop the process, because the agent is likely done
    """
    messages = state["messages"]              # Get the full conversation history
    last_message = messages[-1]               # Look at the most recent message from the agent

    # If the last message includes tool calls, continue processing
    if last_message.tool_calls:
        return "continue"
    else:
        return "end"                          # Otherwise, we're done — no more action needed


# Judge node
def judge_node(state: AgentState):
    """
    This node allows the LLM to judge the quality of its own final answer.

    It checks whether the answer is good enough to stop, or if improvements are needed.
    """

    # Get the current number of times feedback has been requested so far (default to 0)
    num_feedback_requests = state.get("num_feedback_requests", 0)

    # If the agent has already tried and failed to improve the answer twice, we stop here
    # We assume the answer is "good enough" to prevent getting stuck in an endless loop
    if num_feedback_requests >= 2:
        return {"is_good_answer": True}

    # Add the judging prompt that tells the model to evaluate its own final answer
    system_prompt = SystemMessage(content=judge_prompt)

    # Call the judge LLM to review the current messages and return structured feedback
    response: JudgeOutput = judge_llm.invoke([system_prompt] + state["messages"])

    # Prepare the updated state
    output = {
        "is_good_answer": response.is_good_answer,              # Did the answer pass?
        "num_feedback_requests": num_feedback_requests + 1      # Increment feedback attempts
    }

    # If feedback is provided (because the answer was not good), send it as a new message
    if response.feedback:
        output["messages"] = [AIMessage(content=response.feedback)]

    # Return updated state for the next node to use
    return output


# Final answer router function
def final_answer_router(state: AgentState):
    """
    This function decides what to do after the agent's answer has been judged.

    - If the answer was marked as good → end the workflow.
    - If not → go back to the planning step to try again with improvements.
    """
    if state["is_good_answer"]:
        return "end"        # Answer passed the quality check → we're done
    else:
        return "planning"   # Answer needs improvement → re-plan and try again

## Workflow definition

This cell defines the workflow using LangGraph.

In [9]:
# Step 1: Initialize the StateGraph with the AgentState structure
workflow = StateGraph(AgentState)

# Step 2: Add all the nodes (steps in the workflow) to the graph
workflow.add_node("decision_making", decision_making_node)  # Entry point: Should we do research or not?
workflow.add_node("planning", planning_node)                # Plan the research process step-by-step
workflow.add_node("tools", tools_node)                      # Run tools like paper search and download
workflow.add_node("agent", agent_node)                      # Agent thinks, responds, or decides next step
workflow.add_node("judge", judge_node)                      # Judge if the final answer is good or needs improvement

# Step 3: Set where the workflow starts
workflow.set_entry_point("decision_making")  # Always begin by checking what kind of query it is

# Step 4: Define conditional routing from the decision-making step
# Based on the router logic:
# - If research is needed, go to planning
# - If not, end the workflow (because the question was simple and already answered)
workflow.add_conditional_edges(
    "decision_making",
    router,
    {
        "planning": "planning",
        "end": END,
    }
)

# Step 5: Connect planning → agent (after plan is made, the agent executes it)
workflow.add_edge("planning", "agent")

# Step 6: If the agent calls tools, send the result back to the agent to continue
workflow.add_edge("tools", "agent")

# Step 7: From the agent step, decide whether to:
# - Use tools again (if tool calls were made)
# - Or go to the judge (if response is complete)
workflow.add_conditional_edges(
    "agent",
    should_continue,
    {
        "continue": "tools",
        "end": "judge",
    },
)

# Step 8: After judgment, either end the workflow (if answer is good),
# or go back to planning to improve the answer
workflow.add_conditional_edges(
    "judge",
    final_answer_router,
    {
        "planning": "planning",
        "end": END,
    }
)

# Step 9: Compile the workflow — this makes it ready to run
app = workflow.compile()

## Example usecase for PhD academic research

This cell tests the workflow with several example queries. These queries are designed to evaluate the agent on the following aspects:
- Completing tasks that are representative of the work a PhD researcher might need to perform.
- Addressing more specific tasks that require researching papers within a defined timeframe.
- Tackling tasks across multiple areas of research.
- Critically evaluating its own responses by sourcing specific information from the papers.

In [10]:
test_inputs = [
    "Download and summarize the findings of this paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf",

    "Can you find 8 papers on quantum machine learning?",

    """Find recent papers (2023-2024) about CRISPR applications in treating genetic disorders,
    focusing on clinical trials and safety protocols""",

    """Find and analyze papers from 2023-2024 about the application of transformer architectures in protein folding prediction,
    specifically looking for novel architectural modifications with experimental validation."""
]

# Run tests and store the results for later visualisation
outputs = []
for test_input in test_inputs:
    final_answer = await print_stream(app, test_input)
    outputs.append(final_answer.content)

## New research running

### Input:

Download and summarize the findings of this paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf



### Stream:




I will first download the paper from the provided URL and then summarize its findings. 

Let's proceed with the download. 

**Tool Required:** download-paper

Input:
{"url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf"}



Tool Calls:
  download-paper (call_VUNYczHd4CVYjf9eh2ZbIAk6)
 Call ID: call_VUNYczHd4CVYjf9eh2ZbIAk6
  Args:
    url: https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf



Name: download-paper

"Discover Nano\nReview\nAdvances, limitations and perspectives in the use of celecoxib\u2011loaded\nnanocarriers in therapeutics of cancer\nMiguel de Jesus Oliveira Santos1,2 \u00b7 J\u00e9ssica Teles\u2011Souza1 \u00b7 Renata Freitas de Ara\u00fajo\u2011Calumby1,3 \u00b7\nRobert L. Copeland Jr4 \u00b7 Henrique Rodrigues Marcelino2,5 \u00b7 Deise Souza Vilas\u2011B\u00f4as1,3\nReceived: 13 March 2024 / Accepted: 22 July 2024\n\u00a9 The Author(s) 2024 OPEN\nAbstract\nCancer is highlighted as a major global 

## New research running

### Input:

Can you find 8 papers on quantum machine learning?



### Stream:




I will search for 8 papers on quantum machine learning. Please hold on for a moment. 

{"query": "quantum machine learning", "max_papers": 8}



Tool Calls:
  search-papers (call_C4m8MAcspEUbSc1QB1C4VQsf)
 Call ID: call_C4m8MAcspEUbSc1QB1C4VQsf
  Args:
    query: quantum machine learning
    max_papers: 8



Name: search-papers

"* ID: 151256742,\n* Title: Quantum Circuit Learning,\n* Published Date: 2019-04-23T01:00:00+01:00,\n* Authors: Mitarai, Kosuke and Negoro, Makoto and Kitagawa, Masahiro and Fujii, Keisuke,\n* Abstract: We propose a classical-quantum hybrid algorithm for machine learning on\nnear-term quantum processors, which we call quantum circuit learning. A quantum\ncircuit driven by our framework learns a given task by tuning parameters\nimplemented on it. The iterative optimization of the parameters allows us to\ncircumvent the high-depth circuit. Theoretical investigation shows that a\nquantum circuit can approximate nonlinear functions, which is further confirmed\nby 

## New research running

### Input:

Find recent papers (2023-2024) about CRISPR applications in treating genetic disorders, 
    focusing on clinical trials and safety protocols



### Stream:




Here’s a step-by-step plan to find recent papers (2023-2024) about CRISPR applications in treating genetic disorders, with a focus on clinical trials and safety protocols:

### Step 1: Define Search Parameters
- **Keywords**: CRISPR, genetic disorders, clinical trials, safety protocols
- **Date Range**: 2023-2024

### Step 2: Search for Relevant Papers
- Use the CORE API to search for scientific papers that match the defined parameters.

**Tool Required**: search-papers

### Step 3: Review Search Results
- Analyze the titles, abstracts, and authors of the papers retrieved in Step 2 to identify the most relevant studies.

### Step 4: Download Selected Papers
- Choose the most pertinent papers from the search results and download them for a detailed review.

**Tool Required**: download-paper

### Step 5: Summarize Findings
- Create a summary of the key findings from the downloaded papers, focusing on the applications of CRISPR in clinical trials and the safety protocols discussed.

### 

## New research running

### Input:

Find and analyze papers from 2023-2024 about the application of transformer architectures in protein folding prediction, 
    specifically looking for novel architectural modifications with experimental validation.



### Stream:




Here’s a step-by-step plan to help you find and analyze relevant papers on the application of transformer architectures in protein folding prediction, focusing on novel architectural modifications with experimental validation from 2023-2024.

### Step 1: Search for Relevant Papers
- **Subtask**: Use the CORE API to search for scientific papers published in 2023-2024 that discuss transformer architectures in protein folding prediction, specifically looking for those that mention novel architectural modifications and experimental validation.
- **Tool Required**: search-papers

### Step 2: Review Paper Abstracts
- **Subtask**: Once the papers are retrieved, review the abstracts to identify which papers specifically focus on novel architectural modifications and include experimental validation.
- **Tool Required**: None (manual review)

### Step 3: Download Selected Papers
- **Subtask**: Download the selected papers that meet the criteria from Step 2 for a more in-depth analysis.
- **Tool

# Display results

This cell displays the results of the test queries for a more compact visualisation of the results.


In [11]:
for input, output in zip(test_inputs, outputs):
    display(Markdown(f"## Input:\n\n{input}\n\n"))
    display(Markdown(f"## Output:\n\n{output}\n\n"))

## Input:

Download and summarize the findings of this paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11379842/pdf/11671_2024_Article_4070.pdf



## Output:

The paper titled "Advances, limitations and perspectives in the use of celecoxib-loaded nanocarriers in therapeutics of cancer" provides a comprehensive review of the development and application of celecoxib (CXB) loaded into nanocarriers for cancer treatment. Here are the key findings and insights from the paper:

### Summary of Findings:

1. **Cancer and COX-2**: Cancer remains a significant global health challenge, with cyclooxygenase-2 (COX-2) identified as a crucial marker for tumor progression. Celecoxib, a selective COX-2 inhibitor, is used in adjuvant cancer therapy but requires high concentrations for efficacy, leading to potential side effects.

2. **Nanocarrier Development**: The paper reviews advancements in the formulation of CXB-loaded nanocarriers over the past decade. These nanocarriers, which can be polymeric or lipid-based, enhance the biopharmaceutical properties of CXB, improving its solubility, stability, and targeted delivery to tumor sites.

3. **Types of Nanocarriers**: Various types of nanocarriers have been developed, including:
   - **Polymeric Nanoparticles**: Commonly using materials like poly(lactic-co-glycolic acid) (PLGA), which is biocompatible and allows for controlled drug release.
   - **Lipid-based Systems**: Such as liposomes, which encapsulate CXB in a hydrophilic core, enhancing its delivery and efficacy.

4. **In Vivo Studies**: The review highlights that a significant portion of the studies (54%) involved in vivo experiments, indicating a growing interest in the practical application of CXB-loaded nanocarriers. However, most studies are still in pre-clinical phases, with no clinical trials reported yet.

5. **Synergistic Effects**: The combination of CXB with other anti-inflammatory drugs or apoptosis inducers shows promising pharmacological potential. CXB has been found to enhance the cytotoxic effects of established chemotherapeutic agents, suggesting a synergistic approach to cancer treatment.

6. **Future Perspectives**: The authors emphasize the need for further research to overcome the limitations of CXB, particularly its low solubility and potential cardiovascular side effects. They suggest that future developments should focus on:
   - Quality-by-design approaches for stability and targeting.
   - Co-loading CXB with other drugs to enhance therapeutic efficacy.
   - Exploring controlled release mechanisms and imaging capabilities for better clinical outcomes.

7. **Conclusion**: The review concludes that CXB-loaded nanocarriers represent a promising avenue for enhancing cancer therapy. The integration of nanotechnology with traditional pharmacological approaches could lead to more effective and less harmful cancer treatments.

### Implications:
The findings suggest that while CXB-loaded nanocarriers are still in the early stages of research, they hold significant promise for improving cancer treatment outcomes through enhanced drug delivery and reduced side effects. The review calls for continued innovation and clinical exploration in this emerging field.

This summary encapsulates the main findings and implications of the paper, highlighting the potential of CXB-loaded nanocarriers in cancer therapeutics.



## Input:

Can you find 8 papers on quantum machine learning?



## Output:

Here are 8 papers on quantum machine learning:

1. **Quantum Circuit Learning**
   - **Authors**: Mitarai, Kosuke; Negoro, Makoto; Kitagawa, Masahiro; Fujii, Keisuke
   - **Published Date**: April 23, 2019
   - **Abstract**: This paper proposes a classical-quantum hybrid algorithm for machine learning on near-term quantum processors, called quantum circuit learning. It shows that a quantum circuit can approximate nonlinear functions and discusses the framework's potential applications.
   - **URL**: [Read Paper](http://arxiv.org/abs/1803.00745)

2. **The Power of One Qubit in Machine Learning**
   - **Authors**: Ghobadi, Roohollah; Oberoi, Jaspreet S.; Zahedinejhad, Ehsan
   - **Published Date**: June 8, 2019
   - **Abstract**: This paper introduces a kernel-based quantum machine learning algorithm that can be implemented on near-term quantum devices, focusing on estimating classically intractable kernel functions.
   - **URL**: [Read Paper](http://arxiv.org/abs/1905.01390)

3. **Quantum Machine Learning over Infinite Dimensions**
   - **Authors**: Lau, Hoi-Kwan; Pooser, Raphael; Siopsis, George; Weedbrook, Christian
   - **Published Date**: November 14, 2016
   - **Abstract**: This work generalizes quantum machine learning to infinite-dimensional systems, presenting critical subroutines for quantum algorithms that achieve exponential speedup compared to classical counterparts.
   - **URL**: [Read Paper](http://arxiv.org/abs/1603.06222)

4. **Experimental Demonstration of Quantum Learning Speed-up with Classical Input Data**
   - **Authors**: Lee, Joong-Sung; Bang, Jeongho; Hong, Sunghyuk; Lee, Changhyoup; Seol, Kang Hee; Lee, Jinhyoung; Lee, Kwang-Geol
   - **Published Date**: November 22, 2018
   - **Abstract**: This paper discusses a quantum-classical hybrid machine learning approach, demonstrating a quantum learning speed-up of approximately 36% compared to classical machines.
   - **URL**: [Read Paper](http://arxiv.org/abs/1706.01561)

5. **Quantum Machine Learning: A Classical Perspective**
   - **Authors**: Ciliberto, Carlo; Herbster, Mark; Ialongo, Alessandro Davide; Pontil, Massimiliano; Rocchetto, Andrea; Severini, Simone; Wossnig, Leonard
   - **Published Date**: February 13, 2018
   - **Abstract**: This review discusses the intersection of quantum computing and classical machine learning, emphasizing the limitations and potential advantages of quantum algorithms.
   - **URL**: [Read Paper](http://arxiv.org/abs/1707.08561)

6. **An Efficient Quantum Algorithm for Generative Machine Learning**
   - **Authors**: Gao, Xun; Zhang, Zhengyu; Duan, Luming
   - **Published Date**: November 6, 2017
   - **Abstract**: This paper proposes a quantum algorithm for generative machine learning, demonstrating exponential speedup in training and inference compared to classical models.
   - **URL**: [Read Paper](http://arxiv.org/abs/1711.02038)

7. **Quantum Machine Learning**
   - **Authors**: Biamonte, Jacob; Wittek, Peter; Pancotti, Nicola; Rebentrost, Patrick; Wiebe, Nathan; Lloyd, Seth
   - **Published Date**: May 10, 2018
   - **Abstract**: This paper explores how quantum systems can outperform classical systems in machine learning tasks, discussing the challenges and potential solutions in quantum software development.
   - **URL**: [Read Paper](http://arxiv.org/abs/1611.09347)

8. **Quantum-Enhanced Machine Learning**
   - **Authors**: Dunjko, Vedran; Taylor, Jacob M.; Briegel, Hans J.
   - **Published Date**: October 26, 2016
   - **Abstract**: This work proposes a systematic approach to machine learning from the perspective of quantum information, addressing improvements in supervised, unsupervised, and reinforcement learning.
   - **URL**: [Read Paper](http://arxiv.org/abs/1610.08251)

These papers cover various aspects of quantum machine learning, including algorithms, theoretical frameworks, and experimental demonstrations.



## Input:

Find recent papers (2023-2024) about CRISPR applications in treating genetic disorders, 
    focusing on clinical trials and safety protocols



## Output:

Here are some recent papers (2023-2024) focusing on CRISPR applications in treating genetic disorders, particularly in the context of clinical trials and safety protocols:

1. **Title**: [Progress and harmonization of gene editing to treat human diseases: Proceeding of COST Action CA21113 GenE-HumDi](https://iris.unimore.it/bitstream/11380/1327197/2/1-s2.0-S2162253123002846-main.pdf)  
   **Authors**: Cavazza A., Hendel A., Bak R. O., et al.  
   **Published Date**: January 1, 2023  
   **Abstract**: This paper discusses the GenE-HumDi network, which aims to expedite the application of genome editing for therapeutic purposes. It highlights the importance of safety measures, clinical translation, and regulatory guidelines in gene editing technologies.  
   **Source**: COST Action CA21113.

2. **Title**: [Investigating the Potential of a Cell-Based Gene Editing Therapy for Inherited Metabolic Liver Disease](https://tigerprints.clemson.edu/cgi/viewcontent.cgi?article=4432&context=all_dissertations)  
   **Authors**: Ates, Ilayda  
   **Published Date**: August 1, 2023  
   **Abstract**: This study explores the use of CRISPR-Cas9 for inherited metabolic diseases affecting the liver, focusing on electroporation as a non-viral delivery method. It discusses safety and efficacy concerns associated with traditional viral delivery methods.  

3. **Title**: [Clinical applications of gene therapy for rare diseases: A review](https://discovery.ucl.ac.uk/10170190/1/Clinical%20applications%20of%20gene%20therapy%20for%20rare%20diseases%20%20A%20review.pdf)  
   **Authors**: Papaioannou, Ioannis; Owen, James S; Yáñez-Muñoz, Rafael J  
   **Published Date**: May 13, 2023  
   **Abstract**: This review discusses recent advances in gene therapy for rare diseases, including the use of CRISPR technologies. It highlights various viral and non-viral strategies and their clinical applications, emphasizing safety protocols in gene therapy.  

4. **Title**: [Future regenerative medicine developments and their therapeutic applications](https://www.iris.unina.it/bitstream/11588/905963/1/genes-1702228-peer-review-v2.pdf)  
   **Authors**: Altyar, Ahmed E., El-Sayed, Amr, et al.  
   **Published Date**: January 1, 2023  
   **Abstract**: This paper reviews advancements in regenerative medicine, including CRISPR applications. It discusses the potential of gene editing technologies in therapeutic contexts and the importance of safety measures in clinical applications.  

5. **Title**: [Recombinant DNA technology: revolutionizing regenerative medicine and empowering nanotechnology strategies](https://updatepublishing.com/journal/index.php/rrst/article/download/8732/6277/16342)  
   **Authors**: Lakshmi, Mallela, Dhanu, A. S., et al.  
   **Published Date**: December 28, 2023  
   **Abstract**: This article discusses the role of recombinant DNA technology, including CRISPR, in regenerative medicine and its implications for safety and efficacy in clinical applications.  

These papers provide a comprehensive overview of the current state of CRISPR applications in treating genetic disorders, focusing on clinical trials and safety protocols. If you need further details or specific papers downloaded, please let me know!



## Input:

Find and analyze papers from 2023-2024 about the application of transformer architectures in protein folding prediction, 
    specifically looking for novel architectural modifications with experimental validation.



## Output:

Here are the relevant papers from 2023-2024 that discuss transformer architectures in the context of protein folding prediction, focusing on novel architectural modifications and experimental validation:

### 1. **Endowing Protein Language Models with Structural Knowledge**
   - **Authors**: Chen, Dexiong et al.
   - **Published Date**: January 26, 2024
   - **Abstract**: This paper introduces a novel framework that enhances protein language models by integrating protein structural data. The authors refine the self-attention mechanisms of pretrained language transformers by incorporating structural information, resulting in a model termed Protein Structure Transformer (PST). Empirical evaluations demonstrate its superior parameter efficiency and performance in protein function prediction compared to existing models.
   - **Key Contributions**:
     - **Architectural Modifications**: The integration of structural information into the self-attention mechanisms of transformers.
     - **Experimental Validation**: Empirical evaluations show that PST outperforms the state-of-the-art model ESM-2 in protein function prediction.
   - **Link**: [Read the paper](http://arxiv.org/abs/2401.14819)

### 2. **AI Driven B-cell Immunotherapy Design**
   - **Authors**: da Silva, Bruna Moreira et al.
   - **Published Date**: September 3, 2023
   - **Abstract**: This review discusses advancements in AI and machine learning methods for predicting protein structures and their complexes, particularly in the context of immunotherapy design. While it does not focus solely on transformer architectures, it highlights the importance of computational approaches in protein structure prediction.
   - **Key Contributions**: Discusses various machine learning frameworks that could be relevant for protein structure prediction, including transformer-based models.
   - **Link**: [Read the paper](http://arxiv.org/abs/2309.01122)

### 3. **Understanding Biology in the Age of Artificial Intelligence**
   - **Authors**: Lawrence, Elsa et al.
   - **Published Date**: March 6, 2024
   - **Abstract**: This paper explores the interplay between machine learning models and scientific understanding in biology, with a focus on protein structure prediction. It discusses the implications of using ML models, including transformers, in advancing biological research.
   - **Key Contributions**: Provides insights into how ML models, including transformers, can enhance our understanding of protein folding and structure.
   - **Link**: [Read the paper](http://arxiv.org/abs/2403.04106)

### 4. **Machine and Deep Learning Methods for Predicting 3D Genome Organization**
   - **Authors**: Wall, Brydon P. G. et al.
   - **Published Date**: March 4, 2024
   - **Abstract**: This review discusses various machine learning methods for predicting 3D genome organization, which may include insights applicable to protein folding prediction. However, it does not focus specifically on transformer architectures.
   - **Key Contributions**: Discusses computational tools that could be adapted for protein folding prediction.
   - **Link**: [Read the paper](http://arxiv.org/abs/2403.03231)

### 5. **Accelerating Protein Design with Deep Learning**
   - **Authors**: McPartlon, Matthew Timothy
   - **Published Date**: August 23, 2023
   - **Abstract**: This thesis introduces machine-learning methods for accelerating protein design, focusing on modeling protein structure. It discusses a deep neural network approach that leverages backbone coordinates for robust designs and accurate quality estimates.
   - **Key Contributions**: Proposes a deep learning framework that could be relevant for protein folding tasks.
   - **Link**: [Read the paper](https://knowledge.uchicago.edu/record/7659/files/McPartlon_uchicago_0330D_16988.pdf)

### Next Steps
- **Review Abstracts**: You can review the abstracts of these papers to determine which ones are most relevant to your specific interests in novel architectural modifications and experimental validation.
- **Download Selected Papers**: If you find specific papers of interest, I can assist you in downloading them for a more in-depth analysis.

Would you like to proceed with downloading any specific papers, or do you need further assistance with the analysis?



---

Agents usually takes longer to process the result than a LLM. But, it provides significantly more detailed metadata, validated sources, and structured academic output.

---



##  Limitations

1. Technical Limitations
    - API rate limits for paper access
    - Handle time for large PDFs
    - Limited to publicly accessible papers
  
2. Functional Limitations
    - No support for image analysis in papers
    - Limited context window for very long papers
    - Cannot perform mathematical derivations
    - Language constraints for non-English papers


## Potential Improvements:

1. Technical Improvements
    - Implement parallel processing for multiple papers
    - Add caching system for frequently accessed papers
    - Integrate multiple academic APIs for broader coverage
    - Implement batch processing for large datasets

2. Functional Improvements
    - Add support for figure and table extraction
    - Implement cross-referencing between papers
    - Add citation network analysis
    - Include domain-specific validation rules
        
3. User Experience
    - Add interactive feedback mechanisms
    - Implement progress tracking
    - Add customizable validation criteria
    - Include export options for research summaries
        
   
## Specific Use Cases:

1. Academic Research, Literature review and paper analysis.
    - Comprehensive search
    - Citation tracking
    - Cross-reference validation

2. Industry Research, Technical documentation and patent analysis.
    - Focused search
    - Technical specification extraction
    - Competitive analysis

3. Educational, Student research assistance.
    - Simplified explanations
    - Learning resource identification
    - Guided research process


---

## Conclusion

This implementation demonstrates how state-driven architectures can transform academic paper analysis. By combining LangGraph's orchestration capabilities with robust API integrations, we've created a system that maintains research rigor while automating key aspects of paper processing. The workflow's emphasis on validation and quality control ensures reliable research outputs while significantly streamlining the paper analysis process.
