# BridgeDB API Integration with LangChain Agent

This notebook demonstrates the integration of the BridgeDB API with a LangChain-powered AI agent for biological identifier mapping.

## Table of Contents
1. Setup and Imports
2. BridgeDB API Wrapper
3. Custom LangChain Tool
4. AI Agent Configuration
5. Testing the Agent
6. Additional Utility Functions
7. Direct API Calls

## 1. Setup and Imports

First, let's set up our environment and import the necessary libraries.

In [1]:
# Import statements
import os
import requests
import pandas as pd
from dotenv import load_dotenv
from langchain_community.tools import BaseTool
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver

from qsprpred.data.sources.papyrus import Papyrus
import qsprpred

# Load environment variables
load_dotenv()

# Set OpenAI API Key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OpenAI API Key is not set. Please check your .env file.")

os.environ["OPENAI_API_KEY"] = api_key

dataset_name = "PapyrusTutorialDataset"  # name of the file to be generated
papyrus_version = "latest"  # Papyrus database version
data_dir = "papyrus"  # directory to store the Papyrus data
output_dir = "data"  # directory to store the generated dataset

In [None]:
from papyrus_scripts.download import download_papyrus
download_papyrus(version='latest', structures=True, descriptors=['mold2', 'unirep'])

## 2. BridgeDB API Wrapper

We'll create a wrapper class for the BridgeDB API to handle identifier mapping requests.

In [20]:
from qsprpred.data.sources.papyrus import Papyrus
import qsprpred

acc_keys = ["P29274"]
dataset_name = "PapyrusTutorialDataset"  # name of the file to be generated
quality = "high"  # choose minimum quality from {"high", "medium", "low"}
papyrus_version = "latest"  # Papyrus database version
data_dir = "papyrus"  # directory to store the Papyrus data
output_dir = "data"  # directory to store the generated dataset

# Create a Papyrus object, which specifies the version and directory to store the payrus data
papyrus = Papyrus(
    data_dir=data_dir,
    version=papyrus_version,
    stereo=False,
    plus_only=True,
)

# Create subset of payrus data for the given accession keys, returns a MoleculeTable
mt = papyrus.getData(
    dataset_name,
    acc_keys,
    quality,
    output_dir=output_dir,
    use_existing=False,
    activity_types=["Ki", "IC50", "Kd"]
)
mt.getDF().head()

len(mt.getDF())

0it [00:00, ?it/s]

3785

In [2]:
# BridgeDbAPI Class
# This class encapsulates the functionality to interact with the BridgeDb web service:
# - `BASE_URL`: The base URL for the BridgeDb API
# - `map_identifier`: A method to map identifiers between different biological databases

class PapyrusAPI:
    papyrus_version = "latest"  # Papyrus database version
    data_dir = "papyrus"  # directory to store the Papyrus data
    output_dir = "data"  # directory to store the generated dataset


    @staticmethod
    def fetch_data(acc_keys=["P29274"], quality='high'):
        papyrus = Papyrus(
                data_dir=data_dir,
                version=papyrus_version,
                stereo=False,
                plus_only=True,
            )
        # create subset
        mt = papyrus.getData(
                dataset_name,
                acc_keys,
                quality,
                output_dir=output_dir,
                use_existing=False,
                activity_types=["Ki", "IC50", "Kd"]
            )
        if type(mt) is qsprpred.data.tables.mol.MoleculeTable:
            return mt
        else:
             f"Error: fetched data is {type(mt)}, should be MoleculeTable "

In [28]:
# define papyrus dataset handler

class PapyrusDatasetCreationTool(BaseTool):
    name: str ="dataset_creation"
    description: str = "Used for retrieving and filtering data from the Papyrus database"

    def _run(self, query: str) -> str:
        # Parse the query; now expecting only two parts: acc key, quality
        parts = query.split(",")
        if len(parts) != 2:
            return "Error: Query should be in the format 'acc key, quality'"
        acc_keys, quality = [p.strip() for p in parts]
        
        # Attempt to retrieve papyrus_set and handle potential errors
        try:
            papyrus_set = PapyrusAPI().fetch_data(acc_keys, quality)
            
            length = len(papyrus_set)
            return length
            # print(result.getDF().Quality.value_counts())
        except Exception as e:
            return f"Error: {str(e)}"

    def _arun(self, query: str) -> str:
        # Async implementation (not needed for this tool)
        raise NotImplementedError("This tool does not support async")
    
tools = [PapyrusDatasetCreationTool()]

In [5]:
# Test with a different ENSG identifier (e.g., BRCA2)
tool = PapyrusDatasetCreationTool()
tool._run("P29275, low")
print('hello')

0it [00:00, ?it/s]

hello


## 3. Custom LangChain Tool

Now, let's create a custom LangChain tool that uses our BridgeDbAPI wrapper.

## 4. AI Agent Configuration

Let's set up our AI agent using the custom tool we created.

In [29]:
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver
from langchain_core.messages import HumanMessage

def create_bridgedb_agent():
    model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    tools = [PapyrusDatasetCreationTool()]
    memory = MemorySaver()
    agent_executor = create_react_agent(model, tools, checkpointer=memory)
    return agent_executor

papyrus_agent = create_bridgedb_agent()

def run_agent_query(query):
    config = {"configurable": {"thread_id": "qspr_conversation"}}
    
    print(f"Query: {query}\n")
    for chunk in papyrus_agent.stream(
        {"messages": [HumanMessage(content=query)]},
        config
    ):
        if 'agent' in chunk and 'messages' in chunk['agent']:
            for message in chunk['agent']['messages']:
                if hasattr(message, 'content') and message.content:
                    print(message.content)
        print("----")
    print("\n")

# Test queries

run_agent_query("What is the length of the papyrus_set with acc key P29274 and quality high")

Query: What is the length of the papyrus_set with acc key P29274 and quality high

----
----
----


0it [00:00, ?it/s]

----
The length of the papyrus_set with the acc key P29274 and quality high is 3785.
----


