The purpose of this notebook is to explore parsing data from AddGene by using LangChain with OpenAI 

As of now, the following attributes are desired for all plasmids:
- [X] name (str)
- [X] plasmid id # (int)
- [X] purpose (str)
- [X] publication (str)
- [X] sequence information (link to GenBank file)
- [ ] vector backbone (link)
- [ ] vector type (str[])
- [ ] tag/fusion protein (str[])
- [ ] bacterial resistance(s) (str[] for now)
- [ ] growth temp (str for now)
- [ ] growth strain(s) (str)
- [ ] copy number (str)
- [ ] gene/insert name (str)


In [None]:
from bs4 import BeautifulSoup, Tag
from requests import get

In [59]:
from requests import get
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate

#llm = OpenAI(model_name="davinci")
chat_model = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")



## Define System Messages

In [52]:
system_message = SystemMessagePromptTemplate.from_template("We are extracting the following keys from a webpage."
"- name"
"- id"
"- purpose"
"- publication href"
"- sequence href"
"- vector backbone"
"- vector type"
"- tag / fusion protein(s)"
"- bacterial resistance(s)"
"- growth temp"
"- copy number"
"- gene/insert name")
system_message2 = SystemMessagePromptTemplate.from_template("The webpage containing metadata for biotech parts as strings or arrays of strings which are derived from nearby elements. Return a JSON dict")

## Get page

In [None]:
def addgene_url(plasmid_id: int) -> str:
    return f'https://www.addgene.org/{plasmid_id}/'

# url for "pBABE puro EGFP"
url = addgene_url(128041)

page = get(url)
page = BeautifulSoup(page.text, 'html.parser')
raw_plasmid_info = page.find('section', 'addgene-panel-catalog-item')

### Clean page

In [None]:
# strip whitespace
cleaned = [i.strip() for i in str(raw_plasmid_info).split('\n')]
empty = []
for i, string in enumerate(cleaned):
    if string == '':
        empty.append(i)
for i in empty:
    cleaned.__delitem__(i)
cleaned = "".join(cleaned)

## Create message from page

In [None]:
page_message = HumanMessagePromptTemplate.from_template("Here is the raw webpage: {raw_page}")

## Combine prompts and add page data

In [58]:
chat_message = ChatPromptTemplate.from_messages([system_message, system_message2, page_message])
chat_model.predict_messages(chat_message.format_prompt(raw_page=str(raw_plasmid_info)).to_messages())


InvalidRequestError: The model `code-davinci-002` does not exist or you do not have access to it.

In [54]:
print(_53.content)

{
  "name": "pBABE puro EGFP",
  "id": "128041",
  "purpose": "(Empty Backbone) Retroviral vector with N terminal EGFP",
  "publication href": "/browse/article/28203582/",
  "sequence href": "/128041/sequences/",
  "vector backbone": "pBABE puro",
  "vector type": "Mammalian Expression, Retroviral",
  "tag / fusion protein(s)": ["EGFP (N terminal on backbone)"],
  "bacterial resistance(s)": ["Ampicillin, 100 μg/mL"],
  "growth temp": "37°C",
  "copy number": "High Copy",
  "gene/insert name": "None"
}


# Using LangChain more idiomatically

In [60]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_extraction_chain, create_extraction_chain_pydantic
from langchain.prompts import ChatPromptTemplate

In [61]:
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

In [73]:
schema = {
    "properties": {
        "name": {"type": "string"},
        "id": {"type": "integer"},
        "purpose": {"type": "string"},
        "publication href": {"type": "string"},
        "sequence href": {"type": "string"},
        "vector backbone": {"type": "string"},
        "vector type": {"type": "string"},
        "tag / fusion protein": {"type": "string"},
        "bacterial resistance": {"type": "string"},
        "growth temp": {"type": "string"},
        "copy number": {"type": "string"},
        "gene/insert name": {"type": "string"},
    },
    "required": ["name", "id", "purpose"],
}
chain = create_extraction_chain(schema, llm)
chain.run(str(raw_plasmid_info))

[{'name': 'pBABE puro EGFP',
  'id': 128041,
  'purpose': 'Retroviral vector with N terminal EGFP',
  'publication href': '/browse/article/28203582/',
  'sequence href': '/128041/sequences/',
  'vector backbone': 'pBABE puro',
  'vector type': 'Mammalian Expression, Retroviral',
  'tag / fusion protein': 'EGFP (N terminal on backbone)',
  'bacterial resistance': 'Ampicillin, 100 μg/mL',
  'growth temp': '37°C',
  'copy number': 'High Copy',
  'gene/insert name': 'None'}]

Parsing with OpenAI API worked well, however, it is overkill for normal fetching and best serves to perform a QA function. Instead, this implementation will be used to test random plasmid ids (which can be fetched from the "Recently Uploaded" page) to ensure that normal scraping function works as performed.