In [1]:
import pickle
import os
from typing import List, Optional

from pydantic import BaseModel, Field

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.runnables import RunnablePassthrough

from os import path
from typing import List, Optional
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA


In [2]:
prompt = """

You are an expert chemist. The document describes the synthesis and characterization of Metal-Organic Frameworks (MOFs) and Coordinated Polymers (CPs) with crystal structures and other properties. MOFs or CPs are compound with very well-defined crystalline structure that consist of a transition metal node like Cu, Dy, Zn, etc. and an organic linker that is commonly referred to with short hand names like DME, or BTC. Use your chemistry knowledge to determine whether a compound is a metal, organic linker or a MOF.

## Instructions:
1) Compound names: MOFs and CPs can have different based on the author. They can be code names like: 'ZIF-8, HKUST-1, etc.' or Chemical formulas: 'Cu2(btc)3, etc.' Make sure to extract one of these formats after resolving any co-references such as 'Compound 1, 1a, crystal 1, network 1a, etc'. When outputting the final MOF name, concatenate the all co-references to the single MOF with "<|>". For Example, "Cu2(Btc)3 - HKUST-1, or compound 1, is a very good MOF". Should output a MOF name of "Cu2(BTC)3<|>HKUST-1<|>compound 1". The MOF names you are extracting MUST BE UNQIUE!

2) Make sure to fully unpack MOF generic formulas into the individual MOFs names. For example "X-DMF.2H2O (X = Cu, Zn, Mn)" should output MOFs 'Cu-DMF.2H2O', 'Zn-DMF.2H2O', 'Mn-DMF.2H2O'. Often you can encounter co-references their as well. For example, "Cu2(DMF)4.3(S) (S = iba (Cu-iba), ina (Cu-ina))" should output the following individual MOFs: 'Cu2(DMF)4.3(iba)<|>Cu-iba' and 'Cu2(DMF)4.3(ina)<|>Cu-ina'.

3) Justification: Always provide a justification for your output. The justification must be logical and should help the user understand how you went about answering their question. Never make up answers without justification from provided information. You can use chemistry knowledge to deduce Metal Nodes and Organic Linkers and chemical formula stoichiometry only!

4) Always include the following in the output: DOI, MOF Name from text, CSD Reference Code (if Provided). Do not use CCDC Reference Codes for the papers.

## Task:
I have the following CSD Reference codes for MOFs. Only use these Ref Codes:

{'OKAYUW': {'Space Group': 'P-1', 'Metal Nodes': 'Ca', 'Chemical Name': 'catena-[(μ-22,25-bis(4-carboxyphenyl)[11,21:24,31-terphenyl]-14,34-dicarboxylato)-calcium ethylene]', 'a': 5.1803, 'b': 10.6508, 'c': 15.2914, 'Molecular Formula': 'C34Ca1H20O8', 'Synonyms': "['SBMOF-2 ethylene']"}, 'OKAYIK': {'Space Group': 'P21/n', 'Metal Nodes': 'Ca', 'Chemical Name': "catena-[(μ-4,4'-sulfonyldibenzoato)-calcium ethane]", 'a': 11.6667, 'b': 5.5586, 'c': 22.935, 'Molecular Formula': 'C56Ca4H32O24S4', 'Synonyms': "['SBMOF-1 ethane']"}, 'OKAZAD': {'Space Group': 'P-1', 'Metal Nodes': 'Ca', 'Chemical Name': 'catena-[(μ-22,25-bis(4-carboxyphenyl)[11,21:24,31-terphenyl]-14,34-dicarboxylato)-calcium ethane]', 'a': 5.2195, 'b': 10.5691, 'c': 15.3604, 'Molecular Formula': 'C34Ca1H20O8', 'Synonyms': "['SBMOF-2 ethane']"}, 'OKAYEG': {'Space Group': 'P21/n', 'Metal Nodes': 'Ca', 'Chemical Name': "catena-[(μ-4,4'-sulfonyldibenzoato)-calcium ethylene]", 'a': 11.5955, 'b': 5.5581000000000005, 'c': 22.9548, 'Molecular Formula': 'C56Ca4H32O24S4', 'Synonyms': "['SBMOF-1 ethylene']"}, 'OKAYAC': {'Space Group': 'P21/n', 'Metal Nodes': 'Ca', 'Chemical Name': "catena-[(μ-4,4'-sulfonyldibenzoato)-calcium acetylene]", 'a': 11.6583, 'b': 5.567100000000001, 'c': 22.911, 'Molecular Formula': 'C56Ca4H32O24S4', 'Synonyms': "['SBMOF-1 acetylene']"}, 'OKAYOQ': {'Space Group': 'P-1', 'Metal Nodes': 'Ca', 'Chemical Name': 'catena-[(μ-22,25-bis(4-carboxyphenyl)[11,21:24,31-terphenyl]-14,34-dicarboxylato)-calcium acetylene]', 'a': 5.1634, 'b': 10.5518, 'c': 15.4849, 'Molecular Formula': 'C34Ca1H20O8', 'Synonyms': "['SBMOF-2 acetylene']"}}

You need to find which MOFs in the provided text have these CSD Reference codes. Use the features provided for each CSD reference code like Lattice Parameters (a, b, c), Metal node, Chemical Name, Space group, Molecular formula, and Synonyms to find the matching MOF from the paper. Do not hallucinate information not included in the paper.
EACH MOF CAN HAVE A SINGLE REF CODE AT MOST
YOU MUST USE THE FOLLOWING FORMAT FOR EACH MOF YOU FIND:
1.  -MOF name: name of the MOF along with the co-references concatenated with '<|>'. These names must be unique! Unpack generic formulas
    -CSD Ref Code: the CSD Ref Code for the specific MOF. Use "not provided" if you believe non of the MOFs match any CSD Ref Code. Don't over use 'not provided'
    -Justification: why do you believe this MOF has this CSD Ref Code. Explain whether Metal node, space group, molecular formula or organic linker match.



"""

In [3]:
from langchain import hub
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# See full prompt at https://smith.langchain.com/hub/langchain-ai/retrieval-qa-chat
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")



In [4]:
retrieval_qa_chat_prompt

ChatPromptTemplate(input_variables=['context', 'input'], optional_variables=['chat_history'], input_types={'chat_history': typing.List[typing.Annotated[typing.Union[typing.Annotated[langchain_core.messages.ai.AIMessage, Tag(tag='ai')], typing.Annotated[langchain_core.messages.human.HumanMessage, Tag(tag='human')], typing.Annotated[langchain_core.messages.chat.ChatMessage, Tag(tag='chat')], typing.Annotated[langchain_core.messages.system.SystemMessage, Tag(tag='system')], typing.Annotated[langchain_core.messages.function.FunctionMessage, Tag(tag='function')], typing.Annotated[langchain_core.messages.tool.ToolMessage, Tag(tag='tool')], typing.Annotated[langchain_core.messages.ai.AIMessageChunk, Tag(tag='AIMessageChunk')], typing.Annotated[langchain_core.messages.human.HumanMessageChunk, Tag(tag='HumanMessageChunk')], typing.Annotated[langchain_core.messages.chat.ChatMessageChunk, Tag(tag='ChatMessageChunk')], typing.Annotated[langchain_core.messages.system.SystemMessageChunk, Tag(tag='Sy

In [5]:
vec = "/home/tom-pruyn/Documents/TDM Papers/blah/42"

In [6]:
from langchain.vectorstores import FAISS
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

vs = FAISS.load_local(vec, OpenAIEmbeddings(model = 'text-embedding-ada-002'), allow_dangerous_deserialization = True)


In [7]:
retriever = vs.as_retriever(
                    search_type="mmr", search_kwargs={"k": 9, "fetch_k": 50}
                )

In [8]:

QA_PROMPT = (
    "Answer the user question using the information provided in the documents."
    "Don't make up answer!\n"
    "Documents:\n{context}\n\n Question:\n{input}"
)

In [9]:
qa_chat_prompt = ChatPromptTemplate.from_template(QA_PROMPT)

In [10]:
qa_chat_prompt

ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template="Answer the user question using the information provided in the documents.Don't make up answer!\nDocuments:\n{context}\n\n Question:\n{input}"), additional_kwargs={})])

In [11]:
llm = ChatOpenAI(model="gpt-4o", temperature=0)

In [12]:
docs_chain = create_stuff_documents_chain(llm, qa_chat_prompt)


In [13]:
qa_chain = create_retrieval_chain(retriever, docs_chain)


In [14]:
result = qa_chain.invoke({"input": prompt})

In [15]:
result

{'input': '\n\nYou are an expert chemist. The document describes the synthesis and characterization of Metal-Organic Frameworks (MOFs) and Coordinated Polymers (CPs) with crystal structures and other properties. MOFs or CPs are compound with very well-defined crystalline structure that consist of a transition metal node like Cu, Dy, Zn, etc. and an organic linker that is commonly referred to with short hand names like DME, or BTC. Use your chemistry knowledge to determine whether a compound is a metal, organic linker or a MOF.\n\n## Instructions:\n1) Compound names: MOFs and CPs can have different based on the author. They can be code names like: \'ZIF-8, HKUST-1, etc.\' or Chemical formulas: \'Cu2(btc)3, etc.\' Make sure to extract one of these formats after resolving any co-references such as \'Compound 1, 1a, crystal 1, network 1a, etc\'. When outputting the final MOF name, concatenate the all co-references to the single MOF with "<|>". For Example, "Cu2(Btc)3 - HKUST-1, or compound

In [16]:
qa_chain_2 = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, memory=None, chain_type="stuff")

In [26]:
qa_chain_2.combine_documents_chain

StuffDocumentsChain(verbose=True, llm_chain=LLMChain(verbose=False, prompt=ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n{context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template='{question}'), additional_kwargs={})]), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x704d23fcfdd0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x704d23fcdd30>, root_client=<openai.OpenAI object at 0x704d2c106c90>, root_async_client=<openai.AsyncOpenAI object at 0x704d23fcfe30>, m

In [25]:
result_2 = qa_chain_2.invoke(prompt)

Error in StdOutCallbackHandler.on_chain_start callback: AttributeError("'NoneType' object has no attribute 'get'")



[1m> Finished chain.[0m


In [None]:
result_2

{'query': '\n\nYou are an expert chemist. The document describes the synthesis and characterization of Metal-Organic Frameworks (MOFs) and Coordinated Polymers (CPs) with crystal structures and other properties. MOFs or CPs are compound with very well-defined crystalline structure that consist of a transition metal node like Cu, Dy, Zn, etc. and an organic linker that is commonly referred to with short hand names like DME, or BTC. Use your chemistry knowledge to determine whether a compound is a metal, organic linker or a MOF.\n\n## Instructions:\n1) Compound names: MOFs and CPs can have different based on the author. They can be code names like: \'ZIF-8, HKUST-1, etc.\' or Chemical formulas: \'Cu2(btc)3, etc.\' Make sure to extract one of these formats after resolving any co-references such as \'Compound 1, 1a, crystal 1, network 1a, etc\'. When outputting the final MOF name, concatenate the all co-references to the single MOF with "<|>". For Example, "Cu2(Btc)3 - HKUST-1, or compound

In [19]:
docs2 = retriever.invoke(prompt)

In [20]:
docs2

[Document(metadata={'source': '/home/tom-pruyn/Documents/TDM Papers/blah/42/10.1021_acs.chemmater.5b03792.md'}, page_content='Acknowledgment\n\nSynthetic strategies for development of SBMOF-2, SCXRD and DSC-XRD characterization work and analysis of synchrotron data at Stony Brook by A.M.P., X.C., J.B.P. and W.R.W was supported by the National Science Foundation grant DMR1231586. Structures of SBMOF-1:C2H2, SBMOF-1:C2H4, SBMOF-2:C2H2 and SBMOF-2:C2H4 were determined using the Stony Brook University single-crystal diffractometer, obtained through the support of the NSF (grant number CHE-0840483).\n\nStructures of SBMOF-1:C2H6 and SBMOF-2:C2H6 were determined in ChemMatCars (Sector 15), Advanced Photon Source 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 (APS), principally supported by the National Science Foundation/Department of Energy (NSF/CHE-0822838). Use of AP

In [21]:
result

{'input': '\n\nYou are an expert chemist. The document describes the synthesis and characterization of Metal-Organic Frameworks (MOFs) and Coordinated Polymers (CPs) with crystal structures and other properties. MOFs or CPs are compound with very well-defined crystalline structure that consist of a transition metal node like Cu, Dy, Zn, etc. and an organic linker that is commonly referred to with short hand names like DME, or BTC. Use your chemistry knowledge to determine whether a compound is a metal, organic linker or a MOF.\n\n## Instructions:\n1) Compound names: MOFs and CPs can have different based on the author. They can be code names like: \'ZIF-8, HKUST-1, etc.\' or Chemical formulas: \'Cu2(btc)3, etc.\' Make sure to extract one of these formats after resolving any co-references such as \'Compound 1, 1a, crystal 1, network 1a, etc\'. When outputting the final MOF name, concatenate the all co-references to the single MOF with "<|>". For Example, "Cu2(Btc)3 - HKUST-1, or compound