## Use your API key here

In [1]:
api = 'sk-proj'#(use your own key)

## Basic Prompt Structure with input variables for data and chemical formula

In [2]:
system_message = """
You are an expert in NMR spectroscopy and organic chemistry.
"""

user_prompt = """
Here are the peaks from an NMR spectrum: {spectrum_data}.
The chemical formula is {chemical_formula}. What's the molecule? 
Think step-by-step, making extensive use of a scratchpad to record your thoughts. Consider finding ways to group related peaks 
together, and keep track of the stoichiometry and the amount of unassigned H atoms as you make provisional assignments.
Format the final answer like this - 
### Scratchpad ### <scratchpad> ### Scratchpad ###
### Start answer ### <prediction> ### End answer ###
The prediction should only contain the name of the molecule and no other text or cha

"""

## Call to OpenAI

In [3]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import json

def call_openAI(spec_data, chem_formula, system_message=system_message, user_prompt=user_prompt, 
             api_key = api,
             model = "gpt-4-0613"):
             #"gpt-3.5-turbo"):

    llm = ChatOpenAI(api_key=api_key, model = model)

    prompt = ChatPromptTemplate.from_messages([
    ("system", system_message),
    ("user", user_prompt)])

    llm = ChatOpenAI(api_key=api_key, model=model)

    parser = StrOutputParser()
    
    chain = prompt | llm | parser

    llm_output = chain.invoke({"spectrum_data": spec_data, "chemical_formula": chem_formula})
    return llm_output


In [4]:
import os
import csv

with open("csvfiles_update/64-17-5_C2H6O_Ethanol_HNMR.csv") as csvfile:
    csvreader = csv.reader(csvfile)
    next(csvreader)
            
    data_string = ""
              
    for row in csvreader:
        row_string = ", ".join(row)
        data_string += row_string + "\n"  
specs_data = data_string

In [5]:
output_gpt4 = call_openAI(specs_data, 'C2H6O')
output_gpt4


'### Scratchpad ### \n\nThe given molecular formula is C2H6O, which corresponds to a molecule with 2 carbons, 6 hydrogens and 1 oxygen. \n\nThe possibilities for this formula are ethanol, dimethyl ether, and methoxyethane. \n\nThe NMR spectrum provides a list of chemical shifts, which can help us identify the structure of the molecule. \n\nPeaks in the range 3-4 ppm typically indicate the presence of protons bonded to a carbon atom adjacent to an oxygen atom. Peaks in the range of 1-2 ppm typically indicate the presence of protons bonded to a carbon atom that is not directly bonded to a heteroatom or in a functional group. \n\nLooking at the given data, there are three peaks in the range of 3-4 ppm (3.811, 3.73, 3.652) with a high intensity which suggests that there are three protons bonded to a carbon atom adjacent to an oxygen atom. \n\nThere are also multiple peaks in the range of 1-2 ppm (1.303, 1.286, 1.226, 1.207, 1.199, 1.146) with variable intensities. \n\nThe presence of these

In [12]:
output_gpt4 = call_openAI(specs_data, """Peak1: 1.24-1.25, triplet, integration = 3H
Peak2: 1.4, singlet, integration = 1.7* - exchangeable hydrogen
Peak3: 3.70-3.74, quadruplet, integration = 2H
Chemical formula: C2H6O""")
output_gpt4


'### Scratchpad ### \n\nThe chemical formula for the compound is C2H6O. \n\nFrom the NMR data:\n\nPeak 1: 1.24-1.25, triplet, integration = 3H. This peak is a triplet, which indicates that it has two neighboring hydrogen atoms. The integration value of 3H means there are 3 hydrogen atoms in this group. This is likely a CH3 group.\n\nPeak 2: 1.4, singlet, integration = 1.7*. It is mentioned that this is an exchangeable hydrogen, which is characteristic of alcohols or acids. In this case, the chemical formula indicates it is an alcohol.\n\nPeak 3: 3.70-3.74, quadruplet, integration = 2H. This peak is a quadruplet, which indicates that it has three neighboring hydrogen atoms. The integration value of 2H means there are 2 hydrogen atoms in this group. This is likely a CH2 group.\n\nSo, based on these data, the molecule likely has a CH3 group, a CH2 group, and an OH group, which fits the formula C2H6O.\n\n### Scratchpad ###\n\n### Start answer ### Ethanol ### End answer ###'

## Output Parser

In [6]:
def extract_answer(analysis_string):
    start_pos = analysis_string.find("### Start answer ###") + len("### Start answer ###")
    end_pos = analysis_string.find("### End answer ###")

    answer = analysis_string[start_pos:end_pos].strip()
    return answer

print("GPT4:", extract_answer(output_gpt4) )
#print("Claude:", extract_answer(output_claude))

GPT4: Ethanol
