# Developing RASPA agent

In [1]:
from student.agent.agent_raspa import RaspaAgent
agent = RaspaAgent(provider="anthropic", path="output/learning")

In [2]:
from latex_parsing import *

def parse_tex(filename):
    with open(filename) as f:
            latex_text = f.read()
    return construct_tree(split_latex_sections(latex_text, depth=0))

In [3]:
def save(agent, filename):
    agent.save_conversation(f"{filename}_conversation.txt")
    agent.get_memory_agent().save_memory(f"{filename}_memory.txt")
    agent.get_memory_agent().save_conversation(f"{filename}_conversation_memory.txt")

def load(agent, filename):
    agent.load_conversation(f"{filename}_conversation.txt")
    agent.get_memory_agent().load_memory(f"{filename}_memory.txt")
    agent.get_memory_agent().load_conversation(f"{filename}_conversation_memory.txt")

# MVP

In [10]:
mvp = RaspaAgent(provider="anthropic", path="output/mvp")

In [11]:
p1="""
I will teach you how to use RASPA to do classical molecular simulations. 
Build a knowledge into your memory. The memory is currently empty.
Integrate the tools (as <tool name="{tool_name}">) in to your knowledge for easy reference.

These are basic instructions how to run a simulation:
<simulations>
1. Identify the molecules (gas/liquid) to simulate. Generate a molecular definition file and corresponding force field and pseudoatoms files using the "trappe" tool.
2. Identify the framework for the simulation. It can either be an empty box or a porous material (MOF, zeolite, ...). You can load the structure file for some MOFs with the "coremof" tool. If you cannot load a structure file, ask for it.
3. Build a simulation input file with the "input" tool. You must look at an example and tune the parameters depending on your simulation type and what you want to calculate.
4. Double-check with the "files" and "read" tools, if everything is correct.
5. Run the simulation with "raspa" tool.
6. If finished, use the "output" tool to parse some relevant information from the output. 
</simulations>

Learn the above. Afterwards I will provide you with examples for the simulation input files which you dont know yet how to generate.
"""


In [12]:
mvp.run(p1)

### Define examples

In [13]:
c1 = "# Monte Carlo of methane in a box\nSimulationType                MonteCarlo\nNumberOfCycles                10000\nNumberOfInitializationCycles  5000\nPrintEvery                    1000\n\nForcefield                    ExampleMoleculeForceField\n\n\nBox 0\nBoxLengths 30 30 30\nExternalTemperature 300.0\nMovies yes\nWriteMoviesEvery 100\n\nComponent 0 MoleculeName             methane\n            MoleculeDefinition       ExampleDefinitions\n            TranslationProbability   1.0\n            CreateNumberOfMolecules  100"
k1 = ["Monte Carlo of methane in a box", "Monte Carlo", "MC"]
a1 = """
A Monte Carlo run of 100 methane molecules in a $30\times30\times30$ \AA\ box.
After 5000 cycles of initialization the production run is started.
A movie is written and every 100th configuration is appended to the movie. 
The movie is stored in `Movies/System\_0',
and can be viewed with iRASPA or VMD.
"""
a1_full = a1 + """
In RASPA, the cycle is define as max(20,$N$) steps, where $N$ is the number of molecules in the system. In every cycle, each of the molecules
has on average been used for a Monte Carlo move (accepted or rejected). There is a minimum of 20 steps to avoid that low-density
systems or not sampled well. The definition of a cycle is less dependent on the system size. The number of Monte Carlo steps
is roughly the number of cycles times the average number of molecules.

The output is written to the 'Output' directory (per system), and the temperature and pressure are appended to all output filenames.
In the output file, the simulation writes an important check to the file
\begin{tiny}
\begin{verbatim}
     Energy-drift status
     ===========================================================================
     Adsorbate/Adsorbate energy-drift:                                     1.05012e-10
         Adsorbate/Adsorbate VDW energy-drift:                               1.05012e-10
     ===========================================================================
     Total energy-drift: 1.05012e-10
\end{verbatim}
\end{tiny}
In Monte Carlo, only difference in energies are computed. These differences are continuously added to keep track of the current energies
(from which average energies etc. are computed). Obviously, the current energy that is kept track off during the simulation should
be equal to a full recalculation of the energies. The difference between the two signals an error. If the drift is higher than
say $1e-3$ or $1e-4$ the results of the simulation are in error. This could be due to an error in one of the Monte Carlo moves
or because the force field is ``wrong'' (a typical error is when one forgets to define required potentials).

The performance of Monte Carlo moves is monitored. Translation moves are usually scaled to achieve an acceptance rate of 50\%.
Here, the move reached its upper limit of 1 \AA\ because of the low density of the system.
\begin{tiny}
\begin{verbatim}
     Performance of the translation move:
     ======================================
     Component 0 [methane]
         total        332905.000000 333233.000000 333862.000000
         succesfull   283926.000000 284388.000000 284917.000000
         accepted   0.852874 0.853421 0.853398
         displacement 1.000000 1.000000 1.000000
\end{verbatim}
\end{tiny}

Averages are computed along with an error bar. The error is computed by dividing the simulation in 5 blocks and calculating the standard deviation.
The errors in RASPA are computed as the 95\% confidence interval.
\begin{tiny}
\begin{verbatim}
     Total energy:
     =============
         Block[ 0]       -18276.83475 [K]
         Block[ 1]       -18329.57756 [K]
         Block[ 2]       -18502.81990 [K]
         Block[ 3]       -18371.38298 [K]
         Block[ 4]       -19216.89509 [K]
         ------------------------------------------------------------------------------
         Average         -18539.50205 [K] +/-          481.43129 [K]
\end{verbatim}
\end{tiny}
"""

In [14]:
c2 =  "# Monte Carlo of CO2 in a box and N2 in another box (two independent simulations)\nSimulationType                MonteCarlo\nNumberOfCycles                10000\nNumberOfInitializationCycles  1000\nPrintEvery                    100\n\nForcefield                    ExampleMoleculeForceField\n\n\nBox 0\nBoxLengths 25 25 25\nExternalTemperature 300.0\nMovies yes\nWriteMoviesEvery 10\n\nBox 1\nBoxLengths 30 30 30\nBoxAngles 90 120 90\nExternalTemperature 500.0\nMovies yes\nWriteMoviesEvery 10\n\nComponent 0 MoleculeName             N2\n            MoleculeDefinition       ExampleDefinitions\n            TranslationProbability   1.0\n            RotationProbability      1.0\n            ReinsertionProbability   1.0\n            CreateNumberOfMolecules  50 25\n\nComponent 1 MoleculeName             CO2\n            MoleculeDefinition       ExampleDefinitions\n            TranslationProbability   1.0\n            RotationProbability      1.0\n            ReinsertionProbability   1.0\n            CreateNumberOfMolecules  25 50"
k2 = ["Monte Carlo of CO2 in a box and N2 in another box (two independent simulations)", "Monte Carlo"]
a2 = """
RASPA has a build-in structure of being able to simulate several systems at the same time. This has applications in Gibbs-ensembles and (hyper) parallel tempering
for example. However, this capability can also be used for independent systems. The first box is $30\times30\times30$ \AA\ with 90 $^\circ$ angles,
containing 50 N$_2$ and 25 CO$_2$ and molecules and moved around by translation, rotation and reinsertion. The second box is monoclinic 
and of size $25\times25\times25$ with 
$\beta=120^\circ,\alpha=\gamma=90^\circ$ containing 25 N$_2$ and 50 CO$_2$ molecules. The first system is at 300K, the second at 500K. 
"""
a2_full = a2 +"""
One thing to note is that system-dependent statements apply to the \emph{current} box, following `Box [int]'. The initialization
of the systems with molecules is done using the `CreateNumberOfMolecules' which applies similarly to the \emph{current} component
specified using `component [int]'. The list of integers represent the initial amount of molecules for each system. Note that when the
`BoxAngles' line is omitted, $\alpha=\beta=\gamma=90^\circ$ is assumed as the default.

Note that we specify only relative probabilities of MC particle moves. They will be correctly rescaled as shown in the output-file:
\begin{tiny}
\begin{verbatim}
     Particle Moves:
          ProbabilityTranslationMove:                  33.333333
              TranslationDirection:      XYZ
          Percentage of rotation moves:                      33.333333
          Percentage of reinsertion moves:                   33.333333
\end{verbatim}
\end{tiny}
At every MC-step, each move will be randomly selected with 1/3 probability.
"""

In [15]:
c3 = "# Monte Carlo of a binary mixture in a box\nSimulationType                MonteCarlo\nNumberOfCycles                10000\nNumberOfInitializationCycles  2000\nPrintEvery                    100\n\nForcefield                    ExampleMoleculeForceField\n\n\nBox 0\nBoxLengths 30 30 30\nExternalTemperature 300.0\nMovies yes\nWriteMoviesEvery 10\n\nComponent 0 MoleculeName             propane\n            MoleculeDefinition       ExampleDefinitions\n            TranslationProbability   1.0\n            RotationProbability      1.0\n            ReinsertionProbability   1.0\n            CreateNumberOfMolecules  50\n\nComponent 1 MoleculeName             butane\n            MoleculeDefinition       ExampleDefinitions\n            TranslationProbability   1.0\n            RotationProbability      1.0\n            ReinsertionProbability   1.0\n            CreateNumberOfMolecules  50"
k3 = ["Monte Carlo of a binary mixture in a box", "Monte Carlo", "MC"]
a3 = """
A Monte Carlo run of 50 propane and 50 butane molecules in a $30\times30\times30$ \AA\ box. The MC moves are
translation, rotation, and full reinsertion.
After 1000 steps of initialization the production run is started.
A movie is written and every 10th configuration is appended to the movie. 
The movie is stored in `Movies/System\_0',
and can be viewed with iRASPA or VMD.
"""

a3_full = a3 +"""
The propane and butane molecules are modeled as flexible united-atom beads.
The intra-molecular force field contains bond, bend, and torsion terms
\begin{tiny}
\begin{verbatim}
     Average Adsorbate Bond stretch energy:
     ====================================
         Block[ 0]        37377.65243 [K]
         Block[ 1]        37822.77336 [K]
         Block[ 2]        37216.91024 [K]
         Block[ 3]        37033.87935 [K]
         Block[ 4]        37658.50987 [K]
         ------------------------------------------------------------------------------
         Average          37421.94505 [K] +/-          398.05476 [K]
     
     Average Adsorbate Bend angle energy:
     ====================================
         Block[ 0]        23136.71656 [K]
         Block[ 1]        22692.37638 [K]
         Block[ 2]        22046.60765 [K]
         Block[ 3]        22185.01877 [K]
         Block[ 4]        21419.84764 [K]
         ------------------------------------------------------------------------------
         Average          22296.11340 [K] +/-          810.78089 [K]
     
     Average Adsorbate Torsion energy:
     =================================
         Block[ 0]        13601.19894 [K]
         Block[ 1]        13749.89405 [K]
         Block[ 2]        13355.15893 [K]
         Block[ 3]        13339.11856 [K]
         Block[ 4]        13049.12955 [K]
         ------------------------------------------------------------------------------
         Average          13418.90000 [K] +/-          334.24478 [K]
\end{verbatim}
\end{tiny}
The translation and rotation moves leave the internal structure invariant.
The reinsertion-move regrows the molecule at a random position with a new internal structure.
\begin{tiny}
\begin{verbatim}
     Performance of the Reinsertion move:
     ====================================
     Component [propane] total tried: 333613.000000 succesfull growth: 333407.000000 (99.938252 [%]) accepted: 85599.000000 (25.658173 [%])
     Component [butane] total tried: 332088.000000 succesfull growth: 331383.000000 (99.787707 [%]) accepted: 46465.000000 (13.991773 [%])
\end{verbatim}
\end{tiny}
The acceptance percentages are here high enough. But for dense systems, the insertion acceptance ratios become too small.
In these cases, other moves (like partial-reinsertion or MC/MD hybrid moves) become essential to properly sample the internal structure of molecules.

"""

In [16]:
c4 = "# Monte Carlo of CO$_2$ and N$_2$ in two independent boxes\nSimulationType                MonteCarlo\nNumberOfCycles                10000\nNumberOfInitializationCycles  1000\nPrintEvery                    100\n\nForcefield                    ExampleMoleculeForceField\n\n\nBox 0\nBoxLengths 25 25 25\nExternalTemperature 300.0\nMovies yes\nWriteMoviesEvery 10\n\nBox 1\nBoxLengths 30 30 30\nBoxAngles 90 120 90\nExternalTemperature 500.0\nMovies yes\nWriteMoviesEvery 5\n\nComponent 0 MoleculeName             CO2\n            MoleculeDefinition       ExampleDefinitions\n            TranslationProbability   1.0\n            RotationProbability      1.0\n            ReinsertionProbability   1.0\n            CreateNumberOfMolecules  100 0\n\nComponent 1 MoleculeName             N2\n            MoleculeDefinition       ExampleDefinitions\n            TranslationProbability   1.0\n            RotationProbability      1.0\n            ReinsertionProbability   1.0\n            CreateNumberOfMolecules  0 100"
k4 = ["Monte Carlo of CO$_2$ and N$_2$ in two independent boxes", "Monte Carlo", "MC"]
a4 = """
An example of a binary mixture of CO$_2$ and N$_2$ in two independent boxes. Box one contains 100 CO$_2$ molecules
at 300 Kelvin, box two (monoclinic shape) contains 100 N$_2$ molecules at 500 Kelvin. The movies for box one are appended
every 10 cycles, the movie for box two every 5 cycles. Three types of Monte Carlo moves are used: translation, rotation, and
reinsertion. 
"""
a4_full = a4

### Learn examples

In [17]:
def example(content, keywords, annotation):
    return f'<example keywords="{keywords}"><annotation>{annotation}</annotation><input>{content}</input></example>'

ex1 = example(c1, k1, a1)
ex2 = example(c2, k2, a2)
ex3 = example(c3, k3, a3)
ex4 = example(c4, k4, a4)

In [18]:
p2=f"""
I will give you 4 annotated examples one by one for Monte Carlo simulations (abbreviated as MC). 
DO NOT ask for clarifications until you have seen example 4! Afterwards answer
The examples include keywords, the simulation input file and annotations. 
The annotations explain the simulation.
These examples should be integrated into the memory such that you can refer to them for generating your own simulation input files.
YOU MUST include the input files formats correctly since RASPA is really sensitive to formatting.

Example 1:
{ex1}
"""

In [19]:
mvp.run(p2)

In [20]:
p3=f"Example 2: \n{ex2}"
p4=f"Example 3: \n{ex3}"
p5=f"Example 4: \n{ex4}"

In [21]:
mvp.run(p3)
mvp.run(p4)
mvp.run(p5)

In [25]:
mvp.render_conversation()

In [None]:
clarifications = """
Clarification: 
1. You are right. The number in the annotation is incorrect!
2. This is not important for the moment. 
3. Yes, these are placeholders. The molecule loader / framework loader tools automatically generate the actual files. You do not need to change the field in the input file.
4. No, but i dont think that you would use more than two for anything.
5. This is not important for the moment.
"""
#mvp.run(clarifications)

In [None]:
save(mvp, "mvp3")

## Load and test

In [9]:
loaded = RaspaAgent(provider="anthropic", path="output/test_mc")

In [11]:
load(loaded, "mvp2")
loaded.reset_chat()

In [12]:
loaded.auto_run=True
loaded.run("Generate a simulation input file for a monte carlo simulation of propane and methane in some box.")
loaded.run("Generate the molecule definition files and then run raspa")

In [13]:
loaded.render_chat_html()

# Full learning

## Knowledge

In [4]:
a='''
TODO:
- Latex parsing
- Annotation
    - Delete irrelevant parts
    - Annotate tool capabilities
'''

### RASPA theory

In [5]:
a='''
1.1 Design Philosophy
1.2. Units and conventions

2. Format of the Input Files
    2.1. Introduction
    2.2. Simulation input
    (2.3. Force field) -> automatically with trappe tool
    (2.4. Molecules) -> automatically with trappe tool
    (2.5. Framework) -> coremof or other tool (for zeolites, ... TODO)
'''

In [6]:
intro = "raw_knowledge/introduction.tex"
intro_parsed = parse_tex(intro)

philosophy = intro_parsed[0]
#prompt_philosophy = parse_node(philosophy)

units = intro_parsed[1].children[0]
#prompt_units = parse_node(units)

In [7]:
input_files = "raw_knowledge/input_files.tex"
input_files_parsed = parse_tex(input_files)

introduction = input_files_parsed[0]
sim_input = input_files_parsed[1]
# ff = input_files_parsed[2]
# molecules = input_files_parsed[3]
# framework = input_files_parsed[4]

In [8]:
def parse_node(node):
    title = node.title
    content = node.content
    # TODO: content tex_to_markdown(content)
    prompt = f"Remember this information integrate this into your existing memory: <title>{title}</title><content>{content}</content>"
    return prompt

def run_agent(prompt, agent):
    agent.run(prompt)
    

def learn(node,agent):
    prompt = parse_node(node)
    run_agent(prompt, agent)
    for child in node.children:
        learn(child, agent)

In [None]:
agent.run("""
I will teach you how to use the tool RASPA. 
Therefore, I will give you some information piece by piece. 
Starting from an empty memory, build a knowledge into your memory.
You have several tools to generate different files for the RASPA simulations.
Also integrate the knowledge about which tools are available into your memory (add the pattern <tool name="{tool_name}"> into the memory when relevant.)
""")
agent.render_conversation()

In [None]:
learn(introduction, agent)

In [None]:
agent.render_conversation()

In [None]:
agent.get_memory_agent().memory.render_html()

True

In [None]:
save(agent, "raspa_learning1")

In [27]:
def filter():
    # remove irrelevant parts
    pass

def annotate():
    # highlight tool capabilities
    pass

### RASPA input/output examples - annotated

In [28]:
a='''
4.2 Basic examples
partially 
    4.3 Non-basic examples
    4.4 Advanced examples
    4.5 Auxiliary examples
partially 4.6 
    Number of cycles and run-times
'''

In [29]:
examples = "raw_knowledge/examples.tex"
examples_parsed = parse_tex(examples)

general = examples_parsed[0].content[:223]
files = examples_parsed[0].children

sim = parse_node(files[0])
ff = parse_node(files[1])
mol = parse_node(files[2])
structure = parse_node(files[3])

In [None]:
from student.agent.memory import Memory
# from agent_input_file import init_input_file_memory
# examples_memory = init_input_file_memory()
examples_memory = Memory()
examples_memory.load("input_file_memory.json")
#examples_memory.recall(["Monte Carlo"])

## Questions / Tasks

### General questions

In [17]:
q1 = "How to setup a RASPA simulation and how do you use your tools for it?"
q2 = "How to find the critical temperature of a molecule?"
q3 = "What types of simulations can you run and which types of properties can you calculate?"
q4 = "How do you calculate the gas storage potential for a MOF?"
q5 = ""
q6 = ""
q7 = ""
q8 = ""

In [18]:
agent.run(q1)


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



In [None]:
agent.render_conversation()

### Simulation tasks

In [31]:
s1 = "Provide simulation files for running an energy minimization for CO2 in NU-1000"
s2 = "Find me a structure that has Fm-3m topology and then run methane adsorption in it"
s3 = "please tell me the performance of reinsertion and partial reinsertion moves for benzene adsorption in MFI type zeolites"
s4 = "Run a tertiary mixture adsorption of hexane and its isomers in MFI zeolite at 298K and pressures 0 - 100 Pa"
s5 = "Create simulation files for finding a minimum energy location of methane in MFI zeolite?"
s6 = "Please provide simulation files for running a adsorption simulation of methane in IFMOF-1"
s7 = "Provide simulation input files to find the Henry law coefficient of methane in NU-1000"
s8 = "Provide Input files for running GEMC for methane in NU-1000"
s9 = "simulate dynamic behaviour of hexane in MFI zeolite"
s10 = "Run simulation to find the critical temperature of heptane"

### Output analysis tasks

In [32]:
o1 = "What is the Henry coefficient?"
o2 = "Compare the simulations for heptane and pentane. Which one has a higher ...?"
o3 = ""
o4 = ""
o5 = ""
o6 = ""
o7 = ""
o8 = ""
o9 = ""
o10 = ""

### Tutorial Tasks

In [33]:
tutorial = "raw_knowledge/tutorial.tex"
tutorial_parsed = parse_tex(tutorial)

In [34]:
t1 = parse_node(tutorial_parsed[0])
t2 = parse_node(tutorial_parsed[1])
t3 = parse_node(tutorial_parsed[2])

In [35]:
# TODO: extract "exercises"

## Teaching Workflow

In [3]:
a='''Try RASPAagent.run() vs explicit MemoryAgent.learn() -> maybe RaspaAgent.learn() ?

for knowledge:
    while response needs no clarification:
        response = agent.run()
        answer = "human response"
        response = agent.run(answer)
'''
