In [1]:
import re

import orcaparse as op

`My data` and `Another data` blocks were recognized as `BlockUnrecognizedWithHeader` and `My start of the message messag` block was recognized as `BlockUnknown`

In [2]:
orca_file = op.File("example.out")
orca_file.get_data()

--------------------
My data
--------------------
 
My value: 1.234 eV

                            ***************************************
                            *            Another data             *
                            ***************************************
Not my match
My value: 9.876 eV



Unnamed: 0,Type,Subtype,Element,Position,ReadableName,RawData,ExtractedData
7858779669156,Block,BlockFinalSinglePointEnergy,<orcaparse.elements.BlockFinalSinglePointEnerg...,"(17, 19)",FINAL SINGLE POINT ENERGY,------------------------- ------------------...,[Energy]
7858689244025,Block,BlockTerminatedNormally,<orcaparse.elements.BlockTerminatedNormally ob...,"(22, 22)",ORCA TERMINATED NORMALLY,****ORCA TERMINAT...,[Termination status]
7858689348092,Block,BlockTotalRunTime,<orcaparse.elements.BlockTotalRunTime object a...,"(23, 23)",TOTAL RUN TIME,TOTAL RUN TIME: 0 days 0 hours 0 minutes 26 se...,[Run Time]
7858689348086,Block,BlockUnrecognizedWithHeader,<orcaparse.elements.BlockUnrecognizedWithHeade...,"(5, 10)",My data,--------------------\nMy data\n---------------...,[raw data]
7858779679540,Block,BlockUnrecognizedWithHeader,<orcaparse.elements.BlockUnrecognizedWithHeade...,"(11, 16)",Another data,******************...,[raw data]
7858689347789,Block,BlockUnknown,<orcaparse.elements.BlockUnknown object at 0x7...,"(2, 3)",My start of the message messag...,My start of the message: message1\nmessage2,[raw data]
7858689347909,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be63...,,,,
7858689347795,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be63...,,,,
7858689347990,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be63...,,,\n,
7858689348425,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be63...,,,,


Let's start with the simple ways of introducing your block, and later we will discuss the structure os the search and use the more advanced methods

In [3]:
rs = op.RegexSettings(op.DEFAULT_ORCA_REGEX_FILE)

`My data` and `Another data` blocks have quite a standard pattern: Single Line Header

Lets add `My data` to the blueprint for this type of patterns
Use BlockNameOfBlock for the class name

In [4]:
rs.items["TypeKnownBlocks"].items[
    "BlueprintBlockWithSingeLineHeader"].add_item(name="BlockMyData",
                                                  pattern_text="My data")

We will detect the first block as paragraph that starts with 'My start of the message'

In [5]:
rs.items["TypeKnownBlocks"].items["BlueprintParagraphStartsWith"].add_item(
    name="BlockMyStart", pattern_text="My start of the message")

Let's look at the changes.

We should load our new regex settings file at the creation of `File` object

In [6]:
orca_file = op.File("example.out", regex_settings=rs)
orca_file.get_data()

My start of the message: message1
message2
--------------------
My data
--------------------
 
My value: 1.234 eV

                            ***************************************
                            *            Another data             *
                            ***************************************
Not my match
My value: 9.876 eV



Unnamed: 0,Type,Subtype,Element,Position,ReadableName,RawData,ExtractedData
7858689268577,Block,BlockMyStart,<orcaparse.elements.Block object at 0x725be61d...,"(2, 3)",My start of the message messag...,My start of the message: message1\nmessage2,[raw data]
7858779703281,Block,BlockMyData,<orcaparse.elements.Block object at 0x725c3c5c...,"(5, 10)",My data My value eV,--------------------\nMy data\n---------------...,[raw data]
7858689268724,Block,BlockFinalSinglePointEnergy,<orcaparse.elements.BlockFinalSinglePointEnerg...,"(17, 19)",FINAL SINGLE POINT ENERGY,------------------------- ------------------...,[Energy]
7858779702600,Block,BlockTerminatedNormally,<orcaparse.elements.BlockTerminatedNormally ob...,"(22, 22)",ORCA TERMINATED NORMALLY,****ORCA TERMINAT...,[Termination status]
7858689335194,Block,BlockTotalRunTime,<orcaparse.elements.BlockTotalRunTime object a...,"(23, 23)",TOTAL RUN TIME,TOTAL RUN TIME: 0 days 0 hours 0 minutes 26 se...,[Run Time]
7858689268727,Block,BlockUnrecognizedWithHeader,<orcaparse.elements.BlockUnrecognizedWithHeade...,"(11, 16)",Another data,******************...,[raw data]
7858689268634,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be61...,,,,
7858689268610,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be61...,,,,
7858779702510,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725c3c5...,,,\n,
7858689184838,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be60...,,,,


The Blocks were recognized as `BlockMyStart` and `BlockMyData`

Now let's add the data recognition to `BlockMyData`

Note that I am using `BlockWithStandardHeader` instead of just `Block`, as I know that this block has a standard header that can be easily separated. But I could use `Block`, then `ReadableName` would be recognized as 'My data My value eV'  instead of 'My data'

Data extraction takes place only on a call, so you don't need to worry much about the performance of your code

In [7]:
@op.elements.AvailableBlocks.register_block
class BlockMyData(op.elements.BlockWithStandardHeader):

    def data(self):
        pattern = r"My value:\s*(\d+\.\d+)"
        match = re.search(pattern, self.raw_data)
        extracted_number = float(match.group(1)) if match else None
        value = extracted_number * op.units_and_constants.ureg.eV
        return op.Data(
            data={"My value": value, "Another Value": 42},
            comment="Contains pint object of `My value`. The magnitude in eV can be extracted with property .magnitude\n`Another value` is 42.",
        )

Now lets add the `ReadableName` to `BlockMyStart`. Now it is 'My start of the message messag...' 

In [8]:
@op.elements.AvailableBlocks.register_block
class BlockMyStart(op.elements.Block):

    def extract_name_header_and_body(self):
        return "My Start", None, self.raw_data

Do not forget to restart the orca file

In [9]:
orca_file = op.File("example.out", regex_settings=rs)
orca_file.get_data()

My start of the message: message1
message2
                            ***************************************
                            *            Another data             *
                            ***************************************
Not my match
My value: 9.876 eV



Unnamed: 0,Type,Subtype,Element,Position,ReadableName,RawData,ExtractedData
7858689185282,Block,BlockMyStart,<__main__.BlockMyStart object at 0x725be6096020>,"(2, 3)",My Start,My start of the message: message1\nmessage2,[raw data]
7858689185084,Block,BlockMyData,<__main__.BlockMyData object at 0x725be60953c0>,"(5, 10)",My data,--------------------\nMy data\n---------------...,"[My value, Another Value]"
7858689185078,Block,BlockFinalSinglePointEnergy,<orcaparse.elements.BlockFinalSinglePointEnerg...,"(17, 19)",FINAL SINGLE POINT ENERGY,------------------------- ------------------...,[Energy]
7858689185072,Block,BlockTerminatedNormally,<orcaparse.elements.BlockTerminatedNormally ob...,"(22, 22)",ORCA TERMINATED NORMALLY,****ORCA TERMINAT...,[Termination status]
7858689185234,Block,BlockTotalRunTime,<orcaparse.elements.BlockTotalRunTime object a...,"(23, 23)",TOTAL RUN TIME,TOTAL RUN TIME: 0 days 0 hours 0 minutes 26 se...,[Run Time]
7858689184865,Block,BlockUnrecognizedWithHeader,<orcaparse.elements.BlockUnrecognizedWithHeade...,"(11, 16)",Another data,******************...,[raw data]
7858689185294,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be60...,,,,
7858689185243,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be60...,,,,
7858689184925,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be60...,,,\n,
7858689185138,Spacer,Spacer,<orcaparse.elements.Spacer object at 0x725be60...,,,,


Now our data is ready to be extracted:

In [10]:
df = orca_file.get_data(element_type=BlockMyData)
assert len(df) == 1, "More then 1 `BlockMyData` found"
data = df.iloc[0].ExtractedData
print(data)
print()
print(f"{data['My value'].magnitude = }")
print(f"{data['Another Value'] = }")

OrcaData with items: `My value`, `Another Value`. Comment: Contains pint object of `My value`. The magnitude in eV can be extracted with property .magnitude
`Another value` is 42.

data['My value'].magnitude = 1.234
data['Another Value'] = 42


Let's looks at the search algorithm structure

`RegexSettings` is a tree/'directory' object that contains  `RegexSettings`s, `RegexBlueprint`s and `RegexRequest`s. `RegexBlueprint` is a 'generator' object for `RegexRequest`s of the same type. They have `.items` that contains `RegexRequest`s as it was previously shown.

In [11]:
rs = op.RegexSettings(op.DEFAULT_ORCA_REGEX_FILE)
print(rs)

RegexGroup:
  TypeKnownBlocks:
    RegexGroup:
      BlockIcon: RegexRequest(p_type=Block, p_subtype=BlockIcon, pattern=^((?:[ \t]*#,[ \t]*\n[ \t..., flags=re.MULTILINE, comment=Searching for the fin of ...)
      BlockShark: RegexRequest(p_type=Block, p_subtype=BlockShark, pattern=^(([ \t]*-{50,}[ \t]*\n)(..., flags=re.MULTILINE, comment=Non-special line is defin...)
      BlueprintParagraphStartsWith:
        RegexBlueprint:
          BlockVersion: Pattern: ^([ \t]*Program Version.*?\n(?:(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*(?:\n(?=(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*))?)*)
          BlockContributions: Pattern: ^([ \t]*With contributions from.*?\n(?:(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*(?:\n(?=(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*))?)*)
          BlockAcknowledgement: Pattern: ^([ \t]*We gratefully acknowledge.*?\n(?:(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*(?:\n(?=(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*

You can create the new instance of `RegexSettings`, `RegexBlueprint` or `RegexRequest` and add it with .add_item.

`TypeKnownBlocks` is made for specific patterns for known blocks

`TypeDefaultBlocks` is made for the general patters to find some specific kinds of blocks, data extraction is not expected from the blocks in this section

`BlockUnknown` is the `RegexRequest` to collect everything that was not recognized before as a block and is not just a space

`Spacer` collects the spaces left in the document

In [12]:
pattern = op.regex_request.RegexRequest(
    p_type="Block",
    p_subtype="BlockDemonstration",
    pattern="^(aaa)$",
    flags=["MULTILINE"],
    comment="Patterns should always start with ^, have at least 1 capturing group and end with $",
)
pattern

RegexRequest(p_type=Block, p_subtype=BlockDemonstration, pattern=^(aaa)$, flags=re.MULTILINE, comment=Patterns should always st...)

Patterns should always start with `^`, have at least 1 capturing group and end with `$`. This capturing group will capture the `raw_data`

In [13]:
rs.items["TypeKnownBlocks"].add_item(name="BlockDemonstration", item=pattern)

Pattern was successfully added:

In [14]:
print(rs)

RegexGroup:
  TypeKnownBlocks:
    RegexGroup:
      BlockIcon: RegexRequest(p_type=Block, p_subtype=BlockIcon, pattern=^((?:[ \t]*#,[ \t]*\n[ \t..., flags=re.MULTILINE, comment=Searching for the fin of ...)
      BlockShark: RegexRequest(p_type=Block, p_subtype=BlockShark, pattern=^(([ \t]*-{50,}[ \t]*\n)(..., flags=re.MULTILINE, comment=Non-special line is defin...)
      BlueprintParagraphStartsWith:
        RegexBlueprint:
          BlockVersion: Pattern: ^([ \t]*Program Version.*?\n(?:(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*(?:\n(?=(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*))?)*)
          BlockContributions: Pattern: ^([ \t]*With contributions from.*?\n(?:(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*(?:\n(?=(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*))?)*)
          BlockAcknowledgement: Pattern: ^([ \t]*We gratefully acknowledge.*?\n(?:(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*#=]{7,}[ \t]*$).*(?:\n(?=(?!^.*<@%.*%@>.*$|^[ \t]*$|^[ \t]*[-*