In [1]:
import re

import chemparse as chp

`My data` and `Another data` blocks were recognized as `BlockOrcaUnrecognizedWithHeader` and `My start of the message messag` block was recognized as `BlockUnknown`

In [2]:
orca_file = chp.File("example.out")
orca_file.get_data()

--------------------
My data
--------------------
 
My value: 1.234 eV


                            ***************************************
                            *            Another data             *
                            ***************************************
Not my match
My value: 9.876 eV




Unnamed: 0,Type,Subtype,Element,CharPosition,LinePosition,ReadableName,RawData,ExtractedData
7764716181144,Block,BlockOrcaTotalRunTime,<chemparse.orca_elements.BlockOrcaTotalRunTime...,"(565, 625)","(24, 25)",TOTAL RUN TIME,TOTAL RUN TIME: 0 days 0 hours 0 minutes 26 se...,[Run Time]
7764716181282,Block,BlockOrcaTerminatedNormally,<chemparse.orca_elements.BlockOrcaTerminatedNo...,"(503, 564)","(23, 24)",ORCA TERMINATED NORMALLY,****ORCA TERMINAT...,[Termination status]
7764716181207,Block,BlockOrcaFinalSinglePointEnergy,<chemparse.orca_elements.BlockOrcaFinalSingleP...,"(354, 500)","(18, 21)",FINAL SINGLE POINT ENERGY,------------------------- ------------------...,[Energy]
7764716181393,Block,BlockOrcaUnrecognizedWithSingeLineHeader,<chemparse.orca_elements.BlockOrcaUnrecognized...,"(45, 116)","(6, 12)",My data,--------------------\nMy data\n---------------...,[raw data]
7764716181333,Block,BlockOrcaUnrecognizedWithSingeLineHeader,<chemparse.orca_elements.BlockOrcaUnrecognized...,"(117, 353)","(13, 19)",Another data,******************...,[raw data]
7764716181444,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd26...,"(0, 0)","(1, 2)",,\n,
7764716181102,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd26...,"(44, 44)","(6, 7)",,\n,
7764716181129,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd26...,"(501, 502)","(22, 24)",,\n\n,
7764716180988,Block,BlockUnknown,<chemparse.elements.BlockUnknown object at 0x7...,"(1, 43)","(3, 5)",My start of the message messag...,My start of the message: message1\nmessage2\n,[raw data]


Let's start with the simple ways of introducing your block, and later we will discuss the structure os the search and use the more advanced methods

In [3]:
rs = chp.RegexSettings(chp.DEFAULT_ORCA_REGEX_FILE)

`My data` and `Another data` blocks have quite a standard pattern: Single Line Header

Lets add `My data` to the blueprint for this type of patterns
Use BlockNameOfBlock for the class name

In [4]:
rs.items["TypeKnownBlocks"].items["BlueprintBlockWithSingeLineHeader"].add_item(
    name="BlockOrcaMyData", pattern_text="My data"
)

We will detect the first block as paragraph that starts with 'My start of the message'

In [5]:
rs.items["TypeKnownBlocks"].items["BlueprintParagraphStartsWith"].add_item(
    name="BlockOrcaMyStart", pattern_text="My start of the message"
)

Let's look at the changes.

We should load our new regex settings file at the creation of `File` object

In [6]:
orca_file = chp.File("example.out", regex_settings=rs)
orca_file.get_data()

My start of the message: message1
message2

--------------------
My data
--------------------
 
My value: 1.234 eV


                            ***************************************
                            *            Another data             *
                            ***************************************
Not my match
My value: 9.876 eV




Unnamed: 0,Type,Subtype,Element,CharPosition,LinePosition,ReadableName,RawData,ExtractedData
7764716074321,Block,BlockOrcaTotalRunTime,<chemparse.orca_elements.BlockOrcaTotalRunTime...,"(565, 625)","(24, 25)",TOTAL RUN TIME,TOTAL RUN TIME: 0 days 0 hours 0 minutes 26 se...,[Run Time]
7764716074258,Block,BlockOrcaTerminatedNormally,<chemparse.orca_elements.BlockOrcaTerminatedNo...,"(503, 564)","(23, 24)",ORCA TERMINATED NORMALLY,****ORCA TERMINAT...,[Termination status]
7764716074342,Block,BlockOrcaFinalSinglePointEnergy,<chemparse.orca_elements.BlockOrcaFinalSingleP...,"(354, 500)","(18, 21)",FINAL SINGLE POINT ENERGY,------------------------- ------------------...,[Energy]
7764716181075,Block,BlockOrcaMyStart,<chemparse.elements.Block object at 0x70fdd265...,"(1, 43)","(3, 5)",My start of the message messag...,My start of the message: message1\nmessage2\n,[raw data]
7764716074441,Block,BlockOrcaMyData,<chemparse.elements.Block object at 0x70fdd24b...,"(45, 116)","(8, 14)",My data My value eV,--------------------\nMy data\n---------------...,[raw data]
7764716074231,Block,BlockOrcaUnrecognizedWithSingeLineHeader,<chemparse.orca_elements.BlockOrcaUnrecognized...,"(117, 353)","(15, 21)",Another data,******************...,[raw data]
7764716074408,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd24...,"(0, 0)","(1, 2)",,\n,
7764716074375,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd24...,"(44, 44)","(6, 7)",,\n,
7764716074459,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd24...,"(501, 502)","(22, 24)",,\n\n,


The Blocks were recognized as `BlockOrcaMyStart` and `BlockOrcaMyData`

Now let's add the data recognition to `BlockOrcaMyData`

Note that I am using `BlockOrcaWithStandardHeader` instead of just `Block`, as I know that this block has a standard header that can be easily separated. But I could use `Block`, then `ReadableName` would be recognized as 'My data My value eV'  instead of 'My data'

Data extraction takes place only on a call, so you don't need to worry much about the performance of your code

In [7]:
@chp.orca_elements.AvailableBlocksOrca.register_block
class BlockOrcaMyData(chp.orca_elements.BlockOrcaWithStandardHeader):

    def data(self):
        pattern = r"My value:\s*(\d+\.\d+)"
        match = re.search(pattern, self.raw_data)
        extracted_number = float(match.group(1)) if match else None
        value = extracted_number * chp.units_and_constants.ureg.eV
        return chp.Data(
            data={
                "My value": value,
                "Another Value": 42
            },
            comment=
            "Contains pint object of `My value`. The magnitude in eV can be extracted with property .magnitude\n`Another value` is 42.",
        )

Now lets add the `ReadableName` to `BlockOrcaMyStart`. Now it is 'My start of the message messag...' 

In [8]:
@chp.orca_elements.AvailableBlocksOrca.register_block
class BlockOrcaMyStart(chp.elements.Block):

    def extract_name_header_and_body(self):
        return "My Start", None, self.raw_data

Do not forget to restart the orca file

In [9]:
orca_file = chp.File("example.out", regex_settings=rs)
orca_file.get_data()

My start of the message: message1
message2

                            ***************************************
                            *            Another data             *
                            ***************************************
Not my match
My value: 9.876 eV




Unnamed: 0,Type,Subtype,Element,CharPosition,LinePosition,ReadableName,RawData,ExtractedData
7764716074519,Block,BlockOrcaTotalRunTime,<chemparse.orca_elements.BlockOrcaTotalRunTime...,"(565, 625)","(24, 25)",TOTAL RUN TIME,TOTAL RUN TIME: 0 days 0 hours 0 minutes 26 se...,[Run Time]
7764716074528,Block,BlockOrcaTerminatedNormally,<chemparse.orca_elements.BlockOrcaTerminatedNo...,"(503, 564)","(23, 24)",ORCA TERMINATED NORMALLY,****ORCA TERMINAT...,[Termination status]
7764716074840,Block,BlockOrcaFinalSinglePointEnergy,<chemparse.orca_elements.BlockOrcaFinalSingleP...,"(354, 500)","(18, 21)",FINAL SINGLE POINT ENERGY,------------------------- ------------------...,[Energy]
7764716074633,Block,BlockOrcaMyStart,<__main__.BlockOrcaMyStart object at 0x70fdd24...,"(1, 43)","(3, 5)",My Start,My start of the message: message1\nmessage2\n,[raw data]
7764716074738,Block,BlockOrcaMyData,<__main__.BlockOrcaMyData object at 0x70fdd24b...,"(45, 116)","(8, 14)",My data,--------------------\nMy data\n---------------...,"[My value, Another Value]"
7764716074780,Block,BlockOrcaUnrecognizedWithSingeLineHeader,<chemparse.orca_elements.BlockOrcaUnrecognized...,"(117, 353)","(15, 21)",Another data,******************...,[raw data]
7764716074636,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd24...,"(0, 0)","(1, 2)",,\n,
7764716074531,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd24...,"(44, 44)","(6, 7)",,\n,
7764716074624,Spacer,Spacer,<chemparse.elements.Spacer object at 0x70fdd24...,"(501, 502)","(22, 24)",,\n\n,


Now our data is ready to be extracted:

In [10]:
df = orca_file.get_data(element_type=BlockOrcaMyData)
display(df)
assert len(df) == 1, "More then 1 `BlockOrcaMyData` found"
data = df.iloc[0].ExtractedData
print(data)
print()
print(f"{data['My value'].magnitude = }")
print(f"{data['Another Value'] = }")

Unnamed: 0,Type,Subtype,Element,CharPosition,LinePosition,ReadableName,RawData,ExtractedData
7764716074738,Block,BlockOrcaMyData,<__main__.BlockOrcaMyData object at 0x70fdd24b...,"(45, 116)","(8, 14)",My data,--------------------\nMy data\n---------------...,"[My value, Another Value]"


Data with items: `My value`, `Another Value`. Comment: Contains pint object of `My value`. The magnitude in eV can be extracted with property .magnitude
`Another value` is 42.

data['My value'].magnitude = 1.234
data['Another Value'] = 42


Let's looks at the search algorithm structure

`RegexSettings` is a tree/'directory' object that contains  `RegexSettings`s, `RegexBlueprint`s and `RegexRequest`s. `RegexBlueprint` is a 'generator' object for `RegexRequest`s of the same type. They have `.items` that contains `RegexRequest`s as it was previously shown.

In [11]:
rs = chp.RegexSettings(chp.DEFAULT_ORCA_REGEX_FILE)
print(rs)

RegexGroup:
  TypeKnownBlocks:
    RegexGroup:
      BlockOrcaTotalRunTime: RegexRequest(p_type='Block', p_subtype='BlockOrcaTotalRunTime', pattern='^([ \t]*TOTAL RUN TIME...', flags=re.MULTILINE, comment='This pattern captures ...')
      BlockOrcaTerminatedNormally: RegexRequest(p_type='Block', p_subtype='BlockOrcaTerminatedNormally', pattern='^([ \t]*\*{4}ORCA TERM...', flags=re.MULTILINE, comment='This pattern captures ...')
      BlockOrcaFinalSinglePointEnergy: RegexRequest(p_type='Block', p_subtype='BlockOrcaFinalSinglePointEnergy', pattern='^((-{20,}\s+-{15,}\n)[...', flags=re.MULTILINE, comment='This pattern matches t...')
      BlockOrcaDipoleMoment: RegexRequest(p_type='Block', p_subtype='BlockOrcaDipoleMoment', pattern='^(([ \t]*-{10,}[ \t]*\...', flags=re.MULTILINE, comment='Equal signs around the...')
      BlockOrcaInputFile: RegexRequest(p_type='Block', p_subtype='BlockOrcaInputFile', pattern='^((?:[ \t]*={10,}[ \t]...', flags=re.MULTILINE, comment='Equal signs around t

You can create the new instance of `RegexSettings`, `RegexBlueprint` or `RegexRequest` and add it with .add_item.

`TypeKnownBlocks` is made for specific patterns for known blocks

`TypeDefaultBlocks` is made for the general patters to find some specific kinds of blocks, data extraction is not expected from the blocks in this section

`BlockOrcaUnknown` is the `RegexRequest` to collect everything that was not recognized before as a block and is not just a space

`Spacer` collects the spaces left in the document

In [12]:
pattern = chp.regex_request.RegexRequest(
    p_type="Block",
    p_subtype="BlockOrcaDemonstration",
    pattern="^(aaa)$",
    flags=["MULTILINE"],
    comment=
    "Patterns should always start with ^, have at least 1 capturing group and end with $",
)
pattern

RegexRequest(p_type='Block', p_subtype='BlockOrcaDemonstration', pattern='^(aaa)$', flags=re.MULTILINE, comment='Patterns should always...')

Patterns should always start with `^`, have at least 1 capturing group and end with `$`. This capturing group will capture the `raw_data`

In [13]:
rs.items["TypeKnownBlocks"].add_item(
    name="BlockOrcaDemonstration", item=pattern)

Pattern was successfully added:

In [14]:
print(rs)

RegexGroup:
  TypeKnownBlocks:
    RegexGroup:
      BlockOrcaTotalRunTime: RegexRequest(p_type='Block', p_subtype='BlockOrcaTotalRunTime', pattern='^([ \t]*TOTAL RUN TIME...', flags=re.MULTILINE, comment='This pattern captures ...')
      BlockOrcaTerminatedNormally: RegexRequest(p_type='Block', p_subtype='BlockOrcaTerminatedNormally', pattern='^([ \t]*\*{4}ORCA TERM...', flags=re.MULTILINE, comment='This pattern captures ...')
      BlockOrcaFinalSinglePointEnergy: RegexRequest(p_type='Block', p_subtype='BlockOrcaFinalSinglePointEnergy', pattern='^((-{20,}\s+-{15,}\n)[...', flags=re.MULTILINE, comment='This pattern matches t...')
      BlockOrcaDipoleMoment: RegexRequest(p_type='Block', p_subtype='BlockOrcaDipoleMoment', pattern='^(([ \t]*-{10,}[ \t]*\...', flags=re.MULTILINE, comment='Equal signs around the...')
      BlockOrcaInputFile: RegexRequest(p_type='Block', p_subtype='BlockOrcaInputFile', pattern='^((?:[ \t]*={10,}[ \t]...', flags=re.MULTILINE, comment='Equal signs around t