# Python lxml extension with all text property

This collects some notes from attempting to extend lxml functioinality for the full text content of a element node (and all it's children).

## Demo for how text truncation occurs

Given the following XML:

In [1]:
import lxml.etree
import pprint

str_xml_query_list = '''
<QueryList>

  <Query Id="0" Path="Security">

    <!-- 
      E.g. extra custom query that gets more security events if MSSQLSERVER is installed. Helps test:
      - Multi-line xpath parsing.
      - Handling embeded comments.
      - Hanlding more complex EventID range logic with >= and <= comparisons.
    -->
    <Select Path="Application">
      *[
        System[
          Provider[@Name='MSSQLSERVER'] and
          (
            (
              EventID&gt;=18453 and
              EventID&lt;=18455
            ) or
            <!-- 	Login failed -->
            EventID=18452 or
            EventID=18456
          )
        ]
      ]
    </Select>

  </Query>

</QueryList>
'''

lxml_select_list = lxml.etree.fromstring(str_xml_query_list).xpath('/QueryList/Query/Select')
lxml_select = lxml_select_list[0]
lxml_select.text


"\n      *[\n        System[\n          Provider[@Name='MSSQLSERVER'] and\n          (\n            (\n              EventID>=18453 and\n              EventID<=18455\n            ) or\n            "

Clearly show how half the embeded xpath after the embeded comment are not included in the `text` property:

In [2]:
pp = pprint.pprint
pp(lxml_select_list[0].text)

('\n'
 '      *[\n'
 '        System[\n'
 "          Provider[@Name='MSSQLSERVER'] and\n"
 '          (\n'
 '            (\n'
 '              EventID>=18453 and\n'
 '              EventID<=18455\n'
 '            ) or\n'
 '            ')


The above likely happens because, within an element, a comment becomes another element to itterate over, and so `text` is only for the current element and does not extend past comment sub-elements?

## Use XPath to select all text a workaround

In [3]:
lxml_select_list = lxml.etree.fromstring(str_xml_query_list).xpath('/QueryList/Query/Select')
lxml_select = lxml_select_list[0]
lxml_select.xpath('text()')

["\n      *[\n        System[\n          Provider[@Name='MSSQLSERVER'] and\n          (\n            (\n              EventID>=18453 and\n              EventID<=18455\n            ) or\n            ",
 '\n            EventID=18452 or\n            EventID=18456\n          )\n        ]\n      ]\n    ']

Or more directly with with a full X-path:

In [4]:
lxml_select_list_all_text_list = lxml.etree.fromstring(str_xml_query_list).xpath('/QueryList/Query/Select/text()')
lxml_select_list_all_text_list

["\n      *[\n        System[\n          Provider[@Name='MSSQLSERVER'] and\n          (\n            (\n              EventID>=18453 and\n              EventID<=18455\n            ) or\n            ",
 '\n            EventID=18452 or\n            EventID=18456\n          )\n        ]\n      ]\n    ']

So the above shows there are two text elements.

Use a join function to recombine them:

In [5]:
lxml_select_list_all_text = ''.join(lxml_select_list_all_text_list)
pp(lxml_select_list_all_text)

('\n'
 '      *[\n'
 '        System[\n'
 "          Provider[@Name='MSSQLSERVER'] and\n"
 '          (\n'
 '            (\n'
 '              EventID>=18453 and\n'
 '              EventID<=18455\n'
 '            ) or\n'
 '            \n'
 '            EventID=18452 or\n'
 '            EventID=18456\n'
 '          )\n'
 '        ]\n'
 '      ]\n'
 '    ')




## Use an extended XML element tree property to include all text as a workaround

Note that the element class needs to be extended in a speical way and one should not simply superclass with `__init__` because lxml etree is a proxy to C and has complex memory managment interactions? See: [Element initialization](https://lxml.de/element_classes.html#element-initialization).

By creating a custom element and using the `itertext()` function, a workaround is possible:

In [6]:
class ExtendedElement(lxml.etree.ElementBase):

  @property
  def all_text(self):
    """
    Iterate and join all the text within an element
    Required because .text only returns text up to the first XML comment and truncates all text after any comments within the element.
    """

    return ''.join([t for t in self.itertext()])

lxml_select_extended = ExtendedElement(lxml_select)
lxml_select_extended.all_text

"\n      *[\n        System[\n          Provider[@Name='MSSQLSERVER'] and\n          (\n            (\n              EventID>=18453 and\n              EventID<=18455\n            ) or\n            \n            EventID=18452 or\n            EventID=18456\n          )\n        ]\n      ]\n    \n\n  "

Clearly show how all the text is now itterated into to an `all_text` property:

In [7]:
pp(lxml_select_extended.all_text)

('\n'
 '      *[\n'
 '        System[\n'
 "          Provider[@Name='MSSQLSERVER'] and\n"
 '          (\n'
 '            (\n'
 '              EventID>=18453 and\n'
 '              EventID<=18455\n'
 '            ) or\n'
 '            \n'
 '            EventID=18452 or\n'
 '            EventID=18456\n'
 '          )\n'
 '        ]\n'
 '      ]\n'
 '    \n'
 '\n'
 '  ')


## Extending the element class adds a parent element node

Note that, due to the philosphy of treating everthing as part of an XML element tree, technically a new XML element is created via the class

In [8]:
print(lxml.etree.tostring(lxml_select_extended).decode())

<ExtendedElement><Select Path="Application">
      *[
        System[
          Provider[@Name='MSSQLSERVER'] and
          (
            (
              EventID&gt;=18453 and
              EventID&lt;=18455
            ) or
            <!-- 	Login failed -->
            EventID=18452 or
            EventID=18456
          )
        ]
      ]
    </Select>

  </ExtendedElement>


vs the original:

In [9]:
print(lxml.etree.tostring(lxml_select).decode())

<Select Path="Application">
      *[
        System[
          Provider[@Name='MSSQLSERVER'] and
          (
            (
              EventID&gt;=18453 and
              EventID&lt;=18455
            ) or
            <!-- 	Login failed -->
            EventID=18452 or
            EventID=18456
          )
        ]
      ]
    </Select>

  


## Property values are not inherited into the extended element

This extension class method has another limitation where it does not seem to inherit and retain the content of other properties in the element tree? E.g. `text`is `None` in the extended element:

In [10]:
lxml_select_extended.text is None

True

vs the orignal:

In [11]:
lxml_select.text

"\n      *[\n        System[\n          Provider[@Name='MSSQLSERVER'] and\n          (\n            (\n              EventID>=18453 and\n              EventID<=18455\n            ) or\n            "

So the sub-element needs to be selected again as a chile, e.g. via xpath, because extending the class injected parent element:

In [12]:
lxml_select.xpath('.')[0].text

"\n      *[\n        System[\n          Provider[@Name='MSSQLSERVER'] and\n          (\n            (\n              EventID>=18453 and\n              EventID<=18455\n            ) or\n            "

## Python `xml` instead of `lxml`

Validate if the `text` truncation behaviour is the same with the standard python `xml` module.

In [13]:
import xml.etree.ElementTree

xml_select_list = xml.etree.ElementTree.fromstring(str_xml_query_list).findall('./QueryList/Query/Select')
xml_select = x_select_list[0]
xml_select.text

NameError: name 'x_select_list' is not defined

So, frtom the above, it's clear that the python `xml` module has the same issue as the `lxml` module.