Skip to content

Add new Jsoup wholeText() function to get unencoded, un-normalized text from element and children #3180

@thadguidry

Description

@thadguidry

When parsing HTML text it is sometimes advantageous to get the unencoded, un-normalized text and keeping any newlines and whitespaces found. This helps to keep a near exact copy of text within an element in certain situations, for example, multi-line chat/comment threads, etc.

Proposed solution

Implement Joup's wholeText() function introduced last year.

Example HTML

<div class="commentthread_comment_text" id="comment_content_2577697791650773248">
  Me : Make a 2nd game ?
 <br>Dev : Nah man , too much work.
 <br>Me : So what's it gonna be ?
 <br>Dev : REMASTER !!!!
 <br>

and apply new GREL function wholeText()

value.parseHtml().select("div.commentthread_comment_text")[0].wholeText()

parse and output would stay consistent as original, keeping any new lines and whitespaces found


  Me : Make a 2nd game ?
 Dev : Nah man , too much work.
 Me : So what's it gonna be ?
 Dev : REMASTER !!!!
 

instead of current GREL function htmlText() that internally uses Jsoup text() where whitespace is normalized and trimmed and new lines are not kept to help disambiguate further in certain situations:

value.parseHtml().select("div.commentthread_comment_text")[0].htmlText()

which outputs as

Me : Make a 2nd game ? Dev : Nah man , too much work. Me : So what's it gonna be ? Dev : REMASTER !!!!

Alternatives considered

play chess? buy more Tesla stock?

Additional context

Docs: https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#wholeText()

Metadata

Metadata

Assignees

Labels

Type: Feature RequestIdentifies requests for new features or enhancements. These involve proposing new improvements.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions