Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to extract all the html information. #950

Closed
haolin96 opened this issue Mar 15, 2024 · 4 comments
Closed

I want to extract all the html information. #950

haolin96 opened this issue Mar 15, 2024 · 4 comments
Labels
stale From automation, when inactive for too long.

Comments

@haolin96
Copy link

Hello,

I would like to extract all the html elements from the websites to store. But I can only get the text content inside them in content field. I can't find where the elements such as has been deleted. Can you please help me achieve my goal?

@ohtwadi
Copy link
Contributor

ohtwadi commented Mar 15, 2024

One approach is to copy the HTML to a field before the HTML gets parsed (i.e. as a pre-parse handler). Something like this could do it (not tested):

  <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
      <restrictTo field="document.contentType">text/html</restrictTo>
      <pattern field="doc_html">.*</pattern>
  </tagger>

@hadupa
Copy link

hadupa commented May 7, 2024

I am also curious, so I will bump this. Tested the above suggested approach but it did not work due to several errors caused by deprecations. Currently trying to implement a similar solution using the RegexTagger.

https://opensource.norconex.com/importer/v3/apidocs/com/norconex/importer/handler/tagger/impl/RegexTagger.html

@ohtwadi
Copy link
Contributor

ohtwadi commented May 8, 2024

Perhaps the DOMPreserveTransformer will be helpful.

Copy link

stale bot commented Jul 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale From automation, when inactive for too long. label Jul 9, 2024
@stale stale bot closed this as completed Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale From automation, when inactive for too long.
Projects
None yet
Development

No branches or pull requests

3 participants