Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prevent duplicate adding posts using CSS Selector Bridge? #3687

Closed
Dean-Corso opened this issue Sep 22, 2023 · 13 comments
Closed

How to prevent duplicate adding posts using CSS Selector Bridge? #3687

Dean-Corso opened this issue Sep 22, 2023 · 13 comments

Comments

@Dean-Corso
Copy link

Hi again,

Info: The problem I did started to post in this issue.

Somehow when using "CSS Selector Bridge" to create any rss feed of any website I get the duplicate adding posts problem everyday when updating the feed in my feed reader tool.

Example URL: Digital Foundry
Bridge URL: Import Atom format URL into your rss reader tool.
https://rss-bridge.org/bridge01/?action=display&bridge=CssSelectorBridge&home_page=https%3A%2F%2Fwww.eurogamer.net%2Ftopics%2Fdigital-foundry&url_selector=.link_overlay&content_selector=article&content_cleanup=script, div.breadcrumbs, div.headline_details, div.metadata, div.mypop-header-wrapper, div.social_share&title_cleanup=+%7C+Eurogamer.net&format=Atom

If you do then you get 10 posts added into your reader tool with the actually time / date (not time of posts itself). When you update this feed again next day then you get added again 10 posts which including new posts (if there are any new) and old posts which you got already in your feed messages.

My question is how to prevent that duplicate post adding in my custom rss feed above? Other feed bridges are working correctly like the Rutube bridge (just new posts getting added) or ABC News Bridge. I think the problem could be the not declared timestamp parameter. Maybe you can tell me how to handle that or whether its an generally problem when using CSS Selector Bridge. Thanks.

PS: I'm using QuiteRSS which also has an option to remove duplicate posts but normally I have this disabled and don't need to use it when the RSS feed is working correctly.

@Dean-Corso
Copy link
Author

Just have a add-on question using the CSS Selector Bridge. What to do in case of websites using JavaScript to load it? I tried this out but I get an error message 400 (No results for URL selector). When I try those websites with politepol then it does load the website just blank and I need to enable the script option to get it work. Is there also a script option for the CSS Selector Bridge (don't see any field for that) or is it just not supported?

@dvikan
Copy link
Contributor

dvikan commented Sep 23, 2023

@ORelio @LarsStegman

@ORelio
Copy link
Contributor

ORelio commented Sep 23, 2023

Hi,

Duplicates may happen if the article URL change over time, because the article URL is used by the RSS reader to tell apart old articles from new articles. If they change the URL (e.g. by editing the title), then the article will be considered as new and appear as duplicate. This might also happen if some random garbadge is present in URL e.g. for tracking purposes.

in case of websites using JavaScript to load it

CSS Selector Bridge won't work for that kind of sites. Since search engines can't index dynamic sites, they may have a site map for search engines, so you can try out SitemapBridge.

@Dean-Corso
Copy link
Author

Hi,

thanks for the info so far. I think this problem happens also without any changes so for all CSS Selector Bridges I made I get that problem.

What about to create a CSS Selector Bridge version with 3 parameters only to enter the CSS elements? As I said before, the politepol website has this simple feature to select them visually (title, description & image). Something like that would be nice to have as bridge too to have at least the edit fields to enter the CSS element / chain to define all three elements. In some cases when using the "CSS Selector Bridge" I don't get that working right to have to all elements handled right and I don't get the title showing when I pick the article element etc. Somehow it would help to define those few elements directly for itself to make it clear and build a correctly showing custom bridge. Just a idea you know.

Ok, I wanna try that Sitemap Bridge but I have to ask what to enter for "Pattern for site URLs to take in feed"? Just see same edit field for the "CSS Selector Bridge" too which is just optional. Can you explain that or showing any examples? Otherwise do you have some more examples for CSS Selector Bridge (advanced) & Sitemap Bridge?

I was trying out that Sitemap Bridge (did enter same for Home page with latest articles & Pattern for site URLs to take in feed) and as result I just get a blank site back. Maybe you can post any example website with the parameter its using for the pattern to know what I have to enter into that field. Thank you.

@dvikan
Copy link
Contributor

dvikan commented Sep 23, 2023

im unable to reproduce this. is it possible you have added the feed twice to your reader?

alternative explanation: they very recently changed their url structure. are the duplicated feed items having identical url?

@Dean-Corso
Copy link
Author

No. Just tried another RSS tool and got same problem. Just test the Digital Foundry feed bridge URL I did post above and you should get that problem of duplicate adding posts or updating read posts into unread + update timestamp in RSS reader tool. So or so I get the same issue in both RSS reader tools QuiteRSS & RSS Guard. Other feeds are working right which are not created with CSS Selector Bridge like Rutube and your now fixed Pornhub bridge. So whats the different? Just try to create a user CSS Selector Bridge URL with any Pornhub model and test what date time you get in RSS tool. If you got it working with the CSS Selector Bridge then post the feed URL of it to see what query's you did use to make it work.

@ORelio
Copy link
Contributor

ORelio commented Sep 24, 2023

What about to create a CSS Selector Bridge version with 3 parameters only to enter the CSS elements? As I said before, the politepol website has this simple feature to select them visually (title, description & image). Something like that would be nice to have as bridge too to have at least the edit fields to enter the CSS element / chain to define all three elements. In some cases when using the "CSS Selector Bridge" I don't get that working right to have to all elements handled right and I don't get the title showing when I pick the article element etc. Somehow it would help to define those few elements directly for itself to make it clear and build a correctly showing custom bridge. Just a idea you know.

You can do that with CssSelectorComplexBridge. It lets you specify selectors for more fields such as author, etc. For the simpler CssSelectorBridge, I try to keep things simple but I'll look into auto-filling them using metadata provided for previews on social media sites when possible. Many blogs have them for SEO and preview purposes (see #3626 and #3602).

Ok, I wanna try that Sitemap Bridge but I have to ask what to enter for "Pattern for site URLs to take in feed"? Just see same edit field for the "CSS Selector Bridge" too which is just optional. Can you explain that or showing any examples? Otherwise do you have some more examples for CSS Selector Bridge (advanced) & Sitemap Bridge?

Regarding advanced CSS Selector Bridge, it is maintained by @LarsStegman. See #3626 for some how-to explanations.
Regarding SitemapBridge, it works on the following principle:

  1. Analyze /robots.txt on the site to determine path to Sitemap.xml (or specify direct URL to Sitemap.xml)
  2. Sitemap.xml contains a set of URL+Timestamp: Filter them by URL to take the most recent ones matching desired pattern
  3. Extract content from URL using CSS Selector as in CssSelectorBridge.

Here is an example of site where SitemapBridge can work (article lists are loaded using javascript, breaking the regular CssSelectorBridge):

  • URL: https://www.zscaler.com/blogs/
  • URL pattern: blogs/security-research
  • Content selector: #post, .node-blog
  • Content cleanup: .light_hero_module_root__dky_w, div.hidden, .tile-container, .sidebar_root__y6e_1, #more-blogs, #subscribe-blog
  • Title cleanup: | Zscaler

Underlying Sitemap URL: https://www.zscaler.com/sitemap.xml (SitemapBridge will automatically recover it by loading https://www.zscaler.com/robots.txt then https://www.zscaler.com/sitemap_index.xml)

Query string:
?action=display&bridge=SitemapBridge&home_page=https%3A%2F%2Fwww.zscaler.com%2Fblogs%3Ftype%3Dsecurity-research&url_pattern=blogs%2Fsecurity-research&content_selector=%23post%2C%20.node-blog&content_cleanup=.light_hero_module_root__dky_w,%20div.hidden,%20.tile-container,%20.sidebar_root__y6e_1,%20%23more-blogs,%20%23subscribe-blog&title_cleanup=|+Zscaler&site_map=&limit=5&format=Atom

@dvikan
Copy link
Contributor

dvikan commented Sep 24, 2023

really good answer @ORelio thanks for helping out!

@dvikan
Copy link
Contributor

dvikan commented Sep 24, 2023

perhaps one possible improvement here is to hardcode a feed item timestamp e.g. 2023-09-01 18:00:00.

this will make sure that both url and timestamp remain constant.

@ORelio
Copy link
Contributor

ORelio commented Sep 24, 2023

If I take a random article from the Eurogamer feed:
https://www.eurogamer.net/digitalfoundry-2023-ai-and-the-future-of-graphics-how-nvidia-and-cdpr-pushed-out-rt-visuals-in-cyberpunk-2077-phantom-liberty
Its source contains:

<meta property="og:description" content="Alex Battaglia recently hosted a roundtable discussion on AI and the future of game graphics, sharing the stage with re…">
<meta property="og:site_name" content="Eurogamer.net">
<meta property="og:title" content="Inside DLSS 3.5 and Cyberpunk 2077 Phantom Liberty: discussing the future of PC graphics">
<meta property="og:type" content="article">
<meta property="og:url" content="https://www.eurogamer.net/digitalfoundry-2023-ai-and-the-future-of-graphics-how-nvidia-and-cdpr-pushed-out-rt-visuals-in-cyberpunk-2077-phantom-liberty"> 
<meta property="og:image" content="https://assetsio.reedpopcdn.com/cp2077-SITE_54aQBhW_wc2TQsO.jpg?width=1200&amp;height=630&amp;fit=crop&amp;enable=upscale&amp;auto=webp">
<meta property="article:published_time" content="2023-09-20T11:55:26.865857+00:00">
<meta property="article:modified_time" content="2023-09-23T12:10:47.525783+00:00">

My plan is to add support for those tags in CssSelectorBridge and SitemapBridge.
The RSS feed item would use canonical URL and published time, which hopefully would remove potential dynamic garbage from URL and use a constant, actual timestamp, when available.

Note that in theory, RSS readers should index items by canonical URL, not by timestamp. My guess about the issue discussed here is that Eurogamer frequently revises article titles, changing their caconical URL so they become duplicate items.

@ORelio
Copy link
Contributor

ORelio commented Sep 24, 2023

Submitted PR #3706 that adds support for metadata. It might help for this issue as it automatically retrieves the canonical URL for each entry, reducing the risk of taking garbage in the entry URL (e.g. tracking parameters set by home page). It now also supports retrieval of published_time among other things.

(Of course, if blog authors change article titles / canonical URLs, the feed items will still appear as duplicates, but I see no way of detecting/fixing that, especially at bridge level).

dvikan pushed a commit that referenced this issue Sep 24, 2023
…3687) (#3706)

* [CssSelectorBridge] Metadata from social embed (#3602, #3687)

Implement the following metadata sources:
 - Facebook Open Graph
 - Twitter <meta> tags
 - Standard <meta> tags
 - JSON linked data (ld+json)

The following metadata is supported:
 - Canonical URL (may help removing garbage from URLs)
 - Article title
 - Truncated summary
 - Published/Updated timestamp
 - Enclosure/Thumbnail image
 - Author Name or Twitter handle

SitemapBridge will also automatically benefit from this commit.

* [php8backports] Add array_is_list()

Needed this function for ld+json implementation in CssSelectorBridge.

* [SitemapBridge] Add option to discard thumbnail

* [CssSelectorBridge] Fix linting issues
@dvikan dvikan closed this as completed Sep 24, 2023
@Dean-Corso
Copy link
Author

Thank you for that info post so far @ORelio. So I would like to have & see more examples for using the CssSelectorComplexBridge & Sitemap Bridge etc. Do you guys have stored a list of them anywhere I could see and analyze to recognize how to do it correctly and to test it out?

Otherwise one more time about handling more simple custom elements to build own working bridges. In most cases we would need title, description & image elements as basic elements to define the path / chain or better having a visual window to pick those elements via mouse directly. Maybe would be better to set those few elements separated & manually in some cases to prevent wrong displaying issues etc.

@ORelio
Copy link
Contributor

ORelio commented Oct 6, 2023

As stated before, CssSelectorComplexBridge is maintained by @LarsStegman so I don't have specific examples, but it's very close to CssSelectorBridge. You can find some feed examples in #3687 and #3717.

Here is an attempt to write proper documentation for the CSS Selector Bridge.

CSS Selector bridge

Typical use case: Articles links on home page

CssSelectorBridge allows you to make RSS Feeds by scraping information from web pages.
It works on the following principle:

  • First select links on home page (using a CSS Selector)
  • Then for each link, extract content from article page (using a CSS Selector)

Being familiar with HTML and CSS is highly recommended to use the bridge.

Let's use #3717 as example:

Let's assume we have a home page with a list of latest articles: https://edition.cnn.com/. The first goal is to find a selector to identify all links to recent articles. A typical way of doing that is through the web browser's dev tools, by highlighting a link and finding the most appropriate DOM element:

image

Here we see that links (<a> elements) for each article have the container__link class, so .container__link makes a valid CSS selector (see also: CSS Selector Reference).

However, the selector may catch some links to other pages that are not articles, and we don't want that. One handy feature of CSS Selector Bridge is that we can specify a pattern for URLs to keep in feed; allowing to filter out the others. This is optional.

Here is a sample URL to an article:

https://edition.cnn.com/2023/10/05/world/ukraine-money-military-aid-intl-dg/index.html

We see that the URL contains a date and ends with .html. So the pattern used in #3717 is:

/20.+\.html

This is a regular expression that will match all URLs containing "/20", then anything, then ".html", so only those links will be kept by the bridge. Regex101 is a handy tool to test regular expressions:

image

If not familiar with regular expressions and want to learn more, an introduction to regular expressions might be of help.

So now we have a proper list of links to the latest articles. Great, but we have no content yet. Let's extract it using the Content Selector! Let's open the link and repeat the process:

image

Looks like .article__content makes a good content selector.

The content selector may catch some unwanted elements inside the article (ads, social share or other distracting elements) that you may want to remove. In that case, just make up selectors for them and input them as content cleanup selector:

image

Here we are removing .source__cite elements. Several selectors are allowed (comma-separated).

Finally, you may have some clutter in the article title:

image

You can remove it by filling in "Text to remove from expanded article title". Here we want to remove this:

| CNN

Congratulations, you have generated a feed through CSS Selector Bridge!

Secondary use case: All articles on home page

Now, let's assume all articles are on the same page, taking #3537 as example.

In that case, the content selector specified as URL selector should select the whole article from home page:

image

This will work as long as the first link (<a> element) for each article points to the article. Then, don't specify content selector. The element selected on home page will be kept as article content, and link title becomes article title.

Since CSS Selector Bridge will automatically retrieve metadata from article page, when possible, such as article title, published date, author, thumbnail, etc (#3706), using a content selector is recommended when possible, as the feed will contain less data when not using a content selector. However, the feed will load faster as the bridge only processes the home page.

CSS Selector Complex Bridge

This bridge is very similar to CSS Selector Bridge, but instead of trying to automate things for you, it will let you specify settings for everything:

  • Site URL: Page with latest articles: Same as CSS Selector Bridge. This this the home page from which we start
  • [Optional] Cookie: Here you can set a HTTP Cookie that will be sent with every request to the website.
  • [Optional] Text to remove from feed title: Same as text to remove from article title from CSS Selector Bridge, but for the feed title retrieved from the home page
  • Selector for article entry elements: Specify here a selector to article cards / whole article, like in CSS Selector Bridge's "Secondary use case: All articles on home page".
  • [Optional] Selector for link elements: Customize the "first link in article card is link to article" behavior: Specify the selector inside each article card to match the article link (in case your first selector is not the link itself).
  • [Optional] Pattern for site URLs to keep in feed: Same as CSS Selector Bridge. You can filter out unwanted pages from the feed using this setting.
  • Limit: Maximum amount of articles to process.
  • Load article from page: Specify whether the following selectors come from article card on home page OR standalone article page (check to load standalone page and continue from there, uncheck to continue from inside the article card)
  • [Optional] Selector to select article element: Same as CSS Selector Bridge. Selector to article content. If not specified, the whole article card becomes the content.
  • [Optional] Content cleanup: selector for items to remove: Same as CSS Selector Bridge. Selector for elements inside article content that you want to remove.
  • [Optional] Selector for the article title: Specify selector for article title.
  • [Optional] Categories: Specify selector for article categories
  • [Optional] Author: Specify selector for author
  • [Optional] Time selector: Specify selector for article date
  • [Optional] Format string for parsing time: Format of the article date
  • [Optional] Remove styling: Remove all styling, colors, etc. from the article

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants