New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to prevent duplicate adding posts using CSS Selector Bridge? #3687
Comments
Just have a add-on question using the CSS Selector Bridge. What to do in case of websites using JavaScript to load it? I tried this out but I get an error message 400 (No results for URL selector). When I try those websites with politepol then it does load the website just blank and I need to enable the script option to get it work. Is there also a script option for the CSS Selector Bridge (don't see any field for that) or is it just not supported? |
Hi, Duplicates may happen if the article URL change over time, because the article URL is used by the RSS reader to tell apart old articles from new articles. If they change the URL (e.g. by editing the title), then the article will be considered as new and appear as duplicate. This might also happen if some random garbadge is present in URL e.g. for tracking purposes.
CSS Selector Bridge won't work for that kind of sites. Since search engines can't index dynamic sites, they may have a site map for search engines, so you can try out SitemapBridge. |
Hi, thanks for the info so far. I think this problem happens also without any changes so for all CSS Selector Bridges I made I get that problem. What about to create a CSS Selector Bridge version with 3 parameters only to enter the CSS elements? As I said before, the politepol website has this simple feature to select them visually (title, description & image). Something like that would be nice to have as bridge too to have at least the edit fields to enter the CSS element / chain to define all three elements. In some cases when using the "CSS Selector Bridge" I don't get that working right to have to all elements handled right and I don't get the title showing when I pick the article element etc. Somehow it would help to define those few elements directly for itself to make it clear and build a correctly showing custom bridge. Just a idea you know. Ok, I wanna try that Sitemap Bridge but I have to ask what to enter for "Pattern for site URLs to take in feed"? Just see same edit field for the "CSS Selector Bridge" too which is just optional. Can you explain that or showing any examples? Otherwise do you have some more examples for CSS Selector Bridge (advanced) & Sitemap Bridge? I was trying out that Sitemap Bridge (did enter same for Home page with latest articles & Pattern for site URLs to take in feed) and as result I just get a blank site back. Maybe you can post any example website with the parameter its using for the pattern to know what I have to enter into that field. Thank you. |
im unable to reproduce this. is it possible you have added the feed twice to your reader? alternative explanation: they very recently changed their url structure. are the duplicated feed items having identical url? |
No. Just tried another RSS tool and got same problem. Just test the Digital Foundry feed bridge URL I did post above and you should get that problem of duplicate adding posts or updating read posts into unread + update timestamp in RSS reader tool. So or so I get the same issue in both RSS reader tools QuiteRSS & RSS Guard. Other feeds are working right which are not created with CSS Selector Bridge like Rutube and your now fixed Pornhub bridge. So whats the different? Just try to create a user CSS Selector Bridge URL with any Pornhub model and test what date time you get in RSS tool. If you got it working with the CSS Selector Bridge then post the feed URL of it to see what query's you did use to make it work. |
You can do that with
Regarding advanced CSS Selector Bridge, it is maintained by @LarsStegman. See #3626 for some how-to explanations.
Here is an example of site where
Underlying Sitemap URL: https://www.zscaler.com/sitemap.xml ( Query string: |
really good answer @ORelio thanks for helping out! |
perhaps one possible improvement here is to hardcode a feed item timestamp e.g. this will make sure that both url and timestamp remain constant. |
If I take a random article from the Eurogamer feed:
My plan is to add support for those tags in Note that in theory, RSS readers should index items by canonical URL, not by timestamp. My guess about the issue discussed here is that Eurogamer frequently revises article titles, changing their caconical URL so they become duplicate items. |
Submitted PR #3706 that adds support for metadata. It might help for this issue as it automatically retrieves the canonical URL for each entry, reducing the risk of taking garbage in the entry URL (e.g. tracking parameters set by home page). It now also supports retrieval of published_time among other things. (Of course, if blog authors change article titles / canonical URLs, the feed items will still appear as duplicates, but I see no way of detecting/fixing that, especially at bridge level). |
…3687) (#3706) * [CssSelectorBridge] Metadata from social embed (#3602, #3687) Implement the following metadata sources: - Facebook Open Graph - Twitter <meta> tags - Standard <meta> tags - JSON linked data (ld+json) The following metadata is supported: - Canonical URL (may help removing garbage from URLs) - Article title - Truncated summary - Published/Updated timestamp - Enclosure/Thumbnail image - Author Name or Twitter handle SitemapBridge will also automatically benefit from this commit. * [php8backports] Add array_is_list() Needed this function for ld+json implementation in CssSelectorBridge. * [SitemapBridge] Add option to discard thumbnail * [CssSelectorBridge] Fix linting issues
Thank you for that info post so far @ORelio. So I would like to have & see more examples for using the CssSelectorComplexBridge & Sitemap Bridge etc. Do you guys have stored a list of them anywhere I could see and analyze to recognize how to do it correctly and to test it out? Otherwise one more time about handling more simple custom elements to build own working bridges. In most cases we would need title, description & image elements as basic elements to define the path / chain or better having a visual window to pick those elements via mouse directly. Maybe would be better to set those few elements separated & manually in some cases to prevent wrong displaying issues etc. |
As stated before, Here is an attempt to write proper documentation for the CSS Selector Bridge. CSS Selector bridgeTypical use case: Articles links on home page
Being familiar with HTML and CSS is highly recommended to use the bridge. Let's use #3717 as example: Let's assume we have a home page with a list of latest articles: Here we see that links ( However, the selector may catch some links to other pages that are not articles, and we don't want that. One handy feature of CSS Selector Bridge is that we can specify a pattern for URLs to keep in feed; allowing to filter out the others. This is optional. Here is a sample URL to an article:
We see that the URL contains a date and ends with .html. So the pattern used in #3717 is:
This is a regular expression that will match all URLs containing "/20", then anything, then ".html", so only those links will be kept by the bridge. Regex101 is a handy tool to test regular expressions: If not familiar with regular expressions and want to learn more, an introduction to regular expressions might be of help. So now we have a proper list of links to the latest articles. Great, but we have no content yet. Let's extract it using the Content Selector! Let's open the link and repeat the process: Looks like The content selector may catch some unwanted elements inside the article (ads, social share or other distracting elements) that you may want to remove. In that case, just make up selectors for them and input them as content cleanup selector: Here we are removing Finally, you may have some clutter in the article title: You can remove it by filling in "Text to remove from expanded article title". Here we want to remove this:
Congratulations, you have generated a feed through CSS Selector Bridge! Secondary use case: All articles on home pageNow, let's assume all articles are on the same page, taking #3537 as example. In that case, the content selector specified as URL selector should select the whole article from home page: This will work as long as the first link ( Since CSS Selector Bridge will automatically retrieve metadata from article page, when possible, such as article title, published date, author, thumbnail, etc (#3706), using a content selector is recommended when possible, as the feed will contain less data when not using a content selector. However, the feed will load faster as the bridge only processes the home page. CSS Selector Complex BridgeThis bridge is very similar to CSS Selector Bridge, but instead of trying to automate things for you, it will let you specify settings for everything:
|
Hi again,
Info: The problem I did started to post in this issue.
Somehow when using "CSS Selector Bridge" to create any rss feed of any website I get the duplicate adding posts problem everyday when updating the feed in my feed reader tool.
Example URL: Digital Foundry
Bridge URL: Import Atom format URL into your rss reader tool.
https://rss-bridge.org/bridge01/?action=display&bridge=CssSelectorBridge&home_page=https%3A%2F%2Fwww.eurogamer.net%2Ftopics%2Fdigital-foundry&url_selector=.link_overlay&content_selector=article&content_cleanup=script, div.breadcrumbs, div.headline_details, div.metadata, div.mypop-header-wrapper, div.social_share&title_cleanup=+%7C+Eurogamer.net&format=Atom
If you do then you get 10 posts added into your reader tool with the actually time / date (not time of posts itself). When you update this feed again next day then you get added again 10 posts which including new posts (if there are any new) and old posts which you got already in your feed messages.
My question is how to prevent that duplicate post adding in my custom rss feed above? Other feed bridges are working correctly like the Rutube bridge (just new posts getting added) or ABC News Bridge. I think the problem could be the not declared timestamp parameter. Maybe you can tell me how to handle that or whether its an generally problem when using CSS Selector Bridge. Thanks.
PS: I'm using QuiteRSS which also has an option to remove duplicate posts but normally I have this disabled and don't need to use it when the RSS feed is working correctly.
The text was updated successfully, but these errors were encountered: