Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add limited IHAVE support #2826

Open
Safihre opened this issue Mar 18, 2024 · 26 comments
Open

Add limited IHAVE support #2826

Safihre opened this issue Mar 18, 2024 · 26 comments

Comments

@Safihre
Copy link
Member

Safihre commented Mar 18, 2024

Description

We have been requested if we could implement IHAVE support.
I don't want to implement any support for posting articles in SABnzbd, so I dismissed it so far.

However, we can help supporting this by making the right data available for post-processing scripts.

My proposal so far would be this:

  1. Add "IHAVE supported" switch to a server.
  2. If enabled, and the server does not have an article, but another server does have it:
  3. Write the raw data of those articles to a file in the complete folder of the download. This could be in CSV or Pickle format.
@sanderjo
Copy link
Contributor

And then? Should user upload that article to the server that doesn't have the article?

I'm confused

If enabled, and the server does have an article, but another server does have it:

typo? Both have it? I don't understand the "but".

@Safihre
Copy link
Member Author

Safihre commented Mar 18, 2024

Typo fixed.
Indeed the user can upload the article (using the post processing script that others can make) to the server that doesn't have it.

@mnightingale
Copy link
Contributor

If any “IHAVE supported” servers are enabled the downloader would need to make “ARTICLE” requests, but that should not require much if any changes – maybe skip header section before searching for =y…

Write the raw data of those articles to a file in the complete folder of the download.

Are you thinking like a database file storing all the missing articles or a file per article?
I only ask because repeatably appending/encoding a pickle file could be slow.
To reduce the CPU overhead it may be better to write raw responses straight to individual files, I’m presuming only the message-id and raw article is required anyway. So if they were just written to "completed"/articles/*.bin it'd be simple to consume them.

I wonder if providers supporting it would offer incentives to encourage users to set it up and enable posting on their accounts else I can't imagine may would go through the effort.

It concerns me a little though because I'm not sure what the destinction between posting and reposting would be, for instance if someone asks the provider who uploaded message XYZ would they know it had been reuploaded by a different user?

@sanderjo
Copy link
Contributor

sanderjo commented Mar 18, 2024

Why store the article at all? If it's clear the article is not on Server A, but is on Server B, the user's post processing script could get it from B and post it on A? Getting the article is just 2 - 3lines with nntplib.

And: based on the current sab logging, a user could already make such a script.
To make his life / script easier, SAB could just write the article score in a database. But not the article itself. So just per article: per newsserver: Available, Missing, NotChecked.

And it's all a bit fuzzy: only if A is checked first (missing), and then B (available), then the post script could do something. But if B is checked first, it's unknown it's not on A, right? If so, it would mean you have to put the worst newsserver as first server ... to get the most hits. And doing that is a bit weird.

@sanderjo
Copy link
Contributor

And do we statistics how useful this would be?

The below (article name a bit altered) would not be a solution: missing all servers.

I must confess I have eweka as first server ... because it's the best server. I would need the reverse order to find articles not available on bad servers but available on good servers.

2024-03-12 08:37:01,765::DEBUG::[downloader:748] Thread 9@news6.eweka.nl: Article x86Vz.909585$0z1.197129@fx20.ams1 missing (error=430)
2024-03-12 08:37:03,258::DEBUG::[downloader:748] Thread 14@news6.newshosting.com: Article x86Vz.909585$0z1.197129@fx20.ams1 missing (error=430)
2024-03-12 08:37:04,013::DEBUG::[downloader:748] Thread 2@news.usenetprime.com: Article x86Vz.909585$0z1.197129@fx20.ams1 missing (error=430)
2024-03-12 08:37:05,044::DEBUG::[downloader:748] Thread 4@europe.newsgroupdirect.com: Article x86Vz.909585$0z1.197129@fx20.ams1 missing (error=430)
2024-03-12 08:37:05,638::WARNING::[downloader:770] 6@usenet.premiumize.me: Received unknown status code 400 for article x86Vz.909585$0z1.197129@fx20.ams1
2024-03-12 08:37:05,639::INFO::[nzbstuff:258] Article x86Vz.909585$0z1.197129@fx20.ams1 unavailable on all servers, discarding

@sanderjo
Copy link
Contributor

cat sabnzbd.log* | grep Article | grep -i unavailable | awk '{ print $4 }' | sort -u > articles_unavailable_on_all_servers.txt
cat sabnzbd.log* | grep Article | grep -i missing  | awk '{ print $6 }' | sort -u > articles_missing_on_at_least_one_server.txt
$ wc -l articles_*
  2636 articles_missing_on_at_least_one_server.txt
   758 articles_unavailable_on_all_servers.txt

What is missing on one (or more) servers, but not on all servers:

$ diff articles_missing_on_at_least_one_server.txt articles_unavailable_on_all_servers.txt  | grep -E "^<" | wc -l
1879

OK, looks a hopeful, but when I check such an article: missing on all my 4 servers. So why does SAB not say "unavailable on all servers" ... wrong logging, as I don't have more servers defined?

2024-03-12 08:37:01,903::DEBUG::[downloader:748] Thread 7@news6.eweka.nl: Article part100of199.9iio1CWOsKpCrj2DvhjS@powerpostggggAAX.local missing (error=430)
2024-03-12 08:37:24,491::DEBUG::[downloader:748] Thread 12@news6.newshosting.com: Article part100of199.9iio1CWOsKpCrj2DvhjS@powerpostggggAAX.local missing (error=430)
2024-03-12 08:37:24,804::DEBUG::[downloader:748] Thread 3@news.usenetprime.com: Article part100of199.9iio1CWOsKpCrj2DvhjS@powerpostggggAAX.local missing (error=430)
2024-03-12 08:37:25,310::DEBUG::[downloader:748] Thread 2@europe.newsgroupdirect.com: Article part100of199.9iio1CWOsKpCrj2DvhjS@powerpostggggAAX.local missing (error=430)

@Safihre
Copy link
Member Author

Safihre commented Mar 18, 2024

Yes, we could just output article ID. But to re-upload it to the sever which has it missing, it would require downloading each article twice. It would be super-limited support, but it would be at least something.

@sanderjo
Copy link
Contributor

But to re-upload it to the sever which has it missing, it would require downloading each article twice.

Yes. Not a problem if 5% is missing, but indeed a problem if a lot more is missing.

Instead of SABnzbd writing the article itself again somewhere: can SAB just leave it in place in incomplete/..../__ADMIN__/SABnzbd_article_... ?

@mnightingale
Copy link
Contributor

I don’t think there would be sufficient benefit if not writing the missing articles to disk, at that point I think it would be simpler to just write a purpose-built application taking an nzb calling STAT for each messageid against the IHAVE server and for those missing try other servers and reupload.

As said if it was part of the SAB logic IHAVE servers would need to be higher priority, which wouldn’t always be desirable (completion, slow, block accounts, etc.) and if there are a lot missing slower overall performance - I believe the latency is the biggest killer.

Maybe an "IHAVE only" server switch so a server isn't used for downloads, but if higher priority then use it for a STAT call – similar to the standalone example.
Potentially do all / some STAT calls upfront?

Also SAB would need to do ARTICLE requests.

Placing files into completed/incomplete could get messy, maybe a separate config for a location to store them all – whoever wants to IHAVE them could watch the directory (inotify etc.) maybe gets tricky if they’d like to know which server(s) each was missing from.

If taking the postproc script approach I suppose it’d want to know the ids and server(s) each article was missing and I’d expect to be offloaded to another process, rather than SAB having to wait for it.

@thezoggy
Copy link
Contributor

thezoggy commented Mar 18, 2024

If it's discovered during incomplete process it would be lost after it complete unless it's stored elsewhere or handled before complete process happens..

  • People probably wouldnt want to slow down/delay their normal stuff.. so it should be async..
  • I could see people wanting VPN involving they are uploading, which is another wrinkle

In scene world this falls more inline with nzb checker scripts which sees what articles are missing and marked which then you can repost those. The closest part to that logic is our pre-check.. which may be that could be extended in a similar way? That way people could turn that on and know what they are getting themselves into.

@jcfp
Copy link
Member

jcfp commented Mar 19, 2024

Technical details aside: how useful would this be in practice? One would imagine the main reason for missing articles - apart from that one fake article per job inserted by indexers for tracking scrapers - is takedowns, in which case a Usenet provider would have to be rather stupid to restore the missing articles in the first place (considering the notice&takedown whack-a-mole is their legal cover).

@mnightingale
Copy link
Contributor

mnightingale commented Mar 19, 2024

I believe it’s not takedowns that the operator(s?) supporting IHAVE are interested in, indeed they’d return a “435 Article not wanted” response in such case.

Maybe @Safihre can confirm but I think it’s more some ‘new’ operator might agree a deal with another operator for say 2000 days of access, which they could cache upon access. But they would like a way to encourage their users to backfill >2000 days which they themselves do not have access to.

I recall there was some discussion recently about the operator with the longest retention pricing out their resellers creating a monopoly on older posts.

The indexers would possibly be the better people to approach if for example they’d provide article ids for anything that had been downloaded 2000-5000 days old. Need not be nzb files, could just be an unsorted list of article ids, but I don’t know if they’d go for that. Regardless, I suspect provider accounts used to source the missing articles would soon get shutdown for downloading vast volumes of data which might explain the request to make it a more passive part of SAB.

@jcfp
Copy link
Member

jcfp commented Mar 19, 2024

Oh yeah, I forgot about the recent no more block accounts on that one backbone thing.

Are there any existing scripts/tools used for this kind of backfilling, and if so, what input do these take? Article ID, server that has, server(s) that want, or do they need the actual article already present?

@thezoggy
Copy link
Contributor

thezoggy commented Mar 20, 2024

wonder if it would make more sense to perhaps giving the data to something like nyuu to post (as it seems to already have IHAVE support)? ping @animetosho

@animetosho
Copy link

wonder if it would make more sense to perhaps giving the data to something like nyuu to post

From the way I read the initial post, how the upload is performed would be up to the author of the post-processing script?
Nyuu does have a --input-raw-posts option for posting files as raw articles, so if the missing articles are written as such, the post-processing script could take this approach.

Though I wonder how much of a benefit this is to NZB checker scripts. If you're already downloading the file, integrating it with the downloader means you don't have to re-check every article, though you lose a little bit of efficiency by circulating the articles through disk. But that's the only main benefit I can think of.

@puzzledsab
Copy link
Contributor

I looked into adding IHAVE when UsenetExpress first talked about it (https://www.usenetexpress.com/blog/post/20211015_ihave/).

I believe they are the only ones who've talked about it. I assumed it wouldn't be very popular with SABnzbd's sponsor because they would usually be the source and they're not known for playing nice with the competition. In addition it would be very experimental so I was going to make a fork intended to be run from source. Unfortunately I couldn't find any articles that would be accepted for testing. I think I tried asking but didn't hear back or get a useful reply so I dropped it.

I would have implemented it completely in SABnzbd so that users who had UsenetExpress or a reseller could enable it with a switch. It would probably use the existing connections. Saving them for later upload with a different program seems like something that would stop most users from using it. UE should be the primary server, and if it was missing an article then SABnzbd would request the full article with header from the next servers and reupload it immediately if UE wanted it.

I don't think it's much use for other providers than UE. They are the only ones other than Omicron that tries to keep old articles.

The paranoid side of me wonders if they have even enabled it for everyone or if they're just saying it so that they can claim that this is the source if they want particular old articles from Omicron. It seems dangerous because I don't think they can verify that the uploaded article is identical to the original. Anyone could replace missing articles with different data to trick downloaders of popular nzbs.

@animetosho
Copy link

It seems dangerous because I don't think they can verify that the uploaded article is identical to the original. Anyone could replace missing articles with different data to trick downloaders of popular nzbs.

That's always been possible though. Usenet downloading has always involved a fair degree of blind trust.
It's worked reasonably well so far I guess, so I don't see it being that much of a concern.

@nicpaesk
Copy link

Hello! Any update on this?
I'm super interested in running this iHAVE to backfill the supported provider. Either as a Sabnzb feature or as a post processing script. If the feature is not planned at the moment, could someone point me on how to set up the post processing script?

@thezoggy
Copy link
Contributor

there are various nzbchecking / fill scripts.. which should acomplish what you want. just may have to search and try which ones are still worth using. or just use apps that you can post nzb with and have them do their built in checkers

@Safihre
Copy link
Member Author

Safihre commented May 25, 2024

@thezoggy that's not at all what this is about...

@nicpaesk
Copy link

Could anyone who did send missing articles on UsenetExpress explain to me a bit of their process over private message? I'd love to contribute doing a lot of it by myself but I just don't know the scripts and what it takes :/

@DarrenPIngram
Copy link

I will add publicly that would that many people who use SAB actually be saving the individual posts that might be somehow useable for IHAVE, even if it is only then for one vendor?

I have looked a bit at it and I genuinely wonder if there is that large a user base for the feature and especially within SAB's remit - enough for the developers to expend time on it?

@thezoggy
Copy link
Contributor

@thezoggy that's not at all what this is about...

the normal method of just checking for missing articles and reposting just those article to get new id and updating nzb to reflect to self heal. which is what nzb posters checkers do. this of course is meant if you have the original data/the poster.

the ihave is from end user pov of finding article from one server and can backfill to another. which can be used to circumvent takedowns or missing things from retention for one server from another. which also can be abused by lower end usp to leverage users other usp...

is this not correct?

@nicpaesk
Copy link

I have looked a bit at it and I genuinely wonder if there is that large a user base for the feature and especially within SAB's remit - enough for the developers to expend time on it?

Yes, most users of the indexers I am part of use SAB, some of them have two proviers, and those who have usenet express usually put it in priority 0 - as per UsenetExpress request, since they use the algorithm to prune less popular articles. I am sure that if there were a simple "turn on switch" on SAB such that it would save the raw articles from other providers and upload them to UsenetExpress, they would turn it on.

Additionally, if this becomes a feature and people start using it, it's a matter of time until other providers use it. It is not abuse in any way, it's a good and virtuous cycle if you ask me :)

is this not correct?

Actually you cannot use iHAVE to fill articles that have been taken down due to DMCA / NTD, if I understood correctly, and this makes sense.

Reuploading everything is not a great way of dealing with the issue because it's wasteful in terms of space (some providers would store duplicate information in the end), and would require the nzbs to be changed in all indexers, which is not necessarily something easy to do.

I see iHAVE as a very important feature and not related to abuse at all. For example, recently Omicron lost a TON of data from 2021 (we have proof but support doesn't care). If we could have used iHAVE to fill up other providers, the files would not be dead now.

@thezoggy
Copy link
Contributor

I have looked a bit at it and I genuinely wonder if there is that large a user base for the feature and especially within SAB's remit - enough for the developers to expend time on it?

Yes, most users of the indexers I am part of use SAB, some of them have two proviers, and those who have usenet express usually put it in priority 0 - as per UsenetExpress request, since they use the algorithm to prune less popular articles. I am sure that if there were a simple "turn on switch" on SAB such that it would save the raw articles from other providers and upload them to UsenetExpress, they would turn it on.

Additionally, if this becomes a feature and people start using it, it's a matter of time until other providers use it. It is not abuse in any way, it's a good and virtuous cycle if you ask me :)

is this not correct?

Actually you cannot use iHAVE to fill articles that have been taken down due to DMCA / NTD, if I understood correctly, and this makes sense.

Reuploading everything is not a great way of dealing with the issue because it's wasteful in terms of space (some providers would store duplicate information in the end), and would require the nzbs to be changed in all indexers, which is not necessarily something easy to do.

I see iHAVE as a very important feature and not related to abuse at all. For example, recently Omicron lost a TON of data from 2021 (we have proof but support doesn't care). If we could have used iHAVE to fill up other providers, the files would not be dead now.

you dont reup everything, just the articles taken down. which usually is not the whole job.

@DarrenPIngram
Copy link

Additionally, if this becomes a feature and people start using it, it's a matter of time until other providers use it. It is not abuse in any way, it's a good and virtuous cycle if you ask me :)

Well, it is not me who needs to do the programming :) All I ask is IF this feature is enabled please have it opt-in AND if opted-in some "max space allocation". I download rather too much and keeping many Tb of data I don't need and can't auto-trim might be a problem...

Now if you can then make some super intelligent compressed archive cache for these headers (and still give it a quota) and then retrieve from there if necessary, maybe that means more can be stored to possibly help. I can't say if that's possible to program.

I still don't fully get "it", as if I downloaded 1Tb of stuff over the past day or so, how much of those headers might suddenly disappear from my provider? And if they DID disappear from provider 2 - of which I am not a customer - how might they benefit from my archive? Even if I was a customer of provider 1 and provider 2 - for which I pay - other than being a "nice guy" how might I benefit helping commercial provider 2 be perhaps better than provider 1 by giving them the data they are missing?

I am not criticizing the idea, I just don't fully get it and don't need to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants