-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big rewrite of scraping solution #26
Conversation
…) - single producer written in c#. Changed from page scraping to rss xml scraping Includes RealDebridManager hashlist decoding (requires a github readonly PAT as requests must be authenticated) - This allows ingestion of 200k+ entries in a few hours. Simplifies a lot of torrentio to deal with new data
I've also pushed up: iPromKnight/addon-jackett I don't have time to maintain it really |
The scope of these changes is huge. You need to commit more frequently so I can read through the commit messages 😅 Is this a drop in replacement? I imagine you're gonna need to talk everyone through it |
First, big thank you. I was not expecting this. I'd be more useful to this project in C#. I had other plans when it comes to sharing databases as well. |
@purple-emily @iPromKnight |
Haha - I know. Sorry about that, I wasn't going to actually share it - was more of a backup solution for me as it started with a rarbg dump sqlite database I found a couple of days ago, that i've been working with my local gitea repo with. Main changes here in torrentio are the providers filter doesn't do anything, as i've stripped it out of config src/producer/configuration holds all the injected Json config to control sync timers Its basically a drop in replacement though Yeah its a complete drop in replacement. The idea is, we'd only (Hopefully) have to drop in changes in the moch's directory from upstream torrentio as thats where all provider handling takes place
Perfect - I think that makes sense then as others can also maintain it. |
Does this means that it removes the scraping of older entries? |
Yeah, but thats what the DMM scraper is for They are all shared in git here Here is the scraper in the producer that decodes these: https://github.com/Gabisonfire/torrentio-scraper-sh/blob/ee994fc8be02c28c962cd834046c12aff9e90071/src/producer/Crawlers/Sites/DebridMediaManagerCrawler.cs Basically these shared hashlists are LZString encoded collections of magnets. My count after letting the producer run was for 1 hour was 280k messages in rabbit being consumed by the consumer 😄 |
Wow ok that's really nice. I did not understand initially. This takes care of the "database sharing" issue/feature. |
This is fantastic, and really well put-together. I'm running it currently on the free public instance at https://torrentio.elfhosted.com, and it's going like a rocket!
|
Ooh, I too have found this rarbg dump, can you suggest how I might import it? :) |
I've just pulled your fork @iPromKnight and it's working amazingly, I've had trouble scraping the original way and it took me like a week to get around 20k. Running your fork for the past half an hour I've got 120k ingested and around 1k consumed already.. and can see the consumed in stremio and working great!! Just wanted to say thanks for your work on this and feedback for the PR, it's awesome! :) |
My tests are also conclusive so far. @iPromKnight What in the world is that shutterstock background 😄 Gotta say I love the new slogan: Selfhostio the Torrentio brings you much Funio I'll work on adding a Torrent9 scraper as well based on the ones you wrote. |
Thanks a lot - glad its working for folks
I used pgloader, to get a postgres database from it - then "selected" the data into the ingested_torrents table with processed = false, and let the producer push them all into the rabbit queue :)
You're welcome 😄
@iPromKnight What in the world is that shutterstock background 😄 Gotta say I love the new slogan: Selfhostio the Torrentio brings you much Funio I'll work on adding a Torrent9 scraper as well based on the ones you wrote. Haha - you know, i've no idea where that background came from - I removed the torrentio one, and that showed up as the default I just never looked at where it came from - I presumed same as the logo from Stremio ^^ Extending the readme is probably a good idea |
Do we need to reimplement Flaresolverr? |
So for a 1 to 1 merge, I'm seeing a few missing crawlers like 1337x (which will obviously need Flaresolverr again). We need to be know for a fact if we can run this over a database running from the master branch with absolutely no breaking issues. |
I can confirm that running the new scrapers into my existing and already-seeded database from the main branch has created no obvious issues.. |
@funkypenguin That's good to know. So we need somewhere we can put a bunch of todo tasks:
|
Did you directly pull this and run it over the top? Are you running the new code for the addon as well or did you just replace the scrapers? |
For now, I just replaced the scrapers, on the basis that I should be able to just "drop them in" (and nobody would notice if srcaping broke for a while anyway!) |
Yes would you consider using the rabbit configuration files to expose settings for both consumer and producer? That would allow users to better tune them for lower end hardware |
So an additional observation.. I thought I'd try the original l33tx scraper against my database, and it failed with:
So there's no going back to main without some work.. |
Yeah unfortunately there were a few schema changes Its worth noting here I don't know how this will function against the upstream release of torrentio as that doesn't have the provider DMM which is what the DMM scraper stores the torrents as, so if you have just dropped this in as the scraper and not the addon, then I dont think any of the DMM sources wil be listed in Stremio That's why I removed the providers section from config - its kinda pointless I think anyhow, i've not really seen a case where people have ask to remove sources I'm going to remove it from the UI and push back up to the PR branch |
Yeah thats a good shout
Yeah I hear ya - its been on my todo list, when i was rewriting the scraper I accidentally changed it |
I'm assuming the issues my Raspberry Pi is having is absolutely the producer adding too many scrapes to process at once. Here's my RabbitMQ messages queue:
I disabled the producer about 2 hours ago. I can only imagine how big the queue was when it forced my Pi to a standstill. This definitely needs some limits adding. I'm guessing what @danwilldev said might handle it. A configuration file for RabbitMQ. Then maybe some logic to get the Producer to sleep whilst the queue goes down? |
there is a separate task in the producer that is responsible for actually publishing to rabbitmq The scrapers themselves all ingest and write to a postgres table with processed = false We can expand on this and make that configurable from the scrapers config - then introduce a hard limit for the queue size, and have the produce publisher job check the number of messages in the queue first - if its reached, defer the execution of the job till the next scheduled run |
Also wraps in pm2, and introduces linting, and dev watch
I booted up a virtual machine and installed minimal ubuntu server. I then span up the producer and 60 consumers. The thing is insane. This is on a VM with 12gb of RAM and 4 cores. The consumers cannot, in any way, keep up with the producer (this is before I have had a look at your new commits @iPromKnight). I imagine the producer will eventually reach a limit where it's processed the backlog of available crawls so it will even out eventually. The databses are still hosted on my Raspberry Pi. I just opened some ports and stuck the VM on a bridged network. |
MaxPublishBatchSize must be set, but MaxQueueSize can be set to 0 to disable check of the rabbitmq queue size
Nice workaround 60 Consumers - haha - those logs must have been crazy 😄 Unfortunately you are at the mercy of the the time it takes to extract the torrent data to find files etc - that part of the consumer is still torrentio scraper code. Perhaps we can come up with something better? I've just committed the producer changes that allow setting size limits for queue and publish size and configuration of the publishing window - that'll help if you want to keep the queue as small as possible Previous to that i've implemented hardening of the services, added esbuild and dev watch etc to all addons, and brought in the jackett backend addon for torrentio aptly names Jackettio 😸 |
Explain the process of using Jacket to me, I'm going to add some commits to update the documentation |
I have created a project and issues based on @purple-emily 's input (thank you so much) |
Here is whats needed to be added to the compose for it version: '3.8'
name: jackettio-selfhostio
services:
jackett:
image: linuxserver/jackett
restart: unless-stopped
ports:
- "127.0.0.1:9117:9117"
environment:
- PUID=1001
- PGID=1001
- TZ=London/Europe
volumes:
- jackett-downloads:/downloads
- jackett-config:/config
addon:
build:
context: src/node/addon-jackett
dockerfile: Dockerfile
ports:
- "7001:7001"
environment:
- TZ=London/Europe
- DEBUG_MODE=false
- JACKETT_API_KEY=hl7a62ujbwwut0zfqtr3hrm2izin5jf3
- JACKETT_URI=http://jackett:9117
volumes:
jackett-downloads:
jackett-config: There is a chicken and egg issue though. Make sure to add some indexers in jackett Thats all thats required for that - no mongo / databases / scrapers etc You'll notice on the addon i cut out anything that wasn't supported - so its purely debrid options. |
cool thanks ^^ |
@Gabisonfire I'm currently going through and trying to get the documenation updated. I was pushing it against @iPromKnight's repo. Where do you want to be before merging this? |
That's a sign-off from me I think now |
I don't think we need to open the MongoDB and PostgreSQL ports in the docker-compose. If we remove the port references then we close a security issue and need not offer advice on changing the database passwords. |
@purple-emily I think we need to get the docs for deploying and integrating before. Then I'd say we merge and unlink from the fork. @iPromKnight thanks for all. |
@Gabisonfire See here: https://github.com/iPromKnight/torrentio-scraper-sh/pull/1 @iPromKnight will need to merge this into his fork when I'm done |
Anyone know what happens if you don't fill out the |
There is a check on the GitHub config when it loads |
out of curiosity if anyone had success with the rar dump, in one of my test environments I loaded via pg loader into the same DB and did a select statement similar to; INSERT INTO ingested_torrents (name, source, category, info_hash, size, seeders, leechers, imdb, processed, "createdAt", "updatedAt") but it didn't seem to push them to rabbit, what source did you set, does it have to be one defined in the add-on? or did you set as RAR? |
It will push them but it'll look like it's not doing anything because unless you are on todays version which batches new ingestions, it'll bulk send all 260k that's found in the SQLite file which will take a few minutes. Too me 10 to publish to rabbit I think If you use todays and set the max batch size to like 500 or less you'll see it making progress You have to make sure as well you set the categories correctly when you insert them |
Wouldn't it be easier to have a jackett as a separate deployment instead of it being included in the stack? This way, you can run jackett separately and grab the API keys then input them into the selfhostio docker compose file afterwards. |
Oh yeah absolutely - i was only giving a baseline example of how to use the other addon there |
I agree with @trulow , I think a lot of people who use Jackett will already have their own instance. Otherwise, it makes sense to have it as a "step 1" |
Absolutely - i wouldn't want us to have to worry about having that in our chosen stack anyhow - the jacket addon is more of an optional extra |
So this project has suddenly become a C# project? Could as well have been a separate repo 😅 |
@sleeyax we have detached from the upstream as of today and will be renaming the project soon. |
Haha its just one service right now thats c# (.net ❤️ though 😋) |
I love the ambition to build a better, open source and community maintained torrentio 🚀. I dislike the sudden change in direction though, anyone who was reading the OG forked code is in for a surprise on the next |
Good idea 😄 Was screaming for distributed pub/sub though looking at how things were done |
@sleeyax understand the feeling and I created a release of the code before the rewrite so you can still access it. |
distributed consumers for ingestion / scraping(scalable) - single producer written in c#.
Changed from page scraping to rss xml scraping
Includes RealDebridManager hashlist decoding (requires a github readonly PAT as requests must be authenticated) - This allows ingestion of 200k+ entries in a few hours. Simplifies a lot of torrentio to deal with new data.