Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big rewrite of scraping solution #26

Merged
merged 8 commits into from
Feb 2, 2024
Merged

Big rewrite of scraping solution #26

merged 8 commits into from
Feb 2, 2024

Conversation

iPromKnight
Copy link
Collaborator

distributed consumers for ingestion / scraping(scalable) - single producer written in c#.

Changed from page scraping to rss xml scraping
Includes RealDebridManager hashlist decoding (requires a github readonly PAT as requests must be authenticated) - This allows ingestion of 200k+ entries in a few hours. Simplifies a lot of torrentio to deal with new data.

…) - single producer written in c#.

Changed from page scraping to rss xml scraping
Includes RealDebridManager hashlist decoding (requires a github readonly PAT as requests must be authenticated) - This allows ingestion of 200k+ entries in a few hours.
Simplifies a lot of torrentio to deal with new data
@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 1, 2024

I've also pushed up: iPromKnight/addon-jackett
Which is an extremely lightweight version of torrentio, with the backend fully changed to jackett
Might be worth bringing that into this repo and solution as a second addon to run side by side?
I've had them both running together with stremio, and it does help with result augmentation while waiting for the main ingestion of DMM (DebridMediaManager) hashsets.

I don't have time to maintain it really

@purple-emily
Copy link
Collaborator

The scope of these changes is huge. You need to commit more frequently so I can read through the commit messages 😅

Is this a drop in replacement? I imagine you're gonna need to talk everyone through it

@Gabisonfire
Copy link
Collaborator

First, big thank you. I was not expecting this.
This will take some time to review and test properly.
One concern that comes to mind is this removes the ability to sync with the upstream. At this point is this is merged, might aswell just move it to it's own project and just distance it from Torrentio.

I'd be more useful to this project in C#.
I had started some planning to add Jackett, but if yours already works, I'd try to bring it into the project and have it optional defaulted to False.

I had other plans when it comes to sharing databases as well.

@Gabisonfire
Copy link
Collaborator

@purple-emily @iPromKnight
If you want to discuss, please add me on Discord, might be easier there. (same as my username here)

@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 1, 2024

The scope of these changes is huge. You need to commit more frequently so I can read through the commit messages 😅

Is this a drop in replacement? I imagine you're gonna need to talk everyone through it

Haha - I know. Sorry about that, I wasn't going to actually share it - was more of a backup solution for me as it started with a rarbg dump sqlite database I found a couple of days ago, that i've been working with my local gitea repo with.
Then I found the debrid media manager site, and had a look into just how they were sharing hashsets - its encodeed LZStrings dumped into a react frontend. Decoding them is simple a bunch of torrent sets that are all RD cached ^^
With all the traction this is getting on /r/StremioAddons though I felt I should give back
Was a little worried about sharing tbh - you know how much of a touchy subject this is 😄

Main changes here in torrentio are the providers filter doesn't do anything, as i've stripped it out of config
Scraping is all done via xml rss feeds for each site now too without crawling individual pages.
It seems each site republishes their rss feed every 1 minute with new changes.

src/producer/configuration holds all the injected Json config to control sync timers
Mounting this file over the top will allow custom setting for ingestion etc.

Its basically a drop in replacement though

Yeah its a complete drop in replacement. The idea is, we'd only (Hopefully) have to drop in changes in the moch's directory from upstream torrentio as thats where all provider handling takes place

had started some planning to add Jackett, but if yours already works, I'd try to bring it into the project and have it optional defaulted to False.

Perfect - I think that makes sense then as others can also maintain it.
It does work - but needs some work around the jackettQueries found in the /jackett folder to expand on tv series searching.
I build it so you can return a list of queries which are processed as a batch of promises, and then once all done, the results get enumerated on and parsed. This way, for tv season's etc its possible to expand on what is there now and match things like <title> instead of just <title> etc

@Gabisonfire
Copy link
Collaborator

Scraping is all done via xml rss feeds for each site now too without crawling individual pages.
It seems each site republishes their rss feed every 1 minute with new changes.

Does this means that it removes the scraping of older entries?

@iPromKnight
Copy link
Collaborator Author

Scraping is all done via xml rss feeds for each site now too without crawling individual pages.
It seems each site republishes their rss feed every 1 minute with new changes.

Does this means that it removes the scraping of older entries?

Yeah, but thats what the DMM scraper is for
You need to set a readonly github access token in env/producer.env
Then it'll crawl and ingest all the shared hashlists of DMM, like this one for example: https://hashlists.debridmediamanager.com/

They are all shared in git here

Here is the scraper in the producer that decodes these: https://github.com/Gabisonfire/torrentio-scraper-sh/blob/ee994fc8be02c28c962cd834046c12aff9e90071/src/producer/Crawlers/Sites/DebridMediaManagerCrawler.cs

Basically these shared hashlists are LZString encoded collections of magnets. My count after letting the producer run was for 1 hour was 280k messages in rabbit being consumed by the consumer 😄

@Gabisonfire
Copy link
Collaborator

Wow ok that's really nice. I did not understand initially. This takes care of the "database sharing" issue/feature.

@funkypenguin
Copy link
Contributor

This is fantastic, and really well put-together. I'm running it currently on the free public instance at https://torrentio.elfhosted.com, and it's going like a rocket!

23:00:25 [Information] [Scraper.Crawlers.Sites.DebridMediaManagerCrawler] Ingestion Successful - Wrote 8373 new torrents
23:00:25 [Information] [Scraper.Crawlers.Sites.DebridMediaManagerCrawler] Successfully marked page as ingested
23:00:28 [Information] [Scraper.Crawlers.Sites.DebridMediaManagerCrawler] Ingestion Successful - Wrote 78 new torrents
23:00:28 [Information] [Scraper.Crawlers.Sites.DebridMediaManagerCrawler] Successfully marked page as ingested

@funkypenguin
Copy link
Contributor

was more of a backup solution for me as it started with a rarbg dump sqlite database I found a couple of days ago, that i've been working with my local gitea repo with.

Ooh, I too have found this rarbg dump, can you suggest how I might import it? :)

@ash32152
Copy link

ash32152 commented Feb 2, 2024

I've just pulled your fork @iPromKnight and it's working amazingly, I've had trouble scraping the original way and it took me like a week to get around 20k.

Running your fork for the past half an hour I've got 120k ingested and around 1k consumed already.. and can see the consumed in stremio and working great!!

Just wanted to say thanks for your work on this and feedback for the PR, it's awesome! :)

@Gabisonfire
Copy link
Collaborator

Gabisonfire commented Feb 2, 2024

My tests are also conclusive so far.
I'm looking to merging tomorrow. I'll have to extend the readme to be more noob friendly.

@iPromKnight What in the world is that shutterstock background 😄 Gotta say I love the new slogan: Selfhostio the Torrentio brings you much Funio

I'll work on adding a Torrent9 scraper as well based on the ones you wrote.
Thanks

@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 2, 2024

This is fantastic, and really well put-together. I'm running it currently on the free public instance at https://torrentio.elfhosted.com, and it's going like a rocket!

Thanks a lot - glad its working for folks

Ooh, I too have found this rarbg dump, can you suggest how I might import it? :)

I used pgloader, to get a postgres database from it - then "selected" the data into the ingested_torrents table with processed = false, and let the producer push them all into the rabbit queue :)
https://pgloader.readthedocs.io/en/latest/ref/sqlite.html

Just wanted to say thanks for your work on this and feedback for the PR, it's awesome! :)

You're welcome 😄

My tests are also conclusive so far.
I'm looking to merging tomorrow. I'll have to extend the readme to be more noob friendly.

@iPromKnight What in the world is that shutterstock background 😄 Gotta say I love the new slogan: Selfhostio the Torrentio brings you much Funio

I'll work on adding a Torrent9 scraper as well based on the ones you wrote.
Thanks

Haha - you know, i've no idea where that background came from - I removed the torrentio one, and that showed up as the default I just never looked at where it came from - I presumed same as the logo from Stremio ^^
Couldn't resist with the slogan lol

Extending the readme is probably a good idea
Might be worth going into detail about the env vars etc, especially the github PAT for DMM scraper.

@purple-emily
Copy link
Collaborator

Do we need to reimplement Flaresolverr?

@purple-emily
Copy link
Collaborator

So for a 1 to 1 merge, I'm seeing a few missing crawlers like 1337x (which will obviously need Flaresolverr again).

We need to be know for a fact if we can run this over a database running from the master branch with absolutely no breaking issues.

@funkypenguin
Copy link
Contributor

I can confirm that running the new scrapers into my existing and already-seeded database from the main branch has created no obvious issues..

@purple-emily
Copy link
Collaborator

purple-emily commented Feb 2, 2024

@funkypenguin That's good to know.

So we need somewhere we can put a bunch of todo tasks:

  • Create crawlers to get us 1-1 with master (1337x, ...)
  • Fix a mismatch .env vars naming (DATABASE_URI and POSTGRES_DATABASE_URI in addon.env and consumer.env) -- perhaps combine this into one .env file with good comments and defaults.
  • Consider this change Main changes here in torrentio are the providers filter doesn't do anything, as i've stripped it out of config and what it means, does this functionality need replacing?
  • Consider if the project should commit to renaming to selfhostio or something equivalent and totally branching away from upstream.
  • Consider removing iPromKnights compose anchors to make the compose file easier to read for casuals. An alternative is to comment the compose file thoroughly.
  • Need some documentation updates. Possibly a docs folder. I don't mind taking this task when I have time. I can go through how to get this running with a reverse proxy, etc.
  • Maybe some documentation for integrating with already existing databases instead of spinning up a new container and having everything self contained. (Have your own PostgreSQL? Change the following...)
  • Contributing guidelines. This can be a CONTRIBUTING.md and/or some linters/workhooks/pre-commit hooks. Theres an open issue for this.
  • Issue templates. This can stop the need to chase up detailed logs etc.
  • Further testing on smaller systems. My Raspberry Pi froze and I lost SSH access during the initial run. Can we limit the resource usage in some way? I can't confirm which of the containers caused this, nor if it was the CPU maxing out or the RAM. (EDIT: It's the producer. It absolutely murders the RPi, CPU goes to 100% on all cores and the RAM maxes out. It could be a mixture of this and the data exploding in RabbitMQ)

@purple-emily
Copy link
Collaborator

I can confirm that running the new scrapers into my existing and already-seeded database from the main branch has created no obvious issues..

Did you directly pull this and run it over the top? Are you running the new code for the addon as well or did you just replace the scrapers?

@funkypenguin
Copy link
Contributor

Did you directly pull this and run it over the top? Are you running the new code for the addon as well or did you just replace the scrapers?

For now, I just replaced the scrapers, on the basis that I should be able to just "drop them in" (and nobody would notice if srcaping broke for a while anyway!)

@danwilldev
Copy link

  • Further testing on smaller systems. My Raspberry Pi froze and I lost SSH access during the initial run. Can we limit the resource usage in some way?

Yes would you consider using the rabbit configuration files to expose settings for both consumer and producer? That would allow users to better tune them for lower end hardware

@funkypenguin
Copy link
Contributor

So an additional observation.. I thought I'd try the original l33tx scraper against my database, and it failed with:

    sql: 'ALTER TABLE "torrents" ALTER COLUMN "trackers" DROP NOT NULL;ALTER TABLE "torrents" ALTER COLUMN "trackers" DROP DEFAULT;ALTER TABLE "torrents" ALTER COLUMN "trackers" TYPE VARCHAR(4096);',
    parameters: undefined

So there's no going back to main without some work..

@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 2, 2024

Yeah unfortunately there were a few schema changes
The trackers column wasn't large enough to store the trackers pulled via http

Its worth noting here I don't know how this will function against the upstream release of torrentio as that doesn't have the provider DMM which is what the DMM scraper stores the torrents as, so if you have just dropped this in as the scraper and not the addon, then I dont think any of the DMM sources wil be listed in Stremio

That's why I removed the providers section from config - its kinda pointless I think anyhow, i've not really seen a case where people have ask to remove sources
In jackettio the jackett backend version I added I removed everything but the debrid section for instance

I'm going to remove it from the UI and push back up to the PR branch
less confusing then having the redundant selector.

@iPromKnight
Copy link
Collaborator Author

  • Further testing on smaller systems. My Raspberry Pi froze and I lost SSH access during the initial run. Can we limit the resource usage in some way?

Yes would you consider using the rabbit configuration files to expose settings for both consumer and producer? That would allow users to better tune them for lower end hardware

Yeah thats a good shout
On a pi too - i'd probably change the number of deployed consumers to 1, instead of 3.
Mind you - the consumer isn't that resource heavy - sits around 90mb ram usage per instance

Fix a mismatch .env vars naming (DATABASE_URI and POSTGRES_DATABASE_URI in addon.env and consumer.env) -- perhaps combine this into one .env file with good comments and defaults.

Yeah I hear ya - its been on my todo list, when i was rewriting the scraper I accidentally changed it

@purple-emily
Copy link
Collaborator

I'm assuming the issues my Raspberry Pi is having is absolutely the producer adding too many scrapes to process at once.

Here's my RabbitMQ messages queue:

root@4e8effe4d7df:/# rabbitmqadmin  list queues
+----------+----------+
|   name   | messages |
+----------+----------+
| ingested | 12428    |
+----------+----------+

I disabled the producer about 2 hours ago. I can only imagine how big the queue was when it forced my Pi to a standstill.

This definitely needs some limits adding.

I'm guessing what @danwilldev said might handle it. A configuration file for RabbitMQ. Then maybe some logic to get the Producer to sleep whilst the queue goes down?

@iPromKnight
Copy link
Collaborator Author

there is a separate task in the producer that is responsible for actually publishing to rabbitmq

The scrapers themselves all ingest and write to a postgres table with processed = false
Then every 10 seconds a job wakes up and pulls anything that hasn't been sent and pushes to the queue

We can expand on this and make that configurable from the scrapers config - then introduce a hard limit for the queue size, and have the produce publisher job check the number of messages in the queue first - if its reached, defer the execution of the job till the next scheduled run
This would give tighter control and introduce the concept of batches

@purple-emily
Copy link
Collaborator

purple-emily commented Feb 2, 2024

I booted up a virtual machine and installed minimal ubuntu server. I then span up the producer and 60 consumers. The thing is insane.

This is on a VM with 12gb of RAM and 4 cores.

The consumers cannot, in any way, keep up with the producer (this is before I have had a look at your new commits @iPromKnight). I imagine the producer will eventually reach a limit where it's processed the backlog of available crawls so it will even out eventually.

The databses are still hosted on my Raspberry Pi. I just opened some ports and stuck the VM on a bridged network.

MaxPublishBatchSize must be set, but MaxQueueSize can be set to 0 to disable check of the rabbitmq queue size
@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 2, 2024

I booted up a virtual machine and installed minimal ubuntu server. I then span up the producer and 60 consumers. The thing is insane.

This is on a VM with 12gb of RAM and 4 cores.

The consumers cannot, in any way, keep up with the producer (this is before I have had a look at your new commits @iPromKnight). I imagine the producer will eventually reach a limit where it's processed the backlog of available crawls so it will even out eventually.

The databses are still hosted on my Raspberry Pi. I just opened some ports and stuck the VM on a bridged network.

Nice workaround

60 Consumers - haha - those logs must have been crazy 😄

Unfortunately you are at the mercy of the the time it takes to extract the torrent data to find files etc - that part of the consumer is still torrentio scraper code. Perhaps we can come up with something better?

I've just committed the producer changes that allow setting size limits for queue and publish size and configuration of the publishing window - that'll help if you want to keep the queue as small as possible

Previous to that i've implemented hardening of the services, added esbuild and dev watch etc to all addons, and brought in the jackett backend addon for torrentio aptly names Jackettio 😸
PR all updated

@purple-emily
Copy link
Collaborator

Explain the process of using Jacket to me, I'm going to add some commits to update the documentation

@Gabisonfire
Copy link
Collaborator

I have created a project and issues based on @purple-emily 's input (thank you so much)
It's public, let me know if it makes sense:
https://github.com/users/Gabisonfire/projects/1

@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 2, 2024

Explain the process of using Jacket to me, I'm going to add some commits to update the documentation

Here is whats needed to be added to the compose for it

version: '3.8'
name: jackettio-selfhostio

services:
  jackett:
    image: linuxserver/jackett
    restart: unless-stopped
    ports:
      - "127.0.0.1:9117:9117"
    environment:
      - PUID=1001
      - PGID=1001
      - TZ=London/Europe
    volumes:
      - jackett-downloads:/downloads
      - jackett-config:/config

  addon:
    build:
      context: src/node/addon-jackett
      dockerfile: Dockerfile
    ports:
      - "7001:7001"
    environment:
      - TZ=London/Europe
      - DEBUG_MODE=false
      - JACKETT_API_KEY=hl7a62ujbwwut0zfqtr3hrm2izin5jf3
      - JACKETT_URI=http://jackett:9117
      
volumes:
  jackett-downloads:
  jackett-config:

There is a chicken and egg issue though.
You have to run docker compose up, then go to jackett at http://localhost:9117
Get the api key frpm the page - and add it as the env where its required on the addon
Then run docker compose up again and it'll recreate the addon deployment with the new env var

Make sure to add some indexers in jackett

Thats all thats required for that - no mongo / databases / scrapers etc

You'll notice on the addon i cut out anything that wasn't supported - so its purely debrid options.

@iPromKnight
Copy link
Collaborator Author

I have created a project and issues based on @purple-emily 's input (thank you so much) It's public, let me know if it makes sense: https://github.com/users/Gabisonfire/projects/1

cool thanks ^^

@purple-emily
Copy link
Collaborator

@Gabisonfire I'm currently going through and trying to get the documenation updated. I was pushing it against @iPromKnight's repo.

Where do you want to be before merging this?

@iPromKnight
Copy link
Collaborator Author

@Gabisonfire I'm currently going through and trying to get the documenation updated. I was pushing it against @iPromKnight's repo.

Where do you want to be before merging this?

That's a sign-off from me I think now

@purple-emily
Copy link
Collaborator

I don't think we need to open the MongoDB and PostgreSQL ports in the docker-compose. If we remove the port references then we close a security issue and need not offer advice on changing the database passwords.

@Gabisonfire
Copy link
Collaborator

@purple-emily I think we need to get the docs for deploying and integrating before. Then I'd say we merge and unlink from the fork.

@iPromKnight thanks for all.
I think we could either move jackett into a separate compose, I think most people who would use Jackett already have their own instance but I could be wrong.

@purple-emily
Copy link
Collaborator

@Gabisonfire See here: https://github.com/iPromKnight/torrentio-scraper-sh/pull/1

@iPromKnight will need to merge this into his fork when I'm done

@purple-emily
Copy link
Collaborator

Anyone know what happens if you don't fill out the GithubSettings__PAT=<YOUR TOKEN HERE> var?

@Gabisonfire Gabisonfire merged commit 898ab6e into knightcrawler-stremio:master Feb 2, 2024
@iPromKnight
Copy link
Collaborator Author

Anyone know what happens if you don't fill out the GithubSettings__PAT=<YOUR TOKEN HERE> var?

There is a check on the GitHub config when it loads
If it's missing then it just doesn't schedule the DMM crawler jobs

@ash32152
Copy link

ash32152 commented Feb 2, 2024

out of curiosity if anyone had success with the rar dump, in one of my test environments I loaded via pg loader into the same DB and did a select statement similar to;

INSERT INTO ingested_torrents (name, source, category, info_hash, size, seeders, leechers, imdb, processed, "createdAt", "updatedAt")
SELECT title, dt, cat, hash, size, NULL, NULL, imdb, false, current_timestamp, current_timestamp
FROM items;

but it didn't seem to push them to rabbit, what source did you set, does it have to be one defined in the add-on? or did you set as RAR?

@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 2, 2024

It will push them but it'll look like it's not doing anything because unless you are on todays version which batches new ingestions, it'll bulk send all 260k that's found in the SQLite file which will take a few minutes. Too me 10 to publish to rabbit I think

If you use todays and set the max batch size to like 500 or less you'll see it making progress

You have to make sure as well you set the categories correctly when you insert them

@trulow
Copy link
Contributor

trulow commented Feb 2, 2024

Explain the process of using Jacket to me, I'm going to add some commits to update the documentation

Here is whats needed to be added to the compose for it

version: '3.8'
name: jackettio-selfhostio

services:
  jackett:
    image: linuxserver/jackett
    restart: unless-stopped
    ports:
      - "127.0.0.1:9117:9117"
    environment:
      - PUID=1001
      - PGID=1001
      - TZ=London/Europe
    volumes:
      - jackett-downloads:/downloads
      - jackett-config:/config

  addon:
    build:
      context: src/node/addon-jackett
      dockerfile: Dockerfile
    ports:
      - "7001:7001"
    environment:
      - TZ=London/Europe
      - DEBUG_MODE=false
      - JACKETT_API_KEY=hl7a62ujbwwut0zfqtr3hrm2izin5jf3
      - JACKETT_URI=http://jackett:9117
      
volumes:
  jackett-downloads:
  jackett-config:

There is a chicken and egg issue though. You have to run docker compose up, then go to jackett at http://localhost:9117 Get the api key frpm the page - and add it as the env where its required on the addon Then run docker compose up again and it'll recreate the addon deployment with the new env var

Make sure to add some indexers in jackett

Thats all thats required for that - no mongo / databases / scrapers etc

You'll notice on the addon i cut out anything that wasn't supported - so its purely debrid options.

Wouldn't it be easier to have a jackett as a separate deployment instead of it being included in the stack? This way, you can run jackett separately and grab the API keys then input them into the selfhostio docker compose file afterwards.

@iPromKnight
Copy link
Collaborator Author

Oh yeah absolutely - i was only giving a baseline example of how to use the other addon there
I wouldn't deploy like that 😛

@Gabisonfire
Copy link
Collaborator

I agree with @trulow , I think a lot of people who use Jackett will already have their own instance. Otherwise, it makes sense to have it as a "step 1"

@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 2, 2024

I agree with @trulow , I think a lot of people who use Jackett will already have their own instance. Otherwise, it makes sense to have it as a "step 1"

Absolutely - i wouldn't want us to have to worry about having that in our chosen stack anyhow - the jacket addon is more of an optional extra
I've a plan to make that completely redundant anyhow - see #45

@sleeyax
Copy link
Collaborator

sleeyax commented Feb 2, 2024

So this project has suddenly become a C# project? Could as well have been a separate repo 😅

@Gabisonfire
Copy link
Collaborator

@sleeyax we have detached from the upstream as of today and will be renaming the project soon.

@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 2, 2024

So this project has suddenly become a C# project? Could as well have been a separate repo 😅

Haha its just one service right now thats c#
Still has two node apps (three if you count the lightweight jackettio)

(.net ❤️ though 😋)

@sleeyax
Copy link
Collaborator

sleeyax commented Feb 4, 2024

I love the ambition to build a better, open source and community maintained torrentio 🚀.

I dislike the sudden change in direction though, anyone who was reading the OG forked code is in for a surprise on the next git pull. That's why I've created this copy of the original torrentio-scraper-sh project (before this PR was introduced) here: https://github.com/sleeyax/torrentio-scraper. I have no plans to maintain it, it solely exists for educational and archival purposes.

@iPromKnight
Copy link
Collaborator Author

iPromKnight commented Feb 4, 2024

Good idea 😄
Aye thats all on me lol.
I think with the issues torrentio started having, and the droves of people posting on reddit in r/StremioAddons had a lot of devs looking into the code base lol
I just didn't want to have yet another repo being worked on in parallel to others - so decided to PR all my changes into this, and it was quite a drastic change of direction as you say ^^

Was screaming for distributed pub/sub though looking at how things were done

@Gabisonfire
Copy link
Collaborator

@sleeyax understand the feeling and I created a release of the code before the rewrite so you can still access it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants