Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled Threads for crawling #919

Open
haolin96 opened this issue Mar 7, 2024 · 5 comments
Open

Scheduled Threads for crawling #919

haolin96 opened this issue Mar 7, 2024 · 5 comments
Labels
stale From automation, when inactive for too long.

Comments

@haolin96
Copy link

haolin96 commented Mar 7, 2024

Hi,

I want to set up a service listening the new URLs from other sources. The service should be kept as an opening situation. And every time I import a URL, It can be executed for crawling and committing.

I have two problems to solve based on the codes. First, I think the stopping option should be closed. But the committers only will be executed after all data extraction has been stopped. Second, I think the URLs' import logic can be regarded as adding into the queue, but I haven't found the location to add. Can you give me some hints?

@ohtwadi
Copy link
Contributor

ohtwadi commented Mar 8, 2024

Are you looking to start the crawler when a URL is added somewhere? This is currently not supported out of the box. Crawler does support reading URLs dynamically upon startup via IStartURLsProvider.

You can also generate a file from your source and feed it to urlsFile via a variable. These will also be read only at startup time.

Perhaps there is something else we can suggest if you share more details.

@haolin96
Copy link
Author

Are you looking to start the crawler when a URL is added somewhere? This is currently not supported out of the box. Crawler does support reading URLs dynamically upon startup via IStartURLsProvider.

You can also generate a file from your source and feed it to urlsFile via a variable. These will also be read only at startup time.

Perhaps there is something else we can suggest if you share more details.

Yes, I want to make crawling service keep opening which can receive URL from other source continuously, crawl, collect and commit after it receives new URLs.

@ohtwadi
Copy link
Contributor

ohtwadi commented Mar 12, 2024

This is not currently supported.

You will have to build your own Java application that uses the Crawler (Examples here).

Further helpful info can be found here.

Consider the following idea:

  • the application will listen to incoming connections
  • when a new URL is submitted, it builds a Crawler config file and starts the Crawler with this new config file
  • (look at my previous reply on other ways to pass URL to the Crawler)

I strongly recommend setting an upper limit on the number of crawler instances this app can spawn.

If programming is not your forte, you could also script this.

@haolin96
Copy link
Author

This is not currently supported.

You will have to build your own Java application that uses the Crawler (Examples here).

Further helpful info can be found here.

Consider the following idea:

  • the application will listen to incoming connections
  • when a new URL is submitted, it builds a Crawler config file and starts the Crawler with this new config file
  • (look at my previous reply on other ways to pass URL to the Crawler)

I strongly recommend setting an upper limit on the number of crawler instances this app can spawn.

If programming is not your forte, you could also script this.

Thank you so much. I'm trying to do it.

Copy link

stale bot commented May 15, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale From automation, when inactive for too long. label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale From automation, when inactive for too long.
Projects
None yet
Development

No branches or pull requests

2 participants