Scheduled Threads for crawling #919

haolin96 · 2024-03-07T07:49:25Z

Hi,

I want to set up a service listening the new URLs from other sources. The service should be kept as an opening situation. And every time I import a URL, It can be executed for crawling and committing.

I have two problems to solve based on the codes. First, I think the stopping option should be closed. But the committers only will be executed after all data extraction has been stopped. Second, I think the URLs' import logic can be regarded as adding into the queue, but I haven't found the location to add. Can you give me some hints?

ohtwadi · 2024-03-08T22:49:44Z

Are you looking to start the crawler when a URL is added somewhere? This is currently not supported out of the box. Crawler does support reading URLs dynamically upon startup via IStartURLsProvider.

You can also generate a file from your source and feed it to urlsFile via a variable. These will also be read only at startup time.

Perhaps there is something else we can suggest if you share more details.

haolin96 · 2024-03-12T01:28:52Z

Are you looking to start the crawler when a URL is added somewhere? This is currently not supported out of the box. Crawler does support reading URLs dynamically upon startup via IStartURLsProvider.

You can also generate a file from your source and feed it to urlsFile via a variable. These will also be read only at startup time.

Perhaps there is something else we can suggest if you share more details.

Yes, I want to make crawling service keep opening which can receive URL from other source continuously, crawl, collect and commit after it receives new URLs.

ohtwadi · 2024-03-12T20:08:58Z

This is not currently supported.

You will have to build your own Java application that uses the Crawler (Examples here).

Further helpful info can be found here.

Consider the following idea:

the application will listen to incoming connections
when a new URL is submitted, it builds a Crawler config file and starts the Crawler with this new config file
(look at my previous reply on other ways to pass URL to the Crawler)

I strongly recommend setting an upper limit on the number of crawler instances this app can spawn.

If programming is not your forte, you could also script this.

haolin96 · 2024-03-15T10:50:01Z

This is not currently supported.

You will have to build your own Java application that uses the Crawler (Examples here).

Further helpful info can be found here.

Consider the following idea:

the application will listen to incoming connections

when a new URL is submitted, it builds a Crawler config file and starts the Crawler with this new config file

(look at my previous reply on other ways to pass URL to the Crawler)

I strongly recommend setting an upper limit on the number of crawler instances this app can spawn.

If programming is not your forte, you could also script this.

Thank you so much. I'm trying to do it.

stale · 2024-05-15T03:24:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale From automation, when inactive for too long. label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduled Threads for crawling #919

Scheduled Threads for crawling #919

haolin96 commented Mar 7, 2024

ohtwadi commented Mar 8, 2024

haolin96 commented Mar 12, 2024

ohtwadi commented Mar 12, 2024

haolin96 commented Mar 15, 2024

stale bot commented May 15, 2024

Scheduled Threads for crawling #919

Scheduled Threads for crawling #919

Comments

haolin96 commented Mar 7, 2024

ohtwadi commented Mar 8, 2024

haolin96 commented Mar 12, 2024

ohtwadi commented Mar 12, 2024

haolin96 commented Mar 15, 2024

stale bot commented May 15, 2024