WebCrawler: improvements #422

eolivelli · 2023-09-15T15:27:22Z

Summary:
Add a couple of improvements to webcrawler-source:

Take into account the robots.txt file
Download the site maps
Add a configuration parameter to limit the overall number of urls considered by the crawler
Add a configuration parameter to limit the depth of the crawling
Add a configuration parameter to not scan for Links inside the documents
Removed the "idle-time" parameter

The new parameters are:

max-urls: the maximum number of urls visited by the Web Crawler (defaults to 1000)
max-depth: the maximum depth (defaults to 10)
handle-robots-file: look for robots.txt (and then sitemaps) (defaults to true)
scan-html-documents: scan the HTML documents in order to find links to other pages (defaults to true)

Full configuration:

  - name: "Crawl the WebSite"
    type: "webcrawler-source"
    configuration:
      seed-urls: ["https://docs.langstream.ai/"]
      allowed-domains: ["https://docs.langstream.ai"]
      forbidden-paths: []
      min-time-between-requests: 500
      reindex-interval-seconds: 3600
      max-error-count: 5
      max-urls: 1000
      max-depth: 10
      handle-robots-file: true
      user-agent: "" # this is computed automatically, but you can override it
      scan-html-documents: true
      http-timeout: 10000
      handle-cookies: true
      max-unflushed-pages: 100
      bucketName: "{{{secrets.s3.bucket-name}}}"
      endpoint: "{{{secrets.s3.endpoint}}}"
      access-key: "{{{secrets.s3.access-key}}}"
      secret-key: "{{{secrets.s3.secret}}}"
      region: "{{{secrets.s3.region}}}"

eolivelli added 5 commits September 15, 2023 17:00

WebCrawler: improvements

392539f

Handle robots

8c2c20d

Fix tests

e754f1b

docs

7a8ea9d

Add test about reload

7953f2a

eolivelli marked this pull request as ready for review September 18, 2023 12:48

eolivelli merged commit 2757349 into main Sep 18, 2023
8 checks passed

eolivelli added the needs-doc label Sep 18, 2023

cbornet deleted the impl/webcrawler-improvements-again branch September 26, 2023 23:45

benfrank241 pushed a commit to vectorize-io/langstream that referenced this pull request May 2, 2024

WebCrawler: improvements (LangStream#422)

06a8bb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebCrawler: improvements #422

WebCrawler: improvements #422

eolivelli commented Sep 15, 2023 •

edited

Loading

WebCrawler: improvements #422

WebCrawler: improvements #422

Conversation

eolivelli commented Sep 15, 2023 • edited Loading

eolivelli commented Sep 15, 2023 •

edited

Loading