Url regex matching #16

Mews · 2024-06-16T10:33:44Z

This pr closes issue #13

Changes

Added url_regex argument to the Spider class.
Urls then get matched against this pattern inside the Spider.crawl method. If they don't match the pattern, they are skipped.
Also added a test case for this feature, in test_url_regex.

Please let me know if any changes need to be made to this pr.

tiny_web_crawler/crawler.py

…-crawler into url-regex-matching

indrajithi · 2024-06-16T11:17:59Z

@Mews Can you make a pull from master. The CI fails because of some outdated workflow in master.

Mews · 2024-06-16T11:20:20Z

I don't think that was the issue, it was because of the import re line, it should be before requests

Mews added 5 commits June 16, 2024 11:29

Added url_regex argument to Spider class

613acba

Added test case for regex matching

bbf027f

Fix merge conflicts

88f484c

Fix merge conflicts

d6cdbfe

Merge branch 'master' into url-regex-matching

1632b99

indrajithi requested changes Jun 16, 2024

View reviewed changes

tiny_web_crawler/crawler.py Outdated Show resolved Hide resolved

Mews added 3 commits June 16, 2024 12:14

Change verbose_print message

6c28726

Merge branch 'url-regex-matching' of https://github.com/Mews/tiny-web…

640936d

…-crawler into url-regex-matching

Fix import order

56c0138

indrajithi merged commit 9912308 into DataCrawl-AI:master Jun 16, 2024
9 checks passed

This was referenced Jun 16, 2024

Feature: Support for regular expression pattern for url crawling #13

Closed

First Major Release v1.0.0 #24

Open

Provide feedback