Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Url regex matching #16

Merged
merged 8 commits into from
Jun 16, 2024
Merged

Conversation

Mews
Copy link
Collaborator

@Mews Mews commented Jun 16, 2024

This pr closes issue #13

Changes

  • Added url_regex argument to the Spider class.
  • Urls then get matched against this pattern inside the Spider.crawl method. If they don't match the pattern, they are skipped.
  • Also added a test case for this feature, in test_url_regex.

Please let me know if any changes need to be made to this pr.

tiny_web_crawler/crawler.py Outdated Show resolved Hide resolved
@indrajithi
Copy link
Collaborator

@Mews Can you make a pull from master. The CI fails because of some outdated workflow in master.

@Mews
Copy link
Collaborator Author

Mews commented Jun 16, 2024

I don't think that was the issue, it was because of the import re line, it should be before requests

@indrajithi indrajithi merged commit 9912308 into DataCrawl-AI:master Jun 16, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants