Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Support flag to crawl only the root website. Do not hop to external links #11

Closed
indrajithi opened this issue Jun 14, 2024 · 10 comments · Fixed by #40
Closed
Assignees
Labels
good first issue Good for newcomers

Comments

@indrajithi
Copy link
Collaborator

indrajithi commented Jun 14, 2024

  • Very straightforward feature to add a flag to crawl only the root website and do not crawl to external links.

  • eg: If the root url provided is https://github.com. It should crawl pages in this domain only. It should not crawl https://exmaple.com

  • (optional) Can we also support an option to crawl only external links and no internal links. There could be some use cases for that

@Mews
Copy link
Collaborator

Mews commented Jun 15, 2024

I have a question, isn't this already achievable through max_links=0 in the Spider class?
And if not, does this mean to add an argument to Spider.__init__ which, when set to true, it'll only crawl the root website?

@indrajithi
Copy link
Collaborator Author

if we set max_link=0 it will crawl only the root_url once.

Say for example we are passing the root_url as https://github.com. It will crawl only this page and fetch all the links in this page. It will not crawl https://github.com/indrajithi/tiny-web-crawler and fetch links in that. max_links is the number of urls/links crawled.

What we want to achieve in this issue is we that, it should only crawl internal links.

Every links that has https://github.com/ in it. And do not crawl external links.

This will be useful in creating sitemap for a website. LMK if you have any more questions.
@Mews

@Mews
Copy link
Collaborator

Mews commented Jun 15, 2024

Alright makes sense. What should I call the argument then, something like crawl_external_links? And the default would be true?

@Mews
Copy link
Collaborator

Mews commented Jun 15, 2024

Oh wait there's already a pr open for this

@indrajithi
Copy link
Collaborator Author

Oh wait there's already a pr open for this

Would you like to pick this up? This is very similar to what we discussed.

@Mews
Copy link
Collaborator

Mews commented Jun 16, 2024

Sure!

@indrajithi
Copy link
Collaborator Author

@devavinothm Are you working on this? #14

@Mews
Copy link
Collaborator

Mews commented Jun 17, 2024

@indrajithi I can complete his pr if you want

@indrajithi
Copy link
Collaborator Author

@Mews I have updated the description. Assigning to you. 🥇

@Mews
Copy link
Collaborator

Mews commented Jun 17, 2024

Thanks, I'm going to sleep right now but I'll get to it tomorrow morning :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants