Feature: Support flag to crawl only the root website. Do not hop to external links #11

indrajithi · 2024-06-14T16:16:10Z

Very straightforward feature to add a flag to crawl only the root website and do not crawl to external links.
eg: If the root url provided is https://github.com. It should crawl pages in this domain only. It should not crawl https://exmaple.com
(optional) Can we also support an option to crawl only external links and no internal links. There could be some use cases for that

Mews · 2024-06-15T18:27:34Z

I have a question, isn't this already achievable through max_links=0 in the Spider class?
And if not, does this mean to add an argument to Spider.__init__ which, when set to true, it'll only crawl the root website?

indrajithi · 2024-06-15T19:13:49Z

if we set max_link=0 it will crawl only the root_url once.

Say for example we are passing the root_url as https://github.com. It will crawl only this page and fetch all the links in this page. It will not crawl https://github.com/indrajithi/tiny-web-crawler and fetch links in that. max_links is the number of urls/links crawled.

What we want to achieve in this issue is we that, it should only crawl internal links.

Every links that has https://github.com/ in it. And do not crawl external links.

This will be useful in creating sitemap for a website. LMK if you have any more questions.
@Mews

Mews · 2024-06-15T19:40:42Z

Alright makes sense. What should I call the argument then, something like crawl_external_links? And the default would be true?

Mews · 2024-06-15T19:42:35Z

Oh wait there's already a pr open for this

indrajithi · 2024-06-15T22:58:14Z

Oh wait there's already a pr open for this

Would you like to pick this up? This is very similar to what we discussed.

Mews · 2024-06-16T00:23:04Z

Sure!

indrajithi · 2024-06-16T13:56:17Z

@devavinothm Are you working on this? #14

Mews · 2024-06-17T19:03:39Z

@indrajithi I can complete his pr if you want

indrajithi · 2024-06-17T21:02:44Z

@Mews I have updated the description. Assigning to you. 🥇

Mews · 2024-06-17T21:14:40Z

Thanks, I'm going to sleep right now but I'll get to it tomorrow morning :)

indrajithi added the good first issue Good for newcomers label Jun 14, 2024

indrajithi mentioned this issue Jun 15, 2024

Feature: Add option to return the crawled website body in the response #8

Closed

indrajithi added this to the First major release v.1.0.0 milestone Jun 16, 2024

indrajithi assigned Mews Jun 17, 2024

indrajithi mentioned this issue Jun 17, 2024

First Major Release v1.0.0 #24

Open

25 tasks

Mews mentioned this issue Jun 18, 2024

Add internal_links_only and external_links_only options #40

Merged

indrajithi closed this as completed in #40 Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Support flag to crawl only the root website. Do not hop to external links #11

Feature: Support flag to crawl only the root website. Do not hop to external links #11

indrajithi commented Jun 14, 2024 •

edited

Loading

Mews commented Jun 15, 2024 •

edited

Loading

indrajithi commented Jun 15, 2024

Mews commented Jun 15, 2024

Mews commented Jun 15, 2024

indrajithi commented Jun 15, 2024

Mews commented Jun 16, 2024

indrajithi commented Jun 16, 2024

Mews commented Jun 17, 2024

indrajithi commented Jun 17, 2024

Mews commented Jun 17, 2024

Feature: Support flag to crawl only the root website. Do not hop to external links #11

Feature: Support flag to crawl only the root website. Do not hop to external links #11

Comments

indrajithi commented Jun 14, 2024 • edited Loading

Mews commented Jun 15, 2024 • edited Loading

indrajithi commented Jun 15, 2024

Mews commented Jun 15, 2024

Mews commented Jun 15, 2024

indrajithi commented Jun 15, 2024

Mews commented Jun 16, 2024

indrajithi commented Jun 16, 2024

Mews commented Jun 17, 2024

indrajithi commented Jun 17, 2024

Mews commented Jun 17, 2024

indrajithi commented Jun 14, 2024 •

edited

Loading

Mews commented Jun 15, 2024 •

edited

Loading