WEB CRAWLER

What is?

A simple web crawler, which given a URL to crawl, outputs:

a site map showing the static assets for each page.

It is not limited to one domain - so when crawling google.com it would :

crawl all pages within the google.com domain,
but not follow external links.

Requirements?

In order to run this task you should have the following installed:

Python 2.7

Design Decisions

Here follows a description of the files and whats their purpose:

web_crawler
├── crawler.py    - file that contains the logic that parameter parser's and thread management for the crawler
├── models.py     - defines the classes that represent the site map nodes browser and others
├── README.md     - Instruction about the task and how to run it
└── result.json   - Where the site map result will be outputted by default (the current content of the file was a 100 limit crawl to the domain google.com)

The models contain 3 important classes definitions:

Browser - that has the objective of impersonating a browser and to make the actions needed to get the webpages
Node - Represents one page of the site map. contains the location, http status, and the assets the page has
SiteMap - The objective of this class is to group the nodes that represent the pages within a domain. Its purpose is to allocate and decide which nodes should be on the site map

The crawler file has the logic regarding the threads management. This approach was selected for performance purposes.

How to run?

In order to run this task the procedure is simple by using the following command on the context of this folder: $ python crawler.py -u <DOMAIN> -f <FILE_OUTPUT> -l <LIMIT_PAGES> -s <SIMPLE_OUTPUT> -d <DEBUG>

<DOMAIN> - The site domain you want to crawl. Default: 'www.google.com'

<FILE_OUTPUT> - File to output the resulting site map. Default goes to the file 'result.json

<LIMIT_PAGES> - Limit of how many sites should be crawlled. Default 100. 0 defines no limit

<FILE_OUTPUT> - Defines the indentation to be adopted on output to be printed to the file. If True (default) - prints each node per line. If False prints each node with indentation

<SIMPLE_OUTPUT> - Defines the indentation to be adopted on output to be printed to the file. If True (default) - prints each node per line. If False prints each node with indentation

<DEBUG> - If you want to run this in debug mode (more info on prints). Default is False. If this option is ennabled some exceptions will appear that are expected exceptions like http 301 and timeout.

No argummets are mandatory but is recommended that this command example is used:

$ python crawler.py -u www.google.com -l 0 -d True

Done by

This was developed by Tiago Pombeiro.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

crawler.py

crawler.py

models.py

models.py

result.json

result.json

Repository files navigation

WEB CRAWLER

What is?

Requirements?

Design Decisions

How to run?

Done by

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
models.py		models.py
result.json		result.json

tpombeiro87/web_crawler

Folders and files

Latest commit

History

Repository files navigation

WEB CRAWLER

What is?

Requirements?

Design Decisions

How to run?

Done by

About

Resources

Stars

Watchers

Forks

Languages