Web-Crawler

Web Crawler project that navigates the web and indexes pages. The application uses Jsoup(Java html parsing library) and maven package manager. For testing the app crawls amazon jobs website at the link depth of 2 and returns Leadership principles, links and title then saves them to a file.

In the first step, we first pick a URL from the frontier.
Fetch the HTML code of that URL.
Get the links to the other URLs by parsing the HTML code.
Check whether the URL is already crawled before or not. We also check whether we have seen the same content before or not. If both the condition doesn't match, we add them to the index.
For each extracted URL, verify that whether they agree to be checked(robots.txt, crawling frequency)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
my-app		my-app
.project		.project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

my-app

my-app

.project

.project

README.md

README.md

Repository files navigation

Web-Crawler

About

Releases

Packages

Languages

KELVI23/Java-Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Web-Crawler

About

Topics

Resources

Stars

Watchers

Forks

Languages