GitHub - NgProjects/webcrawler: This application is used to crawl urls from a given link. It crawls only urls related to the same domain of the root link. e.g https://www.testdomain.com would only visit urls in that domain.

Web Crawler Application

This application is used to crawl urls from a given link. It crawls only urls related to the same domain of the root link. e.g https://www.testdomain.com would only visit urls in that domain. It has a configuration for the maximum request it can make to a website during the crawl process.

High level design

Built with

Getting started

Starting up project

First, Unzip the project webcrawler.zip

Starting up the project using Docker

Prerequisites

Docker - This project runs on docker, makes setup faster and you don't have to download individual components on your system Download docker for your desktop here.
OR If using homebrew on Mac, You can also install docker using brew install docker via terminal.

After Installing docker,

Start up the docker desktop
with your terminal tool, navigate to the project folder (in this case webcrawler) and run the following commands simultaneously

docker compose build
docker compose up

Make sure you are running the docker compose command within the

This will download all the necessary dependencies within docker and start up the project. the application.properties I have already pre-configured the project with the docker details so it's easy to start up with docker You can then access the swagger doc on localhost:8080/webcrawler/webcrawler-api-doc.html

NOTE: While starting up the project with docker, you will see errors like Cassandra is unavailable: DO NOT PANIC, CASSANDRA IS STILL STARTING UP. Webcrawler will start up as soon as cassandra is up - will retry and errors similar to the one in the picture below, this is because cassandra DB is still starting up even though we specified that web crawler depends on cassandra to start up, Please wait for some time until cassandra is done starting up. You do not have to do anything, the webcrawler container restarts automatically after cassandra is up.

When Spring boot app starts up successfully, it shows this message

If you wish to start up the project manually

Prerequisites

To set up the project We'll need the following; If you are using homebrew on macbook you can also install these components using homebrew

IDE - Best to use a Java IDE like IntelliJ download intelliJ here or Eclipse download Eclipse here OR with homebrew brew install --cask intellij-idea

JDK 17 - This project runs on JDK 17 which can be downloaded from here JDK 17 OR with homebrew brew install openjdk@17

Maven - This project is a maven project and requires maven, some IDEs come with maven, but if you need to install it, you can get it here and Install using Maven Installation guide or use homebrew brew install maven

Cassandra - Cassandra DB download and installation instructions can be found here OR with homebrew brew install cassandra

Redis - Redis download and installation instructions can be found here OR with homebrew brew install redis

After installing cassandra and redis, start them up and configure their properties with their respective keys in application.properties file

After adding the configurations, open the project with IntelliJ, run mvn clean package after that, run the WebcrawlerApplication.java class, it will start up as it is a spring boot project. You can then access the project on localhost:8080/webcrawler/webcrawler-api-doc.html

API Documentation and Usage

After Starting up the project with the steps above, the project is ready to be tested.

This project uses swagger UI for API documentation. You can access the project's swagger UI on localhost:8080/webcrawler/webcrawler-api-doc.html with the swagger UI you will be able to see the available endpoints and how to use them. you will also be able to send request with the swagger UI by clicking on the Try it out button. or you can make a request with postman, you can download postman here if you need to.

Ensure to provide a valid url.

In App Configs

webcrawler-max-request is used to configure the max request that can be made to a particular domain. configure in the application.properties

Running the tests

This project has unit tests and integration tests. you can manually run the tests from the individual test classes.

unit tests

CoreWebCrawlerTest.java
UrlExtractorTest.java

integration tests

DockerIntegrationTest.java - runs with docker
LocalIntegrationTest.java - runs locally without docker

The tests also runs while building the project using mvn clean package command.
For a project that uses docker, I need to test the behaviour on docker as well so, The DockerIntegrationTest.java needs docker to be up and running before it can run successfully and you can use the @EnabledIf or @Testcontainers(disabledWithoutDocker = true) annotation to disable it if you do not have docker.

If you do not have docker, the integration test runs on LocalIntegrationTest.java which has all of it's components functionalities mocked using Mockito Also disable the docker integration test while running the project with docker because, Although it is possible to run docker inside docker, it might not be the best idea

Building after making changes to the code

After making change to the code and you need to re-run with docker, please run docker compose build first before you run docker compose up this is to ensure that the project is rebuilt after the change.

improvement - Improving webcrawler to run in parallel

CoreWebCrawler.java was designed to also run very well in parallel as each CoreWebCrawler has it's own internal WebCrawlerComponents. So all we need to do is to invoke CoreWebCrawler asynchronously.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.mvn/wrapper		.mvn/wrapper
docker		docker
src		src
.gitignore		.gitignore
README.md		README.md
create-cassandra-keyspace.sh		create-cassandra-keyspace.sh
docker-compose.yml		docker-compose.yml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
spring_start_message.png		spring_start_message.png
web-crawler.png		web-crawler.png
webcrawler_cassandra_error.png		webcrawler_cassandra_error.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler Application

High level design

Built with

Getting started

Starting up project

Starting up the project using Docker

Prerequisites

If you wish to start up the project manually

Prerequisites

API Documentation and Usage

In App Configs

Running the tests

unit tests

integration tests

Building after making changes to the code

improvement - Improving webcrawler to run in parallel

About

Releases

Packages

Languages

NgProjects/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Application

High level design

Built with

Getting started

Starting up project

Starting up the project using Docker

Prerequisites

If you wish to start up the project manually

Prerequisites

API Documentation and Usage

In App Configs

Running the tests

unit tests

integration tests

Building after making changes to the code

improvement - Improving webcrawler to run in parallel

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages