Skip to content

This application is used to crawl urls from a given link. It crawls only urls related to the same domain of the root link. e.g https://www.testdomain.com would only visit urls in that domain.

Notifications You must be signed in to change notification settings

NgProjects/webcrawler

Repository files navigation

Web Crawler Application

This application is used to crawl urls from a given link. It crawls only urls related to the same domain of the root link. e.g https://www.testdomain.com would only visit urls in that domain. It has a configuration for the maximum request it can make to a website during the crawl process.

High level design

web-crawler.png

Built with

Getting started

Starting up project

First, Unzip the project webcrawler.zip

Starting up the project using Docker

Prerequisites

Docker - This project runs on docker, makes setup faster and you don't have to download individual components on your system Download docker for your desktop here.
OR If using homebrew on Mac, You can also install docker using brew install docker via terminal.

After Installing docker,

  • Start up the docker desktop
  • with your terminal tool, navigate to the project folder (in this case webcrawler) and run the following commands simultaneously
  1. docker compose build
  2. docker compose up
  • Make sure you are running the docker compose command within the

This will download all the necessary dependencies within docker and start up the project. the application.properties I have already pre-configured the project with the docker details so it's easy to start up with docker You can then access the swagger doc on localhost:8080/webcrawler/webcrawler-api-doc.html

NOTE: While starting up the project with docker, you will see errors like Cassandra is unavailable: DO NOT PANIC, CASSANDRA IS STILL STARTING UP. Webcrawler will start up as soon as cassandra is up - will retry and errors similar to the one in the picture below, this is because cassandra DB is still starting up even though we specified that web crawler depends on cassandra to start up, Please wait for some time until cassandra is done starting up. You do not have to do anything, the webcrawler container restarts automatically after cassandra is up. webcrawler_cassandra_error.png

When Spring boot app starts up successfully, it shows this message img.png

If you wish to start up the project manually

Prerequisites

To set up the project We'll need the following; If you are using homebrew on macbook you can also install these components using homebrew

IDE - Best to use a Java IDE like IntelliJ download intelliJ here or Eclipse download Eclipse here OR with homebrew brew install --cask intellij-idea

JDK 17 - This project runs on JDK 17 which can be downloaded from here JDK 17 OR with homebrew brew install openjdk@17

Maven - This project is a maven project and requires maven, some IDEs come with maven, but if you need to install it, you can get it here and Install using Maven Installation guide or use homebrew brew install maven

Cassandra - Cassandra DB download and installation instructions can be found here OR with homebrew brew install cassandra

Redis - Redis download and installation instructions can be found here OR with homebrew brew install redis

After installing cassandra and redis, start them up and configure their properties with their respective keys in application.properties file

After adding the configurations, open the project with IntelliJ, run mvn clean package after that, run the WebcrawlerApplication.java class, it will start up as it is a spring boot project. You can then access the project on localhost:8080/webcrawler/webcrawler-api-doc.html

API Documentation and Usage

After Starting up the project with the steps above, the project is ready to be tested.

This project uses swagger UI for API documentation. You can access the project's swagger UI on localhost:8080/webcrawler/webcrawler-api-doc.html with the swagger UI you will be able to see the available endpoints and how to use them. you will also be able to send request with the swagger UI by clicking on the Try it out button. or you can make a request with postman, you can download postman here if you need to.

Ensure to provide a valid url.

In App Configs

webcrawler-max-request is used to configure the max request that can be made to a particular domain. configure in the application.properties

Running the tests

This project has unit tests and integration tests. you can manually run the tests from the individual test classes.

unit tests

CoreWebCrawlerTest.java
UrlExtractorTest.java

integration tests

DockerIntegrationTest.java - runs with docker
LocalIntegrationTest.java - runs locally without docker

The tests also runs while building the project using mvn clean package command.
For a project that uses docker, I need to test the behaviour on docker as well so, The DockerIntegrationTest.java needs docker to be up and running before it can run successfully and you can use the @EnabledIf or @Testcontainers(disabledWithoutDocker = true) annotation to disable it if you do not have docker.

If you do not have docker, the integration test runs on LocalIntegrationTest.java which has all of it's components functionalities mocked using Mockito Also disable the docker integration test while running the project with docker because, Although it is possible to run docker inside docker, it might not be the best idea

Building after making changes to the code

After making change to the code and you need to re-run with docker, please run docker compose build first before you run docker compose up this is to ensure that the project is rebuilt after the change.

improvement - Improving webcrawler to run in parallel

  • CoreWebCrawler.java was designed to also run very well in parallel as each CoreWebCrawler has it's own internal WebCrawlerComponents. So all we need to do is to invoke CoreWebCrawler asynchronously.

About

This application is used to crawl urls from a given link. It crawls only urls related to the same domain of the root link. e.g https://www.testdomain.com would only visit urls in that domain.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published