Skip to content

StevenRojas/natscrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NATS Based User Rating Web Crawler

The goal of this project is implement a proof of concept for a system that collects rating information from Roku website using microservices and NATS

Solutions

There are two solutions that were implemented. One is using chromedp to perform the crawler and the second using an API call. Both solutions have the same architecture, the only difference is the use of a chromedp container, as showed in the diagrams below.

The Reader Client is parsing a CSV file with URLs that are sent to the system using GRPC calls. The Request Listener is getting the GRPC requests and put them into a NATS Queue. The main reason to use a queue is that the crawler process takes several seconds so we need a way to keep a list of URLs that should be processed. Each URL is associated to a request-id that is generated by the Reader Client so that is possible to trace a request with the final result.

On the other hand, the use of a queue allows scale the system having multiple Web Cralers (or API Crawlers) collecting the ratings and storing them in a Postgres DB.

The crawler microservice is using the Fan-In Fan-Out pattern to collect data from Roku website.

Chromedp Solution Diagram

Alt text

API Solution Diagram

Alt text

Proof Of Concept Results

The results, comments and conclusions of the PoC are here

Running this project

Dependencies

  • Go 1.18
  • nats.go v1.13.1
  • grpc v1.43.0
  • protobuf v1.5.2
  • cobra v1.4.0
  • viper v1.10.1

Development environment

  • Go 1.18
  • Goland or other IDE
  • Docker

Microservices

There are three microservices that should be executed to run the system:

  • The first one is the reader which parse the CSV and send GRPC requests with the URLs
  • The second microservice is the listener which runs the GRPC server listening for the URLs that should be parsed and send them to NATS queue
  • The third one is the crawler which collects the ratings using Chromedb or an API url call

GRPC and Proto Buffers are defined at grpcapi/pb

Run microservices

The application CLI is implemented with Cobra and Viper libraries, so it is possible to override the configuration with flags and environment variables. The precedence to override a configuration is: flag -> environment variable -> configuration field

Here some examples to run the application

  • ./reader -f csv/target_urls_test.csv runs the reader microservice
  • ./listener runs the listener microservice
  • ./crawler runs the crawler microservice

To run with docker just use docker-compose build and then docker-compose up --scale crawler=10

Configuration

Each microservice has its configuration file at config/local.yaml already set to be used with Docker. To use the API option at crawler microservice, change the crawler.use_api flag to false

TODOs

  • Check chromedp image configuration to improve performance
  • Add unit tests, specially mocks for DB, NATS and Browser

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •