This is an API service which provides web screenshots of any website.
To setup up the whole stack using docker-compose
just issue the following command in the root of the project.
docker-compose up
The server will start listening on port 80 of the host machine.
To create screenshots of the url you just need to issue a POST request the api/request
endpoint with your urls. Just like this:
curl -X POST -d 'urls=https://google.com%3Bhttps://github.com%3Bhttps://stackoverflow.com' http://localhost/api/new
The response would be something like this:
{
"results": [
"http://localhost/api/result/aHR0cHM6Ly9nb29nbGUuY29t",
"http://localhost/api/result/aHR0cHM6Ly9naXRodWIuY29t",
"http://localhost/api/result/aHR0cHM6Ly9zdGFja292ZXJmbG93LmNvbQ=="
]
}
Each url is for retreving the status of each screenshot. The API server may return one of the following HTTP status codes, depending on the status of the request.
200 OK
- The request is successfully finished and the resulting image url is returned204 No Content
- The request has not yet been processed404 Not Found
- The requested url is not found in our database500 Internal Server Error
- An unexpected error occurred, you may try again503 Service Unavailable
- The request has failed permanently for some reason, you may check the logs422 Unprocessable Entity
- The supplied request is invalid
There are two type of tests in this project.
To run unit tests, no setup is required and you just need the helper make command:
make test
To run integration tests, there needs to be a running instance of RabbitMQ server and a MongoDB server. To run them without hassle just issue the helper command:
make test-integration
This project is made of 5 main components and since none of them has any dependency to each other and they solely depend on interfaces, they can be swapped out with different implementations easily without affecting other components.
A brief explanation of each component is provided below:
This component is the outermost layer which is responsible for receiving screenshot requests from user and then displaying the response. It receives requests and after validation passes it to the next layer.
Dispatcher acts as kind of coordinator. It receives requests from upper layer (api) and dispatches jobs to the job queue and then receives the results back from it. It also stores requests using the Datastore interface.
This package provides persistence capabilities for the service. Dispatcher uses this to store requests and their statuses.
I've used MongoDB here to avoid the hassle of SQL queries. But it is easy enough to replace it with any other database implementations.
JobQ does what it's name implies. It queues screenshot jobs and processes them sequentially to avoid overloading the system. It uses the webCapture
interface to proccess jobs.
Screenshot is where everything happens. This is the innermost layer which is unaware of any other layers. I
I have used PhantomJS
as an implementation of the Capture
interface, but it can also use any other tool such as Chromium Headless
or even a third-party provider.
Since this service uses a message queue, it is inherently scalable and since it's already containerized, it can be deployed to the cloud without much work. To scale it out, just increase the number of pods (containers) and you're good to go.
At the last minute I thought it'd be a good idea to have a cloud native solution for file storage as well (the files were previously stored on a mounted volume) so I added Minio as the file storage service. Minio is compatible with Amazon's S3 protocol so it's a cinch to replace it with AWS Cloud storage.
Note: Having a filestore is not mandatory for the system to work and it works perfectly without it. That's why I did not bring it up in Architecture section above.
There is a config with which you can specify how many workers should be created on each instance. By setting WORKERS_PER_INSTANCE
env variable to a number between 1 and 10 you can utilize more system resources therefor increase the performance.