GitHub - AwsafAlam/web-automation: A data scraping tool that crawls various sites to gather data, and provides the user with a searchable interface to view and modify data.

Web Automation Project

In this application we Scrap data from various data sources. [ Mostly healthcare facility websites in the US ] and Store those data into a database. We provide a searchable interface where users can search for facilities based on the state, city or even the facility name.

If queried listing is not found, user can make a request and the system will fetch the given info.

Users can also view the different facility details. They can upload images of these facilities so that other users can be benefitted.

Application

Live Demo: Link

The backend server is turned off for now. You will need to clone the repo and deploy the server inorder to preview.

Technologies Used:

Frontend

Vite.js
React
Material-ui

Backend Server

Nodejs
Express
Sequelize
MySQL

Crawler

python3
selenium

Deployment

Frontend Deployed in Vercel
AWS RDS is used for MySQL Database
AWS s3 is used for image storage
AWS Lightsail is used for server deployment

Architecture Overview

Full Architecture

Api Specifications

We divide the backend apis into different services which serves various purposes. Here are the services that are used by the frontend

Routes

Cities: We use a third party api POST https://countriesnow.space/api/v0.1/countries/state/cities - This returns the list of citie in a given state.

Requests:

POST /v1/requests - adds a request to the db for scraping.
GET /v1/requests - get all request

Listings

GET /v1/listings - get all listings
GET /v1/listings/:slug - get listing by slug
POST /v1/listings/images - upload images
POST /v1/listings/search - returns the partial name, state or cities which match with the user input

Private Routes (Only accessible inside the network/VPC)

GET /private/requests/uncrawled - Gets one uncrawled request
PUT /private/requests/:id - update request by id
GET /private/requests/:id - get request by id
POST /private/listings/ - insert listing
POST /private/listings/multiple - insert multiple listings
PUT /private/listings/:id - update listing by govSiteId

Api Design Considerations

Versioning: We maintain a versioning system for our apis initially so that when a breaking change is introduced in production the application does not break.
Private Routes: We have introduced a proxy which ensure that our application can only be acessed from outside the instance through two ports 80 and 443 the proxy also does automatic redirects from http to https. Thus, other services such as the crawler can access private routes without any need for authentication and write securely to the db.

Deployment

We use docker for deploying the whole application since docker makes it very easy to manage dependencies between local and prod environments. The entire application is hosted in a single Lightsail instance.

SSL Configuration:

SSL was configured using certbot. Further details provided here

Database Configuration:

We are using MySQL database for this project.

Hosted in AWS RDS (Free Tier)
For development, we use a local environment

More details here

Proxy:

We use haproxy as a reverse proxy to route requests from the client to our server. Haproxy is also used to configure ssl very easily in our application and enable https redirection.

Storing Images in S3:

First we create a bucket in S3 for our project. In the server we use the aws-sdk for uploading images in s3. To do that, first we need to create a new IAM account. This will allow our server to interact with AWS programatically. We grant this IAM account very limited access (s3 read-write only). All the credentials are stored in the server .env file.

BUCKET_NAME=
S3_ACCESS_KEY=
S3_SECRET=

Front the front-end we use multi-part/formData to make an api call which takes the binary files of all the images. and then in the server, we write our own custom middleware which uses multer and aws-sdk to upload the images to s3.

Once the images are uploaded, the middleware returns a list of urls:string[] which are then store in the database.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
assets		assets
crawler		crawler
documentation		documentation
haproxy		haproxy
react-mui-search @ 6d6690b		react-mui-search @ 6d6690b
server		server
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Readme.md		Readme.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Automation Project

Table of Contents

Application

Technologies Used:

Frontend

Backend Server

Crawler

Deployment

Architecture Overview

Api Specifications

Routes

Api Design Considerations

Deployment

SSL Configuration:

Database Configuration:

Proxy:

Storing Images in S3:

About

Releases

Packages

Languages

AwsafAlam/web-automation

Folders and files

Latest commit

History

Repository files navigation

Web Automation Project

Table of Contents

Application

Technologies Used:

Frontend

Backend Server

Crawler

Deployment

Architecture Overview

Api Specifications

Routes

Api Design Considerations

Deployment

SSL Configuration:

Database Configuration:

Proxy:

Storing Images in S3:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages