This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
JobsCrawler is designed to aggregate job listings from a variety of sources including job boards, custom RSS feeds, and traditional APIs. It utilizes a combination of Selenium
, BeautifulSoup (bs4)
, custom RSS readers
, and direct API calls with the requests
library to scrape job postings and save them to a PostgreSQL database, either locally or managed.
The project operates asynchronously, with each tool having its own unique strategy implemented in separate async files. These strategies are orchestrated together in main.py
.
This project focuses on embedding the results from each module and offers improved modularity. Notably, this branch can be implemented out of the box for Retrieval-Augmented Generation (RAG).
Each module within JobsCrawler is configured with two JSON schemas: prod
and test
. These schemas define the parameters for the scraping process, such as CSS selectors, the strategy to be used, and the number of pages to crawl. Here is an example JSON object for the site 4dayweek.io
:
{
"name": "https://4dayweek.io",
"url": "https://4dayweek.io/remote-jobs/fully-remote/?page=",
"pages_to_crawl": 1,
"start_point": 1,
"strategy": "container",
"follow_link": "yes",
"inner_link_tag": ".row.job-content-wrapper .col-sm-8.cols.hero-left",
"elements_path": [
{
"jobs_path": ".row.jobs-list",
"title_path": ".row.job-tile-title",
"link_path": ".row.job-tile-title h3 a",
"location_path": ".job-tile-tags .remote-country",
"description_path": ".job-tile-tags .tile-salary"
}
]
}
- To add a new website to the crawler, create a corresponding JSON object with the required parameters.
- For common tests, see
tests
to ensure the correct data is being scraped, and save the schema to the appropriate JSON file (e.g.,bs4_test.json
). - Before running any tests, ensure that your environment variables are correctly set up.
- For debugging, enable logging as there are numerous log statements placed at common breakpoints.
To get started with the main branch:
- Ensure you have Python and pip installed.
- Clone the repository and navigate to the project directory.
- Install the required dependencies using pip:
pip install -r requirements.txt
- Set up your
.env
file based on the.env.example
provided.