JobsCrawler

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Project Overview

JobsCrawler is designed to aggregate job listings from a variety of sources including job boards, custom RSS feeds, and traditional APIs. It utilizes a combination of Selenium, BeautifulSoup (bs4), custom RSS readers, and direct API calls with the requests library to scrape job postings and save them to a PostgreSQL database, either locally or managed.

The project operates asynchronously, with each tool having its own unique strategy implemented in separate async files. These strategies are orchestrated together in main.py.

This project focuses on embedding the results from each module and offers improved modularity. Notably, this branch can be implemented out of the box for Retrieval-Augmented Generation (RAG).

How It Works

Each module within JobsCrawler is configured with two JSON schemas: prod and test. These schemas define the parameters for the scraping process, such as CSS selectors, the strategy to be used, and the number of pages to crawl. Here is an example JSON object for the site 4dayweek.io:

{
    "name": "https://4dayweek.io",
    "url": "https://4dayweek.io/remote-jobs/fully-remote/?page=",
    "pages_to_crawl": 1,
    "start_point": 1,
    "strategy": "container",
    "follow_link": "yes",
    "inner_link_tag": ".row.job-content-wrapper .col-sm-8.cols.hero-left",
    "elements_path": [
        {
            "jobs_path": ".row.jobs-list",
            "title_path": ".row.job-tile-title",
            "link_path": ".row.job-tile-title h3 a",
            "location_path": ".job-tile-tags .remote-country",
            "description_path": ".job-tile-tags .tile-salary"
        }
    ]
}

To add a new website to the crawler, create a corresponding JSON object with the required parameters.
For common tests, see tests to ensure the correct data is being scraped, and save the schema to the appropriate JSON file (e.g., bs4_test.json).
Before running any tests, ensure that your environment variables are correctly set up.
For debugging, enable logging as there are numerous log statements placed at common breakpoints.

Quickstart - Main Branch

To get started with the main branch:

Ensure you have Python and pip installed.
Clone the repository and navigate to the project directory.
Install the required dependencies using pip: pip install -r requirements.txt
Set up your .env file based on the .env.example provided.

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main.sh		main.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JobsCrawler

License

Project Overview

Table of Contents

How It Works

Quickstart - Main Branch

About

Releases

Packages

Languages

License

0JCRG0/JobsCrawler

Folders and files

Latest commit

History

Repository files navigation

JobsCrawler

License

Project Overview

Table of Contents

How It Works

Quickstart - Main Branch

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages