Skip to content

A Python library and cli tool for scraping companies, jobs, and founders data from Workatastartup.com.

License

Notifications You must be signed in to change notification settings

Nneji123/ycombinator-scraper

Repository files navigation

YCombinator-Scraper

Ycombinator_Scraper logo
CI/CD CI - Test publish-pypi Coverage
Docs Docs
Package PyPI - Version PyPI - Downloads PyPI - Python Version
Meta linting - Ruff License - MIT

YCombinator-Scraper provides a web scraping tool for extracting data from Workatastartup website. The package uses Selenium and BeautifulSoup to navigate through the pages and extract information.


Documentation: https://nneji123.github.io/ycombinator-scraper

Source Code: https://github.com/nneji123/ycombinator-scraper


Sponsor

Proxycurl APIs

Scrape public LinkedIn profile data at scale with Proxycurl APIs.

  • Scraping Public profiles are battle tested in court in HiQ VS LinkedIn case.
  • GDPR, CCPA, SOC2 compliant.
  • High rate limit - 300 requests/minute.
  • Fast - APIs respond in ~2s.
  • Fresh data - 88% of data is scraped real-time, other 12% are not older than 29 days.
  • High accuracy.
  • Tons of data points returned per profile

Built for developers, by developers.

Features

  • Web Scraping Capabilities:

    • Extract detailed information about companies, including name, description, tags, images, job links, and social media links.
    • Scrape job-specific details such as title, salary range, tags, and description.
  • Founder and Company Data Extraction:

    • Obtain information about company founders, including name, image, description, linkedIn profile, and optional email addresses.
  • Headless Mode:

    • Run the scraper in headless mode to perform web scraping without displaying a browser window.
  • Configurability:

    • Easily configure scraper settings such as login credentials, logs directory, automatic install of webdriver based on browser with webdriver-manager package and using environment variables or a configuration file.
  • Command-Line Interface (CLI):

    • Command-line tools to perform various scraping tasks interactively or in batch mode.
  • Data Output Formats:

    • Save scraped data in JSON or CSV format, providing flexibility for further analysis or integration with other tools.
  • Caching Mechanism:

    • Implement a caching feature to store function results for a specified duration, reducing redundant web requests and improving performance.
  • Docker Support:

    • Package the scraper as a Docker image, enabling easy deployment and execution in containerized environments or run the prebuilt docker image docker pull nneji123/ycombinator_scraper.

Requirements

  • Python 3.9+
  • Chrome or Chromium browser installed.

Installation

$ pip install ycombinator-scraper
$ ycscraper --help

# Output
YCombinator-Scraper Version 0.7.0
Usage: python -m ycombinator_scraper [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  login
  scrape-company
  scrape-founders
  scrape-job
  version

With Docker

$ git clone https://github.com/Nneji12/ycombinator-scraper
$ cd ycombinator-scraper
$ docker build -t your_name/scraper_name . # e.g docker build -t nneji123/ycombinator_scraper .
$ docker run nneji123/ycombinator_scraper python -m ycombinator_scraper --help

Dependencies

  • click: Enables the creation of a command-line interface for interacting with the scraper tool.
  • beautifulsoup4: Facilitates the parsing and extraction of data from HTML and XML in the web scraping process.
  • loguru: Provides a robust logging framework to track and manage log messages generated during the scraping process.
  • pandas: Utilized for the manipulation and organization of data, particularly in generating CSV files from scraped information.
  • pathlib: Offers an object-oriented approach to handle file system paths, contributing to better file management within the project.
  • pydantic: Used for data validation and structuring the models that represent various aspects of scraped data.
  • pydantic-settings: Extends Pydantic to enhance the management of settings in the project.
  • selenium: Employs browser automation for web scraping, allowing interaction with dynamic web pages and extraction of information.

Usage

With CLI

ycscraper scrape-company --company-url https://www.workatastartup.com/companies/example-inc

This command will scrape data for the specified company and save it in the default output format (JSON).

With Library

from ycombinator_scraper import Scraper

scraper = Scraper()
company_data = scraper.scrape_company_data("https://www.workatastartup.com/companies/example-inc")
print(company_data.model_dump_json(by_alias=True,indent=2))

Pydantic is used under the hood so methods like model_dump_json are available for all the scraped data.

You can view more examples here: Examples

Contribution

We welcome contributions from the community! To contribute to this project, follow the steps below.

Setting Up Development Environment

Gitpod

You can use Gitpod, a free online VS Code-like environment, to quickly start contributing.

Open in Gitpod

Local Setup

  1. Clone the repository:

    git clone https://github.com/nneji123/ycombinator-scraper.git
    cd ycombinator-scraper
  2. Create a virtual environment (optional but recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

Running Tests

Make sure to run tests before submitting a pull request.

pip install -r requirements-test.txt
pytest tests

Installing Documentation Requirements

If you make changes to documentation, install the necessary dependencies:

pip install -r requirements-docs.txt
mkdocs serve

Setting Up Pre-Commit Hooks

We use pre-commit to ensure code quality. Install it by running:

pip install pre-commit
pre-commit install

Now, pre-commit will run automatically before each commit to check for linting and other issues.

Submitting a Pull Request

  1. Fork the repository and create a new branch for your contribution:

    git checkout -b feature-or-fix-branch
  2. Make your changes and commit them:

    git add .
    git commit -am "Your meaningful commit message"
  3. Push the changes to your fork:

    git push origin feature-or-fix-branch
  4. Open a pull request on GitHub. Provide a clear title and description of your changes.

Documentation

The documentation is made with Material for MkDocs and is hosted by GitHub Pages.

License

YCombinator-Scraper is distributed under the terms of the MIT license.