Skip to content

Latest commit

 

History

History
38 lines (28 loc) · 1.11 KB

README.md

File metadata and controls

38 lines (28 loc) · 1.11 KB

About

** CURRENTLY BROKEN XPATHS (TO BE FIXED) ** Scraping Finn housing/work ads with Python and requests. Work in progress.

Scraping different subdomains within finn (see parameters.yml). E.g. housing ads, project ads, work ads. Each different subdomain requires a different set of xpaths, though there are many common denominators (see src/xpaths.py).

Only tested on Python 3.11

CSV example alt text

Log example alt text

Setup

mkdir scrapes
mkdir logs
pip install -r requirements.txt

Parameters

Adjust parameters in parameters.yml.
daily_scrape: If true scraper only scrapes the daily adds.
finn_sub_urls: Which part of finn to scrape. A different CSV is created for all the different sub urls.

To run

python src/finn_scraper.py

Checklist

  • Add detail to headers.
  • Add sleep timer and folder etc to parameters.yml.
  • Custom queries instead of binary daily/not daily scrape.
  • Reduce line length across project.
  • Checking if all requests yields code 200.
  • Process data function for html->text.