Atlas

Web-scraping toolkit for Swedish authorities. Automates data collection with pagination, PDF crawling, metadata extraction, and multi-format export.

Atlas is a data collection toolkit designed to support Swedish government authorities in organizing and accessing their own public information. It automates the gathering of reports, statistics, and publications from agency websites, creating searchable, structured datasets.

Overview

Atlas collects public data from Swedish authorities and organizes it into structured formats for easy analysis and search. Currently supports Tillväxtanalys (business analysis) and Trafikanalys (transport analysis) with custom-built scrapers. The architecture is designed with the goal of eventually scaling to 100+ authorities through a modular, extensible design.

Current Capabilities

Supported Authorities

Tillväxtanalys - Business analysis reports and publications
Trafikanalys - Transport statistics, reports, and datasets

Features

Browser Automation - Handles JavaScript-rendered pages and cookie consent dialogs using Playwright
Pagination Support - Automatically navigates multi-page listings and "load more" buttons
PDF Crawler - Recursively finds and downloads PDF documents with section-based organization
RSS Feed Parser - Extracts metadata and downloads content from RSS feeds
Sitemap Processing - Parses XML sitemaps to discover all published documents
Metadata Extraction - Captures document IDs (Serienummer, Diarienummer), dates, titles, and descriptions
Duplicate Detection - Prevents redundant downloads using checksums and URL uniqueness
Multi-Format Export - Saves data to SQLite, Excel (XLSX), and CSV formats
Progress Tracking - Visual progress bars for long-running operations
Debug Tools - Saves HTML snapshots for troubleshooting parsing issues

Installation

Prerequisites

Python 3.11 or higher
pip package manager

Setup

Clone the repository:

git clone https://github.com/HelixCipher/Atlas.git
cd Atlas

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers:

playwright install chromium

Configuration

Before running, you need to configure which authorities to scrape:

Rename the example configuration file:
```
cp example_urls.py urls.py
```

Edit urls.py and add the URLs of authorities you want to scrape:

URLS = [
    "https://www.tillvaxtanalys.se",
    "https://www.trafa.se"
]

Note: urls.py is gitignored to keep your specific configuration private.

Usage

Run All Scrapers

python main.py

This will execute all configured scrapers in sequence:

Tillväxtanalys web scraper (paginated listings)
Trafikanalys RSS feed parser
Trafikanalys sitemap crawler
PDF crawler for document discovery

Individual Scrapers

Each scraper can be run independently:

# Web scraper for Tillväxtanalys
from web_scraper import main as web_scraper
web_scraper()

# RSS metadata extractor
from trafa_rss_metadata import main as rss_scraper
rss_scraper()

# Sitemap crawler
from trafa_sitemap_metadata import main as sitemap_scraper
sitemap_scraper()

# PDF crawler
from pdf_crawler import main as pdf_crawler
pdf_crawler()

Project Structure

Atlas/
├── chrome_path_helper.py      # Playwright Chromium path management
├── example_urls.py             # Example URL configuration (copy to urls.py)
├── main.py                     # Entry point - orchestrates all scrapers
├── pdf_crawler.py              # Recursive PDF discovery and download
├── requirements.txt            # Python dependencies
├── trafa_rss_metadata.py       # RSS feed parser
├── trafa_sitemap_metadata.py   # XML sitemap crawler
├── urls.py                     # URL configuration (gitignored)
├── utils.py                    # User agent strings and utilities
├── web_scraper.py              # Main web scraper with pagination
├── downloads/                  # Downloaded content (gitignored)
├── trafa_downloads/            # Trafikanalys downloads (gitignored)
├── trafa_rss_downloads/        # RSS feed snapshots (gitignored)
├── debug_html/                 # Debug HTML snapshots (gitignored)
├── reports.db                  # SQLite database (generated)
└── reports.xlsx                # Excel export (generated)

Data Output

Database Schema (SQLite)

The reports.db database contains:

CREATE TABLE reports (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    report_name TEXT,              -- Publication title
    diarienummer TEXT,             -- Diary number (reference ID)
    serienummer TEXT,              -- Series number (e.g., "Rapport 2024:17")
    description TEXT,              -- Publication description/summary
    date TEXT,                     -- Publication date
    url TEXT UNIQUE                -- Source URL (unique constraint)
);

Export Formats

SQLite (reports.db) - Full database with all metadata
Excel (reports.xlsx) - Human-readable spreadsheet format
CSV (trafa_rss_metadata.csv, trafa_sitemap_metadata.csv) - Raw metadata exports

Downloaded Content

Organized by source and file type:

downloads/ - Tillväxtanalys content (HTML pages)
trafa_downloads/pdf/ - PDF documents organized by year
trafa_downloads/xlsx/ - Excel datasets organized by year
trafa_rss_downloads/ - RSS feed HTML snapshots by date

Scraper Types

Web Scraper (web_scraper.py)
- Uses Playwright for JavaScript-rendered pages
- Handles pagination via "load more" buttons
- Extracts structured metadata from HTML
RSS Parser (trafa_rss_metadata.py)
- Parses RSS/Atom feeds
- Downloads linked content
- Organizes by publication date
Sitemap Crawler (trafa_sitemap_metadata.py)
- Parses XML sitemaps
- Discovers all published documents
- Extracts file metadata (PDF, XLSX)
PDF Crawler (pdf_crawler.py)
- Recursive link following
- Section-based organization
- Duplicate prevention

Sample Data

Recent scrapes have collected:

Tillväxtanalys (Business Analysis)

577 reports in database
12,132 PDF documents downloaded

Trafikanalys (Transport Analysis)

2,917 PDF documents (transport statistics and reports)
536 Excel datasets with transport data
130 RSS feed snapshots
289 RSS feed entries

Totals

15,049 PDF documents (12,132 + 2,917)
536 Excel datasets
577 database reports
130 RSS snapshots

Example queries on the data:

-- Find all reports from December 2024
SELECT * FROM reports WHERE date LIKE '%december 2024%';

-- Count reports by series
SELECT serienummer, COUNT(*) FROM reports GROUP BY serienummer;

-- Search by keyword
SELECT * FROM reports WHERE description LIKE '%företag%';

Acknowledgments

Built to support Swedish government authorities in making their public data more accessible and organized.

License & Attribution

This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

You are free to use, share, copy, modify, and redistribute this material for any purpose (including commercial use), provided that proper attribution is given.

Attribution requirements

Any reuse, redistribution, or derivative work must include:

The creator’s name: HelixCipher
A link to the original repository:
https://github.com/HelixCipher/Atlas
An indication of whether changes were made
A reference to the license (CC BY 4.0)

Example Attribution

This work is based on Atlas by HelixCipher.
Original source: https://github.com/HelixCipher/Atlas Licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).

You may place this attribution in a README, documentation, credits section, or other visible location appropriate to the medium.

Full license text: https://creativecommons.org/licenses/by/4.0/

Disclaimer

This project is provided “as—is”. The author accepts no responsibility for how this material is used. There is no warranty or guarantee that the scripts are safe, secure, or appropriate for any particular purpose. Use at your own risk.

see DISCLAIMER.md for full terms. Use at your own risk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Atlas

Overview

Current Capabilities

Supported Authorities

Features

Installation

Prerequisites

Setup

Configuration

Usage

Run All Scrapers

Individual Scrapers

Project Structure

Data Output

Database Schema (SQLite)

Export Formats

Downloaded Content

Scraper Types

Sample Data

Tillväxtanalys (Business Analysis)

Trafikanalys (Transport Analysis)

Totals

Acknowledgments

License & Attribution

Attribution requirements

Example Attribution

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
ATTRIBUTION.md		ATTRIBUTION.md
DISCLAIMER.md		DISCLAIMER.md
LICENSE		LICENSE
README.md		README.md
chrome_path_helper.py		chrome_path_helper.py
example_urls.py		example_urls.py
main.py		main.py
pdf_crawler.py		pdf_crawler.py
project_image.png		project_image.png
requirements.txt		requirements.txt
trafa_rss_metadata.py		trafa_rss_metadata.py
trafa_sitemap_metadata.py		trafa_sitemap_metadata.py
utils.py		utils.py
web_scraper.py		web_scraper.py

License

HelixCipher/Atlas

Folders and files

Latest commit

History

Repository files navigation

Atlas

Overview

Current Capabilities

Supported Authorities

Features

Installation

Prerequisites

Setup

Configuration

Usage

Run All Scrapers

Individual Scrapers

Project Structure

Data Output

Database Schema (SQLite)

Export Formats

Downloaded Content

Scraper Types

Sample Data

Tillväxtanalys (Business Analysis)

Trafikanalys (Transport Analysis)

Totals

Acknowledgments

License & Attribution

Attribution requirements

Example Attribution

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages