Government of Canada website data scraper (parser of registries and official lists)
This project is designed for automatic collection and structuring of data from various government registries and official sources.
Currently, the project parses one or several sources and saves the result into a convenient CSV file.
The main functionality is contained in a single file — main.py.
- Collecting data from a registry (likely a list of literature / licenses / organizations / individuals, etc.)
- Saving results to
lit_list_basic.csv - Basic data cleaning and processing
- Python 3.8+
- requests (for HTTP requests)
- BeautifulSoup4 / selenium (depending on implementation)
- pandas (optional — for convenient table handling)
(exact stack can be seen in main.py or by looking at the imports)
ScrapingGovSite/
├── main.py ← main script / entry point
git clone https://github.com/MaximusPro/ScrapingGovSite.git
cd ScrapingGovSitepython -m venv venv
source venv/bin/activate # Linux / macOS
venv\Scripts\activate # Windows(if a requirements.txt file appears later — use it; for now you can install the most common packages manually):
pip install requests beautifulsoup4 pandas tqdmpython main.pyAfter execution, the file lit_list_basic.csv will appear / be updated in the root folder. What is currently being parsed (please specify more precisely if you know)
Most likely: list of literature, register of licenses, list of accredited organizations, unified register of inspections, register of individual entrepreneurs / legal entities, etc. Source: https://www.ic.gc.ca/app/scr/tds/web/list-liste?lang=eng
Add delays between requests (sleep 3–10 seconds) Use proper User-Agent headers Do not create excessive load on the server Whenever possible — prefer official open APIs (if they exist)
Add configuration via .env / config.yaml Support for multiple data sources Automatic checking for data updates Export to different formats (json, xlsx, sqlite) Error handling and logging Multithreaded / asynchronous scraping Documentation for each parser
MIT License (or choose another one — specify if needed)
If you use this code in your projects — it would be nice if you mention the repository link 😊 Happy scraping!