ScrapingGovSite

Government of Canada website data scraper (parser of registries and official lists)

About the project

This project is designed for automatic collection and structuring of data from various government registries and official sources.

Currently, the project parses one or several sources and saves the result into a convenient CSV file.

The main functionality is contained in a single file — main.py.

Current functionality

Collecting data from a registry (likely a list of literature / licenses / organizations / individuals, etc.)
Saving results to lit_list_basic.csv
Basic data cleaning and processing

Technologies

Python 3.8+
requests (for HTTP requests)
BeautifulSoup4 / selenium (depending on implementation)
pandas (optional — for convenient table handling)

(exact stack can be seen in main.py or by looking at the imports)

Project structure

ScrapingGovSite/
├── main.py               ← main script / entry point

How to run

Clone the repository

git clone https://github.com/MaximusPro/ScrapingGovSite.git
cd ScrapingGovSite

Create a virtual environment

python -m venv venv
source venv/bin/activate     # Linux / macOS
venv\Scripts\activate        # Windows

Install dependencies

(if a requirements.txt file appears later — use it; for now you can install the most common packages manually):

pip install requests beautifulsoup4 pandas tqdm

Run the parser

python main.py

After execution, the file lit_list_basic.csv will appear / be updated in the root folder. What is currently being parsed (please specify more precisely if you know)

Most likely: list of literature, register of licenses, list of accredited organizations, unified register of inspections, register of individual entrepreneurs / legal entities, etc. Source: https://www.ic.gc.ca/app/scr/tds/web/list-liste?lang=eng

Important notes

⚠️ Comply with the law Scraping government websites may be restricted by regulations, robots.txt rules, request rate limits, Federal Law 149-FZ, 152-FZ, etc. Recommendations:

Add delays between requests (sleep 3–10 seconds) Use proper User-Agent headers Do not create excessive load on the server Whenever possible — prefer official open APIs (if they exist)

Future plans

Add configuration via .env / config.yaml Support for multiple data sources Automatic checking for data updates Export to different formats (json, xlsx, sqlite) Error handling and logging Multithreaded / asynchronous scraping Documentation for each parser

License

MIT License (or choose another one — specify if needed)

If you use this code in your projects — it would be nice if you mention the repository link 😊 Happy scraping!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
lit_list_basic.csv		lit_list_basic.csv
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapingGovSite

About the project

Current functionality

Technologies

Project structure

How to run

Clone the repository

Create a virtual environment

Install dependencies

Run the parser

Important notes

Future plans

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScrapingGovSite

About the project

Current functionality

Technologies

Project structure

How to run

Clone the repository

Create a virtual environment

Install dependencies

Run the parser

Important notes

Future plans

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages