Skip to content

MaximusPro/ScrapingGovSite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ScrapingGovSite

Government of Canada website data scraper (parser of registries and official lists)

About the project

This project is designed for automatic collection and structuring of data from various government registries and official sources.

Currently, the project parses one or several sources and saves the result into a convenient CSV file.

The main functionality is contained in a single file — main.py.

Current functionality

  • Collecting data from a registry (likely a list of literature / licenses / organizations / individuals, etc.)
  • Saving results to lit_list_basic.csv
  • Basic data cleaning and processing

Technologies

  • Python 3.8+
  • requests (for HTTP requests)
  • BeautifulSoup4 / selenium (depending on implementation)
  • pandas (optional — for convenient table handling)

(exact stack can be seen in main.py or by looking at the imports)

Project structure

ScrapingGovSite/
├── main.py               ← main script / entry point

How to run

Clone the repository

git clone https://github.com/MaximusPro/ScrapingGovSite.git
cd ScrapingGovSite

Create a virtual environment

python -m venv venv
source venv/bin/activate     # Linux / macOS
venv\Scripts\activate        # Windows

Install dependencies

(if a requirements.txt file appears later — use it; for now you can install the most common packages manually):

pip install requests beautifulsoup4 pandas tqdm

Run the parser

python main.py

After execution, the file lit_list_basic.csv will appear / be updated in the root folder. What is currently being parsed (please specify more precisely if you know)

Most likely: list of literature, register of licenses, list of accredited organizations, unified register of inspections, register of individual entrepreneurs / legal entities, etc. Source: https://www.ic.gc.ca/app/scr/tds/web/list-liste?lang=eng

Important notes

⚠️ Comply with the law Scraping government websites may be restricted by regulations, robots.txt rules, request rate limits, Federal Law 149-FZ, 152-FZ, etc. Recommendations:

Add delays between requests (sleep 3–10 seconds) Use proper User-Agent headers Do not create excessive load on the server Whenever possible — prefer official open APIs (if they exist)

Future plans

Add configuration via .env / config.yaml Support for multiple data sources Automatic checking for data updates Export to different formats (json, xlsx, sqlite) Error handling and logging Multithreaded / asynchronous scraping Documentation for each parser

License

MIT License (or choose another one — specify if needed)

If you use this code in your projects — it would be nice if you mention the repository link 😊 Happy scraping!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages