Blog post: Kubestronaut Metrics: Top Countries Rankings
Here's a high-level explanation of the script:
- Environment Setup: Loads credentials for Google Sheets and defines variables for accessing a specific Google Sheet.
- Web Scraping with Selenium: Scrapes Kubestronaut data from a website (CNCF's Kubestronaut page), extracting information about regions and countries and counting the number of Kubestronauts.
- Asynchronous API Fetching: Uses
aiohttp
to asynchronously retrieve population data for each country from the REST Countries API. - Data Preparation: Combines scraped data with fetched population data and sorts it by the number of Kubestronauts.
- Batch the Google Sheets Update: Batch updates the Google Sheet to avoid Google API rate limits and ensure the sheet is updated efficiently.
- Google Sheets Update: Batch updates the Google Sheet with the Kubestronauts data and population information, ensuring the sheet is updated efficiently.
Here’s how to set up the environment on both Linux and MacOS
First, create and activate a virtual environment to keep dependencies isolated.
~$ python3 -m venv myenv
~$ source myenv/bin/activate
In the repo you can find the requirements.txt file and use it basically we use just 3 packages aiohttp
, gspread
and selenium
pip install -r requirements.txt
For MacOS:
Use Homebrew
to install ChromeDriver
brew install chromedriver
For Linux: You can install ChromeDriver manually or through a package manager, depending on your distribution.
For Ubuntu/Debian:
sudo apt-get install chromium-chromedriver
Ensure the chromedriver
executable is in your system's PATH, or specify its location when initializing the Selenium WebDriver.
Set up your Google Cloud project, enable the Google Sheets API, and create a Service Account. You can follow the gspread official doc. When you are done you can download the service account credentials JSON file.
Best way is to use secret manager but for simplicity we will use it in env variable.
- GSHEET_CREDS is the JSON content of the service account credentials file
- GSHEET_KEY is the Google Sheet ID it looks something like "1BxiMVs0A5nFMdKvBdBZjptlbs74OgvE2upms"
export GSHEET_CREDS="JSON content {...}"
export GSHEET_KEY="CHANGE-THAT-1BxiMVs0A5nFMdKvtlbs74OgvE2upms"
If you decide to put it in your shell Reload the shell configuration to apply the changes:
source ~/.bashrc # or ~/.zshrc
Ensure the virtual environment is active and run your Python script:
Run the script
$ python3 kubestronauts.py
Starting Selenium to scrape Kubestronaut data...
Scraped 5 regions and 79 countries.
Fetching populations asynchronously...
Starting population fetch for all countries...
...
...
Chris Abraham - #1 Just wanted to mention that instead of scraping the data from the web page, you could import the data from this file and filter on "category": "Kubestronaut". That is the central source for all the data on the Kubestronaut webpage.