### Website Carbon scraper

This script takes a list of URL's in the urls.txt file, cleans the URL's, and scrapes the websitecarbon.com page.
It then extracts the grams of carbon value from the HTML, and saves the results to the carbondata.csv file.

I started by attempting to use MS Excel's cell formulas =WEBSERVICE() and =FILTERXML() functions, but these only work on windows. I then decided to try learning Julia, but decided it would be quicker to just use python and command line. The first version of this python notebook used wget to scrape the URL's, but in the end I needed to use requests.sessions and post the URL to the Websitecarbon.com webform to reliably retrieve the carbon data.

To rerun this script, check the list of URL's in the urls.txt file are up to date, clear out any cached HTML files in the ./html directory, and run all cells.

In [None]:
import re
import os
import csv
import requests
import json
import pandas as pd

In [None]:
# Clean the URL list a little bit
urls = open('urls.txt').readlines()
urls = [u.rstrip('\n').strip().rstrip('/').replace('https://', '').replace('http://', '').replace('www.', '') for u in urls]
urls = [u for u in urls if u]  # remove empties

In [None]:
def scrapeWebsitecarbon(url):
    '''
    Function takes a URL and scrapes the grams of carbon from the websitecarbon.com site.
    It saves the response HTML file and extracts the data from the HTML header
    Returns a quadruple (url, source, metric type, grams of CO2)
    '''
    source = 'websitecarbon.com'
    carbonurl = 'https://www.websitecarbon.com'
    cleanurl = url.replace('.', '-').replace('/', '-')
    headers = {'User-Agent': 'Mozilla/5.0'}
    payload = {'wgd-cc-url':url,
               'wgd-cc-retest': 'true'}
    
    scrapefile = os.path.join("./html/", source, cleanurl + ".html")
    directory = os.path.dirname(scrapefile)
    if not os.path.exists(directory):
        os.makedirs(directory)
    
    # Make the request and save the response
    if not os.path.exists(scrapefile) or os.path.getsize(scrapefile) == 0:
        print('scraping ' + url)
        session = requests.Session()
        sess = session.post(carbonurl, headers=headers, data=payload)
        with open(scrapefile, "w") as f:
            f.write(sess.text)
            f.close()
    else:
        print('using cached html for ' + url)
        
    # Load the cached response file and search for the data
    grams = !grep -E 'grams": (.*),$' {scrapefile} | cut -d: -f2 | cut -d, -f1
    if len(grams) > 0:
        grams = grams[0].strip()
    
    return (url, source, 'grams of CO2', grams)

Website Emissions.com is another website carbon calculator

In [None]:
def scrapeWebsiteemissions(url):
    '''
    Function takes a URL and scrapes the grams of carbon from the https://websiteemissions.com site.
    Returns a quadruple (url, source, metric type, grams of CO2)
    '''
    source = 'websiteemissions.com'
    carbonurl = 'https://websiteemissions.com/'
    carbonajaxurl = 'https://websiteemissions.com/wp-admin/admin-ajax.php'
    cleanurl = url.replace('.', '-').replace('/', '-')

    headers = {'User-Agent': 'Mozilla/5.0'}
    # The payload needs the nonce added later
    payload = {'action': 'carbon_calculate',
               'weblink': 'https://' + url}  # this has to start with https://

    scrapefile = os.path.join("./html/", source, cleanurl + ".html")
    directory = os.path.dirname(scrapefile)
    if not os.path.exists(directory):
        os.makedirs(directory)

    # Make the request and save the response - so we can get the wordpress nonce value
    if not os.path.exists(scrapefile) or os.path.getsize(scrapefile) == 0:
        session = requests.Session()
        sess = session.get(carbonurl)

        # This is the line of javascript we need to get the wordpress nonce value from
        noncepattern = 'var carbon_calc_ajax.*?nonce":"(.*?)"'
        result = re.findall(noncepattern, sess.text)

        # Once we have the nonce, we can POST the webform to retrieve the response data
        if result:
            print('scraping ' + url)
            payload['carbonNonce'] = result[0]  # this is a required wordpress nonce value
            sess = session.post(carbonajaxurl, headers=headers, data=payload)
            with open(scrapefile, "w") as f:
                f.write(sess.text)
                f.close()
    else:
        print('using cached html for ' + url)

    # read file and get json data
    data = None
    with open(scrapefile, 'r') as f:
        data = f.read()
        data = json.loads(data)
    
    return (url, source, 'grams of CO2', data['co2']) 

Scrape the URL's into a list of lists

In [None]:
carbondata = [scrapeWebsitecarbon(u) for u in urls]

In [None]:
webemissions = [scrapeWebsiteemissions(u) for u in urls]

In [None]:
# Merge lists
carbondata.extend(webemissions)

In [None]:
# Write to CSV file
with open('carbondata.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['url', 'source', 'type', 'value'])
    writer.writerows(carbondata)

Now you can multiply these values by your website pageviews analytics to calculate your website's carbon footprint.

In [None]:
# Take a look at the data
carbondata = pd.read_csv('carbondata.csv')
carbondata