(C) Crown Copyright, Met Office. All rights reserved.

## esgf_vars_downloaded_20210204.ipynb

This notebook looks at the number of downloads of CMIP6.HighResMIP variables that have been downloaded from CEDA's ESGF node via HTTP between 25th March 2019 and 4th February 2021. The variables have been sorted by frequency and table name. The number shows the number of times that unique datasets (e.g. a unique combination of institute, model, experiment, variant label, table and variable) have been downloaded from a unique IP address. 

PRIMAVERA was given access to access logs of the THREDDS Tomcat server from CEDA's ESGF node. All of the PRIMAVERA data has been published to this ESGF server. Data can be downloaded over the HTTP protocol from THREDDS or via Globus and so these logs do not show all of the data that has been downloaded, although some of the Globus downloads will have been replication of the data to other ESGF nodes and so may not be representative of actual usage of the data. The data may have been replicated to other ESGF nodes and so these logs only show downloads from CEDA’s ESGF node and not the total global downloads. 

The logs had been anonymised by replacing the IP address in them with a hash of the IP address. Some institutes will operate a web proxy so that all users at that institute will appear to come from the same IP address. Because some datasets from some institutes have a single variable spread over several files, a download has been counted as a download of that variable from that IP address hash. 

In [1]:
import datetime
print(f'Last run {datetime.datetime.utcnow()}')

Last run 2021-04-19 14:04:06.611658


In [2]:
from collections import OrderedDict

In [3]:
# The path to the Tomcat log file
LOGFILE = 'HighResMIP.all.anonlog'

In [4]:
class VariableRequests:
    """
    A class to represent the variable requests and to keep track of how many
    times each of them has been requested
    """
    def __init__(self):
        """Create an empty dict of variable requests"""
        self._vreqs = {}

    def increment_vreq(self, vreq):
        """
        Increment the retrieval count for `vreq`, adding it to the list if
        it doesn't alreday exist.

        :param str vreq: the variable request code
        """
        if vreq not in self._vreqs:
            self._vreqs[vreq] = 1
        else:
            self._vreqs[vreq] += 1

    def get_vreqs(self, order_by_count=False):
        """
        Get the list and count of variable requests. The default order of
        requests is frequency, then table and finally variable name. They
        can alternatively be returned in decreasing count order.

        :param bool order_by_count: if True then return in decreasing order
            of count.
        :returns: the list and count of variable requests.
        :rtype: str
        """
        return_strings = []
        if order_by_count:
            ordered = OrderedDict(sorted(self._vreqs.items(),
                                         key=lambda x: x[1],
                                         reverse=True))
        else:
            ordered = OrderedDict(
                sorted(self._vreqs.items(),
                       # sort order is frequency then table name then variable
                       key=lambda x: (_guess_frequency(x[0]),
                                      x[0].split('_')[1],
                                      x[0].split('_')[0]))
            )
        for vr in ordered:
            return_strings.append('{:<25} {:3}'.format(vr, self._vreqs[vr]))

        return '\n'.join(return_strings)


In [5]:
def _guess_frequency(table_name):
    """
    Return an integer corresponding to the frequency of variables in the table.
    Higher frequency data (starting at 1hr) has a lower priority.

    :param str table_name: a string containing the table name.
    :returns: an integer corresponding to the frequency of variables.
    :rtype: int
    :raises ValueError: if a valid frequency isn't found in the table name.
    """
    frequencies = {
        '1hr': 1,
        '3hr': 2,
        '6hr': 3,
        'day': 4,
        'mon': 5,
        'fx': 6
    }
    for freq in frequencies:
        if freq in table_name:
            return frequencies[freq]

    raise ValueError(f'No frequency found for table name {table_name}')


In [6]:
dreqs = {}
vreqs = VariableRequests()
num_bad_request = 0
num_other_requests = 0
num_lines_done = 0

with open(LOGFILE) as fh:
    for line in fh:
        num_lines_done += 1
        cmpts = line.split()
        ip_hash = cmpts[0]
        url = cmpts[6]
        status = cmpts[-2]

        if status != '200':
            # bad request so ignore and move to next
            num_bad_request += 1
            continue

        if not url.startswith('/thredds/fileServer/esg_cmip6'):
            # not a file retrieval so ignore and move to next
            num_other_requests += 1
            continue

        url_parts = url.split('/')
        # The data request code is in the form:
        # institute_id/source_id/variant_label/table_name/cmor_name
        dreq = '/'.join(url_parts[6:12])
        # The variable request code is in the form:
        # cmor_name_table_name
        vreq = f'{url_parts[11]}_{url_parts[10]}'
        if dreq not in dreqs:
            # This data request hasn't been requested before
            dreqs[dreq] = [ip_hash]
            vreqs.increment_vreq(vreq)
        elif ip_hash not in dreqs[dreq]:
            # This IP address hasn't requested this data request before
            dreqs[dreq].append(ip_hash)
            vreqs.increment_vreq(vreq)

print(f'{num_lines_done} lines processed')

4513950 lines processed


Look at the variables downloaded in frequency and table order

In [7]:
print(vreqs.get_vreqs())

pr_E1hr                   112
prc_E1hr                   69
clt_3hr                     7
hfls_3hr                   19
hfss_3hr                   19
huss_3hr                   56
mrro_3hr                    4
mrsos_3hr                   5
pr_3hr                    203
prc_3hr                    14
prsn_3hr                   15
ps_3hr                     15
rlds_3hr                   24
rldscs_3hr                 13
rlus_3hr                   26
rsds_3hr                   37
rsdscs_3hr                 13
rsus_3hr                   21
rsuscs_3hr                 15
tas_3hr                   124
tos_3hr                    15
tslsi_3hr                   4
uas_3hr                   119
vas_3hr                   111
psl_CF3hr                   5
clivi_E3hr                  1
prcsh_E3hr                  2
prw_E3hr                   36
psl_E3hr                   45
rlut_E3hr                   9
rlutcs_E3hr                 3
rsdt_E3hr                   1
rsut_E3hr                   2
rsutcs_E3h

Look at the variables downloaded in popularity order

In [8]:
print(vreqs.get_vreqs(order_by_count=True))

thetao_Omon               3300
uo_Omon                   2700
vo_Omon                   2686
zos_Omon                  2292
tauuo_Omon                2204
tos_Omon                  2189
hfds_Omon                 2051
pr_Amon                   1957
pr_day                    1666
ts_Amon                   1624
psl_Amon                  1610
tauvo_Omon                1373
hfls_Amon                 1337
hfss_Amon                 1331
rsds_Amon                 1317
tas_Amon                  1311
rlds_Amon                 1298
tauu_Amon                 1290
rsus_Amon                 1287
rlus_Amon                 1286
tas_day                   809
tasmax_day                758
ua_Amon                   709
so_Omon                   693
tasmin_day                656
va_Amon                   616
ta_Amon                   510
hus_Amon                  473
ps_Amon                   472
zg_Amon                   450
uas_Amon                  446
vas_Amon                  439
evspsbl_Amon        