# Get the number of records from each contributor by zone and format

In [another notebook](get_contributors.ipynb) I saved a flat list of Trove's metadata contributors. This notebook uses that list to find out the number of records contributed by each organisation, aggregated by zone and format, and does a little exploration of the data.

The code in this notebook is used to create weekly harvests of data about Trove contributors which are saved [in this repository](https://github.com/wragge/trove-contributor-totals).

The method used is straightforward:

- loop through the list of contributors
- using the query pattern `nuc:"[contributor NUC]"` search all zones for items from that contributor
- add the `facet=format` parameter to the search to get facets by item format
- process the results by zone and facet to get totals from each contributor

## Set things up

In [1]:
import datetime
import os
import time

import pandas as pd
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

# Create a session that will automatically retry on server errors
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))

In [2]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [11]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

## Define some functions

In [12]:
def get_children(facet_type, term):
    """
    Get child terms of the current facet term.
    """
    facets = []
    for child_term in term["term"]:
        facets += get_term(facet_type, child_term)
    return facets


def get_term(facet_type, term):
    """
    Gets details of an individual facet term as well as any child terms.
    Return a list of facet terms
    """
    facets = []
    facets.append({facet_type: term["search"], "total": int(term["count"])})
    if "term" in term:
        facets += get_children(facet_type, term)
    return facets


def get_facet_values(zone, facet_type):
    """
    Get all of the terms for a particular from the given zone.
    """
    facet_values = []
    # If there are no facet values you might either get and empty string or no facets at all
    if "facets" in zone and zone["facets"]:
        # If there are multiple facets then facets will be a list, otherwise a dict
        if isinstance(zone["facets"]["facet"], list):
            facets = zone["facets"]["facet"]
        else:
            facets = [zone["facets"]["facet"]]
        for facet in facets:
            if facet["name"] == facet_type:
                for term in facet["term"]:
                    facet_values += get_term(facet_type, term)
    return facet_values


def get_contrib_totals():
    params = {
        "zone": "article,book,collection,map,music,people,picture",
        "facet": "format",
        "encoding": "json",
        "n": 0,
        "key": API_KEY,
    }
    contributors = pd.read_csv(
        "https://raw.githubusercontent.com/wragge/trove-contributor-totals/main/data/trove-contributors.csv"
    ).to_dict("records")
    totals = []
    formats = []
    for contrib in tqdm(contributors):
        if nuc := contrib["nuc"]:
            params["q"] = f'nuc:"{nuc}"'
            response = s.get("https://api.trove.nla.gov.au/v2/result", params=params)
            data = response.json()
            for zone in data["response"]["zone"]:
                totals.append(
                    {
                        "nuc": nuc,
                        "name": contrib["name"],
                        "zone": zone["name"],
                        "total": zone["records"]["total"],
                    }
                )
                facets = get_facet_values(zone, "format")
                for facet in facets:
                    facet.update(
                        {"nuc": nuc, "name": contrib["name"], "zone": zone["name"]}
                    )
                    formats.append(facet)
            time.sleep(0.2)
    return totals, formats

Harvest the data.

In [None]:
totals, formats = get_contrib_totals()

Convert the harvested data to dataframes.

In [20]:
df_totals = pd.DataFrame(totals)
df_formats = pd.DataFrame(formats)

In [7]:
df_totals.head()

Unnamed: 0,nuc,name,zone,total
0,VPWLH,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,people,0
1,VPWLH,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,map,0
2,VPWLH,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,music,0
3,VPWLH,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,article,3
4,VPWLH,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,collection,22


In [8]:
df_formats.head()

Unnamed: 0,name,nuc,zone,format,total
0,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,VPWLH,article,Article,2
1,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,VPWLH,article,Article/Other article,1
2,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,VPWLH,article,Article/Report,1
3,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,VPWLH,article,Periodical,1
4,4th/19th Prince of Wales' Light Horse Regiment Unit. History Room.,VPWLH,article,"Periodical/Journal, magazine, other",1


Save the harvested data as CSV files.

In [30]:
df_totals[["nuc", "name", "zone", "total"]].to_csv(
    f"trove-contributors-zones-{datetime.datetime.now().strftime('%Y%m%d')}.csv",
    index=False,
)

df_formats[["nuc", "name", "zone", "format", "total"]].to_csv(
    f"trove-contributors-formats-{datetime.datetime.now().strftime('%Y%m%d')}.csv",
    index=False,
)

## Top contributors by zone

Let's do a little exploration of the data. First we'll set some Pandas defaults, and make sure the data is loaded.

In [2]:
pd.set_option("display.max_colwidth", None)
pd.set_option("styler.format.thousands", ",")

# If you haven't created a fresh harvest above, you can load the most recent weekly harvest
try:
    # Make sure totals are ints
    df_totals["total"] = df_totals["total"].astype("Int64")
    df_formats["total"] = df_formats["total"].astype("Int64")
except NameError:
    df_totals = pd.read_csv(
        "https://raw.githubusercontent.com/wragge/trove-contributor-totals/main/data/trove-contributors-zones.csv"
    )
    df_formats = pd.read_csv(
        "https://raw.githubusercontent.com/wragge/trove-contributor-totals/main/data/trove-contributors-formats.csv"
    )

By grouping the totals data, we can find who has contributed the most records by zone.

In [9]:
# Group by zone
df_totals.groupby("zone").apply(
    # This operates on each group's dataframe, reordering, sorting, and displaying the top 5
    lambda x: x[["nuc", "name", "total"]]
    .sort_values(by="total", ascending=False)
    .reset_index(drop=True)
    .head(5)
    # Display the totals as a bar chart
).style.format({"total": "{:,.0f}"}).bar(subset="total")

Unnamed: 0_level_0,Unnamed: 1_level_0,nuc,name,total
zone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
article,0,ANL:DL,Trove Digital Library.,1804128
article,1,ANL,National Library of Australia.,852604
article,2,QU:IR,The University of Queensland. University of Queensland Library. University of Queensland: Institutional Repository.,421583
article,3,APAR:PR,Parliament of Australia. Parliamentary Library. Press Releases Database.,380935
article,4,VSL,State Library Victoria.,273336
book,0,ANL,National Library of Australia.,3488973
book,1,QU,The University of Queensland. University of Queensland Library.,1557222
book,2,VLU,La Trobe University. La Trobe University Library.,1518937
book,3,VSL,State Library Victoria.,1503395
book,4,VU,The University of Melbourne. The University of Melbourne Library.,1459231


## Explore an individual zone

To make it easier to look at the make-up of each zone, we can add a `%` column to the dataframe that shows the proportion each source has contributed. 

In [None]:
# This nifty one-liner to get percentages by group was found here: https://stackoverflow.com/a/57359372
df_totals["%"] = (
    100 * df_totals["total"] / df_totals.groupby("zone")["total"].transform("sum")
)

Let's have a look at the `collection` zone that includes archives, manuscripts, and other unpublished materials.

How many different contributors provide records to the `collection` zone?

In [18]:
df_totals.loc[(df_totals["zone"] == "collection") & (df_totals["total"] > 0)][
    "nuc"
].nunique()

784

Let's list the top twenty contributors to the `collection` zone.

In [21]:
df_totals.loc[df_totals["zone"] == "collection"].sort_values(
    by="total", ascending=False
).head(20).style.format({"total": "{:,.0f}"}).bar(subset="%").hide(axis="index")

nuc,name,zone,total,%
QSA,Queensland State Archives.,collection,3043694,62.655373
NSCA,City of Sydney Archives.,collection,517023,10.643077
NPAR:PR,Parliament of New South Wales. NSW Parliamentary Library Press Releases.,collection,244083,5.024523
VUMA,The University of Melbourne. University of Melbourne Archives.,collection,182226,3.751178
NSL,State Library of NSW.,collection,144348,2.971448
ANL:AJCP,Australian Joint Copying Project.,collection,113185,2.329948
ANL,National Library of Australia.,collection,93073,1.915936
VSL,State Library Victoria.,collection,63248,1.301979
ANL:DL,Trove Digital Library.,collection,32218,0.663217
VRHS,Royal Historical Society of Victoria. Royal Historical Society of Victoria.,collection,30588,0.629663


## Theses from institutional repositories

We can use the formats dataset to drill down further. The NUC symbols of institutional repositories generally end in ':IR', so we look to see how many theses are being contributed by institutional repositories in Australian universities.

In [5]:
df_formats = df_formats[["name", "nuc", "zone", "format", "total"]]
df_formats.loc[
    # Filter by format 'Thesis'
    (df_formats["format"] == "Thesis")
    # Only universities
    & (df_formats["name"].str.contains("University"))
    # Only institutional repositories
    & (df_formats["nuc"].str.endswith(":IR"))
].sort_values(by="total", ascending=False).style.format({"total": "{:,.0f}"}).bar(
    subset="total"
).hide(
    axis="index"
)

name,nuc,zone,format,total
The University of Queensland. University of Queensland Library. University of Queensland: Institutional Repository.,QU:IR,book,Thesis,26037
UNSW Sydney. UNSW Library. University of New South Wales: Institutional Repository.,NUN:IR,book,Thesis,22851
The University of Melbourne. The University of Melbourne Library. University of Melbourne: Institutional Repository.,VU:IR,book,Thesis,18955
Australian National University. Australian National University Library. Australian National University: Institutional Repository.,ANU:IR,book,Thesis,15254
The University of Sydney. University of Sydney Library. University of Sydney: Institutional Repository.,NU:IR,book,Thesis,13159
University of Adelaide. University of Adelaide: Institutional Repository.,SUA:IR,book,Thesis,12625
Queensland University of Technology. Queensland University of Technology: Institutional Repository.,QUT:IR,book,Thesis,8097
University of South Australia. University of South Australia Library. University of South Australia: Institutional Repository.,SUSA:IR,book,Thesis,5091
Macquarie University. Macquarie University Library. Macquarie University Library: Institutional Repository.,NMQU:IR,book,Thesis,4827
RMIT University. RMIT University Library. RMIT Research Repository.,VIT:IR,book,Thesis,4807


----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge).