# Getting started with the PROV API

In [10]:
import requests
import pandas as pd
from IPython.display import JSON
# necessary for jlite
import jinja2

The PROV API provides data about PROV's archival holdings in machine readable form.

Making a request is easy. The base url for API requests is `https://api.prov.vic.gov.au/search/query`. No authentication is required, and the only mandatory parameter is `q` which you use to pass along your search query. Here's an example of a simple search.

In [12]:
api_url = "https://api.prov.vic.gov.au/search/query"
params = {
    "q": "ostrich",
}
response = requests.get(api_url, params=params)
print(response.url)
data = response.json()

https://api.prov.vic.gov.au/search/query?q=ostrich


The number of matching search results is contained in the `response` -> `numFound` field.

In [13]:
print(f"There are {data['response']['numFound']:,} results.")

There are 10 results.


The full search results can be found in `response` -> `docs`. Let's try listing all the titles.

In [14]:
for result in data["response"]["docs"]:
    print(result["title"])

D501 Head close-up of ostrich
D502 Head close-up of ostrich
110
1004/312 Marinda McOstrich Jaffray: Will; Grant of probate
1004/312 Marinda McOstrich Jaffray: Grant of probate
M OSTRICH
553055
M OSTRICH
MATT OSTRICHE
V332 [Daryl Somers and Ozzie Ostrich on the children's show 'Hey, Hey' It's Saturday']


Here's the first result in full.

In [15]:
JSON(data["response"]["docs"][0])

<IPython.core.display.JSON object>

## But what are you searching?

It's not immediately obvious how much, or what types of data the PROV API returns. You can find out how many records there are by using a 'blank' query – setting `q` to `*:*` returns *everything*.

In [8]:
params = {
    "q": "*:*",
}
response = requests.get(api_url, params=params)
print(response.url)
data = response.json()

print(f"There are {data['response']['numFound']:,} results in total.")

https://api.prov.vic.gov.au/search/query?q=%2A%3A%2A
There are 10,137,566 results in total.


But what sorts of things are described by those 10+ million records? You can explore this by using facets to get a count of values in the `category` field. To retrieve the facet counts, set the `facet` parameter to `true`, and `facet.field` to `category`.

In [202]:
params = {
    "q": "*:*",
    "facet": "true",
    "facet.field": "category",
    "rows": 0
}
response = requests.get(api_url, params=params)
print(response.url)

data = response.json()
values = data["facet_counts"]["facet_fields"]["category"]
facets = [{"facet": values[i], "count": values[i+1]} for i in range(0, len(values), 2)]
pd.DataFrame(facets).style.format(thousands=",").bar()

https://api.prov.vic.gov.au/search/query?q=%2A%3A%2A&facet=true&facet.field=category&rows=0


Unnamed: 0,facet,count
0,Item,6329699
1,Image,3613751
2,relatedEntity,150066
3,Consignment,23610
4,Series,16930
5,Agency,3137
6,Function,321


*Items*, *Series*, *Agencies*, and *Functions* are the basic entities used to describe the PROV's archival holdings. You can find out more about their [Archival Control Model](https://prov.vic.gov.au/recordkeeping-government/a-z-topics/archival-control-model) on the PROV website.

> *Record Item*: a discrete element [of] records managed within a ‘Series’. An Item represents a part of a recordkeeping system or a logical or convenient grouping of records. It may represent one record or multiple records such as a group of folios fastened together to form a file, a group of electronic files aggregated in a folder, or a single volume.

> *Record Series*: a group of records which are recorded or maintained by the same agency (or agencies) and which:
> - are in the same numerical, alphabetical, chronological or other identifiable sequence;
> - or result from the same accumulation or filing process.

> The *agent entity* in PROV’s ACM represents a Victorian Government agency—an administrative unit which has or had responsibility for the provision of at least one aspect of government administration. This entity helps to provide a description of a record’s context, namely who created the records and for what purpose.

> The *function entity* in PROV’s ACM represents the major responsibilities of Victorian Government that may be managed by one or more agencies over time. Applying this entity helps to:
> - group together various records with the same administrative record context
> - links records to their provenance and complementary information.

If you look again at the full result displayed above, you'll see that the `category` field is set to 'Item'. This result describes a discrete element within a series. If you look at the `is_part_of_series.title` field you'll see that it belongs to a series titled 'Negatives of Photographs [Publications Branch]'. You can also see the government agency responsible for this record in the `resp_agency_title` field.

There are a number of other ways that entities wihin the Archival Control Model can be related to each other. See the full [ACM Policy document](https://prov.vic.gov.au/sites/default/files/files/Govt%20Services%20General/PROV_Archival_Control_Model_Policy.pdf) for a list of possible relationships. These connections between entities are documented in the `relatedEntity` category.

*Consignments* are created when parts of a series are transferred to PROV. See the [Archival description for records transfer projects](https://prov.vic.gov.au/recordkeeping-government/transferring-records/archival-description-records-transfer-projects) page on the PROV website:

> The entire contents of a series may not necessarily be transferred at the same time. A series may also be transferred to PROV in portions, known as consignments, over a number of years. A consignment comprises of record items belonging to the one series which are accessioned into the custody of PROV as part of the one transfer. A consignment may consist of the entirety of a series or only part of a series.

*Images* don't seem to be formally defined anywhere, but they appear to be individual photos or pages belonging to a digitised item.

In [9]:
params = {
    "q": "series_id:14517 AND category:Series",
}
response = requests.get(api_url, params=params)
print(response.url)

https://api.prov.vic.gov.au/search/query?q=series_id%3A14517+AND+category%3ASeries


## Searching within a specific category

To limit your search results to a specific category, set the `category` field in your query to one of the values described above. For example, if you only wanted government agencies, you'd add `category:Agency` to the `q` query string. To include multiple categories you can use the `OR` operator:

- just items: `category:Item`
- items and images: `category:Item OR category:Image`

Here's how to request items only.

In [203]:
params = {
    "q": "category:Item"
}
response = requests.get(api_url, params=params)
print(response.url)
data = response.json()

print(f"\nThere are {data['response']['numFound']:,} results.\n")

for result in data["response"]["docs"]:
    print(result["title"])

https://api.prov.vic.gov.au/search/query?q=category%3AItem

There are 6,329,699 results.

211/374 Leslie A Lamb: Will; Grant of probate
215/936 Ellen Cahill: Will; Grant of probate
215/981 Florence M Lovegrove: Will; Grant of probate
211/107 Amelia Hawking: Will; Grant of probate
215/980 William F Finchett: Will; Grant of probate
215/979 George Wilson: Will; Grant of probate
211/102 Bernard F Cragen: Will; Grant of probate
211/221 Jonathan Coulson: Will; Grant of probate
215/978 William E S Ockenden: Will; Grant of probate
215/959 Otto Holst: Will; Grant of probate


Compare the number of results to a request for items *or* images.

In [16]:
params = {
    "q": "category:Item OR category:Image"
}
response = requests.get(api_url, params=params)
print(response.url)
data = response.json()

print(f"\nThere are {data['response']['numFound']:,} results.\n")

https://api.prov.vic.gov.au/search/query?q=category%3AItem+OR+category%3AImage

There are 9,943,476 results.

