# Working with South Australian Hansard API

This notebook provides some examples for how to interact with the API for the proceedings of the South Australian Parliament. This API let's you retrieve parts of the proceedings by date and house, and also allows you (with more effort) to identify all speeches, questions and answers by particular members.

The first part of this notebook shows some examples of interacting with the API and the returned data.

The second part of this notebook retrieves all speeches in the current Parliamentary session (since 2022) by the current South Australian Premier and counts the words used across his speeches.

## Reference Links:

- [South Australian Government Hansard API Documentation](https://hansardsearch.parliament.sa.gov.au/docs/api/index.html)
- [South Australian Government Hansard Web Search](https://hansardsearch.parliament.sa.gov.au/search)

## Our First API Request

This first examples shows how to retrieve one of the handy pieces of information we can request from the API - a requests for the list of chambers.

This also shows an example of using the [JSON data format](https://en.wikipedia.org/wiki/JSON) returned by the API.

In [1]:
import requests

# We'll need to refer to this domain a lot, but will be changing the path at the end of 
# to work with different API 'endpoints'. This assigns a label api_url to a list of letters 
# that represent the domain (everything inside the quote ' characters).
api_url = 'https://hansardsearch.parliament.sa.gov.au'

# First request - get current chambers:
# requests.get retrieves the URL, and we save it in the variable 'response'
# Note that in python using + with strings puts them together - 'a' + 'b' = 'ab'
response = requests.get(api_url + '/api/chamber')

# print is a Python function to explicitly display output.
# response.json() parses the data returned by the API so we can manipulate it in Python.
print('Raw response:', response.json())
print()

# The raw response is a little hard to read
# But since it starts with a '[' character, we can recognise that it's a list of items
# and we can display it more nicely by looping through with a for loop.
for chamber in response.json():
    print(chamber)

# This response shows that there are six 'chambers' for the organisation of the business of the South Australian government - 
# The expecting House of Assembly and Legislative Council, and four extra 'chambers' relating to estimates committees.

Raw response: [{'id': '2310c894d85048eb8051105ca3d348a8', 'name': 'Legislative Council'}, {'id': '8758cca8bb634275820b3047e39bfe8b', 'name': 'Estimates Committee A - Answers to Questions'}, {'id': 'c2ff907f907046a8bb61467a5e2ccc9f', 'name': 'Estimates Committee B'}, {'id': 'd35e3c30ea624513990bacfe6654dfbe', 'name': 'Estimates Committee B - Answers to Questions'}, {'id': 'b2669775c14346e7b73e1eaa25470874', 'name': 'Estimates Committee A'}, {'id': 'd6d71d2191444d70be863757822ef005', 'name': 'House of Assembly'}]

{'id': '2310c894d85048eb8051105ca3d348a8', 'name': 'Legislative Council'}
{'id': '8758cca8bb634275820b3047e39bfe8b', 'name': 'Estimates Committee A - Answers to Questions'}
{'id': 'c2ff907f907046a8bb61467a5e2ccc9f', 'name': 'Estimates Committee B'}
{'id': 'd35e3c30ea624513990bacfe6654dfbe', 'name': 'Estimates Committee B - Answers to Questions'}
{'id': 'b2669775c14346e7b73e1eaa25470874', 'name': 'Estimates Committee A'}
{'id': 'd6d71d2191444d70be863757822ef005', 'name': 'House 

## Search for 'events' in a time period

Parliaments do not sit all the time - sometimes we need to be able to figure out when parliaments actually sat so we can start retrieving the right information for us. For example, if you're looking for parliamentary speeches about a particular event, you might want to find the first sitting days after that event.

In [2]:
# Let's use the events endpoint to see what happened in the first three months of 2015
# This API takes a start data and an end date as params.
# If you've seen a URL like www.example.com/path?x=1&y=2
# The x=1&y=2 are parameters, representing x = 1 and y = 2 respectively.
# The params argument to the followign requests call handles turning
# the dates we have into the right URL.
response = requests.get(api_url + '/api/hansard/events/', params={'startDate': '2015-01-01', 'endDate': '2015-04-01'})

print(response.json())

{'events': [{'houseName': 'Legislative Council', 'houseCode': 'uh', 'date': '2015-03-26', 'pdfUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-03-26/pdf', 'subjectCount': 54, 'tocUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-03-26/toc'}, {'houseName': 'House of Assembly', 'houseCode': 'lh', 'date': '2015-03-26', 'pdfUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/lh/2015-03-26/pdf', 'subjectCount': 111, 'tocUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/lh/2015-03-26/toc'}, {'houseName': 'House of Assembly', 'houseCode': 'lh', 'date': '2015-03-25', 'pdfUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/lh/2015-03-25/pdf', 'subjectCount': 119, 'tocUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/lh/2015-03-25/toc'}, {'houseName': 'Legislative Council', 'houseCode': 'uh', 'date': '2015-03-25', 'pdfUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-03-25/pdf', 'subjectCount

In [3]:
# The raw response of this output is hard to read - there's a lot.
# The output starts with '{', indicating that this is a Python dictionary
# This is map between a 'key', to values and can be arbitrarily nested.
# We can access the 'events' key in Python like below to format this data more readably
for event in response.json()['events']:
    print(event)

# Now we can see there are lots of events in this time window - and the returned data indicates the house, date, and links to
# specific parts of the proceedings.

{'houseName': 'Legislative Council', 'houseCode': 'uh', 'date': '2015-03-26', 'pdfUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-03-26/pdf', 'subjectCount': 54, 'tocUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-03-26/toc'}
{'houseName': 'House of Assembly', 'houseCode': 'lh', 'date': '2015-03-26', 'pdfUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/lh/2015-03-26/pdf', 'subjectCount': 111, 'tocUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/lh/2015-03-26/toc'}
{'houseName': 'House of Assembly', 'houseCode': 'lh', 'date': '2015-03-25', 'pdfUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/lh/2015-03-25/pdf', 'subjectCount': 119, 'tocUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/lh/2015-03-25/toc'}
{'houseName': 'Legislative Council', 'houseCode': 'uh', 'date': '2015-03-25', 'pdfUrl': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-03-25/pdf', 'subjectCount': 66, 'tocUrl'

In [4]:
# get the table of contents for a date of the SA upper house - we could use this to retrieve everything said
# in a particular house on a particular day.
response = requests.get('https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-02-24/toc')

# We can inspect and pull apart this response to see all the entries and their associated links for each part of the 
# table of contents.
table_of_contents = response.json()
for entry in table_of_contents['proceedings']:
    print(entry)

{'name': 'Commencement', 'subjects': [{'type': 'subject', 'names': ['Commencement'], 'index': 1, 'links': {'html': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-02-24/subject/1?contentType=text%2Fhtml', 'xml': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-02-24/subject/1?contentType=text%2Fxml', 'json': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-02-24/subject/1?contentType=text%2Fjson'}}]}
{'name': 'Parliamentary Procedure', 'subjects': [{'type': 'subject', 'names': ['Papers'], 'index': 2, 'links': {'html': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-02-24/subject/2?contentType=text%2Fhtml', 'xml': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-02-24/subject/2?contentType=text%2Fxml', 'json': 'https://hansardsearch.parliament.sa.gov.au/api/hansard/uh/2015-02-24/subject/2?contentType=text%2Fjson'}}]}
{'name': 'Ministerial Statement', 'subjects': [{'type': 'subject', 'names': ['Greyhound Racing'

## All of the Current SA Premier's Speeches for the Current Parliament

This example is intended to show a more complex data collection: retrieving all of the current premier of South Australia's speeches for the current session of SA Parliament.

This example identifies all speeches that the Premier has said something in and downloads them to a separate directory.

In [5]:
import os
import time
from datetime import datetime

try:
    os.mkdir('data')
except FileExistsError:
    pass

# All members and the speeches they were part of for the latest SA government
response = requests.get('https://hansardsearch.parliament.sa.gov.au/api/hansard/indicies/lh/55/1/members', params={'contentType': 'text/json'})

# Parse the JSON so we can work turn the text into a Python data structure
member_index = response.json()
all_members = member_index['members']

# Go through the members one at a time until we find the current Premier's speeches.
for member in all_members:
    if member['name'] == 'MALINAUSKAS, Peter Bryden':
        speeches = member['speeches']


for speech in speeches[0]['topic']:
    # Get the information about this speech from this index data structure.
    # It is a complex nested structure, so we need to spend some time unpacking it.
    speech_locator = speech['date'][0]['fragment'][0]['uid']

    print('Retrieving:', speech_locator)

    # Pull apart their locator
    house, date, index = speech_locator.split('-')
    date_in_api_format = datetime.strptime(date, '%Y%m%d').strftime('%Y-%m-%d')

    speech_url = 'https://hansardsearch.parliament.sa.gov.au/api/hansard/{}/{}/subject/{}'.format(
        house, date_in_api_format, index
    )

    response = requests.get(speech_url, params={'contentType': 'text/html'})

    file_loc = os.path.join('data', f"{speech_locator}.html")
    with open(file_loc, 'w') as f:
        f.write(response.text)

    # Wait 2 seconds in between API calls - it's polite to space out your requests.
    time.sleep(2)

Retrieving: lh-20221201-51
Retrieving: lh-20230614-6
Retrieving: lh-20230706-6
Retrieving: lh-20230706-10
Retrieving: lh-20230614-8
Retrieving: lh-20230614-36
Retrieving: lh-20230530-6
Retrieving: lh-20230518-6
Retrieving: lh-20230503-18
Retrieving: lh-20230326-3
Retrieving: lh-20230223-37
Retrieving: lh-20230208-9
Retrieving: lh-20221101-10
Retrieving: lh-20221020-9
Retrieving: lh-20220928-9
Retrieving: lh-20220920-4
Retrieving: lh-20220906-3
Retrieving: lh-20220706-11
Retrieving: lh-20220531-34
Retrieving: lh-20220505-3
Retrieving: lh-20220505-4
Retrieving: lh-20220503-4
Retrieving: lh-20230913-46
Retrieving: lh-20231017-3
Retrieving: lh-20240220-8
Retrieving: lh-20240222-9
Retrieving: lh-20240409-12
Retrieving: lh-20240409-16
Retrieving: lh-20240627-10
Retrieving: lh-20240627-11
Retrieving: lh-20240627-14
Retrieving: lh-20241029-31
Retrieving: lh-20241126-13
Retrieving: lh-20241126-14
Retrieving: lh-20241127-52


In [6]:
import glob

import inscriptis


# Identify all the html files in the folder
data_files = glob.glob(os.path.join('data', '*.html'))

# Extract the text from each of the HTML files as plaintext
# using the inscriptis library, then place that text into
# a list.
entry_text = []

for data_file in data_files:
    with open(data_file, 'r') as f:
        html_content = f.read()
        text = inscriptis.get_text(html_content)
        entry_text.append(text)

# Write out the concatenated text files so we can discuss the bulk of the data.
with open('concatenated_speeches.txt', 'w') as w:
    for entry in entry_text:
        w.write(entry)

In [7]:
import collections
# We'll use the NLTK library to break up the text into words.
from nltk.tokenize import wordpunct_tokenize

tokenised = [wordpunct_tokenize(t.lower()) for t in entry_text]

# Now we'll count the number of times a word is used,
# and the number of speeches a word is used in.
word_counts = collections.Counter()
speech_counts = collections.Counter()

for tokens in tokenised:
    for token in tokens:
        word_counts[token] += 1

    for token in set(tokens):
        speech_counts[token] += 1

print('Total tokens:', sum(word_counts.values()))

with open('word_counts.csv', 'w') as f:
    f.write("token,word_count,speech_count\n")
    for word, word_count in word_counts.most_common(1000):
        f.write(f"{word},{word_count},{speech_counts[word]}\n")

Total tokens: 152018
