# Process newspaper data

This notebook adds place information to a harvest of newspapers titles from the SLV catalogue. It also compares the catalogue records with the digitised Victorian newspapers in Trove and links or adds Trove data if it is missing.

The results is a dataset with one row for each place/title combination – ie. a single title might be linked to 5 places, so there will be 5 rows related to that title. The dataset is used in the [load newspapers to SQLite](load_newspapers_to_sqlite.ipynb) notebook to add the newspaper data to Datasette/Spatialite. 

In [1]:
import pandas as pd
from pathlib import Path
import json
from requests_cache import CachedSession
from tqdm.auto import tqdm
import re


sess = CachedSession(timeout=60, headers={"User-Agent": "GLAM Workbench notebook / glam-workbench.net / tim@timsherratt.au"})
tqdm.pandas()

## Load pre-harvested catalogue records

See the [download newspapers](download_newspapers.ipynb) notebook for harvesting details.

In [2]:
# Load pre-harvested newspapers data
df = pd.read_json("newspapers.ndjson", lines=True)

In [3]:
df.columns

Index(['source', 'type', 'language', 'title', 'format', 'creationdate',
       'publisher', 'mms', 'contributor', 'addtitle', 'frequency', 'genre',
       'coverage', 'place', 'version', 'lds03', 'lds04', 'lds19', 'lds41',
       'subject', 'contents', 'lds13', 'rights', 'lds14', 'lds17', 'edition',
       'relation', 'unititle', 'lds09', 'identifier', 'series', 'lds11',
       'lds15', 'lds08', 'description', 'vertitle', 'lds23', 'lds24',
       'creator', 'lds30', 'lds01', 'lds29', 'lds32', 'lds33', 'ispartof',
       'lds06', 'lds10', 'lds02', 'lds07'],
      dtype='object')

In [4]:
# Remove unnecessary fields
df = df[["mms", "title", "lds19", "publisher", "format", "genre", "lds03"]]

In [5]:
# Turn lists into pip-separated strings
df = df.apply(lambda x: x.str.join(" | "))

In [6]:
# Remove things like zines
df = df.loc[(df["genre"].isnull()) | (df["genre"].str.contains("Newspapers"))]

In [791]:
# Save cleaned up dataset as CSV
df.to_csv("newspapers.csv")

## Add Trove links

Trove urls aren't in the harvested records, but can be found in a couple of different places. First we'll try getting them from the MARC record.

In [7]:
def get_marc_record(alma_id):
    """
    Gets a text representation of an item's MARC record.
    """
    response = sess.get(
        f"https://find.slv.vic.gov.au/primaws/rest/pub/sourceRecord?docId=alma{alma_id}&vid=61SLV_INST:SLV"
    )
    return response.text


def get_marc_value(marc, tag, subfield):
    """
    Gets the value of a tag/subfield from a text version of an item's MARC record.
    """
    try:
        tag = re.search(rf"^{tag}\t.+", marc, re.M).group(0)
        #print(tag)
        subfield = re.search(rf"\${subfield}([^\$]+)", tag).group(1)
    except AttributeError:
        return None
    return subfield.strip(" .,")

def get_trove_links(alma_id):
    marc = get_marc_record(alma_id)
    trove_note = get_marc_value(marc, 856, "z")
    trove_url = get_marc_value(marc, 856, "u")
    if trove_url and "nla.gov.au" in trove_url:
        return pd.Series([trove_note, trove_url])
    else:
        return pd.Series([None, None])

In [8]:
df[["trove_note", "trove_url"]] = df["mms"].progress_apply(get_trove_links)

  0%|          | 0/3834 [00:00<?, ?it/s]

In [9]:
df.loc[df["trove_url"].notnull()].shape

(409, 9)

In [10]:
df.loc[(df["trove_url"].notnull()) & (df["trove_url"].str.contains("nla.obj"))]

Unnamed: 0,mms,title,lds19,publisher,format,genre,lds03,trove_note,trove_url
10,9940656140907636,Lilydale Star Mail.,2021-,Healesville Vic. : Paul Thomas for Star News G...,"1 online resource : colour illustrations, colo...",Newspapers,Australia--Victoria--Healesville | Australia--...,National edeposit: Onsite at State Library Vic...,https://nla.gov.au/nla.obj-3197996265
75,9940870303507636,"Progresso (Montrose, Vic. : Online) | Il Progr...",2016-,Thornbury Victoria : Il Progresso,1 online resource : colour illustrations | tex...,Newspapers | Periodicals,Australia--Victoria | Australia--Victoria--Mon...,National edeposit: Onsite at State Library Vic...,https://nla.gov.au/nla.obj-3305088433
78,9940751007907636,"Horsham times (Horsham, Vic. : 2020 : Online) ...",2023-,"Warracknabeal, VIC : The Horsham Times Pty Ltd","1 online resource : colour illustrations, colo...",Newspapers,Australia--Victoria--Horsham | Australia--Vict...,National edeposit: Onsite at State Library Vic...,https://nla.gov.au/nla.obj-3260111775
79,9940751007707636,"Argus (Rainbow, Vic. : Online) | The Argus : J...",2023-,Warracknabeal Victoria : Warrnhill Publishing,"1 online resource : colour illustrations, colo...",Newspapers,Australia--Victoria--Rainbow | Australia--Vict...,National edeposit: Onsite at State Library Vic...,https://nla.gov.au/nla.obj-3260112447
80,9940727347907636,Nhill Free Press & Kaniva Times (Online) | Nhi...,2023-,Nhill Victoria : Nhill Free Press & Kaniva Times,1 online resource : illustrations (some colour...,Newspapers,Australia--Victoria--Wimmera | Australia--Vict...,National edeposit,https://nla.gov.au/nla.obj-3142639214
...,...,...,...,...,...,...,...,...,...
3774,9940655163207636,Ranges Trader star mail (Online) | Ranges Trad...,2021-,Healesville Victoria : Star News Group Pty Ltd,1 online resource : colour illustrations | tex...,Newspapers | Periodicals,Australia--Victoria--Healesville | Australia--...,National edeposit: Onsite at State Library Vic...,https://nla.gov.au/nla.obj-2991362579
3848,9940844719407636,Lakes Post (Online) | Lakes Post.,2024-,"Bairnsdale, Victoria : James Yeates & Sons Pty...","1 online resource : colour illustrations, colo...",Newspapers,Australia--Victoria--Lakes Entrance | Australi...,National edeposit: Onsite at State Library Vic...,https://nla.gov.au/nla.obj-3291889291
3851,9938712173607636,Mt Buller news (Online) | Mt Buller News.,2013-,Mansfield Victoria : Mansfield Newspapers,1 online resource : colour illustrations | tex...,Newspapers,Australia--Victoria--Mount Buller | Australia-...,National edeposit: Onsite at State Library Vic...,https://nla.gov.au/nla.obj-2812928288
3939,9941359998107636,"Guardian (Swan Hill, Vic. : Online) | The Guar...",2025-,Swan Hill Victoria : SA Today Pty Ltd,1 online resource : colour illustrations | tex...,Newspapers,Australia--Victoria--Swan Hill | Australia--Vi...,,https://nla.gov.au/nla.obj-3862619827-t


Some trove links aren't in the MARC, but can be retrieved via the 'edelivery' JSON file. This code will check the JSON for more Trove links.

In [11]:
def get_dig_link(alma_id):
    trove_url = None
    response = sess.get(f"https://find.slv.vic.gov.au/primaws/rest/pub/edelivery/alma{alma_id}?vid=61SLV_INST:SLV&lang=en&googleScholar=false&lang=en")
    data = response.json()
    #print(data)
    for service in data.get("electronicServices", []):
        if "Trove" in service.get("packageName", "") or "Trove" in service.get("publicNote", ""):
            tresponse = sess.get(f"https://find.slv.vic.gov.au/{service["serviceUrl"]}")
            if re.search(r"newspaper\/title\/(\d+)", tresponse.url):
                trove_id = re.search(r"newspaper\/title\/(\d+)", tresponse.url).group(1)
                trove_url = f"http://nla.gov.au/nla.news-title{trove_id}"
                print(trove_url)
                break
    return trove_url
                


df_nt = df.copy().loc[df["trove_url"].isnull()]

In [12]:
df_nt["trove_url"] = df_nt["mms"].progress_apply(get_dig_link)

  0%|          | 0/3425 [00:00<?, ?it/s]

http://nla.gov.au/nla.news-title589
http://nla.gov.au/nla.news-title103
http://nla.gov.au/nla.news-title72
http://nla.gov.au/nla.news-title558


In [13]:
# Merge new urls into existing dataset
df = pd.merge(df, df_nt.loc[df_nt["trove_url"] != ""][["mms", "trove_url"]], how="left", on="mms")

In [14]:
# Combine trove_url columns
df["trove_url"] = df.apply(lambda x: x["trove_url_x"] if x["trove_url_x"] else x["trove_url_y"], axis=1)

In [15]:
# remove old columns
df.drop(columns=["trove_url_x", "trove_url_y"], inplace=True)

In [802]:
# Save updated dataset with Trove links
df.to_csv("newspapers.csv", index=False)

## Find Trove newspapers that aren't in the catalogue dataset

It seems that the current dataset doesn't include all of the Victorian newspapers available through Trove. Presumably there are records in the catalogue, but either the `ld03` hasn't been used to link them to a place, or the Trove url hasn't been added to the catalogue record.

We'll get some data from Trove and cross-check.

In [17]:
# Load the most recent  Trove newspaper harvest
dft = pd.read_csv("https://raw.githubusercontent.com/wragge/trove-newspaper-totals/refs/heads/master/data/total_articles_by_newspaper.csv")

In [18]:
# Filter by state
dfv = dft.loc[dft["state"] == "Victoria"]

In [19]:
dfv

Unnamed: 0,title_id,total,title,state,issn,start_date,end_date
27,1023,855,The Melbourne Weekly Courier (Vic. : 1844 - 1845),Victoria,14403684,1844-01-06,1845-03-28
28,1024,2071,The Melbourne Courier (Vic. : 1845 - 1846),Victoria,14403692,1845-06-16,1846-03-11
29,1025,1565,Melbourne Times (Vic. : 1842 - 1843),Victoria,1440219X,1842-04-09,1843-12-08
34,103,2143,The Australian News for Home Readers (Vic. : 1...,Victoria,18373542,1864-01-25,1867-06-28
49,1043,14,"Seamen's Strike Bulletin (Melbourne, Vic. : 1919)",Victoria,2205085X,,
...,...,...,...,...,...,...,...
1787,958,2204,The Melbourne Leader (Vic. : 1861),Victoria,22044949,1861-01-12,1861-12-28
1788,959,19039,Bell's Life in Victoria and Sporting Chronicle...,Victoria,22044868,1857-01-03,1868-01-04
1790,960,36609,The Snowy River Mail and Tambo and Croajingolo...,Victoria,22044906,1890-08-09,1911-08-31
1791,961,20225,"The Tocsin (Melbourne, Vic. : 1897 - 1906)",Victoria,22044944,1897-10-02,1906-10-25


Compare the Trove list to the data from the Catalogue. If the Trove url isn't in the catalogue dataset, try and find the matching record by searching for the title.

These need to be checked manually by searching the `newspapers.csv` dataset using the titles and then checking other details such as dates. If it's a match, add to `newspaper_manual_additions.csv`.

In [None]:
df_new = df.copy()

counter = 0
for np in dfv.itertuples():
    tid = f"nla.news-title{np.title_id}"
    if df_new.loc[(df_new["trove_url"].notnull()) & (df_new["trove_url"].str.contains(tid))].empty:
        title = re.search(r"([A-Za-z\s\-\,\.']+)\(", np.title).group(1).strip().replace("The", "").replace(" and ", " (and|&) ")
        try:
            place = re.search(r"\(([A-Za-z\s]+),", np.title).group(1)
        except AttributeError:
            place = ""
        #title = np.title.split(" (")[0].replace("The", "").strip()
        
        results = df_new.loc[(df_new["title"].str.contains(title, case=False, regex=True)) & ((df_new["title"].str.contains(place, regex=False)) | (df_new["lds03"].str.contains(place, regex=False)))]
        if not results.empty:
            print("\n" + np.title)
            print(f"http://nla.gov.au/nla.news-title{np.title_id}")
            print(title, place)
            print(results["title"].to_list())
        counter += 1
counter

In [168]:
# Load the additions
df_added = pd.read_csv("newspaper_manual_additions.csv")

In [169]:
def update_url_from_additions(row):
    found = df_added.loc[df_added["mms"] == int(row["mms"])]
    if not found.empty:
        return found.iloc[0]["trove_url"]
    return row["trove_url"]

In [170]:
df_new["trove_url"] = df_new.apply(update_url_from_additions, axis=1)

## Get place information

Add geospatial data to the placenames in the `lds03` field by linking them to the VicNames gazetteer.

In [172]:
# Split the pipe-separated values in lds03 into a list
df_new["place"] = df_new["lds03"].str.split(" | ", regex=False)

In [173]:
# Explode the place values into separate rows
df_new = df_new.explode("place")

In [174]:
# Only use placenames in Victoria
df_new = df_new.loc[df_new["place"].str.startswith("Australia--Victoria")]

In [175]:
# Get the specific placename at the end of the heading and clean
def clean_place(place):
    placename = place.split("--")[-1]
    placename = placename.replace(".", "")
    placename = placename.replace(",", "").strip()
    return placename
    
df_new["placename"] = df_new["place"].apply(clean_place)

In [176]:
#df_new.to_csv("newspapers_places.csv", index=False)

In [177]:
df_new["placename"].nunique()

822

In [178]:
# Here we're merging in manual corrections to lds03 values, these are added to a list of missing values created below.
# Comment this out the first time you run. Then re-run with it uncommented once you've made the corrections.
df_new = pd.merge(df_new, pd.read_csv("places_to_check.csv"), how="left", on="placename")
df_new["placename_corrected"] = df_new.apply(lambda x: x["corrected"] if not pd.isnull(x["corrected"]) else x["placename"], axis=1)
df_new["placename_upper"] = df_new["placename_corrected"].str.upper()

Load place data downloaded from the VicNames Gazetteer. This was manually downloaded from the web interface. I included localities, LGAs, counties, parishes and neighborhoods in the download.

In [181]:
df_places = pd.read_csv("places.csv")
df_places["placename_upper"] = df_places["Place Name"].str.upper()

Some placenames will match multiple place entries, eg. a town and a parish might share a name. Here we'll put the different place types in a preferred order, the drop duplicates, leaving only the first value.

In [182]:
# order types by priority in case there are dupe names
code = {
    "LOCB": 1,
    "LGA": 2,
    "CNTY": 3,
    "PRSH": 4,
    "NBHD": 5
}
df_places["feature_order"] = df_places["Feature Type Code"].apply(lambda x: code[x])

In [183]:
df_places = df_places.sort_values(["placename_upper", "feature_order"])

In [184]:
df_places.drop_duplicates("placename_upper", keep="first", inplace=True)

In [185]:
df_places

Unnamed: 0,State,Municipality,Name Id,Place Name,Place Name Status,Feature Type Code,Feature Type,Longitude,Latitude,Place Id,placename_upper,feature_order
0,VIC,MANSFIELD SHIRE,25,A1 MINE SETTLEMENT,REGISTERED,NBHD,NEIGHBOURHOOD,146.201260,-37.499848,9127,A1 MINE SETTLEMENT,5
1,VIC,ALPINE SHIRE,100117,ABBEYARD,REGISTERED,LOCB,LOCALITY,146.752408,-37.025339,100103,ABBEYARD,1
2,VIC,YARRA CITY,100118,ABBOTSFORD,REGISTERED,LOCB,LOCALITY,144.998711,-37.802505,100104,ABBOTSFORD,1
6,VIC,MOONEE VALLEY CITY,100119,ABERFELDIE,REGISTERED,LOCB,LOCALITY,144.897934,-37.759856,100105,ABERFELDIE,1
7,VIC,BAW BAW SHIRE,100120,ABERFELDY,REGISTERED,LOCB,LOCALITY,146.378349,-37.702066,100106,ABERFELDY,1
...,...,...,...,...,...,...,...,...,...,...,...,...
9536,VIC,SOUTHERN GRAMPIANS SHIRE,30779,YUPPECKIAR,REGISTERED,PRSH,PARISH OR HUNDRED,142.507923,-37.655959,9110,YUPPECKIAR,4
9538,VIC,HUME CITY,103506,YUROKE,REGISTERED,LOCB,LOCALITY,144.863798,-37.581629,103492,YUROKE,1
9539,VIC,COLAC OTWAY SHIRE,103507,YUULONG,REGISTERED,LOCB,LOCALITY,143.307785,-38.722737,103493,YUULONG,1
9551,VIC,GREATER SHEPPARTON CITY,103508,ZEERUST,REGISTERED,LOCB,LOCALITY,145.401341,-36.273084,103494,ZEERUST,1


Now we'll merge the catalogue data with the gazetteer data, linking on placename.

In [186]:
df_merged = pd.merge(df_new, df_places, how="left", on="placename_upper")

In [188]:
df_merged.drop_duplicates(["mms", "placename_upper"], inplace=True)

In [189]:
# Save details of places without matches so they can be manually assessed and corrected.
df_merged.loc[df_merged["Place Name"].isnull()][["place_x", "placename"]].drop_duplicates().to_csv("places_to_check_new.csv", index=False)

In [190]:
df_merged.to_csv("newspapers_with_locations.csv", index=False)

## Add any Trove newspapers from Victoria that aren't in the dataset

There are still some digitised Victorian newspapers in Trove that aren't in our dataset. Here we'll load details of Trove newspapers that I've harvested previously, and add any Victorian titles that are missing to our dataset. The `titles-2025.csv` dataset was created by getting details of titles added since I last updated the Trove Places app from my weekly harvests (before they stopped in January). I then used the [get places from newspapers](get_places_from_newspapers.ipynb) notebook, to link the new Trove titles to places.

In [191]:
# Dataset from my Trove places app
df_trove_old = pd.read_csv("trove-newspaper-titles-locations.csv")
df_trove_old = df_trove_old.loc[df_trove_old["state"] == "VIC"]
# Additional titles since my last update of Trove places
df_trove = pd.concat([df_trove_old, pd.read_csv("titles-2025.csv")])

In [192]:
df_trove["trove_url"] = df_trove["title_id"].apply(lambda x: f"http://nla.gov.au/nla.news-title{x}")

In [193]:
df_trove

Unnamed: 0,title_id,newspaper_title,state,place_id,place,latitude,longitude,trove_url
14,295,"Advertiser (Footscray, Vic. : 1914 - 1918)",VIC,VIC101188,Footscray,-37.798395,144.899441,http://nla.gov.au/nla.news-title295
15,148,"Advertiser (Hurstbridge, Vic. : 1922 - 1939)",VIC,VIC101519,Hurstbridge,-37.640201,145.193968,http://nla.gov.au/nla.news-title148
18,792,"Advocate (Melbourne, Vic. : 1868 - 1954)",VIC,VIC102000,Melbourne,-37.824302,144.973988,http://nla.gov.au/nla.news-title792
22,156,"Alexandra and Yea Standard and Yarck, Gobur, T...",VIC,VIC100121,Acheron,-37.270525,145.704075,http://nla.gov.au/nla.news-title156
23,156,"Alexandra and Yea Standard and Yarck, Gobur, T...",VIC,VIC100137,Alexandra,-37.196576,145.741741,http://nla.gov.au/nla.news-title156
...,...,...,...,...,...,...,...,...
94,1933,Coburg and Moreland Courier (Vic. : 1932),,100791,Coburg,-37.743816,144.964502,http://nla.gov.au/nla.news-title1933
95,1933,Coburg and Moreland Courier (Vic. : 1932),,17681,Moreland,-37.751515,144.957923,http://nla.gov.au/nla.news-title1933
96,1934,"The Courier (Moreland, Vic. : 1932 - 1933)",,17681,Moreland,-37.751515,144.957923,http://nla.gov.au/nla.news-title1934
97,1935,Camberwell Free Press (Vic. : 1947),,100632,Camberwell,-37.839055,145.072955,http://nla.gov.au/nla.news-title1935


Filter the Trove records to only include those where the url isn't already in our dataset.

In [194]:
urls = list(df_merged["trove_url"].unique())
df_trove_extra = df_trove.loc[~df_trove["trove_url"].isin(urls)][["newspaper_title", "place_id", "place", "latitude", "longitude", "trove_url"]]

Now we'll align the column names in the two datasets.

In [196]:
df_merged.columns

Index(['mms', 'title', 'lds19', 'publisher', 'format', 'genre', 'lds03',
       'trove_note', 'trove_url', 'place_x', 'placename', 'place_y',
       'corrected', 'possible', 'placename_corrected', 'placename_upper',
       'State', 'Municipality', 'Name Id', 'Place Name', 'Place Name Status',
       'Feature Type Code', 'Feature Type', 'Longitude', 'Latitude',
       'Place Id', 'feature_order'],
      dtype='object')

In [197]:
# Filter columns
df_merged = df_merged[["mms", "title", "lds19", "publisher", "format", "genre", "lds03",
       "placename_corrected", "Name Id", "Feature Type Code", "Longitude", "Latitude", "trove_note", "trove_url"]]

In [198]:
# Rename columns
df_merged.rename(columns={"mms": "alma_id", "placename_corrected": "placename", "lds19": "date", "Name Id": "place_id", "Feature Type Code": "feature_type_code", "Latitude": "latitude", "Longitude": "longitude"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_merged.rename(columns={"mms": "alma_id", "placename_corrected": "placename", "lds19": "date", "Name Id": "place_id", "Feature Type Code": "feature_type_code", "Latitude": "latitude", "Longitude": "longitude"}, inplace=True)


In [199]:
df_trove_extra.rename(columns={"newspaper_title": "title", "place": "placename"}, inplace=True)

Finally we'll merge the Trove data with the catalogue data.

In [200]:
df_combined = pd.concat([df_merged, df_trove_extra])

In [202]:
df_combined.loc[(df_combined["trove_url"].notnull()) & (df_combined["trove_url"].str.contains("nla.news", regex=False))]["trove_url"].nunique()

468

In [203]:
df_combined["alma_id"] = df_combined["alma_id"].astype("Int64")

In [204]:
def create_id(row):
    if not pd.isnull(row["alma_id"]):
        return f"a{row["alma_id"]}"
    else:
        title_id = re.search(r"title(\d+)", row["trove_url"]).group(1)
        return f"t{title_id}"

df_combined["id"] = df_combined.apply(create_id, axis=1)

In [205]:
df_combined.to_csv("newspapers_combined.csv", index=False)