# Finding non-English newspapers in Trove

There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.

## How not to do it...

My first thought was I could start by searching for digitised newspapers amongst the library records in Trove. My theory was that catalogue metadata would include language information. For example, you can search for newspapers using `format:Periodical/Newspaper` in the books and libraries category (or the `article` API zone). To find those that are digitised, you can add a search for 'trove.nla.gov.au'. Here's the [sort of results](https://trove.nla.gov.au/search/category/books?keyword=%22trove.nla.gov.au%22%20format%3APeriodical%2FNewspaper) you get. Unfortunately, you only get about 826 results and there are many more newspapers than that in Trove. It seems links to digitised newspapers are not consistently recorded.

My second approach was to get the list of digitised newspapers from the API, extract the ISSN, then use this to search for catalogue records. Here's the code snippet I used.

``` python
params = {
    'zone': 'article',
    'encoding': 'json',
    'l-format': 'Periodical/Newspaper',
    'reclevel': 'full',
    'key': TROVE_API_KEY
}
newspapers = get_newspapers()
for newspaper in newspapers:
    print(f'\n{newspaper["title"]}')
    issn = newspaper.get('issn')
    params['q'] = f'issn:{issn}'
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    try:
        works = data['response']['zone'][0]['records']['work']
    except KeyError:
        print('Not found')
    else:
        for work in works:
            print(work.get('language'))
    if not response.from_cache:
        time.sleep(0.2)
```

The main problem here is that not all titles have ISSNs. You could try searching on the titles is there's no ISSN, but this would involve a fair bit of disambiguation. In any case, in running this I discovered that while there is some language information in the metadata, it's not consistently applied. So basically a metadata-only approach is not going to work. Sigh...

## How I actually did it

If I couldn't get language details from metadata, then I had to try and extract it from the resource itself. I spent quite a bit of time looking around for Python packages that provided reliable language detection. The first one I tried regularly identified Mandarin as Korean (it turns out this was a known issue). Another one sent me into dependency hell. Finally I found [pycld3](https://pypi.org/project/pycld3/) which installed with `pip`, and *just worked*.

My plan was to get the list of newspapers via the API as before, then fire off an empty search for each one. I'd then loop through the results, running the language detector over the article text. I set the query parameters to retrieve the maxmimum number of results in one request – 100. That seemed like a reasonable sample. To try and provide a big enough amount of text for the language detector to work with, I set the number of words parameter to return articles with between 100 and 1000 words. So the query parameters I used were:

``` python
params = {
    'zone': 'newspaper',
    'encoding': 'json',
    'l-word': '100 - 1000 Words',
    'include': 'articletext',
    'key': TROVE_API_KEY,
    'q': ' ',
    'n': 100,
}
```

Because some of the newspapers had short runs and the word count filter limits the results, I found that I wasn't always getting 100 results per newspaper. To work around this I found the likely language for each article, aggregated the counts, and then calculated the proportion of results for each language. This gave me the proportion of articles in each language – a number I could use across newspapers to find the non-English titles. 

In general this worked pretty well, and the result was a [list of 52 newspapers](non-english-newspapers.md) (also as a [Gist](https://gist.github.com/wragge/9aa385648cff5f0de0c7d4837896df97)) that have significant amounts of non-English content. However, I had to do a fair bit of fiddling to filter out dodgy results. All the details are included below.

## Problems / limitations

* It's no surprise that the results of the language detection are affected by the quality of the OCR. 
* In filtering out what seems to be the product of dodgy OCR, it's possible that I might be excluding some non-English content. 
* I'm only detecting the predominant language for each article, so there might be articles containing a mix of languages that are being missed. 
* I'm just talking the first 100 results from a blank search in each newspaper. Larger, or more randomised samples might produce different results.
* Some dodgy detection results remain in the list of newspapers, but the point of this exercise was to find non-English newspapers. If you wanted to accurately determine the quantity of non-English content, you'd have to do a lot more fine-grained analysis.

## Import what we need

In [1]:
import os
import re
import time
from collections import Counter
from pathlib import Path

import altair as alt
import cld3
import pandas as pd
import requests_cache
from IPython.display import display
from language_tags import tags
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

In [2]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [3]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

## Harvest the data and run language detection on articles

In [4]:
def get_newspapers():
    """
    Get a list of newspapers in Trove.
    """
    response = s.get(
        "https://api.trove.nla.gov.au/v2/newspaper/titles",
        params={"encoding": "json", "key": API_KEY},
    )
    data = response.json()
    return data["response"]["records"]["newspaper"]

In [5]:
params = {
    "zone": "newspaper",
    "encoding": "json",
    # 'l-category': 'Article',
    "l-word": "100 - 1000 Words",
    "include": "articletext",
    "key": API_KEY,
    "q": " ",
    "n": 100,
}
newspaper_langs = []
newspapers = get_newspapers()
for newspaper in tqdm(newspapers):
    langs = []
    # print(f'\n{newspaper["title"]}')
    params["l-title"] = newspaper["id"]
    response = s.get("https://api.trove.nla.gov.au/v2/result", params=params)
    data = response.json()
    n = data["response"]["zone"][0]["records"]["n"]
    try:
        articles = data["response"]["zone"][0]["records"]["article"]
    except KeyError:
        # print('Not found')
        pass
    else:
        # Detect language for each article in results
        for article in articles:
            if "articleText" in article:
                # Clean up OCRd text by removing tags and extra whitespace
                text = article["articleText"]
                text = re.sub(r"<[^<]+?>", "", text)
                text = re.sub(r"\s\s+", " ", text)
                # Get the language
                ld = cld3.get_language(text)
                # If the language prediction is reliable, save it
                if ld.is_reliable:
                    langs.append(ld.language)
        # Find the count of each language detected in the sample of articles
        for lang, count in dict(Counter(langs)).items():
            # Calculate the language count as a proportion of the total number of results
            prop = int(count) / len(langs)
            newspaper_langs.append(
                {
                    "id": newspaper["id"],
                    "title": newspaper["title"],
                    "language": lang,
                    "proportion": prop,
                    "number": n,
                }
            )
    if not response.from_cache:
        time.sleep(0.2)

  0%|          | 0/1741 [00:00<?, ?it/s]

Convert the results into a dataframe.

In [6]:
df = pd.DataFrame(newspaper_langs)
df.head()

Unnamed: 0,id,title,language,proportion,number
0,166,Canberra Community News (ACT : 1925 - 1927),en,1.0,100
1,165,Canberra Illustrated: A Quarterly Magazine (AC...,en,1.0,29
2,69,"Federal Capital Pioneer (Canberra, ACT : 1924 ...",en,1.0,100
3,871,Good Neighbour (ACT : 1950 - 1969),en,1.0,100
4,665,Student Notes/Canberra University College Stud...,en,1.0,100


## Add full language names

The language detector returns BCP-47-style language codes. To translate these into something that's a bit easier for humans to understand, we can use the [language-tags](https://github.com/OnroerendErfgoed/language-tags) package.

In [7]:
def get_full_language(lc):
    """
    Get full language names from codes
    """
    lang = tags.description(lc)
    if lang:
        return lang[0]
    else:
        print(lc)
        return lc


df["language_full"] = df["language"].apply(get_full_language)

## Filtering the results

If we just look at the numbers of languages detected we might think that Australia's cultural diversity was much greater than we expected! But the likelihood that there were ten newspapers publishing articles in Igbo (the language of the Igbo people in south-eastern Nigeria) seems small. Obviously there are a considerable number of false positives here.

In [8]:
df["language_full"].value_counts()

English                  1680
Maltese                   177
Japanese                   28
Italian                    22
Somali                     18
German                     16
Welsh                      15
Catalan                    12
Portuguese                  9
Norwegian                   9
Chinese                     8
Estonian                    7
Danish                      7
Hindi                       6
French                      6
Western Frisian             6
Corsican                    6
Hawaiian                    4
Bulgarian                   4
Vietnamese                  4
Polish                      4
Igbo                        4
Indonesian                  4
Modern Greek (1453-)        4
Luxembourgish               3
Javanese                    3
Yiddish                     3
Dutch                       3
Scottish Gaelic             3
Swedish                     3
Czech                       2
Samoan                      2
Latin                       2
Kurdish   

Remember that for each language detected in a newspaper we calculated the proportion of articles in our results set in that language. So we can, for example, just look at newspapers where 100% of the articles are in a single language. This highlights a few non-English language newspapers, but obviously we're missing a lot of others.

In [9]:
df.loc[df["proportion"] == 1]["language_full"].value_counts()

English                 1422
German                     3
Italian                    3
Modern Greek (1453-)       2
Estonian                   1
Yiddish                    1
Name: language_full, dtype: int64

If we chart the proportions, we see them bunched up at either end of the scale. So there are lots of languages detected in only a small proportion of articles.

In [10]:
alt.Chart(df).mark_bar().encode(x=alt.X("proportion:Q", bin=True), y="count():Q")

If we zoom in on the proportions less than 0.1 (that's 10 articles in a sample of 100) we see that they're mostly less that 0.01 (or 1 article in 100). It seems likely that these are false positives. 

In [11]:
alt.Chart(df.loc[df["proportion"] < 0.1]).mark_bar().encode(
    x=alt.X("proportion:Q", bin=True), y="count():Q"
)

Let's be fairly conservative and filter out languages that have a proportion (per newspaper) less than 0.5. This list seems a bit more in line with what we would expect, but there are still some surprises – 34 newspapers published articles in Maltese?

In [12]:
df.loc[df["proportion"] >= 0.05]["language_full"].value_counts()

English                  1670
Maltese                    33
Italian                    15
German                      9
Chinese                     8
Somali                      5
Modern Greek (1453-)        4
Japanese                    3
Portuguese                  3
Yiddish                     3
French                      3
Polish                      3
Western Frisian             2
Dutch                       2
Malay (macrolanguage)       1
Lithuanian                  1
Ukrainian                   1
Estonian                    1
Indonesian                  1
Vietnamese                  1
Danish                      1
Swedish                     1
Bosnian                     1
Russian                     1
Scottish Gaelic             1
Welsh                       1
Spanish                     1
Corsican                    1
Macedonian                  1
Bulgarian                   1
Name: language_full, dtype: int64

If we focus in on the newspapers that supposedly have a significant proportion of articles in Maltese, we see some very strange results. I seriously doubt that 80% of the *Mildura Irrigationist* from 1892-3 is in Maltese. So what's going on?

In [13]:
df.loc[(df["proportion"] > 0.1) & (df["language_full"] == "Maltese")]

Unnamed: 0,id,title,language,proportion,number,language_full
203,1596,L'Italo-Australiano = The Italo-Australian (Su...,mt,0.206349,100,Maltese
270,389,"Reporter and Illawarra Journal (Kiama, NSW : 1...",mt,0.105882,100,Maltese
286,418,"Southern Morning Herald (Goulburn, NSW : 1920 ...",mt,0.146667,100,Maltese
289,623,"Sunday News (Sydney, NSW : 1919)",mt,0.181818,100,Maltese
530,500,The Richmond River Express and Casino Kyogle A...,mt,0.126437,100,Maltese
654,810,"Upper Hunter Courier (Murrurundi, NSW : 1871)",mt,0.142857,14,Maltese
812,892,Warwick Daily News (Qld. : 1919 -1954),mt,0.111111,100,Maltese
928,34,"The Advertiser (Adelaide, SA : 1889 - 1931)",mt,0.486111,100,Maltese
1205,543,Cobden Times (Vic. : 1918),mt,0.10989,100,Maltese
1375,384,North Melbourne Gazette (Vic. : 1894 - 1901),mt,0.189873,100,Maltese


If you look at results for the *Mildura Irrigationist* [in Trove](https://trove.nla.gov.au/search/advanced/category/newspapers?l-advtitle=1583&l-advWord=100%20-%201000%20Words) you'll see that many of the page images are blurry, and as a result the OCR is very, very bad. Here's a sample:

> ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lH*a ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im*4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiw*a afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^*— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f**' ""j •fria—lhati* tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aM*toclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af <M>t«l. i pwwiaf Mtan (tot jw. twy MwUI «*a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa

What happens when we feed this fragment of bad OCR to the language detector? Remarkably, the language detector is 96% sure that it's Maltese! To find out why this is the case, we'd probably have to dig into the way the language detection model was trained. But for our purposes it's enough to know that some of the languages detected seem to be the result of bad OCR.

In [14]:
ocr = """ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lH*a ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im*4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiw*a afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^*— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f**' ""j •fria—lhati* tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aM*toclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af <M>t«l. i pwwiaf Mtan (tot jw. twy MwUI «*a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa"""
cld3.get_language(ocr)

LanguagePrediction(language='mt', probability=0.960280179977417, is_reliable=True, proportion=1.0)

Of course there might actually be newspapers with articles in Maltese, so we don't want to filter them all out. So let's do some manual inspection of the newspapers that *seem* to have non-English content. First we'll filter our results to include only languages with proportions of more than 0.05, and then drop out newspapers that seem to be only in English. We end up with 89 different titles. 

In [15]:
# The filter on the groupby drops out newspapers that only have articles in English.
filtered = (
    df.loc[df["proportion"] >= 0.05]
    .groupby(by=["title", "id"])
    .filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)

89

Let's list those 89 newspapers. From the list below, I think it's pretty easy to pick out the results that are likely to be the product of bad OCR.

In [16]:
for n, l in papers:
    if not l.loc[(~df["language"].isin(["en"])) & (df["proportion"] >= 0.05)].empty:
        print(f"\n{n[0]} ({n[1]})")
        display(
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
        )


A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)


Unnamed: 0,language_full,language,proportion
8,Portuguese,pt,0.988889



Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)


Unnamed: 0,language_full,language,proportion
828,German,de,1.0



Auburn and District News (NSW : 1929) (1320)


Unnamed: 0,language_full,language,proportion
43,English,en,0.947368
44,Vietnamese,vi,0.052632



Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933) (1686)


Unnamed: 0,language_full,language,proportion
1158,Yiddish,yi,1.0



Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)


Unnamed: 0,language_full,language,proportion
832,German,de,1.0



Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)


Unnamed: 0,language_full,language,proportion
14,Malay (macrolanguage),ms,0.891304
15,Indonesian,id,0.108696



Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)


Unnamed: 0,language_full,language,proportion
83,Chinese,zh,0.928571



Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)


Unnamed: 0,language_full,language,proportion
1194,Chinese,zh,0.918367



Chronicle and North Coast Advertiser (Qld. : 1903 - 1922) (286)


Unnamed: 0,language_full,language,proportion
695,English,en,0.94898
696,Maltese,mt,0.05102



Chung Wah News (Perth, WA : 1981 - 1987) (1383)


Unnamed: 0,language_full,language,proportion
1694,English,en,0.566667
1693,Chinese,zh,0.388889



Cobden Times (Vic. : 1918) (543)


Unnamed: 0,language_full,language,proportion
1204,English,en,0.857143
1205,Maltese,mt,0.10989



Colac Reformer (Vic. : 1914 - 1918) (763)


Unnamed: 0,language_full,language,proportion
1214,English,en,0.947368
1215,Maltese,mt,0.052632



Daily Post (Hobart, Tas. : 1908 - 1918) (860)


Unnamed: 0,language_full,language,proportion
1011,English,en,0.719101
1012,Japanese,ja,0.11236



Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)


Unnamed: 0,language_full,language,proportion
1716,German,de,0.82
1717,English,en,0.18



Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)


Unnamed: 0,language_full,language,proportion
125,German,de,1.0



Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)


Unnamed: 0,language_full,language,proportion
844,German,de,0.9
843,English,en,0.1



Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)


Unnamed: 0,language_full,language,proportion
126,German,de,0.704082
127,English,en,0.295918



Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)


Unnamed: 0,language_full,language,proportion
845,German,de,0.989583



Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)


Unnamed: 0,language_full,language,proportion
131,Dutch,nl,0.969697



Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)


Unnamed: 0,language_full,language,proportion
134,Dutch,nl,0.919192
135,English,en,0.060606



Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)


Unnamed: 0,language_full,language,proportion
1721,Polish,pl,0.91
1722,English,en,0.09



Eco Italiano (Perth, WA : 1958 - 1959) (1387)


Unnamed: 0,language_full,language,proportion
1723,Italian,it,1.0



Emu Bay Times and North West and West Coast Advocate (Tas. : 1897 - 1899) (116)


Unnamed: 0,language_full,language,proportion
1027,English,en,0.933333
1028,Maltese,mt,0.066667



Evelyn Observer, and South and East Bourke Record (Vic. : 1882 - 1902) (145)


Unnamed: 0,language_full,language,proportion
1241,English,en,0.913978
1240,Maltese,mt,0.075269



Geraldton Advocate and Johnstone River Guardian (Qld. : 1895 - 1896) (1103)


Unnamed: 0,language_full,language,proportion
704,English,en,0.947917
705,Maltese,mt,0.052083



Geraldton Express and Murchison Goldfields News (WA : 1894 - 1896) (1623)


Unnamed: 0,language_full,language,proportion
1734,English,en,0.643836
1735,Maltese,mt,0.09589
1739,Japanese,ja,0.068493



Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)


Unnamed: 0,language_full,language,proportion
162,Chinese,zh,0.854167
165,Western Frisian,fy,0.0625



Hamilton Spectator and Grange District Advertiser (Vic. : 1860 - 1870) (927)


Unnamed: 0,language_full,language,proportion
1282,English,en,0.915789
1283,Maltese,mt,0.073684



Hellenic Echo (Perth, WA : 1967 - 1968) (1389)


Unnamed: 0,language_full,language,proportion
1771,Modern Greek (1453-),el,1.0



Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)


Unnamed: 0,language_full,language,proportion
1773,Italian,it,0.97



Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)


Unnamed: 0,language_full,language,proportion
175,Italian,it,0.91
176,English,en,0.09



Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)


Unnamed: 0,language_full,language,proportion
177,Italian,it,0.75
178,English,en,0.25



Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)


Unnamed: 0,language_full,language,proportion
188,English,en,0.833333
189,Italian,it,0.166667



Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)


Unnamed: 0,language_full,language,proportion
190,English,en,0.893617
191,Italian,it,0.106383



Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)


Unnamed: 0,language_full,language,proportion
192,Italian,it,0.97



Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)


Unnamed: 0,language_full,language,proportion
1777,Japanese,ja,0.9375



Kyabram Union (Vic. : 1886 - 1894) (196)


Unnamed: 0,language_full,language,proportion
1326,English,en,0.931818
1327,Maltese,mt,0.068182



L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)


Unnamed: 0,language_full,language,proportion
202,Italian,it,0.698413
203,Maltese,mt,0.206349



L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)


Unnamed: 0,language_full,language,proportion
208,Italian,it,0.97



La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) (1388)


Unnamed: 0,language_full,language,proportion
1796,Italian,it,0.98



Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)


Unnamed: 0,language_full,language,proportion
212,French,fr,0.76
213,English,en,0.24



Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)


Unnamed: 0,language_full,language,proportion
1815,Modern Greek (1453-),el,0.357143
1814,English,en,0.22449
1816,Portuguese,pt,0.153061
1809,French,fr,0.081633
1808,Spanish,es,0.061224



Meie Kodu = Our Home (Sydney, NSW : 1949 - 1956) (280)


Unnamed: 0,language_full,language,proportion
221,Estonian,et,1.0



Murchison Times and Cue-Big Bell-Reedy Advocate (WA : 1937 - 1942) (1543)


Unnamed: 0,language_full,language,proportion
1838,English,en,0.892857
1839,Maltese,mt,0.071429



Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954) (1594)


Unnamed: 0,language_full,language,proportion
233,Lithuanian,lt,0.95



Nasza droga (Adelaide, SA : 1952 - 1954) (1323)


Unnamed: 0,language_full,language,proportion
869,Polish,pl,0.89
870,English,en,0.11



Norden (Melbourne, Vic. : 1914 - 1918) (797)


Unnamed: 0,language_full,language,proportion
1366,Danish,da,0.752809
1369,Swedish,sv,0.11236
1367,English,en,0.067416



North Melbourne Gazette (Vic. : 1894 - 1901) (384)


Unnamed: 0,language_full,language,proportion
1374,English,en,0.78481
1375,Maltese,mt,0.189873



Oceania (Sydney, NSW : 1913 - 1915) (1598)


Unnamed: 0,language_full,language,proportion
254,Italian,it,0.54
255,English,en,0.46



Reporter and Illawarra Journal (Kiama, NSW : 1887 - 1894) (389)


Unnamed: 0,language_full,language,proportion
269,English,en,0.894118
270,Maltese,mt,0.105882



Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874) (1604)


Unnamed: 0,language_full,language,proportion
271,French,fr,0.98



Ringwood and Croydon Chronicle (Vic. : 1914 - 1918) (329)


Unnamed: 0,language_full,language,proportion
1422,English,en,0.938144
1423,Maltese,mt,0.061856



Sandringham Southern Cross (Vic. : 1914 - 1918) (318)


Unnamed: 0,language_full,language,proportion
1430,English,en,0.731707
1431,Maltese,mt,0.243902



Seamen's Strike Bulletin (Melbourne, Vic. : 1919) (1043)


Unnamed: 0,language_full,language,proportion
1436,Polish,pl,0.4
1435,Bosnian,bs,0.2
1437,Russian,ru-Latn,0.2
1438,Western Frisian,fy,0.2



Southern Morning Herald (Goulburn, NSW : 1920 - 1923) (418)


Unnamed: 0,language_full,language,proportion
285,English,en,0.8
286,Maltese,mt,0.146667
287,Somali,so,0.053333



Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)


Unnamed: 0,language_full,language,proportion
1881,Italian,it,0.97



Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)


Unnamed: 0,language_full,language,proportion
924,German,de,0.888889
925,English,en,0.111111



Sunday News (Sydney, NSW : 1919) (623)


Unnamed: 0,language_full,language,proportion
290,English,en,0.779221
289,Maltese,mt,0.181818



Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)


Unnamed: 0,language_full,language,proportion
1888,Italian,it,1.0



Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)


Unnamed: 0,language_full,language,proportion
922,German,de,0.989691



The Advertiser (Adelaide, SA : 1889 - 1931) (34)


Unnamed: 0,language_full,language,proportion
927,English,en,0.513889
928,Maltese,mt,0.486111



The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) (1685)


Unnamed: 0,language_full,language,proportion
1473,English,en,0.810526
1475,Yiddish,yi,0.157895



The Castlereagh (Gilgandra, NSW : 1905 - 1907) (224)


Unnamed: 0,language_full,language,proportion
384,English,en,0.609195
385,Somali,so,0.310345
386,Maltese,mt,0.08046



The Chinese Advertiser (Ballarat, Vic. : 1856) (706)


Unnamed: 0,language_full,language,proportion
1504,Chinese,zh,0.5
1506,English,en,0.333333
1505,Scottish Gaelic,gd,0.166667



The Derby News (WA : 1887) (1617)


Unnamed: 0,language_full,language,proportion
1927,Maltese,mt,0.75
1928,Corsican,co,0.25



The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)


Unnamed: 0,language_full,language,proportion
1522,English,en,0.894737
1523,Chinese,zh,0.052632
1524,Maltese,mt,0.052632



The Hay Standard and Advertiser for Balranald, Wentworth, Maude...(Hay, NSW : 1871 - 1873; 1880 - 1881; 1890 - 1900) (725)


Unnamed: 0,language_full,language,proportion
441,English,en,0.947368
442,Maltese,mt,0.052632



The Herald of Tasmania (Hobart, Tas. : 1845) (1741)


Unnamed: 0,language_full,language,proportion
1083,English,en,0.857143
1085,Italian,it,0.095238



The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) (1707)


Unnamed: 0,language_full,language,proportion
1535,English,en,0.81
1536,Yiddish,yi,0.19



The Melbourne Advertiser (Vic. : 1838) (935)


Unnamed: 0,language_full,language,proportion
1550,English,en,0.666667
1551,Welsh,cy,0.333333



The Mildura Irrigationist (Vic. : 1892 - 1893) (1583)


Unnamed: 0,language_full,language,proportion
1565,Maltese,mt,0.7625
1564,English,en,0.125
1566,Somali,so,0.1125



The Mildura Irrigationist and Murray River Agricultural Times (Vic. : 1888) (1581)


Unnamed: 0,language_full,language,proportion
1568,Maltese,mt,0.626667
1569,English,en,0.24
1567,Somali,so,0.133333



The Mildura Irrigationist and Murray River Cultural Advocate (Vic. : 1891 - 1892) (1582)


Unnamed: 0,language_full,language,proportion
1570,English,en,0.746667
1571,Somali,so,0.146667
1572,Maltese,mt,0.093333



The Miner's Right (Boulder, WA : 1897) (1638)


Unnamed: 0,language_full,language,proportion
1984,English,en,0.908163
1986,Maltese,mt,0.061224



The Morwell Advocate and Boolara and Mirboo Chronicle (Vic. : 1886) (1733)


Unnamed: 0,language_full,language,proportion
1577,Maltese,mt,0.625
1578,English,en,0.375



The Morwell Advocate and Narracan, Boolara and Mirboo Chronicle (Vic. : 1886) (1734)


Unnamed: 0,language_full,language,proportion
1579,English,en,0.829268
1580,Maltese,mt,0.170732



The Reporter (Box Hill, Vic. : 1889 - 1925) (244)


Unnamed: 0,language_full,language,proportion
1594,English,en,0.904255
1593,Maltese,mt,0.085106



The Richmond River Express and Casino Kyogle Advertiser (NSW : 1904 - 1929) (500)


Unnamed: 0,language_full,language,proportion
532,English,en,0.827586
530,Maltese,mt,0.126437



The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)


Unnamed: 0,language_full,language,proportion
2064,Modern Greek (1453-),el,0.98



To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)


Unnamed: 0,language_full,language,proportion
626,Modern Greek (1453-),el,1.0



Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)


Unnamed: 0,language_full,language,proportion
632,Chinese,zh,0.94



Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)


Unnamed: 0,language_full,language,proportion
638,Chinese,zh,0.926316



Twofold Bay and Maneroo Observer (NSW : 1860) (394)


Unnamed: 0,language_full,language,proportion
645,English,en,0.886364
647,Maltese,mt,0.090909



Uniamoci (Sydney, NSW : 1903 - 1904) (1599)


Unnamed: 0,language_full,language,proportion
652,Italian,it,1.0



Upper Hunter Courier (Murrurundi, NSW : 1871) (810)


Unnamed: 0,language_full,language,proportion
653,English,en,0.857143
654,Maltese,mt,0.142857



Vesnik (Perth, WA : 1975 - 1994) (1382)


Unnamed: 0,language_full,language,proportion
2093,Macedonian,mk,0.408163
2092,English,en,0.357143
2094,Bulgarian,bg-Latn,0.22449



Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)


Unnamed: 0,language_full,language,proportion
655,Ukrainian,uk,0.82
656,English,en,0.18



Warwick Daily News (Qld. : 1919 -1954) (892)


Unnamed: 0,language_full,language,proportion
811,English,en,0.864198
812,Maltese,mt,0.111111



Williamstown Trade Circular (Vic. : 1855 - 1856) (213)


Unnamed: 0,language_full,language,proportion
1658,English,en,0.882353
1659,Portuguese,pt,0.117647


I went through the titles above and compiled a list of title identifiers that seem to be producing dodgy results. We can use this to filter these newspapers out of our results.

In [17]:
# Titles where dodgy OCR causes false positives in language detection
# This was manually created after scanning results
dodgy = [
    "1036",
    "1043",
    "1103",
    "116",
    "1207",
    "1265",
    "13",
    "1320",
    "1336",
    "140",
    "1400",
    "145",
    "1488",
    "1543",
    "1546",
    "1581",
    "1582",
    "1583",
    "1617",
    "1623",
    "1626",
    "1638",
    "1675",
    "1678",
    "171",
    "1733",
    "1734",
    "1741",
    "196",
    "213",
    "224",
    "244",
    "286",
    "292",
    "318",
    "329",
    "34",
    "384",
    "389",
    "394",
    "418",
    "430",
    "431",
    "452",
    "479",
    "499",
    "500",
    "543",
    "570",
    "623",
    "725",
    "763",
    "810",
    "860",
    "886",
    "892",
    "906",
    "92",
    "926",
    "927",
    "935",
    "937",
    "94",
    "946",
    "970",
    "986",
]

Here we'll add the dodgy title ids into our filter. It seems that we have 52 newspapers with significant amounts of non-English content.

In [18]:
# The filter removes titles that only have one language, which is English
filtered = (
    df.loc[(~df["id"].isin(dodgy)) & (df["proportion"] >= 0.05)]
    .groupby(by=["title", "id"])
    .filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)

52

Let's list them.

In [19]:
for n, l in papers:
    print(n[0])

A Voz de Timor (Dili, East Timor : 1970 - 1975)
Adelaider Deutsche Zeitung (SA : 1851 - 1862)
Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933)
Australische Zeitung (Adelaide, SA : 1875 - 1916)
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946)
Chinese Republic News (Sydney, NSW : 1914 - 1937)
Chinese Times (Melbourne, Vic. : 1902 - 1922)
Chung Wah News (Perth, WA : 1981 - 1987)
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952)
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906)
Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851)
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939)
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851)
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993)
Dutch Weekly (Sydney, NSW : 1993 - 2004)
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952)
Eco Italiano (Perth, 

That's looking pretty good. Let's save the results as a Markdown file to make it easy to explore. We'll include links into Trove. Here's the [list of all 52 newspapers](non-english-newspapers.md) (also as a [Gist](https://gist.github.com/wragge/9aa385648cff5f0de0c7d4837896df97)).

In [20]:
with open(Path("non-english-newspapers.md"), "w") as md_file:
    i = 1
    for n, l in papers:
        md_file.write(
            f"\n### {i}. [{n[0]}](http://nla.gov.au/nla.news-title{n[1]})\n\n"
        )
        md_file.write("| Language | Language code | Proportion of sample |\n")
        md_file.write("|---|---|---|\n")
        for row in (
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
            .itertuples()
        ):
            md_file.write(
                f"| {row.language_full} | {row.language} | {row.proportion} |\n"
            )
        i += 1

If you look at the Markdown files you'll see that there are still some dodgy results – for example, 16% of the *Chinese Advertiser* is detected as 'Scottish Gaelic'. But the point of this exercise was to find non-English newspapers, rather than accurately detect the proportion of non-English content, so I think we can live with it for now.

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).  
Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb).