# Finding non-English newspapers in Trove

There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.

## How not to do it...

My first thought was I could start by searching for digitised newspapers amongst the library records in Trove. My theory was that catalogue metadata would include language information. For example, you can search for newspapers using `format:Periodical/Newspaper` in the books and libraries category (or the `article` API zone). To find those that are digitised, you can add a search for 'trove.nla.gov.au'. Here's the [sort of results](https://trove.nla.gov.au/search/category/books?keyword=%22trove.nla.gov.au%22%20format%3APeriodical%2FNewspaper) you get. Unfortunately, you only get about 826 results and there are many more newspapers than that in Trove. It seems links to digitised newspapers are not consistently recorded.

My second approach was to get the list of digitised newspapers from the API, extract the ISSN, then use this to search for catalogue records. Here's the code snippet I used.

``` python
params = {
    'zone': 'article',
    'encoding': 'json',
    'l-format': 'Periodical/Newspaper',
    'reclevel': 'full',
    'key': TROVE_API_KEY
}
newspapers = get_newspapers()
for newspaper in newspapers:
    print(f'\n{newspaper["title"]}')
    issn = newspaper.get('issn')
    params['q'] = f'issn:{issn}'
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    try:
        works = data['response']['zone'][0]['records']['work']
    except KeyError:
        print('Not found')
    else:
        for work in works:
            print(work.get('language'))
    if not response.from_cache:
        time.sleep(0.2)
```

The main problem here is that not all titles have ISSNs. You could try searching on the titles is there's no ISSN, but this would involve a fair bit of disambiguation. In any case, in running this I discovered that while there is some language information in the metadata, it's not consistently applied. So basically a metadata-only approach is not going to work. Sigh...

## How I actually did it

If I couldn't get language details from metadata, then I had to try and extract it from the resource itself. I spent quite a bit of time looking around for Python packages that provided reliable language detection. The first one I tried regularly identified Mandarin as Korean (it turns out this was a known issue). Another one sent me into dependency hell. Finally I found [pycld3](https://pypi.org/project/pycld3/) which installed with `pip`, and *just worked*.

My plan was to get the list of newspapers via the API as before, then fire off an empty search for each one. I'd then loop through the results, running the language detector over the article text. I set the query parameters to retrieve the maxmimum number of results in one request – 100. That seemed like a reasonable sample. To try and provide a big enough amount of text for the language detector to work with, I set the number of words parameter to return articles with between 100 and 1000 words. So the query parameters I used were:

``` python
params = {
    'category': 'newspaper',
    'encoding': 'json',
    'l-word': '100 - 1000 Words',
    'include': 'articletext',
    'n': 100,
}
```

Because some of the newspapers had short runs and the word count filter limits the results, I found that I wasn't always getting 100 results per newspaper. To work around this I found the likely language for each article, aggregated the counts, and then calculated the proportion of results for each language. This gave me the proportion of articles in each language – a number I could use across newspapers to find the non-English titles. 

In general this worked pretty well, and the result was a [list of 55 newspapers](non-english-newspapers.md) that have significant amounts of non-English content. However, I had to do a fair bit of fiddling to filter out dodgy results. All the details are included below.

## Problems / limitations

* It's no surprise that the results of the language detection are affected by the quality of the OCR. 
* In filtering out what seems to be the product of dodgy OCR, it's possible that I might be excluding some non-English content. 
* I'm only detecting the predominant language for each article, so there might be articles containing a mix of languages that are being missed. 
* I'm just talking the first 100 results from a blank search in each newspaper. Larger, or more randomised samples might produce different results.
* Some dodgy detection results remain in the list of newspapers, but the point of this exercise was to find non-English newspapers. If you wanted to accurately determine the quantity of non-English content, you'd have to do a lot more fine-grained analysis.

## Import what we need

In [51]:
import os
import re
from collections import Counter
from datetime import datetime, timedelta
from pathlib import Path

import altair as alt
import pandas as pd
import requests_cache
from dotenv import load_dotenv
from IPython.display import display
from language_tags import tags
from py3langid.langid import MODEL_FILE, LanguageIdentifier
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession(expire_after=timedelta(days=30))
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

load_dotenv()

True

In [52]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

headers = {"X-API-KEY": API_KEY}

## Harvest the data and run language detection on articles

In [53]:
def get_newspapers():
    """
    Get a list of newspapers in Trove.
    """
    response = s.get(
        "https://api.trove.nla.gov.au/v3/newspaper/titles",
        params={"encoding": "json"},
        headers=headers,
    )
    data = response.json()
    return data["newspaper"]

In [54]:
def find_languages(sample_size=None):
    params = {
        "category": "newspaper",
        "encoding": "json",
        # 'l-category': 'Article',
        "l-word": "100 - 1000 Words",
        "include": "articletext",
        "n": 100,
    }
    newspaper_langs = []
    newspapers = get_newspapers()
    identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
    for newspaper in tqdm(newspapers[:sample_size]):
        langs = []
        # print(f'\n{newspaper["title"]}')
        params["l-title"] = newspaper["id"]
        response = s.get(
            "https://api.trove.nla.gov.au/v3/result", params=params, headers=headers
        )
        data = response.json()
        n = data["category"][0]["records"]["n"]
        try:
            articles = data["category"][0]["records"]["article"]
        except KeyError:
            # print('Not found')
            pass
        else:
            # Detect language for each article in results
            for article in articles:
                if "articleText" in article:
                    # Clean up OCRd text by removing tags and extra whitespace
                    text = article["articleText"]
                    text = re.sub(r"<[^<]+?>", "", text)
                    text = re.sub(r"\s\s+", " ", text)
                    # Get the language
                    lang, prob = identifier.classify(text)
                    # If the language prediction is reliable, save it
                    # if ld.is_reliable:
                    if prob >= 0.95:
                        langs.append(lang)
            # Find the count of each language detected in the sample of articles
            for lang, count in dict(Counter(langs)).items():
                # Calculate the language count as a proportion of the total number of results
                prop = int(count) / len(langs)
                newspaper_langs.append(
                    {
                        "id": newspaper["id"],
                        "title": newspaper["title"],
                        "language": lang,
                        "proportion": prop,
                        "number": n,
                    }
                )
    return newspaper_langs

Convert the results into a dataframe.

In [None]:
newspaper_langs = find_languages()
df = pd.DataFrame(newspaper_langs)
df.head()

## Add full language names

The language detector returns BCP-47-style language codes. To translate these into something that's a bit easier for humans to understand, we can use the [language-tags](https://github.com/OnroerendErfgoed/language-tags) package.

In [56]:
def get_full_language(lc):
    """
    Get full language names from codes
    """
    lang = tags.description(lc)
    if lang:
        return lang[0]
    else:
        print(lc)
        return lc


df["language_full"] = df["language"].apply(get_full_language)

## Filtering the results

If we just look at the numbers of languages detected we might think that Australia's cultural diversity was much greater than we expected! But the likelihood that there were ten newspapers publishing articles in Igbo (the language of the Igbo people in south-eastern Nigeria) seems small. Obviously there are a considerable number of false positives here.

In [57]:
df["language_full"].value_counts()

language_full
English                    1786
Latin                       195
Luxembourgish                40
Aragonese                    31
Welsh                        22
Italian                      20
Lao                          20
Albanian                     16
German                       14
Swahili (macrolanguage)      14
Hebrew                       10
Chinese                       9
Northern Sami                 9
Tagalog                       8
Afrikaans                     8
Breton                        6
Portuguese                    5
Norwegian                     5
Quechua                       4
Armenian                      4
Faroese                       4
Modern Greek (1453-)          4
Japanese                      3
Bosnian                       3
French                        3
Polish                        2
Dutch                         2
Spanish                       2
Amharic                       2
Slovak                        2
Lithuanian                

Remember that for each language detected in a newspaper we calculated the proportion of articles in our results set in that language. So we can, for example, just look at newspapers where 100% of the articles are in a single language. This highlights a few non-English language newspapers, but obviously we're missing a lot of others.

In [58]:
df.loc[df["proportion"] == 1]["language_full"].value_counts()

language_full
English                 1481
German                     3
Chinese                    3
Hebrew                     3
Modern Greek (1453-)       2
Italian                    2
Estonian                   1
Name: count, dtype: int64

If we chart the proportions, we see them bunched up at either end of the scale. So there are lots of languages detected in only a small proportion of articles.

In [59]:
alt.Chart(df).mark_bar().encode(x=alt.X("proportion:Q", bin=True), y="count():Q")

If we zoom in on the proportions less than 0.1 (that's 10 articles in a sample of 100) we see that they're mostly less that 0.01 (or 1 article in 100). It seems likely that these are false positives. 

In [60]:
alt.Chart(df.loc[df["proportion"] < 0.1]).mark_bar().encode(
    x=alt.X("proportion:Q", bin=True), y="count():Q"
)

Let's be fairly conservative and filter out languages that have a proportion (per newspaper) less than 0.5. This list seems a bit more in line with what we would expect, but there are still some surprises – 34 newspapers published articles in Latin?

In [61]:
df.loc[df["proportion"] >= 0.05]["language_full"].value_counts()

language_full
English                    1775
Latin                        33
Italian                      15
Chinese                       9
German                        9
Aragonese                     6
Lao                           5
Hebrew                        5
Luxembourgish                 4
Modern Greek (1453-)          4
Portuguese                    3
French                        3
Swahili (macrolanguage)       3
Welsh                         3
Lithuanian                    2
Dutch                         2
Norwegian                     2
Bosnian                       2
Polish                        2
Indonesian                    1
Tagalog                       1
Estonian                      1
Quechua                       1
Walloon                       1
Swedish                       1
Danish                        1
Ukrainian                     1
Albanian                      1
Esperanto                     1
Japanese                      1
Spanish                   

If we focus in on the newspapers that supposedly have a significant proportion of articles in Maltese, we see some very strange results. I seriously doubt that 80% of the *Mildura Irrigationist* from 1892-3 is in Maltese. So what's going on?

In [62]:
df.loc[(df["proportion"] > 0.1) & (df["language_full"] == "Latin")]

Unnamed: 0,id,title,language,proportion,number,language_full
229,1596,L'Italo-Australiano = The Italo-Australian (Su...,la,0.148936,100,Latin
273,350,"Nepean Times (Penrith, NSW : 1882 - 1962)",la,0.11,100,Latin
748,190,Windsor Express and Richmond Advertiser (NSW :...,la,0.13,100,Latin
855,1207,The Coolangatta Chronicle (Qld. : 1926),la,0.153846,26,Latin
1023,34,"The Advertiser (Adelaide, SA : 1889 - 1931)",la,0.171717,100,Latin
1602,706,"The Chinese Advertiser (Ballarat, Vic. : 1856)",la,0.2,10,Latin
1619,685,The English and Chinese Advertiser (Vic. : 185...,la,0.227273,22,Latin
1672,1583,The Mildura Irrigationist (Vic. : 1892 - 1893),la,0.208333,100,Latin
1678,1581,The Mildura Irrigationist and Murray River Agr...,la,0.189474,100,Latin
1691,1733,The Morwell Advocate and Boolara and Mirboo Ch...,la,0.238095,21,Latin


If you look at results for the *Mildura Irrigationist* [in Trove](https://trove.nla.gov.au/search/advanced/category/newspapers?l-advtitle=1583&l-advWord=100%20-%201000%20Words) you'll see that many of the page images are blurry, and as a result the OCR is very, very bad. Here's a sample:

> 1KB JEWk'L CA8R.
Mr*. fWanw wiw latwjcht aft at llw
.PaliiMi Ckact» tiMlty ini anaavi|vh af
oMaioint wowf ^ bbrpmaaMMM. Mr
plitdf I pillf, a«4 araa mlrwnl fa
miMF atoailw |mml wrritadr. la thk
«saa» Mr*. Dakar— *w«ltor)pMl ariifc
.
baTiqt oMiiwil • Mini of mawj fratn
y Mi tot. Uptnk ami On. farJtanrfaiarkicth
»Wrad«l«- Iroai Major and Kit. liar
. gnai«i Mm. CMiwim* «a ako coat
aaillvd for I rial on tHurp >4 prjtiy,
alkynl in hi* lawti raoimitimiIwr
•u 'K<«. tW action for drfamatmn of «Imi«cirr
vhkli lamiflit afaii^t Major
! ami Mi*. H*ritnp«*r». in txme^mncr
of Uwr MMiini thai aim M «*ol««i
mww valuaUr (ran tWir m«-
ilfw. Ma}«r arid Mr*. Ilargreatw
apfwakd lo tW Itrndt tor un'.i'(jr.


What happens when we feed this fragment of bad OCR to the language detector? Remarkably, the language detector is sure that it's Latin! To find out why this is the case, we'd probably have to dig into the way the language detection model was trained. But for our purposes it's enough to know that some of the languages detected seem to be the result of bad OCR.

In [63]:
ocr = """1KB JEWk'L CA8R.
Mr*. fWanw wiw latwjcht aft at llw
.PaliiMi Ckact» tiMlty ini anaavi|vh af
oMaioint wowf ^ bbrpmaaMMM. Mr
plitdf I pillf, a«4 araa mlrwnl fa
miMF atoailw |mml wrritadr. la thk
«saa» Mr*. Dakar— *w«ltor)pMl ariifc
.
baTiqt oMiiwil • Mini of mawj fratn
y Mi tot. Uptnk ami On. farJtanrfaiarkicth
»Wrad«l«- Iroai Major and Kit. liar
. gnai«i Mm. CMiwim* «a ako coat
aaillvd for I rial on tHurp >4 prjtiy,
alkynl in hi* lawti raoimitimiIwr
•u 'K<«. tW action for drfamatmn of «Imi«cirr
vhkli lamiflit afaii^t Major
! ami Mi*. H*ritnp«*r». in txme^mncr
of Uwr MMiini thai aim M «*ol««i
mww valuaUr (ran tWir m«-
ilfw. Ma}«r arid Mr*. Ilargreatw
apfwakd lo tW Itrndt tor un'.i'(jr.
"""
identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
identifier.classify(ocr)

('la', np.float32(1.0))

Of course there might actually be newspapers in unexpected languages, so we don't want to filter them all out. Instead let's do some manual inspection of the newspapers that *seem* to have non-English content. First we'll filter our results to include only languages with proportions of more than 0.05, and then drop out newspapers that seem to be only in English. We end up with 100 different titles. 

In [64]:
# The filter on the groupby drops out newspapers that only have articles in English.
filtered = (
    df.loc[df["proportion"] >= 0.05]
    .groupby(by=["title", "id"])
    .filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)

100

Let's list those 100 newspapers. From the list below, I think it's pretty easy to pick out the results that are likely to be the product of bad OCR.

In [65]:
for n, l in papers:
    if not l.loc[(~df["language"].isin(["en"])) & (df["proportion"] >= 0.05)].empty:
        print(f"\n{n[0]} ({n[1]})")
        display(
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
        )


A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)


Unnamed: 0,language_full,language,proportion
9,Portuguese,pt,0.919192



Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)


Unnamed: 0,language_full,language,proportion
915,German,de,1.0



Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933) (1686)


Unnamed: 0,language_full,language,proportion
1256,Hebrew,he,1.0



Australijos Lietuvis = The Australian Lithuanian (SA : 1948 - 1956) (1876)


Unnamed: 0,language_full,language,proportion
920,Lithuanian,lt,0.97



Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)


Unnamed: 0,language_full,language,proportion
922,German,de,1.0



Bangkok Recorder (Thailand : 1865 - 1867) (1488)


Unnamed: 0,language_full,language,proportion
14,English,en,0.939394
15,Portuguese,pt,0.050505



Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)


Unnamed: 0,language_full,language,proportion
17,Indonesian,id,0.99



Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)


Unnamed: 0,language_full,language,proportion
100,Chinese,zh,0.97



Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)


Unnamed: 0,language_full,language,proportion
1293,Chinese,zh,1.0



Chung Wah News (Perth, WA : 1981 - 1987) (1383)


Unnamed: 0,language_full,language,proportion
1842,English,en,0.5
1841,Chinese,zh,0.49



Cobden Times (Vic. : 1918) (543)


Unnamed: 0,language_full,language,proportion
1297,English,en,0.91
1298,Latin,la,0.08



Daily Post (Hobart, Tas. : 1908 - 1918) (860)


Unnamed: 0,language_full,language,proportion
1113,English,en,0.77
1114,Aragonese,an,0.2



Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)


Unnamed: 0,language_full,language,proportion
1867,German,de,0.82
1868,English,en,0.18



Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)


Unnamed: 0,language_full,language,proportion
149,German,de,1.0



Deutsche Zeitung fur Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)


Unnamed: 0,language_full,language,proportion
935,German,de,0.9
934,English,en,0.1



Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)


Unnamed: 0,language_full,language,proportion
150,German,de,0.71
151,English,en,0.29



Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)


Unnamed: 0,language_full,language,proportion
936,German,de,0.99



Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)


Unnamed: 0,language_full,language,proportion
156,Dutch,nl,0.979592



Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)


Unnamed: 0,language_full,language,proportion
159,Dutch,nl,0.939394
160,English,en,0.060606



Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)


Unnamed: 0,language_full,language,proportion
1872,Polish,pl,0.88
1873,English,en,0.12



Eco Italiano (Perth, WA : 1958 - 1959) (1387)


Unnamed: 0,language_full,language,proportion
1874,Italian,it,0.979592



Geraldton Express and Murchison Goldfields News (WA : 1894 - 1896) (1623)


Unnamed: 0,language_full,language,proportion
1886,English,en,0.585859
1888,Welsh,cy,0.30303
1887,Latin,la,0.111111



Geraldton Murchison Telegraph (WA : 1892 - 1899) (1625)


Unnamed: 0,language_full,language,proportion
1893,English,en,0.92
1894,Welsh,cy,0.06



Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)


Unnamed: 0,language_full,language,proportion
186,Chinese,zh,1.0



Hellenic Echo (Perth, WA : 1967 - 1968) (1389)


Unnamed: 0,language_full,language,proportion
1913,Modern Greek (1453-),el,1.0



Hobart Town Advertiser : Weekly Edt. (Tas. : 1859 - 1865) (1739)


Unnamed: 0,language_full,language,proportion
1129,English,en,0.95



Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)


Unnamed: 0,language_full,language,proportion
1915,Italian,it,0.96



Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)


Unnamed: 0,language_full,language,proportion
197,Italian,it,0.91
198,English,en,0.09



Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)


Unnamed: 0,language_full,language,proportion
199,Italian,it,0.75
200,English,en,0.25



Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)


Unnamed: 0,language_full,language,proportion
211,English,en,0.8
212,Italian,it,0.15



Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)


Unnamed: 0,language_full,language,proportion
214,English,en,0.85
215,Italian,it,0.14



Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)


Unnamed: 0,language_full,language,proportion
217,Italian,it,0.97



Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)


Unnamed: 0,language_full,language,proportion
1918,Japanese,ja,0.96



Kookynie Advocate and Northern Goldfields News (WA : 1903 - 1904) (1455)


Unnamed: 0,language_full,language,proportion
1931,English,en,0.92
1932,Latin,la,0.07



Kyabram Union (Vic. : 1886 - 1894) (196)


Unnamed: 0,language_full,language,proportion
1418,English,en,0.93
1419,Latin,la,0.07



L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)


Unnamed: 0,language_full,language,proportion
227,Italian,it,0.702128
229,Latin,la,0.148936
230,Aragonese,an,0.06383
228,Quechua,qu,0.053191



L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)


Unnamed: 0,language_full,language,proportion
234,Italian,it,0.97



La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) (1388)


Unnamed: 0,language_full,language,proportion
1937,Italian,it,0.98



Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)


Unnamed: 0,language_full,language,proportion
238,French,fr,0.76
239,English,en,0.24



Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)


Unnamed: 0,language_full,language,proportion
1955,Modern Greek (1453-),el,0.333333
1954,English,en,0.232323
1956,Portuguese,pt,0.161616
1950,French,fr,0.080808
1949,Spanish,es,0.060606



Meie Kodu - Our Home (Sydney, NSW : 1949 - 1956) (280)


Unnamed: 0,language_full,language,proportion
248,Estonian,et,1.0



Menzies Weekly Times (WA : 1897 - 1898) (1636)


Unnamed: 0,language_full,language,proportion
1961,English,en,0.89



Moruya Examiner (NSW : 1881 - 1902) (1882)


Unnamed: 0,language_full,language,proportion
255,English,en,0.908163
257,Latin,la,0.061224



Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954) (1594)


Unnamed: 0,language_full,language,proportion
263,Lithuanian,lt,0.95



Nasza droga (Adelaide, SA : 1952 - 1954) (1323)


Unnamed: 0,language_full,language,proportion
964,Polish,pl,0.89
965,English,en,0.11



Nepean Times (Penrith, NSW : 1882 - 1962) (350)


Unnamed: 0,language_full,language,proportion
272,English,en,0.88
273,Latin,la,0.11



Norden (Melbourne, Vic. : 1914 - 1918) (797)


Unnamed: 0,language_full,language,proportion
1460,Danish,da,0.642857
1462,Norwegian,no,0.153061
1464,Swedish,sv,0.091837
1461,English,en,0.061224



North Melbourne Gazette (Vic. : 1894 - 1901) (384)


Unnamed: 0,language_full,language,proportion
1468,English,en,0.94



Oceania (Sydney, NSW : 1913 - 1915) (1598)


Unnamed: 0,language_full,language,proportion
282,Italian,it,0.54
283,English,en,0.46



Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874) (1604)


Unnamed: 0,language_full,language,proportion
302,French,fr,0.98



Sandringham Southern Cross (Vic. : 1914 - 1918) (318)


Unnamed: 0,language_full,language,proportion
1525,English,en,0.939394
1527,Latin,la,0.050505



Seamen's Strike Bulletin (Melbourne, Vic. : 1919) (1043)


Unnamed: 0,language_full,language,proportion
1529,Chinese,zh,0.2
1530,Lao,lo,0.2
1531,Norwegian,no,0.2
1532,Albanian,sq,0.2
1533,Bosnian,bs,0.2



South Sydney News (NSW : 1940) (1854)


Unnamed: 0,language_full,language,proportion
315,English,en,0.944444
316,Latin,la,0.055556



Southern Morning Herald (Goulburn, NSW : 1920 - 1923) (418)


Unnamed: 0,language_full,language,proportion
319,English,en,0.885417
320,Latin,la,0.083333



Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)


Unnamed: 0,language_full,language,proportion
2024,Italian,it,0.98



Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)


Unnamed: 0,language_full,language,proportion
1018,German,de,0.888889
1019,English,en,0.111111



Sunday News (Sydney, NSW : 1919) (623)


Unnamed: 0,language_full,language,proportion
323,English,en,0.878788
324,Latin,la,0.080808



Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)


Unnamed: 0,language_full,language,proportion
2030,Italian,it,1.0



Sydney General Trade List (NSW : 1834 - 1842) (694)


Unnamed: 0,language_full,language,proportion
334,English,en,0.908163
336,Latin,la,0.05102



Sydney General Trade List, Mercantile Chronicle and Advertiser (NSW : 1830) (696)


Unnamed: 0,language_full,language,proportion
340,English,en,0.888889
341,Tagalog,tl,0.111111



Sydney General Trade List, and Mercantile Advertiser (NSW : 1829 - 1830) (695)


Unnamed: 0,language_full,language,proportion
338,English,en,0.913043
339,Latin,la,0.086957



Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)


Unnamed: 0,language_full,language,proportion
1016,German,de,0.99



The Advertiser (Adelaide, SA : 1889 - 1931) (34)


Unnamed: 0,language_full,language,proportion
1021,English,en,0.59596
1022,Luxembourgish,lb,0.222222
1023,Latin,la,0.171717



The Advertiser (Hobart, Tas. :  1837 - 1840) (1736)


Unnamed: 0,language_full,language,proportion
1158,English,en,0.93
1160,Walloon,wa,0.06



The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) (1685)


Unnamed: 0,language_full,language,proportion
1572,English,en,0.77
1573,Hebrew,he,0.23



The Australian Jewish Post (St. Kilda, Vic. : 1966 - 1968) (1777)


Unnamed: 0,language_full,language,proportion
1574,Hebrew,he,1.0



The Bee of Australia (Sydney, NSW : 1844) (1011)


Unnamed: 0,language_full,language,proportion
386,English,en,0.923077
387,Italian,it,0.061538



The Broughton Creek Register, and Kangaroo Valley and South Coast Farmer (Berry, NSW : 1886 - 1890) (1888)


Unnamed: 0,language_full,language,proportion
427,English,en,0.94
428,Aragonese,an,0.06



The Brunswick and Coburg Leader (Vic. : 1914 - 1929) (293)


Unnamed: 0,language_full,language,proportion
1595,English,en,0.94



The Central Districts Advocate (Goomalling, WA : 1922 - 1924) (1402)


Unnamed: 0,language_full,language,proportion
2054,English,en,0.87
2055,Latin,la,0.11



The Chinese Advertiser (Ballarat, Vic. : 1856) (706)


Unnamed: 0,language_full,language,proportion
1601,Chinese,zh,0.8
1602,Latin,la,0.2



The Coolangatta Chronicle (Qld. : 1926) (1207)


Unnamed: 0,language_full,language,proportion
854,English,en,0.846154
855,Latin,la,0.153846



The Derby News (WA : 1887) (1617)


Unnamed: 0,language_full,language,proportion
2072,Luxembourgish,lb,0.5
2073,English,en,0.5



The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)


Unnamed: 0,language_full,language,proportion
1620,Chinese,zh,0.772727
1619,Latin,la,0.227273



The Gippsland Farmers' and Glengarry, Toongabbie and Cowwarr Journal (Traralgon, Vic. : 1922 - 1923) (1870)


Unnamed: 0,language_full,language,proportion
1628,English,en,0.94
1629,Aragonese,an,0.06



The Goldfields Observer (Kalgoorlie, WA : 1930 - 1939) (1626)


Unnamed: 0,language_full,language,proportion
2100,English,en,0.878788
2101,Latin,la,0.090909



The Herald of Tasmania (Hobart, Tas. : 1845) (1741)


Unnamed: 0,language_full,language,proportion
1179,English,en,0.9



The Hobart Town Daily Mercury (Tas. : 1858 - 1860) (33)


Unnamed: 0,language_full,language,proportion
1189,English,en,0.929293
1190,Latin,la,0.060606



The Jewish Post (Melbourne, Vic. : 1949 - 1966) (1776)


Unnamed: 0,language_full,language,proportion
1640,Hebrew,he,1.0



The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) (1707)


Unnamed: 0,language_full,language,proportion
1641,English,en,0.81
1642,Hebrew,he,0.19



The Mildura Irrigationist (Vic. : 1892 - 1893) (1583)


Unnamed: 0,language_full,language,proportion
1670,English,en,0.427083
1672,Latin,la,0.208333
1673,Luxembourgish,lb,0.1875
1671,Swahili (macrolanguage),sw,0.145833



The Mildura Irrigationist and Murray River Agricultural Times (Vic. : 1888) (1581)


Unnamed: 0,language_full,language,proportion
1676,English,en,0.473684
1678,Latin,la,0.189474
1679,Luxembourgish,lb,0.136842
1677,Swahili (macrolanguage),sw,0.094737
1680,Lao,lo,0.094737



The Mildura Irrigationist and Murray River Cultural Advocate (Vic. : 1891 - 1892) (1582)


Unnamed: 0,language_full,language,proportion
1682,English,en,0.87
1683,Swahili (macrolanguage),sw,0.07



The Morwell Advocate and Boolara and Mirboo Chronicle (Vic. : 1886) (1733)


Unnamed: 0,language_full,language,proportion
1692,English,en,0.714286
1691,Latin,la,0.238095



The Morwell Advocate and Narracan, Boolara and Mirboo Chronicle (Vic. : 1886) (1734)


Unnamed: 0,language_full,language,proportion
1694,English,en,0.917526
1696,Latin,la,0.051546



The Mount Ararat Advertiser (Vic. : 1857) (1883)


Unnamed: 0,language_full,language,proportion
1699,English,en,0.916667
1700,Latin,la,0.083333



The Reporter (Box Hill, Vic. : 1889 - 1925) (244)


Unnamed: 0,language_full,language,proportion
1718,English,en,0.938776
1717,Latin,la,0.05102



The Richmond River Express and Casino Kyogle Advertiser (NSW : 1904 - 1929) (500)


Unnamed: 0,language_full,language,proportion
604,English,en,0.846939
606,Lao,lo,0.05102



The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)


Unnamed: 0,language_full,language,proportion
2215,Modern Greek (1453-),el,0.98



The Yarrawonga Mercury and Mulwala (N.S.W.) News (Vic. : 1882 - 1892; 1894 - 1897) (1863)


Unnamed: 0,language_full,language,proportion
1762,English,en,0.89
1763,Latin,la,0.07



To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)


Unnamed: 0,language_full,language,proportion
706,Modern Greek (1453-),el,1.0



Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)


Unnamed: 0,language_full,language,proportion
713,Chinese,zh,1.0



Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)


Unnamed: 0,language_full,language,proportion
714,Chinese,zh,0.99



Uniamoci (Sydney, NSW : 1903 - 1904) (1599)


Unnamed: 0,language_full,language,proportion
725,Italian,it,1.0



Upper Hunter Courier (Murrurundi, NSW : 1871) (810)


Unnamed: 0,language_full,language,proportion
726,English,en,0.928571
727,Lao,lo,0.071429



Vesnik (Perth, WA : 1975 - 1994) (1382)


Unnamed: 0,language_full,language,proportion
2249,Macedonian,mk,0.412371
2248,English,en,0.340206
2251,Bosnian,bs,0.14433



Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)


Unnamed: 0,language_full,language,proportion
728,Ukrainian,uk,0.82
729,English,en,0.18



Warwick Daily News (Qld. : 1919 -1954) (892)


Unnamed: 0,language_full,language,proportion
898,English,en,0.887755
899,Latin,la,0.081633



Williamstown Trade Circular (Vic. : 1855 - 1856) (213)


Unnamed: 0,language_full,language,proportion
1805,English,en,0.888889
1806,Esperanto,eo,0.111111



Windsor Express and Richmond Advertiser (NSW : 1843 - 1844) (190)


Unnamed: 0,language_full,language,proportion
747,English,en,0.87
748,Latin,la,0.13


I went through the titles above and compiled a list of title identifiers that seem to be producing dodgy results. We can use this to filter these newspapers out of our results.

In [66]:
# Titles where dodgy OCR causes false positives in language detection
# This was manually created after scanning results
dodgy = [
    "1036",
    "1043",
    "1011",
    "1103",
    "116",
    "1207",
    "1265",
    "13",
    "1320",
    "1336",
    "140",
    "1400",
    "1402",
    "145",
    "1455",
    "1488",
    "1543",
    "1546",
    "1581",
    "1582",
    "1583",
    "1617",
    "1623",
    "1625",
    "1626",
    "1636",
    "1638",
    "1675",
    "1678",
    "171",
    "1733",
    "1734",
    "1739",
    "1741",
    "1736",
    "1882",
    "1883",
    "1888",
    "1854",
    "1858",
    "1863",
    "1870",
    "1886",
    "190",
    "196",
    "213",
    "224",
    "244",
    "286",
    "292",
    "293",
    "318",
    "329",
    "33",
    "34",
    "350",
    "384",
    "389",
    "394",
    "418",
    "430",
    "431",
    "452",
    "479",
    "499",
    "500",
    "543",
    "570",
    "623",
    "694",
    "695",
    "696",
    "725",
    "763",
    "810",
    "860",
    "886",
    "892",
    "906",
    "92",
    "926",
    "927",
    "935",
    "937",
    "94",
    "946",
    "970",
    "986",
]

Let's list them again, excluding those in the 'dodgy' list.

In [67]:
for n, l in papers:
    if not l.loc[
        (~df["language"].isin(["en"]))
        & (df["proportion"] >= 0.05)
        & (~df["id"].isin(dodgy))
    ].empty:
        print(f"\n{n[0]} ({n[1]})")
        display(
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
        )


A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)


Unnamed: 0,language_full,language,proportion
9,Portuguese,pt,0.919192



Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)


Unnamed: 0,language_full,language,proportion
915,German,de,1.0



Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933) (1686)


Unnamed: 0,language_full,language,proportion
1256,Hebrew,he,1.0



Australijos Lietuvis = The Australian Lithuanian (SA : 1948 - 1956) (1876)


Unnamed: 0,language_full,language,proportion
920,Lithuanian,lt,0.97



Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)


Unnamed: 0,language_full,language,proportion
922,German,de,1.0



Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)


Unnamed: 0,language_full,language,proportion
17,Indonesian,id,0.99



Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)


Unnamed: 0,language_full,language,proportion
100,Chinese,zh,0.97



Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)


Unnamed: 0,language_full,language,proportion
1293,Chinese,zh,1.0



Chung Wah News (Perth, WA : 1981 - 1987) (1383)


Unnamed: 0,language_full,language,proportion
1842,English,en,0.5
1841,Chinese,zh,0.49



Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)


Unnamed: 0,language_full,language,proportion
1867,German,de,0.82
1868,English,en,0.18



Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)


Unnamed: 0,language_full,language,proportion
149,German,de,1.0



Deutsche Zeitung fur Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)


Unnamed: 0,language_full,language,proportion
935,German,de,0.9
934,English,en,0.1



Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)


Unnamed: 0,language_full,language,proportion
150,German,de,0.71
151,English,en,0.29



Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)


Unnamed: 0,language_full,language,proportion
936,German,de,0.99



Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)


Unnamed: 0,language_full,language,proportion
156,Dutch,nl,0.979592



Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)


Unnamed: 0,language_full,language,proportion
159,Dutch,nl,0.939394
160,English,en,0.060606



Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)


Unnamed: 0,language_full,language,proportion
1872,Polish,pl,0.88
1873,English,en,0.12



Eco Italiano (Perth, WA : 1958 - 1959) (1387)


Unnamed: 0,language_full,language,proportion
1874,Italian,it,0.979592



Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)


Unnamed: 0,language_full,language,proportion
186,Chinese,zh,1.0



Hellenic Echo (Perth, WA : 1967 - 1968) (1389)


Unnamed: 0,language_full,language,proportion
1913,Modern Greek (1453-),el,1.0



Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)


Unnamed: 0,language_full,language,proportion
1915,Italian,it,0.96



Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)


Unnamed: 0,language_full,language,proportion
197,Italian,it,0.91
198,English,en,0.09



Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)


Unnamed: 0,language_full,language,proportion
199,Italian,it,0.75
200,English,en,0.25



Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)


Unnamed: 0,language_full,language,proportion
211,English,en,0.8
212,Italian,it,0.15



Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)


Unnamed: 0,language_full,language,proportion
214,English,en,0.85
215,Italian,it,0.14



Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)


Unnamed: 0,language_full,language,proportion
217,Italian,it,0.97



Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)


Unnamed: 0,language_full,language,proportion
1918,Japanese,ja,0.96



L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)


Unnamed: 0,language_full,language,proportion
227,Italian,it,0.702128
229,Latin,la,0.148936
230,Aragonese,an,0.06383
228,Quechua,qu,0.053191



L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)


Unnamed: 0,language_full,language,proportion
234,Italian,it,0.97



La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) (1388)


Unnamed: 0,language_full,language,proportion
1937,Italian,it,0.98



Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)


Unnamed: 0,language_full,language,proportion
238,French,fr,0.76
239,English,en,0.24



Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)


Unnamed: 0,language_full,language,proportion
1955,Modern Greek (1453-),el,0.333333
1954,English,en,0.232323
1956,Portuguese,pt,0.161616
1950,French,fr,0.080808
1949,Spanish,es,0.060606



Meie Kodu - Our Home (Sydney, NSW : 1949 - 1956) (280)


Unnamed: 0,language_full,language,proportion
248,Estonian,et,1.0



Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954) (1594)


Unnamed: 0,language_full,language,proportion
263,Lithuanian,lt,0.95



Nasza droga (Adelaide, SA : 1952 - 1954) (1323)


Unnamed: 0,language_full,language,proportion
964,Polish,pl,0.89
965,English,en,0.11



Norden (Melbourne, Vic. : 1914 - 1918) (797)


Unnamed: 0,language_full,language,proportion
1460,Danish,da,0.642857
1462,Norwegian,no,0.153061
1464,Swedish,sv,0.091837
1461,English,en,0.061224



Oceania (Sydney, NSW : 1913 - 1915) (1598)


Unnamed: 0,language_full,language,proportion
282,Italian,it,0.54
283,English,en,0.46



Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874) (1604)


Unnamed: 0,language_full,language,proportion
302,French,fr,0.98



Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)


Unnamed: 0,language_full,language,proportion
2024,Italian,it,0.98



Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)


Unnamed: 0,language_full,language,proportion
1018,German,de,0.888889
1019,English,en,0.111111



Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)


Unnamed: 0,language_full,language,proportion
2030,Italian,it,1.0



Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)


Unnamed: 0,language_full,language,proportion
1016,German,de,0.99



The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) (1685)


Unnamed: 0,language_full,language,proportion
1572,English,en,0.77
1573,Hebrew,he,0.23



The Australian Jewish Post (St. Kilda, Vic. : 1966 - 1968) (1777)


Unnamed: 0,language_full,language,proportion
1574,Hebrew,he,1.0



The Chinese Advertiser (Ballarat, Vic. : 1856) (706)


Unnamed: 0,language_full,language,proportion
1601,Chinese,zh,0.8
1602,Latin,la,0.2



The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)


Unnamed: 0,language_full,language,proportion
1620,Chinese,zh,0.772727
1619,Latin,la,0.227273



The Jewish Post (Melbourne, Vic. : 1949 - 1966) (1776)


Unnamed: 0,language_full,language,proportion
1640,Hebrew,he,1.0



The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) (1707)


Unnamed: 0,language_full,language,proportion
1641,English,en,0.81
1642,Hebrew,he,0.19



The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)


Unnamed: 0,language_full,language,proportion
2215,Modern Greek (1453-),el,0.98



To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)


Unnamed: 0,language_full,language,proportion
706,Modern Greek (1453-),el,1.0



Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)


Unnamed: 0,language_full,language,proportion
713,Chinese,zh,1.0



Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)


Unnamed: 0,language_full,language,proportion
714,Chinese,zh,0.99



Uniamoci (Sydney, NSW : 1903 - 1904) (1599)


Unnamed: 0,language_full,language,proportion
725,Italian,it,1.0



Vesnik (Perth, WA : 1975 - 1994) (1382)


Unnamed: 0,language_full,language,proportion
2249,Macedonian,mk,0.412371
2248,English,en,0.340206
2251,Bosnian,bs,0.14433



Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)


Unnamed: 0,language_full,language,proportion
728,Ukrainian,uk,0.82
729,English,en,0.18


Here we'll add the dodgy title ids into our filter. It seems that we have 55 newspapers with significant amounts of non-English content.

In [68]:
# The filter removes titles that only have one language, which is English
filtered = (
    df.loc[(~df["id"].isin(dodgy)) & (df["proportion"] >= 0.05)]
    .groupby(by=["title", "id"])
    .filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)

55

Let's list them.

In [69]:
for n, l in papers:
    print(n[0])

A Voz de Timor (Dili, East Timor : 1970 - 1975)
Adelaider Deutsche Zeitung (SA : 1851 - 1862)
Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933)
Australijos Lietuvis = The Australian Lithuanian (SA : 1948 - 1956)
Australische Zeitung (Adelaide, SA : 1875 - 1916)
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946)
Chinese Republic News (Sydney, NSW : 1914 - 1937)
Chinese Times (Melbourne, Vic. : 1902 - 1922)
Chung Wah News (Perth, WA : 1981 - 1987)
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952)
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906)
Deutsche Zeitung fur Sud-Australien = German Times for South Australia (Tanunda, SA : 1851)
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939)
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851)
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993)
Dutch Weekly (Sydney, NSW : 1993 - 2004)
Echo : Polski 

That's looking pretty good. Let's save the results as a Markdown file to make it easy to explore. We'll include links into Trove. Here's the [list of all 55 newspapers](non-english-newspapers.md) (also as a [Gist](https://gist.github.com/wragge/9aa385648cff5f0de0c7d4837896df97)).

In [70]:
with open(Path("non-english-newspapers.md"), "w") as md_file:
    i = 1
    for n, l in papers:
        md_file.write(
            f"\n### {i}. [{n[0]}](http://nla.gov.au/nla.news-title{n[1]})\n\n"
        )
        md_file.write("| Language | Language code | Proportion of sample |\n")
        md_file.write("|---|---|---|\n")
        for row in (
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
            .itertuples()
        ):
            md_file.write(
                f"| {row.language_full} | {row.language} | {row.proportion} |\n"
            )
        i += 1

Save the results as a CSV file.

In [71]:
filtered.to_csv(
    f"newspapers_non_english_{datetime.now().strftime('%Y%m%d')}.csv", index=False
)

In [None]:
# IGNOTE THIS CELL -- FOR TESTING ONLY
if os.getenv("GW_STATUS") == "dev":
    newspaper_langs = find_languages(sample_size=5)
    df = pd.DataFrame(newspaper_langs)
    assert df.shape[0] >= 5
    assert list(df.columns) == ["id", "title", "language", "proportion", "number"]

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).  
Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge?o=esb).