# Philosophical Writing in Early New Zealand Newspapers
## A Case Study of Specialised Corpus Construction from Large Digitised Newspaper Datasets
### DH2021 - University of Canterbury | Te Whare Wānanga o Waitaha
Joshua Black  <br/>
New Zealand Institute of Language, Brain and Behaviour | Te Kāhui Roro Reo <br />
joshua.black@canterbury.ac.nz  <br />
black.joshuad@gmail.com  

GitHub repository: https://github.com/JoshuaDavidBlack/newspaper-philosophy-methods <br/>
Project dashboard: https://nz-newspaper-philosophy.herokuapp.com/

In [1]:
import html
import os
import pickle
import random
import re

from IPython.display import HTML
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

from jupyter_dash import JupyterDash
import dash_cytoscape as cyto
from dash import dcc
from dash import html as dash_html
from dash.dependencies import Input, Output, State

import numpy as np

import pandas as pd

random.seed(10)
JupyterDash.infer_jupyter_proxy_config()

# Overview

1. **Content Motivation:** telling a broader story about philosophy in New Zealand through newspaper content.
2. **Methodological Motivation:** overcoming reliance on keyword search in OCR-generated digital archives.
3. **Problem:** how to construct a specialised corpus from newspaper digitisations which can be a target for digital humanities investigation.
4. **Solution:** an 'iterative bootstrapping' process of candidate corpus exploration, article labelling, and text-classifier training.
5. **Result:** a corpus which reveals interesting features of philosophical discussion in NZ, including:
  1. debates over evolution and creation, and
  2. debates over whether the education system ought to be secular or not.

# Content Motivation

- History of (academic) philosophy in New Zealand largely absent before middle of 20th century:
 - 'many of those who had longstanding chairs published next to nothing' (Davies & Helgeby 2014, 24)
- Another gap in 'history of philosophy' in general: philosophy outside academic publications.
- Newspaper material offers an opportunity to address both gaps.
 - In early colonial NZ, newspapers, rather than monographs or journal articles, were 'the fundamental infrastructure for intellectual life' (Ballantyne 2012, 57)
 - A wider class of contributers (including journalists themselves, letters to the editor, and reports of public lectures).
- [Release](https://natlib.govt.nz/about-us/open-data/papers-past-metadata/papers-past-newspaper-open-data-pilot) of bulk of pre-1900 English-langauge newspaper content by The National Library of New Zealand | Te Puna Mātauranga o Aotearoa.


<img src="images/oo.png" style="width:100%; margin:auto"/>                                                   

## What is philosophy?

- Some 19th Century usage:
 - “Philosophy arises when, not content with the facts of existence (that is, of the world), men proceed to the inquiry into their reasons, and ultimately into their unconditioned reason i.e. their rationality.” (Erdmann 1890, 1)
 - "Philosophy --- we define to be --- the progressive rational system of the principles presupposed and ascertained by the particular sciences, in their relation to ultimate Reality" (Ladd 1890, 27).
- Philosophy is always more-or-less connected to _reason_.
 - We're supposed to _reason_ for philosophical conclusions rather than (directly) appealing to tradition or personal insight.
- For our purposes: philosophy occurs when claims about "ultimate reality" are used in rational argument.

- Need to mention something about contestedness? Laerke 2013, 9

# Methodological Motivation

- How should the methodology of historical research change with the introduction of large OCR-generated digital archives?
- Widespread dissatisfaction with keyword searching:
 - Only allows you to make existence proof arguments (Owens & Padilla 2021).
 - Where is the context? - we can always string anecdotes together (Putnam 2016).
 - OCR problems - an unclear collection of material is unsearchable (Hitchcock 2013)
 - Traditional citation practices are now deceptive. (Hitchcock 2013)

- The same literature calls for a different orientation:
 - "What can we do that we couldn't do before?" (Gibbs & Owens 2013; Nicholson 2013)
 - "Source as data" approach (Owens & Padilla 2021)
 - Use of text mining techniques to provide context (Owens & Padilla 2021, Putnam 2013).
 - Engagement with big data analysis to move from "piles of books to subtle maps of meaning" (Hitchcock 2013).
- This project is an experiment in this orientation.

# The Problem

- The Papers Past Newspaper Open Data Pilot release contains around 1.5 million pages of newspaper content (315GB compressed).
- Application of text mining to this dataset won't give insight into _philosophical discourse_ in NZ newspapers.
 - ...the 'philosophical' material is such a small fraction it wouldn't show up (final corpus: 0.4% of dataset).
- This is a general problem: creating specialist corpora from digital newspaper archives.
- We want to do this in a way which avoids the problems of keyword searching and which does not rely on accurate OCR.
- Additional aim: offer any solution in an accessible way for other researchers.
 - Publically available [Jupyter notebooks](https://github.com/JoshuaDavidBlack/newspaper-philosophy-methods),
 - Publically available ways to interact with the [resulting corpus.](https://nz-newspaper-philosophy.herokuapp.com/)

# A Solution

<img src="images/flow_diagram.png" style="margin:auto"/> 


## Preprocessing

- The data comes in [METS/ALTO](https://veridiansoftware.com/knowledge-base/metsalto/) XML format. This is widespread in newspaper digitisation.
 - It contains both physical and logical descriptions of original newspaper issues and pages (METS for issue structure, ALTO files for each page).
- Data released as title-year compressed files. For each of these we:
 1. iterate through each issue, collecting a list of articles and corresponding text blocks from the METS file,
   - **NB:** articles are distinguished from advertisements.
 2. iterate through the ALTO files, collecting text blocks for each article.
- We save the processed data in a series of compressed "slices" of the dataset (~7.6 million items, 8GB).
 - Each slice is a [Pandas](https://pandas.pydata.org/) dataframe
- The result: plain text for each item in the dataset.

## Corpus Exploration

- The corpus exploration stage starts with a _candidate corpus_. We evaluate whether this corpus is satisfactory.
- So how do we get our first one? Either:
 1. the whole dataset, or
 2. our old friend: keyword search (recommended).
- Many methods can be used here. In the [notebook](https://github.com/JoshuaDavidBlack/newspaper-philosophy-methods):
    - manual inspection,
    - concordancing,
    - collocations,
    - cooccurrence networks,
    - topic modelling.
- Each of these gives insight into the content of the candidate corpus.

### Aside: Cooccurrence networks

- A way to represent the cooccurrence of terms within an item.
- Statistically significant cooccurrences enable us to something of how a word is being used.
- Implemented using [Dash](https://dash.plotly.com/).
- Examples available [here](https://nz-newspaper-philosophy.herokuapp.com/)

- At end of this stage: is the corpus good enough?
 - Are there lots of unwanted items?
 - Do I have reason to think I'm missing something?
- If the corpus is satisfactory, we have what we were after.
- If not, we move on to the labelling stage.
 - Notes should be kept throughout on wanted/unwanted items.

## Labelling

- There's no pain free way to do this!
- We need a _labelling scheme_. In this case it's vague:
  - Philosophy: is the majority of the article 'philosophical discourse’?
  - A broad definition: does it argue for or appeal to ideas of ‘ultimate reality’ or ‘ultimate value’.
  - I also included sublabels for writing type and philosophy type. Even if not used for classification these can be useful auxilliary variables for evaluating what kind of material is being handled by the model best.
- I've created a [Dash](https://dash.plotly.com/) dashboard to enable labelling.

In [2]:
with open('data/codes2names_web.pickle', 'rb') as fin:
    CODES2NAMES_WEB = pickle.load(fin)
with open('data/codes2names.pickle', 'rb') as fin:
    CODES2NAMES = pickle.load(fin)
    
labels = pd.DataFrame(
    columns=[
        'Text', 'Notes',  # See below for discussion of labels.
        'Philosophy', 'Philosophy Type', 'Readable', 'Writing Type', 
    ]
)

In [3]:
def add_title_and_year(df):
    """Add 'Newspaper', 'Year', and 'Date' column to dataframe with
    'Text' column."""
    df['Newspaper'] = df.index.map(lambda x: x[0:x.find('_')])
    df['Date'] = df.index.map(lambda x: x[x.find('_')+1:x.find('_')+9])
    

def escape_markdown(string):
    """Escape characters which have functions in markdown strings.
    Return escaped string."""

    markdown_escape_chars = r"\`*_{}[]<>()#+-.!|"
    for escape_char in markdown_escape_chars:
        string = string.replace(escape_char, "\\"+escape_char)

    return string

def text_as_markdown(index, dataframe, boldface=None):
    """Render article corresponding to index in dataframe as markdown
    string. Any matches for boldface are rendered in bold.
    """

    date = index[index.find('_')+1:index.find('_')+9]
    newspaper = index[0:index.find('_')]

    title = (dataframe.loc[index, 'Title'])
    title = escape_markdown(title)

    web_prefix = "https://paperspast.natlib.govt.nz/newspapers/"
    year = date[0:4]
    month = date[4:6]
    day = date[6:8]
    web_address = f"{web_prefix}{CODES2NAMES_WEB[newspaper]}/{year}/{month}/{day}"

    text_blocks = dataframe.loc[index, 'Text']
    text = ''
    for block in text_blocks:
        paragraph = escape_markdown(block)
        text += paragraph + '\n\n'

    if boldface:
        match = re.search(boldface, text)
        if match:
            text = re.sub(boldface, f'***{match.group(0)}***', text)

    markdown_text = f"""## {title}

*{CODES2NAMES[newspaper]}*

{day}/{month}/{year}

[View issue on Papers Past]({web_address})

{text}
"""

    return markdown_text

In [4]:
to_label = pd.read_pickle('data/sample_corpus.tar.gz')
add_title_and_year(to_label)
item_names_formatted = [
    {'label': f'{to_label["Title"].loc[i]} ({i})', 'value': i} 
    for i in to_label.index
]

In [5]:
app = JupyterDash(__name__, external_stylesheets=['https://codepen.io/chriddyp/pen/bWLwgP.css'])

#For readability, the control panel is defined before the full app layout.
control_panel = [
    dash_html.P('Readable?'),
    dcc.RadioItems(
        id='readable-radio',
        options=[
            {'label': 'True', 'value': True},
            {'label': 'False', 'value': False}
        ]
    ),
    dash_html.P('Philosophy?'),
    dcc.RadioItems(
        id='philosophy-radio',
        options=[
            {'label': 'True', 'value': True},
            {'label': 'False', 'value': False}
        ]
    ),
    dash_html.P('Philosophy Type?'),
    dcc.RadioItems(
        id='phil-type-radio',
        options=[
            {'label': 'Religion/Science', 'value': 'r'},
            {'label': 'Ethics/Politics', 'value': 'e'},
            {'label': 'Other', 'value': 'o'},
            {'label': 'N/A', 'value': None}
        ]
    ),
    dash_html.P('Writing Type?'),
    dcc.RadioItems(
        id='write-type-radio',
        options=[
            {'label': 'Report of public event', 'value': 'p'},
            {'label': 'Letter to editor', 'value': 'l'},
            {'label': 'First order', 'value': 'f'},
            {'label': 'N/A', 'value': None}
        ]
    ),
    dash_html.P('Notes:'),
    dcc.Textarea(
        id='notes-area',
        style={'width': '100%'}
    ),
    dash_html.Button('Update', id='submit-val', n_clicks=0, style={'margin':'5px'}),
    dash_html.P(id='update-message', style={'display':'none'}) # This div allows the update button to work.
]

app.layout = dash_html.Div([
    dash_html.H2('Label Newspaper Items'),
    dash_html.P('Item'),
    dcc.Dropdown(
        id='item-selection',
        options=item_names_formatted,
        value=item_names_formatted[0]['value'],
        style={'width': '80%', 'margin': '10px'}
    ),
    dash_html.Div([
        dash_html.Div(
            dash_html.Div(
                dcc.Markdown(
                    id='article-display',
                    children=text_as_markdown(to_label.index[0], to_label),
                ),
            style={
                'width': '700px',
                'margin': 'auto'
                }    
            ),
        style={
                'width': '70%', 
                'display': 'inline-block',
                'padding': '15px',
                'margin': '10px'
            }
        ),
        dash_html.Div(
            control_panel,
            style={
                'width': '15%', 
                'display': 'inline-block', 
                'vertical-align': 'top', 
                'padding': '50px',
                'border': 'solid',
                #'position': 'fixed',
                'margin': '10px'
            }
        )
    ])    
])

# When new item chosen, load item text and any labels.
@app.callback(
    [Output(component_id='article-display', component_property='children'),
    Output(component_id='readable-radio', component_property='value'),
    Output(component_id='philosophy-radio', component_property='value'),
    Output(component_id='phil-type-radio', component_property='value'),
    Output(component_id='write-type-radio', component_property='value'),
    Output(component_id='notes-area', component_property='value')],
    [Input(component_id='item-selection', component_property='value')]
)
def load_new_markdown_and_labels(item_id):
    text = text_as_markdown(item_id, to_label)
    readable = philosophy = phil_type = write_type = notes =  None # default value.
    if item_id in labels.index:
        readable = labels.loc[item_id, 'Readable']
        philosophy = labels.loc[item_id, 'Philosophy']
        phil_type = labels.loc[item_id, 'Philosophy Type']
        write_type = labels.loc[item_id, 'Writing Type']
        notes = labels.loc[item_id, 'Notes']
    return text, readable, philosophy, phil_type, write_type, notes

# Update labels when 'update' button pressed.
@app.callback(
    Output(component_id='update-message', component_property='children'),
    [Input(component_id='submit-val', component_property='n_clicks')],
    [State(component_id='readable-radio', component_property='value'),
    State(component_id='philosophy-radio', component_property='value'),
    State(component_id='phil-type-radio', component_property='value'),
    State(component_id='write-type-radio', component_property='value'),
    State(component_id='item-selection', component_property='value'),
    State(component_id='notes-area', component_property='value')]
)
def update_labels(n_clicks, readable, philosophy, phil_type, write_type, item_id, notes):
    if n_clicks > 0:
        labels.loc[item_id, "Readable"] = readable
        labels.loc[item_id, "Philosophy"] = philosophy
        labels.loc[item_id, "Philosophy Type"] = phil_type
        labels.loc[item_id, "Writing Type"] = write_type
        labels.loc[item_id, "Text"] = to_label.loc[item_id, 'Text']
        labels.loc[item_id, "Notes"] = notes
        #labels.to_pickle(f'../Labels/labels_{ITERATION}.tar.gz')
    return 'Labels updated'

In [6]:
if __name__ == '__main__':
    app.run_server(debug=False) 

 * Running on http://127.0.0.1:8050/ (Press CTRL+C to quit)
127.0.0.1 - - [18/Jun/2022 12:03:58] "GET /_alive_293828bb-8fb8-45ba-963e-cb86676faa29 HTTP/1.1" 200 -


Dash app running on http://127.0.0.1:8050/


127.0.0.1 - - [18/Jun/2022 12:07:07] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2022 12:07:07] "GET /_dash-component-suites/dash/deps/polyfill@7.v2_0_0m1632235559.12.1.min.js HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2022 12:07:07] "GET /_dash-component-suites/dash/deps/prop-types@15.v2_0_0m1632235559.7.2.min.js HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2022 12:07:07] "GET /_dash-component-suites/dash/deps/react-dom@16.v2_0_0m1632235559.14.0.min.js HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2022 12:07:07] "GET /_dash-component-suites/dash/deps/react@16.v2_0_0m1632235559.14.0.min.js HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2022 12:07:07] "GET /_dash-component-suites/dash_cytoscape/dash_cytoscape.v0_2_0m1619800188.min.js HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2022 12:07:07] "GET /_dash-component-suites/dash/dcc/dash_core_components-shared.v2_0_0m1632235559.js HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2022 12:07:07] "GET /_dash-component-suites/dash/dash-renderer/build/dash_renderer.v2_0_0m1632235559.min.js

## Training and Applying a Model

- Once we have a set of labels, supervised learning is open to us.
- A simple bag-of-words representation of the items by word and frequency count (or [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) transformation) is the best we can do given OCR.
 - This excludes more advanced methods which use high quality sequences of data.
- A simple classification method is applied: Naive Bayes via [Scikit-Learn](https://scikit-learn.org/).
 - ...easy to train and performs well for text classification (Zhang 2004)
- Cross validation search used to select model parameters.
 - ..."do we include n-grams?", "how many words in our dictionary?", etc...
- Evaluation of models is both:
 - quantitative (accuracy, recall, and precision rates)
 - and qualitative (what are the false positives and negatives and are they edge cases?)
- We generate a new candidate corpus by applying the resulting classifier to the processed dataset.

## "Iterative Bootstrapping"

- The phrase: ‘pull yourself up by the bootstraps’:
    1. starting with nothing, we add articles to our labelled collection,
    2. having collected a number with much higher representation of philosophy than the general dataset,
    3. we train and apply a classifier,
    4. we use the articles classified as philosophy as a source of new articles to label.

# Results

- The results reported here come after two iterations of the corpus construction process.




## Labelling

<img src="images/label_counts.png" style="margin:auto"/> 

## Quantitative Model Metrics

<img src="confusion_matrix.png" style="margin:auto"/> 

- Accuracy: 89%
- Precision: 81%
- Recall: 80%

## Qualitative Model Investigation

- The false negatives are mostly composite items, such as editorials, in which many topics are covered.
    - This represents the possible loss of a whole class of perspectives and must be taken into account when drawing conclusions from the corpus.
- We can also look at the terms which the model uses to pick out philosophy:

<img src="images/under_the_hood_2.png" style="margin:auto"/> 

## Corpus Metrics

<img src="images/corpus_counts.png" style="margin:auto"/> 

## Sample Cooccurrence Networks

- So what can we learn about English-language philosophical discussion from this corpus?
- An example: what is the context of philosophical discussion of evolution in NZ newspapers up to 1900?
    - Let's look at some cooccurrence networks.
    - (All have 25 primary cooccurences, 5 secondary cooccurrences and use the log Dice statistic)

<img src="images/evolution_ld_25-5.png" style="margin:auto"/> 

<img src="images/salmond_ld_25-5.png" style="margin:auto"/> 

<img src="images/parker_ld_25-5.png" style="margin:auto"/> 

- From a general search term, we see some of the key ideas associated and some of the local figures prominent in this corpus (Salmond and Parker, both Otago professors).
- By looking at networks for Salmond and Parker we see something of their associations very quickly:
    - Salmond the philosopher and presbyterian minister,
    - Parker the biologist and public lecturer.

<img src="images/education_ld_25-5.png" style="margin:auto"/> 

<img src="images/secular_ld_25-5.png" style="margin:auto"/> 

- This corpus suggests a very close link between the idea of secularity and education. Both appear in one anothers networks.

# Conclusion

- A quick overview of a method for constructing specialised corpora from a large digitised newspaper archive.
- An even quicker look at what the philosophy corpus after two iterations of process can tell us.
- For a fuller account of the method, see the [GitHub repository](https://github.com/JoshuaDavidBlack/newspaper-philosophy-methods).
- How far have we got beyond keyword searching?
 - We have placed our queries about, say, evolution within a wider contex of discussion,
 - However, we still need to be _very careful_ about what conclusions we draw.
 - Careful inspection of our model shows that we are likely to be missing material from editorial discussions.
 - We have made progress towards "networks of meaning" (cf. Hitchcock 2013).

# References

Ballantyne, Tony. (2012). "Reading the Newspaper in Colonial Otago".  _The Journal of New
Zealand Studies_ 12.

Davies, Martin & Stein Helgeby. (2014) "Idealist Origins: 1920s and Before". In: _History of Philosophy
in Australia and New Zealand._ Graham Oppy & N. N. Trakakis (Eds). Dordrecht: Springer. 15–54.

Gibbs, F., & Owens, T. (2013). "The Hermeneutics of Data and Historical Writing". In: _Writing History in the Digital Age._ K. Nawrotzki & J. Dougherty (Eds). University of Michigan Press.

Hitchcock, Tim. (2013). "Confronting the Digital". _The Journal of the Social History Society_ 10(1). 9-23.

Nicholson, Bob. (2013). "The Digital Turn: Exploring the methodological possibilities of digital newspaper archives". _Media History_ 19(1). 57-73.

Putnam, L. (2016). "The transnational and the text-searchable: Digitized sources and the shadows they cast". _The
American Historical Review._ 121(2). 377–402.

Harry Zhang. (2004). "The Optimality of Naive Bayes". In: _Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2004._ Valerie Barr and Zdravko Markov (Eds). 562–567.