# Labelling Stage

Content Warning:

- an example item in the "Labelling Scheme" discusses suicide.
- random selections of 19th C. newspaper text content are also displayed with no ability to filter for content.

## Overview

This notebook sets out the labelling process for corpus items. This requires, broadly speaking, two sets of inputs. At any stage we have a `candidate_corpus` from the corpus exploration phase of the process. These items should be increasingly close to our desired corpus as the process of corpus construction continues. However, we will also require a broad sample of the kind of items that we _don't_ want. This is done by including a random sample of items from the processed dataset.

This notebook sets out :

1. the import of data from previous stages,
2. the construction of a labelling scheme, and
3. the creation and use of a labelling dashboard.

## Imports and Setup

In [1]:
import pickle
import random

import pandas as pd

from jupyter_dash import JupyterDash
from dash import dcc
from dash import html as dash_html
from dash.dependencies import Input, Output, State

# Provide a path to the location of the processed dataset.
dataset_path = '../Dataset/processed-data/'

# The following lines load a dictionary of correspondences needed
# to generate Papers Past URLs.
with open('../Pickles/codes2names_web.pickle', 'rb') as fin:
    CODES2NAMES_WEB = pickle.load(fin)
with open('../Pickles/codes2names.pickle', 'rb') as fin:
    CODES2NAMES = pickle.load(fin)

As in the corpous exploration notebook, we keep track of the iteration with a variable.

In [2]:
ITERATION = 1

This notebook will predominantly cover the first iteration of the process. In the first iteration, nothing will have been labelled.

The following code will load the required labels. If you are part way through an iteration (labelling can take a while!), then set `in_progress` to `True`. If there are no labels yet in the current iteration, and empty set of labels will be created.

In [41]:
in_progress = False

if in_progress:
    desired_iteration = ITERATION
else:
    desired_iteration = ITERATION - 1

try:
    labels = pd.read_pickle(f'../Labels/labels_{desired_iteration}.tar.gz'),
except FileNotFoundError:
    labels = pd.DataFrame(
        columns=[
            'Text', 'Notes',  # See below for discussion of labels.
            'Philosophy', 'Philosophy Type', 'Readable', 'Writing Type', 
        ]
    )

We load the candidate corpus.

In [4]:
candidate_corpus = pd.read_pickle(f'../Corpora/candidate_corpus_{ITERATION-1}.tar.gz')

We then take a random sample of the processed dataset. By default, we take a sample of 500. ***NB: random_state argument to the sample method means that each time this is run the sample will be the same. I have included it for reproducability, but the same state should not be used for two distinct iterations.***

In [6]:
sample_size = 500
# Load out list of all articles with the file they are contained in.
items_by_slice = pd.read_pickle('../Dataset/processed-data/meta/items_by_slice.tar.gz')
sample = items_by_slice.sample(sample_size, random_state=100).sort_values(by="Slice")
sample

Unnamed: 0,Title,Slice
FS_18891130_ARTICLE21,Wellington Land Board,0
WEST_18710601_ARTICLE12,ENGLISH ITEMS.,0
FS_18840306_ARTICLE5,Local & General News.,0
WT_18840814_ARTICLE3,ABSTRACT OF SALES BY AUCTION. This Day.,0
FS_18931026_ARTICLE1,BIRTH.,0
...,...,...
TS_18830903_ARTICLE1,LYTTELTON.,25
NEM_18891104_ARTICLE18,""" The Spectator"" on the Dominion of Australia.",25
LT_18910408_ARTICLE30,PUBLIC SERVICE ASSOCIATION.,25
TC_18980421_ARTICLE33,Marine Engineers.,25


We now iterate through the required slices, returning the desired items. We first define a function which, given the desired indices, returns a dataframe containing the required articles from our processed corpus. This can be used both for the random sample and for collecting any articles which we have reason to want to include but which are not in our candidate corpus.

In [37]:
def collect_articles_by_indices(indices, slices=list(range(26))): # "26" is the number of slices plus 1.
    filtered_dfs = []
    for i in slices:
        print(f"Loading corpus slice {i}")
        df = pd.read_pickle(f'../Dataset/processed-data/corpus_df_{i}.tar.gz')
        filtered_df = df.filter(indices, axis=0)
        filtered_dfs.append(filtered_df)
        del df
    
    desired_items = pd.concat(filtered_dfs)
    return desired_items

In [38]:
needed_slices = sample['Slice'].unique()
needed_slices
    
random_items = collect_articles_by_indices(sample.index, slices=needed_slices)
random_items

Loading corpus slice 0
Loading corpus slice 1
Loading corpus slice 2
Loading corpus slice 3
Loading corpus slice 4
Loading corpus slice 5
Loading corpus slice 6
Loading corpus slice 7
Loading corpus slice 8
Loading corpus slice 9
Loading corpus slice 10
Loading corpus slice 11
Loading corpus slice 12
Loading corpus slice 13
Loading corpus slice 14
Loading corpus slice 15
Loading corpus slice 16
Loading corpus slice 17
Loading corpus slice 18
Loading corpus slice 19
Loading corpus slice 20
Loading corpus slice 21
Loading corpus slice 22
Loading corpus slice 23
Loading corpus slice 24
Loading corpus slice 25


Unnamed: 0,Title,Text
FS_18891130_ARTICLE21,Wellington Land Board,[The usual monthly meeting of the Land Board w...
WEST_18710601_ARTICLE12,ENGLISH ITEMS.,"[The Rev. J. C. Reichardt says:— '""According t..."
FS_18840306_ARTICLE5,Local & General News.,[The Borough Council will meet this evening at...
WT_18840814_ARTICLE3,ABSTRACT OF SALES BY AUCTION. This Day.,"[Mb J. S. Buokland, at the Cambridge Yards, at..."
FS_18931026_ARTICLE1,BIRTH.,"[Bkooks.— At Feilding, ou October 17tb, tho wi..."
...,...,...
TS_18830903_ARTICLE1,LYTTELTON.,"[ABBTTBB. Baft. S— Wanoko, «.e., a7B tons, Nev..."
NEM_18891104_ARTICLE18,""" The Spectator"" on the Dominion of Australia.","[London, November 3. The Spectator considers t..."
LT_18910408_ARTICLE30,PUBLIC SERVICE ASSOCIATION.,[[Per Press Association.] WELLINGTON. April 7....
TC_18980421_ARTICLE33,Marine Engineers.,"[Presentation., Wellington April 19. The Hon M..."


### Special case: loading articles not in candidate corpus

If you want to load additional items, enter them in the empty list below and run the following cell. You will need to enter the article codes in form NEWSPAPER-CODE_DATE_ARTICLE-NUMBER.

In [39]:
desired_indices = [ # Here is a sample list of articles of interest for labelling.
    'LT_18990211_ARTICLE29', 
    'DSC_18601207_ARTICLE27',
    'OW_18511018_ARTICLE10', 
    'DSC_18520629_ARTICLE8',
    'DSC_18691229_ARTICLE17',
    'OW_18861112_ARTICLE129',
    'ESD_18760108_ARTICLE30',
    'OW_18601006_ARTICLE1',
    'LT_18690930_ARTICLE17',
    'NZTIM_18891129_ARTICLE33',
    'WEST_18720827_ARTICLE2',
    'CHP_18680716_ARTICLE12',
    'AS_18821014_ARTICLE10',
    'WT_18900422_ARTICLE8',
    'CHP_18940423_ARTICLE56',
    'AG_18840116_ARTICLE5',
    'MS_18830117_ARTICLE19',
    'GRA_18970305_ARTICLE3',
    'LT_18800611_ARTICLE5',
    'AG_18910526_ARTICLE8'
]
additional_items = collect_articles_by_indices(desired_indices)

Loading corpus slice 0
Loading corpus slice 1
Loading corpus slice 2
Loading corpus slice 3
Loading corpus slice 4
Loading corpus slice 5
Loading corpus slice 6
Loading corpus slice 7
Loading corpus slice 8
Loading corpus slice 9
Loading corpus slice 10
Loading corpus slice 11
Loading corpus slice 12
Loading corpus slice 13
Loading corpus slice 14
Loading corpus slice 15
Loading corpus slice 16
Loading corpus slice 17
Loading corpus slice 18
Loading corpus slice 19
Loading corpus slice 20
Loading corpus slice 21
Loading corpus slice 22
Loading corpus slice 23
Loading corpus slice 24
Loading corpus slice 25


In [40]:
additional_items

Unnamed: 0,Title,Text
WEST_18720827_ARTICLE2,VESSELS ANNOUNCED TO LEAVE WESTPORT.,"[Chalks Edward, on Thursday next, for Nelson, ..."
WT_18900422_ARTICLE8,Death.,"[Walwouth.—On April 18fch, at Tauwhare, Jane, ..."
LT_18990211_ARTICLE29,THE ROMANCE OF MOTHER EARTH.,[Lv his “ Romance of the Earth ” Professor Bic...
LT_18690930_ARTICLE17,MR HUXLEY AND SCIENTIFIC EDUCATION.,[{Lancet.) la the July number of Macmillan's M...
LT_18800611_ARTICLE5,SHIPPING.,"[LYTTELTON., arrived. ~ , T . in_Wallinffton. ..."
GRA_18970305_ARTICLE3,MAILS CLOSE,"[This day— For Dunedln, per Herald, at 11 am. ..."
ESD_18760108_ARTICLE30,REVIEW.,"[The New Zealand Magazine, a quarterly Journal..."
NZTIM_18891129_ARTICLE33,THE RELIGION OF SELF-RESPECT.,[Self-respejt is eminently a masculine quality...
AS_18821014_ARTICLE10,"Meetings, Entertainments, Etc.","[OCTOBBR 11, 1882., Abbott's Opera House— Speo..."
DSC_18520629_ARTICLE8,"Lecture on "" Secular Education.""",[(From the 'ScoUman.') Mr. George Combe delive...


## Labelling Scheme

At this point, we apply a labelling scheme to the articles. The labelling scheme used in the course of picking out a philosophy corpus is as follows:

- Readable: True / false.
- Philosophy: True / false.
- Philosophy type: Religion and science / ethics and politics / Other.
- Writing Type: report of public event / letter to the editor / first-order writing / other.

The question of what counts as philosophy is, as widely noted, itself a contentious philosophical question (see, e.g. Midgley 2018; Agamben 2017; Priest 2006; Deleuze and Guattari 1991; Heidegger 1958; Jevons 1914). It is one of the shortcomings of this project that a more clear specialised topic was not selected at the outset. _However_, the fact that a useful corpus comes out at the other end suggests that _even in the case of vague labelling_ the method is acceptable.

In any case, for our purposes and in order to count as 'philosophy' an article must either appeal to or argue about 'ultimate values' or 'ultimate reality'. This is in line with a conceptions of philosophy which were popular at the time. 

If there is a portion of an article which is desired, we will label the whole article as 'philosophy'. Unfortunately, this method is not good at dealing with composite articles in which, say, a series of topics are covered only one of which is philosophy. See, for instance, the philosophical discussion of suicide contained in this piece which is surrounded by other non-philosophical matter (https://paperspast.natlib.govt.nz/newspapers/AS18860821.2.33).

## Labelling Dashboard

The labelling scheme above will be implemented using a `Dash` dashboard.

We want to have the date and newspaper more obviously displayed in our dashboard. We define a function to do this and apply it.

In [30]:
def add_title_and_year(df):
    """Add 'Newspaper', 'Year', and 'Date' column to dataframe with
    'Text' column."""
    df['Newspaper'] = df.index.map(lambda x: x[0:x.find('_')])
    df['Date'] = df.index.map(lambda x: x[x.find('_')+1:x.find('_')+9])
    
add_title_and_year(candidate_corpus)
add_title_and_year(random_items)
add_title_and_year(additional_articles)

### Setting up labelling dashboard

#### Helper functions

We will need to display formatted article text.

In [42]:
def escape_markdown(string):
    """Escape characters which have functions in markdown strings.
    Return escaped string."""

    markdown_escape_chars = r"\`*_{}[]<>()#+-.!|"
    for escape_char in markdown_escape_chars:
        string = string.replace(escape_char, "\\"+escape_char)

    return string

def text_as_markdown(index, dataframe, boldface=None):
    """Render article corresponding to index in dataframe as markdown
    string. Any matches for boldface are rendered in bold.
    """

    date = index[index.find('_')+1:index.find('_')+9]
    newspaper = index[0:index.find('_')]

    title = (dataframe.loc[index, 'Title'])
    title = escape_markdown(title)

    web_prefix = "https://paperspast.natlib.govt.nz/newspapers/"
    year = date[0:4]
    month = date[4:6]
    day = date[6:8]
    web_address = f"{web_prefix}{CODES2NAMES_WEB[newspaper]}/{year}/{month}/{day}"

    text_blocks = dataframe.loc[index, 'Text']
    text = ''
    for block in text_blocks:
        paragraph = escape_markdown(block)
        text += paragraph + '\n\n'

    if boldface:
        match = re.search(boldface, text)
        if match:
            text = re.sub(boldface, f'***{match.group(0)}***', text)

    markdown_text = f"""## {title}

*{CODES2NAMES[newspaper]}*

{day}/{month}/{year}

[View issue on Papers Past]({web_address})

{text}
"""

    return markdown_text

### Selecting articles for labelling

I've set up this notebook such that we label from one of the sources listed above at a time. That is, we can label the random sample of articles, named articles from our candidate corpus, all articles from the candidate corpus, or any additional articles which we might have loaded from the processed dataset.

Enter any list of items from the candidate corpus selected for labelling during the corpus exploration stage:

In [None]:
items_to_label = []

The initial value here sets up the random items to be labelled. Alternatives are listed in the comment.

In [None]:
to_label = random_items # alternative options: candidate_corpus, candidate_corpus.loc[items_to_label, ], additional_items.

We reformat the article names for use in the dashboard.

In [66]:
item_names_formatted = [
    {'label': f'{to_label["Title"].loc[i]} ({i})', 'value': i} 
    for i in to_label.index
]

In [67]:
item_names_formatted

[{'label': 'Wellington Land Board (FS_18891130_ARTICLE21)',
  'value': 'FS_18891130_ARTICLE21'},
 {'label': 'ENGLISH ITEMS. (WEST_18710601_ARTICLE12)',
  'value': 'WEST_18710601_ARTICLE12'},
 {'label': 'Local & General News. (FS_18840306_ARTICLE5)',
  'value': 'FS_18840306_ARTICLE5'},
 {'label': 'ABSTRACT OF SALES BY AUCTION.  This Day. (WT_18840814_ARTICLE3)',
  'value': 'WT_18840814_ARTICLE3'},
 {'label': 'BIRTH. (FS_18931026_ARTICLE1)', 'value': 'FS_18931026_ARTICLE1'},
 {'label': 'COMMERCIAL. (FS_18830131_ARTICLE2)',
  'value': 'FS_18830131_ARTICLE2'},
 {'label': 'AUSTRALIAN NEWS. (FS_18880920_ARTICLE7)',
  'value': 'FS_18880920_ARTICLE7'},
 {'label': 'THE HOUSEHOLD OF McNEIL. (DTN_18900930_ARTICLE39)',
  'value': 'DTN_18900930_ARTICLE39'},
 {'label': 'The Daily Telegraph. TUESDAY DECEMBER 2, 1890 (DTN_18901202_ARTICLE5)',
  'value': 'DTN_18901202_ARTICLE5'},
 {'label': 'DEATH OF CARDINAL MANNING. (WT_18920116_ARTICLE16)',
  'value': 'WT_18920116_ARTICLE16'},
 {'label': "AUSTRALIAN

The following code sets up and runs the dashboard.

In [63]:
app = JupyterDash(__name__, external_stylesheets=['https://codepen.io/chriddyp/pen/bWLwgP.css'])

#For readability, the control panel is defined before the full app layout.
control_panel = [
    dash_html.P('Readable?'),
    dcc.RadioItems(
        id='readable-radio',
        options=[
            {'label': 'True', 'value': True},
            {'label': 'False', 'value': False}
        ]
    ),
    dash_html.P('Philosophy?'),
    dcc.RadioItems(
        id='philosophy-radio',
        options=[
            {'label': 'True', 'value': True},
            {'label': 'False', 'value': False}
        ]
    ),
    dash_html.P('Philosophy Type?'),
    dcc.RadioItems(
        id='phil-type-radio',
        options=[
            {'label': 'Religion/Science', 'value': 'r'},
            {'label': 'Ethics/Politics', 'value': 'e'},
            {'label': 'Other', 'value': 'o'},
            {'label': 'N/A', 'value': None}
        ]
    ),
    dash_html.P('Writing Type?'),
    dcc.RadioItems(
        id='write-type-radio',
        options=[
            {'label': 'Report of public event', 'value': 'p'},
            {'label': 'Letter to editor', 'value': 'l'},
            {'label': 'First order', 'value': 'f'},
            {'label': 'N/A', 'value': None}
        ]
    ),
    dash_html.P('Notes:'),
    dcc.Textarea(
        id='notes-area',
        style={'width': '100%'}
    ),
    dash_html.Button('Update', id='submit-val', n_clicks=0, style={'margin':'5px'}),
    dash_html.P(id='update-message', style={'display':'none'}) # This div allows the update button to work.
]

app.layout = dash_html.Div([
    dash_html.H2('Label Newspaper Items'),
    dash_html.P('Item'),
    dcc.Dropdown(
        id='item-selection',
        options=item_names_formatted,
        value=item_names_formatted[0]['value'],
        style={'width': '80%', 'margin': '10px'}
    ),
    dash_html.Div([
        dash_html.Div(
            dash_html.Div(
                dcc.Markdown(
                    id='article-display',
                    children=text_as_markdown(to_label.index[0], to_label),
                ),
            style={
                'width': '700px',
                'margin': 'auto'
                }    
            ),
        style={
                'width': '70%', 
                'display': 'inline-block',
                'padding': '15px',
                'margin': '10px'
            }
        ),
        dash_html.Div(
            control_panel,
            style={
                'width': '15%', 
                'display': 'inline-block', 
                'vertical-align': 'top', 
                'padding': '50px',
                'border': 'solid',
                #'position': 'fixed',
                'margin': '10px'
            }
        )
    ])    
])

# When new item chosen, load item text and any labels.
@app.callback(
    [Output(component_id='article-display', component_property='children'),
    Output(component_id='readable-radio', component_property='value'),
    Output(component_id='philosophy-radio', component_property='value'),
    Output(component_id='phil-type-radio', component_property='value'),
    Output(component_id='write-type-radio', component_property='value'),
    Output(component_id='notes-area', component_property='value')],
    [Input(component_id='item-selection', component_property='value')]
)
def load_new_markdown_and_labels(item_id):
    text = text_as_markdown(item_id, to_label)
    readable = philosophy = phil_type = write_type = notes =  None # default value.
    if item_id in labels.index:
        readable = labels.loc[item_id, 'Readable']
        philosophy = labels.loc[item_id, 'Philosophy']
        phil_type = labels.loc[item_id, 'Philosophy Type']
        write_type = labels.loc[item_id, 'Writing Type']
        notes = labels.loc[item_id, 'Notes']
    return text, readable, philosophy, phil_type, write_type, notes

# Update labels when 'update' button pressed.
@app.callback(
    Output(component_id='update-message', component_property='children'),
    [Input(component_id='submit-val', component_property='n_clicks')],
    [State(component_id='readable-radio', component_property='value'),
    State(component_id='philosophy-radio', component_property='value'),
    State(component_id='phil-type-radio', component_property='value'),
    State(component_id='write-type-radio', component_property='value'),
    State(component_id='item-selection', component_property='value'),
    State(component_id='notes-area', component_property='value')]
)
def update_labels(n_clicks, readable, philosophy, phil_type, write_type, item_id, notes):
    if n_clicks > 0:
        labels.loc[item_id, "Readable"] = readable
        labels.loc[item_id, "Philosophy"] = philosophy
        labels.loc[item_id, "Philosophy Type"] = phil_type
        labels.loc[item_id, "Writing Type"] = write_type
        labels.loc[item_id, "Text"] = to_label.loc[item_id, 'Text']
        labels.loc[item_id, "Notes"] = notes
        labels.to_pickle(f'../Labels/labels_{ITERATION}.tar.gz')
    return 'Labels updated'

if __name__ == '__main__':
    app.run_server(debug=False) # Debug changed to avoid https://github.com/plotly/jupyter-dash/issues/15


The 'environ['werkzeug.server.shutdown']' function is deprecated and will be removed in Werkzeug 2.1.

127.0.0.1 - - [19/Nov/2021 16:25:32] "GET /_shutdown_d0e76599-6b80-473c-a39b-16a42794875a HTTP/1.1" 200 -
 * Running on http://127.0.0.1:8050/ (Press CTRL+C to quit)
127.0.0.1 - - [19/Nov/2021 16:25:32] "GET /_alive_d0e76599-6b80-473c-a39b-16a42794875a HTTP/1.1" 200 -


Dash app running on http://127.0.0.1:8050/


127.0.0.1 - - [19/Nov/2021 16:25:34] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [19/Nov/2021 16:25:35] "GET /_dash-layout HTTP/1.1" 200 -
127.0.0.1 - - [19/Nov/2021 16:25:35] "GET /_dash-dependencies HTTP/1.1" 200 -
127.0.0.1 - - [19/Nov/2021 16:25:35] "GET /_dash-component-suites/dash/dcc/async-dropdown.js HTTP/1.1" 304 -
127.0.0.1 - - [19/Nov/2021 16:25:35] "GET /_dash-component-suites/dash/dcc/async-markdown.js HTTP/1.1" 304 -
127.0.0.1 - - [19/Nov/2021 16:25:35] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [19/Nov/2021 16:25:35] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [19/Nov/2021 16:25:35] "GET /_dash-component-suites/dash/dcc/async-highlight.js HTTP/1.1" 304 -
127.0.0.1 - - [19/Nov/2021 16:25:45] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [19/Nov/2021 16:25:53] "POST /_dash-update-component HTTP/1.1" 200 -


## Next Step

Having labelled articles, the next step is to fit models using the 'Model Fit and Application Stages' notebook.