In [52]:
import textwrap
import pickle

import pandas as pd

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

from IPython.display import IFrame, HTML, clear_output

import NL_helpers
import NL_topicmodels

In [41]:
# Run cell to reload NL_helpers and NL_topicmodels if they have been changed.
from importlib import reload
reload(NL_helpers)

<module 'NL_helpers' from '/home/joshua/Documents/Academic/MADS/DATA601/NL_helpers.py'>

### What are my criteria?
#### Original:
- Readable: 
    - Options: y/n
    - Criterion: Whether, on reading just the OCR, I could decypher what was meant (even if this required a lot of effort). No comparison with original images made.
- Philosophy:
    - Options: y/n
    - Criterion: somewhat messy. Shift between 'do I want this to come out of the classifier?', 'am I interested in this?', 'does this mention someone who I already have reason to think of as engaged in 'philsophical activity' (e.g. reports of Josiah Royce in Australia, which I wouldn't otherwise have added). Lots of material on relationship between religion and science/metaphysical speculations. Politics largely avoided unless argued from something like first principles as I didn't want a corpus full of political speeches (which seem to be pretty well represented in the PPNODP dataset (eek, new acronym).
    - Issues: This is too idiosyncratic. Will suggest a modification in the next section.
- Philosophy Type:
    - Options: 'r': religion-science; 'e': ethics, including some politics; 'm': epistemology-metaphysics; 'o': other. 
    - Criteria:
        - 'r': (139 total). Does the piece reason about, say, the relationship between ideas of creation and evolution, or whether religious belief can be sustained against 'modern thought' in general. This will include reports of sermons that touch on these issues, public debates, lectures by church of 'free thought' activists. I've also included, e.g., material by theosophists.
        - 'm': (13 total) Does the piece touch on abstract metaphysical or epistemological issues without reference to the relationship between religion and modern science/history. Includes
        - 'e':(94 total). Broadly ethics and politics (94 total). Discussions of whether certain things are virtues or vices (e.g. curiosity). Conceptual discussions of politics (although not more straight forward arguments between liberals and others which do not go into the foundations of their principles). For instance, an account of papal CST statement (look up which), which argues that the pope at the time is confused about what 'socialism' means when denouncing it (as the rest of the statement actually defends a view which would be called 'socialist' in the context of the discussions happening in NZ (and perhaps the anglosphere more broadly?)).
        - 'o': (51 total) I didn't think others applied. For instance: report of sitting in on philosophy lectures at Harvard. Discussions of the nature of the university etc.
    - Issues: 
        - Most of my 'false negatives' comes from the 'o' category.
        - 'm' not widely used. Included Frankland on 'mind-stuff'. It's arguable that 19th century idealism is, broadly speaking, about reconcilling 'religious' principles with new scientific and historical knowledge. But it's not at all explicit in the kind of thing Frankland does in his 'mind-stuff' material. Some theosophical speculations carried out without explicit reference to 'modern developments' are also included. A bit of a hodge-podge.
        - The 'e' crossover with 'm' and 'r'. It's more common in this kind of material than in, say, contemporary academic philosophy, to go straight for an ethical or political conclusion from a set of metaphysical claims. The 'piecemeal' approach favoured by analytic philosophy has definitely not taken root here (unsurprisingly - it's a much later development. 
- Writing Type:
    - Options:
        - 'p': A report of a public events or printing of an address delivered to an audience. Includes sermons, debates, lectures, etc.
        - 'l': A letter to the editor.
        - 'f': a first-order piece of philosophy by an author for the newspaper or printed from somewhere else.
        - 'r': book review.
    - Issues: 
        - I haven't actually used this for any kind of classifier.
- NZ author?
    - Options: y/n
    - Criterion: Does it look to me like the author was based in NZ or is a known NZ figure? (e.g. Stout writing from Sydney is fine). No if, e.g., it is clearly a republishing of something from elsewhere.
    - Issue: almost everything is NZ. This is not useable for a classification scheme.
        

#### Second attempt:
- Readable:
    - Changes: will simply double check.
- Philosophy:
    - Changes: I should include more political discussions and let the more fine-grained distinctions regarding philosophy type do more work for me.
    - Be more careful with large pieces in which only a bit is 'philosophy'.
    - New criterion: 'Philosophical', for the purpose of constructing this corpus, is reflection on fundamentals. Philosophical discourse will make some appeal to first principles or argue that there are no such principles. It will not merely deal with current political scandals or just make a series of claims from within a (e.g.) theological framework.
- Philosophy Type:
    - 'r': Same as before: but more explicitly religion in light of modern science (esp. evolution), history (esp. 'higher criticism'). Include theosophical works here and 'free thought' lectures. Include metaphysical work
    - 'e': ethics and politics. Include disputes over nature of education system here.
    - 'o': Anything which doesn't fit in those two.
- Writing Type and NZ/non-NZ:
    - I think these are mostly fine (but double check)
    - Writing type: add to non-philosophical writing.
    


### Other issues:
- The 'philosophy' classification tends to be applied to very short articles (and even empty ones), to articles in te reo, and to articles unreadable because of bad OCR.
    - Solution: Filter out the readable/unreadable by making a classifier for this first. Ignore issue of te reo articles as this can be dealt with later.
- Missclassified articles are often articles which *contain* philosophical writing within a much larger piece (often these are very long articles by the editor of the newspaper touching on many topics.
    - Solution: Exclude these from the labelled (philosophy) dataset unless the majority of the article is 'philosophy'.
- I've imagined this as a hierarchical system, but it's possible that the, e.g. public event vs non public-event classifier would work better if I also labelled non-philosophy articles with writing types.

### Updates to use of NB classifiers
1. Readable/Non-Readable model
2. Philosophy/Non-Philosophy model
    * feed the resulting 'philosophy' corpus into topic model.
3. Religion-Science/Other model.
    * feed the resulting 'religion-science' corpus into topic model.
4. Do anything with 'public lecture?'... could do a 'letter classifier'?

In [2]:
labelled_with_text = pd.read_pickle('pickles/classified_with_text_as_list_df.pickle')

In [3]:
labelled = pd.read_pickle('pickles/classified_df.pickle')

In [19]:
def print_text_index_only(index, dataframe):
    """
    Given index, return string containing heading and body text.
    Assumes dataframe contains a 'Text' column containing lists of
    strings as entries as well as 'Title'. Works out newspaper and
    date from index.
    """
    newspaper = index[0:index.find('_')]
    date = index[index.find('_')+1:index.find('_')+9]
    title = "Not given by labelled df."
    text_blocks = dataframe.loc[index, 'Text']
    wrapped_blocks = []
    for block in text_blocks:
        wrapped_string = textwrap.fill(block, width=80)
        wrapped_blocks.append(wrapped_string)
    text = '\n\n'.join(wrapped_blocks)
    article_string = f'{title}\n{newspaper} - {date}\n\n{text}'

    print(article_string)

In [43]:
checked = []

In [42]:
checked = checked[0:-2]

In [50]:
exit = False
i = 0
while exit == False:
    print('?') #For some reason this statement is required for the next print statements to function.
    index = labelled_with_text.index[i]
    print(f"Current Classification = \n{labelled.loc[index]}")
    
    if index not in checked:
        print_text_index_only(index, labelled_with_text)
    
        change_decision = input('Change? (y/n) >')
        if change_decision == 'y':    
            labelled.loc[index] = NL_helpers.classify_text_v2()        

        checked.append(index)
    i+=1
    clear_output()

?
Current Classification = 
Readable                                                        True
Philosophy                                                      True
Philosophy Type                                                    o
Writing Type                                                       p
NZ                                                              True
Notes              contains report of lecture on 'philosophy of d...
Name: WI_18661215_ARTICLE14, dtype: object
Not given by labelled df.
WI - 18661215

Ancient Okpeb of Remiabitks.— A Tent of the above Order, was opened pursuant to
adver tisement nt the Temperance Hall, on Thursday evening, by Brother
R.Johnson, P.C.R., of tho Hope of Auckland Tent. After the usual for malities
had been gone through, thirteen mem bers wore initiated, and the following;
office bearers were elected. C.R., Brother Fraser ; D.C.R., Brother Luwes ;
Secretary, Brother Levy ; Levite, Brother Jnnsen ; Guardian Brother, Watson ;
Treasurer, Brot

KeyboardInterrupt: 

In [54]:
len(checked)

90

In [53]:
with open('pickles/labelled_v2_df_in_progress.pickle', 'wb') as outfile:
    pickle.dump((checked, labelled), outfile)

In [6]:
philoso_df = pd.read_pickle('pickles/nb2_philoso_df.tar.gz')

In [10]:
NL_helpers.add_title_and_date(philoso_df)
philoso_df = NL_helpers.remove_duplicates(philoso_df)

In [13]:
index_subset = philoso_df.index.to_series().sample(n=500)

In [None]:
interact(NL_helpers.html_text, index=index_subset, dataframe=fixed(philoso_df), boldface=fixed(None))