# Part 1: Introduction

This workshop is part of a [series](https://leddy.uwindsor.ca/rdm-tdm-jupyterhub-newspapers) on _Research Data Management_ and _Text Data Mining_. This particular session attempts to bring together the use of textual analysis with digitized newspaper content within a [JupyterHub](https://jupyter.org/hub) environment. The session does not require background in any of these areas, but for those who may be attending this session without the benefit of the preceeding _Introduction to JupyterHub_ workshop, it is useful to understand some conventions of [Jupyter](https://jupyter.org/) notebooks.

JupyterHub is a collaborative environment for the use of Jupyter notebooks, and thanks to the work of the [Digital Alliance](https://alliancecan.ca/) and others, including [Sharcnet/Compute Ontario](https://www.sharcnet.ca/), JupyterHub is increasingly becoming an entry point to research computing in Canada. The notebooks made available in JupyterHub are digital documents, analogous to the physical notebooks used for research in labs and studies. A Jupyter notebook consists of "cells", which typically contain descriptive text (like what you are reading right now), or code. Today's session will use [python](https://www.python.org/doc/essays/blurb/) coding statements, but Jupyter notebooks can use other programming languages as well. There is information at the end of this notebook if you are interested in learning more about Jupyter, but we will start with two simple exercises to illustrate how Jupyter works.

To start with, you will be asked double-click on the next paragraph (but finish reading this one first). Your challenge will be to put an underscore before and after the word _Jupyter_, like this: \_Jupyter\_. After you have done so, select the "Run" button at the top of the screen, or use Ctrl-Enter (the keyboard shortcuts may not always be completely consistent across Jupyter environments, but the "Run" button should always work). Go ahead, give it a try.

Hi there, my name is Jupyter.

You probably noticed that the paragraph was contained within what looks like a box. That is the Jupyter "cell", it is a way to organized text and code. The underscore characters you added are examples of [Markdown](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html), sort of a simplified HTML scheme, that allows some formatting options without requiring a lot of extra steps. There are many tools for Markdown, and options to help with more elaborate text challenges, like tables, but the main goal of Markdown is to be simple and not get in the way of explaining and describing what the code provided is trying to achieve.

And that brings us to modifying a cell with actual code. Double click the next cell and change the word "World" to your name, or something else. Use the "Run" button again and you will hopefully see a change in the text below the cell. You are actually running code in this exercise, your notebook is empowered to let you alter and execute program statements.

In [None]:
print("Hello World! My name is Jupyter.")

Congratulations! By editing the two cells in the above exercises, you are most of the way there in making use of Jupyter Notebooks. Jupyter brings together text and code to provide an environment for explanatory documents. You don't necessarily have to understand how Jupyter notebooks work to benefit from them, but hopefully this short introduction gives you a sense of how  notebooks work.


### Key takeaways:
* Jupyter notebooks are interactive.
* Jupyter notebooks support explanations.
* JupyterHub can be your gateway to resources that are otherwise unavailable.

# Part 2: Using newspapers as a source for text data mining


This workshop will be using an historical newspaper from Essex County, Ontario called [_The Amherstburg Echo_](http://ink.scholarsportal.info/echo). The _Echo_ was purposely selected because it both illustrates many of the challenges associated with newspaper digitization, especially for community newspapers, but also because it was a consistently high quality title from its beginnings in 1874 until it was closed in 2012. 

Digitization probably pales in comparison to the challenges of keeping newspapers afloat financially, and Ontario has seen many newspapers shut down in recent years, but long before the internet, newspapers were a key source of information for communities. Historical newspaper digitization is a huge topic and is mostly out-of-scope for the workshop, but here are a [few slides](https://docs.google.com/presentation/d/1xHqzNIjqIV8rMLRbysLI8lK07H352fSEbkQOPVTSd0E/edit?usp=sharing) to help put newspapers in context for text processing.

In addition to trying to understand the source of the text, it is also worth flagging the sheer amount of text that historical newspapers can represent and some of the difficulties introduced by the use of [Optical Character Recogniton](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR). Here are some metrics in spreadsheets from the [New York Times](https://docs.google.com/spreadsheets/d/1GQM_7qszC2VBbsPSqDPr7rJXHZ9rmGz4slIO9dWAL8Q/edit#gid=0) (courtesy of the [Internet Archive](https://archive.org/)) and a comparable example from the [Echo](https://docs.google.com/spreadsheets/d/1rCGnQF5cNqSRWS3QxQmBSWRaGdSmC4zOyA_QqjbPVJM/edit#gid=0).

# Part 3: Text Mining up close - Spanish Flu example

For this example, we will build on the great work done by the University of Arizona's [Newspapers as Data](https://libguides.library.arizona.edu/newspapers-as-data) project. In order to make sure all of the building blocks we will need are in place in our environment, we will gather all of our _import_ statements together in one cell. In python, the _import_ statement is used to specify program libraries that will be called on for specific tasks. By importing the libraries alll at once, we will know immediately whether we have everything we need to run other cells.

In [None]:
# some standard libraries
import os,re,sys
from datetime import datetime

# libraries for data mining and text analysis
import matplotlib.pyplot as plt
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import RegexpTokenizer
import plotly.express as px

# whoosh libraries - for creating and searching indexes
from whoosh import scoring
from whoosh.fields import Schema, DATETIME, ID, TEXT 
import whoosh.highlight as highlight
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser

print("=> libraries loaded and ready to go...")

It is not necessary to understand all of the python syntax, but notice that the cell above ends with a print statement. This gives some additional indicaton that the process has been completed, which is sometimes easy to miss in a notebook. One python library that we will use immediately is the [Natural Language Toolkit](https://www.nltk.org/) or _NLTK_. The NLTK has support for downloading data that is useful for text processing, and like the python libraries above, it is worth establishing at the outset that the data can be accessed and stored.

In [None]:
"""
Download the stopwords for several languages & VADER lexicon.
Note that some jupyter environments need custom paths, which can be
achieved with statements like:
   ntlk_path = os.sep + "util" + os.sep + "odw" + os.sep + "nltk_data"
   nltk.data.path.append(ntlk_path)
   nltk.download('stopwords',download_dir=ntlk_path)
"""
nltk.download('stopwords')
nltk.download('vader_lexicon')
print("=> downloading complete...")

At this point, we can start providing configuration values for the newspaper we are working with. Here are some values which will not be changed again in the workshop, but could be altered for another newspaper title, or customized for a different project.

In [None]:
"""
These values align the newspaper title and OCR directory (leave these alone for workshop since we are using one title)
"""

news_title = "The Amherstburg Echo" # newspaper title for indexing
news_code = "echo" # used for location of OCR text
url_base = "https://collections.uwindsor.ca/workshop/" + news_code # url for viewing PDFs
print("=> newspaper title and OCR directory set...")

We will try a slight variation on an exercise in the University of Arizona's [Introduction to text mining](https://github.com/jcoliver/dig-coll-borderlands/blob/main/Text-Mining-Short.ipynb) notebook. The basic idea is to calculate the relative frequency of terms associated with the [Spanish Flu](https://www.theglobeandmail.com/canada/article-mandatory-masks-shuttered-theatres-and-confusing-rules-the-1918/) for the years 1917 to 1919, which was the height of this afflication in Ontario.

In [None]:
"""
These values define what terms will be and what years will be examined.
"""

news_topics = ["influenza","flu"]
news_range = "[1917 to 1919]" # we use whoosh layout for date range
print("=> relative frequency configuration set...")

Now some python coding to go through the full text of the newspapers for the years specified. 

In [None]:
"""
The folders containing OCR text are organized by year. For the years identified
above, calculate relative frequencies (specified term compared to other terms),
and then graph the results.
"""
dates = [] # dates and frequencies will be collection here

range = re.findall(r'\d+', news_range)
folder_list = sorted(os.listdir(news_code))
folder_paths = [os.path.join(news_code,i) for i in folder_list]
for folder_path in folder_paths:
    folder = folder_path.split(os.sep)[1]
    if int(range[0]) <= int(folder) <= int(range[1]):
        file_list = sorted(os.listdir(folder_path))
        file_paths = [os.path.join(folder_path,j) for j in file_list]
        for file in file_paths:
            if file.endswith('.txt'):
                fp = open(file,'r', encoding='utf8')
                text = fp.read()
                text = " ".join(text.split())
                fp.close()
                tokenizer = RegexpTokenizer(r'\w+')
                word_list = tokenizer.tokenize(text.lower())
                word_table = pd.Series(word_list,dtype='string')
                # Calculate relative frequencies of all words in the issue
                word_freqs = word_table.value_counts(normalize = True)
                # Pull out only values that match words of interest
                my_freqs = word_freqs.filter(news_topics)
                # Get the total frequency for words of interest
                total_my_freq = my_freqs.sum()
                # The file names are used to identify dates
                skip = len(news_code) + 6
                dates.append([file[skip:skip + 10],total_my_freq])
            
# Add those dates to a data frame
results_table = pd.DataFrame(dates, columns = ['Date','Frequency']) 
# Analyses are all done, plot the figure
my_figure = px.line(results_table, x = 'Date', y = 'Frequency').update_layout(yaxis_title="Relative Freq.")
print("=> pages examined:", len(dates))
# Show figure
my_figure.show()

# Part 4: Sentiment Analysis of historical newspapers

[Sentiment Analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) typically involves the calculation of polarity scores is a method which will give us numbers for the following categories:

* Positive
* Negative
* Neutral
* Compound

The compound score is the sum of positive, negative snf neutral scores which is then normalized between -1 (most extreme negative) and +1 (most extreme positive). The calculation is often based on a lexicon (dictionary of terms with weight assignments), a popular lexicon is [VADER](https://github.com/cjhutto/vaderSentiment) (Valence Aware Dictionary and sEntiment Reasoner), a general purpose system. VADER  has support for social media, but has been used for [historical text](https://programminghistorian.org/en/lessons/sentiment-analysis) as well. 

In [None]:
"""
Examine polarity scores for sample sentences.
"""

sample_sentence = "My cat is a good cat."
sentiment = SentimentIntensityAnalyzer() # this analyzer is specific VADER
print(sentiment.polarity_scores(sample_sentence))

Once again, we will put our configuration options together in one cell. The approach here is to use an _index_ of the OCRed text of the newspaper to collect hightlights or snippets of text surrounding term(s) of interest. The indexing is carried out by a python searching system called [Whoosh](https://whoosh.readthedocs.io/), which is very similar to the popular [Lucene](https://lucene.apache.org/) search engine. Like Lucene, Whoosh can retrieve snippets surrounding search terms. The idea is that the snippet will provide enough sentence structure to hopefully calculate a meaningful compound polarity score.

In [None]:
""" 
Configuration options for Whoosh searching 
"""

index_dir = "whoosh_index" # directory for index
snippet_limit = 200 # limit for number of snippets to work with
print("=> configuration set...")

The next cell contains the python logic to build the Whoosh index and a few related functions. Note that the index could be built outside of the notebook and probably should be for large newspaper sets. Github has some limits on file size that makes providing an index problematic, which can be a limitation that carries over to _Binder_, and it is possible that building the index at the time is a better approach. On the other hand, some _JupyterHub_ environments have ample space for providing prebuild data.

In [None]:
"""
Classes and functions are here in one place.
"""

class MinimalFormatter(highlight.Formatter):

    def format_token(self, text, token, replace=False):
        tokentext = highlight.get_text(text, token, replace)

        # this could be elaborate as shown 
        # return "[%s]" % tokentext

        # but just return the token here
        return tokentext

def createUrl(np_page):
    # utility functon for later
    year = np_page[:4]
    month = np_page[5:7]
    day = np_page[8:10]
    seq = np_page[11:]
    print(url_base + "/" + year + "/" + year + "_" + month + "/" + np_page + ".pdf")
    
def createSearchableData(root,indexdir):   
 
    # Note that we need content to be stored for highlighting to work
    schema = Schema(title=TEXT(stored=True),
              path=ID(stored=True),
              content=TEXT(stored=True),
              pubdate=DATETIME(stored=True))

    # this is how a whoosh index can be created
    # ideally, this would be done outside of the notebook
    # for a large set
    if not os.path.exists(indexdir):
        os.mkdir(indexdir)
 
        # Creating an index writer to add documents
        ix = create_in(indexdir,schema)
        writer = ix.writer()
 
        # Assume file text is local
        folder_list = sorted(os.listdir(root))
        folder_paths = [os.path.join(root,i) for i in folder_list]
        for folder_path in folder_paths:
            print(folder_path)
            file_list = sorted(os.listdir(folder_path))
            file_paths = [os.path.join(folder_path,j) for j in file_list]
            for file_path in file_paths:
                if file_path.endswith('.txt'):
                    fp = open(file_path,'r', encoding='utf8')
                    file_bits = file_path.split(os.sep)
                    page_id = file_bits[len(file_bits) - 1]
                    page_id = page_id.replace(".txt","")
                    date_str = page_id[:10]
                    date_object = datetime.strptime(date_str,"%Y-%m-%d")
                    page_num = int(page_id[11:])
                    ntitle = date_object.strftime("%B %d, %Y") + "- pg. " + str(page_num)
                    text = fp.read()
                    text = " ".join(text.split())
                    writer.add_document(title = news_title + ". " + ntitle,
                        path=page_id, content=text, pubdate = date_object)
                    fp.close()
                
        print("commiting...") # this can be the slowest step
        writer.commit() 
    if os.path.exists(indexdir):
        print("=> index directory exists...")
        
print("=> classes & functions in place...")

An index only has to be created once (if the data has not changed). This next cell can be skipped if the index is already there. The GitHub set of newspaper text will take a few minutes to be indexed by Whoosh.

In [None]:
"""
This is where we build the index. This could take some time,
depending on the JupyterHub environment.
"""

createSearchableData(news_code,index_dir) # a great time to look at folder structure

Now use the index for getting highlights/snippets. Whoosh has the plumbing for more elaborate searching but keep it simple for now.

In [None]:
"""
Whoosh steps in - this is where to search the index
"""

# follow whoosh conventions for terms, very similar to lucene, e.g "wom?n OR female*"
news_query = "smok*"
index_range = "[1917 to 1918]" # follow whoosh conventions for dates, e.g "1975", "[1970 to 1980]", "[19000101 to 19000431]"

# the index directory contains the index
ix = open_dir(index_dir)
 
qp = QueryParser("content", schema=ix.schema)
q = qp.parse(news_query)
allow_q = qp.parse("pubdate:" + index_range)

with ix.searcher() as s:
    results = s.search(q,filter=allow_q,limit=snippet_limit) 
    # Allow larger fragments
    results.fragmenter.maxchars = 100

    # Show more context before and after
    results.fragmenter.surround = 10

    # Use the class defined above to strip HTML tags around terms
    minf = MinimalFormatter()
    results.formatter = minf

    snippets = []
    i = 0
    for i,hit in enumerate(results):
        # clean up the spaces in the result
        snippet = " ".join(hit.highlights("content").split())
        snippets.append([snippet,hit["path"][:4],hit["path"]])
print("=> # of snippets gathered: ", 0 if i == 0 else i+1)

At this point, the snippets/highlights are collected. Now we handover the results to the powerful [Pandas](https://pandas.pydata.org/) library. Pandas has a data structure called a _DataFrame_, sort of like an internal spreadsheet or database table, which is extremely valuable for data processing.

In [None]:
"""
Take a look at a few snippets
"""

for snippet in snippets[:10]:
    print("snippet", snippet)

# Part 5: From searching to data analysis

The text from the index is not quite ready for data processing. The following cell will convert the results to a format that can be handled by the Sentiment Analysis tools.

In [None]:
"""
The Whoosh results are set up for a DataFrame.
"""

df = pd.DataFrame(snippets,columns=['snippet','year','page'])
df["row_id"] = df.index + 1

# remove all non-alphabet characters
df['snippet'] = df['snippet'].str.replace("[^a-zA-Z#]", " ", regex=True)
# covert to lower-case
df['snippet'] = df['snippet'].str.casefold()

print("=> the handover from whoosh to pandas is complete...")

The following is based on the redgate tutorial at [Sentiment Analysis with Python](https://www.red-gate.com/simple-talk/development/data-science-development/sentiment-analysis-python/).

In [None]:
"""
This is where the polarity scores are calculated.
"""

tmp = []
df_output = None

sid = SentimentIntensityAnalyzer()
for index, row in df.iterrows():
    scores = sid.polarity_scores(row[0])   
    for key, value in scores.items():
        #row is is the last column
        tmp.append([row[3],key,value])

if len(tmp) > 0:
    # this is a slight variation, the original append method has been depreciated
    t_df=pd.DataFrame(tmp,columns=['row_id','sentiment_type','sentiment_score'])
    # remove duplicates if any exist
    t_df_cleaned = t_df.drop_duplicates()
    # only keep rows where sentiment_type = compound
    t_df_cleaned = t_df_cleaned[t_df.sentiment_type == 'compound']
    # merge dataframes - this unites the snippets with scores
    df_output = pd.merge(df, t_df_cleaned, on='row_id', how='inner')

print("=> we have a DataFrame with the scores...")

The DataFrame can now be examined with some nifty builtin functions.

In [None]:
df_output[["sentiment_score"]].describe()

We can look more closely at some of the scores and the underlying snippets. This will help give a sense of how the numbers match the text.

In [None]:
"""
Use some inbuilt Pandas functions to look at scoring.
"""

# take a look at first few entries for negative scores
df_belowzero = df_output[df_output.sentiment_score < 0.0]
df_snippets = df_belowzero[["sentiment_score","snippet","page"]]
print("=== negative scores: %d out of %d ===" % (len(df_snippets),len(df_output)))
print(df_snippets.head(10))

df_abovezero = df_output[df_output.sentiment_score > 0.0]
df_snippets = df_abovezero[["sentiment_score","snippet","page"]]
print("=== positive scores: %d out of %d ===" % (len(df_snippets),len(df_output)))
print(df_snippets.head(10))

df_zero = df_output[df_output.sentiment_score == 0.0]
df_snippets = df_zero[["sentiment_score","snippet","page"]]
print("=== zero scores: %d out of %d ===" % (len(df_snippets),len(df_output)))
print(df_snippets.head(10))

If we want to see the content associated with the snippet, we can use a convenience function to look at a PDF with embedded OCR.

In [None]:
"""
Copy and paste from above for snippets of interest.
"""

createUrl("1918-12-06-0006")

Graphing the results by year is one approach to tracking sentiment over time. Dataframes make creating graphs a fairly simple process.

In [None]:
"""
Graph DataDrame by year
"""

# generate mean of sentiment_score by snippet
dfg = df_output.groupby(['year'])['sentiment_score'].mean()
# create a bar plot

dfg.plot(kind='bar', title='Sentiment Score', ylabel='Mean Sentiment Score',
         xlabel='Year', figsize=(6, 5))

# Part 6: Where to next?

This workshop has hopefully showed a bit of what is possible with Jupyter and historical newspapers. Here are some resources for further exploration of what has been presented here.

### Jupyter/JupyterHub

* [Introduction to Jupyter Notebooks](https://programminghistorian.org/en/lessons/jupyter-notebooks) - from the [Programming Historian](https://programminghistorian.org/).

* [JupyterHub](https://jupyterhub.readthedocs.io/) - the official docs for JupyterHub. 

* [Anaconda](https://www.anaconda.com/products/distribution) - an easy option for installing Jupyter across platforms.

### Text Analysis/Data Processing

* [Introduction to Pandas](https://towardsdatascience.com/introduction-to-pandas-hands-on-tutorial-part-one-2e74f35ab166) - a short and to the point tutorial that uses _Anaconda_.

* [Introduction to Natural Language Processing using NLTK](https://blog.paperspace.com/introduction-to-natural-language-processing-using-nltk/) - a quick run-through of _NLTK_ functions.

### Newspaper Digitization/Newspapers as Data

* [Newspapers as Data](https://libguides.library.arizona.edu/newspapers-as-data) - the starting point for the University of Arizona's efforts to support student data literacy with newspapers.

* [Collections As Data](https://collectionsasdata.github.io/part2whole/) - this Mellon-funded initiative provided support for Arizona's great work, this site included the grant justification, and information about the concluding international summit, [Collections as Data: State of the Field and Future Directions](https://collectionsasdata.github.io/part2whole/future/), to be held from April 25-26, 2023 at [Internet Archive Canada](https://internetarchivecanada.org/).

* [Macrophotography from Microfilm](https://ourdigitalworld.net/2020/07/20/macrophotography-from-microfilm/) - no organization has digitized more newspapers with fewer resources than [OurDigitalWorld](https://ourdigitalworld.net/) (ODW), this is one of many initiatives ODW has supported to bring costs down in order to open the possibilities of newspaper digitization to organizations of all sizes.