# Introduction

Text collections and textual analysis offer a unique and largely untapped combination for applying data mining techniques and computational analysis to generate new insights into the past. This is particularly true for newspaper collections. Newspaper pages typically have eight times the amount of text that appears on a book page, and a modest weekly newspaper can represent many thousands of pages of local content for even the smallest of communities. The [Leddy Library](https://leddy.uwindsor.ca/) at the [University of Windsor](https://www.uwindsor.ca) has been digitizing local newspapers with partners for over a decade, including [The Amherstburg Echo](http://ink.scholarsportal.info/echo), [The Essex Free Press](http://ink.scholarsportal.info/efp), and [The Border Cities Star](http://ink.scholarsportal.info/bcs). With the help of the [Essex County Library System](https://www.countyofessex.ca/en/resident-services/library.aspx) and [Hackforge](https://hackf.org/), the [Academic Data Centre](https://leddy.uwindsor.ca/key-service-areas/academic-data-centre) is designating the month of March 2023 as an opportunity to encourage the use of local digitized historical newspapers for data mining and text analysis with 10 prizes of $50 Amazon gift cards.

This opportunity arose from a [workshop series](https://leddy.uwindsor.ca/rdm-tdm-jupyterhub-newspapers) on _Research Data Management_ and _Text Data Mining_ supported by [SHARCNET/Compute Ontario](https://www.sharcnet.ca/). Although the access to the newspapers is presented through a [Jupyter](https://jupyter.org/) notebook, there is no requirement to use Jupyter. Simply send us your code, snippets of code, URLs, or even ideas to [libdata@uwindsor.ca](mailto:libdata@uwindsor.ca). We are casting a broad net in the hopes of fostering ideas on the use of newspapers for research.

The datasets consist of the [Optical Character Recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR) for 5 newspapers titles, as shown in the table below. We have selected 3 titles from [Essex County](https://en.wikipedia.org/wiki/Essex_County,_Ontario) and 2 titles from [Chatham-Kent](https://en.wikipedia.org/wiki/Chatham-Kent). These 2 neighbouring counties in Southern Ontario both have rich newspaper histories, and the titles are represented from the start of the newspapers through to 1950. Please note the gaps in coverage as shown in the table, it was not always possible to digitize all of the years in the newspaper's operation, and in some case, no record exists for the time period.

| Newspaper Title                         | Coverage                                     |
| :-------------------------------------- | :------------------------------------------- |
| The Comber Herald (Chatham-Kent County) | 1892,1894-1902,1906-1908,1912-1914,1920-1949 |
| The Amherstburg Echo (Essex County)     | 1874-1936,1943-1946                          |
| The Essex Free Press (Essex County)     | 1895-1908,1911-1922,1924,1927-1949           |
| The Harrow News (Essex County)          | 1931,1933-1935,1938-1949                     |
| The Tilbury Times (Chatham-Kent County) | 1898-1909,1917-1949                          |

This particular session attempts to bring together the use of textual analysis with digitized newspaper content within a [JupyterHub](https://jupyter.org/hub) environment. The session does not require background in any of these areas, but for those who may be attending this session without the benefit of the preceeding _Introduction to JupyterHub_ workshop, it is useful to understand some conventions of [Jupyter](https://jupyter.org/) notebooks.

The layout closely follows the format used by the [Newspapers as Data](https://libguides.library.arizona.edu/newspapers-as-data) project. Each newspaper has a zipped file that is available at the [Internet Archive](https://archive.org/) which consists of a series of folders. The first folders are organized by year of publication. Within each year, there is a _pages_ folder, which contains a text file contain the OCR for each page of the newspaper issue organized by date and sequence number, e.g. _1907-08-23-0001.txt_. There is also a _volumes_ folder, which puts the OCR for an entire issue into one file, e.g. _1907-08-23.txt_. One may be preferable over the other, depending on what you are trying to do with the text. You can also view the [PDF](https://en.wikipedia.org/wiki/PDF) of a page using the method shown below. The page will contain embedded OCR. Newspapers, and in particular, historic newspapers from microfilm, can be extremely challenging for image and OCR quality. 

# Working with the data

The code sample below shows one method of accessing the OCR files from within a Jupyter notebook, but you can use the links to the datasets directly as well:

* [The Comber Herald](https://archive.org/download/comber_ocr) - link to [zipped OCR](https://archive.org/download/comber_ocr/comber.zip) (~188 MB)
* [The Amherstburg Echo](https://archive.org/download/echo_ocr) - link to [zipped OCR](https://archive.org/download/echo_ocr/echo.zip) (~420 MB)
* [The Essex Free Press](https://archive.org/download/efp_ocr) - link to [zipped OCR](https://archive.org/download/efp_ocr/efp.zip) (~162 MB)
* [The Harrow News](https://archive.org/download/harrow_ocr) - link to [zipped OCR](https://archive.org/download/harrow_ocr/harrow.zip) (~6 MB)
* [The Tilbury Times](https://archive.org/download/tilbury_ocr) - link to [zipped OCR](https://archive.org/download/tilbury_ocr/tilbury.zip) (~145 MB)



In [26]:
"""
Retrieve newspaper OCR from the Internet Archive
and extract in a temporary folder. This can take
a few minutes for larger sets.
"""

from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile
import os,re,tempfile

# Use the url for The Essex Free Press
efp_url = "https://archive.org/download/efp_ocr/efp.zip"

newspaper_folder = tempfile.TemporaryDirectory()

with urlopen(efp_url) as zip_resp:
    with ZipFile(BytesIO(zip_resp.read())) as zfile:
        zfile.extractall(newspaper_folder.name)
        
year_list = sorted(os.listdir(newspaper_folder.name))
print("=> years extracted:", len(year_list))

=> years extracted: 50


If the above cell executes ok (it might take a few minutes), we can go ahead and set some values. In this case, we will look for occurences of the terms _influenza_ and _flu_ for the years 1915 to 1920. This covers the lead-up to the [Spanish Flu](https://www.theglobeandmail.com/canada/article-mandatory-masks-shuttered-theatres-and-confusing-rules-the-1918/) which took hold in late 1918 in Ontario, and persisted through to the last wave in 1920.

In [27]:
news_topics = ["influenza","flu"]
news_range = "[1915 to 1920]"
print("=> values set!")

=> values set!


We will try a slight variation on an exercise in the University of Arizona's [Introduction to text mining notebook](https://github.com/jcoliver/dig-coll-borderlands/blob/main/Text-Mining-Short.ipynb). The basic idea is to calculate the relative frequency of terms we are interested in.

In [28]:
# libraries for data mining and text analysis
import pandas as pd
import nltk
from nltk.tokenize import RegexpTokenizer
import plotly.express as px

dates = [] # dates and frequencies will be collection here

range = re.findall(r'\d+', news_range)
for folder in year_list:
    if int(range[0]) <= int(folder) <= int(range[1]):
        # Use the volumes folder since the date is used
        year_path = newspaper_folder.name + os.sep + folder + os.sep + "volumes"
        file_list = sorted(os.listdir(year_path))
        for file in file_list:
            if file.endswith('.txt'):
                fp = open(year_path + os.sep + file,'r', encoding='utf8')
                text = fp.read()
                fp.close()
                tokenizer = RegexpTokenizer(r'\w+')
                word_list = tokenizer.tokenize(text.lower())
                word_table = pd.Series(word_list,dtype='string')
                # Calculate relative frequencies of all words in the issue
                word_freqs = word_table.value_counts(normalize = True)
                # Pull out only values that match words of interest
                my_freqs = word_freqs.filter(news_topics)
                # Get the total frequency for words of interest
                total_my_freq = my_freqs.sum()
                # The file names are used to identify dates
                dates.append([file[:10],total_my_freq])

# Add those dates to a data frame
results_table = pd.DataFrame(dates, columns = ['Date','Frequency']) 
# Analyses are all done, plot the figure
my_figure = px.line(results_table, x = 'Date', y = 'Frequency').update_layout(yaxis_title="Relative Freq.")
print("=> pages examined:", len(dates))
# Show figure
my_figure.show()
                

=> pages examined: 312


This is just one example. The quality of the OCR varies across the newspaper titles, and some combinations will work better than others.

In [29]:
"""
Clean up the temporary folder
"""
newspaper_folder.cleanup()

print("=> Newspaper folder has been removed.")

=> Newspaper folder has been removed.


# More information

### Jupyter/JupyterHub

* [Introduction to Jupyter Notebooks](https://programminghistorian.org/en/lessons/jupyter-notebooks) - from the [Programming Historian](https://programminghistorian.org/).

* [JupyterHub](https://jupyterhub.readthedocs.io/) - the official docs for JupyterHub. 

* [Anaconda](https://www.anaconda.com/products/distribution) - an easy option for installing Jupyter across platforms.

### Text Analysis/Data Processing

* [Introduction to Pandas](https://towardsdatascience.com/introduction-to-pandas-hands-on-tutorial-part-one-2e74f35ab166) - a short and to the point tutorial that uses _Anaconda_.

* [Introduction to Natural Language Processing using NLTK](https://blog.paperspace.com/introduction-to-natural-language-processing-using-nltk/) - a quick run-through of _NLTK_ functions.

* [COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge) - a Kaggle challenge using the [CORD-19 open research dataset](https://blog.allenai.org/sunsetting-cord-19-239fb2f9ff4a). CORD-19 became available in March 2020 when the White House and a coalition of leading research groups created a freely available dataset of over 1M scholarly articles to encourage data mining and other text-based approaches to help in the fight against Covid-19. Kaggle has been called an "AirBnB for data scientists", it is backed by Google, and if you are interested in what Jupyter can offer for analysing text content, there are a lot of intriguing ideas here.

### Newspaper Digitization/Newspapers as Data

* [Newspapers as Data](https://libguides.library.arizona.edu/newspapers-as-data) - the starting point for the University of Arizona's efforts to support student data literacy with newspapers. 

* [Text Data Mining of Newspapers in JupyterHub](https://github.com/ADC-RDM/TDMnewspapers) - materials from the workshop series [RDM & TDM in JupyterHub with Newspapers](https://leddy.uwindsor.ca/rdm-tdm-jupyterhub-newspapers).
