# Data/Culture Workshop, Lancaster
## Content Analysis of Historical Newspapers


In [None]:
# let's just turn of
import warnings
warnings.filterwarnings('ignore')

# OCR quality

In this notebook, we have a closer look at exploring newspaper content. But before we do that, let's have a look at the quality of the text data.

A major hurdle for analysing the historical press, is the sometimes awful quality of the automatic text transcription, using Optical Character Recognition (OCR) software, which converts images to machine-readable text.

And here ```m4n y th**in^gs can go wrong!```

So before analysing/reading our sources, we should determine what is readable and how data quality might impact our findings.

In this notebook, we investigate if the OCR errors are truly randomly distributed or more skewed towards certain categories of newspapers. This could inform how we read our findings later on.

In [None]:
# We need to import the pandas library for working with spreadsheet
import pandas as pd
import re # another library for matching patterns in text
import plotly.express as px
import numpy as np
import seaborn as sns # import seaborn for making plots a bit prettier
import matplotlib.pyplot as plt
sns.set()

In [None]:
# load the dataframe from github
df = pd.read_csv("https://raw.githubusercontent.com/kasparvonbeelen/lancaster-newspaper-workshop/wc/data/subsample500mixedocr-selected_mitch.csv")
# for convenience we drop rows that have nan (not a number values)
# otherwise some of the scripts and operation might crash
df.dropna(inplace=True)

We can print the first n-rows to get a sense of the information available to us.

In [None]:
df.head(3)

### Scatter plots

The most 'direct' way to interrogate data is to look at scatterplots.

In [None]:
fig = px.scatter(df,
                 x="word_count",
                 y="ocrquality",
                 color="political_leaning_label",
                 hover_data=['date','newspaper_title',"political_leaning_label", "price_label"],
                 trendline_scope="overall",
                 trendline="ols",
                 width=1000, height=500,
                 )
fig.update_layout(showlegend=True)
fig.show()

### Log-scale


A common technique to declutter the visualisation is to use a log-scale, this will make a small difference bigger and a big difference smaller.

In [None]:
print(np.log([1, 5]))
print(np.log([100,1000]))

In [None]:
df = df[(df.ocrquality > 0) & (df.word_count > 0)]
df['word_count_log'] = np.log(df['word_count'] )
fig = px.scatter(df,
                 x="word_count_log",
                 y="ocrquality",
                 color="political_leaning_label",
                 hover_data=['date','newspaper_title',"political_leaning_label", "price_label"],
                 trendline_scope="overall",
                 trendline="ols",
                 width=1000, height=500,
                 )
fig.update_layout(showlegend=True)
fig.show()

### Exercise

Plot the OCR quality over time using a scatter plot.

In [None]:
# enter code here

### Other plotting options

We can visualize distributions as histograms or density plots.

In [None]:
df[df.political_leaning_label.isin(['conservative','liberal'])].groupby(['political_leaning_label'])['ocrquality'].plot(kind='hist', bins=100, alpha=.6)

In [None]:
df[df.political_leaning_label.isin(['conservative','liberal'])].groupby(['political_leaning_label'])['ocrquality'].plot(kind='density')

### Exercise

Is the OCR of the halfpenny press (½ d) worse than the papers priced at 1d? For the exercise, ignore all other newspapers outside of these price points.

In [None]:
df.price_label.value_counts()

In [None]:
# enter your answer here, adapt the previous line of code df[df.political_leaning_label.


# Content Analysis

##  Counting Words with Regular Expressions

Regular expressions offer a convenient tool to explore content by searching and investigating the occurrence of specific patterns in the corpus.

Below we construct a regular expression in which we aim to capture multiple words (and variants) at once.

In abstract terms, the regex follows the format:
`"\b(query_1|query_2|...|query_n)\b"`

- `\b` indicates a word break, which can be a white space or interpunction symbol
- `|` indicates OR, i.e. we want to find any of the queried items
- `s?` ensures we include plural forms

In Python, we first formulate the regex as a 'raw' string (a string prefixed by `r`' and then compile it, where add extra flags, in this `re.I` ignoring the difference between upper and lower case)

In [None]:
# define the regular expression
regex = r"\b(trains?|rails?)\b"
# compile the regex use an ignore case flag
# i.e. we will ignore uppercase
pattern = re.compile(regex, re.I)

In [None]:
# test the regex on a particular example
example_text = 'I took to trAin from Euston to Lancaster, but thetrain was delayed because there were leaves on the rails!'
pattern.findall(example_text)

Instead of applying the regex to one example, we can apply it to all items in the `text` column of our dataframe. For this, we need to apply the `.apply` method (what's in a name!) to the text column.

What does this operation return? For each row, it will return words that match our query regex, or return an empty list (or `[]`) in case we do not find anything!

In the code cell below, we apply the regex to all items in our dataframe.

In [None]:
df['text'].apply(pattern.findall)

Of course, we want to store the result of the `pattern.findall` operation and add the query results as a new column to the dataframe. In `pandas` this is relatively straightforward and resembles the variable assignment operation.

After saving the results in a new column, we can keep track of the number of matched items in the text (and the corpus). These results are stored in the `num_hits` column.

In [None]:
df['hits'] = df['text'].apply(pattern.findall) # safe the query results in a new column
df['num_hits'] = df['hits'].apply(len) # count the number of items found

In [None]:
df['num_hits'].value_counts() # get the distribution of hits

In [None]:
df['num_hits'].value_counts().plot(kind='bar') # plot the distribution as a bar chart

We can inspect the result of the `findall` operation more closely, and zoom in on the examples where we encounter more than one hit. We use `df.num_hits > 0` as a filter to select only rows with contain at least one mention of 'train' or 'rails'.

In [None]:
df_with_hits = df[df.num_hits > 0].reset_index()
df_with_hits[['hits','text']]

We print the full content of the 4th text.

In [None]:
print(df_with_hits.iloc[4][['hits','text']].text)

### Exercise

Print the content of the 7th text.

In [None]:
# enter code here

### Exercise

Let's explore a bit larger collection text, where we have 10.000 articles to play with. Then, search the newspaper dataframe for two (or more!) words of choice.

In [5]:
# we download a larger sample of newspaper data
# with approx 10_000 articles per year
!wget -q --show-progress https://github.com/kasparvonbeelen/lancaster-newspaper-workshop/raw/wc/data/sample_lwm_hmd_mt90_10000.csv.zip
# unzip the downloaded sample
!unzip -o sample_lwm_hmd_mt90_10000.csv.zip
!rm -r __MACOSX

In [6]:
import pandas as pd
df = pd.read_csv('sample_lwm_hmd_mt90_10000.csv')
df.shape

#### Easy version

Select query terms and see how often these appear in the corpus.

In [None]:
query_1 = '' # add a query term between the quotation marks
query_2 = '' # add a query term between the quotation marks

regex = rf"\b({query_1}|{query_2})\b"
# compile the regex use an ignore case flag
# i.e. we will ignore uppercase
pattern = re.compile(regex, re.I) # compile
df['hits'] = df['text'].apply(pattern.findall) # safe the query results in a new column
df['num_hits'] = df['hits'].apply(len) # count the number of items found
df['num_hits'].value_counts()

#### Advanced version

- Define a new regular expression that queries the corpus for at least 2 words.
- Look at the previous examples and adapt the code to plot the distribution of the hits.


In [None]:
regex = ''
pattern = re.compile(regex, re.I) # compile
df['hits'] = df['text'].apply
= df['hits'].apply(len)
# plot the distribution of hits

# Text and Metadata

Simply counting how often certain items appear is not that interesting. To use newspaper archives for making historical arguments, we often rely on metadata. More precisely, studying the relation between metadata and full-text content is where things get interesting historically.

Below we have a closer (and practical) look at some examples.


The code below repeats the regex-based search operations we discussed previously.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/kasparvonbeelen/lancaster-newspaper-workshop/wc/data/subsample500mixedocr-selected_mitch.csv")
df.dropna(inplace=True)

# define the regular expression
regex = r"\b(trains?|rails?)\b"
# compile the regex use an ignore case flag
# i.e. we will ignore uppercase
pattern = re.compile(regex, re.I)
df['hits'] = df['text'].apply(pattern.findall) # safe the query results in a new column
df['num_hits'] = df['hits'].apply(len) # count the number of items found


## Timelines

We group the results by year and count how often we encounter 'trains' in historical newspapers over the nineteenth century.

### Questions

- What is shown in the timeline below?
- And how could it be misleading?

In [None]:
df.groupby('year')['num_hits'].count().plot()

To investigate change over time, we need to 'normalize' results by year, to make the results comparable. One way of doing this is to divide the number of hits, by the total number of words.

As seen previously, we can use `.split()` to divide the string by white spaces, and then count the number of 'words'*

*or a proxy to the number of words.

In [None]:
an_example_text = "This sentence has 5 words."
words = an_example_text.split()
print(words)
num_words = len(words)
print(num_words)

In [None]:
sentence = "This sentence has 5 words."
len(sentence.split())

Below we inspect the distribution of the document lengths using a histogram...

In [None]:
df['num_words'] = df['text'].apply(lambda x: len(x.split()))
df['num_words'].plot(kind='hist')

... Or plot the number of words by year.

In [None]:
df.groupby('year')['num_words'].sum().plot()

We can use these total counts to plot a timeline that shows the prevalence of a topic while accounting for the changes in corpus size.

To do this, we sum the number of hits and divide this by the total number of words for each year.

In [None]:
df_grouped = df.groupby('year').apply(lambda x: x['num_hits'].sum() / x['num_words'].sum())
df_grouped.plot()

What might be confusing about the plot is that they (kind of) point in different directions. Should we believe the 'raw' counts or the normalized frequencies?

The short answer is that both are not reliable. Even though we imagine observing trends, we don't have enough data in this case to make any claims about historical change.

Why do I think this is the case?

We can plot the relative number of hits with confidence intervals (using the `seaborn` library).

In [None]:
df['ratio'] = df['num_hits'] /  df['num_words']
sns.lineplot(x='year',y='ratio', data=df)

## Intermezzo: Understanding confidence intervals

In [None]:
from random import shuffle, random
scores = [random() for _ in range(100)]
scores[:3]

In [None]:
print(np.mean(scores))

In [None]:
size = 20
n_trials = 100
means = []
for _ in range(n_trials):
  shuffle(scores)
  means.append(sum(scores[:size]) / size)
ax = pd.Series(means).plot(kind='density')
ax.axvline(sum(scores)/len(scores), color="black", linestyle="dashed")

In [None]:
percentiles = np.percentile(means,q=[5.,95.])
ax = pd.Series(means).plot(kind='density')
ax.axvline(sum(scores)/len(scores), color="black", linestyle="dashed")
ax.axvline(percentiles[0], color="red", linestyle="dashed")
ax.axvline(percentiles[1], color="red", linestyle="dashed")

## Politics



In [None]:
df.political_leaning_label.unique()

## Guided Exercise: Politics and Language

- Create a simplified schema of these political labels that map each of the categories to either 'left', 'right' or 'non-aligned'.
- Save the simplified labels in a new column `political_labels_simplified`.
- Print the number of hits by political party using the simplified schema.
- Visualize the results using a barplot.

In [None]:
## Enter code here

## create a mapping

## mapping = {'liberal': ...}

## apply mapping df[''].replace

## use value_counts() to see the number of hits by party
sns.barplot(x='political_labels_simplified',y='ratio', data=df)
plt.xticks(rotation=70)

#### Solution

Uncomment the code below.

In [None]:
# mapping = {'liberal': 'left',
#            'independent': 'non-aligned',
#            'neutral': 'non-aligned',
#            'constitutional': 'right',
#             'liberal; conservative': 'non-aligned',
#             'unionist':'right',
#             'independent; conservative': 'right',
#             'conservative':'right'}

In [None]:
# df['political_labels_simplified'] = df['political_leaning_label'].replace(mapping)
# df['political_labels_simplified'].value_counts()

In [None]:
# sns.barplot(x='political_labels_simplified',y='ratio', data=df)
# plt.xticks(rotation=70)

political_labels_simplified
non-aligned    163
left           151
right          113
Name: count, dtype: int64

## Guided Example: Exploring the periodicity in newspapers

Let's now play with a larger dataset and tie together everything we've seen so far. Instead of looking at change over time, we will inspect periodicities in historical newspapers.

In [None]:
# we download a larger sample of newspaper data
# with approx 10_000 articles per year
!wget -q --show-progress https://github.com/kasparvonbeelen/lancaster-newspaper-workshop/raw/wc/data/sample_lwm_hmd_mt90_10000.csv.zip

In [None]:
# unzip the downloaded sample
!unzip -o sample_lwm_hmd_mt90_10000.csv.zip
!rm -r __MACOSX

In [None]:
# import required libraries
import seaborn as sns
import pandas as pd
from tqdm import tqdm
import re
sns.set()

In [None]:
# read the csv file
df_large = pd.read_csv('sample_lwm_hmd_mt90_10000.csv')

In [None]:
# plot the OCR quality by year
sns.lineplot(x='year',y='ocrquality',data=df_large)

In [None]:
# plot the OCR quality by month
sns.lineplot(x='month',y='ocrquality',data=df_large)

## Question

Is there a significant change in OCR quality over the nineteenth century but not by month?

In [None]:
# compute the number of words for each document
df_large['num_words'] = df_large.text.apply(lambda x: len(x.split()))

In [None]:
# plot the average document length
sns.lineplot(x='year',y='num_words',data=df_large)

In [None]:
# plot the average document length by month
sns.lineplot(x='month',y='num_words',data=df_large)

In [None]:
# search the corpus using a particular regular expression
tqdm.pandas() # use tqdm to print a progress bar
#pattern = re.compile(r'\btoo cold\b', re.I)
#pattern = re.compile(r'\btoo hot\b', re.I)
pattern = re.compile(r'\bcricket\b', re.I) # create and compile a regex pattern
df_large['matches'] = df_large.text.progress_apply(lambda x: pattern.findall(x)) # apply compile regular expression
df_large['num_matches'] = df_large.matches.apply(len) # count number of hits for each document
df_large['matches_ratio'] = df_large['num_matches'] / df_large['num_words'] # compute the ratio of hits
sns.barplot(x='month',y='matches_ratio',data=df_large) # plot the results with error bars

# Fin.