Skip to content
A demonstration of how to sort, filter and count the dataset using R to answer some basic statistical questions
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md
output_13_0.png
output_16_0.png
output_18_0.png
output_21_0.png
output_23_0.png
output_24_0.png
output_25_0.png
output_26_1.png
output_31_1.png
output_36_0.png

README.md

The British Library Newspaper Titles: Some Basic Numbers

How many titles does the British Library hold? When were they published? This document is intended to answer a few basic questions about the Library's newspaper titles using the title list, and illustrate how the list can be queried using the programming language, R.

The best way to start exploring what you might do with the title list and a programming language like R (or just a spreadsheet program) is to come up with some basic questions it might help to answer. Let's start with:

  • What year had the largest number of new titles?
  • How many titles do we hold each from England, Wales, Ireland and Scotland?
  • For which county do we hold the largest number of titles in the nineteenth century?
  • Which city (outside London) has the biggest percentage change in number of titles from 1800 to 1900?

What is this document?

This is a markdown file, made from a Jupyter notebook. A jupyter notebook is usually an interactive document which can contain code, images and text, and a markdown file is a static version of that. Each bit of code runs in a cell, and the output is displayed directly below.

This document isn't interactive, but the code is reproducible, if you put the .csv file in the right folder and copy the bits of code into R using R-Studio. You can then change any of the variables to get different results - to look at another time period, for example.

The code I've used is R, which is a language particularly good for data analysis, but another language, Python, is probably used in Jupyter more frequently. If you're going to work in R, I would recommend downloading R-Studio to do all your work, which could then be copied-and-pasted over the Jupyter notebook if needed, like I've done here.

There are tonnes of places to get started working with R, Python, Jupyter notebooks and so forth, and we would recommend looking here in the first instance:

https://programminghistorian.org/

https://software-carpentry.org/

First we need to load some libraries which we'll use. A library is just a bunch of functions* grouped together, usually with a particular overall purpose or theme.

'tidyverse' is actually a number of libraries with hundreds of useful functions to make lots of data analysis easier. It includes a very powerful plotting library called 'ggplot2'.

'readxl' is a library which.. reads excel files..

It's usually the first thing I load, before even deciding what I'm going to do with my data.

Lots of this code uses something called piping. This is a function in one of our tidyverse libraries which allows you to do something to your data, and then pass it along to another function using this symbol: %>%

It allows you to string lots of changes to your data together in one block of code, so you might filter it, then pass the filtered data to another function which summarises it, and pass it on to another function which plots it as a graph.

* You might say a function is a pre-made block of code which does something to some data. It has a name and often one or more arguments. The first argument is often a space for you to specify the thing you want to do the function on, and subsequent arguments might be additional parameters.

library(tidyverse)
library(readxl)

The first thing to do is download a copy of the British and Irish Newspapers title list available here. This uses the version in the .zip file which contains a .csv and readme - the column names in the excel sheet are slightly different.

load the whole title list as a variable called 'working_list', specifying the sheet of the excel file we'd like to use.

working_list <- read_csv(
BritishAndIrishNewspapersTitleList_20191118.csv, 
local = locale(encoding = "latin1"))

Glimpse() allows us to take a look at all the fields:

glimpse(working_list)
Observations: 25,150
Variables: 23
$ Title.ID                                           <chr> "016901050", "01...
$ NID                                                <chr> "0100823", "0033...
$ NLP                                                <chr> NA, NA, NA, NA, ...
$ `Publication title`                                <chr> "Ealing gazette ...
$ Edition                                            <chr> NA, "Clapham, Ba...
$ `Preceding titles`                                 <chr> "Continues: Eali...
$ `Succeeding titles`                                <chr> NA, "Continued b...
$ Place.of.publication                               <chr> "Guildford, Surr...
$ Country.of.publication                             <chr> "England", "Engl...
$ General.area.of.coverage                           <chr> "Acton (London, ...
$ Coverage..City                                     <chr> "Ealing (London,...
$ First.geographical.subject.heading                 <chr> "Ealing (London,...
$ Subsequent.geographical.subject.headings           <chr> "Acton (London, ...
$ First.date.held                                    <dbl> 2014, 2007, NA, ...
$ Last.date.held                                     <chr> "Continuing", "2...
$ Publication.date.one                               <dbl> 2014, 2007, NA, ...
$ Publication.date.two                               <dbl> NA, 2008, NA, 20...
$ Current.publication.frequency                      <chr> "Weekly", "Weekl...
$ Publisher                                          <chr> "Trinity Mirror"...
$ Holdings..more.information                         <chr> NA, NA, NA, NA, ...
$ `Free text information about dates of publication` <chr> "2014 July 25 -"...
$ Online.status                                      <chr> NA, NA, NA, NA, ...
$ Link.to.BNA.digitised.resource                     <chr> NA, NA, NA, NA, ...

Let's just look at the nineteenth century - we'll use the filter() function from dplyr, which is one of the libraries in the tidyverse. We then use %>% to pipe the filtered data to a function called head(), which displays a set number of rows of your data frama - useful for taking a peek at the structure.

working_list %>% 
  filter(last_date_held>1799) %>%
  filter(first_date_held<1900) %>% 
head(2)
Title.IDNIDNLPPublication titleEditionPreceding titlesSucceeding titlesPlace.of.publicationCountry.of.publicationGeneral.area.of.coverage...First.date.heldLast.date.heldPublication.date.onePublication.date.twoCurrent.publication.frequencyPublisherHoldings..more.informationFree text information about dates of publicationOnline.statusLink.to.BNA.digitised.resource
013895880 0006496 NA Y Gwron Cymreig NA NA Continued by: Gwron. 26 April 1856 - 30 June 1860 NA Wales Mid Glamorgan ... 1852 1856 1852 1856 NA NA rhif l, etc (1 January 1852 - 12 April 1856) NA NA NA
013895881 0006497 NA Y Gwron NA Continues: Gwron Cymreig. rhif.l, etc (1 Jan.1852 - 12 April 1856)Continued by: Gwron a'r gweithiwr. 7 July - 29 Sept.1860 NA Wales Mid Glamorgan ... 1856 1860 1856 1860 NA NA 26 April 1856 - 30 June 1860 NA NA NA

Using ggplot2 (which we loaded when we loaded the tidyverse library), and piping with %>%, we can quickly make some simple plots. First you filter the data you want, and group it by year and count the totals per year.

Then we pass this through the function ggplot(), followed by something called a 'geom', which tells ggplot what kind of visualisation you want to draw. A geom (in this case geom_line) has a set of 'aesthetic' variables which you need to fill in. Here we tell the geom_line that we want the year on the x axis and the calculated total for that year on the y axis.

# First a line charting the number of new titles per year in the 19th century:

working_list %>% 
  filter(last_date_held>1799) %>%
  filter(first_date_held<1900) %>% 
  group_by(first_date_held) %>% 
  tally() %>%
  ggplot() + geom_line(aes(x = first_date_held, y = n ))

png

Easy, and we've answered our first question. You can do endless customisations of this plot but this will do for now. As you can see, there's a huge spike in titles with their 'first date held' in 1855. This actually makes sense: that was the year that stamp duty was abolished and it allowed a large number of new titles to spring up - though you can see it went back to a more gentle upward trend a few years later. What might the second spike be?

Next we can do a bar chart of titles, for each country in the dataset.

working_list %>% 
  group_by(country_of_publication) %>% 
  tally() %>% ggplot() + geom_bar(aes(x = country_of_publication, y = n), stat = 'identity')

png

There are lots of places mentioned just a handful of times - this is primarily a dataset of UK and Irish titles. It will be more readable if we filter for just these countries.

working_list %>% 
  filter(country_of_publication %in% c('Ireland', 
                                       'Northern Ireland', 
                                       'England', 
                                       'Wales', 
                                       'Scotland')) %>%
  group_by(country_of_publication) %>% 
  tally() %>% ggplot() + 
geom_bar(aes(x = country_of_publication, y = n), stat = 'identity')

png

That's the second question answered.

We can count by county. This time we add back in the filter for first and last date held between 1799 and 1899, and keep the filter for the five primary fields.

working_list %>% 
  filter(country_of_publication %in% c('Ireland', 
                                       'Northern Ireland', 
                                       'England', 
                                       'Wales', 
                                       'Scotland')) %>%
  filter(last_date_held>1799) %>%
  filter(first_date_held<1900) %>% 
  group_by(general_area_of_coverage) %>% 
  tally() %>% ggplot() + 
geom_bar(aes(x = general_area_of_coverage, y = n), stat = 'identity')

png

That's not very readable. Let's make a few adjustments.

# reorder by number of titles using reorder()

working_list %>% 
  filter(country_of_publication %in% c('Ireland', 
                                       'Northern Ireland', 
                                       'England', 
                                       'Wales', 
                                       'Scotland')) %>%
  filter(last_date_held>1799) %>%
  filter(first_date_held<1900) %>% 
  group_by(general_area_of_coverage) %>% 
  tally() %>% ggplot() + 
geom_bar(aes(x = reorder(general_area_of_coverage,n), y = n), stat = 'identity')

png

# Make it vertical with coord_flip():

working_list %>% 
  filter(country_of_publication %in% c('Ireland', 
                                       'Northern Ireland', 
                                       'England', 
                                       'Wales', 
                                       'Scotland')) %>%
  filter(lat_date_held>1799) %>%
  filter(first_date_held<1900) %>% 
  group_by(general_area_of_coverage) %>% 
  tally() %>% ggplot() + 
geom_bar(aes(x = reorder(general_area_of_coverage,n), y = n), stat = 'identity') + coord_flip()

png

# Remove NA values:

working_list %>% 
  filter(!is.na(general_area_of_coverage)) %>%
  filter(country_of_publication %in% c('Ireland', 
                                       'Northern Ireland', 
                                       'England', 
                                       'Wales', 
                                       'Scotland')) %>%
  filter(last_date_held>1799) %>%
  filter(first_date_held<1900) %>% 
  group_by(general_area_of_coverage) %>% 
  tally() %>% ggplot() + 
geom_bar(aes(x = reorder(general_area_of_coverage,n), y = n), stat = 'identity') + coord_flip()

png

# Get the top 10:

working_list %>% 
  filter(!is.na(general_area_of_coverage)) %>%
  filter(country_of_publication %in% c('Ireland', 
                                       'Northern Ireland', 
                                       'England', 
                                       'Wales', 
                                       'Scotland')) %>%
  filter(last_date_held>1799) %>%
  filterfirst_date_held<1900) %>% 
  group_by(general_area_of_coverage) %>% 
  tally() %>%
  top_n(10) %>%
ggplot() + 
geom_bar(aes(x = reorder(general_area_of_coverage,n), y = n), stat = 'identity') + coord_flip()
Selecting by n

png

That's it: after London (obviously), the county with the largest number of holdings is Strathclyde

A quick example of how fancy we can make things:

# Add a palette of official British Library colours: 
libraryPalette = c("#00B3C9", "#CBDB2A", "#803F92", "#FAA61A", "#FAA61A", "#E1058C")
# Use the colorRampPalette function to turn this set of six colours into more:

mycolors <- colorRampPalette(libraryPalette)(10)
mycolors
  1. '#00B3C9'
  2. '#70C970'
  3. '#C2C935'
  4. '#99736F'
  5. '#9B5577'
  6. '#DE8F34'
  7. '#FAA61A'
  8. '#FAA61A'
  9. '#EE5E4C'
  10. '#E1058C'
# Plot this, also adding some transparency and changing the theme slightly, and give it a title and a few other tweaks

working_list %>% 
  filter(!is.na(general_area_of_coverage)) %>%
  filter(country_of_publication %in% c('Ireland', 
                                       'Northern Ireland', 
                                       'England', 
                                       'Wales', 
                                       'Scotland')) %>%
  filter(last_date_held>1799) %>%
  filter(first_date_held<1900) %>% 
  group_by(general_area_of_coverage) %>% 
  tally() %>%
  top_n(10) %>%
ggplot() + 
geom_bar(aes(x = reorder(general_area_of_coverage,n),
             y = n, 
             fill = general_area_of_coverage), 
         stat = 'identity',
        color = 'black',
        alpha = .7) + coord_flip() + 
  scale_fill_manual(values = mycolors) + 
theme_minimal() + 
ggtitle("Ten Counties With Most Newspaper Titles, Nineteenth Century") + 
theme(axis.title = element_blank(), legend.position = 'none')
Selecting by n

png

These are modern counties - as that's the way the newspapers have been catalogued. We're working on an expanded version with the historical counties, watch this space!

The last question is slightly more complicated.

Let's just get the total titles starting before 1800, and compare that to the total titles before 1900

working_list %>% 
filter(first_date_held < '1800') %>% 
group_by(coverage_city) %>% 
tally() %>%
arrange(desc(n))
Coverage..Cityn
London 437
Dublin 142
Bath 26
Edinburgh 21
Exeter 17
Newcastle upon Tyne 13
Bristol 12
Manchester 12
Worcester 12
Glasgow 11
Canterbury 10
Cork 9
Salisbury 9
York 9
Chester 8
Limerick 8
Liverpool 8
Reading 8
Sheffield 8
Oxford 7
Belfast 6
Derby 6
Lewes 6
Stamford 6
Bury Saint Edmunds 5
Norwich 5
Nottingham 5
Sherborne 5
Aberdeen 4
Amsterdam 4
......
Nassau 2
Plymouth 2
Stafford 2
Winchester 2
Charles-town|Charleston, South Carolina1
Cirencester 1
Clonmel 1
Colchester 1
Cork|London 1
Darlington 1
Derby|Derbyshire 1
Dorchester 1
Drogheda 1
Haarlem|London 1
Haerlem 1
Halifax 1
Kilkenny 1
London|Rotterdam 1
Ludlow 1
Maidstone 1
Middlewich 1
Newark-on-Trent 1
Nottingham|Nottinghamshire 1
Philadelphia 1
Saint Eustatius 1
Saint Helier 1
The Hague 1
Warwick 1
Wokingham 1
Wolverhampton 1
working_list %>% 
filter(first_date_held < '1900') %>% 
group_by(coverage_city) %>% 
tally() %>%
arrange(desc(n))
Coverage..Cityn
London 2486
Dublin 305
Glasgow 186
Liverpool 133
Manchester 131
Birmingham 128
Edinburgh 124
Belfast 84
Bristol 77
Newcastle upon Tyne 66
Hull 64
Sheffield 61
Bath 60
Dundee 60
Norwich 58
Brighton 57
Aberdeen 51
Nottingham 51
Leeds 50
Canterbury 49
Cork 49
Exeter 49
Plymouth 49
Waterford 49
Portsmouth 46
Bradford 41
Leicester 40
Derby 39
Limerick 39
Oxford 39
......
Stonehaven 1
Swaffham 1
Tavistock 1
Tenbury Wells 1
The Hague 1
Thurles 1
Tipperary 1
Tipton 1
Trim 1
Tywyn 1
Uppingham 1
Upton upon Severn1
Urmston 1
Vatican City 1
Wantage 1
Welshpool 1
Westgate on Sea 1
Westport 1
Whitchurch 1
Whitley Bay 1
Whittlesey 1
Willenhall 1
Wilmslow 1
Windermere 1
Withernsea 1
Wiveliscombe 1
Wokingham 1
Woodstock 1
Wymondham 1
Yeadon 1
working_list %>% 
filter(first_date_held < '1800') %>% 
group_by(coverage_city) %>% 
tally() %>% 
left_join(working_list %>% 
          filter(first_date_held < '1900'& first_date_held >'1800') %>% 
          group_by(coverage_city) %>% 
          tally(), by = 'coverage_city') %>%
mutate(percent = n.y/n.x*100) %>% 
arrange(desc(percent)) %>% top_n(20,wt = percent) %>% ggplot() + 
geom_bar(aes(x = reorder(coverage_city, desc(percent)),
             y = percent,
            fill = percent), 
         stat = 'identity',
        alpha = .8,
        color = 'black')+ 
theme_minimal() + 
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + 
ggtitle("Biggest percentage change in titles, from 1800 to 1900") + 
theme(axis.title.x = element_blank())

png

You can’t perform that action at this time.