---
abstract: Using the British Libraries collection of UK PhD thesis metadata, we explore the evolution of research topics and the relationship between disciplines over time. 
author:
- name: Ryan Chan
  url: https://www.turing.ac.uk/people/research-engineering/ryan-chan
  affiliation: 
  - id: ati
    name: The Alan Turing Institute
    city: London
    url: www.turing.ac.uk
- name: Isabel Fenton
  url: https://www.turing.ac.uk/people/researchers/isabel-fenton
  affiliation: 
  - ref: ati
- name: Katriona Goldmann
  url: https://www.turing.ac.uk/people/research-engineering/kat-goldmann
  affiliation: 
  - ref: ati
date: 'TBC'
keywords:
- data exploration
- data wrangling
- data visualisation
- natural language processing
license: CC BY
reviewers:
- name: TBC
- name: TBC
title: Beyond Titles: A Data-Driven Odyssey into UK PhD Theses
---

### reviewers:
- name: TBC
- name: TBC

# Introduction 

The British Library, as the national library of the UK, holds a vast collection of books, manuscripts, and digital resources, acting as a crucial research hub and guardian of the country's cultural and intellectual heritage. 
One noteworthy digital asset, the [Electronic Theses Online Service](https://ethos.bl.uk) dataset (EThOS), serves as a digital repository, housing comprehensive metadata of PhD theses from UK institutions since the 1700s. 
It provides a wealth of information, including names, dates, titles, abstracts, and subjects.

In this Turing Data Story we delve into the British Libraries EThOS dataset, deciphering the evolution of disciplines, and exploring the ever-changing landscape of academic pursuit. We also perform a natural language processing (NLP) analysis of thesis abstracts, uncovering the most common words and topics.

## Table of Contents

* [Data description](#data-description)
* [Historical Trends](#historical-trends)
* [Topic modelling](#topic-modelling)
* [Conclusions](#conclusions)

# Data description

Before we get to actual work, in the below cell we import all the Python modules we will need in the course of this story, and set a couple of global constants related to e.g. plotting. 
To install these packages locally run `pip install -r requirements.txt`.  

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For pretty and exportable matplotlib plots.
%matplotlib inline

# Set a consistent plotting style across the notebook using Seaborn.
sns.set_style("darkgrid")
sns.set_context("notebook")

The data is available for download from the [British Library](https://data.bl.uk/ethos/).

Quoting their website,

>The EThOS dataset lists virtually all UK doctoral theses ever awarded, some 500,000 dating back to 1787. All UK HE institutions are included, but we estimate records are missing for around 10,000 titles (2%).

At the time of writing, the latest version of the dataset is from April 2023. The data includes the title, year, author, and institution for each thesis, as well as a link to a full record which may or may not include things like keywords or access to full texts.

For the purposes of this story, we have preprocessed the data in another notebook, which you can find [here](). 
The preprocessed data is available in the `Data` folder of this repository.

We will start by loading this is a Pandas dataframe, and taking a look at the first few rows.

In [None]:
datafile = './Data/cleaned_EThOS_CSV_202304.csv'
df = pd.read_csv(datafile)

df.head()

The first observation to make is that the data is remarkably clean. There are a few NaNs that we need to drop:

In [None]:
print("Number of rows with NaNs: {}".format(df.isnull().any(axis=1).sum()))
df = df.dropna()

This repository is designed to hold doctoral theses. So we can inspect the qualifications column to the different types of degree and whether there are any other qualifications in the dataset.

In [None]:
df['Qualification'].value_counts()

# rremove rows which do not contain 'D.'
df2 = df[~df['Qualification'].str.contains('D.|d.')]
df2['Qualification'].value_counts()

# Historical Trends

# Topic modelling

# Conclusions