# Unistats – Data Download

Current and previous releases of the Unistats datasets can be downloaded from links on the [Unistats dataset](https://www.hesa.ac.uk/support/tools-and-downloads/unistats) page on the the HESA website.

THe data is supplied as a zipped file archive containing the data in two formats: a single XML file, and a set of (equivalent) CSV files.

## Download the data

We can retrieve the data file in a Linux environment using the `wget` command-line command. The `-O` switch forces the file to be saved with a particular filename, in this case `unistats-latest.zip`.

The `!` prefix says that we want to execute the command in the code cell on the command line, rather than using the programming language code execution kernel the notebook is connected to.

In [2]:
! wget -O unistats-latest.zip https://unistatsdataset.hesa.ac.uk/api/UnistatsDatasetDownload

--2021-09-01 10:01:07--  https://unistatsdataset.hesa.ac.uk/api/UnistatsDatasetDownload
Resolving unistatsdataset.hesa.ac.uk (unistatsdataset.hesa.ac.uk)... 2606:4700::6813:aa27, 2606:4700::6813:ab27, 104.19.171.39, ...
Connecting to unistatsdataset.hesa.ac.uk (unistatsdataset.hesa.ac.uk)|2606:4700::6813:aa27|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://hesacdn.blob.core.windows.net/unistats/unistats20_2021_08_11_07_24_51.zip?sv=2018-03-28&sr=b&sig=%2FTh1oNd59PkybgPc295t91s1sAFsrMeT0VlxxLIx%2B0A%3D&se=2021-09-01T09%3A02%3A07Z&sp=r&rscd=attachment%3Bfilename%3D%22unistats_latest.zip%22 [following]
--2021-09-01 10:01:07--  https://hesacdn.blob.core.windows.net/unistats/unistats20_2021_08_11_07_24_51.zip?sv=2018-03-28&sr=b&sig=%2FTh1oNd59PkybgPc295t91s1sAFsrMeT0VlxxLIx%2B0A%3D&se=2021-09-01T09%3A02%3A07Z&sp=r&rscd=attachment%3Bfilename%3D%22unistats_latest.zip%22
Resolving hesacdn.blob.core.windows.net (hesacdn.blob.core.windows.net)... 52.239.137

We can use the `ls` command to list the contents of the current directory and check that the file has been downloaded as required. The `-l` flag is used to include the file size as part of the file listing:

In [10]:
! ls -l

total 72224
-rw-r--r--@ 1 tonyhirst  staff      1078  1 Sep 09:35 LICENSE
-rw-r--r--  1 tonyhirst  staff        44  1 Sep 09:35 README.md
-rw-r--r--  1 tonyhirst  staff      4563  1 Sep 10:03 unistats - download.ipynb
-rw-r--r--  1 tonyhirst  staff  36079529 11 Aug 07:24 unistats-latest.zip


We can unzip the downloaded file using the Linux `unzip` command:

In [11]:
! unzip unistats-latest.zip

Archive:  unistats-latest.zip
  inflating: on_2021_08_11_07_24_51/ACCREDITATION.csv  
  inflating: on_2021_08_11_07_24_51/AccreditationByHep.csv  
  inflating: on_2021_08_11_07_24_51/ACCREDITATIONTABLE.csv  
  inflating: on_2021_08_11_07_24_51/COMMON.csv  
  inflating: on_2021_08_11_07_24_51/CONTINUATION.csv  
  inflating: on_2021_08_11_07_24_51/COURSELOCATION.csv  
  inflating: on_2021_08_11_07_24_51/EMPLOYMENT.csv  
  inflating: on_2021_08_11_07_24_51/ENTRY.csv  
  inflating: on_2021_08_11_07_24_51/GOSALARY.csv  
  inflating: on_2021_08_11_07_24_51/GOSECSAL.csv  
  inflating: on_2021_08_11_07_24_51/GOVOICEWORK.csv  
  inflating: on_2021_08_11_07_24_51/INSTITUTION.csv  
  inflating: on_2021_08_11_07_24_51/JOBLIST.csv  
  inflating: on_2021_08_11_07_24_51/JOBTYPE.csv  
  inflating: on_2021_08_11_07_24_51/kis20210811071617.xml  
  inflating: on_2021_08_11_07_24_51/KISAIM.csv  
  inflating: on_2021_08_11_07_24_51/KISCOURSE.csv  
  inflating: on_2021_08_11_07_24_51/LEO3.csv  
  inflating:

For the latest data release at least, the files are unzipped into a directory with a name corresponding to the release data of the dataset, in this case `on_2021_08_11_07_24_51`:

In [12]:
! ls

LICENSE                    unistats - download.ipynb
README.md                  unistats-latest.zip
[34mon_2021_08_11_07_24_51[m[m/


## Using the Data

At this point, we need to make a decision – do we want to work with the hierarchical XML data:

*The `head` command-line command shows the first few lines of a file. The `-n` switch allows you to specify how many lines are displayed. For example, `head -n 5 FILENAME` will display just the first 5 lines.*

*The `tail` command works in a similar way, but displays lines counting from the end, rather than the start, of the file.*

In [13]:
! head on_2021_08_11_07_24_51/kis20210811071617.xml 

﻿<?xml version="1.0" encoding="utf-8"?>
<KIS>
  <COLLECTION>C20061</COLLECTION>
  <LOCATION>
    <UKPRN>10000055</UKPRN>
    <ACCOMURL>https://www.brookes.ac.uk/studying-at-brookes/accommodation/</ACCOMURL>
    <LOCID>AB</LOCID>
    <LOCNAME>Abingdon &amp;amp; Witney College (Abingdon Campus)</LOCNAME>
    <LATITUDE>51.680769</LATITUDE>
    <LONGITUDE>-1.286935</LONGITUDE>


Or would we rather use the data that has been "flattened" into the separate CSV files?

In [17]:
! head -n 3 on_2021_08_11_07_24_51/KISCOURSE.csv 

PUBUKPRN,UKPRN,ASSURL,ASSURLW,CRSECSTURL,CRSECSTURLW,CRSEURL,CRSEURLW,DISTANCE,EMPLOYURL,EMPLOYURLW,FOUNDATION,HONOURS,HECOS,HECOS,HECOS,HECOS,HECOS,KISCOURSEID,KISMODE,LDCS,LDCS,LDCS,LOCCHNGE,LTURL,LTURLW,NHS,NUMSTAGE,SANDWICH,SUPPORTURL,SUPPORTURLW,TITLE,TITLEW,UCASPROGID,UKPRNAPPLY,YEARABROAD,KISAIMCODE,KISLEVEL
10000047,10001143,https://www.canterbury.ac.uk/study-here/courses/undergraduate/ophthalmic-dispensing-19-20.aspx,,https://www.canterbury.ac.uk/study-here/courses/undergraduate/ophthalmic-dispensing-19-20.aspx#collapseSix,,https://www.canterbury.ac.uk/study-here/courses/undergraduate/ophthalmic-dispensing-19-20.aspx,,0,https://www.canterbury.ac.uk/study-here/courses/undergraduate/ophthalmic-dispensing-19-20.aspx,,0,0,,,,,,PSSFDOPTDIS,1,,,,0,https://www.canterbury.ac.uk/study-here/courses/undergraduate/ophthalmic-dispensing-19-20.aspx,,,2,0,http://www.canterbury.ac.uk/study-here/fees-and-funding/undergraduate-fees-funding/scholarships-bursaries-and-financial-support.aspx,,Op

If you have previously tools such as Excel or Tableau to work with datasets, you will probably find the "flat" tabular layout more familiar and easier to work with. This does add the overhead of potentially having to cross reference data from several files to to get just the data you need, but is much easier to work with if all the data we require is in one of the pre-generated CSV files.

So let's work with the CSV data for now, and leave working with the equivalant XML dataset to a later date.

### Viewing the CSV Data

The *pandas* Python package is a full-featured package for working with tabular datasets and includes support for opening a wide range of data file types, including simple text based comma separated variable (CSV) files, which typically have a `.csv` file suffix, other text file formas (such as tab separated `.tsv` files) and Excel `.xls` and `.xlsx`spreadsheet filetypes.

*As well as accessing data from spreadsheet files using the Python _pandas_ package, data analysis packages in other languages such as R or Julia also provide tools for accessing data from a wide range of data file types.*

In order to use the *pandas* package, we first need to import it using the `import pandas` command, further giving it the shorter, conventional alias `pd`:

In [19]:
import pandas as pd

We can now open a CSV file given its filename (or filepath and filename) using the `pd.read_csv(FILENAME)` function:

In [84]:
pd.read_csv("on_2021_08_11_07_24_51/KISCOURSE.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,PUBUKPRN,UKPRN,ASSURL,ASSURLW,CRSECSTURL,CRSECSTURLW,CRSEURL,CRSEURLW,DISTANCE,EMPLOYURL,...,SANDWICH,SUPPORTURL,SUPPORTURLW,TITLE,TITLEW,UCASPROGID,UKPRNAPPLY,YEARABROAD,KISAIMCODE,KISLEVEL
0,10000047,10001143,https://www.canterbury.ac.uk/study-here/course...,,https://www.canterbury.ac.uk/study-here/course...,,https://www.canterbury.ac.uk/study-here/course...,,0,https://www.canterbury.ac.uk/study-here/course...,...,0,http://www.canterbury.ac.uk/study-here/fees-an...,,Ophthalmic Dispensing,,A19-H32,,0,30,4
1,10000055,10000055,https://www.abingdon-witney.ac.uk/courses/anim...,,https://www.brookes.ac.uk/courses/undergraduat...,,https://www.abingdon-witney.ac.uk/courses/anim...,,0,https://www.abingdon-witney.ac.uk/courses/anim...,...,0,https://www.brookes.ac.uk/studying-at-brookes/...,,Animal Behaviour and Welfare,,,10004930.0,0,36,4
2,10000055,10000055,https://www.abingdon-witney.ac.uk/courses/mech...,,https://www.brookes.ac.uk/courses/undergraduat...,,https://www.abingdon-witney.ac.uk/courses/mech...,,0,https://www.abingdon-witney.ac.uk/courses/mech...,...,0,https://www.brookes.ac.uk/studying-at-brookes/...,,Mechanical Engineering,,,10004930.0,0,34,4
3,10000055,10000055,https://www.abingdon-witney.ac.uk/courses/foun...,,https://www.brookes.ac.uk/courses/undergraduat...,,https://www.abingdon-witney.ac.uk/courses/foun...,,0,https://www.abingdon-witney.ac.uk/courses/foun...,...,0,https://www.brookes.ac.uk/studying-at-brookes/...,,Early Years,,,10004930.0,0,31,4
4,10000055,10000055,https://www.abingdon-witney.ac.uk/courses/anim...,,https://www.brookes.ac.uk/courses/undergraduat...,,https://www.abingdon-witney.ac.uk/courses/anim...,,0,https://www.abingdon-witney.ac.uk/courses/anim...,...,0,https://www.brookes.ac.uk/studying-at-brookes/...,,Animal Therapy and Rehabilition,,,10004930.0,0,36,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35140,99999997,99999997,https://www.gre.ac.uk/undergraduate-courses/en...,,https://www.gre.ac.uk/undergraduate-courses/en...,,https://www.gre.ac.uk/undergraduate-courses/en...,,0,https://www.gre.ac.uk/undergraduate-courses/en...,...,0,https://www.gre.ac.uk/finance,,Pharmacy,,A15-F87,,0,73,3
35141,99999997,99999997,https://www.gre.ac.uk/undergraduate-courses/en...,,https://www.gre.ac.uk/undergraduate-courses/en...,,https://www.gre.ac.uk/undergraduate-courses/en...,,0,https://www.gre.ac.uk/undergraduate-courses/en...,...,0,https://www.gre.ac.uk/finance,,Pharmacy with foundation year,,A51-G00,,0,73,3
35142,99999998,99999998,https://www.hyms.ac.uk/gateway-year#assessment...,,https://www.hyms.ac.uk/gateway-year#feesandfun...,,https://www.hyms.ac.uk/gateway-year,,0,https://www.hyms.ac.uk/gateway-year,...,0,https://www.hyms.ac.uk/gateway-year#feesandfun...,,Medicine with a Gateway Year,,A49-J86,,0,43,3
35143,99999998,99999998,http://www.hyms.ac.uk/undergraduate/medicine-a...,,http://www.hyms.ac.uk/undergraduate/for-succes...,,http://www.hyms.ac.uk/undergraduate/medicine-a...,,0,http://www.hyms.ac.uk/undergraduate/medicine-a...,...,0,http://www.hyms.ac.uk/undergraduate/for-succes...,,Medicine,,A19-A72,,0,43,3


*Note that not all the rows are previewed! Also note that on occasion, a warning may be presented that suggests upcoming changes (deprecation notices) to how files are handled, or that recommends alternative or more robust ways of opening the file. Oftentimes, these warnings can be ignored.*

*The _pandas_ `.read_csv()` function also hides a large amount of customisation opportunities in how the contents of the opened file are imported into a pandas `Dataframe`. See the [`pd.read_csv()` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for more details.*

We can also create a reference to the contents of the opened file, which are represented as a *pandas* `DataFrame`, so that we can refer to it an manipulate it by name. For example, from the file reference, we can preview the "head" (that is the first few rows) of the dataframe:

In [86]:
kis_course_df = pd.read_csv("on_2021_08_11_07_24_51/KISCOURSE.csv")

kis_course_df.head(3)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,PUBUKPRN,UKPRN,ASSURL,ASSURLW,CRSECSTURL,CRSECSTURLW,CRSEURL,CRSEURLW,DISTANCE,EMPLOYURL,...,SANDWICH,SUPPORTURL,SUPPORTURLW,TITLE,TITLEW,UCASPROGID,UKPRNAPPLY,YEARABROAD,KISAIMCODE,KISLEVEL
0,10000047,10001143,https://www.canterbury.ac.uk/study-here/course...,,https://www.canterbury.ac.uk/study-here/course...,,https://www.canterbury.ac.uk/study-here/course...,,0,https://www.canterbury.ac.uk/study-here/course...,...,0,http://www.canterbury.ac.uk/study-here/fees-an...,,Ophthalmic Dispensing,,A19-H32,,0,30,4
1,10000055,10000055,https://www.abingdon-witney.ac.uk/courses/anim...,,https://www.brookes.ac.uk/courses/undergraduat...,,https://www.abingdon-witney.ac.uk/courses/anim...,,0,https://www.abingdon-witney.ac.uk/courses/anim...,...,0,https://www.brookes.ac.uk/studying-at-brookes/...,,Animal Behaviour and Welfare,,,10004930.0,0,36,4
2,10000055,10000055,https://www.abingdon-witney.ac.uk/courses/mech...,,https://www.brookes.ac.uk/courses/undergraduat...,,https://www.abingdon-witney.ac.uk/courses/mech...,,0,https://www.abingdon-witney.ac.uk/courses/mech...,...,0,https://www.brookes.ac.uk/studying-at-brookes/...,,Mechanical Engineering,,,10004930.0,0,34,4


## Supporting Data

As well as the Unistats KIS data, several supporting data files are also provided on the [Unistats dataset](https://www.hesa.ac.uk/support/tools-and-downloads/unistats) page on the the HESA website:

- *UKPRN codes*: a lookup table for UKPRNs (*UK Provider Reference Number* as published by the [UK Register of Learning Providers](https://www.ukrlp.co.uk/))
- *Subject codes*: a lookup table for CAH ([Common Aggregation Hierarchy](https://www.hesa.ac.uk/support/documentation/hecos/cah-about)) 1.3.3 codes (applicable to latest data; lookup tables for previous data releases are also available)

### UKPRN codes

Let's download the UKPRN look-up table, which is provided as an Excel spreadsheet:

In [18]:
! wget https://www.hesa.ac.uk/files/UNISTATS_UKPRN_lookup_20160901.xlsx

--2021-09-01 10:31:02--  https://www.hesa.ac.uk/files/UNISTATS_UKPRN_lookup_20160901.xlsx
Resolving www.hesa.ac.uk (www.hesa.ac.uk)... 2606:4700::6813:ab27, 2606:4700::6813:aa27, 104.19.171.39, ...
Connecting to www.hesa.ac.uk (www.hesa.ac.uk)|2606:4700::6813:ab27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35153 (34K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘UNISTATS_UKPRN_lookup_20160901.xlsx’


2021-09-01 10:31:02 (1.56 MB/s) - ‘UNISTATS_UKPRN_lookup_20160901.xlsx’ saved [35153/35153]



*If you read the output log, you should see the filename the file has been saved as. In this case, `UNISTATS_UKPRN_lookup_20160901.xlsx`.*

Although you might typically expect to open a file with an `.xlsx` file suffix in a spreadsheet application, we can also open it in the context of our programming environment using another function from the *pandas* package, the ` pd.ExcelFile(FILEPATH)` function.

As we don't yet know how many sheets there are in the spreadsheet, or what they are called, we can create a reference to the file (`ukprns_xlsx`) and then inspect the sheet names contained in that file:

In [57]:
# The pd.ExcelFile(FILENAME) function opens the specified Excel file
# We then assign this to the reference: ukprns_xlsx
ukprns_xlsx = pd.ExcelFile("UNISTATS_UKPRN_lookup_20160901.xlsx")

# We can inspect the sheet names contained in the referenced file
ukprns_xlsx.sheet_names

['Information', 'Lookup']

We can then load in the sheet we want by name (`pd.read_excel(XLSX, SHEETNAME)`) into a *pandas* dataframe and preview the first few lines of it (`.head()`):

*A _pandas_ dataframe is a bit like a single sheet of a spreadsheet, a two dimensional data table where the data is organised in rows and columns.*

In [58]:
ukprns = pd.read_excel(ukprns_xlsx, "Lookup")

ukprns.head()

Unnamed: 0,UKPRN,NAME
0,10000291,Anglia Ruskin University
1,10000385,The Arts University Bournemouth
2,10000571,Bath Spa University
3,10000712,University College Birmingham
4,10000824,Bournemouth University


### Subject Codes

Let's download the subject codes lookup table for the most recent Unistats dataset, again as an Excel file:

In [36]:
! wget https://www.hesa.ac.uk/files/HECoS_CAH_Version_1.3.3.xlsx

--2021-09-01 10:47:14--  https://www.hesa.ac.uk/files/HECoS_CAH_Version_1.3.3.xlsx
Resolving www.hesa.ac.uk (www.hesa.ac.uk)... 2606:4700::6813:aa27, 2606:4700::6813:ab27, 104.19.171.39, ...
Connecting to www.hesa.ac.uk (www.hesa.ac.uk)|2606:4700::6813:aa27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 219205 (214K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘HECoS_CAH_Version_1.3.3.xlsx’


2021-09-01 10:47:14 (1.92 MB/s) - ‘HECoS_CAH_Version_1.3.3.xlsx’ saved [219205/219205]



Using the recipe we used to inspect the contents of the UKPRN spreadsheet, and then load in and preview a particular sheet, you should also be able to extract the subject code data from the `HECoS_CAH` spreadsheet:

In [59]:
cah_xlsx = pd.ExcelFile("HECoS_CAH_Version_1.3.3.xlsx")
cah_xlsx.sheet_names

['Index',
 'CAH (V1.3.3)',
 'HECoS_CAH_Mapping (V1.3.3)',
 'JACS_CAH_Mapping (V1.3.3)',
 'Changes']

In this case we note there appear to be several data sheets.

If we view the index, which is presented as a table reflecting the original sheet, we see that using spreadsheets to communicate generic text information is perhaps not the most useful way of presenting the information...

In [60]:
pd.read_excel(cah_xlsx, "Index")

Unnamed: 0,The Common Aggregation Hierarchy (CAH) HECoS and JACS mapping,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,Version 1.3.3,,,,,,,,,
1,This document contains the Common Aggregation ...,,,,,,,,,
2,This is designed to act as a bridge between JA...,,,,,,,,,
3,,,,,,,,,,
4,Worksheet index,,,,,,,,,
5,CAH,Contains the basic CAH labels and codes at lev...,,,,,,,,
6,HECoS_CAH_Mapping,Contains a mapping between HECoS and CAH level...,,,,,,,,
7,JACS_CAH_Mapping,"Contains a mapping between JACS subject area, ...",,,,,,,,
8,,,,,,,,,,
9,Revision History,,,,,,,,,


Let's look at some of the actual code tables to familiarise ourselves with what they contain.

#### CAH Codes

First, let's look at the Common Aggregation Hierarchy (CAH) codes themselves:

In [45]:
pd.read_excel(cah_xlsx, "CAH (V1.3.3)")

Unnamed: 0,CAH1,CAH2,CAH3,CAH1 (Code only),CAH2 (Code only),CAH3 (Code only)
0,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-01) medical sciences (non-specific),CAH01,CAH01-01,CAH01-01-01
1,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-02) medicine (non-specific),CAH01,CAH01-01,CAH01-01-02
2,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-03) medicine by specialism,CAH01,CAH01-01,CAH01-01-03
3,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-04) dentistry,CAH01,CAH01-01,CAH01-01-04
4,(CAH02) subjects allied to medicine,"(CAH02-02) pharmacology, toxicology and pharmacy",(CAH02-02-01) pharmacology,CAH02,CAH02-02,CAH02-02-01
...,...,...,...,...,...,...
161,(CAH22) education and teaching,(CAH22-01) education and teaching,(CAH22-01-02) teacher training,CAH22,CAH22-01,CAH22-01-02
162,(CAH23) combined and general studies,(CAH23-01) combined and general studies,"(CAH23-01-01) combined, general or negotiated ...",CAH23,CAH23-01,CAH23-01-01
163,(CAH23) combined and general studies,(CAH23-01) combined and general studies,(CAH23-01-02) personal development,CAH23,CAH23-01,CAH23-01-02
164,(CAH23) combined and general studies,(CAH23-01) combined and general studies,(CAH23-01-03) humanities (non-specific),CAH23,CAH23-01,CAH23-01-03


We see that there is redundany information in the `CAH1`, `CAH2` and `CAH3` columns in the form of the actual codes. We don't really need this information in those columns becuase we can access it from the *"Code only"* columns. So let's *clean* the data and remove the codes.

*__Cleaning data__ refers to tidying up data that is "messy" and getting each data item into a consistent form or a form that is better suited to our purposes. In this case, we want to clean out the CAH codes from the CAH labels, for example cleaning `(CAH22-01-02) teacher training` to the neater label `teacher training`.*

*Data cleaning also often includes removing things we can't necessarily see, such as extraneous white space at the end of a label.*

There are often several ways in which we can clean a dataset and the method you choose is often based on personal preference or experience.

One way of cleaning the data where there is already some readily identificable, consistent structure we can work with is to parse the structure from each item and extract just the item we need. If you look at the labels you will see they all have the form `(CODE) LABEL`.

The Python [`parse`](https://github.com/r1chardj0n3s/parse) package provides a powerful way for matching patterns and extracting the individual components of each matched pattern.

*A basic Python programming environment only contains a core set of Python packages. To use a package that is not part of the core distribution requires it to be: a) `install`ed into the base Python environment; b) `import`ed into the current working Pyhton environment.*

*Installation of the package is only required once. Packages need to be imported once in each working environment (for example, each notebook) where you call on them.*

In [50]:
# Install the package if it isn't already installed
# This only needs to be done once for any given Pyhton installation
# The %pip invocation makes sure the package is installed into the correct Pyhton environment
%pip install parse

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [51]:
# Import the package
# Packages need to be imported into each working environment, for example, each notebook 
from parse import parse

Consider the following example which demonstrates how the `parse` function can extract the two parts of our code-and-label columns:

In [49]:
parse("({}) {}", "(CAH22-01-02) teacher training")

<Result ('CAH22-01-02', 'teacher training') {}>

The `{}` elements define a "capture group" that captures content conforming to the specified pattern. In this case, we capture content inside the first pair of brackets at the start of the text string and after the space after the bracketed text.

The `Result` object that is returned is a `list` of items. Lists are ordered structures that contain, as you might expect, a list of items in a fixed order:

In [53]:
my_list = ["one", "two", "three", "next to last", "last"]
my_list

['one', 'two', 'three', 'next to last', 'last']

In Python, we can access a particular item in a list from its list index number. The *first* item in a list is given the index number `0`, the *second* item has index `1` and so on. The index value is passed via square brackets (`LIST_NAME[INDEX_VALUE]`)immediately following the list name:

In [54]:
my_list[0]

'one'

We can also count back from the end of a list; for example, the last item in the ist has index `-1`, the last but one item has index `-2`, and so on:

In [55]:
my_list[-2]

'next to last'

Using our matched pattern list, we can access each item separately. For example, the label is the second (index `1`) item in our pattern matching list (in this case, we could also reference it as the *last* (index `-1`) item):

In [52]:
matches = parse("({}) {}", "(CAH22-01-02) teacher training")

matches[1]

'teacher training'

If you have worked with spreadsheets, you will know how you can select particular columns and perform various operations over them using formulas, for example.

With *pandas* dataframes, we can also run operations over the dataframe to create new columns or replace the content of existing columns.

Let's load in the CAH sheet data to something we can reference by name:

In [61]:
# Create a reference to the contents of a particular sheet in the CAH Excel file
cah_df = pd.read_excel(cah_xlsx, "CAH (V1.3.3)")

cah_df.head()

Unnamed: 0,CAH1,CAH2,CAH3,CAH1 (Code only),CAH2 (Code only),CAH3 (Code only)
0,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-01) medical sciences (non-specific),CAH01,CAH01-01,CAH01-01-01
1,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-02) medicine (non-specific),CAH01,CAH01-01,CAH01-01-02
2,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-03) medicine by specialism,CAH01,CAH01-01,CAH01-01-03
3,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-04) dentistry,CAH01,CAH01-01,CAH01-01-04
4,(CAH02) subjects allied to medicine,"(CAH02-02) pharmacology, toxicology and pharmacy",(CAH02-02-01) pharmacology,CAH02,CAH02-02,CAH02-02-01


If you look at the structure of *pandas* dataframe, it can be thought of as a list of columns. We can index the columns by column name:

In [62]:
cah_df["CAH1"]

0            (CAH01) medicine and dentistry
1            (CAH01) medicine and dentistry
2            (CAH01) medicine and dentistry
3            (CAH01) medicine and dentistry
4       (CAH02) subjects allied to medicine
                       ...                 
161          (CAH22) education and teaching
162    (CAH23) combined and general studies
163    (CAH23) combined and general studies
164    (CAH23) combined and general studies
165    (CAH23) combined and general studies
Name: CAH1, Length: 166, dtype: object

We can apply "formulas" to this column using the `.apply()` method to run a formula against each item in turn, parsing out the code and label, accessing the label, and then using the `.strip()` function to clear any whitespace that sneaked in at the start or the end of the string:

In [82]:
# Define the name of a function or "formula" we want to apply to each row
# In particular we want to:
# Parse the value into a list: parse("({}) {}", x)
# Then get the second item in the list: [1]
# The .strip() function removes any whitespace at the start and the end of a text string
cleanit = lambda x: parse("({}) {}", x)[1].strip()

# Apply this formula to each item in the column
cah_df["CAH1"].apply(cleanit)

0            medicine and dentistry
1            medicine and dentistry
2            medicine and dentistry
3            medicine and dentistry
4       subjects allied to medicine
                   ...             
161          education and teaching
162    combined and general studies
163    combined and general studies
164    combined and general studies
165    combined and general studies
Name: CAH1, Length: 166, dtype: object

We can create a new column (or overwrite the old column) by creating a new column by name:

In [66]:
cah_df["CAH1_clean"] = cah_df["CAH1"].apply(cleanit)

cah_df.head()

Unnamed: 0,CAH1,CAH2,CAH3,CAH1 (Code only),CAH2 (Code only),CAH3 (Code only),CAH1_clean
0,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-01) medical sciences (non-specific),CAH01,CAH01-01,CAH01-01-01,medicine and dentistry
1,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-02) medicine (non-specific),CAH01,CAH01-01,CAH01-01-02,medicine and dentistry
2,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-03) medicine by specialism,CAH01,CAH01-01,CAH01-01-03,medicine and dentistry
3,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-04) dentistry,CAH01,CAH01-01,CAH01-01-04,medicine and dentistry
4,(CAH02) subjects allied to medicine,"(CAH02-02) pharmacology, toxicology and pharmacy",(CAH02-02-01) pharmacology,CAH02,CAH02-02,CAH02-02-01,subjects allied to medicine


We can use the same approach to create clean forms of the other columns:

In [67]:
cah_df["CAH2_clean"] = cah_df["CAH2"].apply(cleanit)
cah_df["CAH3_clean"] = cah_df["CAH3"].apply(cleanit)

cah_df.head()

Unnamed: 0,CAH1,CAH2,CAH3,CAH1 (Code only),CAH2 (Code only),CAH3 (Code only),CAH1_clean,CAH2_clean,CAH3_clean
0,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-01) medical sciences (non-specific),CAH01,CAH01-01,CAH01-01-01,medicine and dentistry,medicine and dentistry,medical sciences (non-specific)
1,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-02) medicine (non-specific),CAH01,CAH01-01,CAH01-01-02,medicine and dentistry,medicine and dentistry,medicine (non-specific)
2,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-03) medicine by specialism,CAH01,CAH01-01,CAH01-01-03,medicine and dentistry,medicine and dentistry,medicine by specialism
3,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-04) dentistry,CAH01,CAH01-01,CAH01-01-04,medicine and dentistry,medicine and dentistry,dentistry
4,(CAH02) subjects allied to medicine,"(CAH02-02) pharmacology, toxicology and pharmacy",(CAH02-02-01) pharmacology,CAH02,CAH02-02,CAH02-02-01,subjects allied to medicine,"pharmacology, toxicology and pharmacy",pharmacology


The column names are perhaps a little unwieldy, so let's clean those too by renaming themm using the `inplace=True` parameter to tell the dataframe to go ahead and make the changes directly to the dataframe:

In [74]:
cah_df.rename(columns={"CAH1 (Code only)": "CAH1_code",
                       "CAH2 (Code only)": "CAH2_code",
                       "CAH3 (Code only)": "CAH3_code"},
              inplace=True)

cah_df

Unnamed: 0,CAH1,CAH2,CAH3,CAH1_code,CAH2_code,CAH3_code,CAH1_clean,CAH2_clean,CAH3_clean
0,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-01) medical sciences (non-specific),CAH01,CAH01-01,CAH01-01-01,medicine and dentistry,medicine and dentistry,medical sciences (non-specific)
1,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-02) medicine (non-specific),CAH01,CAH01-01,CAH01-01-02,medicine and dentistry,medicine and dentistry,medicine (non-specific)
2,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-03) medicine by specialism,CAH01,CAH01-01,CAH01-01-03,medicine and dentistry,medicine and dentistry,medicine by specialism
3,(CAH01) medicine and dentistry,(CAH01-01) medicine and dentistry,(CAH01-01-04) dentistry,CAH01,CAH01-01,CAH01-01-04,medicine and dentistry,medicine and dentistry,dentistry
4,(CAH02) subjects allied to medicine,"(CAH02-02) pharmacology, toxicology and pharmacy",(CAH02-02-01) pharmacology,CAH02,CAH02-02,CAH02-02-01,subjects allied to medicine,"pharmacology, toxicology and pharmacy",pharmacology
...,...,...,...,...,...,...,...,...,...
161,(CAH22) education and teaching,(CAH22-01) education and teaching,(CAH22-01-02) teacher training,CAH22,CAH22-01,CAH22-01-02,education and teaching,education and teaching,teacher training
162,(CAH23) combined and general studies,(CAH23-01) combined and general studies,"(CAH23-01-01) combined, general or negotiated ...",CAH23,CAH23-01,CAH23-01-01,combined and general studies,combined and general studies,"combined, general or negotiated studies"
163,(CAH23) combined and general studies,(CAH23-01) combined and general studies,(CAH23-01-02) personal development,CAH23,CAH23-01,CAH23-01-02,combined and general studies,combined and general studies,personal development
164,(CAH23) combined and general studies,(CAH23-01) combined and general studies,(CAH23-01-03) humanities (non-specific),CAH23,CAH23-01,CAH23-01-03,combined and general studies,combined and general studies,humanities (non-specific)


The display of this dataframe is perhaps a little cluttered. We can simplify the display by providing a list of column names that we want to display, in the order we want them displayed:

In [76]:
col_names = ["CAH1_clean", "CAH1_code", "CAH2_clean",
             "CAH2_code", "CAH3_clean", "CAH3_code"]

cah_df[col_names]

Unnamed: 0,CAH1_clean,CAH1_code,CAH2_clean,CAH2_code,CAH3_clean,CAH3_code
0,medicine and dentistry,CAH01,medicine and dentistry,CAH01-01,medical sciences (non-specific),CAH01-01-01
1,medicine and dentistry,CAH01,medicine and dentistry,CAH01-01,medicine (non-specific),CAH01-01-02
2,medicine and dentistry,CAH01,medicine and dentistry,CAH01-01,medicine by specialism,CAH01-01-03
3,medicine and dentistry,CAH01,medicine and dentistry,CAH01-01,dentistry,CAH01-01-04
4,subjects allied to medicine,CAH02,"pharmacology, toxicology and pharmacy",CAH02-02,pharmacology,CAH02-02-01
...,...,...,...,...,...,...
161,education and teaching,CAH22,education and teaching,CAH22-01,teacher training,CAH22-01-02
162,combined and general studies,CAH23,combined and general studies,CAH23-01,"combined, general or negotiated studies",CAH23-01-01
163,combined and general studies,CAH23,combined and general studies,CAH23-01,personal development,CAH23-01-02
164,combined and general studies,CAH23,combined and general studies,CAH23-01,humanities (non-specific),CAH23-01-03


As well as obtaining the contents of a dataframe column by reference to its name, we can also create a list of `True`/`False` index values that determine which *rows* we want to display from a dataframe.

For example, we can test whether the *CAH3 (Code Only)* column matches a particular value:

In [70]:
cah_df["CAH3 (Code only)"]=="CAH02-02-01"

0      False
1      False
2      False
3      False
4       True
       ...  
161    False
162    False
163    False
164    False
165    False
Name: CAH3 (Code only), Length: 166, dtype: bool

Whilst this list of truth values may not be of much use to us, if we use it as the index value to a *pandas* dataframe, the dataframe can use the values to decide which *rows* to display:

In [71]:
# Create a row_matches reference to the list of code matching truth values
row_matches = cah_df["CAH3 (Code only)"]=="CAH02-02-01"

# Use this list of truth values to select particular rows of the dataframe
# pandas is clever enough to realise that you are referencing the rows rather than column names
cah_df[row_matches]

Unnamed: 0,CAH1,CAH2,CAH3,CAH1 (Code only),CAH2 (Code only),CAH3 (Code only),CAH1_clean,CAH2_clean,CAH3_clean
4,(CAH02) subjects allied to medicine,"(CAH02-02) pharmacology, toxicology and pharmacy",(CAH02-02-01) pharmacology,CAH02,CAH02-02,CAH02-02-01,subjects allied to medicine,"pharmacology, toxicology and pharmacy",pharmacology


#### HECoS_CAH_Mapping

Second, let's consider the HECoS_CAH_Mapping that maps CAH codes to [Higher Education Classification of Subjects (HECoS)](https://www.hesa.ac.uk/support/documentation/hecos) codes:

In [79]:
cah2hecos_df = pd.read_excel(cah_xlsx, "HECoS_CAH_Mapping (V1.3.3)")
cah2hecos_df

Unnamed: 0,HECoS,CAH3,CAH2,CAH1,HECoS (Code only),CAH3 (Code only),CAH2 (Code only),CAH1 (Code only)
0,(100270) medical sciences,(CAH01-01-01) medical sciences (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100270.0,CAH01-01-01,CAH01-01,CAH01
1,(100267) clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100267.0,CAH01-01-02,CAH01-01,CAH01
2,(100271) medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100271.0,CAH01-01-02,CAH01-01,CAH01
3,(100276) pre-clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100276.0,CAH01-01-02,CAH01-01,CAH01
4,(101334) allergy,(CAH01-01-03) medicine by specialism,(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,101334.0,CAH01-01-03,CAH01-01,CAH01
...,...,...,...,...,...,...,...,...
1088,(101276) work placement experience (personal l...,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101276.0,CAH23-01-02,CAH23-01,CAH23
1089,(101277) work-based learning,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101277.0,CAH23-01-02,CAH23-01,CAH23
1090,(100314) humanities,(CAH23-01-03) humanities (non-specific),(CAH23-01) combined and general studies,(CAH23) combined and general studies,100314.0,CAH23-01-03,CAH23-01,CAH23
1091,(100065) liberal arts,(CAH23-01-04) liberal arts (non-specific),(CAH23-01) combined and general studies,(CAH23) combined and general studies,100065.0,CAH23-01-04,CAH23-01,CAH23


Inspecting this dataset, we note a couple of things. First, the `HECoS (Code only)` column has codes that look like they should be integere values but they are represented as decimals. Second, there is an odd row at the end of the dataframe containing `NaN` ("not a number") items. These are null values and we could get rid of this row using the `.dropna()` dataframe method; the `how="all"` parameter says that we want to drop the row where *all* the values are null values:

In [80]:
# Use the inplace=True parameter to force the changes to the dataframe
# Otherwise, we need to assign the cleaned view to a new (or the same) dataframe reference
# For example: cah2hecos_df = cah2hecos_df.dropna(how="all")
cah2hecos_df.dropna(how="all")

Unnamed: 0,HECoS,CAH3,CAH2,CAH1,HECoS (Code only),CAH3 (Code only),CAH2 (Code only),CAH1 (Code only)
0,(100270) medical sciences,(CAH01-01-01) medical sciences (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100270.0,CAH01-01-01,CAH01-01,CAH01
1,(100267) clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100267.0,CAH01-01-02,CAH01-01,CAH01
2,(100271) medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100271.0,CAH01-01-02,CAH01-01,CAH01
3,(100276) pre-clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100276.0,CAH01-01-02,CAH01-01,CAH01
4,(101334) allergy,(CAH01-01-03) medicine by specialism,(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,101334.0,CAH01-01-03,CAH01-01,CAH01
...,...,...,...,...,...,...,...,...
1087,(101090) study skills,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101090.0,CAH23-01-02,CAH23-01,CAH23
1088,(101276) work placement experience (personal l...,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101276.0,CAH23-01-02,CAH23-01,CAH23
1089,(101277) work-based learning,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101277.0,CAH23-01-02,CAH23-01,CAH23
1090,(100314) humanities,(CAH23-01-03) humanities (non-specific),(CAH23-01) combined and general studies,(CAH23) combined and general studies,100314.0,CAH23-01-03,CAH23-01,CAH23


One problem with this approach is that we are still faced with the "dirty" HeCOS integer codes. This were represented as decimals because it's hard to find a way to represent a null value as an integer datatype. However, noting that the null line was the last line in the dataset, we can load the data in again but this time ignoring the last line using the `skipfooter=IGNORE_LAST_N_LINES` parameter. In this case, we want to ignore just the last `1` number of lines:

In [81]:
cah2hecos_df = pd.read_excel(cah_xlsx, "HECoS_CAH_Mapping (V1.3.3)", skipfooter=1)
cah2hecos_df

Unnamed: 0,HECoS,CAH3,CAH2,CAH1,HECoS (Code only),CAH3 (Code only),CAH2 (Code only),CAH1 (Code only)
0,(100270) medical sciences,(CAH01-01-01) medical sciences (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100270,CAH01-01-01,CAH01-01,CAH01
1,(100267) clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100267,CAH01-01-02,CAH01-01,CAH01
2,(100271) medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100271,CAH01-01-02,CAH01-01,CAH01
3,(100276) pre-clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100276,CAH01-01-02,CAH01-01,CAH01
4,(101334) allergy,(CAH01-01-03) medicine by specialism,(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,101334,CAH01-01-03,CAH01-01,CAH01
...,...,...,...,...,...,...,...,...
1087,(101090) study skills,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101090,CAH23-01-02,CAH23-01,CAH23
1088,(101276) work placement experience (personal l...,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101276,CAH23-01-02,CAH23-01,CAH23
1089,(101277) work-based learning,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101277,CAH23-01-02,CAH23-01,CAH23
1090,(100314) humanities,(CAH23-01-03) humanities (non-specific),(CAH23-01) combined and general studies,(CAH23) combined and general studies,100314,CAH23-01-03,CAH23-01,CAH23


As before, we might want to tidy up `(CODE) LABEL` columns so that they just contain the clean labels and also rename the column names.

*Tidyng up the columns and column names is left as an exercise for the reader.*

#### JACS_CAH_Mapping

Finally, let's consider the JACS_CAH mapping that maps CAH codes to [Joint Academic Coding System (JACS)](https://www.hesa.ac.uk/support/documentation/jacs) codes:

In [83]:
cah2jacs_df = pd.read_excel(cah_xlsx, "JACS_CAH_Mapping (V1.3.3)")
cah2jacs_df

Unnamed: 0,JACS Subject Area,JACS Principal Subject,JACS_4Digit,CAH3,CAH2,CAH1,JACS_4Digit (Code only),CAH3 (Code only),CAH2 (Code only),CAH1 (Code only)
0,(1) Medicine & dentistry,(A0) Broadly-based programmes within medicine ...,(A000) Medicine & dentistry,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,A000,CAH01-01-02,CAH01-01,CAH01
1,(1) Medicine & dentistry,(A1) Pre-clinical medicine,(A100) Pre-clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,A100,CAH01-01-02,CAH01-01,CAH01
2,(1) Medicine & dentistry,(A2) Pre-clinical dentistry,(A200) Pre-clinical dentistry,(CAH01-01-04) dentistry,(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,A200,CAH01-01-04,CAH01-01,CAH01
3,(1) Medicine & dentistry,(A3) Clinical medicine,(A300) Clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,A300,CAH01-01-02,CAH01-01,CAH01
4,(1) Medicine & dentistry,(A4) Clinical dentistry,(A400) Clinical dentistry,(CAH01-01-04) dentistry,(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,A400,CAH01-01-04,CAH01-01,CAH01
...,...,...,...,...,...,...,...,...,...,...
1566,(I) Education,(X3) Academic studies in education,(X390) Academic studies in education not elsew...,(CAH22-01-01) education,(CAH22-01) education and teaching,(CAH22) education and teaching,X390,CAH22-01-01,CAH22-01,CAH22
1567,(I) Education,(X9) Others in education,(X900) Others in education,(CAH22-01-01) education,(CAH22-01) education and teaching,(CAH22) education and teaching,X900,CAH22-01-01,CAH22-01,CAH22
1568,(I) Education,(X9) Others in education,(X990) Education not elsewhere classified,(CAH22-01-01) education,(CAH22-01) education and teaching,(CAH22) education and teaching,X990,CAH22-01-01,CAH22-01,CAH22
1569,(J) Combined,(Y0) Combined,(Y000) Combined/general subject unspecified,"(CAH23-01-01) combined, general or negotiated ...",(CAH23-01) combined and general studies,(CAH23) combined and general studies,Y000,CAH23-01-01,CAH23-01,CAH23


As before, we notice an empty line at the end of the dataframe, cluttered `(CODE) LABEL` column values and untidy column names.

Using recipes described above, we can tidy up the dataframe to make it a little more presentable and easier to work with.

*Once again, tidyng up the dataframe by loading in the data without the final empty row, and tidying the dirty columns and column names is left as an exercise for the reader.*