# Simple Sorting and Filtering

One of the simplest, yet most powerful, ways of looking at datasets is simply to sort them and then look at the first or last few entries (such is the making of top 10 league tables!). So in this notebook, let's have a look at one of the datafiles that has some numbers we can sort and see what sorts sorting torture we apply to the dataset as a result.

As ever, let's start by loading in the *pandas* package:

In [1]:
import pandas as pd

## Exploring the `ENTRY` Dataset

The data we'll use to demonstrate some simple sorting and filtering operations is the `ENTRY` dataset that records course populations and entry levels as a proportion of population.

*See an earlier conversation for how we can decode all the column names in a dataset.*

In [2]:
entry_df = pd.read_csv("on_2021_08_11_07_24_51/ENTRY.csv")
entry_df.head()

Unnamed: 0,PUBUKPRN,UKPRN,KISCOURSEID,KISMODE,ENTUNAVAILREASON,ENTPOP,ENTAGG,ENTSBJ,ACCESS,ALEVEL,BACC,DEGREE,FOUNDTN,NOQUALS,OTHER,OTHERHE
0,10000047,10001143,PSSFDOPTDIS,1,0,30.0,14.0,,0.0,70.0,0.0,15.0,0.0,5.0,10.0,0.0
1,10000055,10000055,AB20,1,0,20.0,14.0,,0.0,80.0,0.0,0.0,0.0,5.0,10.0,5.0
2,10000055,10000055,AB29,1,0,10.0,24.0,,0.0,80.0,0.0,10.0,0.0,10.0,0.0,0.0
3,10000055,10000055,AB33,1,0,20.0,14.0,,0.0,90.0,0.0,5.0,0.0,0.0,5.0,0.0
4,10000055,10000055,AB35,1,0,25.0,13.0,CAH06-01-01,0.0,75.0,0.0,0.0,0.0,5.0,10.0,10.0


To make it easier to make sense of the results, we should merge in a couple of other columns:
    
- the institution name (because it's too hard trying to remember offhand which UKPRN applies to which institution)
- the course name (because only the keenest datageek will offhand remember which `KISCOURSEID` relates which to which course).

### Annotating the Data

Let's start by merging in the provider names from the `UNISTATS_UKPRN_lookup_20160901.xlsx` additional data file:

In [3]:
ukprns = pd.read_excel("UNISTATS_UKPRN_lookup_20160901.xlsx", "Lookup")

entry_df = pd.merge(entry_df, ukprns, on="UKPRN")

entry_df.head()

Unnamed: 0,PUBUKPRN,UKPRN,KISCOURSEID,KISMODE,ENTUNAVAILREASON,ENTPOP,ENTAGG,ENTSBJ,ACCESS,ALEVEL,BACC,DEGREE,FOUNDTN,NOQUALS,OTHER,OTHERHE,NAME
0,10000047,10001143,PSSFDOPTDIS,1,0,30.0,14.0,,0.0,70.0,0.0,15.0,0.0,5.0,10.0,0.0,Canterbury Christ Church University
1,10001143,10001143,PECFDECEDCA,1,0,20.0,14.0,,0.0,85.0,0.0,5.0,0.0,0.0,0.0,10.0,Canterbury Christ Church University
2,10001143,10001143,PFDDGAC4,1,1,145.0,13.0,CAH15-01-02,4.0,92.0,1.0,0.0,0.0,1.0,1.0,1.0,Canterbury Christ Church University
3,10001143,10001143,PFDDGAC4,2,1,,,,,,,,,,,,Canterbury Christ Church University
4,10001143,10001143,PFDDGAD4,1,1,,,,,,,,,,,,Canterbury Christ Church University


And then the course names, which can be found in the `KISCOURSE.csv` datafile. The course datafile has quite a lot of columns, so we should be selective about the ones we want.

In [4]:
course_df = pd.read_csv("on_2021_08_11_07_24_51/KISCOURSE.csv")
course_df.columns

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Index(['PUBUKPRN', 'UKPRN', 'ASSURL', 'ASSURLW', 'CRSECSTURL', 'CRSECSTURLW',
       'CRSEURL', 'CRSEURLW', 'DISTANCE', 'EMPLOYURL', 'EMPLOYURLW',
       'FOUNDATION', 'HONOURS', 'HECOS', 'HECOS.1', 'HECOS.2', 'HECOS.3',
       'HECOS.4', 'KISCOURSEID', 'KISMODE', 'LDCS', 'LDCS.1', 'LDCS.2',
       'LOCCHNGE', 'LTURL', 'LTURLW', 'NHS', 'NUMSTAGE', 'SANDWICH',
       'SUPPORTURL', 'SUPPORTURLW', 'TITLE', 'TITLEW', 'UCASPROGID',
       'UKPRNAPPLY', 'YEARABROAD', 'KISAIMCODE', 'KISLEVEL'],
      dtype='object')

*One of the things we might notice about the columns in this dataset is that there appear to be some repeated column names, such as `'HECOS', 'HECOS.1', 'HECOS.2', 'HECOS.3','HECOS.4'` and `'LDCS', 'LDCS.1', 'LDCS.2'` These columns represent data items where multiple values may have been associated with one sort of thing (such as a course with multiple `HECOS` or `LDCS` codes, and where the multiple values have been unpacked across several columns.*

*In may cases, courses will have just a single code associated with them, which means that the extra columns will have null columns. Representing this data is a horrible way of doing, but one that we see all too often in data that is forced, often naively, into spreadsheets or simple tabular datasets. We will have a chat with that data about improving matters in ta later notebook...*

Let's review what the columns relate to:

In [5]:
colname_metadata = pd.read_csv("colnames_metadata.csv")

for colname in course_df.columns:
    # This filter generates a dataframe which should have one entry
    txt = colname_metadata[colname_metadata["colname"]==colname]
    # If the dataframe is not empty, we can work with it...
    if not txt.empty:
        txt = txt["Description"].values[0]
        print(f"{colname}: {txt}")
    else:
        print(f"Can't find a lookup for column {colname}")

PUBUKPRN: Publication UK provider reference number for where the course is primarily taught
UKPRN: UK provider reference number, which is the unique identifier allocated to providers by the UK Register of Learning Providers (UKRLP)
ASSURL: URL explaining assessment methods of the course
ASSURLW: URL explaining assessment methods of the course in Welsh
CRSECSTURL: The URL for the course cost page
CRSECSTURLW: The URL for the course cost page in Welsh
CRSEURL: The URL for the course page
CRSEURLW: The URL for the course page in Welsh
DISTANCE: Whether the course is offered wholly through distance learning
EMPLOYURL: URL for further details on employment opportunities
EMPLOYURLW: URL for further details on employment opportunities in Welsh
FOUNDATION: Foundation year availability
HONOURS: Honours degree availability
HECOS: HECOS code
Can't find a lookup for column HECOS.1
Can't find a lookup for column HECOS.2
Can't find a lookup for column HECOS.3
Can't find a lookup for column HECOS.4
K

In [6]:
course_names = course_df[["UKPRN", "KISCOURSEID", "TITLE"]].drop_duplicates()

In [7]:
entry_df = pd.merge(entry_df, course_names, on=["UKPRN", "KISCOURSEID"])

entry_df.head()

Unnamed: 0,PUBUKPRN,UKPRN,KISCOURSEID,KISMODE,ENTUNAVAILREASON,ENTPOP,ENTAGG,ENTSBJ,ACCESS,ALEVEL,BACC,DEGREE,FOUNDTN,NOQUALS,OTHER,OTHERHE,NAME,TITLE
0,10000047,10001143,PSSFDOPTDIS,1,0,30.0,14.0,,0.0,70.0,0.0,15.0,0.0,5.0,10.0,0.0,Canterbury Christ Church University,Ophthalmic Dispensing
1,10001143,10001143,PECFDECEDCA,1,0,20.0,14.0,,0.0,85.0,0.0,5.0,0.0,0.0,0.0,10.0,Canterbury Christ Church University,Early Childhood Education And Care
2,10001143,10001143,PFDDGAC4,1,1,145.0,13.0,CAH15-01-02,4.0,92.0,1.0,0.0,0.0,1.0,1.0,1.0,Canterbury Christ Church University,Applied Criminology
3,10001143,10001143,PFDDGAC4,2,1,,,,,,,,,,,,Canterbury Christ Church University,Applied Criminology
4,10001143,10001143,PFDDGAD4,1,1,,,,,,,,,,,,Canterbury Christ Church University,Mechanical Engineering (Advance Manufacture)


*In passing, we note that if we hadnlt handled the exceptions of the multiple column names, our simple print routine would have failed to run and raised an error.*

*Hacking extra columns into a spreadsheet to cope with occasional rows that misbehave in a particular column (for example, by trying to force multiple values into a column that expects a single value) is __not advised__. It may give you a temporary fix to a particular problem, but it can cause all sorts of issues when you try to work with the dataset, and force you into ever more elaborate, and hacky, ways of trying to work around the earlier hack.*

## Sorting the Data

The `ENTPOP` column gives the *Number of students in the population from which the entry qualification data is derived for the course* so we can use this naively to try to find the largest courses (which is to say, qualifications) by population.

To limit the amout of data displayed, we can select just a few columns to display.

We can then sort the dataframe on a particular column using the `.sort_values()` method.

Applying the `.head(N)` method lets us limit the results preview to the top $N$ courses.

In [8]:
entry_course_cols = ["NAME", "TITLE", "KISMODE", "ENTPOP"]

entry_df[entry_course_cols].sort_values("ENTPOP").head()

Unnamed: 0,NAME,TITLE,KISMODE,ENTPOP
14409,The University of Wolverhampton,Illustration,2,10.0
23461,The University of Manchester,Geography with International Study,1,10.0
4649,Liverpool Hope University,Music and Tourism (with Foundation Year),1,10.0
33163,The University of Leeds,Music with Enterprise,1,10.0
40809,Aberystwyth University,Sport and Exercise Science (with year in indus...,1,10.0


By default, the sort is applied in an *ascending* order, so to get the *top* results we need a *descending* sort order by setting the `ascending=False` parameter.

In [9]:
entry_df[entry_course_cols].sort_values("ENTPOP", ascending=False).head()

Unnamed: 0,NAME,TITLE,KISMODE,ENTPOP
22246,"University of the Arts, London",Sound Arts,1,3415.0
26171,The Open University,Combined Professional Studies,2,3240.0
26160,The Open University,Business Management and Languages,2,3015.0
26135,The Open University,Business Management,2,2940.0
26142,The Open University,Open,2,2660.0


That OU courses have large populations is no surprise, but the University of the Arts *Sound Arts* course? Is that a valid number or an error? How many courses does the University of the Arts offer? If it just offers "Sound Arts" and "Visual Arts", maybe it is a legitimate number??? Or does it perhaps rival the OU with part-time distance education provision?!

*Another way to assess the size of a course might be relative within a particular provider, eg capturing the size of a course as a percentage of the total population of the provider's courses. (There is a risk that this sort of analysis can be used to highlight low population courses that don't contribute to the bottom line...) We'll save that sort of analysis for when we have a conversation with the data about grouping.*

In passing, it's also worth noting that the `.sort_values()` method will always put `NaN` values at the end of the dataframe, so let's just check the tail of the data to see if there are `NaN` values there:

In [10]:
entry_df[entry_course_cols].sort_values("ENTPOP", ascending=False).tail()

Unnamed: 0,NAME,TITLE,KISMODE,ENTPOP
42151,The Cambridge Theological Federation,Theology Ministry and Mission,2,
42174,South Gloucestershire and Stroud College,Public Services,1,
42178,South Gloucestershire and Stroud College,Business,1,
42204,University of St Mark and St John,Performing Arts Education,1,
42226,University of St Mark and St John,Early Years,1,


Okay.. so there are null values there. Since we're interested entry data where there are some people on the course, we can dump the rows where there aren't any:

*Data is often a very literal and pedantic conversant. To get your question answered you often have to qualify the question in all sorts of ways...*

In [11]:
entry_df.dropna(subset=["ENTPOP"], inplace=True)

The `tail()` should now report rather more useful data:

In [12]:
entry_df[entry_course_cols].sort_values("ENTPOP", ascending=False).tail()

Unnamed: 0,NAME,TITLE,KISMODE,ENTPOP
30335,The University of Essex,Social Anthropology,1,10.0
4668,Liverpool Hope University,Nutrition and Tourism,1,10.0
30338,The University of Essex,Sociology and Criminology,1,10.0
4667,Liverpool Hope University,Nutrition and Tourism,1,10.0
7320,The University of Central Lancashire,Chemistry,1,10.0


Let's go back to trying to make sense of the University of the Arts data that leads the table and also see if there's a quick way to handle the Open University data.

Assuming that the `KISMODE` encodings are `1: full-time, 2: part-time, 3: both` (I didn't spot where the lookup for that encoding was?) let's see which the largest full time courses are:

In [13]:
entry_full_time_filter = entry_df["KISMODE"]==1

entry_df[entry_full_time_filter][entry_course_cols].sort_values("ENTPOP", ascending=False).head()

Unnamed: 0,NAME,TITLE,KISMODE,ENTPOP
22246,"University of the Arts, London",Sound Arts,1,3415.0
22206,"University of the Arts, London",Product and Furniture Design,1,2455.0
22260,"University of the Arts, London",Fashion Media Practice and Criticism,1,2455.0
22223,"University of the Arts, London",Product and Industrial Design,1,2455.0
10840,The Manchester Metropolitan University,Sports Business Management,1,2250.0


Hmmm... Is the University of the Arts reporting the *total* population rather than course populations?! And is the Manchester Met result also an error?

In [14]:
entry_df[entry_full_time_filter][entry_course_cols].sort_values("ENTPOP", ascending=False).head()

Unnamed: 0,NAME,TITLE,KISMODE,ENTPOP
22246,"University of the Arts, London",Sound Arts,1,3415.0
22206,"University of the Arts, London",Product and Furniture Design,1,2455.0
22260,"University of the Arts, London",Fashion Media Practice and Criticism,1,2455.0
22223,"University of the Arts, London",Product and Industrial Design,1,2455.0
10840,The Manchester Metropolitan University,Sports Business Management,1,2250.0


So what am I doing wrong looking at what I thought were course numbers? Or is the data really that bad?!

After chatting to someone who knows about these things (thanks, Dave Kernohan), it seems as if the detail is all in th aggregation levels which are used to aggregate data over small population courses. So let's see what the `ENTAGG` (_"Aggregation level applied to the entry data for the course"_) looks like...

In [16]:
entry_course_cols2 = entry_course_cols + ["ENTAGG"]

entry_df[entry_course_cols2][entry_full_time_filter].sort_values("ENTPOP", ascending=False).head()

Unnamed: 0,NAME,TITLE,KISMODE,ENTPOP,ENTAGG
22246,"University of the Arts, London",Sound Arts,1,3415.0,12.0
22206,"University of the Arts, London",Product and Furniture Design,1,2455.0,13.0
22260,"University of the Arts, London",Fashion Media Practice and Criticism,1,2455.0,13.0
22223,"University of the Arts, London",Product and Industrial Design,1,2455.0,13.0
10840,The Manchester Metropolitan University,Sports Business Management,1,2250.0,12.0


The aggregation level codes seem to be described on the [*Rounding and aggregation* section](https://www.hesa.ac.uk/collection/c20061/unistats_dataset_file_structure) of the *Unistats dataset file structure*.

```
KISCOURSE level - most recent year's data (AGG=14)
KISCOURSE level - most recent two years' data aggregated (AGG=24)
CAH level 3 (e.g. aerospace engineering) - most recent year's data (AGG=13)
CAH level 3 - most recent two years' data aggregated (AGG=23)
CAH level 2 (e.g. mechanically based engineering) - most recent year's data (AGG=12)
CAH level 2 - most recent two years' data aggregated (AGG=22)
CAH level 1 (e.g. engineering and technology) - most recent year's data (AGG=11)
CAH level 1 - most recent two years' data aggregated (AGG=21)
```

So for a single course, it looks like we want the single year aggregation levels at the course level, aggreagtion code `14`.

The other current year aggregations (`11`, `12`, `13`) presumably represent subject hierarchical groupings of the data (?).

We can add another level of filtering to limit the data to an appropriate aggregation level:

In [27]:
entry_agg14_filter = entry_df["ENTAGG"]==14

entry_df[entry_full_time_filter & entry_agg14_filter][entry_course_cols] \
                    .sort_values("ENTPOP", ascending=False).head(10)

Unnamed: 0,NAME,TITLE,KISMODE,ENTPOP
9713,London School of Business and Management Limited,Business Management including Foundation Year,1,805.0
33994,The University of the West of Scotland,Adult Nursing,1,605.0
29333,The University of Cambridge,Natural Sciences,1,565.0
16320,The University of Liverpool,Law,1,465.0
24004,The University of Manchester,Medicine,1,455.0
5166,The City University,Law,1,445.0
10407,The Manchester Metropolitan University,Law,1,435.0
33748,The University of Newcastle-upon-Tyne,Medicine and Surgery,1,415.0
17813,Birmingham City University,Nursing - Adult (January intake),1,415.0
17816,Birmingham City University,Nursing - Adult (September intake),1,415.0


That looks a bit more realistic. There may be a bit more fettling we could do to try to identify (and filter out) courses with foundation year components, but those numbers look plausible...

## Exporing the `CONTINUATION` Dataset

Another way of looking at course sizes is to consider the *continutation* rather than *entry* data.

I'm not really sure what the continuation data actually is? The `CONTPOP` is the _"Population of students with continuation information"_, for example, but what is "continuation"?  The `UCONT` is the _"Proportion of students who continued on their course at the HE provider in the year after starting the course"_ so maybe continuity data is based on students who make it into the second year? Note that we also have `UGAINED` as the _"Proportion of students who gained their intended award (or higher) the year after they entered HE"_: does this help measure two year completion rates?

In [42]:
cont_df = pd.read_csv("on_2021_08_11_07_24_51/CONTINUATION.csv")

cont_df.head()

Unnamed: 0,PUBUKPRN,UKPRN,KISCOURSEID,KISMODE,CONTUNAVAILREASON,CONTPOP,CONTAGG,CONTSBJ,UCONT,UDORMANT,UGAINED,ULEFT,ULOWER
0,10000047,10001143,PSSFDOPTDIS,1,0,30.0,14.0,,75.0,0.0,0.0,15.0,15.0
1,10000055,10000055,AB20,1,0,25.0,24.0,,80.0,0.0,10.0,10.0,0.0
2,10000055,10000055,AB29,1,0,10.0,14.0,,80.0,0.0,0.0,20.0,0.0
3,10000055,10000055,AB33,1,0,15.0,14.0,,80.0,0.0,0.0,10.0,5.0
4,10000055,10000055,AB35,1,1,25.0,23.0,CAH06-01-01,75.0,0.0,10.0,15.0,0.0


Based on what we've already learned when chatting to the `ENTRY` dataset, we can preemptively remove any null values and filter the data to the appropriate aggregation level:

In [43]:
# Drop null values
cont_df.dropna(subset=["CONTPOP"], inplace=True)

# Filter to appropriate aggregation level
cont_agg14_filter = cont_df["CONTAGG"]==14

cont_df = cont_df[cont_agg14_filter]

We can annotate the `CONTINUATION` data in much the same way as we did the `ENTRY` data:

In [44]:
cont_df = pd.merge(cont_df, ukprns, on="UKPRN")
cont_df = pd.merge(cont_df, course_names, on=["UKPRN", "KISCOURSEID"])

cont_course_cols = ["NAME", "TITLE", "KISMODE", "CONTPOP"]

cont_df[cont_course_cols].head()

Unnamed: 0,NAME,TITLE,KISMODE,CONTPOP
0,Canterbury Christ Church University,Ophthalmic Dispensing,1,30.0
1,Canterbury Christ Church University,Early Childhood Education And Care,1,35.0
2,Canterbury Christ Church University,Digital Media,1,10.0
3,Canterbury Christ Church University,Early Childhood Studies,1,20.0
4,Canterbury Christ Church University,Education Studies,1,15.0


How do the numbers look now, for example, on fulltime courses?

In [45]:
cont_full_time_filter = cont_df["KISMODE"]==1

cont_df[cont_full_time_filter][cont_course_cols].sort_values("CONTPOP", ascending=False).head(5)

Unnamed: 0,NAME,TITLE,KISMODE,CONTPOP
7707,The University of Cambridge,Natural Sciences,1,580.0
8637,The University of the West of Scotland,Adult Nursing,1,570.0
2404,London School of Business and Management Limited,Business Management including Foundation Year,1,550.0
4110,The University of Liverpool,Law,1,530.0
1321,De Montfort University,Business and Management,1,480.0
