# Making Sense of Datafile Contents

If we load in a datafile, it's not necessarily immediately clear what the file relates to based on the slightly opaque column names used within the datafiles.

*Experienced users of the data will have some familiarity with the decoding the meaning of the column names, but the rest of us need all the help we can get.*

Let's load in our *pandas* package:

In [3]:
import pandas as pd

And preview one of the datafiles:

In [4]:
entry_df = pd.read_csv("on_2021_08_11_07_24_51/ENTRY.csv")
entry_df.head()

Unnamed: 0,PUBUKPRN,UKPRN,KISCOURSEID,KISMODE,ENTUNAVAILREASON,ENTPOP,ENTAGG,ENTSBJ,ACCESS,ALEVEL,BACC,DEGREE,FOUNDTN,NOQUALS,OTHER,OTHERHE
0,10000047,10001143,PSSFDOPTDIS,1,0,30.0,14.0,,0.0,70.0,0.0,15.0,0.0,5.0,10.0,0.0
1,10000055,10000055,AB20,1,0,20.0,14.0,,0.0,80.0,0.0,0.0,0.0,5.0,10.0,5.0
2,10000055,10000055,AB29,1,0,10.0,24.0,,0.0,80.0,0.0,10.0,0.0,10.0,0.0,0.0
3,10000055,10000055,AB33,1,0,20.0,14.0,,0.0,90.0,0.0,5.0,0.0,0.0,5.0,0.0
4,10000055,10000055,AB35,1,0,25.0,13.0,CAH06-01-01,0.0,75.0,0.0,0.0,0.0,5.0,10.0,10.0


It's not necessarily obvious what each of those columns relates to, so let's load in the data we scraped that provides a simple description of each column:

In [5]:
colname_metadata = pd.read_csv("colnames_metadata.csv")
colname_metadata.head()

Unnamed: 0,Field name,Description,Min/Max occurs,Field Length,csv_file,colname
0,KIS.Collection,HESA Collection reference identifier,0/1,6,KIS,Collection
1,Institution.UKPRN,"UK provider reference number, which is the uni...",1/1,8,Institution,UKPRN
2,Institution.PUBUKPRN,Publication UK provider reference number for w...,1/1,8,Institution,PUBUKPRN
3,Institution.COUNTRY,"Country of provider (England, Wales, Northern ...",1/1,2,Institution,COUNTRY
4,Institution.PUBUKPRNCOUNTRY,"Country of publication provider (England, Wale...",1/1,2,Institution,PUBUKPRNCOUNTRY


We can use the following recipe to print the description of a column given the column name:

In [6]:
txt = colname_metadata[colname_metadata["colname"]=="ENTPOP"]["Description"].values[0]

print(txt)

Number of students in the population from which the entry qualification data is derived for the course


We can now generate a report describing the columns in the dataframe loaded in from a particular dataset:

In [9]:
for colname in entry_df.columns:
    txt = colname_metadata[colname_metadata["colname"]==colname]["Description"].values[0]
    print(f"{colname}: {txt}")

PUBUKPRN: Publication UK provider reference number for where the course is primarily taught
UKPRN: UK provider reference number, which is the unique identifier allocated to providers by the UK Register of Learning Providers (UKRLP)
KISCOURSEID: An identifier which uniquely identifies a course within a provider
KISMODE: The mode of the KIS course (full-time, part-time, both)
ENTUNAVAILREASON: Indicator of the reason why data for a course may not be available
ENTPOP: Number of students in the population from which the entry qualification data is derived for the course
ENTAGG: Aggregation level applied to the entry data for the course
ENTSBJ: CAH Level subject code
ACCESS: Proportion of students whose highest qualification on entry is an access course
ALEVEL: Proportion of students whose highest qualification on entry is A level or (Scottish) Highers
BACC: Proportion of students whose highest qualification on entry is an International Baccalaureate
DEGREE: Proportion of students whose hig