# Simple Data Reshaping

In an earlier conversation, we noticed that one of the columns in the daatframe created from the `KISCOURSE` dataset appeared to be replicated multiple times. Specifically, the `HECOS` item appeared in several forms (`HECOS`, `HECOS.1`, `HECOS.2`, `HECOS.3`, `HECOS.4`) to capture the fact that several `HECOS` codes might be applied to the same course.

Checking back at the original file, we see that there were originally several columns *with the same name*:

In [29]:
!head -n 1 on_2021_08_11_07_24_51/KISCOURSE.csv

PUBUKPRN,UKPRN,ASSURL,ASSURLW,CRSECSTURL,CRSECSTURLW,CRSEURL,CRSEURLW,DISTANCE,EMPLOYURL,EMPLOYURLW,FOUNDATION,HONOURS,HECOS,HECOS,HECOS,HECOS,HECOS,KISCOURSEID,KISMODE,LDCS,LDCS,LDCS,LOCCHNGE,LTURL,LTURLW,NHS,NUMSTAGE,SANDWICH,SUPPORTURL,SUPPORTURLW,TITLE,TITLEW,UCASPROGID,UKPRNAPPLY,YEARABROAD,KISAIMCODE,KISLEVEL


On importing the data, *pandas*, which requires unique column names, added the incremental count values to the duplicated columns, to uniquely identify them:

In [28]:
import pandas as pd

course_df = pd.read_csv("on_2021_08_11_07_24_51/KISCOURSE.csv")
course_df.columns

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Index(['PUBUKPRN', 'UKPRN', 'ASSURL', 'ASSURLW', 'CRSECSTURL', 'CRSECSTURLW',
       'CRSEURL', 'CRSEURLW', 'DISTANCE', 'EMPLOYURL', 'EMPLOYURLW',
       'FOUNDATION', 'HONOURS', 'HECOS', 'HECOS.1', 'HECOS.2', 'HECOS.3',
       'HECOS.4', 'KISCOURSEID', 'KISMODE', 'LDCS', 'LDCS.1', 'LDCS.2',
       'LOCCHNGE', 'LTURL', 'LTURLW', 'NHS', 'NUMSTAGE', 'SANDWICH',
       'SUPPORTURL', 'SUPPORTURLW', 'TITLE', 'TITLEW', 'UCASPROGID',
       'UKPRNAPPLY', 'YEARABROAD', 'KISAIMCODE', 'KISLEVEL'],
      dtype='object')

Both cases — the duplicated names and the repeated-but-indexed column names — are horrible. So what can we do about it?

One way is to have a single `HECOS` column containing a list of zero of more codes. No matter how many codes apply to a particular course, we still need only one row and one column per course.

Another approach is to "melt" the "wide" data set to a "long" form where a course with mutliple `HECOS` codes is represented mutliple times in a table with just a single `HECOS` column, with the multiple occurrences of the same course each being associated with a single different `HECOS` code.

When we melt the columns - and create multiple rows from each original row, one per `HECOS` column — we need to create an otherwise unique index to identify each group of melted rows. The 3-tuple of `(UKPRN, KISCOURSEID, KISMODE)` values make up a unique key within the `KISCOURSE` data, as the following counts of those grouped values shows:

In [26]:
course_df[["UKPRN", "KISCOURSEID", "KISMODE"]].value_counts()

UKPRN     KISCOURSEID        KISMODE
10000055  AB20               1          1
10007786  1FREN005UU-201920  1          1
          1ENGL005UU-201920  1          1
          1ENGL006UU-201920  1          1
          1ENGL008DU-201920  2          1
                                       ..
10007138  CBSCESPORFF        2          1
                             1          1
          CBSCESPOR          2          1
                             1          1
99999999  N1EJ001FUU         1          1
Length: 35145, dtype: int64

To melt the data, we can used the *pandas* `.melt()` function. This takes the dataframe, the columns we want to uniqely identify each group of melted values, and the names of the value columns want to melt from wide to long form. We can also specify the names of the newly derived columns.

Finally, we tidy up the dataset by removing rows where there is no code value.

In [96]:
#cols=["PUBUKPRN","UKPRN", "KISCOURSEID", "KISMODE", "HECOS.1", "HECOS.2"]
      
course_long_df = pd.melt(course_df,
                         id_vars=["UKPRN", "KISCOURSEID"],
                         value_vars=["HECOS","HECOS.1", "HECOS.2", "HECOS.3", "HECOS.4"],
                         # Name the new variable column
                         var_name="old_col", value_name='HECOS_',
                        ).dropna(subset=["HECOS_"])

course_long_df

Unnamed: 0,UKPRN,KISCOURSEID,old_col,HECOS_
4,10000055,AB35,HECOS,100936.0
6,10000055,AB39,HECOS,100078.0
8,10000163,BSPF-C800,HECOS,100497.0
9,10000163,BSRDIF-B821,HECOS,100129.0
10,10000163,BSRROF-B822,HECOS,100132.0
...,...,...,...,...
164448,10007789,UNU1F71P401,HECOS.4,100351.0
167378,10007795,RT51,HECOS.4,101164.0
167383,10007795,RTH1,HECOS.4,100326.0
167384,10007795,RTH2,HECOS.4,101168.0


We can tidy the data frame a little by removing the `old_col`, renaming the `HECOS_` column to `HECOS_` (if we created a new column withthe same name as one of the melted columns, we would raise an error) and casting the `HECOS` values to integers:

In [97]:
course_long_df.drop(columns=["old_col"], inplace=True)
course_long_df.rename(columns={"HECOS_": "HECOS"}, inplace=True)

course_long_df["HECOS"] = course_long_df["HECOS"].astype(int)

course_long_df.head(3)

Unnamed: 0,UKPRN,KISCOURSEID,HECOS
4,10000055,AB35,100936
6,10000055,AB39,100078
8,10000163,BSPF-C800,100497


Picking one of the courses with multiple codes:

In [98]:
course_long_df[course_long_df["KISCOURSEID"]=="RTH2"]

Unnamed: 0,UKPRN,KISCOURSEID,HECOS
26804,10007795,RTH2,101169
61949,10007795,RTH2,101271
97094,10007795,RTH2,100327
132239,10007795,RTH2,100326
167384,10007795,RTH2,101168


In [99]:
cah_xlsx = pd.ExcelFile("HECoS_CAH_Version_1.3.3.xlsx")

cah2hecos_df = pd.read_excel(cah_xlsx, "HECoS_CAH_Mapping (V1.3.3)", skipfooter=1)
cah2hecos_df

Unnamed: 0,HECoS,CAH3,CAH2,CAH1,HECoS (Code only),CAH3 (Code only),CAH2 (Code only),CAH1 (Code only)
0,(100270) medical sciences,(CAH01-01-01) medical sciences (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100270,CAH01-01-01,CAH01-01,CAH01
1,(100267) clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100267,CAH01-01-02,CAH01-01,CAH01
2,(100271) medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100271,CAH01-01-02,CAH01-01,CAH01
3,(100276) pre-clinical medicine,(CAH01-01-02) medicine (non-specific),(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,100276,CAH01-01-02,CAH01-01,CAH01
4,(101334) allergy,(CAH01-01-03) medicine by specialism,(CAH01-01) medicine and dentistry,(CAH01) medicine and dentistry,101334,CAH01-01-03,CAH01-01,CAH01
...,...,...,...,...,...,...,...,...
1087,(101090) study skills,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101090,CAH23-01-02,CAH23-01,CAH23
1088,(101276) work placement experience (personal l...,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101276,CAH23-01-02,CAH23-01,CAH23
1089,(101277) work-based learning,(CAH23-01-02) personal development,(CAH23-01) combined and general studies,(CAH23) combined and general studies,101277,CAH23-01-02,CAH23-01,CAH23
1090,(100314) humanities,(CAH23-01-03) humanities (non-specific),(CAH23-01) combined and general studies,(CAH23) combined and general studies,100314,CAH23-01-03,CAH23-01,CAH23


In [89]:
from parse import parse

# The function returns a 2-tuple: (CODE, LABEL)
cleanit = lambda x: parse("({}) {}", x)[:]

# Apply this formula to each item in the column
# Then split the 2-tuple into two Series (i.e. two columns)
hecos_df = cah2hecos_df["HECoS"].apply(cleanit).apply(pd.Series)

hecos_df.head()

Unnamed: 0,0,1
0,100270,medical sciences
1,100267,clinical medicine
2,100271,medicine
3,100276,pre-clinical medicine
4,101334,allergy


Rename the columns:

In [100]:
hecos_df.columns = ["HECOS", "Label"]
hecos_df["HECOS"] = hecos_df["HECOS"].astype(int)

hecos_df.head()

Unnamed: 0,HECOS,Label
0,100270,medical sciences
1,100267,clinical medicine
2,100271,medicine
3,100276,pre-clinical medicine
4,101334,allergy


And annotate the courses dataframe with meaningful `HECOS` code labels:

In [103]:
course_long_df = pd.merge(course_long_df, hecos_df, on="HECOS")

course_long_df.head()

Unnamed: 0,UKPRN,KISCOURSEID,HECOS,Label
0,10000055,AB35,100936,animal health
1,10000721,FDAAHW,100936,animal health
2,10000721,FDAAHW,100936,animal health
3,10000721,FDAAHW2,100936,animal health
4,10000721,FDAAHW2,100936,animal health


So what's an example of multiple tags used for a course?

In [104]:
course_long_df[course_long_df["KISCOURSEID"]=="RTH2"]

Unnamed: 0,UKPRN,KISCOURSEID,HECOS,Label
15015,10007795,RTH2,100327,Italian studies
15534,10007795,RTH2,100326,Italian language
15637,10007795,RTH2,101271,East Asian studies
15851,10007795,RTH2,101169,Japanese languages
15961,10007795,RTH2,101168,Japanese studies


Let's add the course names back in...

In [105]:
course_long_df = pd.merge(course_long_df,
                          course_df[["UKPRN", "KISCOURSEID", "TITLE"]],
                          on=["UKPRN", "KISCOURSEID"])

course_long_df.head()

Unnamed: 0,UKPRN,KISCOURSEID,HECOS,Label,TITLE
0,10000055,AB35,100936,animal health,Animal Therapy and Rehabilition
1,10000721,FDAAHW,100936,animal health,Applied Animal Health and Welfare
2,10000721,FDAAHW,100936,animal health,Applied Animal Health and Welfare
3,10000721,FDAAHW,100936,animal health,Applied Animal Health and Welfare
4,10000721,FDAAHW,100936,animal health,Applied Animal Health and Welfare


So what course is tagged as both Italian and Japanese?!

In [106]:
course_long_df[course_long_df["KISCOURSEID"]=="RTH2"]

Unnamed: 0,UKPRN,KISCOURSEID,HECOS,Label,TITLE
19647,10007795,RTH2,100327,Italian studies,Italian B and Japanese
19648,10007795,RTH2,100326,Italian language,Italian B and Japanese
19649,10007795,RTH2,101271,East Asian studies,Italian B and Japanese
19650,10007795,RTH2,101169,Japanese languages,Italian B and Japanese
19651,10007795,RTH2,101168,Japanese studies,Italian B and Japanese


Silly question!