# Initial HESA HE Enrolment Data Review

In [1]:
import pandas as pd

## HE enrolment by Subject

In [3]:
enrol_sub_df = pd.read_csv("HE enrolment by Subject.csv")
enrol_sub_df.head()

Unnamed: 0,CAH level marker,CAH level subject,Level of study,Mode of study,Academic Year,Category marker,Category,Number
0,CAH level 1,01 Medicine and dentistry,All,Full-time,2020/21,Sex,Female,39790
1,CAH level 1,01 Medicine and dentistry,All,Full-time,2020/21,Sex,Male,24740
2,CAH level 1,01 Medicine and dentistry,All,Full-time,2020/21,Sex,Other,110
3,CAH level 1,01 Medicine and dentistry,All,Full-time,2020/21,Domicile,Not known,0
4,CAH level 1,01 Medicine and dentistry,All,Full-time,2020/21,Domicile,England,43530


In [4]:
enrol_sub_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119175 entries, 0 to 119174
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   CAH level marker   119175 non-null  object
 1   CAH level subject  119175 non-null  object
 2   Level of study     119175 non-null  object
 3   Mode of study      119175 non-null  object
 4   Academic Year      119175 non-null  object
 5   Category marker    119175 non-null  object
 6   Category           119175 non-null  object
 7   Number             119175 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 7.3+ MB


The dataset contains 119175 rows and 8 columns. There is one column with a data type of integer and the remaining columns are objects. An index column has been created for the dataframe. There are no nulls in the dataset.

In [5]:
enrol_sub_df.describe()

Unnamed: 0,Number
count,119175.0
mean,6384.784
std,59145.15
min,0.0
25%,0.0
50%,55.0
75%,675.0
max,2751865.0


Summary statistcs for the number column for enrolment data.

**CAH level marker**

CAH is the Common Aggregation Hierarchy used to code subjects.

In [7]:
enrol_sub_df["CAH level marker"].describe()

count          119175
unique              2
top       CAH level 3
freq           103425
Name: CAH level marker, dtype: object

The data in this column has 2 unique values and the data type is object.

In [8]:
enrol_sub_df["CAH level marker"].unique()

array(['CAH level 1', 'CAH level 3'], dtype=object)

CAH Level 1 is the high level subject e.g. CAH01 medicine and dentistry and CAH Level 3 is the detailed subject e.g. CAH01-01-01 is medical sciences (non-specific). This may not be relevant to include in our analysis as we want to look at detailed subject level to investigate what women are studying.


In [11]:
# look at removing/dropiint CAH level marker

**CAH level subject**

In [9]:
enrol_sub_df["CAH level subject"].describe()

count     119175
unique       193
top        Total
freq        1260
Name: CAH level subject, dtype: object

The data in this column has 193 unique values and the data type is object.

In [10]:
enrol_sub_df["CAH level subject"].unique()

array(['01 Medicine and dentistry', '02 Subjects allied to medicine',
       '03 Biological and sport sciences', '04 Psychology',
       '05 Veterinary sciences',
       '06 Agriculture, food and related studies', '07 Physical sciences',
       '09 Mathematical sciences', '10 Engineering and technology',
       '11 Computing', '13 Architecture, building and planning',
       '26 Geography, earth and environmental studies (natural sciences)',
       'Total science CAH level 1', '15 Social sciences', '16 Law',
       '17 Business and management',
       '24 Media, journalism and communications',
       '19 Language and area studies',
       '20 Historical, philosophical and religious studies',
       '25 Design, and creative and performing arts',
       '22 Education and teaching', '23 Combined and general studies',
       '26 Geography, earth and environmental studies (social sciences)',
       'Total non-science CAH level 1', 'Total',
       '01-01-01 Medical sciences (non-specific)',


**Level of study**

In [12]:
enrol_sub_df["Level of study"].describe()

count     119175
unique         7
top          All
freq       17400
Name: Level of study, dtype: object

The data in this column has 7 unique values and the data type is object.

In [13]:
enrol_sub_df["Level of study"].unique()

array(['All', 'Postgraduate (research)', 'Postgraduate (taught)',
       'All postgraduate', 'First degree', 'Other undergraduate',
       'All undergraduate'], dtype=object)

Looking at the values in this column, it would be interesting to see the number and proportion of females studying different subjects at different levels.

To get the breakdown, we will remove rows with a value of 'All'. We will need to begin analysing the data to review the merits of retaining the 'All undergraduate' and 'All postgraduate' values.

Postgraduate combines all postgraduate taught and postgraduate research students, whereas I think we would be better splitting these out.

'All undergraduate' contains first degrees and other degrees - we may not want to include other degrees in our analysis as it is a broad category where we cannot specify the degree achieved as an outcome.

We will need to be consistent with the approach to the Student data analysis.

In [14]:
# look to remove rows with a value of 'All'

**Mode of study**

In [15]:
enrol_sub_df["Mode of study"].describe()

count     119175
unique         3
top          All
freq       40140
Name: Mode of study, dtype: object

The data in this column has 3 unique values and the data type is object.

In [16]:
enrol_sub_df["Mode of study"].unique()

array(['Full-time', 'Part-time', 'All'], dtype=object)

This column contains information on whether students are studying full or part-time. Here I think it will be useful to retain all values initially. We might want to use all to get the general trends but can then look at full-time and part-time split by gender to see which subjects females are studying full or part-time.

**Academic Year**

In [17]:
enrol_sub_df["Academic Year"].describe()

count      119175
unique          2
top       2019/20
freq        59670
Name: Academic Year, dtype: object

The data in this column has 2 unique values and the data type is object.

In [18]:
enrol_sub_df["Academic Year"].unique()

array(['2020/21', '2019/20'], dtype=object)

We only have two years worth of data - 2019/20 and 2020/21 which is insufficient to undertake a time series analysis. The reason for this is that there was a change to subject coding which was implemented ahead of 2019/0. Previous data is not suitable for comparison. It will still be interesting to discover what subjects women are studying and this is something that can then be further investigated in the future as more data becomes available.

**Category marker**

In [19]:
enrol_sub_df["Category marker"].describe()

count       119175
unique           4
top       Domicile
freq         79450
Name: Category marker, dtype: object

The data in this column has 4 unique values and the data type is object.

In [20]:
enrol_sub_df["Category marker"].unique()

array(['Sex', 'Domicile', 'Error', 'Total'], dtype=object)

An immediate flag on this dataset is the inclusion of 'Error' as a value. The reason behind this value will need to be investigated to see what could be included in this grouping. Our project is focusing on gender, so we will retain only rows where the value is sex. We will need to confirm if a value for sex is available for all rows in the dataset.

In [21]:
# look to remove rows where the value is not equal to sex.

**Category**

In [22]:
enrol_sub_df["Category"].describe()

count     119175
unique        15
top       Female
freq        7945
Name: Category, dtype: object

The data in this column has 15 unique values and the data type is object.

In [23]:
enrol_sub_df["Category"].unique()

array(['Female', 'Male', 'Other', 'Not known', 'England', 'Wales',
       'Scotland', 'Northern Ireland', 'Other UK', 'Total UK',
       'European Union', 'Non-European Union', 'Total Non-UK',
       'Not known ', 'Total'], dtype=object)

We can see from the values in the above array that those which map to the category marker 'Sex' are 'Female', 'Male' and 'Other'. There are no error values in this column but we will need to check the category column against the category marker column to check for inconsistencies. 

We will retain only the 'Sex' values for our analysis.

In [24]:
# look to remove rows where the value is not a gender value

## HE students by CAH level 1 and sex

In [25]:
enrol_CAH1_df = pd.read_csv("HE students by CAH level 1 and sex.csv")
enrol_CAH1_df.head()

Unnamed: 0,CAH level 1,Sex,Academic Year,Number
0,01 Medicine and dentistry,Female,2019/20,42610
1,01 Medicine and dentistry,Male,2019/20,27605
2,01 Medicine and dentistry,Female,2020/21,47435
3,01 Medicine and dentistry,Male,2020/21,29445
4,02 Subjects allied to medicine,Female,2019/20,233810


In [27]:
enrol_CAH1_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88 entries, 0 to 87
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   CAH level 1    88 non-null     object
 1   Sex            88 non-null     object
 2   Academic Year  88 non-null     object
 3   Number         88 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 2.9+ KB


There are 88 rows and 4 columns in the data set, which includes data on the level 1 CAH grouping and sex. One column has a data tyoe if number and the remaining columns are objects.

In [26]:
enrol_CAH1_df.describe()

Unnamed: 0,Number
count,88.0
mean,59941.193182
std,61212.355741
min,1945.0
25%,21277.5
50%,36875.0
75%,71757.5
max,269305.0


Summary statistics for the number column

**CAH level 1**

In [28]:
enrol_CAH1_df["CAH level 1"].describe()

count                            88
unique                           22
top       01 Medicine and dentistry
freq                              4
Name: CAH level 1, dtype: object

The data in this column has 22 unique values and the data type is object.

In [29]:
enrol_CAH1_df["CAH level 1"].unique()

array(['01 Medicine and dentistry', '02 Subjects allied to medicine',
       '03 Biological and sport sciences', '04 Psychology',
       '05 Veterinary sciences',
       '06 Agriculture, food and related studies', '07 Physical sciences',
       '09 Mathematical sciences', '10 Engineering and technology',
       '11 Computing', '13 Architecture, building and planning',
       '26 Geography, earth and environmental studies (natural sciences)',
       '15 Social sciences', '16 Law', '17 Business and management',
       '19 Language and area studies',
       '20 Historical, philosophical and religious studies',
       '22 Education and teaching', '23 Combined and general studies',
       '24 Media, journalism and communications',
       '25 Design, and creative and performing arts',
       '26 Geography, earth and environmental studies (social sciences)'],
      dtype=object)

This is the level 1 CAH grouping for subject. We will include all values here as we wish to investigate the subjects being studied by women.

**Sex**

In [30]:
enrol_CAH1_df["Sex"].describe()

count         88
unique         2
top       Female
freq          44
Name: Sex, dtype: object

The data in this column has 2 unique values and the data type is object.

In [31]:
enrol_CAH1_df["Sex"].unique()

array(['Female', 'Male'], dtype=object)

There are only two values recorded for sex- male and female. We will need to be aware that our student and graduate datasets have the additional value of 'other' when undertaking our analysis.

**Academic Year**

In [32]:
enrol_CAH1_df["Academic Year"].describe()

count          88
unique          2
top       2019/20
freq           44
Name: Academic Year, dtype: object

The data in this column has 2 unique values and the data type is object.

In [33]:
enrol_CAH1_df["Academic Year"].unique()

array(['2019/20', '2020/21'], dtype=object)

There are only two academic years worth of data for us to analyse due to the changes in subject coding introduced ahead of 2019/20. Whilst this is insufficient to undertake a time series analysis, it will be interesting to see which subjects are being studied by women in each year. This also something that can be investigated further as more data becomes available.

## Percentage of HE Student Enrolments in  Science Subjects by Personal Characteristics

In [34]:
enrol_science_df = pd.read_csv("Percentage of HE student enrolments in science subjects by personal characteristics.csv")
enrol_science_df.head()

Unnamed: 0,Category Marker,Category,First year marker,Level of study,Mode of study,Country of HE provider,Academic Year,Percentage
0,Sex,Female,All,All,All,All,2019/20,41%
1,Sex,Female,All,All,All,All,2020/21,42%
2,Sex,Female,All,All,All,England,2019/20,40%
3,Sex,Female,All,All,All,England,2020/21,41%
4,Sex,Female,All,All,All,Northern Ireland,2019/20,51%


In [35]:
enrol_science_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7232 entries, 0 to 7231
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Category Marker         7232 non-null   object
 1   Category                7232 non-null   object
 2   First year marker       7232 non-null   object
 3   Level of study          7232 non-null   object
 4   Mode of study           7232 non-null   object
 5   Country of HE provider  7232 non-null   object
 6   Academic Year           7232 non-null   object
 7   Percentage              6824 non-null   object
dtypes: object(8)
memory usage: 452.1+ KB


There are 7232 rows and 8 columns in the dataset. All columns contain object data type. It looks as though there are nulls in the percentage column.

In [36]:
enrol_science_df.describe()

Unnamed: 0,Category Marker,Category,First year marker,Level of study,Mode of study,Country of HE provider,Academic Year,Percentage
count,7232,7232,7232,7232,7232,7232,7232,6824
unique,7,26,3,3,3,5,2,89
top,Religious Belief,Other,All,All,All,All,2019/20,44%
freq,2430,534,2418,2418,2420,1458,3616,426


Summary statistics for science enrolments

**Category Marker**

In [38]:
enrol_science_df["Category Marker"].describe()

count                 7232
unique                   7
top       Religious Belief
freq                  2430
Name: Category Marker, dtype: object

The data in this column has 7 unique values and the data type is object.

In [39]:
enrol_science_df["Category Marker"].unique()

array(['Sex', 'Age Group', 'Disability Status', 'Religious Belief',
       'Ethnicity', 'Total UK domiciled students', 'Total'], dtype=object)

As our project focuses on gender, we will only retain rows where the value is 'Sex'.

In [40]:
# remove rows where the value is not equal to sex.

**Category**

In [41]:
enrol_science_df["Category"].describe()

count      7232
unique       26
top       Other
freq        534
Name: Category, dtype: object

The data in this column has 26 unique values and the data type is object.

In [42]:
enrol_science_df["Category"].unique()

array(['Female', 'Male', 'Other', '20 and under', '21-24 years',
       '25-29 years', '30 years and over', 'Age unknown',
       'Known disability', 'No known disability', 'No religion',
       'Buddhist', 'Christian', 'Hindu', 'Jewish', 'Muslim', 'Sikh',
       'Spiritual', 'Any other religion or belief', 'White', 'Black',
       'Asian', 'Mixed', 'Not known', 'Total UK domiciled students',
       'Total'], dtype=object)

As we are investigating gender, we will only retain the values of female, male and other.

In [43]:
# remove rows where the value is not a gender value.

**First year marker**

In [44]:
enrol_science_df["First year marker"].describe()

count     7232
unique       3
top        All
freq      2418
Name: First year marker, dtype: object

The data in this column has 3 unique values and the data type is object.

In [45]:
enrol_science_df["First year marker"].unique()

array(['All', 'First year', 'Other years'], dtype=object)

First year marker is whether a student is in their first year of study for their degree or other years. For higher education data in our project, we are looking at first year students and then graduates to get an idea of completion, i.e. who starts vs who finishes.

In the First year marker column, we only want to retain the value of 'First year'. We will want to drop rows with the 'All' and 'Other years' values.

In [48]:
# remove rows where the value is not equal to first year.

**Level of study**

In [46]:
enrol_science_df["Level of study"].describe()

count     7232
unique       3
top        All
freq      2418
Name: Level of study, dtype: object

The data in this column has 3 unique values and the data type is object.

In [47]:
enrol_science_df["Level of study"].unique()

array(['All', 'Postgraduate', 'Undergraduate'], dtype=object)

Looking at the values in this column, it would be interesting to see the number and proportion of females starting different levels of degree.

To get the breakdown, we will remove rows with a value of 'All'. 


In [49]:
# look to remove rows where the value is all.

**Mode of study**

In [50]:
enrol_science_df["Mode of study"].describe()

count     7232
unique       3
top        All
freq      2420
Name: Mode of study, dtype: object

The data in this column has 3 unique values and the data type is object.

In [51]:
enrol_science_df["Mode of study"].unique()

array(['All', 'Full-time', 'Part-time'], dtype=object)

This column contains information on whether students are studying full or part-time. Here I think it will be useful to retain all values initially. We might want to use all to get the general trends but can then look at full-time and part-time split by gender to see if women are more or less likely to study full-time than males.

**Country of HE provider**

In [52]:
enrol_science_df["Country of HE provider"].describe()

count     7232
unique       5
top        All
freq      1458
Name: Country of HE provider, dtype: object

The data in this column has 5 unique values and the data type is object.

In [53]:
enrol_science_df["Country of HE provider"].unique()

array(['All', 'England', 'Northern Ireland', 'Scotland', 'Wales'],
      dtype=object)

Data is available for the four devolved nations. As the secondary level education data we have is only available for England, we should only retain rows with a value of England for our project.

In [54]:
# look to remove rows where value is not equal to England.

**Academic Year**

In [55]:
enrol_science_df["Academic Year"].describe()

count        7232
unique          2
top       2019/20
freq         3616
Name: Academic Year, dtype: object

The data in this column has 2 unique values and the data type is object.

In [56]:
enrol_science_df["Academic Year"].unique()

array(['2019/20', '2020/21'], dtype=object)

There are only two academic years worth of data for us to analyse due to the changes in subject coding introduced ahead of 2019/20. Whilst this is insufficient to undertake a time series analysis, it will be interesting to see which science subjects are being studied by women in each year. This also something that can be investigated further as more data becomes available.