# Initial HESA Graduate Data Review

In [1]:
import pandas as pd

## 2017/8 Dataset

In [3]:
graduate_1718_df = pd.read_csv("Graduate Outcomes by activity and personal characteristics - 2017-18.csv")
graduate_1718_df.head()

Unnamed: 0,Personal characteristic category filter,Personal characteristic category,Activity,Country of provider,Domicile,Provider type,Level of qualification obtained,Mode of former study,Interim study marker,Academic year,Number,Percent,95% confidence interval
0,Sex,Female,Employment and further study,All,All,All,All,All,Exclude significant interim study,2017/18,19695,10%,(9%-10%)
1,Sex,Female,Full-time employment,All,All,All,All,All,Exclude significant interim study,2017/18,120700,59%,(59%-59%)
2,Sex,Female,Full-time further study,All,All,All,All,All,Exclude significant interim study,2017/18,14930,7%,(7%-8%)
3,Sex,Female,Non-respondents,All,All,All,All,All,Exclude significant interim study,2017/18,216065,,
4,Sex,Female,"Other including travel, caring for someone or ...",All,All,All,All,All,Exclude significant interim study,2017/18,11730,6%,(6%-6%)


In [33]:
# NaN values in row 3

In [4]:
graduate_1718_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 427620 entries, 0 to 427619
Data columns (total 13 columns):
 #   Column                                   Non-Null Count   Dtype 
---  ------                                   --------------   ----- 
 0   Personal characteristic category filter  427620 non-null  object
 1   Personal characteristic category         427620 non-null  object
 2   Activity                                 427620 non-null  object
 3   Country of provider                      427620 non-null  object
 4   Domicile                                 427620 non-null  object
 5   Provider type                            427620 non-null  object
 6   Level of qualification obtained          427620 non-null  object
 7   Mode of former study                     427620 non-null  object
 8   Interim study marker                     427620 non-null  object
 9   Academic year                            427620 non-null  object
 10  Number                                   427

The 2017-18 graduate dataset contains 427620 rows and 13 columns. There is one column with a data type of integer and the remaining columns are objects. An index column has been created for the dataframe.

In [None]:
# note percentage has 227999 in the non-null column above.

In [5]:
graduate_1718_df.describe()

Unnamed: 0,Number
count,427620.0
mean,1897.995978
std,15487.864387
min,0.0
25%,0.0
50%,5.0
75%,85.0
max,769210.0


Statistical summary data for graduate numbers

**Personal characteristic category filter**

In [6]:
graduate_1718_df["Personal characteristic category filter"].describe()

count        427620
unique            5
top       Age Group
freq         130500
Name: Personal characteristic category filter, dtype: object

The data in this column has 5 unique values and the data type is object.

In [7]:
graduate_1718_df["Personal characteristic category filter"].unique()

array(['Sex', 'Ethnicity', 'Age Group', 'Disability', 'Total'],
      dtype=object)

Our project is focusing on gender in relation to education outcomes. For that reason, we can remove rows in the dataset which have a Personal characteristic category filter value that is not equal to sex.

In [8]:
# look at dropping rows with a value not equal to sex.

**Personal characteristic category**

In [9]:
graduate_1718_df["Personal characteristic category"].describe()

count                  427620
unique                     17
top       No known disability
freq                    36660
Name: Personal characteristic category, dtype: object

The data in this column has 17 unique values and the data type is object.

In [10]:
graduate_1718_df["Personal characteristic category"].unique()

array(['Female', 'Male', 'Other', '20 and under', '21-24 years',
       '25-29 years', '30 years and over', 'Age unknown',
       'Known disability', 'No known disability', 'White', 'Black',
       'Asian', 'Mixed', 'Not known', 'Total UK domiciled', 'Total'],
      dtype=object)

Our project is focusing on gender, and as such, we only want to include values that relate to gender. In this dataset, we have female, male and other. Whilst the other category should be included, we will need to be aware in undertaking our analysis that for the student and secondary level data, gender information is only available for a male/female split.

**Activity**

In [11]:
graduate_1718_df["Activity"].describe()

count                           427620
unique                              15
top       Employment and further study
freq                             28508
Name: Activity, dtype: object

The data in this column has 15 unique values and the data type is object.

In [12]:
graduate_1718_df["Activity"].unique()

array(['Employment and further study', 'Full-time employment',
       'Full-time further study', 'Non-respondents',
       'Other including travel, caring for someone or retired',
       'Part-time employment', 'Part-time further study', 'Total',
       'Total with known outcomes', 'Unemployed',
       'Unemployed and due to start further study',
       'Unemployed and due to start work',
       'Unknown pattern of employment',
       'Unknown pattern of further study', 'Voluntary or unpaid work'],
      dtype=object)

The values in this column relate to what the graduate goes on to do after completing their degree. For our analysis, all values here should be included so that we have a complete analysis of the higher education outcome.

**Country of provider**

In [13]:
graduate_1718_df["Country of provider"].describe()

count     427620
unique         5
top          All
freq      107760
Name: Country of provider, dtype: object

The data in this column has 5 unique values and the data type is object.

In [14]:
graduate_1718_df["Country of provider"].unique()

array(['All', 'England', 'Northern Ireland', 'Scotland', 'Wales'],
      dtype=object)

Data is available for graduates in the four devolved nations. However, as the secondary level data we have is only for England, we will only retain England as a value in the Country of Provide column so that we have a consistent approach.

In [15]:
# look at dropping rows with a value not equal to England

**Domicile**

In [16]:
graduate_1718_df["Domicile"].describe()

count     427620
unique        11
top       All UK
freq       57360
Name: Domicile, dtype: object

The data in this column has 11 unique values and the data type is object.

In [17]:
graduate_1718_df["Domicile"].unique()

array(['All', 'All UK', 'All non-UK', 'England', 'Non-European Union',
       'Northern Ireland', 'Not known', 'Other European Union',
       'Other UK', 'Scotland', 'Wales'], dtype=object)

The Domicile is where graduates come from. Whilst an interesting metric, our project is focusing on gender in relation to education outcomes.For that reason, we will exclude the Domicile column from our analysis. It would be an interesting follow on from our project to see if domicile also has an impact.

In [20]:
# look at removing/dropping domicile column.

**Provider type**

In [18]:
graduate_1718_df["Provider type"].describe()

count     427620
unique         3
top          All
freq      173850
Name: Provider type, dtype: object

The data in this column has 3 unique values and the data type is object.

In [19]:
graduate_1718_df["Provider type"].unique()

array(['All', 'Further education colleges (FECs)',
       'Higher education providers (HEPs)'], dtype=object)

The provider type has an 'All' category which we can exclude from our analysis. We can look at whether there are differences in the outcome for Further education college versus Higher education providers.

In [21]:
# look at removing rows with a value of 'All'

**Level of qualification obtained**

In [22]:
graduate_1718_df["Level of qualification obtained"].describe()

count     427620
unique         3
top          All
freq      153570
Name: Level of qualification obtained, dtype: object

The data in this column has 3 unique values and the data type is object.

In [23]:
graduate_1718_df["Level of qualification obtained"].unique()

array(['All', 'Postgraduate', 'Undergraduate'], dtype=object)

Looking at the values in this column, it would be interesting to see the number and proportion of females completing different levels of degree.

We will remove rows with a value of 'All' at look at the undergraduate/postgraduate split to see if there is a difference in the level of study females complete.

**Mode of former study**

In [24]:
graduate_1718_df["Mode of former study"].describe()

count     427620
unique         3
top          All
freq      149280
Name: Mode of former study, dtype: object

The data in this column has 3 unique values and the data type is object.

In [25]:
graduate_1718_df["Mode of former study"].unique()

array(['All', 'Full-time', 'Part-time'], dtype=object)

This column contains information on whether graduates were studying part-time. Here I think it will be useful to retain all values initially. We might want to use all to get the general trends but can then look at full-time and part-time split by gender to see if women are more or less likely to complete after studying full-time than males.

**Interim study marker**

In [27]:
graduate_1718_df["Interim study marker"].describe()

count                                427620
unique                                    2
top       Include significant interim study
freq                                 213840
Name: Interim study marker, dtype: object

The data in this column has 2 unique values and the data type is object.

In [28]:
graduate_1718_df["Interim study marker"].unique()

array(['Exclude significant interim study',
       'Include significant interim study'], dtype=object)

The interim study marker is not directly relevant to our analysis and the column can be removed from the dataframe.

In [29]:
# look at removing/dropping interim study marker

**Academic year**

In [30]:
graduate_1718_df["Academic year"].describe()

count      427620
unique          1
top       2017/18
freq       427620
Name: Academic year, dtype: object

The data in this column has 1 unique values and the data type is object.

In [31]:
graduate_1718_df["Academic year"].unique()

array(['2017/18'], dtype=object)

This data set is for academic year 2017/8. We have datasets from 2018/9 and 2019/0 which we will look to combine to create a time series analysis.

## 2018/9 Dataset

In [32]:
graduate_1819_df = pd.read_csv("Graduate Outcomes by activity and personal characteristics- 2018-19.csv")
graduate_1819_df.head()

Unnamed: 0,Personal characteristic category filter,Personal characteristic category,Activity,Country of provider,Domicile,Provider type,Level of qualification obtained,Mode of former study,Interim study marker,Academic year,Number,Percent,95% confidence interval
0,Sex,Female,Employment and further study,All,All,All,All,All,Exclude significant interim study,2018/19,23305,11%,(10%-11%)
1,Sex,Female,Full-time employment,All,All,All,All,All,Exclude significant interim study,2018/19,125470,57%,(56%-57%)
2,Sex,Female,Full-time further study,All,All,All,All,All,Exclude significant interim study,2018/19,15680,7%,(7%-7%)
3,Sex,Female,Non-respondents,All,All,All,All,All,Exclude significant interim study,2018/19,220640,,
4,Sex,Female,"Other including travel, caring for someone or ...",All,All,All,All,All,Exclude significant interim study,2018/19,12215,6%,(5%-6%)


In [34]:
graduate_1819_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 427620 entries, 0 to 427619
Data columns (total 13 columns):
 #   Column                                   Non-Null Count   Dtype 
---  ------                                   --------------   ----- 
 0   Personal characteristic category filter  427620 non-null  object
 1   Personal characteristic category         427620 non-null  object
 2   Activity                                 427620 non-null  object
 3   Country of provider                      427620 non-null  object
 4   Domicile                                 427620 non-null  object
 5   Provider type                            427620 non-null  object
 6   Level of qualification obtained          427620 non-null  object
 7   Mode of former study                     427620 non-null  object
 8   Interim study marker                     427620 non-null  object
 9   Academic year                            427620 non-null  object
 10  Number                                   427

The 2018-19 graduate dataset contains 427620 rows and 13 columns. There is one column with a data type of integer and the remaining columns are objects. An index column has been created for the dataframe.

In [35]:
graduate_1819_df.describe()

Unnamed: 0,Number
count,427620.0
mean,1969.22244
std,15805.372415
min,0.0
25%,0.0
50%,5.0
75%,95.0
max,793445.0


Statistical summary data for graduate numbers

**Personal characteristic category filter**

In [36]:
graduate_1819_df["Personal characteristic category filter"].describe()

count        427620
unique            5
top       Age Group
freq         130755
Name: Personal characteristic category filter, dtype: object

The data in this column has 5 unique values and the data type is object.

In [37]:
graduate_1819_df["Personal characteristic category filter"].unique()

array(['Sex', 'Ethnicity', 'Age Group', 'Disability', 'Total'],
      dtype=object)

Our project is focusing on gender in relation to education outcomes. For that reason, we can remove rows in the dataset which have a Personal characteristic category filter value that is not equal to sex. This matches the values for the 2017/8 dataset.

In [53]:
# look to remove rows with a value not equal to sex.

**Personal characteristic category**

In [38]:
graduate_1819_df["Personal characteristic category"].describe()

count                  427620
unique                     17
top       No known disability
freq                    35160
Name: Personal characteristic category, dtype: object

The data in this column has 17 unique values and the data type is object.

In [39]:
graduate_1819_df["Personal characteristic category"].unique()

array(['Female', 'Male', 'Other', '20 and under', '21-24 years',
       '25-29 years', '30 years and over', 'Age unknown',
       'Known disability', 'No known disability', 'White', 'Black',
       'Asian', 'Mixed', 'Not known', 'Total UK domiciled', 'Total'],
      dtype=object)

Our project is focusing on gender, and as such, we only want to include values that relate to gender. In this dataset, we have female, male and other. Whilst the other category should be included, we will need to be aware in undertaking our analysis that for the student and secondary level data, gender information is only available for a male/female split. The values will then match the 2017/8 dataset.

In [None]:
# look to remove rows where value is not equal to a gender value

**Activity**

In [40]:
graduate_1819_df["Activity"].describe()

count                           427620
unique                              15
top       Employment and further study
freq                             28508
Name: Activity, dtype: object

The data in this column has 15 unique values and the data type is object.

In [41]:
graduate_1819_df["Activity"].unique()

array(['Employment and further study', 'Full-time employment',
       'Full-time further study', 'Non-respondents',
       'Other including travel, caring for someone or retired',
       'Part-time employment', 'Part-time further study', 'Total',
       'Total with known outcomes', 'Unemployed',
       'Unemployed and due to start further study',
       'Unemployed and due to start work',
       'Unknown pattern of employment',
       'Unknown pattern of further study', 'Voluntary or unpaid work'],
      dtype=object)

The values in this column relate to what the graduate goes on to do after completing their degree. For our analysis, all values here should be included so that we have a complete analysis of the higher education outcome. The values match the 2017/8 dataset.

**Country of provider**

In [42]:
graduate_1819_df["Country of provider"].describe()

count     427620
unique         5
top          All
freq      108450
Name: Country of provider, dtype: object

The data in this column has 5 unique values and the data type is object.

In [43]:
graduate_1819_df["Country of provider"].unique()

array(['All', 'England', 'Northern Ireland', 'Scotland', 'Wales'],
      dtype=object)

Data is available for graduates in the four devolved nations. However, as the secondary level data we have is only for England, we will only retain England as a value in the Country of Provide column so that we have a consistent approach. Our values match the 2017/8 dataset.

In [None]:
# look to remove rows with a value not equal to England

**Domicile**

In [44]:
graduate_1819_df["Domicile"].describe()

count     427620
unique        11
top       All UK
freq       58440
Name: Domicile, dtype: object

The data in this column has 11 unique values and the data type is object.

In [45]:
graduate_1819_df["Domicile"].unique()

array(['All', 'All UK', 'All non-UK', 'England', 'Non-European Union',
       'Northern Ireland', 'Not known', 'Other European Union',
       'Other UK', 'Scotland', 'Wales'], dtype=object)

The Domicile is where graduates come from. Whilst an interesting metric, our project is focusing on gender in relation to education outcomes.For that reason, we will exclude the Domicile column from our analysis. It would be an interesting follow on from our project to see if domicile also has an impact. By removing the column, our data will match that for 2017/8.

In [52]:
# look to remove/drop domicile column

**Provider type**

In [46]:
graduate_1819_df["Provider type"].describe()

count     427620
unique         3
top          All
freq      175620
Name: Provider type, dtype: object

The data in this column has 3 unique values and the data type is object.

In [48]:
graduate_1819_df["Provider type"].unique()

array(['All', 'Further education colleges (FECs)',
       'Higher education providers (HEPs)'], dtype=object)

The provider type has an 'All' category which we can exclude from our analysis. We can look at whether there are differences in the outcome for Further education college versus Higher education providers. Our data will then align with the 2017/8 dataset.

In [51]:
# look to remove rows with a value of 'All'

**Level of qualification obtained**

In [49]:
graduate_1819_df["Level of qualification obtained"].describe()

count     427620
unique         3
top          All
freq      154800
Name: Level of qualification obtained, dtype: object

The data in this column has 3 unique values and the data type is object.

In [50]:
graduate_1819_df["Level of qualification obtained"].unique()

array(['All', 'Postgraduate', 'Undergraduate'], dtype=object)

The column has a value for 'All' which will combine undergraduate and postgraduate qualifications. It would be interesting to see if there is a difference in outcomes for females for UG versus PG. When we come to analyse our data we can consider whether to retain the 'All' value.

**Mode of former study**

In [54]:
graduate_1819_df["Mode of former study"].describe()

count     427620
unique         3
top          All
freq      148830
Name: Mode of former study, dtype: object

The data in this column has 3 unique values and the data type is object.

In [55]:
graduate_1819_df["Mode of former study"].unique()

array(['All', 'Full-time', 'Part-time'], dtype=object)

This column contains information on whether graduates were studying part-time. Here I think it will be useful to retain all values initially. We might want to use all to get the general trends but can then look at full-time and part-time split by gender to see if women are more or less likely to complete after studying full-time than males. This will then match our 2017/8 dataset.

**Interim study marker**

In [56]:
graduate_1819_df["Interim study marker"].describe()

count                                427620
unique                                    2
top       Include significant interim study
freq                                 214305
Name: Interim study marker, dtype: object

The data in this column has 2 unique values and the data type is object.

In [57]:
graduate_1819_df["Interim study marker"].unique()

array(['Exclude significant interim study',
       'Include significant interim study'], dtype=object)

The interim study marker is not directly relevant to our analysis and the column can be removed from the dataframe. This will then match the 2017/8 dataset.

**Academic year**

In [58]:
graduate_1819_df["Academic year"].describe()

count      427620
unique          1
top       2018/19
freq       427620
Name: Academic year, dtype: object

The data in this column has 1 unique value and the data type is object.

In [59]:
graduate_1819_df["Academic year"].unique()

array(['2018/19'], dtype=object)

This data set is for academic year 2018/9. We have datasets from 2017/8 and 2019/0 which we will look to combine to create a time series analysis.

## 2019/20 Dataset

In [60]:
graduate_1920_df = pd.read_csv("Graduate Outcomes by activity and personal characteristics- 2019-20.csv")
graduate_1920_df.head()

Unnamed: 0,Personal characteristic category filter,Personal characteristic category,Activity,Country of provider,Domicile,Provider type,Level of qualification obtained,Mode of former study,Interim study marker,Academic year,Number,Percent,95% confidence interval
0,Sex,Female,Employment and further study,All,All,All,All,All,Exclude significant interim study,2019/20,22545,10%,(10%-11%)
1,Sex,Female,Full-time employment,All,All,All,All,All,Exclude significant interim study,2019/20,123800,57%,(57%-58%)
2,Sex,Female,Full-time further study,All,All,All,All,All,Exclude significant interim study,2019/20,16615,8%,(7%-8%)
3,Sex,Female,Non-respondents,All,All,All,All,All,Exclude significant interim study,2019/20,214300,,
4,Sex,Female,"Other including travel, caring for someone or ...",All,All,All,All,All,Exclude significant interim study,2019/20,10905,5%,(5%-5%)


In [61]:
graduate_1920_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421290 entries, 0 to 421289
Data columns (total 13 columns):
 #   Column                                   Non-Null Count   Dtype 
---  ------                                   --------------   ----- 
 0   Personal characteristic category filter  421290 non-null  object
 1   Personal characteristic category         421290 non-null  object
 2   Activity                                 421290 non-null  object
 3   Country of provider                      421290 non-null  object
 4   Domicile                                 421290 non-null  object
 5   Provider type                            421290 non-null  object
 6   Level of qualification obtained          421290 non-null  object
 7   Mode of former study                     421290 non-null  object
 8   Interim study marker                     421290 non-null  object
 9   Academic year                            421290 non-null  object
 10  Number                                   421

The dataset containts 421290 rows and 13 columns. There is one column with integer data type and the remaining columns are objects.

In [62]:
graduate_1920_df.describe()

Unnamed: 0,Number
count,421290.0
mean,1944.942261
std,15519.048911
min,0.0
25%,0.0
50%,5.0
75%,95.0
max,774715.0


Summary statistics for the 2019/20 graduate numbers

**Personal characteristic category filter**

In [63]:
graduate_1920_df["Personal characteristic category filter"].describe()

count        421290
unique            5
top       Age Group
freq         127590
Name: Personal characteristic category filter, dtype: object

The data in this column has 5 unique values and the data type is object.

In [64]:
graduate_1920_df["Personal characteristic category filter"].unique()

array(['Sex', 'Ethnicity', 'Age Group', 'Disability', 'Total'],
      dtype=object)

Our project is focusing on gender in relation to education outcomes. For that reason, we can remove rows in the dataset which have a Personal characteristic category filter value that is not equal to sex. This matches the values for the 2017/8 and 2018/9 datasets.

In [None]:
# remove rows with values not equal to sex

**Personal characteristic category**

In [65]:
graduate_1920_df["Personal characteristic category"].describe()

count                  421290
unique                     16
top       No known disability
freq                    36000
Name: Personal characteristic category, dtype: object

The data in this column has 16 unique values and the data type is object.

In [66]:
graduate_1920_df["Personal characteristic category"].unique()

array(['Female', 'Male', 'Other', '20 and under', '21-24 years',
       '25-29 years', '30 years and over', 'Known disability',
       'No known disability', 'White', 'Black', 'Asian', 'Mixed',
       'Not known', 'Total UK domiciled', 'Total'], dtype=object)

Our project is focusing on gender, and as such, we only want to include values that relate to gender. In this dataset, we have female, male and other. Whilst the other category should be included, we will need to be aware in undertaking our analysis that for the student and secondary level data, gender information is only available for a male/female split. The values will then match the 2017/8 and 2018/9 datasets.

In [67]:
# remove rows with values not equal to gender values.

**Activity**

In [68]:
graduate_1920_df["Activity"].describe()

count                           421290
unique                              15
top       Employment and further study
freq                             28086
Name: Activity, dtype: object

The data in this column has 15 unique values and the data type is object.

In [69]:
graduate_1920_df["Activity"].unique()

array(['Employment and further study', 'Full-time employment',
       'Full-time further study', 'Non-respondents',
       'Other including travel, caring for someone or retired',
       'Part-time employment', 'Part-time further study', 'Total',
       'Total with known outcomes', 'Unemployed',
       'Unemployed and due to start further study',
       'Unemployed and due to start work',
       'Unknown pattern of employment',
       'Unknown pattern of further study', 'Voluntary or unpaid work'],
      dtype=object)

The values in this column relate to what the graduate goes on to do after completing their degree. For our analysis, all values here should be included so that we have a complete analysis of the higher education outcome. The values match the 2017/8 and 2018/9 datasets.

**Country of provider**

In [70]:
graduate_1920_df["Country of provider"].describe()

count     421290
unique         5
top          All
freq      104400
Name: Country of provider, dtype: object

The data in this column has 5 unique values and the data type is object.

In [71]:
graduate_1920_df["Country of provider"].unique()

array(['All', 'England', 'Northern Ireland', 'Scotland', 'Wales'],
      dtype=object)

Data is available for graduates in the four devolved nations. However, as the secondary level data we have is only for England, we will only retain England as a value in the Country of Provide column so that we have a consistent approach. Our values match the 2017/8 and 2018/9 datasets.

In [None]:
# look to remove rows where the value is not equal to England


**Domicile**

In [72]:
graduate_1920_df["Domicile"].describe()

count     421290
unique        11
top       All UK
freq       56220
Name: Domicile, dtype: object

The data in this column has 11 unique values and the data type is object.

In [73]:
graduate_1920_df["Domicile"].unique()

array(['All', 'All UK', 'All non-UK', 'England', 'Non-European Union',
       'Northern Ireland', 'Not known', 'Other European Union',
       'Other UK', 'Scotland', 'Wales'], dtype=object)

The Domicile is where graduates come from. Whilst an interesting metric, our project is focusing on gender in relation to education outcomes.For that reason, we will exclude the Domicile column from our analysis. It would be an interesting follow on from our project to see if domicile also has an impact. By removing the column, our data will match that for 2017/8 and 2018/9.

In [74]:
# look to remove/drop domicile column

**Provider type**

In [75]:
graduate_1920_df["Provider type"].describe()

count     421290
unique         3
top          All
freq      171810
Name: Provider type, dtype: object

The data in this column has 3 unique values and the data type is object.

In [76]:
graduate_1920_df["Provider type"].unique()

array(['All', 'Further education colleges (FECs)',
       'Higher education providers (HEPs)'], dtype=object)

The provider type has an 'All' category which we can exclude from our analysis. We can look at whether there are differences in the outcome for Further education college versus Higher education providers. Our data will then align with the 2017/8 and 2018/9 datasets.

In [77]:
# look to remove rows with a value of 'All'

**Level of qualification obtained**

In [78]:
graduate_1920_df["Level of qualification obtained"].describe()

count     421290
unique         3
top          All
freq      150840
Name: Level of qualification obtained, dtype: object

The data in this column has 3 unique values and the data type is object.

In [79]:
graduate_1920_df["Level of qualification obtained"].unique()

array(['All', 'Postgraduate', 'Undergraduate'], dtype=object)

The column has a value for 'All' which will combine undergraduate and postgraduate qualifications. It would be interesting to see if there is a difference in outcomes for females for UG versus PG. When we come to analyse our data we can consider whether to retain the 'All' value. The dataset matches the 2017/8 and 2018/9 datasets.

**Mode of former study**

In [80]:
graduate_1920_df["Mode of former study"].describe()

count     421290
unique         3
top          All
freq      146820
Name: Mode of former study, dtype: object

The data in this column has 3 unique values and the data type is object.

In [81]:
graduate_1920_df["Mode of former study"].unique()

array(['All', 'Full-time', 'Part-time'], dtype=object)

This column contains information on whether graduates were studying full or part-time. Here I think it will be useful to retain all values initially. We might want to use all to get the general trends but can then look at full-time and part-time split by gender to see if women are more or less likely to complete after studying full-time than males.

**Interim study marker**

In [82]:
graduate_1920_df["Interim study marker"].describe()

count                                421290
unique                                    2
top       Include significant interim study
freq                                 211065
Name: Interim study marker, dtype: object

The data in this column has 2 unique values and the data type is object.

In [83]:
graduate_1920_df["Interim study marker"].unique()

array(['Exclude significant interim study',
       'Include significant interim study'], dtype=object)

The interim study marker is not directly relevant to our analysis and the column can be removed from the dataframe. This will then match the 2017/8 and 2018/9 datasets.

**Academic year**

In [84]:
graduate_1920_df["Academic year"].describe()

count      421290
unique          1
top       2019/20
freq       421290
Name: Academic year, dtype: object

The data in this column has 1 unique value and the data type is object.

In [85]:
graduate_1920_df["Academic year"].unique()

array(['2019/20'], dtype=object)

This data set is for academic year 2019/20. We have datasets from 2017/8 and 2018/9 which we will look to combine to create a time series analysis.

**Summary**

The datasets are consistent with each other in terms of the columns and values recorded and therefore are appropriate to combine in one large dataset. We will look to combine the datasets in MySQL before performing the noted deletions.