What are the regional patterns of health graduate employment by gender in Ireland, and which HSE regions show the greatest gender disparities? 

Focus: Compare male vs. female distribution across 6 HSE regions for nursing, medicine, and social care graduates. 

In [40]:
# %pip install plotly
# %pip install --upgrade nbformat

# Imports
import pandas as pd
import numpy as np

Collecting nbformat
  Downloading nbformat-5.10.4-py3-none-any.whl.metadata (3.6 kB)
Collecting fastjsonschema>=2.15 (from nbformat)
  Downloading fastjsonschema-2.21.2-py3-none-any.whl.metadata (2.3 kB)
Collecting jsonschema>=2.6 (from nbformat)
  Downloading jsonschema-4.25.1-py3-none-any.whl.metadata (7.6 kB)
Collecting jsonschema-specifications>=2023.03.6 (from jsonschema>=2.6->nbformat)
  Downloading jsonschema_specifications-2025.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting referencing>=0.28.4 (from jsonschema>=2.6->nbformat)
  Downloading referencing-0.37.0-py3-none-any.whl.metadata (2.8 kB)
Collecting rpds-py>=0.7.1 (from jsonschema>=2.6->nbformat)
  Downloading rpds_py-0.30.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (4.1 kB)
Downloading nbformat-5.10.4-py3-none-any.whl (78 kB)
Downloading fastjsonschema-2.21.2-py3-none-any.whl (24 kB)
Downloading jsonschema-4.25.1-py3-none-any.whl (90 kB)
Downloading jsonschema_specifications-2025.9.1-py3-none-any.whl (18 kB)
Downloadin

In [7]:
# Load in the CSV for graduate regions by gender 
file_path = 'csv_version/graduate_regions_by_gender.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),Graduation Year,C04477V05260,Field of Study,C03919V04671,Gender,C04300V05079,HSE Health Regions,UNIT,VALUE
0,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,10,Male,-,All HSE Regions,Number,54
1,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,10,Male,20,HSE Dublin and Midlands,Number,9
2,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,10,Male,10,HSE Dublin and North East,Number,12
3,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,10,Male,30,HSE Dublin and South East,Number,9
4,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,10,Male,50,HSE Midwest,Number,3


In [12]:
# Intial inspection of the data
print("shape of dataset", df.shape)
print("column names:")
print(df.columns.tolist())

shape of dataset (756, 12)
column names:
['STATISTIC', 'Statistic Label', 'TLIST(A1)', 'Graduation Year', 'C04477V05260', 'Field of Study', 'C03919V04671', 'Gender', 'C04300V05079', 'HSE Health Regions', 'UNIT', 'VALUE']


In [13]:
# is there missing rows?
print(df.isna().sum())

STATISTIC             0
Statistic Label       0
TLIST(A1)             0
Graduation Year       0
C04477V05260          0
Field of Study        0
C03919V04671          0
Gender                0
C04300V05079          0
HSE Health Regions    0
UNIT                  0
VALUE                 0
dtype: int64


No values are missing but in the head there is a dash for - C03919V04671 in the this is could be missing data I need the descriptor for this DS

https://db.nomics.world/CSO/HEO04/HEO04.-.-.-.IEZ999?tab=dimensions
cites that C03919V04671 references gender, 10 for male, 20 for female, and - for both genders 

In [14]:
print("Unique values per column:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} ‚Üí {df[col].unique()[:10]}")

Unique values per column:
STATISTIC: 1 ‚Üí ['HGO22C01']
Statistic Label: 1 ‚Üí ['Number of Health Graduates']
TLIST(A1): 12 ‚Üí [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019]
Graduation Year: 12 ‚Üí [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019]
C04477V05260: 3 ‚Üí [ 30 202 702]
Field of Study: 3 ‚Üí ['Nursing and Midwifery' 'Medicine' 'Social Care']
C03919V04671: 3 ‚Üí ['10' '20' '-']
Gender: 3 ‚Üí ['Male' 'Female' 'All genders']
C04300V05079: 7 ‚Üí ['-' '20' '10' '30' '50' '60' '40']
HSE Health Regions: 7 ‚Üí ['All HSE Regions' 'HSE Dublin and Midlands' 'HSE Dublin and North East'
 'HSE Dublin and South East' 'HSE Midwest' 'HSE West and North West'
 'HSE South West']
UNIT: 1 ‚Üí ['Number']
VALUE: 132 ‚Üí [ 54   9  12   3 786 153 150 120  78 132]


C04300V05079 may appears to be the HSE region this row can likely be dropped

Rows to drop as there are value not needed 

| Column             | Keep or drop |
|--------------------|-------------|
| Statistic          | drop        |
|Statistic Label:    | drop        |
| TLIST              | drop        |
| Graduation Year    | keep        |
| C04477V05260       | drop        |
| Field of Study     | keep        |
| C03919V04671       | drop        |
| Gender             | keep        |
| C04300V05079       | drop        |
| HSE Health Regions | keep        |
| UNIT               | drop        |
| VALUE              | keep        |


TLIST -> is the same as graduation year
C04477V05260 -> is tied to field of study
C03919V04671 -> is tied to gender
C04300V05079 -> is HSE region
Unit -> is the one value we do not need it overall  
Statistic is also the one value tied to the dataset   


It would also be interesting to break off a data frame of just all regions I think


In [18]:
# list of columns to drop
cols_to_drop = ['STATISTIC', 'Statistic Label', 'TLIST(A1)', 'C04477V05260','C03919V04671', 'C04300V05079', 'UNIT']

# create a clean data frame
df_clean = df.drop(columns = cols_to_drop)
df_clean.head()

Unnamed: 0,Graduation Year,Field of Study,Gender,HSE Health Regions,VALUE
0,2010,Nursing and Midwifery,Male,All HSE Regions,54
1,2010,Nursing and Midwifery,Male,HSE Dublin and Midlands,9
2,2010,Nursing and Midwifery,Male,HSE Dublin and North East,12
3,2010,Nursing and Midwifery,Male,HSE Dublin and South East,9
4,2010,Nursing and Midwifery,Male,HSE Midwest,3


In [20]:
# remove all HSE regions rows and all genders rows
# these will be brought in later as their own dataframe for exploration later they are already presnt in the data so we don't need it as it will skew results
df_clean = df_clean[
    (df_clean['Gender'] != 'All genders') &
    (df_clean['HSE Health Regions'] != 'All HSE Regions')
].copy()

df_clean.head()


Unnamed: 0,Graduation Year,Field of Study,Gender,HSE Health Regions,VALUE
1,2010,Nursing and Midwifery,Male,HSE Dublin and Midlands,9
2,2010,Nursing and Midwifery,Male,HSE Dublin and North East,12
3,2010,Nursing and Midwifery,Male,HSE Dublin and South East,9
4,2010,Nursing and Midwifery,Male,HSE Midwest,3
5,2010,Nursing and Midwifery,Male,HSE West and North West,9


In [22]:
# shape of the clean df
df_clean.shape

(432, 5)

In [25]:
# summary of the clean dataset

dataset_summary = {
    'Rows': len(df_clean),
    'Years': f"{df_clean['Graduation Year'].min()}‚Äì{df_clean['Graduation Year'].max()}",
    'Fields of Study': df_clean['Field of Study'].nunique(),
    'Genders': df_clean['Gender'].unique().tolist(),
    'Regions': df_clean['HSE Health Regions'].nunique(),
    'Total Graduates': df_clean['VALUE'].sum()
}

print(dataset_summary)
# this summary needs to be tidied up in a readable format later 

{'Rows': 432, 'Years': '2010‚Äì2021', 'Fields of Study': 3, 'Genders': ['Male', 'Female'], 'Regions': 6, 'Total Graduates': np.int64(22626)}


In [28]:
# Make the all regions df, all genders 

df_all_regions = df[
    (df['HSE Health Regions'] == 'All HSE Regions') &
    (df['Gender'] != 'All genders')
].copy()


df_all_genders = df[
    (df['Gender'] == 'All genders') &
    (df['HSE Health Regions'] != 'All HSE Regions')
].copy()


In [29]:
df_all_regions.head()

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),Graduation Year,C04477V05260,Field of Study,C03919V04671,Gender,C04300V05079,HSE Health Regions,UNIT,VALUE
0,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,10,Male,-,All HSE Regions,Number,54
7,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,20,Female,-,All HSE Regions,Number,786
21,HGO22C01,Number of Health Graduates,2010,2010,202,Medicine,10,Male,-,All HSE Regions,Number,51
28,HGO22C01,Number of Health Graduates,2010,2010,202,Medicine,20,Female,-,All HSE Regions,Number,96
42,HGO22C01,Number of Health Graduates,2010,2010,702,Social Care,10,Male,-,All HSE Regions,Number,48


In [30]:
df_all_genders.head()

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),Graduation Year,C04477V05260,Field of Study,C03919V04671,Gender,C04300V05079,HSE Health Regions,UNIT,VALUE
15,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,-,All genders,20,HSE Dublin and Midlands,Number,162
16,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,-,All genders,10,HSE Dublin and North East,Number,162
17,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,-,All genders,30,HSE Dublin and South East,Number,129
18,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,-,All genders,50,HSE Midwest,Number,81
19,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,-,All genders,60,HSE West and North West,Number,162


In [33]:
# all genders and all regions would be nice too so this is below
df_fully_aggregated = df[
    (df['Gender'] == 'All genders') &
    (df['HSE Health Regions'] == 'All HSE Regions')
].copy()
df_fully_aggregated

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),Graduation Year,C04477V05260,Field of Study,C03919V04671,Gender,C04300V05079,HSE Health Regions,UNIT,VALUE
14,HGO22C01,Number of Health Graduates,2010,2010,30,Nursing and Midwifery,-,All genders,-,All HSE Regions,Number,840
35,HGO22C01,Number of Health Graduates,2010,2010,202,Medicine,-,All genders,-,All HSE Regions,Number,147
56,HGO22C01,Number of Health Graduates,2010,2010,702,Social Care,-,All genders,-,All HSE Regions,Number,486
77,HGO22C01,Number of Health Graduates,2011,2011,30,Nursing and Midwifery,-,All genders,-,All HSE Regions,Number,1026
98,HGO22C01,Number of Health Graduates,2011,2011,202,Medicine,-,All genders,-,All HSE Regions,Number,177
119,HGO22C01,Number of Health Graduates,2011,2011,702,Social Care,-,All genders,-,All HSE Regions,Number,444
140,HGO22C01,Number of Health Graduates,2012,2012,30,Nursing and Midwifery,-,All genders,-,All HSE Regions,Number,879
161,HGO22C01,Number of Health Graduates,2012,2012,202,Medicine,-,All genders,-,All HSE Regions,Number,183
182,HGO22C01,Number of Health Graduates,2012,2012,702,Social Care,-,All genders,-,All HSE Regions,Number,579
203,HGO22C01,Number of Health Graduates,2013,2013,30,Nursing and Midwifery,-,All genders,-,All HSE Regions,Number,876


In [34]:
# make a pivot  for gender divide
pivot_df = df_clean.pivot_table(
    index=['Graduation Year', 'Field of Study', 'HSE Health Regions'],
    columns='Gender',
    values='VALUE',
    aggfunc='sum',
    fill_value=0
).reset_index()

pivot_df['Total Graduates'] = pivot_df['Female'] + pivot_df['Male']
pivot_df['Female %'] = (pivot_df['Female'] / pivot_df['Total Graduates']) * 100
pivot_df['Male %'] = (pivot_df['Male'] / pivot_df['Total Graduates']) * 100
pivot_df['Gender Gap (%)'] = pivot_df['Female %'] - pivot_df['Male %']
pivot_df['Gender Ratio (F/M)'] = pivot_df['Female'] / pivot_df['Male']


In [36]:
pivot_df.head(10)

Gender,Graduation Year,Field of Study,HSE Health Regions,Female,Male,Total Graduates,Female %,Male %,Gender Gap (%),Gender Ratio (F/M)
0,2010,Medicine,HSE Dublin and Midlands,18,6,24,75.0,25.0,50.0,3.0
1,2010,Medicine,HSE Dublin and North East,27,12,39,69.230769,30.769231,38.461538,2.25
2,2010,Medicine,HSE Dublin and South East,21,18,39,53.846154,46.153846,7.692308,1.166667
3,2010,Medicine,HSE Midwest,3,3,6,50.0,50.0,0.0,1.0
4,2010,Medicine,HSE South West,15,6,21,71.428571,28.571429,42.857143,2.5
5,2010,Medicine,HSE West and North West,9,6,15,60.0,40.0,20.0,1.5
6,2010,Nursing and Midwifery,HSE Dublin and Midlands,153,9,162,94.444444,5.555556,88.888889,17.0
7,2010,Nursing and Midwifery,HSE Dublin and North East,150,12,162,92.592593,7.407407,85.185185,12.5
8,2010,Nursing and Midwifery,HSE Dublin and South East,120,9,129,93.023256,6.976744,86.046512,13.333333
9,2010,Nursing and Midwifery,HSE Midwest,78,3,81,96.296296,3.703704,92.592593,26.0


In [41]:
import plotly.express as px
import ipywidgets as widgets
from IPython.display import display

# üß∞ Widgets
year_slider = widgets.IntSlider(
    value=pivot_df['Graduation Year'].min(),
    min=pivot_df['Graduation Year'].min(),
    max=pivot_df['Graduation Year'].max(),
    step=1,
    description='Year:',
    continuous_update=False
)

field_dropdown = widgets.Dropdown(
    options=sorted(pivot_df['Field of Study'].unique()),
    value='Medicine',
    description='Field:',
)

# üìä Plotting function
def update_bar_chart(selected_year, selected_field):
    filtered = pivot_df[
        (pivot_df['Graduation Year'] == selected_year) &
        (pivot_df['Field of Study'] == selected_field)
    ]

    fig = px.bar(
        filtered,
        x='HSE Health Regions',
        y=['Female', 'Male'],
        barmode='group',
        title=f"Male vs Female Graduates by Region ({selected_field}, {selected_year})",
        labels={'value': 'Graduate Count', 'HSE Health Regions': 'Region', 'variable': 'Gender'},
        height=500
    )
    fig.update_layout(xaxis_title='Region', yaxis_title='Graduates', legend_title='Gender')
    fig.show()

# üîÑ Connect widgets
widgets.interact(update_bar_chart, selected_year=year_slider, selected_field=field_dropdown)

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

interactive(children=(IntSlider(value=2010, continuous_update=False, description='Year:', max=2021, min=2010),‚Ä¶

<function __main__.update_bar_chart(selected_year, selected_field)>