Author: Niamh Hogan

# **Suicide Mortality in Ireland: Demographic Trends and EU Comparison (2012–2019)**

## **Project Overview**

In [1]:
# Imports

import pandas as pd
import numpy as np

## **Data Loading**  

In this section, I load the datasets required for the analysis. This includes Irish mortality data by age, sex, and county, as well as EU mortality and population data. I perform initial checks to inspect the structure and contents of each dataset, ensuring that the data has loaded correctly and is ready for cleaning and analysis.

**Irish Deaths by Year, Age, and Sex**

In this dataset, I load Irish mortality data published by the Central Statistics Office (CSO). The dataset contains annual death counts in Ireland, broken down by year, sex, cause of death, and age group at death, with unit of measurement & values representing overall counts of deaths. This dataset (VSA35) was downloaded as a CSV file from the [Central Statistics Office data portal](https://data.cso.ie/#) and allows me to analyse mortality patterns across different age groups and sexes over time.  

I load the CSV file *irishdata_year_age_sex_cso.csv* into a pandas DataFrame called *irish_age_sex_df* ([Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).
I then view the first three rows to inspect the data and ensure it has loaded correctly:

In [2]:
irish_age_sex_df = pd.read_csv(
    "./data/irishdata_year_age_sex_cso.csv"
)

irish_age_sex_df.head(3)

Unnamed: 0,Statistic Label,Year,Sex,Cause of Death,Age Group at Death,UNIT,VALUE
0,Revised Deaths Occurring,2007,Both sexes,X60-X84 Intentional self-harm,Under 1 year,Number,
1,Revised Deaths Occurring,2007,Both sexes,X60-X84 Intentional self-harm,1 - 4 years,Number,
2,Revised Deaths Occurring,2007,Both sexes,X60-X84 Intentional self-harm,5 - 9 years,Number,


**Irish Deaths by County and Sex**

In this dataset, I load Irish mortality data published by the Central Statistics Office (CSO), containing annual death counts broken down by year, sex, county, and cause of death. The values represent overall counts of deaths, allowing for comparison of mortality patterns across different counties and between the sexes. This dataset (VSA112) was downloaded as a CSV file from the [Central Statistics Office data portal](https://data.cso.ie/#) and supports geographic analysis of mortality trends within Ireland.  

I load the CSV file *irishdata_year_counties_sex_cso.csv* into a pandas DataFrame called *irish_counties_df* ([Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).
I then view the first three rows to inspect the data and ensure it has loaded correctly:

In [3]:
irish_counties_df = pd.read_csv(
    "./data/irishdata_year_counties_sex_cso.csv"
)

irish_counties_df.head(3)

Unnamed: 0,Statistic Label,Year,Sex,County,Cause of Death,UNIT,VALUE
0,Deaths Occuring,2015,Both sexes,Ireland,Intentional self-harm (X60-X84),Number,500.0
1,Deaths Occuring,2015,Both sexes,Carlow County Council,Intentional self-harm (X60-X84),Number,7.0
2,Deaths Occuring,2015,Both sexes,Dublin City Council,Intentional self-harm (X60-X84),Number,54.0


**EU Deaths by Country and Sex**  

In this dataset, I load European mortality data from the World Health Organization (WHO), containing annual death counts (1969 - 2022) by country and sex. The values represent overall counts of deaths, allowing me to compare mortality patterns across different European countries and between the sexes. This dataset was downloaded as a CSV file from the [WHO data portal](https://gateway.euro.who.int/en/indicators/hfamdb_761-deaths-suicide-and-intentional-self-harm/#id=31291) and allows me to analyse trends in suicide mortality across Europe.  

I load the WHO EU deaths CSV file into a pandas DataFrame called eu_deaths_df ([Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)). I skip the first 30 rows because they contain metadata & notes rather than the actual data and I set *low_memory=False* to ensure pandas correctly infers column data types for the entire file, preventing mixed-type warnings ([Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)). I then view the first three rows to check the structure and confirm the data loaded correctly for analysis:

In [4]:
eu_deaths_df = pd.read_csv(
    "./data/who_eu_deaths.csv",
    skiprows=30,
    low_memory=False,
)

eu_deaths_df.head(3)

Unnamed: 0,COUNTRY,COUNTRY_GRP,AGE_GRP_LIST,SEX,SUBNATIONAL_MDB,YEAR,VALUE
0,ALB,,TOTAL,FEMALE,,1987.0,25.0
1,ALB,,TOTAL,FEMALE,,1988.0,22.0
2,ALB,,TOTAL,FEMALE,,1989.0,15.0


**EU Population**  

In this dataset, I load European population data for 2012–2022, containing total population counts by country, year, age group, and sex. The values represent overall population counts. I include this dataset to standardize EU death counts, which allows me to compare mortality across countries of different population sizes. By using population data, I can calculate rates or adjusted counts so that countries with larger populations do not appear to have disproportionately higher mortality ([Health Knowledge](https://www.healthknowledge.org.uk/e-learning/epidemiology/specialists/standardisation?utm_source=chatgpt.com)). This dataset was downloaded as a CSV file from [Eurostat](https://ec.europa.eu/eurostat/databrowser/view/demo_pjan/default/table) and supports fair cross-country comparisons of mortality trends.  

Below, I load the CSV file *eu_pop_2012_2022.csv* into a pandas DataFrame called *eu_pop_df* ([Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).
I then view the first three rows to inspect the data and ensure it has loaded correctly:

In [5]:
eu_pop_df = pd.read_csv(
    "./data/eu_pop_2012_2022.csv"
)

eu_pop_df.head(3)

Unnamed: 0,Time,geo,Value,age,sex,unit
0,2012,AT,8408121,TOTAL,T,NR
1,2012,BE,11075889,TOTAL,T,NR
2,2012,BG,7327224,TOTAL,T,NR


## **Data Cleansing**

**Cleaning irish_age_sex_df**

In [6]:
# drop unnecessary columns for irish_age_sex_df
drop_col_list1 = ["Statistic Label", "Cause of Death", "UNIT"]

irish_age_sex_df.drop(columns=drop_col_list1, inplace=True)

# sanity check
print(irish_age_sex_df.head(3))

   Year         Sex Age Group at Death  VALUE
0  2007  Both sexes       Under 1 year    NaN
1  2007  Both sexes        1 - 4 years    NaN
2  2007  Both sexes        5 - 9 years    NaN


In [7]:
#irish_age_sex_df
print(irish_age_sex_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Year                960 non-null    int64  
 1   Sex                 960 non-null    object 
 2   Age Group at Death  960 non-null    object 
 3   VALUE               782 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 30.1+ KB
None


In [8]:
# Convert VALUE to int 
irish_age_sex_df["VALUE"] = irish_age_sex_df["VALUE"].astype("Int64")

In [9]:
# Print Age Group at Death values
print(irish_age_sex_df["Age Group at Death"].unique())

['Under 1 year' '1 - 4 years' '5 - 9 years' '10 - 14 years'
 '15 - 19 years' '20 - 24 years' '25 - 29 years' '30 - 34 years'
 '35 - 39 years' '40 - 44 years' '45 - 49 years' '50 - 54 years'
 '55 - 59 years' '60 - 64 years' '65 - 69 years' '70 - 74 years'
 '75 - 79 years' '80 - 84 years' '85 years and over' 'All ages']


In [10]:
# Age Group at Death covert to int
def age_to_int(age_str):
    if age_str == 'All ages':
        return np.nan  
    if age_str == 'Under 1 year':
        return 0
    if 'and over' in age_str: 
        return int(age_str.split()[0])
    return int(age_str.split(' - ')[0])

# Apply to the column
irish_age_sex_df["Age Group at Death"] = irish_age_sex_df["Age Group at Death"].apply(age_to_int).astype('Int64')


In [11]:
# coverting All Ages to midpoint
all_ages_midpoint = 42

# Fill <NA> values with the midpoint
irish_age_sex_df["Age Group at Death"] = irish_age_sex_df["Age Group at Death"].fillna(all_ages_midpoint)

# Check result
print(irish_age_sex_df["Age Group at Death"].unique())

<IntegerArray>
[0, 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 42]
Length: 20, dtype: Int64


In [12]:
print(irish_age_sex_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Year                960 non-null    int64 
 1   Sex                 960 non-null    object
 2   Age Group at Death  960 non-null    Int64 
 3   VALUE               782 non-null    Int64 
dtypes: Int64(2), int64(1), object(1)
memory usage: 32.0+ KB
None


**Cleaning irish_counties_df**

In [13]:
# drop unnecessary columns for irish_counties_df 
drop_col_list2= ["Statistic Label", "Cause of Death", "UNIT"]

irish_counties_df.drop(columns=drop_col_list2, inplace=True)

# sanity check
print(irish_counties_df.head(3))

   Year         Sex                 County  VALUE
0  2015  Both sexes                Ireland  500.0
1  2015  Both sexes  Carlow County Council    7.0
2  2015  Both sexes    Dublin City Council   54.0


In [14]:
#irish_counties_df
print(irish_counties_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Year    768 non-null    int64  
 1   Sex     768 non-null    object 
 2   County  768 non-null    object 
 3   VALUE   745 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 24.1+ KB
None


**Cleaning eu_deaths_df**

In [15]:
# drop unnecessary columns for eu_deaths_df
drop_col_list3= ["COUNTRY_GRP", "AGE_GRP_LIST", "SUBNATIONAL_MDB"]

eu_deaths_df.drop(columns=drop_col_list3, inplace=True)

# sanity check
print(eu_deaths_df.head(3))

  COUNTRY     SEX    YEAR  VALUE
0     ALB  FEMALE  1987.0   25.0
1     ALB  FEMALE  1988.0   22.0
2     ALB  FEMALE  1989.0   15.0


In [16]:
#eu_deaths_df

# EU member state variable
eu_members = [
    "AUT", "BEL", "BGR", "HRV", "CYP", "CZE", "DNK", "EST", "FIN", "FRA",
    "DEU", "GRC", "HUN", "IRL", "ITA", "LVA", "LTU", "LUX", "MLT", "NLD",
    "POL", "PRT", "ROU", "SVK", "SVN", "ESP", "SWE"
]

# Drop non-EU member states 
eu_deaths_df = eu_deaths_df[eu_deaths_df["COUNTRY"].isin(eu_members)]

# Print countries alphabetically
countries = sorted(eu_deaths_df["COUNTRY"].unique())
print(countries)

['AUT', 'BEL', 'BGR', 'CYP', 'CZE', 'DEU', 'DNK', 'ESP', 'EST', 'FIN', 'FRA', 'GRC', 'HRV', 'HUN', 'IRL', 'ITA', 'LTU', 'LUX', 'LVA', 'MLT', 'NLD', 'POL', 'PRT', 'ROU', 'SVK', 'SVN', 'SWE']


In [17]:
#eu_deaths_df
print(eu_deaths_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 3639 entries, 67 to 5881
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   COUNTRY  3639 non-null   object 
 1   SEX      3639 non-null   object 
 2   YEAR     3639 non-null   float64
 3   VALUE    3639 non-null   float64
dtypes: float64(2), object(2)
memory usage: 142.1+ KB
None


**Cleaning eu_pop_df**

In [18]:
# drop unnecessary columns for eu_pop_df
drop_col_list4= ["age", "sex", "unit"]

eu_pop_df.drop(columns=drop_col_list4, inplace=True)

# sanity check
print(eu_pop_df.head(3)) 

   Time geo     Value
0  2012  AT   8408121
1  2012  BE  11075889
2  2012  BG   7327224


In [19]:
#eu_deaths_df

# Drop years not 2012-2019
eu_deaths_df = eu_deaths_df[(eu_deaths_df["YEAR"] >= 2012) & (eu_deaths_df["YEAR"] <= 2019)]

print(eu_deaths_df.head(3)) 

    COUNTRY     SEX    YEAR  VALUE
110     AUT  FEMALE  2012.0  289.0
111     AUT  FEMALE  2013.0  324.0
112     AUT  FEMALE  2014.0  324.0


In [20]:
#eu_pop_df
print(eu_pop_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Time    297 non-null    int64 
 1   geo     297 non-null    object
 2   Value   297 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 7.1+ KB
None


# END