# Mortality Rates in Ireland – Data Analysis & Prediction

**Author:** Loic Soares Bagnoud  
**Data Source:** Central Statistics Office (CSO) – Ireland  
**Notebook Purpose:**  
- Clean and explore mortality rate data for Ireland  
- Identify trends and correlations across years, regions, causes of death, and demographics  
- Build a simple machine learning model to predict mortality rates

## 1. Cleaning up Data 

### 1.1 Importing libraries, loading the data and understanding the columns

The first thing that we need to do is make sure that our raw data is readable and without any issues. 

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [18]:
csv_path = "raw_data/MORT02.20251120T191125.csv"
df = pd.read_csv(csv_path)
df.head(5)

Unnamed: 0,Statistic Label,Year,Sex,County,Age Group,Cause of Death,UNIT,VALUE
0,Age-standardised mortality rate,2015,Both sexes,Ireland,0 - 64 years,All causes of death,Rate,161.63
1,Age-standardised mortality rate,2015,Both sexes,Ireland,0 - 64 years,Infectious and parasitic diseases,Rate,1.75
2,Age-standardised mortality rate,2015,Both sexes,Ireland,0 - 64 years,Tuberculosis,Rate,0.15
3,Age-standardised mortality rate,2015,Both sexes,Ireland,0 - 64 years,Meningococcal infection,Rate,0.04
4,Age-standardised mortality rate,2015,Both sexes,Ireland,0 - 64 years,Aids (HIV disease),Rate,0.19


We can see from the above that we're working with a couple of things here. There are specific columns that we won't need for this which will be __*UNIT*__ and __*Statistic Label*__. 
Everything else will be necessary:
>
- The __*Year*__ will allow us to check trends by year across the country.
- We will try and see if there is anything interesting to be gained from both __*sexes*__.
- The __*Age Group*__ will allow us to see if there any problematic ages.
- The __*Cause of Death*__ will give us insight to the actual causes of the mortality rate numbers
- And finally, the __*VALUE*__ is self-explanatory.

In [19]:
# This will allows us to drop the columns that we don't need.
drop_col_list = ['Statistic Label', 'UNIT']
df.drop(columns=drop_col_list, inplace=True)
df.head(5)

Unnamed: 0,Year,Sex,County,Age Group,Cause of Death,VALUE
0,2015,Both sexes,Ireland,0 - 64 years,All causes of death,161.63
1,2015,Both sexes,Ireland,0 - 64 years,Infectious and parasitic diseases,1.75
2,2015,Both sexes,Ireland,0 - 64 years,Tuberculosis,0.15
3,2015,Both sexes,Ireland,0 - 64 years,Meningococcal infection,0.04
4,2015,Both sexes,Ireland,0 - 64 years,Aids (HIV disease),0.19


### 1.3 Data quality checks

With our coloumns dropped, we can now go ahead and see if there's any missing data from our dataframe. For this, we can use the __*.isna*__ function

In [20]:
# We use this to check if there's any missing values in each column and we sum it up.
missing_counts = df.isna().sum()
missing_counts

Year              0
Sex               0
County            0
Age Group         0
Cause of Death    0
VALUE             0
dtype: int64

No missing values, which is great. Let's also try and see if all values have the correct datatype.

In [None]:
# This gets us the datatypes for each column group.
data_types = df.dtypes
data_types

Year                int64
Sex                object
County             object
Age Group          object
Cause of Death     object
VALUE             float64
dtype: object

We have ints for __*Year*__ and floats for the __*VALUES*__ (which is expected since it's percentages). Everything else is an object which makes sense given the fact it's all text.

In [None]:
# With this function, we can see which values are duplicated in the dataframe
duplicates = df[df.duplicated()]
duplicates

Unnamed: 0,Year,Sex,County,Age Group,Cause of Death,VALUE


I found the above particularly weird but I imagine it meant that it found nothing. I clarified with ChatGPT. 3

In [25]:
# We sum up the duplicates just to make sure we're not missing anything. If the number is 0, this means there aren't really any duplicates.
duplicates_making_sure = df.duplicated().sum()
duplicates_making_sure

0

The next thing I needed to also check is if there are any 0 values. It would be incredibly weird if any value at all had 0 deaths, be it gender, years, cause of death, etc. Meaning that we need to catch those and then decide what we're doing to do with them. The following function on Stackoverflow helps with that. 4

In [27]:
# The following function allows us to check the dataframe and will return True if all values in a df are 0 and False if not.

# Source - https://stackoverflow.com/a
# Posted by kevin41
# Retrieved 2025-12-04, License - CC BY-SA 4.0
def is_zero(df):
    vals = df.to_numpy()
    return (0 == vals).all()

is_zero(df)


False

With all sanity checks done, comes the difficult question. What to do with the age groups? It's problematic because we don't really have various ages like 1,2,3,etc. but two big age goups:
>
- 0–64 years
- 0–74 years
>
After some research, I managed to find out that this it is related to **cumulative incidence rate**5 but applied to mortality. Meaning that I do need to keep them. However, the reason for 64 and 74 is that while both research premature mortality, the WHO/EU tens to go up to 74/75. 

## 2. Exploratory Data Analysis (EDA): Trends & Correlations

### 2.1 Overall distribution of mortality ratios

### 2.2 Trends over time (national level)

### 2.3 Regional patterns

### 2.4 Age and sex patterns

### 2.5 Causes of death

### 2.6 Correlations and relationships

## 3. Predictive Modeling: Simple Machine Learning Model

### References:

1. https://gist.github.com/ericmjl/27e50331f24db3e8f957d1fe7bbbe510 - On ideas to organise repository - (At the beginning).
2. https://www.geeksforgeeks.org/data-analysis/working-with-missing-data-in-pandas/ - On checking if there's any missing values.
3. https://chatgpt.com/share/6931d4c6-1c98-800b-9281-3ad3f9c30654 - Clarifying if empty dataframes meant that no duplicates were found.
4. https://stackoverflow.com/questions/77489899/how-to-check-that-a-dataframe-consists-of-all-0-entries - On checking dataframes for any 0 values.
5. https://www.britannica.com/science/cumulative-incidence - On comulative incidence rate and why it matters.

https://www.cso.ie/en/releasesandpublications/ep/p-vsar/vitalstatisticsannualreport2022/deaths2022/ - Ideas on plotting