## Population in Ireland: Where are we heading to?

- Observe the population growth/decline in Ireland from 1960 to 2022.
- Analyse the correlation between the number of births, deaths and marriages, been determining factors in population growth/decline.
- Find an optimal machine learning model that makes projections into the future regarding Ireland’s population.
- Analyse our results in order to make conclusions concerning Ireland’s population stability.


#### Data source
The dataset for this analysis is going to be gotten from the Central Statistics Office (CSO), Ireland. The CSO is responsible for collection, compilation, extraction and dissemination of high-quality data used for analysis of economic and social important topics.


In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('Total Births, Deaths and Marriages Registered.csv')
df.head(-5)

Unnamed: 0,Statistic Label,Quarter,State,UNIT,VALUE
0,Marriages Registered,1960Q1,State,Number,2987.0
1,Marriages Registered,1960Q2,State,Number,3501.0
2,Marriages Registered,1960Q3,State,Number,5900.0
3,Marriages Registered,1960Q4,State,Number,3106.0
4,Marriages Registered,1961Q1,State,Number,2179.0
...,...,...,...,...,...
4544,Average age of mothers giving birth,2020Q4,State,Number,
4545,Average age of mothers giving birth,2021Q1,State,Number,
4546,Average age of mothers giving birth,2021Q2,State,Number,
4547,Average age of mothers giving birth,2021Q3,State,Number,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4554 entries, 0 to 4553
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Statistic Label  4554 non-null   object 
 1   Quarter          4554 non-null   object 
 2   State            4554 non-null   object 
 3   UNIT             4554 non-null   object 
 4   VALUE            1826 non-null   float64
dtypes: float64(1), object(4)
memory usage: 178.0+ KB


In [4]:
df.describe()

Unnamed: 0,VALUE
count,1826.0
mean,3977.795016
std,5587.5585
min,0.0
25%,7.825
50%,20.05
75%,7348.25
max,19950.0


In [5]:
df.isnull().mean()*100 # Porcentage of missing data

Statistic Label     0.000000
Quarter             0.000000
State               0.000000
UNIT                0.000000
VALUE              59.903382
dtype: float64

In [6]:
df['Statistic Label'].unique()

array(['Marriages Registered',
       'Marriage Rate Registered per 1000 Estimated Population',
       'Deaths Registered',
       'Death Rate Registered per 1000 Estimated Population',
       'Deaths of Infants under 1 Year Registered per 1000 Births',
       'Births Registered',
       'Birth Rate Registered per 1000 Estimated Population',
       'Opposite sex marriages', 'Same sex marriages',
       'Civil partnerships', 'Births registered that were first births',
       'Births registered outside marriage', 'Opposite sex marriage rate',
       'Same sex marriage rate', 'Civil partnership marriage rate',
       'Percentage of first births registered',
       'Percentage of births registered outside marriage',
       'Average age of mothers giving birth'], dtype=object)

In [7]:
df['Quarter'].unique()

array(['1960Q1', '1960Q2', '1960Q3', '1960Q4', '1961Q1', '1961Q2',
       '1961Q3', '1961Q4', '1962Q1', '1962Q2', '1962Q3', '1962Q4',
       '1963Q1', '1963Q2', '1963Q3', '1963Q4', '1964Q1', '1964Q2',
       '1964Q3', '1964Q4', '1965Q1', '1965Q2', '1965Q3', '1965Q4',
       '1966Q1', '1966Q2', '1966Q3', '1966Q4', '1967Q1', '1967Q2',
       '1967Q3', '1967Q4', '1968Q1', '1968Q2', '1968Q3', '1968Q4',
       '1969Q1', '1969Q2', '1969Q3', '1969Q4', '1970Q1', '1970Q2',
       '1970Q3', '1970Q4', '1971Q1', '1971Q2', '1971Q3', '1971Q4',
       '1972Q1', '1972Q2', '1972Q3', '1972Q4', '1973Q1', '1973Q2',
       '1973Q3', '1973Q4', '1974Q1', '1974Q2', '1974Q3', '1974Q4',
       '1975Q1', '1975Q2', '1975Q3', '1975Q4', '1976Q1', '1976Q2',
       '1976Q3', '1976Q4', '1977Q1', '1977Q2', '1977Q3', '1977Q4',
       '1978Q1', '1978Q2', '1978Q3', '1978Q4', '1979Q1', '1979Q2',
       '1979Q3', '1979Q4', '1980Q1', '1980Q2', '1980Q3', '1980Q4',
       '1981Q1', '1981Q2', '1981Q3', '1981Q4', '1982Q1', '1982

In [8]:
df['State'].unique()

array(['State'], dtype=object)

In [9]:
df.drop(columns='State', inplace=True)
df.head()

Unnamed: 0,Statistic Label,Quarter,UNIT,VALUE
0,Marriages Registered,1960Q1,Number,2987.0
1,Marriages Registered,1960Q2,Number,3501.0
2,Marriages Registered,1960Q3,Number,5900.0
3,Marriages Registered,1960Q4,Number,3106.0
4,Marriages Registered,1961Q1,Number,2179.0


In [10]:
df['UNIT'].unique()

array(['Number', '%'], dtype=object)

In [11]:
df = df[df['UNIT'].str.contains('%')==False]
df['UNIT'].unique()

array(['Number'], dtype=object)

In [12]:
df.drop(columns='UNIT', inplace=True)
df.head()

Unnamed: 0,Statistic Label,Quarter,VALUE
0,Marriages Registered,1960Q1,2987.0
1,Marriages Registered,1960Q2,3501.0
2,Marriages Registered,1960Q3,5900.0
3,Marriages Registered,1960Q4,3106.0
4,Marriages Registered,1961Q1,2179.0


In [13]:
df['label_lc'] = df['Statistic Label'].apply(lambda x: ' '.join(x.lower() for x in x.split())) # put everything in lower case
df[['label_lc']].head()

Unnamed: 0,label_lc
0,marriages registered
1,marriages registered
2,marriages registered
3,marriages registered
4,marriages registered


In [14]:
df.rename(columns={"Quarter": "year"}, inplace=True)
df.rename(columns={"VALUE": "value"}, inplace=True)

In [15]:
# Since we are analysing the population change per year, we do not need the values per quater, therefore we will remove
# Q1, Q2, Q3 and Q4 from the Quater column and then we will be able to asign the column as a integer.

df['year'] = df['year'].str[:-2] # Remove the 2 last characters
df['year']

0       1960
1       1960
2       1960
3       1960
4       1961
        ... 
4549    2022
4550    2022
4551    2022
4552    2022
4553    2023
Name: year, Length: 3289, dtype: object

In [16]:
df['year'] = df['year'].astype(int) # Assign Quarter column as a type integer
df.dtypes

Statistic Label     object
year                 int32
value              float64
label_lc            object
dtype: object

In [17]:
df_group = df.groupby(['year','label_lc'])['value'].sum()
df_group.head(20)

year  label_lc                                                 
1960  average age of mothers giving birth                              0.0
      birth rate registered per 1000 estimated population             85.7
      births registered                                            60730.0
      births registered outside marriage                               0.0
      births registered that were first births                         0.0
      civil partnerships                                               0.0
      death rate registered per 1000 estimated population             45.9
      deaths of infants under 1 year registered per 1000 births      117.5
      deaths registered                                            32658.0
      marriage rate registered per 1000 estimated population          21.8
      marriages registered                                         15494.0
      opposite sex marriages                                           0.0
      same sex marriages            

In [18]:
df = df[df['label_lc'].str.contains('average age of mothers giving birth')==False]
df = df[df['label_lc'].str.contains('births registered outside marriage')==False]
df = df[df['label_lc'].str.contains('births registered that were first births')==False]
df = df[df['label_lc'].str.contains('civil partnerships')==False]
df = df[df['label_lc'].str.contains('opposite sex marriages')==False]
df = df[df['label_lc'].str.contains('same sex marriages')==False]
df = df[df['label_lc'].str.contains('birth rate registered per 1000 estimated population')==False]
df = df[df['label_lc'].str.contains('death rate registered per 1000 estimated population')==False]
df = df[df['label_lc'].str.contains('deaths of infants under 1 year registered per 1000 births')==False]
df = df[df['label_lc'].str.contains('marriage rate registered per 1000 estimated population')==False]

df['label_lc'].unique()

array(['marriages registered', 'deaths registered', 'births registered'],
      dtype=object)

In [19]:
df['value'] = df['value'].astype(int) # Assign VALUE column as a type integer, since a birth or death can only be an integer.
df.dtypes

Statistic Label    object
year                int32
value               int32
label_lc           object
dtype: object

In [20]:
df = df.query('year != 2023')
df['year'].unique()

array([1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
       1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
       1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
       1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
       2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
       2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022])

In [21]:
df.head()

Unnamed: 0,Statistic Label,year,value,label_lc
0,Marriages Registered,1960,2987,marriages registered
1,Marriages Registered,1960,3501,marriages registered
2,Marriages Registered,1960,5900,marriages registered
3,Marriages Registered,1960,3106,marriages registered
4,Marriages Registered,1961,2179,marriages registered


In [22]:
df.drop(columns='Statistic Label', inplace=True)
df.head()

Unnamed: 0,year,value,label_lc
0,1960,2987,marriages registered
1,1960,3501,marriages registered
2,1960,5900,marriages registered
3,1960,3106,marriages registered
4,1961,2179,marriages registered


In [23]:
df_group = df.groupby(['year','label_lc'], as_index=False)['value'].sum()
df_group.head(20)

Unnamed: 0,year,label_lc,value
0,1960,births registered,60730
1,1960,deaths registered,32658
2,1960,marriages registered,15494
3,1961,births registered,59826
4,1961,deaths registered,34548
5,1961,marriages registered,15140
6,1962,births registered,61611
7,1962,deaths registered,33643
8,1962,marriages registered,15409
9,1963,births registered,62966


In [24]:
df_group['label_lc'].replace('births registered','births_registered', inplace=True)
df_group['label_lc'].replace('deaths registered','deaths_registered', inplace=True)
df_group['label_lc'].replace('marriages registered','marriages_registered', inplace=True)
df_group.head()

Unnamed: 0,year,label_lc,value
0,1960,births_registered,60730
1,1960,deaths_registered,32658
2,1960,marriages_registered,15494
3,1961,births_registered,59826
4,1961,deaths_registered,34548


In [25]:
df_group.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   year      189 non-null    int32 
 1   label_lc  189 non-null    object
 2   value     189 non-null    int32 
dtypes: int32(2), object(1)
memory usage: 3.1+ KB


In [26]:
df_group.isnull().mean()*100 # Porcentage of missing data

year        0.0
label_lc    0.0
value       0.0
dtype: float64