# Birth by Age of Parents NS Project

This is a brief study upon the born rates in Nova Scotia.

In here we are going to explore the dataset provided by the Government of Nova Scotia (https://open.canada.ca/data/en/dataset/a958d7f7-9317-d43f-8acd-77cd2722c6f5), showing the birth year by age of the parents since 2016.

### Imports

In [1]:
import pandas as pd
import numpy as np
from plotly import __version__
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
import plotly.graph_objects as go
%matplotlib inline

init_notebook_mode(connected=True)

cf.go_offline()

### Importing the DataFrame

In [2]:
df = pd.read_csv('NS_Births_by_Age_of_Parents.csv')
df.head()

Unnamed: 0,AGE OF MOTHER,AGE OF FATHER,YEAR,COUNT
0,12,<20,2016,0
1,12,20-24,2016,0
2,12,25-29,2016,0
3,12,30-34,2016,0
4,12,35-39,2016,0


## First Impressions & Data Cleaning

At first, this data set seems pretty straightforward, let's take a look on each column.

AGE OF MOTHER (str) --> range from [12,59].

AGE OF FATHER (str) --> for some reason the age of the father is not an exact number as the mother's, the data frame shows a range of years. For example from 20 to 24, starting at >20 and going until 65+.

YEAR (numpy int) --> range from [2016, 2022].

COUNT (numpy int) --> the amount of babies born in that specific year for the particular combination of age of mother and father.

If you take a good look, the 'COUNT' table has some zero values, which are indifferent for this part of the study. The 'Not Stated' ages are going to be discated as well.

In [3]:
df['COUNT'] = df['COUNT'][df['COUNT'] != 0]
df['AGE OF FATHER'] = df['AGE OF FATHER'][df['AGE OF FATHER'] != 'Not Stated']
df['AGE OF MOTHER'] = df['AGE OF MOTHER'][df['AGE OF MOTHER'] != 'Not Stated']

In [4]:
# Resseting the index column
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,AGE OF MOTHER,AGE OF FATHER,YEAR,COUNT
0,15,<20,2016,3.0
1,16,<20,2016,10.0
2,16,20-24,2016,2.0
3,17,<20,2016,15.0
4,17,20-24,2016,9.0


That's a better dataframe to work with. Let's explore it a bit.

## Births per Year


In [5]:
byYear = df.groupby('YEAR')
byYear_count = byYear.sum('COUNT')
byYear_count.iplot(kind='scatter',xTitle='Year',yTitle='Total of Births', title='Birth per Year in NS',mode='markers')

In [6]:
# Percentual Difference 
# (1 - byYear_count['COUNT'][2022] / byYear_count['COUNT'][2016]).round(2)
# 0.11

With the image on top we can notice that the number of babies born in Nova Scotia has been decreasing since 2016, going from 8275 in 2016 to 7406 in 2022, a low of 11%. The only exception occured from 2018 to 2019, where the number of births went from 7611 to 7857.

## Mother focused study

An interesting question to ask regardin the dataset is: 'What is the distribution of number of births per age of the mother?'. With this dataset we can easily identify a distribution that is similar to the gaussian distribution, like many more other examples in nature.

For that, let's take a look on the distribution for every record in the dataset.


In [7]:
df_mother = df.groupby('AGE OF MOTHER').sum('COUNT')
df_mother.iplot(kind='bar', y='COUNT', title="Distribution of Number of Births by Mothers Age from 2016 to 2022", xTitle="Mother's Age", yTitle="Number of Births")

It is funny how we can see same patterns in different parts of nature. In this case, the distribution looks pretty close to a gaussian. The most common age to have a child is 31 years old with a total of 3975 mothers in the last 7 years.

### Min and Max

It's important to notice that the youngest woman to have a child was a kid of 13 years old in 2022, funny enough (not funny at all), the dad's age was not stated. The older woman to have a child was 58 in 2018.

### Teenager Pregnancy Rates

Analyzing the quantity of teenagers who got pregnant in the last 7 years we are able to conclude that only 1.37% of the births were held by teenagers from 13 to 18 years old. Honestly it is a lower rate than I expected, perhaps Canada 

In [None]:
# Min and Max

#df[df['AGE OF MOTHER'] == df['AGE OF MOTHER'].min()]
#df[df['AGE OF MOTHER'] == df['AGE OF MOTHER'].max()]


# Teeaneger Pregnancy

total = df_mother['COUNT'].sum()
teen_total = 0

for age in np.arange(13, 19, 1):
    teen_total += df_mother.loc[f'{age}']['COUNT'] 

teen_total/total * 100
    

## Father's Focused Study

To easily work with the data, I'll separate it for each year 

In [21]:
df_father = df.groupby('AGE OF FATHER')
df_father.sum('COUNT').iplot(kind='bar', y='COUNT', title="Distribution of Number of Births by Father's Age from 2016 to 2022"
                             , xTitle="Father's Age", yTitle="Number of Births")

## Findings

The most common age interval for a male to have kids is from 30 to 34 years old, which matches the most common age for the mothers, therefore, couples on their early thirties are the ones who's had more kids since 2016. 

To percieve a gaussian distribution the data should showcase the actual value of the father's age, and not a interval of ages. 

The values for fathers younger then 20 are placed wrongly on the data, since the father's age is a string object and '>20' is considered a bigger string.


## Mother's count depending on the Father's age

We do now that the age gap in a couple says a lot about the relationship. Therefore, it's interesting to analyze how is the distribution of the mother's age depending on the father's. In other words, let's inspect the count of mother's by age by the Father's age group.

In [24]:
fig = make_subplots(rows=3, cols=1, subplot_titles=("Father from 25-29","Father from 35-39","Father from 45-49"))

fig.append_trace(go.Bar(
    x=df_father.get_group('25-29').groupby('AGE OF MOTHER').sum('COUNT').reset_index()['AGE OF MOTHER'],
    y=df_father.get_group('25-29').groupby('AGE OF MOTHER').sum('COUNT').reset_index()['COUNT'],
    name='Father from 25-29'
), row=1, col=1)

fig.append_trace(go.Bar(
    x=df_father.get_group('35-39').groupby('AGE OF MOTHER').sum('COUNT').reset_index()['AGE OF MOTHER'],
    y=df_father.get_group('35-39').groupby('AGE OF MOTHER').sum('COUNT').reset_index()['COUNT'],
    name='Father from 35-39'
), row=2, col=1)

fig.append_trace(go.Bar(
    x=df_father.get_group('45-49').groupby('AGE OF MOTHER').sum('COUNT').reset_index()['AGE OF MOTHER'],
    y=df_father.get_group('45-49').groupby('AGE OF MOTHER').sum('COUNT').reset_index()['COUNT'],
    name='Father from 45-49'
), row=3, col=1)

fig.update_xaxes(title_text="Mother's Age", row=1, col=1)
fig.update_xaxes(title_text="Mother's Age", row=2, col=1)
fig.update_xaxes(title_text="Mother's Age", row=3, col=1)

fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_yaxes(title_text="Count", row=2, col=1)
fig.update_yaxes(title_text="Count", row=3, col=1)

fig.update_layout(height=800, width=950, title_text="Quantity of Mothers for different Father's ages")


It is noticiable that the age of the father is usually close to the age of the mother. However, data show that as the male grows older he looks for younger womans to have a kid with. For example, we can see that father's from 25-29 usually have kids with a woman around their age; however, when we analyse older fathers, like father's from 45-49, it's clear that they look for woman younger than them, having kids more often with ladies from 35 to 39, a difference of 10 years.

Nevertheless, in every scenario for the father's age we can find a young mother on their early 20s (or even younger). What does this says about our society? Males are getting involved with younger females for what reasons?