# **IE0005 Mini-Project**

Dataset from Data.gov.sg : **"Birth and Fertility in Singapore"**

Source: https://data.gov.sg/dataset/births-and-fertility-annual

Done By: ***Marcus Lim***, ***Lau Nan Feng***, ***Dorelle Lua***, ***Chua Zong Lin***

---

**Problem**

**1)** *Predicting the trend of Singapore's birth count*

**2)** *With the declining trend and birthrate, will it ever hit 0 and below?*

---

## **Essential Libraries**

>**NumPy** : *Library for Numeric Computations in Python*  
>**Pandas** : *Library for Data Acquisition and Preparation*  
>**Matplotlib** : *Low-level library for Data Visualization*  
>**Seaborn** : *Higher-level library for Data Visualization*  
>**DecisionTreeClassifier** : *Non-parametric supervised learning method for Regression and Classification*  
>**PlotTree** : *Plot and Visualise Decision Trees* 


In [None]:
# Importing the libraries

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

## **Data Wrangling**

As a first step, we decided to create a new DataFrame to exclude the rows in df5 as they are not relevant for us to conclude our problem.

In [None]:
# Importing the datasets

live_birth_dataset = pd.read_csv('live-births.csv')
crude_birth_dataset = pd.read_csv('crude-birth-rate.csv')
age_f_rate_dataset = pd.read_csv('age-specific-fertility-rate.csv')
e_fert_rate_dataset = pd.read_csv('total-fertility-rate-by-ethnic-group.csv')
fert_and_rep_rate_dataset = pd.read_csv('total-fertility-rate-and-reproduction-rate.csv')

**Live Birth Dataset**

Live Birth refers to the number of newborns in Singapore since 1960s.

In [None]:
# Dataset 1

live_birth_dataset.head()

**Crude Birth Dataset**

Crude Birth refers to the number of live births per 1000 population in a given year.

In [None]:
# Dataset 2

crude_birth_dataset.head()

**Age-Specific Fertility Rate Dataset**

Age-Specific Fertility Rate refers to the number of live births to females in a particular age group.

In [None]:
# Dataset 3

age_f_rate_dataset.head()

**Fertility Rate by Ethnic Group Dataset**

Total Fertility Rate by Ethnic Group refers to the average number of live births that females from different ethnic groups would have during her reproductive years.



In [None]:
# Dataset 4

e_fert_rate_dataset.head()

**Fertility and Reproduction Rate Dataset**

Total Fertility Rate refers to the average number of live births that a female would have during her reproductive years.

In [None]:
# Dataset 5

fert_and_rep_rate_dataset.head()

**Create DataFrame and Renaming of Columns**

Assign Live Birth Dataset to DataFrame1 and rename the columns to their respective titles as to provide useful description of the data.

In [None]:
df1 = pd.DataFrame(live_birth_dataset)

df1.rename(columns = {'year': 'Year', 'level_1': 'Type of Live Birth', 'value': 'Number'}, inplace = True)

df1 = df1[df1['Type of Live Birth'] != 'Resident Live-births'].reset_index()

df1 = df1.drop(columns=['index'])

df1.head()

Assign Crude Birth Dataset to DataFrame2 and rename the columns to their respective titles as to provide useful description of the data.

In [None]:
df2 = pd.DataFrame(crude_birth_dataset)

df2.rename(columns = {'year': 'Year', 'level_1': 'Fertility Rate', 'value': 'Number'}, inplace = True)

df2.head()

Assign Age-Specific Fertility Rate Dataset to DataFrame3 and rename the columns to their respective titles as to provide useful description of the data.

In [None]:
df3 = pd.DataFrame(age_f_rate_dataset)

df3.rename(columns = {'year': 'Year', 'level_1': 'Fertility Rate', 'level_2': 'Age Range','value': 'Number'}, inplace = True)

df3.head()

Assign Fertility Rate by Ethnic Group Dataset to DataFrame4 and rename the columns to their respective titles as to provide useful description of the data.

In [None]:
df4 = pd.DataFrame(e_fert_rate_dataset)

df4.rename(columns = {'year': 'Year', 'level_1': 'Type of Live Birth', 'level_2': 'Ethnic Group', 'value': 'Number'}, inplace = True)

df4.head()

Assign Fertility and Reproduction Rate Dataset to DataFrame5 and rename the columns to their respective titles as to provide useful description of the data.

In [None]:
df5 = pd.DataFrame(fert_and_rep_rate_dataset)

df5.rename(columns = {'year': 'Year', 'level_1': 'Total Fertility Rate', 'value': 'Number'}, inplace = True)

df5.head(10)

**Conversion of Data Types**

Convert Object to Numeric (Integer) Data Types for Number

In [None]:
df1.dtypes

In [None]:
df1['Number'] = pd.to_numeric(df1['Number'])

df1.dtypes

In [None]:
df2.dtypes

Convert Object to Numeric (Float) Data Types for Number and drop any null values

In [None]:
df3.dtypes

In [None]:
df3 = df3.dropna()
df3['Number'] = pd.to_numeric(df3['Number'],errors='coerce')
df3.dtypes


In [None]:
df4.dtypes

In [None]:
df5.dtypes

## **Data Exploration and Analysis**



**Time Series Graph for DataFrame1**

The number of babies has been decreasing steadily since 1960s.However,there is a sudden spike in the number of babies during the year 1987.This is because Singapore's government started to introduce policies to increase the fertility rates in 1987.The policies consists of 3 main categories: financial incentives,support for parents to combine work and family and marriage encouragement.

Financial incentives: The government began offering cash payments to parents in 2000s.The government also offers tax rebates for working mothers,medical insurance for their children and various housing subsidy schemes.

Support for parents to combine work and family: The government has increased paid maternity leave from 8 to 12 in 2004 and from 12 to 16 in 2008 and also subsidze centre-based childcare to help working mothers who rely on childcare when they are working.

Marriage encouragement: The government seek to promote marriage through housing policies that offers various inducements to Singaporeans who plan to marry.

In [None]:
# Time series plot of df1

plt.figure(figsize=(10,10))

plt.plot(df1.Year, df1.Number)

plt.xlabel('Year')
plt.ylabel('Number of babies')

**Time Series Graph for DataFrame 4**

Fertility rates for all ethnic groups has generally been declining over the past few decades.In 2018,the fertility rate for Malays is 1.85 while the fertility rate for Chinese and Indians are 0.98 and 1 respectively.Overall,the fertility rate for Malays are approximately 85% higher than that of Chinese and Indians.

This is because Malays' family members are willing to take care of the childrens while the mothers are working in the day and have a very close-knit community as compared to the Chinese and Indians.
Usually,Malays place a great amount of value and emphasis on the family and in having families.However,they do not dismiss the importance of career progression and financial security but they typically value their families much more.

In [None]:
# Time series plot of df4

chinese = df4[(df4['Ethnic Group'] == 'Chinese')]
indian = df4[(df4['Ethnic Group'] == 'Indians')]
malay = df4[(df4['Ethnic Group'] == 'Malays')]

plt.figure(figsize=(10,10))

plt.plot(chinese.Year, chinese.Number)
plt.plot(indian.Year, indian.Number, color='green')
plt.plot(malay.Year, malay.Number, color='red')

plt.legend(['Chinese', 'Indians', 'Malays'])

plt.xlabel('Year')
plt.ylabel('Number of babies')

**Time Series Graph for DataFrame 5**

Fertility rates has generally been declining from 5.76 to 1.14 over the past few decades,largely due to the rising proportion of singles,later marriages and married couples having fewer childrens.

Over the past 10 years,the proportion of singles have increased across all age groups especially among those aged 25 to 34.The proportion of singles among men and women aged 25 to 29 went up from 74.6% to 81.6% and 54% to 69% respectively.The proportion of singles among men and women aged 30 to 34 went up from 37.1% to 41.9% and 25.1% to 32.8% respectively.
Therefore,many Singaporeans are marrying at a later age.

Married couples are also having fewer childrens as compared to 10 years ago.The proportion of women aged 30 to 39 who have 2 childrens went down from 36.2% to 33.6% while the proportion of women aged 30 to 39 who have only 1 children went up from 29.4% to 29.7%.
The main reasons why married couples are having fewer childrens are high costs of raising a child in Singapore and the desire to defer marriage and parenthood to focus on career progression.

In [None]:
# Time series plot of df5

gross = df5[(df5['Total Fertility Rate'] == 'Gross Reproduction Rate')]
net = df5[(df5['Total Fertility Rate'] == 'Net Reproduction Rate')]
total = df5[(df5['Total Fertility Rate'] == 'Total Fertility Rate')]

plt.figure(figsize=(10,10))

plt.plot(gross.Year, gross.Number)
plt.plot(net.Year, net.Number, color='green')
plt.plot(total.Year, total.Number, color='red')

plt.legend(['Gross Reproduction Rate', 'Net Reproduction Rate', 'Total Fertility Rate'])

plt.xlabel('Year')
plt.ylabel('Number of babies')

**Swarm Plot for DataFrame 3**

There is a rising trend in the number of women aged 30 to 34 giving birth since 2000s.
This is because more and more people are marrying and giving birth at a later age.

In [None]:
f = plt.figure(figsize=(30, 6))
sb.swarmplot(data=df3, x="Year", y="Number", hue="Age Range")

**Bar Plot for DataFrame 3**

There is a rising trend in the number of women aged 30 to 34 giving birth since 2000s.This is because more and more people are marrying and giving birth at a later age.

In [None]:
f = plt.figure(figsize=(30, 6))
sb.barplot(data=df3, x="Year", y="Number", hue="Age Range")

**Swarm Plot for DataFrame 4**

The fertility rates for Malays is generally higher than the Chinese and Indians.This is because the Malays prioritize family over career progression and financial security.

In [None]:
f = plt.figure(figsize=(30, 6))
sb.swarmplot(data=df4, x="Year", y="Number", hue="Ethnic Group")

**Bar Plot for DataFrame 4**

The fertility rates for Malays is generally higher than the Chinese and Indians.This is because the Malays prioritize family over career progression and financial security.

In [None]:
f = plt.figure(figsize=(30, 6))
sb.barplot(data=df4, x="Year", y="Number", hue="Ethnic Group")

**Swarm Plot for DataFrame 5**

The Total Fertility Rate has been decreasing rapidly over the years.

The Gross and Net Production Rates has decreased slightly but remains constantly at around 0.56.

In [None]:
f = plt.figure(figsize=(30, 6))
sb.swarmplot(data=df5, x="Year", y="Number", hue="Total Fertility Rate")

**Bar Plot for DataFrame5**

The Total Fertility Rate has been decreasing rapidly over the years.

The Gross and Net Reproduction Rates have decreased slightly but remains constantly at around 0.56.

In [None]:
f = plt.figure(figsize=(30, 6))
sb.barplot(data=df5, x="Year", y="Number", hue="Total Fertility Rate")

In [None]:
totalfertilityrate = df5[df5["Total Fertility Rate"].isin(['Total Fertility Rate'])]

totalfertilityrate.head()

## **Linear Regression**

### **Uni-Variate Linear Regression**

Response: ***Birth Rates***  
Predictor: ***Year***


The explained variance (R^2) for train data is approximately 0.34.

Since the explained variance (R^2) is not close to 1,the model is not very accurate in predicting the response.

Therefore,the model is not able to do a good job of using the predictor variable (Year) to explain the variation of the response variable (Birth Rates).

In [None]:
# Import essential models and functions from sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Recall the Total-HP Dataset
X = pd.DataFrame(df1['Year'])
y = pd.DataFrame(df1['Number'])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Linear Regression using Train Data
linreg = LinearRegression()         # create the linear regression object
linreg.fit(X_train, y_train)        # train the linear regression model

# Predict Total values corresponding to HP
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)

# Formula for the Regression line
regline_x = X_train
regline_y = linreg.intercept_ + linreg.coef_ * X_train

# Explained Variance (R^2)
print("Explained Variance (R^2) \t:", linreg.score(X_train, y_train))

# Plot the Predictions vs the True values
plt.figure(figsize=(10,10))
plt.scatter(X_train, y_train, color = "blue")
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth = 1)
plt.show()

The explained variance (R^2) for test data is approximately 0.47.

Since the explained variance (R^2) is not close to 1,the model is not very accurate in predicting the response.

Therefore,the model is not able to do a good job of using the predictor variable (Year) to explain the variation of response variable (Birth Rates).

In [None]:
# Explained Variance (R^2)
print("Explained Variance (R^2) \t:", linreg.score(X_test, y_test))

# Plot the Predictions
f = plt.figure(figsize=(10, 10))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()

Response: ***Crude Birth Rates***  
Predictor: ***Year***

The explained variance (R^2) for train data is approximately 0.83.

Since the explained variance (R^2) is close to 1,the model is very accurate in predicting the response.

Therefore,the model is able to do a good job of using the predictor variable (Year) to explain the variation of the response variable (Crude Birth Rates).

In [None]:
# Import essential models and functions from sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Recall the Total-HP Dataset
X = pd.DataFrame(df2['Year'])
y = pd.DataFrame(df2['Number'])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Linear Regression using Train Data
linreg2 = LinearRegression()         # create the linear regression object
linreg2.fit(X_train, y_train)        # train the linear regression model

# Predict Total values corresponding to HP
y_train_pred = linreg2.predict(X_train)
y_test_pred = linreg2.predict(X_test)

# Formula for the Regression line
regline_x = X_train
regline_y = linreg2.intercept_ + linreg2.coef_ * X_train

# Explained Variance (R^2)
print("Explained Variance (R^2) \t:", linreg2.score(X_train, y_train))

# Plot the Predictions vs the True values
plt.figure(figsize=(10,10))
plt.scatter(X_train, y_train, color = "blue")
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth = 1)
plt.show()

The explained variance (R^2) for test data is approximately 0.81.

Since the explained variance (R^2) is close to 1,the model is very accurate in predicting the response.

Therefore,the model is able to do a good job of using the predictor variable (Year) to explain the variation of the response variable (Crude Birth Rates).

In [None]:
# Plot the Predictions
f = plt.figure(figsize=(10, 10))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()

In [None]:
# Explained Variance (R^2)
print("Explained Variance (R^2) \t:", linreg2.score(X_test, y_test))

Response: ***Total Fertility Rate***  
Predictor: ***Year***

The explained variance (R^2) for train data is approximately 0.69.

Since the explained variance (R^2) is close to 1,the model is moderately accurate in predicting the response.

Therefore,the model is able to do a good job of using the predictor variable (Year) to explain the variation of the response variable (Total Fertility Rate).

In [None]:
# Import essential models and functions from sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Recall the Total-HP Dataset
X = pd.DataFrame(totalfertilityrate['Year'])
y = pd.DataFrame(totalfertilityrate['Number'])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Linear Regression using Train Data
linreg3 = LinearRegression()         # create the linear regression object
linreg3.fit(X_train, y_train)        # train the linear regression model

# Predict Total values corresponding to HP
y_train_pred = linreg3.predict(X_train)
y_test_pred = linreg3.predict(X_test)

# Formula for the Regression line
regline_x = X_train
regline_y = linreg3.intercept_ + linreg3.coef_ * X_train

# Explained Variance (R^2)
print("Explained Variance (R^2) \t:", linreg3.score(X_train, y_train))

# Plot the Predictions vs the True values
plt.figure(figsize=(10,10))
plt.scatter(X_train, y_train, color = "blue")
plt.plot(regline_x.to_numpy(), regline_y.to_numpy(), 'r-', linewidth = 1)
plt.show()

The explained variance (R^2) for test data is approximately 0.65.

Since the explained variance (R^2) is close to 1,the model is moderately accurate in predicting the response.

Therefore,the model is able to do a good job of using the predictor variable (Year) to explain the variation of the response variable (Total Fertility Rate).

In [None]:
# Plot the Predictions
f = plt.figure(figsize=(10, 10))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()

In [None]:
# Explained Variance (R^2)
print("Explained Variance (R^2) \t:", linreg3.score(X_test, y_test))

## **Forecast**
### **Using Double Exponential Smoothing to forecast**

Double Exponential Smoothing is a technique used to predict data forecast by using alpha and gamma as the weights based on the previous data. As our data is only until 2018, we have combined additional data by feeding the algorithm using prediction inputs from the LinearRegression model, linreg, in order to come out with the forecast.

In [None]:
from statsmodels.tsa.api import SimpleExpSmoothing
import pandas as pd
import plotly.express as px

y = list(df1['Number'])

alpha = 1
gamma = 0.05
period = 6

ten_pred_year = []
ten_year = []
exp_pred = []
pct_var = []

first_pred_year = 2018

for years in range(period):
  first_pred_year += 1
  ten_pred_year.append(first_pred_year)
  x_pred = [[first_pred_year]]
  y_pred = float(linreg.predict(x_pred))
  ten_year.append(y_pred)

pt = y[0]

bt = y[1] - y[0]

forecast = [pt]

for i in range(1, len(y)):
  temp_pt = alpha * y[i] + (1 - alpha) * (pt + bt)
  bt = gamma * (temp_pt - pt) + (1 - gamma) * bt
  pt = temp_pt

  forecast.append(pt + (1 * bt))

forecast = list(forecast)

for i in range(1, period):
  temp_pt = alpha * ten_year[i] + (1 - alpha) * (pt + bt)
  bt = gamma * (temp_pt - pt) + (1 - gamma) * bt
  pt = temp_pt

  forecast.append(pt + (1 * bt))

plt.plot(y, color = 'blue', label='Original')
plt.plot(forecast, 'r--', label='Forecast')
plt.plot(len(forecast), forecast[len(forecast) - 1], '+', label='2024 Baby Prediction', color='black')
plt.title('2024 Baby Prediction Forecast')
plt.xlabel('Year Intervals')
plt.ylabel('Number of babies born')
plt.legend()

print(forecast[len(forecast) - 1])

## **Mean percentage of error against predicted values  = 1.93%**##

In [None]:
length=len(forecast) -5
val_list = []
for values in range(0,length):
  val_list.append(forecast[values])

df1['Predicted Values'] = val_list

df1['Percentage Variance in %'] = ((df1['Number'] - df1['Predicted Values']) / df1['Number']) * 100 

df1.tail(10)

In [None]:
df1.describe()

## **Prediction by decaying exponential curve**


In [None]:
# fit a second degree polynomial to the economic data
from numpy import arange
from pandas import read_csv
from scipy.optimize import curve_fit
from matplotlib import pyplot
 
# define the true objective function
def objective(x, a, b, c):
 return b * np.exp(-x/c) + a

test = []

i = 0

for var in range(len(totalfertilityrate['Year'])):
  i += 1
  test.append(i)

# choose the input and output variables
x, y = test, totalfertilityrate['Number']
# curve fit
popt, _ = curve_fit(objective, x, y)
# summarize the parameter values
a, b, c = popt

print('y = %.5f * e ^ - x / %.5f + %.5f' % (b, c, a))
print('Asymptote: %.5f' % (a))

# plot input vs output
pyplot.plot(x, y, 'b-', label='Original Values')
# define a sequence of inputs between the smallest and largest known inputs
x_line = arange(min(x), max(x), 1)
# calculate the output for the range
y_line = objective(x_line, a, b, c)
# create a line plot for the mapping function
pyplot.plot(x_line, y_line, 'r--', label='Best Fit Curve')
pyplot.axhline(y = a, color='black', label='Asymptote')
plt.xlabel('Year Intervals')
plt.ylabel('Fertility Rate')
plt.title('Fertility Rate vs Year Intervals')
plt.legend()
pyplot.show()

## **Extras**##

In [None]:
new_list = []

xx = 0 
for xx in range(len(totalfertilityrate['Year'])): 
 
  yy = 5.35215 * np.exp( - xx / 10.52950 )+ 1.25824 
  new_list.append(yy)
  xx+=1 
 
print(new_list)

In [None]:
totalfertilityrate['Exponential Fitted Values'] = new_list

totalfertilityrate['Percentage Variance in %'] = ((totalfertilityrate['Number'] - totalfertilityrate['Exponential Fitted Values']) / totalfertilityrate['Number']) * 100 

totalfertilityrate.drop(columns=['Total Fertility Rate'])

In [None]:
totalfertilityrate.describe().round(5)