# CE 93: Engineering Data Analysis 
# Fertility Rate and Life Expectancy Among Countries

##### By: *Seobin Yun* and *Connor Clark*

## Libraries Required for Exploratory Data Analysis (EDA)

The code that follows loads in the necessary libraries for analysis

In [1]:
# import python library / packages 
import numpy as np                                 # ndarrays for gridded data
import pandas as pd                                # DataFrames for tabular data
import matplotlib.pyplot as plt                    # plotting
import random                                      # random sampling
from scipy.stats import *                          # common distributions, t-test
import statistics as stats                         # statistics like mode
from sklearn.linear_model import LinearRegression  # linear regression
import statsmodels.api as sm                       # linear regression

## Introduction

In recent years, earth can expect to introduce an astonishing 140 million newborns to the new world! Without proper context, one might assume this number to be extremely large or extremely small, depending on who you ask. Nonetheless, certain variables should be considered when discussing the topic of newborns each year, which we will refer to as **fertility rate**. The variable we will focus on in relation to the fertility rate will be **life expectancy**. In this report, we will analyze the potential relationship between life expectancy and fertility rate within the United States for given years. 


Firstly, let's define our variables to better understand the quantity assigned to each variable. Fertility rate is define as the number of births per woman, while the life expectancy is defined as the expected amount of years for that newborn to live. To set up such a relationship, we have selected two data sets, both from the CE93 Data Summaries spreadsheet: (1) Fertility Rate (births) and (2) Average Life Expectancy (yr). The data can be observed or downloaded into a CSV file by visiting the gapminder website (https://www.gapminder.org/data/), where both datasets have the year, as columns, along the x-axis, and the country, as rows, along the y-axis, with corresponding values in each cell. Along with the data itself, the site provides sources from which data was collected to compile the data set itself. 


We have shown an idea of what you can expect from the datasets below. As one can observe, the common variable between both data sets is the country along with the corresponding year, which is where we will perform our analysis.


|Country|1960|1961|1962|...|2021|
|:-|:-|:-|:-|:-|:-|
|Aruba|(fertility rate / life expectancy)|(fertility rate / life expectancy)|(fertility rate / life expectancy)|...|(fertility rate / life expectancy)|
|Afghanistan|(fertility rate / life expectancy)|(fertility rate / life expectancy)|(fertility rate / life expectancy)|...|(fertility rate / life expectancy)|
|Angola|(fertility rate / life expectancy)|(fertility rate / life expectancy)|(fertility rate / life expectancy)|...|(fertility rate / life expectancy)|
|...|


### Fertility Rate
Here, we show the actual data from the csv file below. To make coding and analysis easier, we have named the csv file for the fertility rate, *ferility*. Just at a glance, we see that the years, along the x-axis, range from 1960 to 2021, while the country, along the y-axis, includes every country in the world. In another cell, we have included the shape of the data set, which is simply an output of two numbers in a tuple type. The first number in the tuple represents the number of rows in the data set, and the second number represents the number of columns in the data set. 

### Life Expectancy
Similarily, we have shown the actual data for the life expectancy csv file below, and have named the variable *life*. Again, we see that the years range from 1960 to 2021 along the x-axis, and the countries align with a single row along the y-axis. However, upon inspection, we see that the data sets do not perfectly align. While the number of columns between both sets match up, the number of rows do not, where the fertility rate data set has 213 rows and the life expectancy data set has 212 rows. Looking through csv file on gapminder, we see that Andorra is the perpetrator, as it only contains data for a limited amount of years in the fertility rate data set, and has been completely excluded in the life expectancy data set. For our analysis, this will not affect our results. 

In [11]:
# read a .csv file in as a DataFrame
data = pd.read_csv('CE93_08_Fertility_Expectancy.csv')

# returns the first 5 rows of the data set for fertility rate
data.head()


Unnamed: 0,country,Fertility_rate(births),Life_expectancy(yr)
0,Afghanistan,4.04,63.4
1,Angola,5.41,65.2
2,Albania,1.7,77.9
3,United Arab Emirates,1.69,74.0
4,Argentina,2.23,74.6


Now, we check the shape of data file.

In [13]:
# get the shape (rows, columns)
rows, columns = data.shape
# Print the number of rows and columns
print(f'\nThe Data set has {rows} rows of countries and {columns} columns.')



The Data set has 185 rows of countries and 3 columns.


In [14]:
# data.info()
data.describe()

Unnamed: 0,Fertility_rate(births),Life_expectancy(yr)
count,185.0,185.0
mean,2.661676,72.394595
std,1.234484,6.734505
min,1.23,52.0
25%,1.73,67.3
50%,2.2,73.2
75%,3.49,77.0
max,7.0,84.9


### Summary Statistics

Now, we're gonna compute mearses of central tendency and variability for each data set.

##### Central tendency
- Mean
- Median

##### Variability
- Variance
- Standard deviation
- Range

### Fertility Rate


First of all, we're gonna calculate means by each conutry and each year using .mean().

When getting means by countries, we set a parameter axis as 1 while we set a parameter axis as 0 by years.

When it comes to total mean, we set a parameter axis as None.

In [32]:

fertility_mean = data.iloc[:, 1].mean()
print(f'Mean of Fertility rate: {fertility_mean.round(2)} births per woman')

Mean of Fertility rate: 2.66 births per woman


Next, we're gonna calculate medians by each country and each year.

The way is same as mean.

In [33]:

fertility_median = data.iloc[:, 1].median()
print(f'Median of Fertility rate: {fertility_median.round(2)} births per woman')

Median of Fertility rate: 2.2 births per woman


Calculate Variance by each country and each year.

In [34]:

fertility_var = data.iloc[:, 1].var()
print(f'Variance of Fertility rate: {fertility_var.round(2)} square births per woman')

Variance of Fertility rate: 1.52 square births per woman


Calculate standard deviation by each country and each year.

In [21]:

fertility_stdev = data.iloc[:, 1].std()
print(f'Standard deviation of Fertility rate: {fertility_stdev.round(2)}')

Standard deviation of Fertility rate: 1.23


Calculate Range.

In [23]:
range = data.iloc[:,1].max(axis=None)- data.iloc[:,1].min(axis=None)

print(f'Range of Fertility: {range.round(2)}')

Range of Fertility: 5.77


### Life_expectancy

In [35]:

life_mean = data.iloc[:, 2].mean()
print(f'Mean of life expectancy: {life_mean.round(2)} years')

Mean of life expectancy: 72.39 years


In [36]:

life_median = data.iloc[:, 2].median()
print(f'Median of life expectancy: {life_median.round(2)} years')

Median of life expectancy: 73.2 years


In [40]:

life_var = data.iloc[:, 2].var()
print(f'Variance of life expectancy: {life_var.round(2)} years**2')

Variance of life expectancy: 45.35 years**2


In [43]:

life_stdev = data.iloc[:, 2].std()
print(f'Standard deviation of life expectancy: {life_stdev.round(2)} years')

Standard deviation of life expectancy: 6.73 years


In [44]:
life_range = data.iloc[:,2].max(axis=None)- data.iloc[:,2].min(axis=None)

print(f'Range of life expectancy: {life_range.round(2)}')

Range of life expectancy: 32.9
