# Project: EDA of x

## Introduction
This project looks at a dataset of x. The original data set can be found here: https://www.gapminder.org/data/

- Child Mortality Rate - Death of children under five years of 5 per 1000 live births. 
- Child Per Woman (total fertility) - Total fertility rate. The number of children that would be born to each woman with prevailing age-specific fertility rates. 
- Life Expectancy


### Questions for Analysis

- Is there a correlation of death rates and birth rate, meaning as the death rates decline due birth rates decline as the children born have a lower likelihood to die?
- Is there a relationship between children born and life expectancy?

In [2]:
#We firstly import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime #We will need this for later on
import matplotlib.pyplot as plt
%matplotlib inline

## Data Wrangling
The overall goal of the data wrangling step is to make the data as clean as possible for the exploration stage. We start by loading the data and then do the initial cleaning of addressing null values, duplicates, and incorrect data types.

In [13]:
#Data is loaded and the head viewed.
df = pd.read_csv('children_per_woman_total_fertility.csv')
df.head(1)

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Aruba,5.64,5.64,5.64,5.64,5.64,5.64,5.64,5.64,5.64,...,1.82,1.82,1.82,1.82,1.82,1.82,1.82,1.82,1.83,1.83


In [22]:
df[df['country']=='USA']

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
190,USA,7.03,7.01,6.99,6.96,6.94,6.92,6.9,6.87,6.85,...,1.92,1.92,1.92,1.92,1.92,1.92,1.92,1.92,1.92,1.92


In [32]:
df.columns.tolist().index('2091')

292

In [33]:
df.iloc[190,292]

1.92

In [9]:
#Investigating data for data types, null values, etc.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Columns: 302 entries, country to 2100
dtypes: float64(301), object(1)
memory usage: 476.7+ KB


In [10]:
df.head(20)

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,USA,5.64,5.64,5.64,5.64,5.64,5.64,5.64,5.64,5.64,...,1.82,1.82,1.82,1.82,1.82,1.82,1.82,1.82,1.83,1.83
1,USA,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,...,1.74,1.74,1.74,1.74,1.74,1.74,1.74,1.74,1.74,1.74
2,USA,6.93,6.93,6.93,6.93,6.93,6.93,6.93,6.94,6.94,...,2.54,2.52,2.5,2.48,2.47,2.45,2.43,2.42,2.4,2.4
3,USA,4.6,4.6,4.6,4.6,4.6,4.6,4.6,4.6,4.6,...,1.78,1.78,1.78,1.79,1.79,1.79,1.79,1.79,1.79,1.79
4,USA,5.8,5.8,5.8,5.8,5.8,5.8,5.8,5.8,5.8,...,2.0,2.0,2.01,2.01,2.01,2.01,2.01,2.02,2.02,2.02
5,USA,6.94,6.94,6.94,6.94,6.94,6.94,6.94,6.94,6.94,...,1.76,1.76,1.76,1.77,1.77,1.77,1.77,1.77,1.77,1.77
6,USA,6.8,6.8,6.8,6.8,6.8,6.8,6.8,6.8,6.8,...,1.82,1.82,1.82,1.82,1.82,1.82,1.82,1.82,1.82,1.82
7,USA,7.8,7.8,7.81,7.81,7.81,7.82,7.82,7.82,7.83,...,1.77,1.77,1.77,1.77,1.78,1.78,1.78,1.78,1.78,1.78
8,USA,5.0,5.0,4.99,4.99,4.99,4.98,4.98,4.97,4.97,...,1.81,1.81,1.81,1.81,1.81,1.81,1.81,1.82,1.82,1.82
9,USA,6.5,6.48,6.46,6.44,6.42,6.4,6.38,6.36,6.34,...,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.81,1.81


# Notes:
- Countries are not in alphabetical order
- How do I show float to only 2 points rather than 4
- Do I want to keep it as NAN.  (Remember that Nan is not the same as None.  They represent the same concept by NAN is numerical and treated differently for efficiency reasons.

# Potential Questions
- Is there a correlation between country GDP per capita and birth rates?
- Correlation between religion and birth rates?
- Correlation between death rate and birth rates, meaning do people have more kids with the expectation that some of them will most likely die?  Or a child has died and so they try again?

In [11]:
#double checking by column name for null values
def total_missing(df,column_name):
    is_null = df[column_name].isnull().sum()
    return int(is_null)

In [13]:
#Cheking for null values
total_missing(df, '1965')

14

In [25]:
#value_counts counts how many of a certain value
#sort_index is for sorting by smallest or largest
df['1965'].value_counts().sort_index(ascending= False)

1965
11.5000    1
4.0400     1
2.7600     1
2.7400     1
2.5200     1
          ..
0.0445     1
0.0411     1
0.0392     1
0.0253     1
0.0151     1
Name: count, Length: 63, dtype: int64

In [26]:
#Checking for duplicates.
df.duplicated().sum()

0

It is apparent from the above cells that the following cleaning needs to take place:

- Countries need to be alphabetized

Examples
The data types for PatientID, ScheduledDay, and AppointmentDay need to be updated to string, datetime and datetime.
The neighbourhoods need to be formatted with capitalization.
We also see from above that all rows have complete data, therefore, it is unnecessary to drop null values. Likewise, there aren't any duplicate rows so no need to delete duplicates.

We will most likely find additional necessary changes as we proceed with cleaning.