# Project: Gapminder Life Expectancy, Income, and Energy Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, I will be analyzing the life expectancy, income, and energy consumption datasets from [Gapminder.org](https://www.gapminder.org/data/).

In particular, we'll be interested in finding trends on how income affects each country's life expectancy and the impacts to CO2 emissions.

In [1]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

#### Life Expectancy Dataset

First we will be looking at the life expectancy dataset to build some intuition around what's included.

From the Gapminder's description, the life expectancy datset is the average number of years a newborn child would live if current mortality patterns were to stay the same.

In [None]:
# load data csv files
df_le = pd.read_csv('data/life_expectancy_years.csv')

In [None]:
# how many total rows and columns do we have?
df_le.shape

There are 187 rows and 220 columns.

In [None]:
# what does a sample of the data look like?
df_le.head(3)

From the sample data, we can infer there are 187 rows of countries.  The first column is the country names and columns 2-220 correspond with the years 1800-2018.

In [None]:
# what are some general statistics of the data?
df_le.describe()

With the describe function, we can see at a glance that life expectancy is generally rising over time and that the variance seems to be increasing over time as well.

Is there any duplicate data?

In [None]:
# how many duplicate rows do we have?
df_le.duplicated().sum()

Lucky us, no duplicates!  What about missing values?

In [None]:
# is there any missing data?
df_le[df_le['1800'].isnull()]

Since there are only a few values missing, we will do a linear interpolation to fill in these values.

In [None]:
# fill in null values with linear interpolation
df_le.interpolate(inplace=True)

In [None]:
# check to see if additional data is missing
df_le[df_le['1801'].isnull()]

Otherwise, it looks like this data is fairly clean!  Let's go ahead and do the same things for the next two datasets.

#### Income Per Person

From Gapminder's description, this dataset is the gross domestic product per person adjusted for differences in purchasing power (inflation).

In [None]:
df_inc = pd.read_csv('data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv')

In [None]:
# how many rows and columns do we have?
df_inc.shape

In [None]:
# what does a sample of the data look like?
df_inc.head(3)

From the data shape and sample we can see there are additional countries in this dataset compared to the life expectancy dataset.  Also, it appears that the income dataset is projected out to 2040 whereas the life expectancy dataset stops at 2018.

In [None]:
# what are some general statistics of the data?
df_inc.describe()

As expected, the average income per person increases over time.  The level of inequailty seems to also increase over time as well.

In [None]:
# how many duplicate rows do we have?
df_inc.duplicated().sum()

In [None]:
# is there any missing data?
df_inc.isnull().sum()

There are no duplicated rows and there is no missing data for the income per person dataset.

#### Carbon Dioxide Emissions Per Person
From Gapminder's description, this is the amount of carbon dioxide emissions from the burning of fossil fuels in metric tonnes of CO2 per person.

In [None]:
# load data csv files
df_co2 = pd.read_csv('data/co2_emissions_tonnes_per_person.csv')

In [None]:
# how many rows and columns do we have?
df_co2.shape

This dataset also has a different number of rows and columns from previous two datasets as well.

In [None]:
# what does a sample of the data look like?
df_co2.head(3)

In [None]:
# what are some general statistics of the data?
df_co2.describe()

In [None]:
# how many duplicate rows do we have?
df_co2.duplicated().sum()

In [None]:
# is there any missing data?
df_co2[df_co2['1800'].isnull()]

In [None]:
# fill in the null values along each row
df_co2.loc[:,'1800':] = df_co2.loc[:,'1800':].interpolate(axis=1, limit_direction='both')
df_co2.head()

In [None]:
# check for null values
df_co2[df_co2['1800'].isnull()]

Now that we have all of our data imported and some intuition, let's combine them into one DataFrame.

### Data Cleaning

#### Dropping Extra Countries
As noted earlier, each set of data has a different amount of rows and columns.  Let's find the rows that are missing from each data set.

In [None]:
# create a list of countries from each dataset
le_countries = df_le['country'].values
inc_countries = df_inc['country'].values
co2_countries = df_co2['country'].values

In [None]:
# how many rows do we have for each dataset?
le_countries.shape, inc_countries.shape, co2_countries.shape

It looks like the income dataset has 6 more countries than the life expectancy dataset.  The CO2 emissions dataset is only missing one country from the income dataset.

Let's find the names of the countries that are missing.

In [None]:
# create list of different countries
different = []

# append the list when a country in the income dataset is not found in the other two datasets
for country in inc_countries:
    if (country not in le_countries) or (country not in co2_countries):
        different.append(country)

# remove any duplicates by converting the list into a dictionary and back into a list.
list(dict.fromkeys(different))

As expected, we have a list of 6 countries that 

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!