# Investigating Data Sets

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> In this analysis we are going to be investigating and trying to find possible correlations between three different sets of data. These sets of data are **Income Per Person(Per Country)**, **Life Expectancy(Per Country)**, and **Energy Use Per Person(Per Country)**. All data was collected by [Gapminder World](https://www.gapminder.org/data/). The three data sets are included in [this GitHub repository](https://github.com/TrikerDev/Investigating-Data-Sets) in CSV format for further viewing. The analysis of these data sets is to find potential correlation between different data points. This is all tentative information and **correlation does not equal causation**.

## Questions to Answer

>* Does increased income per person equate to higher life expectancy?
>* Does increased energy usage per person equate to higher life expectancy?
>* Does increased income per person equate to higher energy usage per person?

### My Predictions
> Personally, I predict that all three of these data sets will have a positive correlation with one another. As in, higher income leads to higher life expectancy. Higher life expectancy leads to higher energy usage, etc. I predict each data set will have a positive correlation with every other data set. I am investigating this to see if my predictions and those correlations could potentially be true.

In [12]:
# Importing packages to be used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

### General Properties


In [13]:
# Loading in the CSV data
income = pd.read_csv('income_per_person.csv')
life = pd.read_csv('life_expectancy_years.csv')
energy = pd.read_csv('energy_use_per_person.csv')

In [14]:
# Reading a few lines from the Life Expectancy table
life.head()

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,28.2,28.2,28.2,28.2,28.2,28.2,28.1,28.1,28.1,...,76.5,76.6,76.7,76.9,77.0,77.1,77.3,77.4,77.5,77.7
1,Albania,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,87.4,87.5,87.6,87.7,87.8,87.9,88.0,88.1,88.2,88.3
2,Algeria,28.8,28.8,28.8,28.8,28.8,28.8,28.8,28.8,28.8,...,88.3,88.4,88.5,88.6,88.7,88.8,88.9,89.0,89.1,89.2
3,Andorra,,,,,,,,,,...,,,,,,,,,,
4,Angola,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,...,78.7,78.9,79.0,79.1,79.3,79.4,79.5,79.7,79.8,79.9


In [15]:
# Reading a few lines from the Income per Person table
income.head()

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2031,2032,2033,2034,2035,2036,2037,2038,2039,2040
0,Afghanistan,603,603,603,603,603,603,603,603,603,...,2550,2600,2660,2710,2770,2820,2880,2940,3000,3060
1,Albania,667,667,667,667,667,668,668,668,668,...,19400,19800,20200,20600,21000,21500,21900,22300,22800,23300
2,Algeria,715,716,717,718,719,720,721,722,723,...,14300,14600,14900,15200,15500,15800,16100,16500,16800,17100
3,Andorra,1200,1200,1200,1200,1210,1210,1210,1210,1220,...,73600,75100,76700,78300,79900,81500,83100,84800,86500,88300
4,Angola,618,620,623,626,628,631,634,637,640,...,6110,6230,6350,6480,6610,6750,6880,7020,7170,7310


In [16]:
# Reading a few lines from the Energy Use per Person table
energy.head()

Unnamed: 0,country,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Albania,,,,,,,,,,...,707.0,680.0,711.0,732.0,729.0,765.0,688.0,801.0,808.0,
1,Algeria,,,,,,,,,,...,1030.0,1080.0,1070.0,1150.0,1110.0,1140.0,1230.0,1250.0,1330.0,
2,Angola,,,,,,,,,,...,459.0,472.0,492.0,515.0,521.0,522.0,552.0,534.0,545.0,
3,Antigua and Barbuda,,,,,,,,,,...,1730.0,1740.0,,,,,,,,
4,Argentina,,,,,,,,,,...,1850.0,1860.0,1940.0,1870.0,1930.0,1950.0,1940.0,1970.0,2030.0,


> Now, there are **A LOT** of data points here, and as you can see (specifically in the Energy Use table) there are many data points that dont even exist. These will be narrowed down and cleaned up later. First, we are going to display these uncleaned data points in a few different ways to get a few different views.

In [19]:
# Describing the Life Expectancy table in a different way to get data points such as Mean, Standard Deviation, Max, etc.
life.describe()

Unnamed: 0,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
count,184.0,184.0,184.0,184.0,184.0,184.0,184.0,184.0,184.0,184.0,...,184.0,184.0,184.0,184.0,184.0,184.0,184.0,184.0,184.0,184.0
mean,31.502717,31.461957,31.478804,31.383152,31.459239,31.586413,31.644565,31.59837,31.383152,31.310326,...,83.758152,83.87663,83.996196,84.119022,84.236957,84.358152,84.478804,84.593478,84.71087,84.829891
std,3.814689,3.806303,3.938674,3.962376,3.934674,4.010884,4.110598,3.981247,4.087872,4.04058,...,5.600794,5.59444,5.589074,5.577601,5.57085,5.56606,5.556903,5.550234,5.54055,5.532609
min,23.4,23.4,23.4,19.6,23.4,23.4,23.4,23.4,12.5,13.4,...,67.1,67.3,67.4,67.5,67.6,67.7,67.8,67.9,68.0,68.1
25%,29.075,28.975,28.9,28.9,28.975,29.075,29.075,29.075,28.975,28.875,...,79.5,79.7,79.8,79.9,80.075,80.2,80.375,80.475,80.575,80.775
50%,31.75,31.65,31.55,31.5,31.55,31.65,31.75,31.75,31.55,31.5,...,84.2,84.35,84.45,84.55,84.65,84.75,84.85,85.0,85.15,85.25
75%,33.825,33.9,33.825,33.625,33.725,33.825,33.925,33.925,33.725,33.625,...,88.125,88.225,88.325,88.5,88.6,88.7,88.8,88.9,89.0,89.1
max,42.9,40.3,44.4,44.8,42.8,44.3,45.8,43.6,43.5,41.7,...,93.7,93.9,94.0,94.1,94.2,94.3,94.4,94.5,94.7,94.8


In [20]:
# Describing the Income table in a different way to get data points such as Mean, Standard Deviation, Max, etc.
income.describe()

Unnamed: 0,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,...,2031,2032,2033,2034,2035,2036,2037,2038,2039,2040
count,193.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0,...,193.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0
mean,978.523316,978.948187,980.725389,980.92228,981.911917,982.502591,982.829016,985.419689,980.937824,982.393782,...,23142.378238,23613.119171,24083.46114,24577.430052,25077.678756,25576.476684,26107.564767,26635.953368,27180.512953,27730.725389
std,579.633227,579.915248,582.565512,582.032626,583.963199,584.043985,584.09785,590.514505,578.200194,581.878397,...,23670.673835,24162.379036,24635.072766,25136.440969,25646.47526,26138.360102,26707.571366,27233.418469,27813.430077,28356.57083
min,250.0,250.0,249.0,249.0,249.0,249.0,248.0,248.0,248.0,248.0,...,557.0,566.0,577.0,588.0,600.0,612.0,625.0,637.0,650.0,664.0
25%,592.0,592.0,592.0,592.0,592.0,593.0,593.0,593.0,593.0,593.0,...,5180.0,5280.0,5380.0,5490.0,5600.0,5710.0,5830.0,5950.0,6070.0,6190.0
50%,817.0,822.0,826.0,831.0,836.0,836.0,836.0,836.0,836.0,836.0,...,15400.0,15700.0,16000.0,16400.0,16700.0,17000.0,17400.0,17700.0,18100.0,18500.0
75%,1160.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1160.0,1170.0,...,34200.0,34800.0,35500.0,36200.0,37000.0,37700.0,38500.0,39300.0,40100.0,40900.0
max,3840.0,3840.0,3840.0,3840.0,3840.0,3840.0,3840.0,3840.0,3840.0,3840.0,...,149000.0,153000.0,156000.0,159000.0,162000.0,165000.0,169000.0,172000.0,176000.0,179000.0


In [21]:
# Describing the Energy table in a different way to get data points such as Mean, Standard Deviation, Max, etc.
energy.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
count,25.0,25.0,25.0,25.0,25.0,26.0,26.0,26.0,26.0,26.0,...,167.0,167.0,137.0,137.0,137.0,137.0,137.0,137.0,131.0,34.0
mean,2291.8,2325.8,2421.96,2554.36,2665.28,2700.038462,2772.230769,2844.230769,3029.615385,3212.230769,...,2219.385928,2228.555449,2581.620438,2470.408759,2556.321168,2540.919708,2552.136496,2529.178102,2580.040458,4180.294118
std,2111.839561,2096.769718,2083.526247,2109.977624,2222.873807,2161.118812,2115.409451,2111.165409,2236.114972,2375.453899,...,2823.620956,2834.985797,2984.181246,2867.183873,2947.257429,2980.009828,3008.272882,2994.200788,3037.373448,2819.526594
min,289.0,322.0,350.0,368.0,410.0,441.0,452.0,484.0,497.0,512.0,...,9.55,9.56,128.0,130.0,135.0,113.0,63.7,65.4,66.3,1540.0
25%,1320.0,1400.0,1410.0,1450.0,1520.0,1450.0,1517.5,1732.5,1895.0,2022.5,...,488.0,511.0,594.0,609.0,670.0,661.0,685.0,660.0,687.0,2617.5
50%,1830.0,1890.0,2050.0,2180.0,2320.0,2310.0,2340.0,2320.0,2460.0,2655.0,...,1030.0,1060.0,1460.0,1380.0,1450.0,1550.0,1580.0,1490.0,1560.0,3560.0
75%,2700.0,2740.0,2890.0,3080.0,3210.0,3277.5,3250.0,3307.5,3545.0,3692.5,...,2875.0,2850.0,3360.0,3150.0,3310.0,3100.0,3080.0,3070.0,3030.0,4997.5
max,10500.0,10500.0,10400.0,10500.0,11200.0,10900.0,10500.0,10400.0,11100.0,12000.0,...,19200.0,18200.0,16400.0,16900.0,17000.0,18200.0,17600.0,18200.0,17900.0,17500.0


### Data Cleaning (Replace this with more specific notes!)

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])