<a href="https://colab.research.google.com/github/DonnaVakalis/Livability/blob/master/Gapminder1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Gapminder exploration

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

Which measures of quality of life, if any, are generally correlated with economic equality, as measured by the GINI metric?  
e.g., traffice deaths, ... cellphones.... air quality...minimum wage? 


In [3]:
# Install pycountry
!pip install pycountry

# Imports 
import pandas as pd
import matplotlib.pyplot as plt
import os
from google.colab import drive
import pycountry
from functools import reduce #for merging dataframes

# Settings
%matplotlib inline 
pd.options.display.float_format = '{:,.2f}'.format # display numbers with two decimal places





In [None]:
# Mount Google Drive

drive.mount('/content/gdrive')
os.chdir("/content/gdrive/My Drive/")


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### LOAD DATA


In [None]:
# Load the data

base_dir = "/content/gdrive/My Drive/Colab Notebooks/project_gapminder/"

# 1. Read GINI --> CSV format
file = base_dir + 'gini.csv' # from https://www.gapminder.org/data/
df_gini = pd.read_csv(file)
df_gini.head()

# 2. Read Income per Person --> CSV format
file = base_dir + 'income_per_person_gdppercapita_ppp_inflation_adjusted.csv'  # from https://www.gapminder.org/data/
df_incm = pd.read_csv(file)
df_incm.head()

# 3. Read CO2 emissions --> CSV format
file = base_dir + 'co2_emissions_tonnes_per_person.csv' # from https://www.gapminder.org/data/
df_crbn = pd.read_csv(file)
df_crbn.head()

# 4. Read Cellphone per 100 people --> CSV format 
file = base_dir + 'cell_phones_per_100_people.csv' # from https://www.gapminder.org/data/
df_phon= pd.read_csv(file)
df_phon.head()


Unnamed: 0,country,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Afghanistan,0.0,,,,,0.0,,,,,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111,0.845,2.43,4.68,9.53,17.2,28.5,37.0,35.0,45.8,49.2,52.1,55.2,57.3,61.1,65.9,59.1
1,Albania,0.0,,,,,0.0,,,,,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0742,0.106,0.18,0.353,0.952,12.5,27.2,35.3,40.6,49.6,62.4,76.5,61.9,82.9,91.3,106.0,120.0,127.0,116.0,118.0,117.0,126.0,94.2
2,Algeria,0.0,,,,,0.0,,,,,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00182,0.0181,0.0177,0.0173,0.00478,0.0163,0.04,0.0585,0.0596,0.235,0.277,0.318,1.41,4.48,14.9,41.2,62.4,80.7,77.8,92.6,91.1,97.1,100.0,104.0,111.0,109.0,116.0,111.0,112.0
3,Andorra,0.0,,,,,0.0,,,,,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.31,1.28,1.25,4.42,8.53,13.4,22.0,32.0,36.0,43.7,46.8,70.9,76.6,81.9,85.2,76.8,76.6,76.4,77.6,77.7,77.5,79.1,83.6,91.4,98.5,104.0,107.0
4,Angola,0.0,,,,,0.0,,,,,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00841,0.0135,0.0143,0.0229,0.0474,0.0639,0.151,0.157,0.443,0.799,1.93,3.94,8.29,15.2,23.7,31.2,36.0,40.3,49.8,50.9,51.1,52.2,49.8,45.1,44.7,43.1


In [None]:
# types and look for shape, types and instances of missing or possibly errant data 
df_gini.info()
df_gini

comments about df_gini:
- many more years than other datasets
- consider limiting scope of question to "last 20 years"

In [None]:
# types and look for shape, types and instances of missing or possibly errant data 
df_incm.info()
df_incm

comments about df_incm:
- same countries as above, making comparison easier

In [None]:
# types and look for shape, types and instances of missing or possibly errant data 
df_crbn.info()
df_crbn

comments about df_crbn:
- most recent year is 2014, so limit question to 2014 for all

In [None]:
# types and look for shape, types and instances of missing or possibly errant data 
df_phon.info()
df_phon

comments about df_phon
- up to 2018

 
### Data Cleaning

Merging variables into one large dataframe by country, truncating to years of interest (1996-2014), Getting 3-letter country codes 

In [None]:
# Truncate and Merge dataframes by country
data_frames = [df_gini, df_incm, df_crbn, df_phon]
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['country'],
                                            how='outer'), data_frames)




In [None]:
df_merged

Unnamed: 0,country_x,1800_x,1801_x,1802_x,1803_x,1804_x,1805_x,1806_x,1807_x,1808_x,1809_x,1810_x,1811_x,1812_x,1813_x,1814_x,1815_x,1816_x,1817_x,1818_x,1819_x,1820_x,1821_x,1822_x,1823_x,1824_x,1825_x,1826_x,1827_x,1828_x,1829_x,1830_x,1831_x,1832_x,1833_x,1834_x,1835_x,1836_x,1837_x,1838_x,...,1978_y,1979_y,1980_y,1981_y,1982_y,1983_y,1984_y,1985_y,1986_y,1987_y,1988_y,1989_y,1990_y,1991_y,1992_y,1993_y,1994_y,1995_y,1996_y,1997_y,1998_y,1999_y,2001_y,2002_y,2003_y,2004_y,2005_y,2006_y,2007_y,2008_y,2009_y,2010_y,2011_y,2012_y,2013_y,2014_y,2015,2016,2017,2018
0,Afghanistan,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,30.5,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Albania,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,38.9,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Bosnia and Herzegovina,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,Montenegro,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,30.2,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,Algeria,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.2,56.3,56.4,56.5,56.6,56.7,56.8,56.9,57.0,57.2,57.4,57.5,57.7,57.9,58.1,58.2,58.4,58.6,58.8,58.9,59.1,59.3,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
765,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0000,0.0000,0.0000,0.00000,0.00229,0.00404,0.0164,0.0410,0.0729,0.1120,0.165,0.510,0.735,1.25,2.09,2.72,9.44,20.9,44.8,58.5,73.5,87.8,68.8,71.8,71.1,70.4,74.0,75.9,71.5
766,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0000,0.0000,0.0000,0.00000,0.00000,0.03900,0.0720,0.0897,0.1180,0.1240,0.166,0.185,2.530,3.92,5.15,6.06,7.00,11.8,16.0,57.2,71.9,56.4,58.6,49.6,59.1,64.5,78.5,79.9,85.9
767,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00963,0.0192,0.0378,0.0826,0.38200,0.86900,1.49000,1.8400,2.6000,4.6900,8.6300,15.900,26.300,26.100,27.50,32.40,47.30,70.00,87.4,99.2,100.0,98.0,99.6,104.0,104.0,102.0,96.7,92.5,83.3,71.8
768,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0000,0.0000,0.0000,0.00113,0.00562,0.01700,0.0314,0.0906,0.2080,0.2850,0.416,1.550,2.330,3.33,5.97,11.40,22.30,52.7,86.8,113.0,127.0,143.0,147.0,136.0,148.0,130.0,129.0,127.0,147.0


In [None]:
# Get 3-letter country codes 

list_countries = df_mw['country_name'].unique().tolist()
# print(list_countries) # Uncomment to see list of countries
d_country_code = {}  # To hold the country names and their ISO
for country in list_countries:
    try:
        country_data = pycountry.countries.search_fuzzy(country)
        # country_data is a list of objects of class pycountry.db.Country
        # The first item  ie at index 0 of list is best fit
        # object of class Country have an alpha_3 attribute
        country_code = country_data[0].alpha_3
        d_country_code.update({country: country_code})
    except:
        print('could not add ISO 3 code for ->', country)
        # If could not find country, make ISO code ' '
        d_country_code.update({country: ' '})
        
 # create a new column iso_alpha in the df
# and fill it with appropriate iso 3 code
for k, v in d_country_code.items():
    df_mw.loc[(df_mw.country_name == k), 'iso_alpha'] = v



<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])