<a href="https://colab.research.google.com/github/DonnaVakalis/Livability/blob/master/Gapminder2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Which other metrics track the GINI coefficient, using data from Gapminder?

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

**Questions posed:**
###1) Which countries have the highest/lowest average GINI coefficient (on average, for the last ten years)? 
###2) Which direction are GINI coefficients moving (for the last 50 years) worldwide? AND/OR In which countries is it going up/down (on average, for the last 10 years)?
###3) Within the U.S. how do the GINI coefficients compare at the state-level to the country as a whole?


**Datasets:** 

GINI world downloaded from https://www.gapminder.org/data/

USA downloaded from https://en.wikipedia.org/wiki/List_of_U.S._states_by_Gini_coefficient 


<a id='wrangling'></a>
## Data Wrangling


### LOAD DATA


In [1]:
# Install pycountry
!pip install pycountry

# Imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from google.colab import drive
import pycountry
from functools import reduce #for merging dataframes

# Settings
%matplotlib inline 
pd.options.display.float_format = '{:,.2f}'.format # display numbers with two decimal places

Collecting pycountry
[?25l  Downloading https://files.pythonhosted.org/packages/76/73/6f1a412f14f68c273feea29a6ea9b9f1e268177d32e0e69ad6790d306312/pycountry-20.7.3.tar.gz (10.1MB)
[K     |████████████████████████████████| 10.1MB 2.8MB/s 
[?25hBuilding wheels for collected packages: pycountry
  Building wheel for pycountry (setup.py) ... [?25l[?25hdone
  Created wheel for pycountry: filename=pycountry-20.7.3-py2.py3-none-any.whl size=10746863 sha256=b3822c8b253d3f3877fccdc43a8e607e669b9e691da2d8cb3278b2f385b97f6b
  Stored in directory: /root/.cache/pip/wheels/33/4e/a6/be297e6b83567e537bed9df4a93f8590ec01c1acfbcd405348
Successfully built pycountry
Installing collected packages: pycountry
Successfully installed pycountry-20.7.3


In [2]:
# Mount Google Drive
drive.mount('/content/gdrive')
os.chdir("/content/gdrive/My Drive/")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [4]:
# Load the data
base_dir = "/content/gdrive/My Drive/Colab Notebooks/project_gapminder/"

In [5]:

# GINI coefficients Global
file = base_dir + 'gini.csv' # from https://www.gapminder.org/data/
df_gini = pd.read_csv(file) # Read GINI --< CSV format

# GINI coefficients USA by State
file = base_dir + 'Gini_state_by_state.csv' # from https://en.wikipedia.org/wiki/List_of_U.S._states_by_Gini_coefficient
df_gini_USA = pd.read_csv(file) # Read States GINI --< CSV format

In [None]:
# types and look for shape, types and instances of missing or possibly errant data 
df_gini.head()
df_gini.info()
#df_gini
list_countries = df_gini['country'].unique().tolist()
len(list_countries)

comments about df_gini:
- years range from 1800 to 2040 (which doesn't make any sense to me...)
- 195 countries
- this data set is full! i.e., no missing values for the countries and years given.
- notice many more years than other datasets explored from gapminder
- consider limiting scope of question to "last 20 years" + read more about how gini is calculated

In [29]:
# types and look for shape, types and instances of missing or possibly errant data in the States
df_gini_USA.head()
df_gini_USA.info()
 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   State   52 non-null     object 
 1   GINI    52 non-null     float64
dtypes: float64(1), object(1)
memory usage: 960.0+ bytes



comments about df_gini_USA:

- years of data is 2010
- all 52 States present i.e., no missing values for the countries and years given.
- will need to add 3-letter State codes

 
### Data Cleaning
---
For Global Data: truncating to years of interest (1970-2020), Create column with average, Create column with delta for last 50 years, getting 3-letter country codes, Get the U.S. GINI from 2010

For U.S. Data: get 3-letter state codes, create a column with delta from country as a whole in 2010 (from Global data)

In [6]:
# Select "country" and years "1970" to "2020" within gini dataframe
cols_to_keep = np.r_[0, 171:222] #there are 242 columns beginning with 'country' then 1840...
df_gini = df_gini.iloc[:,cols_to_keep]
df_gini.head()

Unnamed: 0,country,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,30.5,30.8,31.2,31.6,31.9,32.1,32.2,32.2,32.2,32.2,32.2,32.1,32.0,32.1,32.5,33.0,33.7,34.7,35.4,36.0,36.4,36.7,36.7,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8,36.8
1,Albania,26.8,26.8,26.8,26.8,26.8,26.8,26.8,26.8,26.8,26.8,26.9,26.9,26.9,26.9,26.9,26.9,26.9,26.9,26.9,26.9,27.0,27.0,27.0,27.0,27.0,27.2,27.5,28.0,28.6,29.4,30.2,30.7,31.0,31.1,31.0,30.7,30.4,30.2,30.0,29.7,29.5,29.3,29.1,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0
2,Algeria,39.9,39.9,39.9,40.0,40.0,40.0,40.0,40.0,40.0,40.1,40.1,40.1,40.1,40.1,40.1,40.1,40.2,40.0,39.8,39.4,38.8,38.1,37.4,36.7,36.1,35.5,34.9,34.4,34.0,33.5,33.1,32.6,32.2,31.7,31.2,30.8,30.3,29.9,29.4,29.0,28.5,28.2,27.9,27.7,27.6,27.6,27.6,27.6,27.6,27.6,27.6
3,Andorra,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0
4,Angola,54.8,54.7,54.6,54.5,54.4,54.4,54.3,54.2,54.1,54.0,53.9,53.8,53.7,53.6,53.5,53.4,53.3,53.2,53.1,53.0,52.9,52.8,52.8,52.7,52.6,52.5,52.4,52.3,52.2,52.1,51.8,51.3,50.6,49.7,48.5,47.3,46.2,45.0,44.1,43.4,42.9,42.7,42.6,42.6,42.6,42.6,42.6,42.6,42.6,42.6,42.6


In [16]:
# Create new variables such as recent average and overall change

df_ = df_gini.assign(
                    recent_mean = df_gini.iloc[:,-11:].mean(axis=1, numeric_only=True), # Create the average for last 10 years (2010-2020)
                    delta = df_gini['1970'] - df_gini['2020'] # Calculate the difference since 1970
)

Unnamed: 0,country,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,recent_mean,delta
183,Ukraine,31.4,31.3,31.2,31.1,30.9,30.8,30.7,30.6,30.5,30.2,29.8,29.2,28.5,27.7,26.8,25.9,25.1,24.7,24.8,25.4,26.5,27.6,29.5,32.0,33.4,34.1,34.5,33.5,31.4,30.2,29.4,28.9,28.9,28.9,29.1,28.7,28.3,27.6,26.7,25.7,25.2,24.8,24.5,24.7,24.8,24.8,24.9,25.1,25.0,25.0,25.0,24.89,6.4
154,Slovenia,33.8,33.5,33.1,32.5,31.8,31.2,30.6,29.9,29.3,28.7,28.0,27.4,26.8,26.1,25.5,24.9,24.5,24.5,24.8,25.5,26.4,27.3,28.0,28.5,28.8,28.9,28.7,28.6,28.6,28.7,28.8,29.3,28.5,27.7,26.8,25.8,24.4,24.4,24.4,24.5,24.8,25.3,25.5,25.6,25.7,25.6,25.5,25.4,25.4,25.4,25.4,25.42,8.4
45,Czech Republic,18.8,18.7,18.7,18.7,18.7,18.6,18.6,18.6,18.5,18.5,18.5,18.4,18.4,18.4,18.4,18.3,18.5,19.0,19.8,20.9,22.3,23.7,24.8,25.6,26.0,26.2,26.1,26.1,26.2,26.5,26.7,26.9,27.1,27.2,27.1,26.9,26.7,26.4,26.4,26.3,26.3,26.3,26.3,26.2,26.0,26.0,25.9,25.9,25.9,25.9,25.9,26.05,-7.1
153,Slovak Republic,23.0,22.7,22.4,22.1,21.8,21.5,21.2,21.0,20.7,20.4,20.1,19.8,19.5,19.2,18.9,18.7,18.7,18.8,19.0,19.2,19.5,19.8,20.4,21.4,22.7,24.0,25.0,25.8,26.4,26.6,26.9,27.2,27.5,28.0,27.9,27.7,27.4,27.1,26.7,26.7,26.6,26.9,26.8,26.7,26.7,26.7,26.4,26.5,26.5,26.5,26.5,26.62,-3.5
128,Norway,26.3,26.1,25.9,25.7,25.6,25.6,25.7,25.9,26.0,26.2,26.3,26.4,26.4,26.4,26.3,26.3,26.4,26.5,26.6,26.9,27.1,27.3,27.6,27.8,28.1,28.3,28.5,28.7,28.7,28.6,28.4,28.2,28.7,29.1,28.8,28.6,28.5,27.4,26.5,26.3,26.0,25.9,26.0,26.3,26.8,27.1,27.4,27.5,27.5,27.5,27.5,26.86,-1.2


In [None]:
# Get 3-letter country codes and make it column "iso_alpha"

# print(list_countries) # Uncomment to see list of countries
d_country_code = {}  # To hold the country names and their ISO
for country in list_countries:
    try:
        country_data = pycountry.countries.search_fuzzy(country)
        # country_data is a list of objects of class pycountry.db.Country
        # The first item  ie at index 0 of list is best fit
        # object of class Country have an alpha_3 attribute
        country_code = country_data[0].alpha_3
        d_country_code.update({country: country_code})
    except:
        print('could not add ISO 3 code for ->', country)
        # If could not find country, make ISO code ' '
        d_country_code.update({country: ' '})
        
 # create a new column iso_alpha in the df
# and fill it with appropriate iso 3 code
for k, v in d_country_code.items():
    df_.loc[(df_.country == k), 'iso_alpha'] = v

df_.head()

# ideally would manually add ISO 3 code for North Korea, South Korea, Swaziland, Congo, Dem. Rep. and Congo, Rep. ...

<a id='eda'></a>
## Exploratory Data Analysis

statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### What does the GINI coefficient look like today over the globe?
### What does the GINI coefficient look like over time a sample of the countries at the inter-quartiles? 
### Which countries are going up versus down over the last X years?
### Delve into one country: what are GINI coefficients at state level?

In [None]:
# 

df_.sort_values(by='recent_mean',ascending=True, inplace=True)

 

### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])