# Week 2 Assignment
## Overview
The following is a simple exploratory data analysis of the GapMinder data set as part of the Coursera Data Management and Visualization course.  In this analysis I would like to examine the relationship between the economic well-being of a society and the level of democratization.

## About the Data
The data for this analysis comes from a subset of the GapMider project data.  In this section I will examine the variables I am interested in more detail.

### Income per Person
In order to measure the economic well-being I will be using GDP per capita data.  This originally came from the World Bank.  It is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources.  The data are in constant 2000 US Dollars.  The GapMinder data set that I will be analyzing is the 2010 GDP per capita.

### Democracy Score
The democracy score comes from the Polity IV project.  It is a summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.  The GapMinder data set that I will analyzing is the polity score for 2009.  To get a feel for this data take a look at the following figure provided by the Polity IV project authors:

![polity categories](http://www.systemicpeace.org/polity/demmap13.jpg)

### Democracy Categories
Since the democracy score ranges from -10 to 10 there will be 21 values.  This will be a bit unruly so I will create a new variable using the democracy categories detailed in the map legend above.

## Preprocessing the Data
I begin by importing the libraries needed for the analysis:

In [1]:
# Import libraries needed
import pandas as pd
import numpy as np

Now I have python parse the csv file and print out some basic statistics about the data frame (df):

In [2]:
# Read in the Data
df = pd.read_csv('gapminder.csv', low_memory=False)

# Print some basic statistics
n = str(len(df))
cols = str(len(df.columns))
print('Number of observations: '+ n +' (rows)')
print('Number of variables: '+ cols +' (columns)')

Number of observations: 213 (rows)
Number of variables: 16 (columns)


There are 213 observations with 16 variables in the data frame.  I need to clean up the raw data prior to the analysis.  I will first change the varaible types for the variables of interst:

In [3]:
# Change the data type for variables of interest
df['polityscore'] = df['polityscore'].convert_objects(convert_numeric=True)
df['incomeperperson'] = df['incomeperperson'].convert_objects(convert_numeric=True)

I want to see how many missing observations there are in the variables of interest:

In [4]:
print ('Countries with a Democracy Score: ' + str(df['polityscore'].count()) + ' out of ' + str(len(df)) + ' (' + str(len(df) - df['polityscore'].count()) + ' missing)')
print ('Countries with a GDP Per Capita: ' + str(df['incomeperperson'].count()) + ' out of ' + str(len(df)) + ' (' + str(len(df) - df['incomeperperson'].count()) + ' missing)')

Countries with a Democracy Score: 161 out of 213 (52 missing)
Countries with a GDP Per Capita: 190 out of 213 (23 missing)


I need to have a set of data with both variables of interest so I will subset the data frame:

In [5]:
# Get the rows not missing a value
subset = df[np.isfinite(df['polityscore'])]
subset = subset[np.isfinite(subset['incomeperperson'])]
print('Number of observations: '+ str(len(subset)) +' (rows)')

Number of observations: 155 (rows)


155 of the 213 records have complete data.  Now I will create my new variable for the democracy score categories.  I will do this by defining a function and then using that function to create the new variable:

In [6]:
# This function converts the polity score to a category
def convert_polityscore_to_category(score):
    if score == 10:
        return('1 - Full Democracy')
    elif score > 5:
        return('2 - Democracy')
    elif score > 0:
        return ('3 - Open Anocracy')
    elif score > -6:
        return ('4 - Closed Anocracy')
    else:
        return('5 - Autocracy')

# Now we can use the function to create the new variable
subset['democracy'] = subset['polityscore'].apply(convert_polityscore_to_category)

I also need to make a change to the GDP per capita.  Since GDP per person is a continuous varible I will need to create a discrete one. For this assignment I will create quintiles:

In [7]:
subset['incomequintiles'] = pd.cut(subset['incomeperperson'], 5, labels=['Lowest','Second','Middle','Fourth','Highest'])

## Exploratory Data Analysis
### 2009 Democracy Score
The first variable of interest it the polityscore variable.  This variable is a measure of the level of openess of the country.

In [8]:
print('Countries by Democracy Score (-10=autocracy & 10=full democracy)')
polity_counts = subset.groupby('polityscore').size()
print(polity_counts)

Countries by Democracy Score (-10=autocracy & 10=full democracy)
polityscore
-10     2
-9      3
-8      2
-7     11
-6      2
-5      2
-4      6
-3      6
-2      5
-1      4
 0      4
 1      3
 2      3
 3      2
 4      4
 5      7
 6     10
 7     13
 8     19
 9     15
 10    32
dtype: int64


In [9]:
print('Percent of Countries by Democracy Score')
polity_percents = polity_counts * 100 / len(subset)
print(polity_percents)

Percent of Countries by Democracy Score
polityscore
-10     1.290323
-9      1.935484
-8      1.290323
-7      7.096774
-6      1.290323
-5      1.290323
-4      3.870968
-3      3.870968
-2      3.225806
-1      2.580645
 0      2.580645
 1      1.935484
 2      1.935484
 3      1.290323
 4      2.580645
 5      4.516129
 6      6.451613
 7      8.387097
 8     12.258065
 9      9.677419
 10    20.645161
dtype: float64


There are 32 countries that are full democracies (have a polity score of 10).  This is roughly 21% of all the data.  There are 2 observations that are autocracies. It seems like most of the countries are greater than zero.  I would like to see how many so I will compute a quick percentage:

In [10]:
greater_than_zero =subset[subset['polityscore'] > 0]
greater_than_zero_percent = len(greater_than_zero) * 100 / len(subset)
print('Number of countries with a Polity score greater than zero: ' + str(len(greater_than_zero)))
print('Percent of countries with a Polity score greater than zero: ' + str(greater_than_zero_percent) + '%')

Number of countries with a Polity score greater than zero: 108
Percent of countries with a Polity score greater than zero: 69%


108 of the 155 countries in this data set have some degree of openness.  That makes up over half of the data (69%).

### 2010 GDP Per Person Quintiles
The second variable of interest is the measure of economic well-being.  The most common way to measure economic well-being is using per capita GDP, although it is not a perfect measure.  As previously noted it is the 2010 percapita GDP denoted in 2000 US dollars.  

In [11]:
print('Countries by Per Capita GDP Quintiles')
incomequintiles_counts = subset.groupby('incomequintiles').size()
print(incomequintiles_counts)

Countries by Per Capita GDP Quintiles
incomequintiles
Lowest     119
Second      14
Middle       5
Fourth      10
Highest      7
dtype: int64


In [12]:
print('Percent of Countries by Per Capita Quintiles')
incomequintiles_percents = incomequintiles_counts * 100 / len(subset)
print(incomequintiles_percents)

Percent of Countries by Per Capita Quintiles
incomequintiles
Lowest     76.774194
Second      9.032258
Middle      3.225806
Fourth      6.451613
Highest     4.516129
dtype: float64


The data is clustered around the lower end.  We see that 119 (roughly 77%) of the observations are in the lowest quintile.  There are seven that are in the highest quintile.

### 2009 Democracy Categories
The final variable that I will examine is the one that I created summarizing the 21 Polity Iv score values into five categories:

In [13]:
print('Countries by Democracy Category')
democracy_counts = subset.groupby('democracy').size()
print(democracy_counts)

Countries by Democracy Category
democracy
1 - Full Democracy     32
2 - Democracy          57
3 - Open Anocracy      19
4 - Closed Anocracy    27
5 - Autocracy          20
dtype: int64


In [14]:
print('Percent of Countries by Democracy Category')
democracy_percents = democracy_counts * 100 / len(subset)
print(democracy_percents)

Percent of Countries by Democracy Category
democracy
1 - Full Democracy     20.645161
2 - Democracy          36.774194
3 - Open Anocracy      12.258065
4 - Closed Anocracy    17.419355
5 - Autocracy          12.903226
dtype: float64


Most of the countries in my data set fall into the "Democracy" category followed by "Full Democracy."  47 of the 155 countries are closed.