# Week 2 Assignment
## Overview
The following is a simple exploratory data analysis of the GapMinder data set as part of the Coursera Regression Modeling in Practice course.  In this analysis I would like to examine the relationship between the economic well-being of a society and the level of democratization.

## About the Data
The data for this analysis comes from a subset of the GapMider project data.  In this section I will examine the variables I am interested in more detail.

### Income per Person
In order to measure the economic well-being I will be using GDP per capita data.  This originally came from the World Bank.  It is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources.  The data are in constant 2000 US Dollars.  The GapMinder data set that I will be analyzing is the 2010 GDP per capita.

### Democracy Score
The democracy score comes from the Polity IV project.  It is a summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.  The GapMinder data set that I will analyzing is the polity score for 2009.  To get a feel for this data take a look at the following figure provided by the Polity IV project authors:

![polity categories](http://www.systemicpeace.org/polity/demmap13.jpg)

### Democracy Categories
Since the democracy score ranges from -10 to 10 there will be 21 values.  This will be a bit unruly so I will create a new variable using the democracy categories detailed in the map legend above.

## Preprocessing the Data
I begin by importing the libraries needed for the analysis:

In [1]:
%matplotlib inline
# Import libraries needed
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt



Now I have python parse the csv file and print out some basic statistics about the data frame (df):

In [2]:
# Read in the Data
df = pd.read_csv('gapminder.csv', low_memory=False)

# Print some basic statistics
n = str(len(df))
cols = str(len(df.columns))
print('Number of observations: '+ n +' (rows)')
print('Number of variables: '+ cols +' (columns)')

Number of observations: 213 (rows)
Number of variables: 16 (columns)


There are 213 observations with 16 variables in the data frame.  I need to clean up the raw data prior to the analysis.  I will first change the varaible types for the variables of interst:

In [3]:
# Change the data type for variables of interest
df['incomeperperson'] = pd.to_numeric(df['incomeperperson'], errors='coerce')
df['polityscore'] = pd.to_numeric(df['polityscore'], errors='coerce')

I want to see how many missing observations there are in the variables of interest:

In [4]:
print ('Countries with a GDP Per Capita: ' + str(df['incomeperperson'].count()) + ' out of ' + str(len(df)) + ' (' + str(len(df) - df['incomeperperson'].count()) + ' missing)')
print ('Countries with a Democracy Score: ' + str(df['polityscore'].count()) + ' out of ' + str(len(df)) + ' (' + str(len(df) - df['polityscore'].count()) + ' missing)')

Countries with a GDP Per Capita: 190 out of 213 (23 missing)
Countries with a Democracy Score: 161 out of 213 (52 missing)


I need to have a set of data with both variables of interest so I will subset the data frame:

In [5]:
# Get the rows not missing a value
subset = df[np.isfinite(df['incomeperperson'])]
subset = subset[np.isfinite(subset['polityscore'])]
print('Number of observations: '+ str(len(subset)) +' (rows)')

Number of observations: 155 (rows)


155 of the 213 records have complete data.  Now I will create my new variable for the democracy score categories.  I will do this by defining a function and then using that function to create the new variable:

In [6]:
# These functions converts the polity score to a binary category flag
def is_full_democracy(score):
    if score == 10:
        return(1)
    else:
        return(0)

def is_full_democracy_text(score):
    if score ==10:
        return('Yes')
    else:
        return('No')

# Now we can use the function to create the new variable
subset['is_full_democracy'] = subset['polityscore'].apply(is_full_democracy)
subset['is_full_democracy_text'] = subset['polityscore'].apply(is_full_democracy_text).astype('category')

NameError: name 'is_democracy' is not defined

In [None]:
# Visualize data using a boxplot
sns.set_context('poster')
plt.figure(figsize=(14, 7))
sns.boxplot(x='is_full_democracy_text', y='incomeperperson', data=subset)
plt.ylabel('Economic Well-Being (GDP Per Person)')
plt.xlabel('Is a Full Democracy')

In [None]:
model = smf.ols('incomeperperson ~ is_full_democracy', data=subset).fit()
print(model.summary())