# Life Expectancy and GDP
Written by TheJJSerg, Sug900, Fernando, and CalvinTheMechanic

This is a Codecademy Portfolio Project to use data visualization to analyze and plot data from the World Health Organization and the World Bank to try and identify the relationship between the GDP and life expectancy of six countries.

We will analyze, prepare, and plot data in order to answer questions in a meaningful way. After our analysis, we will be creating a blog post to share our findings on the World Health Organization website.

## Project Objectives
- Complete a project to add to our portfolio
- Use `seaborn` and `Matplotlib` to create visualizations
- Become familiar with presenting and sharing data visualizations
- Preprocess, explore, and analyze data

## Overview of the Data
The dataset, `all_data.csv`, contains the following columns:
- **Country**: nation for a specific observation
- **Year**: the year for the observation
- **Life expectancy at birth (years)**: the life expectancy value in years
- **GDP**: Gross Domestic Product in U.S. dollars

## Method and Analysis
1. Data Loading and Tiding
2. Data Analsyis
3. Data Visualization
4. Others

1. Data Loading and Tiding

In [None]:
#import libraries for the analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#read csv file into a dataframe
gdp_data = pd.read_csv('all_data.csv')

#print the first 5 rows
print('HEAD OF THE DATAFRAME:')
print('======================')
print(gdp_data.head(),"\n")

#print the info of the df and main stats of the variables
print('INFORMATION ABOUT THE DATAFRAME VARIABLE TYPES & NON-NULL COUNTS:')
print('=================================================================')
print(gdp_data.info(), "\n")
print('DESCRIPTION ABOUT THE DATA:')
print('===========================')
print(gdp_data.describe(include='all'), "\n")
#reduce life expectancy variable name

gdp_data.rename(columns={'Life expectancy at birth (years)': 'Life'}, inplace=True)
print('DATAFRAME RENAMED:')
print('===========================')
print(gdp_data.head())

* The data contains 96 entries with no null values
* The data types are correct as the `Country` variable is a string while the `Life expectancy at birth (years)` and `GDP` are float. The `Year` variable is an integer
* The data shows no issues with missing data or wrong entries 
* The `Life expectancy at birth (years)` variable was renamed to `Life`

Since there are six different countries, we will create six different DataFrames to analyze specific countries more efficiently.

In [None]:
# Create DataFrames for Each Country
chile = gdp_data[gdp_data["Country"] == "Chile"]
china = gdp_data[gdp_data["Country"] == "China"]
germany = gdp_data[gdp_data["Country"] == "Germany"]
mexico = gdp_data[gdp_data["Country"] == "Mexico"]
usa = gdp_data[gdp_data["Country"] == "United States of America"]
zimbabwe = gdp_data[gdp_data["Country"] == "Zimbabwe"]

We will visualize the different spread for `Life Expectancy` for each country.

In [None]:
# Create a Figure
fig = plt.figure(figsize = (10, 10))
plt.subplots_adjust(hspace = .5)

# Create the First Subplot, Chile
ax1 = plt.subplot(3, 2, 1) # 3 Rows, 2 Columns, 1st Subplot
plt.hist(chile["Life"])
plt.title("Chile's Life Expectancy")
plt.xlabel("Years")
plt.ylabel("Frequency")

# Create the Second Subplot, China
ax2 = plt.subplot(3, 2, 2)
plt.hist(china["Life"])
plt.title("China's Life Expectancy")
plt.xlabel("Years")
plt.ylabel("Frequency")

# Create the Third Subplot, Germany
ax3 = plt.subplot(3, 2, 3)
plt.hist(germany["Life"])
plt.title("Germany's Life Expectancy")
plt.xlabel("Years")
plt.ylabel("Frequency")

# Create the Fourth Subplot, Mexico
ax4 = plt.subplot(3, 2, 4)
plt.hist(mexico["Life"])
plt.title("Mexico's Life Expectancy")
plt.xlabel("Years")
plt.ylabel("Frequency")

# Create the Fifth Subplot, USA
ax5 = plt.subplot(3, 2, 5)
plt.hist(usa["Life"])
plt.title("USA's Life Expectancy")
plt.xlabel("Years")
plt.ylabel("Frequency")

# Create the Sixth Subplot, Zimbabwe
ax6 = plt.subplot(3, 2, 6)
plt.hist(zimbabwe["Life"])
plt.title("Zimbabwe's Life Expectancy")
plt.xlabel("Years")
plt.ylabel("Frequency")

# Show and Close the Plot
plt.show()
plt.close()

##########
# Create a histogram of All Countries' Life Expectancy
plt.hist(gdp_data["Life"])

# Create labels for the plot
plt.title("All Countries' Life Expectancy")
plt.xlabel("Years")
plt.ylabel("Frequency")

# Show and close the plot
plt.show()
plt.close()

The distributions for `Life Expectancy` doesn't appear to have any pattern between them. However, `Zimbabwe`'s distribution of `Life Expectancy` is significantly smaller than the distribution of the other countries' `Life Expectancy`. In the combined **All Countries' Life Expectancy** histogram, `Zimbabwe` looks to be an outlier to the rest of the countries, who have a relatively higher `Life Expectancy`. 

In [None]:
# Create boxplots of life expectancy and country
sns.boxplot(x = "Country", y = "Life", data = gdp_data)

# Create an axes object
ax = plt.subplot()
# Set the x-tick position using a list of numbers by the length of unique Years
ax.set_xticks(range(len(gdp_data["Country"].unique()))) 
# Set the x-tick labels and rotate them by 30 degrees
ax.set_xticklabels(["Chile", "China", "Germany", "Mexico", "USA", "Zimbabwe"], 
                   rotation = 30)

# Create labels for the plot
plt.title("Boxplots of Life Expectancy by Country")

# Show and close the plot
plt.show()
plt.close()

In [None]:
# Create boxplots of life expectancy and year
sns.boxplot(x = "Year", y = "Life", data = gdp_data)

# Create an axes object
ax = plt.subplot()
# Set the x-tick position using a list of numbers by the length of unique Years
ax.set_xticks(range(len(gdp_data["Year"].unique()))) 
# Set the x-tick labels and rotate them by 30 degrees
ax.set_xticklabels(gdp_data["Year"].unique(), rotation = 30)

# Create labels for the plot
plt.title("Boxplots of Life Expectancy by Year")

# Show and close the plot
plt.show()
plt.close()

In [None]:
# Create a scatterplot Year vs Life Expectancy
sns.scatterplot(x = "Year", y = "Life", hue = "Country", 
                palette = "bright", data = gdp_data)

# Create labels for the plot
plt.title("Scatterplot of Year vs Life")
plt.legend()

# Show and close the plot
plt.show()
plt.close()

In [None]:
# Create a scatterplot Year vs Life Expectancy
sns.scatterplot(x = "GDP", y = "Life", hue = "Country", 
                palette = "bright", data = gdp_data)

# Create labels for the plot
plt.title("Scatterplot of GDP vs Life")
plt.legend()

# Show and close the plot
plt.show()
plt.close()

In [None]:
# Create a scatterplot Year vs GDP
sns.scatterplot(x = "Year", y = "GDP", hue = "Country", 
                data = gdp_data)

# Create labels for the plot
plt.title("Scatterplot of Year vs GDP by Country")
plt.legend()

# Show and close the plot
plt.show()
plt.close()

In [None]:
# Create a histogram of gdp
plt.hist(gdp_data["GDP"])

# Create labels for the plot
plt.title("Histogram of GDP")

# Show and close the plot
plt.show()
plt.close()