# Project Goals 
## Single Variable Association
- Assess the overall distribution of the data

## Double Variable Association
Separation by Country
- Distribution of GDP
- Distribution of Life Expectancy

Accounting for Year
- Distribution and progression of GDP
- Distribution and progression of Life Expectancy

## Triple Variable Association
- Distribution and trend of GDP vs Life Expetancy data, separated visually by country
- Distribution of GDP/Life Expectancy Ratio over time

# Major Questions
- Is there a relationship between GDP and Life Expectancy?
- Does year or country have a greater effect on GDP/Life Expectancy?

# Load the Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import numpy as np

%matplotlib notebook

In [2]:
df = pd.read_csv('all_data.csv')

In [3]:
df[df.Country == 'USA'].head()

Unnamed: 0,Country,Year,Life expectancy at birth (years),GDP


# Explore and Explain Data


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Country                           96 non-null     object 
 1   Year                              96 non-null     int64  
 2   Life expectancy at birth (years)  96 non-null     float64
 3   GDP                               96 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 3.1+ KB


In [5]:
df.Year = df.Year.astype('object')

In [6]:
for col in df.columns:
    print(df[col].describe(), '\n')

count        96
unique        6
top       Chile
freq         16
Name: Country, dtype: object 

count       96
unique      16
top       2000
freq         6
Name: Year, dtype: int64 

count    96.000000
mean     72.789583
std      10.672882
min      44.300000
25%      74.475000
50%      76.750000
75%      78.900000
max      81.000000
Name: Life expectancy at birth (years), dtype: float64 

count    9.600000e+01
mean     3.880499e+12
std      5.197561e+12
min      4.415703e+09
25%      1.733018e+11
50%      1.280220e+12
75%      4.067510e+12
max      1.810000e+13
Name: GDP, dtype: float64 



In [7]:
df.loc[df.Country == 'United States of America', 'Country'] = 'USA'
df.Country.unique()

array(['Chile', 'China', 'Germany', 'Mexico', 'USA', 'Zimbabwe'],
      dtype=object)

### Basic Findings
**Country**
- Unique:
    - Chile
    - China
    - Germany
    - Mexico
    - USA
    - Zimbabwe

**Year**
- Mean:&ensp; 2000
- Max:&emsp;&nbsp;2015

**Life Expectancy**
- Range:&emsp;44.3 - 81.0
- Mean:&emsp; 72.79
- Median:&ensp;76.7

**GDP**
- Range:&emsp;4.4E9 - 1.8E13
- Mean:&emsp;&nbsp;3.88E
- Median:&ensp;1.28E12

## Single Variable Visualisation

In [8]:
def get_basic_dist(fignum, outliers=True):
    data_cols = [df['Life expectancy at birth (years)'], df['GDP']]
    data_quantiles = [np.quantile(x, [0.25, 0.75]) for x in data_cols]
    data_within_quantiles = [data_cols[x][(data_cols[x] >= data_quantiles[x][0]) & (data_cols[x] <= data_quantiles[x][1])] for x in range(2)]
    dfs_to_plot = data_cols if outliers == True else data_within_quantiles
    if outliers == True:
        fig, axes = plt.subplots(2, 2)
        for index, col in enumerate(dfs_to_plot):
            plt.suptitle(f'Fig {fignum} - Basic Data Distribution (w/ Outliers)')
            sns.histplot(ax=axes[index, 0], data=col)
            sns.boxplot(ax=axes[index,1], data=col, color='pink')
    else:
        fig, axes = plt.subplots(1, 2)
        plt.suptitle(f'Fig {fignum} - Basic Data Distribution (w/o Outliers)')
        sns.histplot(ax=axes[0], data=dfs_to_plot[0], bins=10)
        sns.histplot(ax=axes[1], data=dfs_to_plot[1], bins=10, color='pink')
    plt.subplots_adjust(hspace=0.5, wspace=0.3)
    plt.show()

In [9]:
# With Outliers
get_basic_dist(1)

<IPython.core.display.Javascript object>

In [10]:
# Without Outliers
get_basic_dist(2, outliers=False)

<IPython.core.display.Javascript object>

### Single Variable Findings
**Fig 1 & 2**
- In general, life expectancy in relevant countries are found within 74 and 79 years
- In general, GDO in relevant countries are found within 1.73E11 and 4.07E12
- There are plenty of outliers in both measures

## Double Variable Visualisation

In [11]:
def double_variable_vis(fignum, accounting_for):
    fig, axes = plt.subplots(1, 2, figsize=(10,5))
    plt.suptitle(f'Fig {fignum} - GDP and Life Expectancy per {accounting_for.title()}')
    sns.boxplot(ax=axes[0], data=df, y='GDP', x=accounting_for, palette='pastel')
    axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45)
    axes[0].set_yscale('log')
    sns.boxplot(ax=axes[1], data=df, y='Life expectancy at birth (years)', x=accounting_for, palette='pastel')
    axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45)
    plt.tight_layout()
    plt.show()

In [12]:
df.groupby(['Country'])[['GDP', 'Life expectancy at birth (years)']].describe()

Unnamed: 0_level_0,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Chile,16.0,169788800000.0,76878840000.0,69736810000.0,93873030000.0,172997500000.0,244951500000.0,278384000000.0,16.0,78.94375,1.058911,77.3,77.975,79.0,79.825,80.5
China,16.0,4957714000000.0,3501096000000.0,1211350000000.0,1881585000000.0,4075195000000.0,7819550000000.0,11064700000000.0,16.0,74.2625,1.318016,71.7,73.4,74.45,75.25,76.1
Germany,16.0,3094776000000.0,667486200000.0,1949950000000.0,2740870000000.0,3396350000000.0,3596078000000.0,3890610000000.0,16.0,79.65625,0.975,78.0,78.95,79.85,80.525,81.0
Mexico,16.0,976650600000.0,209571600000.0,683648000000.0,763091000000.0,1004376000000.0,1156992000000.0,1298460000000.0,16.0,75.71875,0.620987,74.8,75.225,75.65,76.15,76.7
USA,16.0,14075000000000.0,2432694000000.0,10300000000000.0,12100000000000.0,14450000000000.0,15675000000000.0,18100000000000.0,16.0,78.0625,0.832566,76.8,77.425,78.15,78.725,79.3
Zimbabwe,16.0,9062580000.0,4298310000.0,4415703000.0,5748309000.0,6733671000.0,12634460000.0,16304670000.0,16.0,50.09375,5.940311,44.3,45.175,47.4,55.325,60.7


In [13]:
double_variable_vis(3, 'Country')

<IPython.core.display.Javascript object>

In [14]:
df.groupby(['Year'])[['GDP', 'Life expectancy at birth (years)']].describe()

Unnamed: 0_level_0,GDP,GDP,GDP,GDP,GDP,GDP,GDP,GDP,Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years),Life expectancy at birth (years)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
2000,6.0,2371583000000.0,3951878000000.0,6689958000.0,229307700000.0,947499000000.0,1765300000000.0,10300000000000.0,6.0,70.766667,12.344499,46.0,72.475,75.8,77.175,78.0
2001,6.0,2448752000000.0,4062290000000.0,6777385000.0,234410900000.0,1032052000000.0,1797838000000.0,10600000000000.0,6.0,70.833333,12.692938,45.3,72.9,75.95,77.2,78.3
2002,6.0,2561221000000.0,4211437000000.0,6342116000.0,237692600000.0,1106055000000.0,1926992000000.0,11000000000000.0,6.0,70.95,12.97933,44.8,73.275,76.0,77.6,78.4
2003,6.0,2743446000000.0,4396380000000.0,5727592000.0,235053600000.0,1186787000000.0,2294370000000.0,11500000000000.0,6.0,71.033333,13.152592,44.5,73.575,76.1,77.725,78.5
2004,6.0,2991647000000.0,4689670000000.0,5805598000.0,266974800000.0,1362809000000.0,2603275000000.0,12300000000000.0,6.0,71.3,13.377743,44.3,73.975,76.45,77.875,79.1
2005,6.0,3207074000000.0,4981507000000.0,5755215000.0,308810200000.0,1576158000000.0,2717550000000.0,13100000000000.0,6.0,71.483333,13.316969,44.6,74.25,76.4,78.175,79.2
2006,6.0,3463349000000.0,5268510000000.0,5443896000.0,357411200000.0,1858706000000.0,2939870000000.0,13900000000000.0,6.0,71.95,13.159293,45.4,74.6,76.8,78.625,79.6
2007,6.0,3785750000000.0,5474100000000.0,5291950000.0,391072000000.0,2241710000000.0,3524122000000.0,14500000000000.0,6.0,72.3,12.74394,46.6,74.8,77.05,78.7,79.8
2008,6.0,4055986000000.0,5547122000000.0,4415703000.0,410048500000.0,2426825000000.0,4386750000000.0,14700000000000.0,6.0,72.666667,12.178615,48.2,74.775,76.9,79.25,79.9
2009,6.0,4000653000000.0,5476381000000.0,8621574000.0,353029000000.0,2156480000000.0,4686965000000.0,14400000000000.0,6.0,73.066667,11.478792,50.0,75.1,77.1,79.1,80.0


In [15]:
def timed_double_variable_vis(fignum, accounting_for):
    fig, axes = plt.subplots(2, 2, figsize=(9,9))
    plt.suptitle(f'Fig {fignum} - GDP and Life Expectancy per {accounting_for.title()}')
    sns.boxplot(ax=axes[0,0], data=df, y='GDP', x=accounting_for, palette='pastel')
    axes[0,0].set_xticklabels(axes[0,0].get_xticklabels(), rotation=45)
    axes[0,0].set_yscale('log')
    sns.boxplot(ax=axes[0,1], data=df, y='Life expectancy at birth (years)', x=accounting_for, palette='pastel')
    axes[0,1].set_xticklabels(axes[0,1].get_xticklabels(), rotation=45)
    sns.lineplot(ax=axes[1,0], data=df, y='GDP', x=accounting_for)
    axes[1,0].set_yscale('log')
    sns.lineplot(ax=axes[1,1], data=df, y='Life expectancy at birth (years)', x=accounting_for)
    plt.tight_layout()
    plt.show()

In [16]:
timed_double_variable_vis(4, 'Year')

<IPython.core.display.Javascript object>

### Double Variable Findings
**Fig 3**
- USA has a significantly higher *GDP* than other relevant countries
- Chile and Zimbabwe have significantly lower *GDP* than other relevant countries, Mexico is lower however significance may be questionable
- Zimbabwe has significantly lower *Life Expectancy* than other relevant countries
- USA does not appear to have proportionally significantly greater *Life Expectancy* such that is has higher *GDP*
- In a general sense, there is a similar pattern shown between the countries in *GDP* (log scale) and *Life Expectancy*

**Fig 4**
- In general, the lower quartile of *GDP* does not tend to lower over time, however the upper quartile of the IQR increases a fair amount over time
- In general, there is marginal increase in *Life Expectancy* over time in the interquartile range.
- The outliers in *Life Expectancy* increases dramatically following ~2006. This is likely the Zimbabwe data as shown from Figure 3

## Triple Variable Visualisation

In [17]:
def triple_variable_vis(fignum, x='GDP', y='Life expectancy at birth (years)'):
    plt.figure()
    plt.suptitle(f'Fig {fignum} - {x} vs {y.title()} Per Country')
    sns.scatterplot(data=df, x=x, y=y, hue='Country', palette='colorblind')
    plt.xscale('log')
    plt.show()

In [18]:
triple_variable_vis(5)

<IPython.core.display.Javascript object>

In [19]:
# Create col with ratio of GDP to Life Expectancy
df['GDP_LE_Ratio'] = df.GDP / df['Life expectancy at birth (years)']

In [20]:
def timed_triple_variable_vis(fignum, x, y, ylab, title):
    plt.figure()
    plt.suptitle(f'Fig {fignum} - {title}')
    sns.lineplot(data=df, x=x, y=y, hue='Country', palette='colorblind')
    plt.yscale('log')
    plt.legend(loc='lower right')
    plt.show()

In [21]:
timed_triple_variable_vis(6, x='Year', y='GDP_LE_Ratio', ylab='GDP / Life Expectancy', title='GDP/Life Expectancy Ratio Per Country Over Time')

<IPython.core.display.Javascript object>

### Triple Variable Findings
**Fig 5**
- For most countries, there was a clear linear relationship between *GDP* and *Life Expectancy*.
- Zimbabwe has a far steeper incline than other countries, although with a significantly lower *GDP* and *Life Expectancy* than other countries at all datapoints

**Fig 6**
- Decline in the line indicates that *Life Expectancy* increases more than *GDP*, and vice versa
- The greater the increase in *GDP*, the lesser the incline for the *GDP/LE* ratio. This indicates that, while there is a seemingly linear relationship between *GDP* and *Life Expectancy*, there are apparent compounding returns to the effect that increasing GDP can have on *Life Expectancy*.

# Conclusions
The purpose of this project was to investigate the provided dataset, come up with relevant questions that can be answered using Python visualisation techniques, and to answer these questions accordingly. From the generated figures, the answers are seemily apparent. I would like to validate these answers using relevant statistical analysis techniques however this is not the purpose of this project.

**Does year or country have a greater effect on Life Expectancy?**
- The answer to this question can be seen in figures 3 & 4
- Figure 3 shows that, when separated by *Country*, *GDP* and *Life Expectancy* data is far more distinct than when separated by *Year* (Figure 4)
- This indicates that the variable of country has a far greater effect on the Life Expectancy
- It is interesting to note that, while this is the case, there were generic positive trends seen in both *GDP* and *Life Expectancy* when sepearated by *Year*


**Is there a relationship between GDP and Life Expectancy?**
- It is clear from Figure 5 that there is a positive association between *GDP* and *Life Expectancy*. Even if there was no hue separation between different countries, all data can clearly be seen to have a linear incline.
- Despite the fact that different countries have different levels of *GDP*, increases in *GDP* made throughout the years seemingly correlate with an increase in *Life Expectancy*.
- Figure 6 indicates the level by which *GDP* affects *Life Expectancy* over time. The ratio of *GDP:LE* tends to increase in all countries until ~2008, at which point it starts to level out. This means that, prior to 2008, there increases in *GDP* confer greater benefit to *Life Expectancy* than it does following 2008.

These findings are obviously to be taken with a grain of salt until statistical analysis can be performed. The greater benefit to this project has been the demonstration of data visualisations as a exploratory data analysis tool to find potential associations in the data, which can then be later dissected with great care.