# Goals and scope of project
1. Get a clear understanding of how life expectancy are evolving over time
2. Understand the distribution of both factors between countries
3. Understand how both factors are associated with each other

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

country_data = pd.read_csv('./all_data.csv')
country_data.head()

Unnamed: 0,Country,Year,Life expectancy at birth (years),GDP
0,Chile,2000,77.3,77860930000.0
1,Chile,2001,77.3,70979920000.0
2,Chile,2002,77.8,69736810000.0
3,Chile,2003,77.9,75643460000.0
4,Chile,2004,78.0,99210390000.0


In [2]:
print(country_data.info())
print(country_data['Country'].unique())
print(country_data['Year'].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Country                           96 non-null     object 
 1   Year                              96 non-null     int64  
 2   Life expectancy at birth (years)  96 non-null     float64
 3   GDP                               96 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 3.1+ KB
None
['Chile' 'China' 'Germany' 'Mexico' 'United States of America' 'Zimbabwe']
[2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
 2014 2015]


# Conclusion after first look
After loading and looking at the data, we can see that we gathered informations about the life expectancy and the GDP for different countries per year.\
Data about the following countries is included: 'Chile', 'China', 'Germany', 'Mexico', 'United States of America' and 'Zimbabwe'.

The timeline ranges between 2000 and 2015.
The assumed data-types, assumed by pandas, are accurate and the dataframe does not contain missing values.\
For now, no further cleaning is needed

# Research questions
* How did life expectancy develop over the last years?
* How did GDP develop over the last years?
* How is life expectancy distributed over all countries? 
* How is GDP distributed over all countries?
* How is the relationship between GDP and life expectency per country?
* How are both values developing over time?

In [3]:
#

In [4]:
fig = px.scatter(country_data, x='Life expectancy at birth (years)', y='GDP', title='Life expectancy vs GDP', width=1200, height=1000
                , labels={'GDP': 'GDP in Trillions of Dollars'}, hover_name='Year', color='Country')
fig.update_layout(title={'y':0.95, 'x':0.5})
fig.show()

It can be seen, as expected, that GDP and Life expectancy seem to be correlated.\
However it must be noted, that while having a high GDP, both China and the US are getting topped by the life expectancy in Chile.\
This could possibly be explained by their high number of inhabitants. Hence, the GDP per Capita is probably a better indicator.

In [5]:
inhabitants = pd.DataFrame({'Country':['Chile', 'China', 'Germany', 'Mexico', 'United States of America', 'Zimbabwe'],
                           'Population in M':[19.46, 1412.78, 83.02, 123.76, 331.2, 16]})

country_data_extended = pd.merge(country_data, inhabitants, on='Country', how='left')
country_data_extended['GDP per capita'] = country_data_extended['GDP'] / (country_data_extended['Population in M'] * 1000000)

fig = px.scatter(country_data_extended, x='Life expectancy at birth (years)', y='GDP per capita', title='Life expectancy vs GDP per capita', width=1200, height=1000
                , hover_name='Year', color='Country')
fig.update_layout(title={'y':0.95, 'x':0.5})
fig.show()

Now the picture becomes more clear, since Chile is also having a higher GDP per capita than China and Mexiko.\
However, it is interesting to see, that it is still having a higher life expectency than the US, which has a GDP almost 4x as high as the one of Chile.\
In the next step we want to check, if that difference is also statistically significant.

In [6]:
import scipy.stats as stats

tstat, pval = stats.ttest_ind(country_data_extended['Life expectancy at birth (years)'][country_data_extended['Country'] == 'Chile'],
                              country_data_extended['Life expectancy at birth (years)'][country_data_extended['Country'] == 'United States of America'], alternative='greater')
print(pval)

0.006882476157246373


Our hypothesis-test shows us, that we can very confidently reject the null-hypothesis that there is no difference in life expectancy between the US and Chile.\
Hence, the alternative hypothesis that the life expectancy in Chile is higher, stands.

In [7]:
tstat, pval = stats.ttest_ind(country_data_extended['Life expectancy at birth (years)'][country_data_extended['Country'] == 'Chile'],
                              country_data_extended['Life expectancy at birth (years)'][country_data_extended['Country'] == 'Germany'], alternative='less')
print(pval)

0.02847224197415846


Additionally, the life expectancy in Germany is significantly higher than in Chile, while Germany also has significantly higher GDP per Capita

In [8]:
tstat, pval = stats.ttest_ind(country_data_extended['Life expectancy at birth (years)'][country_data_extended['Country'] == 'United States of America'],
                              country_data_extended['Life expectancy at birth (years)'][country_data_extended['Country'] == 'Mexico'], alternative='greater')
print(pval)

2.3513070474435397e-10
