<a id="index"></a>
## Table of Contents
<ul>
    <a href="#one"><li>CO2 emissions (metric tons per capita) and GDP per cap. Year 1962</li></a>
    <a href="#two"><li>In what year is the correlation between CO2 emissions (metric tons per capita) and gdpPercap the strongest?</li></a>
    <a href="#three"><li>What is the relationship between continent and 'Energy use (kg of oil equivalent per capita)'?</li></a>
    <a href="#four"><li>Is there a significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' in the years after 1990?</li></a>
    <a href="#five"><li>What is the country (or countries) that has the highest 'Population density (people per sq. km of land area)' across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)</li></a>
    <a href="#six"><li>What country (or countries) has shown the greatest increase in 'Life expectancy at birth, total (years)' since 1962?</li></a>
</ul>

In [25]:
import pandas as pd
import seaborn as sns
import numpy as np
import plotly.express as px

# Import data
df = pd.read_csv("gapminder_clean.csv")
df.dropna(subset = ["CO2 emissions (metric tons per capita)"], inplace=True)
df.dropna(subset = ["gdpPercap"], inplace=True)


<a id="one"></a>
<h3>CO2 emissions (metric tons per capita) and GDP per cap. Year 1962</h3>
<a href="#index">Go to the table of contents</a>

In [26]:
filtered_data = df[(df["Year"] == 1962) & (df["Country Name"] != "Kuwait")]

# Scatter plot
fig = px.scatter(filtered_data, x='CO2 emissions (metric tons per capita)', 
                 y="gdpPercap", color='Country Name', size='pop', hover_data=['Country Name'], 
                 title="CO2 emissions (metric tons per capita) and GDP per cap. Year 1962")
fig.show()

In [27]:
# Pearson's r
from scipy.stats import pearsonr

corr, p_value = pearsonr(filtered_data["CO2 emissions (metric tons per capita)"], filtered_data["gdpPercap"])
print("\n Pearson correlation of CO2 emissions (metric tons per capita) and gdpPercap, year 1962: \n",
      "Correlation value: ",corr,"p-value: ",p_value)


 Pearson correlation of CO2 emissions (metric tons per capita) and gdpPercap, year 1962: 
 Correlation value:  0.8063294717615218 p-value:  1.0822253072448907e-25


<a id="two"></a>
### In what year is the correlation between CO2 emissions (metric tons per capita) and gdpPercap the strongest?
<a href="#index">Go to the table of contents</a>

In [28]:
unfiltered_data = df[(df["Year"] != 1962) & (df["Country Name"] != "Kuwait")]
unfiltered_data.groupby(by=["Year"]).corrwith(other=df["CO2 emissions (metric tons per capita)"]).sort_values("gdpPercap", ascending=False)["gdpPercap"].head(1)

Year
1972    0.859351
Name: gdpPercap, dtype: float64

In [29]:
new_filtered_data = df[(df["Year"] == 1967) & (df["Country Name"] != "Kuwait")]

fig = px.scatter(new_filtered_data, x='CO2 emissions (metric tons per capita)', y="gdpPercap", color="continent",
                 size='pop', hover_data=['Country Name'], 
                 title="CO2 emissions (metric tons per capita) and GDP per cap. Year 1967")
fig.show()

<a id="three"></a>
<h3>What is the relationship between continent and 'Energy use (kg of oil equivalent per capita)'?</h3>
<a href="#index">Go to the table of contents</a>

In [30]:
df.groupby("continent")["Energy use (kg of oil equivalent per capita)"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,198.0,700.642721,628.227685,9.71541,377.73468,451.382174,746.247275,3071.774832
Americas,188.0,1703.620453,2377.181918,219.075497,556.033108,749.029108,1384.585146,14608.009868
Asia,185.0,1867.280336,2590.043514,86.903767,345.370792,760.140852,1987.087308,12122.050603
Europe,239.0,3110.604287,1768.370162,350.101258,2045.782889,2954.266739,3853.373983,14746.031339
Oceania,20.0,3980.31442,1123.410756,1791.461322,3143.50142,4044.850674,4783.65023,5868.347097


In [31]:
fig = px.box(df, x="Energy use (kg of oil equivalent per capita)", y="continent", hover_data=['Country Name'])
fig.show()

I need to know if the data satisfy parametric requirements to use parametric tests.
1. The population from which samples are drawn should be normally distributed. -> Shapiro-Wilk test.
2. Independence of cases: the sample cases should be independent of each other. -> I assume that this condition is satisfied.
3. Homogeneity of variance: Homogeneity means that the variance among the groups should be approximately equal. -> Levene test

First, I separate the "Energy use" data per continent, (I'll have five series, one per each continent), and I remove missing values to avoid errors on tests.

In [32]:
americas_energy = df[df["continent"] == "Americas"]["Energy use (kg of oil equivalent per capita)"].dropna()
oceania_energy = df[df["continent"] == "Oceania"]["Energy use (kg of oil equivalent per capita)"].dropna()
africa_energy = df[df["continent"] == "Africa"]["Energy use (kg of oil equivalent per capita)"].dropna()
europe_energy = df[df["continent"] == "Europe"]["Energy use (kg of oil equivalent per capita)"].dropna()
asia_energy = df[df["continent"] == "Asia"]["Energy use (kg of oil equivalent per capita)"].dropna()

##### Shapiro-Wilk test
The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution (first requeriment to use parametric tests).

In [33]:
import scipy.stats as stats
print("\n",
"Americas: ",stats.shapiro(americas_energy), " Reject null hypotesis","\n",
"Oceania: ",stats.shapiro(oceania_energy),"Can not reject null hypotesis","\n",
"Africa: ",stats.shapiro(africa_energy)," Reject null hypotesis","\n",
"Europe: ",stats.shapiro(europe_energy)," Reject null hypotesis","\n",
"Asia: ",stats.shapiro(asia_energy)," Reject null hypotesis")


 Americas:  ShapiroResult(statistic=0.5632225871086121, pvalue=1.5868403861741054e-21)  Reject null hypotesis 
 Oceania:  ShapiroResult(statistic=0.9818098545074463, pvalue=0.9552662372589111) Can not reject null hypotesis 
 Africa:  ShapiroResult(statistic=0.6753993034362793, pvalue=2.724574480963581e-19)  Reject null hypotesis 
 Europe:  ShapiroResult(statistic=0.889901876449585, pvalue=3.4700315537650184e-12)  Reject null hypotesis 
 Asia:  ShapiroResult(statistic=0.6609910130500793, pvalue=4.943039538112358e-19)  Reject null hypotesis


Shapiro-Wilk tests: rejected 4/5 null hypotheses. Data is not normal distributed.

##### Levene test

The Levene test tests the null hypothesis that all input samples are from populations with equal variances (third requeriment to use parametric tests). 

In [34]:
import scipy.stats as stats
stats.levene(*[americas_energy, oceania_energy, africa_energy,europe_energy, asia_energy], 
             center='median', proportiontocut=0.05)

LeveneResult(statistic=12.113489871462216, pvalue=1.4250677574810339e-09)

<p>Leneve test: rejected null hypothesis. </p>
<p>Shapiro-Wilk and Leneve tests were rejected. Data doesn't satisfy parametric requeriments. So I need to use non-parametric tests.</p>

####  Kruskal-Wallis H-test
<p>The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. </p>

In [35]:
stats.kruskal(*[americas_energy, oceania_energy, africa_energy,europe_energy, asia_energy])


KruskalResult(statistic=302.0114932359461, pvalue=3.989307514095183e-64)

<p>Kruskal-Wallis H-test: rejected null hypothesis.</p>
<p>So I have to compare the means to detect the differences and similarities between continents.</p>

#### Dunn’s test
<p>Post hoc pairwise test for multiple comparisons of mean rank sums. 
This test is run after Kruskal-Wallis's one-way analysis of variance by ranks to do pairwise comparisons.</p>

In [36]:
import scikit_posthocs as sp

dunn_test = sp.posthoc_dunn([americas_energy, oceania_energy, africa_energy,europe_energy, asia_energy])
dunn_test.columns =["America","Oceania","Africa","Europe","Asia"]
dunn_test.index =["America","Oceania","Africa","Europe","Asia"]
dunn_test

Unnamed: 0,America,Oceania,Africa,Europe,Asia
America,1.0,5.358423e-08,6.438289e-10,1.025511e-20,0.1858753
Oceania,5.358423e-08,1.0,4.148905e-16,0.112535,1.779352e-09
Africa,6.438289e-10,4.148905e-16,1.0,9.844921e-58,1.479379e-06
Europe,1.025511e-20,0.112535,9.844921e-58,1.0,1.1302149999999999e-26
Asia,0.1858753,1.779352e-09,1.479379e-06,1.1302149999999999e-26,1.0


In [37]:
fig = px.imshow(dunn_test, title="Dunn's test between continents' energy use means")
fig.show()

<p>Asia's and Americas' energy use means are similar.</p>
<p>Oceania's and Europe's energy use means are similar.</p>

<a id="four"></a>
### Is there a significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' in the years after 1990?
<a href="#index">Go to the table of contents</a>

In [38]:
europe_and_asia_after_1990 = df[((df["continent"] == "Europe") |
                (df["continent"] == "Asia")) & (df["Year"] > 1990) & 
                (df['Imports of goods and services (% of GDP)'] < 97 ) ] # Deleted Outliers (Singapore is an exception)
europe_and_asia_after_1990.groupby("continent")["Imports of goods and services (% of GDP)"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Asia,93.0,41.713928,23.454248,0.079506,25.393531,38.83129,58.350047,96.742045
Europe,111.0,41.761071,16.818978,17.34513,28.462519,37.691245,51.129504,88.512248


In [39]:
fig = px.box(europe_and_asia_after_1990, x="Imports of goods and services (% of GDP)", y="continent", hover_data=['Country Name', "Year"])
fig.show()

In [40]:
europe_imports = europe_and_asia_after_1990[europe_and_asia_after_1990["continent"] == "Europe"]["Imports of goods and services (% of GDP)"].dropna()
asia_imports = europe_and_asia_after_1990[europe_and_asia_after_1990["continent"] == "Asia"]["Imports of goods and services (% of GDP)"].dropna()
print("\n", stats.shapiro(europe_imports), "\n", stats.shapiro(asia_imports))
stats.levene(*[europe_imports,asia_imports], center='median', proportiontocut=0.05)



 ShapiroResult(statistic=0.9278804063796997, pvalue=1.486705423303647e-05) 
 ShapiroResult(statistic=0.9704142212867737, pvalue=0.032892148941755295)


LeveneResult(statistic=9.881593643140366, pvalue=0.0019207769451392669)

<ul>
    <li>Shapiro-Wilk tests: rejected 1/2 null hypotheses. Data is not normal distributed.</li>
    <li>Leneve test: rejected null hypothesis.</li>
</ul>
<p>Like the anterior case, parametric requirements are not satisfied. I need to compare two means using a non-parametric test.</p>

#### Mann-Whitney U test

<p>The Mann-Whitney U test is used to compare differences between two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed. 
</p>

In [41]:
stats.mannwhitneyu(x=europe_imports, y=asia_imports)

MannwhitneyuResult(statistic=5034.0, pvalue=0.3811649093378452)

Can not reject the null hypothesis of identical average scores.

<a id="five"></a>
### What is the country (or countries) that has the highest 'Population density (people per sq. km of land area)' across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)
<a href="#index">Go to the table of contents</a>

<p>I'll group the records by Country Name, then I'll calculate the pop density mean per country</p>

In [42]:
df.groupby("Country Name")["Population density (people per sq. km of land area)"].mean().sort_values(ascending=False).head()

Country Name
Singapore             4361.499928
Bangladesh             810.312621
Bahrain                648.618703
West Bank and Gaza     513.642691
Mauritius              490.801823
Name: Population density (people per sq. km of land area), dtype: float64

In [43]:
fig = px.line(df, x="Year", y="Population density (people per sq. km of land area)", color="Country Name",
              line_group="Country Name", hover_name="Country Name", 
              title="Population density (people per sq. km of land area) across all years")
fig.show()

<a id="six"></a>
### What country (or countries) has shown the greatest increase in 'Life expectancy at birth, total (years)' since 1962?
<a href="#index">Go to the table of contents</a>

<p>I'll extract the first and the last record that contains the "Life expectancy at birth" value for each country (not the minimum and maximum values), I'll subtract the last minus the first value and then I'll calculate the relative increment, in percentage:</p>
<code>relative increment (%) = (last record - first record)/first record * 100</code>

In [44]:
life_expectancy_by_country = df.groupby('Country Name')['Life expectancy at birth, total (years)'].agg(['last','first'])
life_expectancy_by_country['diff'] = life_expectancy_by_country['last'] - life_expectancy_by_country['first']
life_expectancy_by_country['percentage'] = life_expectancy_by_country['diff'] / life_expectancy_by_country['first'] * 100
life_expectancy_by_country["Country Name"] = life_expectancy_by_country.index
life_expectancy_by_country.sort_values(by="percentage", ascending=False).head()

Unnamed: 0_level_0,last,first,diff,percentage,Country Name
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mali,54.261927,28.548463,25.713463,90.069518,Mali
Nepal,66.551927,35.952293,30.599634,85.111774,Nepal
Afghanistan,57.833829,33.219902,24.613927,74.093917,Afghanistan
Tunisia,74.202439,43.341683,30.860756,71.20341,Tunisia
China,74.340439,44.398341,29.942098,67.439676,China


In [45]:
life_expectancy_by_country_values_greater_than_zero = life_expectancy_by_country[life_expectancy_by_country["percentage"] > 0] #Removed negative values because cause error
fig = px.scatter(life_expectancy_by_country_values_greater_than_zero, x='percentage', y="last", color="percentage", 
                 title="Changes in life expectancy at birth, relative value (%)",
                 size='percentage', hover_data=['Country Name'],
                labels={
                     "last": "Life expectancy at birth, total (years). Last record.",
                     "percentage": "Difference between first and last record, (percentage)"
                 },)
fig.show()

#### Changes in life expectancy at birth, absolute value (Years)
<p>Also I can compare between absolute values, getting different results:<p>
<code>absolute increment (years) = last record - first record</code>

In [46]:
life_expectancy_by_country.sort_values(by="diff", ascending=False).head()

Unnamed: 0_level_0,last,first,diff,percentage,Country Name
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Tunisia,74.202439,43.341683,30.860756,71.20341,Tunisia
Nepal,66.551927,35.952293,30.599634,85.111774,Nepal
China,74.340439,44.398341,29.942098,67.439676,China
Oman,75.12361,48.107073,27.016537,56.159177,Oman
Saudi Arabia,73.345073,46.694512,26.650561,57.074289,Saudi Arabia


In [47]:
life_expectancy_by_country_values_greater_than_zero = life_expectancy_by_country[life_expectancy_by_country["diff"] > 0] #Removed negative values because cause error
fig = px.scatter(life_expectancy_by_country_values_greater_than_zero, x='diff', y="last", color="diff", 
                 title="Changes in life expectancy at birth, absolute value (Years)",
                 size='diff', hover_data=['Country Name'],
                labels={
                     "last": "Life expectancy at birth, total (years). Last record.",
                     "diff": "Difference between first and last record, (years)"
                 },)
fig.show()