# Project: Gapminder Dataset Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

>3 datatsets were collected from Gapminder to serve as samples for this analysis __population_total.csv, __gnicap_atm_con.csv__, __life_expectancy_years.csv__

>The first question we'll be exploring is: __Is life expectancy affected by the population size?__

>The second question we'll be exploring is: __Is there a correlation between gdp and life expectancy?__

> For the purpose of ease through out the analysis, I will label gnicap_atm_con as __df_inc__.

In [1]:
#Importing library packages to be used throughout project

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [2]:
# Load data
df_pop = pd.read_csv('population_total.csv')
df_lyf = pd.read_csv('life_expectancy_years.csv')
df_inc = pd.read_csv('gnicap_atm_con.csv')

### Exploring the 3 different datasets

In [3]:
#confirming right data loaded(population)
print(df_pop.shape)
df_pop.head()

(197, 302)


Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,...,76.6M,76.4M,76.3M,76.1M,76M,75.8M,75.6M,75.4M,75.2M,74.9M
1,Angola,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,...,168M,170M,172M,175M,177M,179M,182M,184M,186M,188M
2,Albania,400k,402k,404k,405k,407k,409k,411k,413k,414k,...,1.33M,1.3M,1.27M,1.25M,1.22M,1.19M,1.17M,1.14M,1.11M,1.09M
3,Andorra,2650,2650,2650,2650,2650,2650,2650,2650,2650,...,63k,62.9k,62.9k,62.8k,62.7k,62.7k,62.6k,62.5k,62.5k,62.4k
4,United Arab Emirates,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,...,12.3M,12.4M,12.5M,12.5M,12.6M,12.7M,12.7M,12.8M,12.8M,12.9M


In [4]:
#confirming right data loaded(life expectancy)
print(df_lyf.shape)
df_lyf.head()

(195, 302)


Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,28.2,28.2,28.2,28.2,28.2,28.2,28.1,28.1,28.1,...,75.5,75.7,75.8,76.0,76.1,76.2,76.4,76.5,76.6,76.8
1,Angola,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,...,78.8,79.0,79.1,79.2,79.3,79.5,79.6,79.7,79.9,80.0
2,Albania,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,87.4,87.5,87.6,87.7,87.8,87.9,88.0,88.2,88.3,88.4
3,Andorra,,,,,,,,,,...,,,,,,,,,,
4,United Arab Emirates,30.7,30.7,30.7,30.7,30.7,30.7,30.7,30.7,30.7,...,82.4,82.5,82.6,82.7,82.8,82.9,83.0,83.1,83.2,83.3


>I noticed a lot of null values on the life expectancy dataset

In [5]:
#confirming right data loaded(GNIperCap)
print(df_inc.shape)
df_inc.head()

(191, 252)


Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2041,2042,2043,2044,2045,2046,2047,2048,2049,2050
0,Afghanistan,207.0,207.0,207.0,207.0,207.0,207.0,207.0,207.0,207.0,...,751,767,783,800,817,834,852,870,888,907
1,Angola,517.0,519.0,522.0,524.0,525.0,528.0,531.0,533.0,536.0,...,2770,2830,2890,2950,3010,3080,3140,3210,3280,3340
2,Albania,207.0,207.0,207.0,207.0,207.0,207.0,207.0,207.0,207.0,...,9610,9820,10k,10.2k,10.5k,10.7k,10.9k,11.1k,11.4k,11.6k
3,United Arab Emirates,738.0,740.0,743.0,746.0,749.0,751.0,754.0,757.0,760.0,...,47.9k,48.9k,50k,51k,52.1k,53.2k,54.3k,55.5k,56.7k,57.9k
4,Argentina,794.0,797.0,799.0,802.0,805.0,808.0,810.0,813.0,816.0,...,12.8k,13.1k,13.4k,13.6k,13.9k,14.2k,14.5k,14.8k,15.2k,15.5k


#### Exploring for any duplicates

In [6]:
df_pop.duplicated().sum(), df_inc.duplicated().sum(), df_lyf.duplicated().sum()

(0, 0, 0)

>No duplicated data was found.

#### Exploring for different datatypes in each dataset

In [7]:
df_pop.dtypes.unique()

array([dtype('O')], dtype=object)

In [8]:
df_lyf.dtypes.unique()

array([dtype('O'), dtype('float64')], dtype=object)

In [9]:
df_inc.dtypes.unique()

array([dtype('O'), dtype('float64')], dtype=object)

> We see that the 3 datasets all have similarities, all referenced by country and years. But there are a lot of null values in some of the datasets and will have to be cleaned. Dropping the missing value rows will be the best choice for me in order to minimize any errors.

> Also, it will be quite complex to work with seperate datasets at once so I'll prefer to transform them into tables with 3 columns each, then merge them together, given that they have the similar references of country and years.

### Data Cleaning
>Now, I'll be taking care of the null values using the __dropna__, from there I will tranform each dataset into tables and finally merge all three tables together!

#### Cleaning up null values

In [10]:
df_pop.isnull().sum()

country    0
1800       0
1801       0
1802       0
1803       0
          ..
2096       0
2097       0
2098       0
2099       0
2100       0
Length: 302, dtype: int64

In [11]:
#Dropping all null value rows found in the life expectancy dataset
df_lyf = df_lyf.dropna()
df_lyf

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,28.2,28.2,28.2,28.2,28.2,28.2,28.1,28.1,28.1,...,75.5,75.7,75.8,76.0,76.1,76.2,76.4,76.5,76.6,76.8
1,Angola,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,...,78.8,79.0,79.1,79.2,79.3,79.5,79.6,79.7,79.9,80.0
2,Albania,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,35.4,...,87.4,87.5,87.6,87.7,87.8,87.9,88.0,88.2,88.3,88.4
4,United Arab Emirates,30.7,30.7,30.7,30.7,30.7,30.7,30.7,30.7,30.7,...,82.4,82.5,82.6,82.7,82.8,82.9,83.0,83.1,83.2,83.3
5,Argentina,33.2,33.2,33.2,33.2,33.2,33.2,33.2,33.2,33.2,...,86.2,86.3,86.5,86.5,86.7,86.8,86.9,87.0,87.1,87.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,Samoa,25.4,25.4,25.4,25.4,25.4,25.4,25.4,25.4,25.4,...,79.8,79.9,80.0,80.1,80.3,80.4,80.5,80.6,80.7,80.8
191,Yemen,23.4,23.4,23.4,23.4,23.4,23.4,23.4,23.4,23.4,...,76.9,77.0,77.1,77.3,77.4,77.5,77.6,77.8,77.9,78.0
192,South Africa,33.5,33.5,33.5,33.5,33.5,33.5,33.5,33.5,33.5,...,76.4,76.5,76.7,76.8,77.0,77.1,77.3,77.4,77.5,77.7
193,Zambia,32.6,32.6,32.6,32.6,32.6,32.6,32.6,32.6,32.6,...,75.8,76.0,76.1,76.3,76.4,76.5,76.7,76.8,77.0,77.1


In [12]:
df_lyf.isnull().sum()

country    0
1800       0
1801       0
1802       0
1803       0
          ..
2096       0
2097       0
2098       0
2099       0
2100       0
Length: 302, dtype: int64

In [13]:
#Making sure there are no null values in the df_inc rows
df_inc = df_inc.dropna()
df_inc

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2041,2042,2043,2044,2045,2046,2047,2048,2049,2050
0,Afghanistan,207.0,207.0,207.0,207.0,207.0,207.0,207.0,207.0,207.0,...,751,767,783,800,817,834,852,870,888,907
1,Angola,517.0,519.0,522.0,524.0,525.0,528.0,531.0,533.0,536.0,...,2770,2830,2890,2950,3010,3080,3140,3210,3280,3340
2,Albania,207.0,207.0,207.0,207.0,207.0,207.0,207.0,207.0,207.0,...,9610,9820,10k,10.2k,10.5k,10.7k,10.9k,11.1k,11.4k,11.6k
3,United Arab Emirates,738.0,740.0,743.0,746.0,749.0,751.0,754.0,757.0,760.0,...,47.9k,48.9k,50k,51k,52.1k,53.2k,54.3k,55.5k,56.7k,57.9k
4,Argentina,794.0,797.0,799.0,802.0,805.0,808.0,810.0,813.0,816.0,...,12.8k,13.1k,13.4k,13.6k,13.9k,14.2k,14.5k,14.8k,15.2k,15.5k
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186,Samoa,373.0,373.0,373.0,373.0,373.0,373.0,373.0,374.0,374.0,...,5330,5440,5560,5670,5790,5920,6040,6170,6300,6440
187,Yemen,197.0,198.0,198.0,199.0,199.0,200.0,200.0,201.0,202.0,...,1440,1470,1500,1530,1560,1590,1630,1660,1700,1730
188,South Africa,800.0,791.0,782.0,773.0,765.0,724.0,724.0,786.0,687.0,...,7630,7790,7960,8130,8300,8480,8660,8840,9030,9220
189,Zambia,213.0,214.0,215.0,215.0,215.0,216.0,216.0,217.0,217.0,...,1260,1290,1320,1340,1370,1400,1430,1460,1490,1520


In [14]:
df_inc.isnull().sum()

country    0
1800       0
1801       0
1802       0
1803       0
          ..
2046       0
2047       0
2048       0
2049       0
2050       0
Length: 252, dtype: int64

### Transforming Datasets into Tables of 3 columns respectively

In [15]:
df_pop = df_pop.melt(id_vars=["country"], 
        var_name="year", 
        value_name="pop")
df_pop

Unnamed: 0,country,year,pop
0,Afghanistan,1800,3.28M
1,Angola,1800,1.57M
2,Albania,1800,400k
3,Andorra,1800,2650
4,United Arab Emirates,1800,40.2k
...,...,...,...
59292,Samoa,2100,310k
59293,Yemen,2100,53.2M
59294,South Africa,2100,79.2M
59295,Zambia,2100,81.5M


In [16]:
df_inc = df_inc.melt(id_vars=["country"], 
        var_name="year", 
        value_name="income")
df_inc

Unnamed: 0,country,year,income
0,Afghanistan,1800,207.0
1,Angola,1800,517.0
2,Albania,1800,207.0
3,United Arab Emirates,1800,738.0
4,Argentina,1800,794.0
...,...,...,...
47685,Samoa,2050,6440
47686,Yemen,2050,1730
47687,South Africa,2050,9220
47688,Zambia,2050,1520


In [17]:
df_lyf = df_lyf.melt(id_vars=["country"], 
        var_name="year", 
        value_name="life_exp")
df_lyf

Unnamed: 0,country,year,life_exp
0,Afghanistan,1800,28.2
1,Angola,1800,27.0
2,Albania,1800,35.4
3,United Arab Emirates,1800,30.7
4,Argentina,1800,33.2
...,...,...,...
55981,Samoa,2100,80.8
55982,Yemen,2100,78.0
55983,South Africa,2100,77.7
55984,Zambia,2100,77.1


#### Merging the 3 datasets into one (df1)

In [18]:
df1 = df_pop.merge(df_inc,on=['country','year']).merge(df_lyf,on=['country','year'])
print(df1)

                    country  year    pop income  life_exp
0               Afghanistan  1800  3.28M  207.0      28.2
1                    Angola  1800  1.57M  517.0      27.0
2                   Albania  1800   400k  207.0      35.4
3      United Arab Emirates  1800  40.2k  738.0      30.7
4                 Argentina  1800   534k  794.0      33.2
...                     ...   ...    ...    ...       ...
46179                 Samoa  2050   267k   6440      74.3
46180                 Yemen  2050  48.1M   1730      72.2
46181          South Africa  2050  75.5M   9220      70.9
46182                Zambia  2050  39.1M   1520      69.8
46183              Zimbabwe  2050  23.9M   1880      67.6

[46184 rows x 5 columns]


In [19]:
df1['year']=df1['year'].astype(int)
df1['life_exp']=df1['life_exp'].astype(float)

In [None]:
df1['income'] = df1['income'].replace({'k': '*1e3', 'm': '*1e6'}, regex=True).map(pd.eval).astype(int)
df1['pop'] = df1['pop'].replace({'k': '*1e3', 'M': '*1e6', 'B': '*1e9'}, regex=True).map(pd.eval).astype(int)

#### Exploring dataset for datatypes, duplicates, errors and null values

In [None]:
df1.dtypes

In [None]:
df1.duplicated().sum()

In [None]:
df1.isnull().sum()

In [None]:
#trim data, working with data from year:1980-2020
df1=df1.loc[33153:40513]
df1.head()

<a id='eda'></a>
## Exploratory Data Analysis

>Now that the data is cleaned, trimmed and set, we can now move to the analysis. Let's get it!!

### Research Question 1 : Is there any correlation between life expectancy and standard of living through the years?
>The idea is to find out if the life expectancy of less developed countries differ significantly from that of the developed countries, we will also to verifying is their populations have any impact on life expectancy

#### Case Study: Cameroon, France
>H0: Life expectancy in developed countries = life expectancy in less developed countries

>H1: Life expectancy in developed countries != life expectancy in less developed countries

In [None]:
#query data with country = Cameroon
df_cm = df1.query('country == "Cameroon"')
df_cm.head()

In [None]:
#Write a function to plot layouts; this is to avoid duplicates and confusion
def plotter(x):
    
    test = x.update_layout(barmode='group', xaxis_tickangle=-45, legend={'bgcolor':'red'},title={
        
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
    
    return test

In [None]:
#df_fr['life_exp'].hist(figsize=(15,5));
fig = px.histogram(df_cm, x='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Life Expectancy Histogram - Cameroon');

plotter(fig)

> We deduce from the above histogram that life expectancy in Cameroon from 1980-2020 ranges from 54-63 Years with highest between ages 57-59 Years, skewed to the left, signifying in most cases life expectancy is between 54-59 Years

In [None]:
#query data with country = France
df_fr = df1.query('country == "France"')
df_fr.head()

In [None]:
#df_fr['life_exp'].hist(figsize=(15,5));
fig = px.histogram(df_fr, x='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Life Expectancy Histogram - France')

plotter(fig)

> We deduce from the above histogram that life expectancy France from 1980-2020 is highest after 82 Years. It is normally distributed with the range of life expectancy between 75-83 Years, __significantly higher__ than that of Cameroon(54-63 Years)

In [None]:
#plotting relationship between life expectancy and years
#fig1=px.bar(df_cm, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'})
#fig1.update_layout(barmode='group', xaxis_tickangle=-45)
fig = px.bar(df_cm, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - Cameroon')

plotter(fig)

In [None]:
fig = px.scatter(df_cm, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - Cameroon');px.scatter(df_fr, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - France');

plotter(fig)


> From the bar chart and scatter plot above, we observe a steady rise from 2002 signifying that life expectancy in Cameroon has grown from 54 to 63+ over the last 20 years

In [None]:
fig = px.bar(df_fr, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - France')

plotter(fig)

In [None]:
fig = px.scatter(df_fr, x='year', y='life_exp', height=320, labels={'life_exp':'Life Expectancy'}, title='Relationship between Life Expectancy and Years - France')

plotter(fig)

> From the bar chart and the scatter plot above, we can see a steady rise through out the years signifying that life expectancy in France has grown from 75 to 82+ over the last 10+ years

In [None]:
fig = px.bar(df_cm, x='year', y='pop', color='life_exp', height=320, labels={'pop':'Population Cameroon'}, title='Life Expectancy with Respect to Pop Growth per Year - Cameroon')

plotter(fig)

In [None]:
fig=px.scatter(df_cm, x='year', y='pop', color='life_exp', height=320, labels={'pop':'Population Cameroon'}, title='Life Expectancy with Respect to Pop Growth per Year - Cameroon')

plotter(fig)

In [None]:
fig=px.bar(df_fr, x='year', y='pop', color='life_exp', height=320, labels={'pop':'Population France'}, title='Life Expectancy with Respect to Pop Growth per Year - France')

plotter(fig)

In [None]:
fig=px.scatter(df_fr, x='year', y='pop', color='life_exp', height=320, labels={'pop':'Population France'}, title='Life Expectancy with Respect to Pop Growth per Year - France')

plotter(fig)

> The relationship graphs above clearly show the differences and the steady rise in life expectancy in France compared to Cameroon, with France having a higher population of 64.7M people and Cameroon with just 26.5M people; by 2019, France life expectancy was already at 82+ Years, far higher than Cameroon's(62+ Years in 2020).

### Research Question 2 :  Is there any correlation between Life Expectancy and Income per Person?

In [None]:
#compute to get descriptive statistics
df_cm.describe()

In [None]:
#compute to get descriptive statistics
df_fr.describe()

In [None]:
#df_cm['income'].hist(figsize=(15,5));
fig = px.histogram(df_cm, x='income', height=320, labels={'income':'Income'}, title='Income Histogram - Cameroon');

plotter(fig)

>Income histogram is skewed to the left, with highest frequency between 1500-1700USD, very few people above 1750USD

In [None]:
#Relationship between life expectancy and income with respect to the population growth in Cameroon on bar chart
fig=px.bar(df_cm, x='pop', y='income', height=320, labels={'pop':'Population'}, color='life_exp', title='Life Expectancy And Income Per Year - Cameroon')

plotter(fig)

In [None]:
#Scatter plot for relationship between life expectancy and income with respect to the population growth in Cameroon on Scatter chart
fig=px.scatter(df_cm, x='pop', y='income', height=320, labels={'pop':'Population'}, color='life_exp', title='Life Expectancy And Income Per Year - Cameroon')

plotter(fig)

> Majority of the population live with GNI below 2000USD. All those with GNI above 2000USD do not reach 59 Years 

In [None]:
#df_fr['income'].hist(figsize=(15,5));
fig = px.histogram(df_fr, x='income', height=320, labels={'income':'Income'}, title='Income Histogram - France');

plotter(fig)

In [None]:
#Relationship between life expectancy and income with respect to the population growth in France on bar chart
fig= px.bar(df_fr, x='pop', y='income', height=320, labels={'pop':'Population'}, color='life_exp', title='Life Expectancy And Income Per Year - France')

plotter(fig)

In [None]:
#Scatter plot for relationship between life expectancy and income with respect to the population growth in France on scatter chart
fig= px.scatter(df_fr, x='pop', y='income', height=320, labels={'pop':'Population'}, color='life_exp', title='Life Expectancy And Income Per Year - France')

plotter(fig)

> Comparing the plots for Cameroon and France, we see some differences, firstly, life expectancy decreases with increase in GNI with respect to the population of Cameroon. In contrast, life expectancy increases with respect to the french population.

<a id='conclusions'></a>
## Conclusions

   ### Results: 
    From the analysis carried out;

> We found using the case studies that France have have a higher life expectancy range(__82+ Years__) than Cameroon(__64+ Years__).

> We see that life expectancy had a steady rise for 20+ years in both Cameroon and France, which could mean that the increase in population has no negative effect on the life expectancy.

> The mean GNI of Cameroon=1672.9 while the mean GNI of France=40265,this also points to a higher standard of living in the developed countries.

> We see in the analysis that Life expectancy is lower for GNI>2000USD and higher for GNI<2000USD, but with France, life expectancy increases with increase in GNI.

> In conclusion, population size does has an effect on life expectancy; as we can see on the bar and scatter plots, life expectancy gets higher as the population size increases. Why?

> Finally, income does have an effect on life expectancy but this effect depends on the populationin question, it could either be negative(as seen in the case of Cameroon where an increase in GNI instead leads to a lower life expectancy) or positive(as in the case of France where an Increase in GNI leads to a higher life expectancy).

   ### Limitations: 
    There are a few limitations with our data:
> The statistics is focused more on the descriptive and a little hypothesis testing, so we didn't involve ourselves with inferentials or causatives.

> We work with a limited amount of data, due to the presence of a good number of null values that didn't permit us to increase our scope of analysis.