Download required packages for analysis:

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Download data to analyse:

In [2]:
df_raw = pd.read_csv("all_data.csv")

Explore and start basic cleaning of data:

In [3]:
print(df_raw.head(20))

   Country  Year  Life expectancy at birth (years)           GDP
0    Chile  2000                              77.3  7.786093e+10
1    Chile  2001                              77.3  7.097992e+10
2    Chile  2002                              77.8  6.973681e+10
3    Chile  2003                              77.9  7.564346e+10
4    Chile  2004                              78.0  9.921039e+10
5    Chile  2005                              78.4  1.229650e+11
6    Chile  2006                              78.9  1.547880e+11
7    Chile  2007                              78.9  1.736060e+11
8    Chile  2008                              79.6  1.796380e+11
9    Chile  2009                              79.3  1.723890e+11
10   Chile  2010                              79.1  2.185380e+11
11   Chile  2011                              79.8  2.522520e+11
12   Chile  2012                              79.9  2.671220e+11
13   Chile  2013                              80.1  2.783840e+11
14   Chile  2014         

In [4]:
print(df_raw.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Country                           96 non-null     object 
 1   Year                              96 non-null     int64  
 2   Life expectancy at birth (years)  96 non-null     float64
 3   GDP                               96 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 3.1+ KB
None


Based on the first rows in the df, there could only be duplicates across the Country and Year columns combined. So here, they are removed.

In [5]:
df_new = df_raw.drop_duplicates(
  subset = ['Country', 'Year'],
  keep = 'last').reset_index(drop = True)

In [6]:
print(df_new.nunique())

Country                              6
Year                                16
Life expectancy at birth (years)    69
GDP                                 96
dtype: int64


There should be 16 rows for each country, otherwise there could be missing data that needs to be dealt with:

In [7]:
print(pd.unique(df_new['Country']))
Country = pd.unique(df_new['Country'])

['Chile' 'China' 'Germany' 'Mexico' 'United States of America' 'Zimbabwe']


There are 6 countries and 96 GDP. Simply performing a division allows me to see we have all 16 years per Country. But just to be sure:

In [8]:
df_CY= df_new[df_new.columns[0:2]]
for variable in Country:
    print(df_CY[df_CY.Country == variable].head(16))

   Country  Year
0    Chile  2000
1    Chile  2001
2    Chile  2002
3    Chile  2003
4    Chile  2004
5    Chile  2005
6    Chile  2006
7    Chile  2007
8    Chile  2008
9    Chile  2009
10   Chile  2010
11   Chile  2011
12   Chile  2012
13   Chile  2013
14   Chile  2014
15   Chile  2015
   Country  Year
16   China  2000
17   China  2001
18   China  2002
19   China  2003
20   China  2004
21   China  2005
22   China  2006
23   China  2007
24   China  2008
25   China  2009
26   China  2010
27   China  2011
28   China  2012
29   China  2013
30   China  2014
31   China  2015
    Country  Year
32  Germany  2000
33  Germany  2001
34  Germany  2002
35  Germany  2003
36  Germany  2004
37  Germany  2005
38  Germany  2006
39  Germany  2007
40  Germany  2008
41  Germany  2009
42  Germany  2010
43  Germany  2011
44  Germany  2012
45  Germany  2013
46  Germany  2014
47  Germany  2015
   Country  Year
48  Mexico  2000
49  Mexico  2001
50  Mexico  2002
51  Mexico  2003
52  Mexico  2004
53  Mexico  20

There are only 69 life expectancies at birth, there should be 96:

In [9]:
columns= ['Country', 'Year', 'GDP', 'Life expectancy at birth (years)']
for variable in columns:
    print(df_new[df_new[variable].isna()])


Empty DataFrame
Columns: [Country, Year, Life expectancy at birth (years), GDP]
Index: []
Empty DataFrame
Columns: [Country, Year, Life expectancy at birth (years), GDP]
Index: []
Empty DataFrame
Columns: [Country, Year, Life expectancy at birth (years), GDP]
Index: []
Empty DataFrame
Columns: [Country, Year, Life expectancy at birth (years), GDP]
Index: []
