<img src=https://i.imgur.com/QnKVI6k.jpg>


---

# Introduction

The datasets were selected based on how big of an impact they have when it comes to migration. We have separated the datasets in the following categories: <br>
* Economy
* People
* Environment
* Poverty
* States 
* Demographic <br>

In this notebook, you will find the exploratory data analysis for each one of the selected datasets, with the respective preliminary conclusion. 

---

# Exploratory Data Analysis

We begin importing all the necessary libraries for the diagnosis: 

In [71]:
#Libraries to work on the databases:
import pandas as pd
import numpy as np
from pandas_datareader import wb
import wbgapi as wb
import datetime
import datapackage 

In [3]:
today = datetime.date.today()
year = today.year

    We've decided to add a prefix to each dataset, indicating which category they belong to.
* References:<br>
**eco** = economy related dataset <br>
**peo** = people related dataset<br>
**env** = environment related dataset<br>
**pov** = poverty related dataset<br> 
**sta** = states related dataset<br>
**mig** = demographic related dataset<br>

---

## ⟐ Economy: 
•	Gross Domestic Product <br>
•	GDP Growth<br>
•	Final Consumption Expenditure<br>
•	GNI Per Capita<br>
•	Gross Savings <br>
•	Consumer Price<br>
•	Total reserves (gold + US$)<br>
•	Foreign direct investment, net inflows (BoP, current US$)<br>
•	Total tax and contribution rate (PCT of profit)<br>
•	Time required to start a business (days)<br>
•	Exports of goods and services (PCT of GDP)<br>
•	Imports of goods and services (PCT of GDP)

We load the databases into variables:

In [3]:
# Gross Domestic Product
eco_gdp = wb.data.DataFrame('NY.GDP.MKTP.CD', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_gdp.rename(columns={'NY.GDP.MKTP.CD': 'Gross Domestic Product'}, inplace=True)

# GDP Growth
eco_gdp_growth = wb.data.DataFrame('NY.GDP.MKTP.KD.ZG', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_gdp_growth.rename(columns={'NY.GDP.MKTP.KD.ZG': 'GDP Growth'}, inplace=True)

# Final Consumption Expenditure
eco_cons_expen = wb.data.DataFrame('NE.CON.TOTL.KD.ZG', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_cons_expen.rename(columns={'NE.CON.TOTL.KD.ZG': 'Final Consumption Expenditure'}, inplace=True)

# GNI_Per_Capita
eco_gni_capita = wb.data.DataFrame('NY.GNP.PCAP.CD', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_gni_capita.rename(columns={'NY.GNP.PCAP.CD': 'GNI Per Capita'}, inplace=True)

# Gross Savings
eco_gross_savings = wb.data.DataFrame('NY.GNS.ICTR.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_gross_savings.rename(columns={'NY.GNS.ICTR.ZS': 'Gross Savings'}, inplace=True)

# Consumer Price
eco_consumer_price = wb.data.DataFrame('FP.CPI.TOTL', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_consumer_price.rename(columns={'FP.CPI.TOTL': 'Consumer Price'}, inplace=True)

# Total reserves (gold + US$)
eco_total_reserves = wb.data.DataFrame('FI.RES.TOTL.CD', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_total_reserves.rename(columns={'FI.RES.TOTL.CD': 'Total reserves (gold + US$)'}, inplace=True)

# Foreign direct investment, net inflows (BoP, current US$)
eco_foreign_invest = wb.data.DataFrame('BX.KLT.DINV.CD.WD', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_foreign_invest.rename(columns={'BX.KLT.DINV.CD.WD': 'Foreign direct investment, net inflows (BoP, current US$)'}, inplace=True)

# Total tax and contribution rate (PCT of profit)
eco_total_tax = wb.data.DataFrame('IC.TAX.TOTL.CP.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_total_tax.rename(columns={'IC.TAX.TOTL.CP.ZS': 'Total tax and contribution rate (PCT of profit)'}, inplace=True)

# Time required to start a business (days)
eco_start_business = wb.data.DataFrame('IC.REG.DURS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_start_business.rename(columns={'IC.REG.DURS': 'Time required to start a business (days)'}, inplace=True)

# Exports of goods and services (PCT of GDP)
eco_exports = wb.data.DataFrame('NE.EXP.GNFS.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_exports.rename(columns={'NE.EXP.GNFS.ZS': 'Exports of goods and services (PCT of GDP)'}, inplace=True)

# Imports of goods and services (PCT of GDP)
eco_imports = wb.data.DataFrame('NE.IMP.GNFS.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
eco_imports.rename(columns={'NE.IMP.GNFS.ZS': 'Imports of goods and services (PCT of GDP)'}, inplace=True)

---

We start checking the basics of each one of them: 

### Gross Domestic Product <br>
    This database shows the information related to Total gross value added by all resident producers in the economy of each country in U$D. 

In [4]:
eco_gdp.head()

Unnamed: 0,economy,time,Country,Time,Gross Domestic Product
0,ZWE,YR2021,Zimbabwe,2021,28371240000.0
1,ZWE,YR2020,Zimbabwe,2020,21509700000.0
2,ZWE,YR2019,Zimbabwe,2019,21832230000.0
3,ZWE,YR2018,Zimbabwe,2018,34156070000.0
4,ZWE,YR2017,Zimbabwe,2017,17584890000.0


We check the size of the dataset: 

In [5]:
eco_gdp.shape

(5609, 5)

We review the amount of null values and the type of data each column has: 

In [6]:
eco_gdp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5609 entries, 0 to 5608
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   economy                 5609 non-null   object 
 1   time                    5609 non-null   object 
 2   Country                 5609 non-null   object 
 3   Time                    5609 non-null   object 
 4   Gross Domestic Product  5609 non-null   float64
dtypes: float64(1), object(4)
memory usage: 219.2+ KB


In [7]:
eco_gdp.describe()

Unnamed: 0,Gross Domestic Product
count,5609.0
mean,2054997000000.0
std,7473989000000.0
min,13965040.0
25%,6050876000.0
50%,38587020000.0
75%,446258400000.0
max,96513080000000.0


We check for duplicates: 

In [8]:
eco_gdp[eco_gdp.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Gross Domestic Product


We check it has the years needed:

In [9]:
eco_gdp['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [10]:
eco_gdp['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Tuvalu',
       'Turks and Caicos Islands', 'Turkmenistan', 'Turkiye', 'Tunisia',
       'Trinidad and Tobago', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Vincent and the Grenadines',
       'St. Martin (French part)', 'St. Lucia', 'St. Kitts and Nevis',
       'Sri Lanka', 'Spain', 'South Sudan', 'South Africa', 'Somalia',
       'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Sint Maarten (Dutch part)', 'Singapore', 'Sierra Leone',
       'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'San Marino', 'Samoa', 'Rwanda',
       'Russian Federation', 'Romania', 'Qatar', 'Puerto

In [11]:
eco_gdp['Country'].nunique()

262

---

🔹 Conclusions on: **Gross Domestic Product** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### GDP growth (annualy %) <br>
    This database shows the information related to Annual percentage growth rate of GDP (annualy %).

In [12]:
eco_gdp_growth.head()

Unnamed: 0,economy,time,Country,Time,GDP Growth
0,ZWE,YR2021,Zimbabwe,2021,8.468017
1,ZWE,YR2020,Zimbabwe,2020,-7.816951
2,ZWE,YR2019,Zimbabwe,2019,-6.332446
3,ZWE,YR2018,Zimbabwe,2018,5.009867
4,ZWE,YR2017,Zimbabwe,2017,4.080264


We check the size of the dataset:

In [13]:
eco_gdp_growth.shape

(5529, 5)

We review the amount of null values and the type of data each column has: 

In [14]:
eco_gdp_growth.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5529 entries, 0 to 5528
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   economy     5529 non-null   object 
 1   time        5529 non-null   object 
 2   Country     5529 non-null   object 
 3   Time        5529 non-null   object 
 4   GDP Growth  5529 non-null   float64
dtypes: float64(1), object(4)
memory usage: 216.1+ KB


In [15]:
eco_gdp_growth.describe()

Unnamed: 0,GDP Growth
count,5529.0
mean,3.399256
std,5.261056
min,-54.2359
25%,1.439615
50%,3.672372
75%,5.852892
max,86.826748


We check for duplicates:

In [16]:
eco_gdp_growth[eco_gdp_growth.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,GDP Growth


We check it has the years needed:

In [17]:
eco_gdp_growth['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [18]:
eco_gdp_growth['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Tuvalu',
       'Turks and Caicos Islands', 'Turkmenistan', 'Turkiye', 'Tunisia',
       'Trinidad and Tobago', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Vincent and the Grenadines',
       'St. Lucia', 'St. Kitts and Nevis', 'Sri Lanka', 'Spain',
       'South Sudan', 'South Africa', 'Somalia', 'Solomon Islands',
       'Slovenia', 'Slovak Republic', 'Sint Maarten (Dutch part)',
       'Singapore', 'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal',
       'Saudi Arabia', 'Sao Tome and Principe', 'San Marino', 'Samoa',
       'Rwanda', 'Russian Federation', 'Romania', 'Qatar', 'Puerto Rico',
       'Portugal', 'Poland'

In [19]:
eco_gdp_growth['Country'].nunique()

260

---

🔹 Conclusions on: **GDP growth (annualy %)** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Final consumption expenditure (% of GDP) <br>
    This database shows the information related to the SUM of household final consumption expenditure (private consumption) and general goverment final consumption expenditure.

In [20]:
eco_cons_expen.head()

Unnamed: 0,economy,time,Country,Time,Final Consumption Expenditure
0,ZWE,YR2021,Zimbabwe,2021,14.261391
1,ZWE,YR2020,Zimbabwe,2020,-4.540536
2,ZWE,YR2019,Zimbabwe,2019,-10.119249
3,ZWE,YR2018,Zimbabwe,2018,-0.462873
4,ZWE,YR2017,Zimbabwe,2017,3.920652


We check the size of the dataset:

In [21]:
eco_cons_expen.shape

(4400, 5)

We review the amount of null values and the type of data each column has: 

In [22]:
eco_cons_expen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4400 entries, 0 to 4399
Data columns (total 5 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   economy                        4400 non-null   object 
 1   time                           4400 non-null   object 
 2   Country                        4400 non-null   object 
 3   Time                           4400 non-null   object 
 4   Final Consumption Expenditure  4400 non-null   float64
dtypes: float64(1), object(4)
memory usage: 172.0+ KB


In [23]:
eco_cons_expen.describe()

Unnamed: 0,Final Consumption Expenditure
count,4400.0
mean,3.776654
std,7.947723
min,-40.868215
25%,1.522064
50%,3.487895
75%,5.749913
max,393.570492


We check for duplicates:

In [24]:
eco_cons_expen[eco_cons_expen.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Final Consumption Expenditure


We check it has the years needed:

In [25]:
eco_cons_expen['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [26]:
eco_cons_expen['Country'].unique()

array(['Zimbabwe', 'West Bank and Gaza', 'Virgin Islands (U.S.)',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Turkiye', 'Tunisia', 'Tonga', 'Togo',
       'Timor-Leste', 'Thailand', 'Tanzania', 'Tajikistan',
       'Syrian Arab Republic', 'Switzerland', 'Sweden', 'Sudan',
       'Sri Lanka', 'Spain', 'South Sudan', 'South Africa',
       'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Sint Maarten (Dutch part)', 'Sierra Leone', 'Seychelles',
       'Serbia', 'Senegal', 'Saudi Arabia', 'San Marino', 'Samoa',
       'Rwanda', 'Russian Federation', 'Romania', 'Qatar', 'Puerto Rico',
       'Portugal', 'Poland', 'Philippines', 'Peru', 'Paraguay',
       'Papua New Guinea', 'Panama', 'Pakistan', 'Oman', 'Norway',
       'North Macedonia', 'Nigeria', 'Niger', 'Nicaragua', 'New Zealand',
       'Netherlands', 'Nepal', 'Namibia', 'Myanmar', 'Mozambique',
       'M

In [27]:
eco_cons_expen['Country'].nunique()

219

---

🔹 Conclusions on: **Final consumption expenditure (% of GDP)** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed, but less countries than the other datasets. 

---

### GNI per capita atlas method (US$) <br>
    This database shows the information related to gross national income (converted to dollars using the world bank atlas method) divided by the midyear population.

In [28]:
eco_gni_capita.head()

Unnamed: 0,economy,time,Country,Time,GNI Per Capita
0,ZWE,YR2021,Zimbabwe,2021,1530.0
1,ZWE,YR2020,Zimbabwe,2020,1460.0
2,ZWE,YR2019,Zimbabwe,2019,1450.0
3,ZWE,YR2018,Zimbabwe,2018,1550.0
4,ZWE,YR2017,Zimbabwe,2017,1170.0


We check the size of the dataset: 

In [29]:
eco_gni_capita.shape

(5285, 5)

We review the amount of null values and the type of data each column has: 

In [30]:
eco_gni_capita.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5285 entries, 0 to 5284
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   economy         5285 non-null   object 
 1   time            5285 non-null   object 
 2   Country         5285 non-null   object 
 3   Time            5285 non-null   object 
 4   GNI Per Capita  5285 non-null   float64
dtypes: float64(1), object(4)
memory usage: 206.6+ KB


In [31]:
eco_gni_capita.describe()

Unnamed: 0,GNI Per Capita
count,5285.0
mean,11772.568904
std,17359.500527
min,110.0
25%,1390.0
50%,4300.0
75%,13210.0
max,122470.0


We check for duplicates:

In [32]:
eco_gni_capita[eco_gni_capita.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,GNI Per Capita


We check it has the years needed:

In [33]:
eco_gni_capita['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [34]:
eco_gni_capita['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Tuvalu', 'Turks and Caicos Islands',
       'Turkmenistan', 'Turkiye', 'Tunisia', 'Trinidad and Tobago',
       'Tonga', 'Togo', 'Timor-Leste', 'Thailand', 'Tanzania',
       'Tajikistan', 'Syrian Arab Republic', 'Switzerland', 'Sweden',
       'Suriname', 'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'St. Kitts and Nevis', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Somalia', 'Solomon Islands', 'Slovenia',
       'Slovak Republic', 'Sint Maarten (Dutch part)', 'Singapore',
       'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'San Marino', 'Samoa', 'Rwanda',
       'Russian Federation', 'Romania', 'Qatar', 'Puerto Rico',
       'Portugal', 'Poland', 'Philippines', 'Peru', 

In [35]:
eco_gni_capita['Country'].nunique()

255

---

🔹 Conclusions on: **GNI per capita atlas method** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Gross savings (% of GDP) <br>
    This database shows the information related to Calculations as gross national income - total consumption. Measured as ercentage of GDP.

In [36]:
eco_gross_savings.head()

Unnamed: 0,economy,time,Country,Time,Gross Savings
0,ZWE,YR2020,Zimbabwe,2020,16.452763
1,ZWE,YR2019,Zimbabwe,2019,20.231198
2,ZWE,YR2018,Zimbabwe,2018,13.923906
3,ZWE,YR2017,Zimbabwe,2017,-2.682417
4,ZWE,YR2016,Zimbabwe,2016,-4.967818


We check the size of the dataset:

In [37]:
eco_gross_savings.shape

(4337, 5)

We review the amount of null values and the type of data each column has:

In [38]:
eco_gross_savings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4337 entries, 0 to 4336
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   economy        4337 non-null   object 
 1   time           4337 non-null   object 
 2   Country        4337 non-null   object 
 3   Time           4337 non-null   object 
 4   Gross Savings  4337 non-null   float64
dtypes: float64(1), object(4)
memory usage: 169.5+ KB


In [39]:
eco_gross_savings.describe()

Unnamed: 0,Gross Savings
count,4337.0
mean,23.850339
std,14.808126
min,-19.902974
25%,16.756782
50%,22.638722
75%,29.150342
max,372.988151


We check for duplicates: 

In [40]:
eco_gross_savings[eco_gross_savings.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Gross Savings


We check it has the years needed:

In [41]:
eco_gross_savings['Time'].unique()

array(['2020', '2019', '2018', '2017', '2016', '2015', '2014', '2013',
       '2012', '2011', '2010', '2009', '2021', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [42]:
eco_gross_savings['Country'].unique()

array(['Zimbabwe', 'Zambia', 'West Bank and Gaza', 'Vietnam',
       'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'Ukraine', 'Uganda', 'Turkiye',
       'Tunisia', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand', 'Tanzania',
       'Tajikistan', 'Syrian Arab Republic', 'Switzerland', 'Sweden',
       'Suriname', 'Sudan', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Sint Maarten (Dutch part)', 'Singapore', 'Sierra Leone',
       'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia', 'Samoa',
       'Rwanda', 'Russian Federation', 'Romania', 'Qatar', 'Portugal',
       'Poland', 'Philippines', 'Peru', 'Paraguay', 'Papua New Guinea',
       'Panama', 'Pakistan', 'Oman', 'Norway', 'North Macedonia',
       'Nigeria', 'Niger', 'Nicaragua', 'New Zealand', 'Netherlands',
       'Nepal', 'Namibia', 'Myanmar', 'Mozambique', 'Morocco',
       'Montenegro', 'Mongolia', 'Moldova', 'Mex

In [43]:
eco_gross_savings['Country'].nunique()

223

---

🔹 Conclusions on: **Gross savings (% of GDP)** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Consumer price index <br>
    This database reflects changes in the cost of the average consumer of acquiring a basket of goods and services that may be fixed or changed at specified intervals, sucha as yearly.

In [44]:
eco_consumer_price.head()

Unnamed: 0,economy,time,Country,Time,Consumer Price
0,ZWE,YR2021,Zimbabwe,2021,5411.002445
1,ZWE,YR2020,Zimbabwe,2020,2725.312815
2,ZWE,YR2019,Zimbabwe,2019,414.684309
3,ZWE,YR2018,Zimbabwe,2018,116.712211
4,ZWE,YR2017,Zimbabwe,2017,105.508414


We check the size of the dataset: 

In [45]:
eco_consumer_price.shape

(3966, 5)

We review the amount of null values and the type of data each column has: 

In [46]:
eco_consumer_price.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3966 entries, 0 to 3965
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   economy         3966 non-null   object 
 1   time            3966 non-null   object 
 2   Country         3966 non-null   object 
 3   Time            3966 non-null   object 
 4   Consumer Price  3966 non-null   float64
dtypes: float64(1), object(4)
memory usage: 155.0+ KB


In [47]:
eco_consumer_price.describe()

Unnamed: 0,Consumer Price
count,3966.0
mean,132.766215
std,630.542086
min,2.909082
25%,82.989634
50%,100.466078
75%,115.800377
max,22570.711031


We check for duplicates:

In [48]:
eco_consumer_price[eco_consumer_price.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Consumer Price


We check it has the years needed:

In [49]:
eco_consumer_price['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [50]:
eco_consumer_price['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Tuvalu', 'Turkiye', 'Tunisia',
       'Trinidad and Tobago', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Vincent and the Grenadines',
       'St. Lucia', 'St. Kitts and Nevis', 'Sri Lanka', 'Spain',
       'South Sudan', 'South Africa', 'Solomon Islands', 'Slovenia',
       'Slovak Republic', 'Sint Maarten (Dutch part)', 'Singapore',
       'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'San Marino', 'Samoa', 'Rwanda',
       'Russian Federation', 'Romania', 'Qatar', 'Portugal', 'Poland',
       'Philippines', 'Peru', 'Paraguay', 'Papua New Guinea', 'Panama',
       'Palau', 'Pakistan', 'Oman',

In [51]:
eco_consumer_price['Country'].nunique()

192

---

🔹 Conclusions on: **Consumer price index** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'

---

### Total reserves (gold + US$)
    This dataset shows information related to Comprise holdings of monetary gold, special drawing rights, reserves of IMF members held by the IMF, and holdings of foreign exchange under the control of monetary authorities. (gold + US$)

In [52]:
eco_total_reserves.head()

Unnamed: 0,economy,time,Country,Time,Total reserves (gold + US$)
0,ZWE,YR2021,Zimbabwe,2021,838780200.0
1,ZWE,YR2020,Zimbabwe,2020,33405020.0
2,ZWE,YR2019,Zimbabwe,2019,151240500.0
3,ZWE,YR2018,Zimbabwe,2018,86951090.0
4,ZWE,YR2017,Zimbabwe,2017,292621200.0


We check the size of the dataset:

In [53]:
eco_total_reserves.shape

(3801, 5)

We review the amount of null values and the type of data each column has:

In [54]:
eco_total_reserves.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3801 entries, 0 to 3800
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   economy                      3801 non-null   object 
 1   time                         3801 non-null   object 
 2   Country                      3801 non-null   object 
 3   Time                         3801 non-null   object 
 4   Total reserves (gold + US$)  3801 non-null   float64
dtypes: float64(1), object(4)
memory usage: 148.6+ KB


In [55]:
eco_total_reserves.describe()

Unnamed: 0,Total reserves (gold + US$)
count,3801.0
mean,56002010000.0
std,234458400000.0
min,267707.4
25%,658352100.0
50%,3647605000.0
75%,26946180000.0
max,3900039000000.0


We check for duplicates: 

In [56]:
eco_total_reserves[eco_consumer_price.duplicated(subset=['Country', 'Time'])]

  eco_total_reserves[eco_consumer_price.duplicated(subset=['Country', 'Time'])]


Unnamed: 0,economy,time,Country,Time,Total reserves (gold + US$)


We check it has the years needed:

In [57]:
eco_total_reserves['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [58]:
eco_total_reserves['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Turkiye', 'Tunisia', 'Trinidad and Tobago',
       'Tonga', 'Timor-Leste', 'Thailand', 'Tanzania', 'Tajikistan',
       'Syrian Arab Republic', 'Switzerland', 'Sweden', 'Suriname',
       'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'St. Kitts and Nevis', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Singapore', 'Sierra Leone', 'Seychelles', 'Serbia',
       'Saudi Arabia', 'Sao Tome and Principe', 'San Marino', 'Samoa',
       'Rwanda', 'Russian Federation', 'Romania', 'Qatar', 'Portugal',
       'Poland', 'Philippines', 'Peru', 'Paraguay', 'Papua New Guinea',
       'Panama', 'Pakistan', 'Oman', 'Norway', 'North Macedonia',
       'Nigeria', 'Nicaragua', 'New Zealand'

In [59]:
eco_total_reserves['Country'].nunique()

179

---

🔹 Conclusions on: **Total reserves (gold + US$)** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'

---

### Foreign direct investment, net inflows (BoP, current US$)
    This dataset shows information related to Net inflows of investment to acquire a lasting management interest (10% or more of voting stock) in an enterprise operating in an economy other than that of the investor. It is the sum of equity capital, reinvestment of earnings, other long-term capital, and short-term capital as shown in the balance of payments. This series shows net inflows (new investment inflows less disinvestment) in the reporting economy from foreign investors, and is divided by GDP (PCT of GDP).

In [60]:
eco_foreign_invest.head()

Unnamed: 0,economy,time,Country,Time,"Foreign direct investment, net inflows (BoP, current US$)"
0,ZWE,YR2021,Zimbabwe,2021,166000000.0
1,ZWE,YR2020,Zimbabwe,2020,150360000.0
2,ZWE,YR2019,Zimbabwe,2019,249500000.0
3,ZWE,YR2018,Zimbabwe,2018,717865300.0
4,ZWE,YR2017,Zimbabwe,2017,307187700.0


We check the size of the dataset: 

In [61]:
eco_foreign_invest.shape

(5369, 5)

We review the amount of null values and the type of data each column has: 

In [62]:
eco_foreign_invest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5369 entries, 0 to 5368
Data columns (total 5 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   economy                                                    5369 non-null   object 
 1   time                                                       5369 non-null   object 
 2   Country                                                    5369 non-null   object 
 3   Time                                                       5369 non-null   object 
 4   Foreign direct investment, net inflows (BoP, current US$)  5369 non-null   float64
dtypes: float64(1), object(4)
memory usage: 209.9+ KB


In [63]:
eco_foreign_invest.describe()

Unnamed: 0,"Foreign direct investment, net inflows (BoP, current US$)"
count,5369.0
mean,59014500000.0
std,220679500000.0
min,-344375400000.0
25%,141000000.0
50%,1333856000.0
75%,15459980000.0
max,3133823000000.0


We check for duplicates: 

In [65]:
eco_foreign_invest[eco_foreign_invest.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,"Foreign direct investment, net inflows (BoP, current US$)"


We check it has the years needed:

In [66]:
eco_foreign_invest['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [67]:
eco_foreign_invest['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Tuvalu', 'Turks and Caicos Islands',
       'Turkmenistan', 'Turkiye', 'Tunisia', 'Trinidad and Tobago',
       'Tonga', 'Togo', 'Timor-Leste', 'Thailand', 'Tanzania',
       'Tajikistan', 'Syrian Arab Republic', 'Switzerland', 'Sweden',
       'Suriname', 'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'St. Kitts and Nevis', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Somalia', 'Solomon Islands', 'Slovenia',
       'Slovak Republic', 'Sint Maarten (Dutch part)', 'Singapore',
       'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'Samoa', 'Rwanda', 'Russian Federation',
       'Romania', 'Qatar', 'Portugal', 'Poland', 'Philippines', 'Peru',
       'Paraguay', 'Papua New Guinea

In [68]:
eco_foreign_invest['Country'].nunique()

249

---

🔹 Conclusions on: **Foreign direct investment, net inflows (BoP, current US$)** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Total tax and contribution rate (PCT of profit)
    This dataset shows information related to Amount of taxes and mandatory contributions and exemptions as a share of commercial profits. Taxes withheld (such as personal income tax) or collected and remitted to tax authorities (such as value added taxes, sales taxes or goods and service taxes) are excluded (PCT of profit)

In [69]:
eco_total_tax.head()

Unnamed: 0,economy,time,Country,Time,Total tax and contribution rate (PCT of profit)
0,ZWE,YR2019,Zimbabwe,2019,31.6
1,ZWE,YR2018,Zimbabwe,2018,31.6
2,ZWE,YR2017,Zimbabwe,2017,31.6
3,ZWE,YR2016,Zimbabwe,2016,31.6
4,ZWE,YR2015,Zimbabwe,2015,31.6


We check the size of the dataset:

In [70]:
eco_total_tax.shape

(3397, 5)

We review the amount of null values and the type of data each column has: 

In [71]:
eco_total_tax.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3397 entries, 0 to 3396
Data columns (total 5 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   economy                                          3397 non-null   object 
 1   time                                             3397 non-null   object 
 2   Country                                          3397 non-null   object 
 3   Time                                             3397 non-null   object 
 4   Total tax and contribution rate (PCT of profit)  3397 non-null   float64
dtypes: float64(1), object(4)
memory usage: 132.8+ KB


In [72]:
eco_total_tax.describe()

Unnamed: 0,Total tax and contribution rate (PCT of profit)
count,3397.0
mean,44.989162
std,30.21871
min,7.4
25%,32.8
50%,40.6
75%,49.0
max,339.1


We check for duplicates:

In [73]:
eco_total_tax[eco_total_tax.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Total tax and contribution rate (PCT of profit)


We check it has the years needed:

In [74]:
eco_total_tax['Time'].unique()

array(['2019', '2018', '2017', '2016', '2015', '2014', '2013', '2012',
       '2011', '2010', '2009', '2008', '2007', '2006', '2005'],
      dtype=object)

We check the amount of countries and which onces it has:

In [75]:
eco_total_tax['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Turkiye', 'Tunisia', 'Trinidad and Tobago',
       'Tonga', 'Togo', 'Timor-Leste', 'Thailand', 'Tanzania',
       'Tajikistan', 'Syrian Arab Republic', 'Switzerland', 'Sweden',
       'Suriname', 'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'St. Kitts and Nevis', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Singapore', 'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal',
       'Saudi Arabia', 'Sao Tome and Principe', 'San Marino', 'Samoa',
       'Rwanda', 'Russian Federation', 'Romania', 'Qatar', 'Puerto Rico',
       'Portugal', 'Poland', 'Philippines', 'Peru', 'Paraguay',
       'Papua New Guinea', 'Panama', 'Palau', 'Pakistan', 'Oman',
       'Norway', 'North Macedo

In [76]:
eco_total_tax['Country'].nunique()

237

🔹 Conclusions on: **Total tax and contribution rate (PCT of profit)** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Time required to start a business (days)
    This dataset shows information related to Number of calendar days needed to complete the procedures to legally operate a business (if a procedure can be speeded up at additional cost, the fastest procedure is chosen) (days)

In [77]:
eco_start_business.head()

Unnamed: 0,economy,time,Country,Time,Time required to start a business (days)
0,ZWE,YR2019,Zimbabwe,2019,27.0
1,ZWE,YR2018,Zimbabwe,2018,32.0
2,ZWE,YR2017,Zimbabwe,2017,61.0
3,ZWE,YR2016,Zimbabwe,2016,91.0
4,ZWE,YR2015,Zimbabwe,2015,91.0


We check the size of the dataset:

In [78]:
eco_start_business.shape

(3774, 5)

We review the amount of null values and the type of data each column has: 


In [79]:
eco_start_business.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3774 entries, 0 to 3773
Data columns (total 5 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   economy                                   3774 non-null   object 
 1   time                                      3774 non-null   object 
 2   Country                                   3774 non-null   object 
 3   Time                                      3774 non-null   object 
 4   Time required to start a business (days)  3774 non-null   float64
dtypes: float64(1), object(4)
memory usage: 147.5+ KB


In [80]:
eco_start_business.describe()

Unnamed: 0,Time required to start a business (days)
count,3774.0
mean,33.489292
std,42.958541
min,0.5
25%,12.0
50%,24.0
75%,41.0
max,697.0


We check for duplicates: 

In [81]:
eco_start_business[eco_start_business.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Time required to start a business (days)


We check it has the years needed:

In [82]:
eco_start_business['Time'].unique()

array(['2019', '2018', '2017', '2016', '2015', '2014', '2013', '2012',
       '2011', '2010', '2009', '2008', '2007', '2006', '2005', '2004',
       '2003'], dtype=object)

We check the amount of countries and which onces it has:

In [83]:
eco_start_business['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Turkiye', 'Tunisia', 'Trinidad and Tobago',
       'Tonga', 'Togo', 'Timor-Leste', 'Thailand', 'Tanzania',
       'Tajikistan', 'Syrian Arab Republic', 'Switzerland', 'Sweden',
       'Suriname', 'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'St. Kitts and Nevis', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Somalia', 'Solomon Islands', 'Slovenia',
       'Slovak Republic', 'Singapore', 'Sierra Leone', 'Seychelles',
       'Serbia', 'Senegal', 'Saudi Arabia', 'Sao Tome and Principe',
       'San Marino', 'Samoa', 'Rwanda', 'Russian Federation', 'Romania',
       'Qatar', 'Puerto Rico', 'Portugal', 'Poland', 'Philippines',
       'Peru', 'Paraguay', 'Papua New Guinea', 'Panama', 'Palau',
       'Pakistan', 'Oman', 'Norway', 'N

In [84]:
eco_start_business['Country'].nunique()

238

---

🔹 Conclusions on: **Time required to start a business (days)** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has most of the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Exports of goods and services (PCT of GDP)
    This dataset shows information related to Value of all goods and other market services provided to the rest of the world. (Exclude: compensation of employees and investment income and transfer payments)

In [85]:
eco_exports.head()

Unnamed: 0,economy,time,Country,Time,Exports of goods and services (PCT of GDP)
0,ZWE,YR2021,Zimbabwe,2021,25.411446
1,ZWE,YR2020,Zimbabwe,2020,25.917014
2,ZWE,YR2019,Zimbabwe,2019,27.163459
3,ZWE,YR2018,Zimbabwe,2018,26.163973
4,ZWE,YR2017,Zimbabwe,2017,19.658905


We check the size of the dataset: 

In [86]:
eco_exports.shape

(5016, 5)

We review the amount of null values and the type of data each column has: 

In [87]:
eco_exports.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5016 entries, 0 to 5015
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   economy                                     5016 non-null   object 
 1   time                                        5016 non-null   object 
 2   Country                                     5016 non-null   object 
 3   Time                                        5016 non-null   object 
 4   Exports of goods and services (PCT of GDP)  5016 non-null   float64
dtypes: float64(1), object(4)
memory usage: 196.1+ KB


In [88]:
eco_exports.describe()

Unnamed: 0,Exports of goods and services (PCT of GDP)
count,5016.0
mean,40.169382
std,29.618884
min,0.459601
25%,23.235476
50%,32.113721
75%,48.868684
max,433.836004


We check for duplicates: 

In [89]:
eco_exports[eco_exports.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Exports of goods and services (PCT of GDP)


We check it has the years needed:

In [90]:
eco_exports['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [91]:
eco_exports['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Turkmenistan',
       'Turkiye', 'Tunisia', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Somalia', 'Solomon Islands', 'Slovenia',
       'Slovak Republic', 'Sint Maarten (Dutch part)', 'Singapore',
       'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'San Marino', 'Samoa', 'Rwanda', 'Russian Federation', 'Romania',
       'Qatar', 'Puerto Rico', 'Portugal', 'Poland', 'Philippines',
       'Peru', 'Paraguay', 'Papua New Guinea', 'Panama', 'Pakistan',
       'Oman', 'Norway', 'Northern Mariana Islands', 'North Macedonia',
       'Nigeria', '

In [92]:
eco_exports['Country'].nunique()

243

---

🔹 Conclusions on: **Exports of goods and services (PCT of GDP)** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Imports of goods and services (PCT of GDP)
    This dataset shows information related to Value of all goods and other market services provided to the rest of the world. (Exclude: compensation of employees and investment income and transfer payments)

In [93]:
eco_exports.head()

Unnamed: 0,economy,time,Country,Time,Exports of goods and services (PCT of GDP)
0,ZWE,YR2021,Zimbabwe,2021,25.411446
1,ZWE,YR2020,Zimbabwe,2020,25.917014
2,ZWE,YR2019,Zimbabwe,2019,27.163459
3,ZWE,YR2018,Zimbabwe,2018,26.163973
4,ZWE,YR2017,Zimbabwe,2017,19.658905


We check the size of the dataset: 

In [94]:
eco_exports.shape

(5016, 5)

We review the amount of null values and the type of data each column has:

In [95]:
eco_exports.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5016 entries, 0 to 5015
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   economy                                     5016 non-null   object 
 1   time                                        5016 non-null   object 
 2   Country                                     5016 non-null   object 
 3   Time                                        5016 non-null   object 
 4   Exports of goods and services (PCT of GDP)  5016 non-null   float64
dtypes: float64(1), object(4)
memory usage: 196.1+ KB


In [96]:
eco_exports.describe()

Unnamed: 0,Exports of goods and services (PCT of GDP)
count,5016.0
mean,40.169382
std,29.618884
min,0.459601
25%,23.235476
50%,32.113721
75%,48.868684
max,433.836004


We check for duplicates: 

In [97]:
eco_exports[eco_exports.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Exports of goods and services (PCT of GDP)


We check it has the years needed:

In [98]:
eco_exports['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [99]:
eco_exports['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Turkmenistan',
       'Turkiye', 'Tunisia', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Somalia', 'Solomon Islands', 'Slovenia',
       'Slovak Republic', 'Sint Maarten (Dutch part)', 'Singapore',
       'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'San Marino', 'Samoa', 'Rwanda', 'Russian Federation', 'Romania',
       'Qatar', 'Puerto Rico', 'Portugal', 'Poland', 'Philippines',
       'Peru', 'Paraguay', 'Papua New Guinea', 'Panama', 'Pakistan',
       'Oman', 'Norway', 'Northern Mariana Islands', 'North Macedonia',
       'Nigeria', '

In [100]:
eco_exports['Country'].nunique()

243

---

🔹 Conclusions on: **Total tax and contribution rate (PCT of profit)** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed.

---

## General conclusions on Economy: 
    All of the economic related databases look exactly the same. 
    They are clean enough to start working on them. 

---

## ⟐ People:
•	Government expenditure on education, total (% of GDP)<br>
•	Unemployment, total (% of total labor force)<br>
•	Primary completion rate, total (PCT of relevant age group)<br>
•	Intentional homicides (per 100,000 people)<br>

We load the databases into variables:

In [4]:
# Government expenditure on education
peo_gover_exp = wb.data.DataFrame('SE.XPD.TOTL.GD.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
peo_gover_exp.rename(columns={'SE.XPD.TOTL.GD.ZS': 'Expenditure Education'}, inplace=True)

# Unemployment, total (% of total labor force)
peo_unemploy = wb.data.DataFrame('SL.UEM.TOTL.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
peo_unemploy.rename(columns={'SL.UEM.TOTL.ZS': 'Unemployment'}, inplace=True)

# Primary completion rate, total (PCT of relevant age group)
peo_primary = wb.data.DataFrame('SE.PRM.CMPT.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
peo_primary.rename(columns={'SE.PRM.CMPT.ZS': 'Primary completion rate, total (PCT of relevant age group)'}, inplace=True)

# Intentional homicides (per 100,000 people)
peo_homicides = wb.data.DataFrame('VC.IHR.PSRC.P5', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
peo_homicides.rename(columns={'VC.IHR.PSRC.P5': 'Intentional homicides (per 100,000 people)'}, inplace=True)

---

We start checking the basics of each one of them: 

### Government expenditure on education, total (% of GDP)
    This database shows the information related to General government expenditure on education expressed as a percentage of GDP. total (% of GDP)

In [102]:
peo_gover_exp.head()

Unnamed: 0,economy,time,Country,Time,Expenditure Education
0,ZWE,YR2018,Zimbabwe,2018,3.86611
1,ZWE,YR2017,Zimbabwe,2017,5.81878
2,ZWE,YR2016,Zimbabwe,2016,5.47262
3,ZWE,YR2015,Zimbabwe,2015,5.81279
4,ZWE,YR2014,Zimbabwe,2014,6.13835


We check the size of the dataset:

In [103]:
peo_gover_exp.shape

(4013, 5)

We review the amount of null values and the type of data each column has: 

In [104]:
peo_gover_exp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4013 entries, 0 to 4012
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   economy                4013 non-null   object 
 1   time                   4013 non-null   object 
 2   Country                4013 non-null   object 
 3   Time                   4013 non-null   object 
 4   Expenditure Education  4013 non-null   float64
dtypes: float64(1), object(4)
memory usage: 156.9+ KB


In [105]:
peo_gover_exp.describe()

Unnamed: 0,Expenditure Education
count,4013.0
mean,4.389899
std,1.892848
min,8e-06
25%,3.19979
50%,4.1088
75%,5.160557
max,23.27


We check for duplicates:

In [106]:
peo_gover_exp[peo_gover_exp.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Expenditure Education


We check it has the years needed:

In [107]:
peo_gover_exp['Time'].unique()

array(['2018', '2017', '2016', '2015', '2014', '2013', '2012', '2010',
       '2020', '2019', '2011', '2009', '2008', '2007', '2005', '2004',
       '2000', '2001', '2006', '2003', '2002', '2021'], dtype=object)

We check the amount of countries and which onces it has:

In [108]:
peo_gover_exp['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Turks and Caicos Islands', 'Turkmenistan',
       'Turkiye', 'Tunisia', 'Trinidad and Tobago', 'Tonga', 'Togo',
       'Timor-Leste', 'Thailand', 'Tanzania', 'Tajikistan',
       'Syrian Arab Republic', 'Switzerland', 'Sweden', 'Suriname',
       'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'St. Kitts and Nevis', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Somalia', 'Solomon Islands', 'Slovenia',
       'Slovak Republic', 'Singapore', 'Sierra Leone', 'Seychelles',
       'Serbia', 'Senegal', 'Saudi Arabia', 'Sao Tome and Principe',
       'San Marino', 'Samoa', 'Rwanda', 'Russian Federation', 'Romania',
       'Qatar', 'Puerto Rico', 'Portugal', 'Poland', 'Philippines',
       'Peru', 'Paraguay', 'Papua New Guinea', 'Panama

In [109]:
peo_gover_exp['Country'].nunique()

247

---

🔹 Conclusions on: **Government expenditure on education** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'

---

### Unemployment, total
    This dataset shows information related to Share of the labor force that is without work but available for and seeking employment. (% of total labor force)

In [110]:
peo_unemploy.head()

Unnamed: 0,economy,time,Country,Time,Unemployment
0,ZWE,YR2021,Zimbabwe,2021,5.174
1,ZWE,YR2020,Zimbabwe,2020,5.351
2,ZWE,YR2019,Zimbabwe,2019,4.833
3,ZWE,YR2018,Zimbabwe,2018,4.796
4,ZWE,YR2017,Zimbabwe,2017,4.785


We check the size of the dataset: 

In [111]:
peo_unemploy.shape

(5170, 5)

We review the amount of null values and the type of data each column has: 

In [112]:
peo_unemploy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5170 entries, 0 to 5169
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   economy       5170 non-null   object 
 1   time          5170 non-null   object 
 2   Country       5170 non-null   object 
 3   Time          5170 non-null   object 
 4   Unemployment  5170 non-null   float64
dtypes: float64(1), object(4)
memory usage: 202.1+ KB


In [113]:
peo_unemploy.describe()

Unnamed: 0,Unemployment
count,5170.0
mean,7.942371
std,5.685855
min,0.1
25%,4.23175
50%,6.286844
75%,10.23628
max,37.25


We check for duplicates:

In [114]:
peo_unemploy[peo_unemploy.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Unemployment


We check it has the years needed:

In [115]:
peo_unemploy['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [116]:
peo_unemploy['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Turkmenistan',
       'Turkiye', 'Tunisia', 'Trinidad and Tobago', 'Tonga', 'Togo',
       'Timor-Leste', 'Thailand', 'Tanzania', 'Tajikistan',
       'Syrian Arab Republic', 'Switzerland', 'Sweden', 'Suriname',
       'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'Sri Lanka', 'Spain', 'South Sudan', 'South Africa', 'Somalia',
       'Solomon Islands', 'Slovenia', 'Slovak Republic', 'Singapore',
       'Sierra Leone', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'Samoa', 'Rwanda', 'Russian Federation',
       'Romania', 'Qatar', 'Puerto Rico', 'Portugal', 'Poland',
       'Philippines', 'Peru', 'Paraguay', 'Papua New Guinea', 'Panama',
       'Pakistan', 'Oman', 'Norway', 'North Macedonia', 'N

In [117]:
peo_unemploy['Country'].nunique()

235

---

🔹 Conclusions on: **Unemployment total** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'

---

### Primary completion rate, total (PCT of relevant age group)
    This dataset shows information related to Number of new entrants (enrollments minus repeaters) in the last grade of primary education, regardless of age, divided by the population at the entrance age for the las grade of primary education. (PCT of relevant age group)

In [118]:
peo_primary.head()

Unnamed: 0,economy,time,Country,Time,"Primary completion rate, total (PCT of relevant age group)"
0,ZWE,YR2021,Zimbabwe,2021,84.818619
1,ZWE,YR2020,Zimbabwe,2020,90.017349
2,ZWE,YR2019,Zimbabwe,2019,88.508812
3,ZWE,YR2018,Zimbabwe,2018,92.195152
4,ZWE,YR2017,Zimbabwe,2017,95.476372


We check the size of the dataset: 

In [119]:
peo_primary.shape

(3813, 5)

We review the amount of null values and the type of data each column has:

In [120]:
peo_primary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3813 entries, 0 to 3812
Data columns (total 5 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   economy                                                     3813 non-null   object 
 1   time                                                        3813 non-null   object 
 2   Country                                                     3813 non-null   object 
 3   Time                                                        3813 non-null   object 
 4   Primary completion rate, total (PCT of relevant age group)  3813 non-null   float64
dtypes: float64(1), object(4)
memory usage: 149.1+ KB


In [121]:
peo_primary.describe()

Unnamed: 0,"Primary completion rate, total (PCT of relevant age group)"
count,3813.0
mean,88.132224
std,17.701888
min,16.57523
25%,80.376633
50%,94.53154
75%,99.33078
max,134.542511


We check for duplicates: 

In [122]:
peo_primary[peo_primary.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,"Primary completion rate, total (PCT of relevant age group)"


We check it has the years needed:

In [123]:
peo_primary['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2003', '2002', '2001', '2010', '2009', '2008',
       '2007', '2006', '2005', '2004', '2000', '2011'], dtype=object)

We check the amount of countries and which onces it has:


In [124]:
peo_primary['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Tuvalu', 'Turks and Caicos Islands',
       'Turkiye', 'Tunisia', 'Trinidad and Tobago', 'Tonga', 'Togo',
       'Timor-Leste', 'Thailand', 'Tanzania', 'Tajikistan',
       'Syrian Arab Republic', 'Switzerland', 'Sweden', 'Suriname',
       'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'St. Kitts and Nevis', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Sint Maarten (Dutch part)', 'Singapore', 'Sierra Leone',
       'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'San Marino', 'Samoa', 'Rwanda',
       'Russian Federation', 'Romania', 'Qatar', 'Portugal', 'Poland',
       'Philippines', 'Peru', 'Paraguay', 'Papua New Guinea', 'Panama',


In [125]:
peo_primary['Country'].nunique()

239

---

🔹 Conclusions on: **Primary completion rate, total (PCT of relevant age group)** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Intentional homicides (per 100,000 people)
    This dataset shows information related to Unlawful homicides purposely inflicted as a result of domestic disputes, interpersonal violence, violet conflicts over land resources, intergang violence over turf or control, and predatory violence and killing by armed groups.(per 100,000 people)

In [126]:
peo_homicides.head()

Unnamed: 0,economy,time,Country,Time,"Intentional homicides (per 100,000 people)"
0,ZWE,YR2012,Zimbabwe,2012,7.4799
1,ZWE,YR2010,Zimbabwe,2010,5.599427
2,ZWE,YR2006,Zimbabwe,2006,8.819056
3,ZWE,YR2005,Zimbabwe,2005,11.178553
4,ZWE,YR2004,Zimbabwe,2004,11.281282


We check the size of the dataset: 

In [127]:
peo_homicides.shape

(2909, 5)

We review the amount of null values and the type of data each column has: 

In [128]:
peo_homicides.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2909 entries, 0 to 2908
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   economy                                     2909 non-null   object 
 1   time                                        2909 non-null   object 
 2   Country                                     2909 non-null   object 
 3   Time                                        2909 non-null   object 
 4   Intentional homicides (per 100,000 people)  2909 non-null   float64
dtypes: float64(1), object(4)
memory usage: 113.8+ KB


In [129]:
peo_homicides.describe()

Unnamed: 0,"Intentional homicides (per 100,000 people)"
count,2909.0
mean,8.124007
std,11.924528
min,0.0
25%,1.287639
50%,3.334389
75%,9.177223
max,105.231188


We check for duplicates:

In [5]:
peo_homicides[peo_homicides.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,"Intentional homicides (per 100,000 people)"


We check it has the years needed:

In [6]:
peo_homicides['Time'].unique()

array(['2012', '2010', '2006', '2005', '2004', '2003', '2002', '2001',
       '2015', '2014', '2013', '2011', '2009', '2008', '2000', '2020',
       '2019', '2018', '2017', '2016', '2007'], dtype=object)

We check the amount of countries and which onces it has:

In [7]:
peo_homicides['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Tuvalu',
       'Turks and Caicos Islands', 'Turkmenistan', 'Turkiye', 'Tunisia',
       'Trinidad and Tobago', 'Tonga', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Vincent and the Grenadines',
       'St. Martin (French part)', 'St. Lucia', 'St. Kitts and Nevis',
       'Sri Lanka', 'Spain', 'South Sudan', 'South Africa',
       'Solomon Islands', 'Slovenia', 'Slovak Republic', 'Singapore',
       'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'San Marino', 'Samoa', 'Rwanda',
       'Russian Federation', 'Romania', 'Qatar', 'Puerto Rico',
       'Portugal', 'Poland', 'Philippines', 'Pe

In [8]:
peo_homicides['Country'].nunique()

241

---

## General conclusions on People: 
    All of the people related databases look exactly the same. 
    They are clean enough to start working on them. 

---

## ⟐ Environment:
•	Access to electricity (% of population)<br>
•	People using at least basic sanitation services in urban areas <br>
•	Population Density (people per sq. hm of land area)<br>

We load the databases into variables:

In [9]:
# Access to electricity (% of population)
env_elect = wb.data.DataFrame('EG.ELC.ACCS.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
env_elect.rename(columns={'EG.ELC.ACCS.ZS': 'Access Elect.'}, inplace=True)

# People using at least basic sanitation services in urban areas
env_basic_sanitation = wb.data.DataFrame('SH.STA.BASS.UR.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
env_basic_sanitation.rename(columns={'SH.STA.BASS.UR.ZS': 'Basic Sanitation'}, inplace=True)

# Population Density
data_url = 'https://datahub.io/world-bank/en.pop.dnst/datapackage.json'
package = datapackage.Package(data_url)
resources = package.resources
for resource in resources:
    if resource.tabular:
        env_pop_density = pd.read_csv(resource.descriptor['path'])

---

We start checking the basics of each one of them: 

### Access to electricity
    This database shows information related to Percentage of population with access to electricity. Electrificacion data is collected from industry, national surveys, and international sources.(% of population)

In [10]:
env_elect.head()

Unnamed: 0,economy,time,Country,Time,Access Elect.
0,ZWE,YR2020,Zimbabwe,2020,52.747669
1,ZWE,YR2019,Zimbabwe,2019,46.781475
2,ZWE,YR2018,Zimbabwe,2018,45.572647
3,ZWE,YR2017,Zimbabwe,2017,44.178635
4,ZWE,YR2016,Zimbabwe,2016,42.561729


We check the size of the dataset:

In [11]:
env_elect.shape

(5507, 5)

We review the amount of null values and the type of data each column has: 

In [12]:
env_elect.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5507 entries, 0 to 5506
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   economy        5507 non-null   object 
 1   time           5507 non-null   object 
 2   Country        5507 non-null   object 
 3   Time           5507 non-null   object 
 4   Access Elect.  5507 non-null   float64
dtypes: float64(1), object(4)
memory usage: 215.2+ KB


In [13]:
env_elect.describe()

Unnamed: 0,Access Elect.
count,5507.0
mean,79.986053
std,28.787049
min,0.643132
25%,64.452004
50%,97.722672
75%,100.0
max,100.0


We check for duplicates: 

In [14]:
env_elect[env_elect.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Access Elect.


We check it has the years needed:

In [16]:
env_elect['Time'].unique()

array(['2020', '2019', '2018', '2017', '2016', '2015', '2014', '2013',
       '2012', '2011', '2010', '2009', '2008', '2007', '2006', '2005',
       '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [17]:
env_elect['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Tuvalu',
       'Turks and Caicos Islands', 'Turkmenistan', 'Turkiye', 'Tunisia',
       'Trinidad and Tobago', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Vincent and the Grenadines',
       'St. Martin (French part)', 'St. Lucia', 'St. Kitts and Nevis',
       'Sri Lanka', 'Spain', 'South Sudan', 'South Africa', 'Somalia',
       'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Sint Maarten (Dutch part)', 'Singapore', 'Sierra Leone',
       'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'San Marino', 'Samoa', 'Rwanda',
       'Russian Federation', 'Romania', 'Qatar', 'Puerto

In [18]:
env_elect['Country'].nunique()

264

---

🔹 Conclusions on: **Access to electricity** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'

---

### Basic sanitation in Urban areas
    This dataset shows information related to Improved sanitation facilities that are not shared with other households. This indicator encompasses both people using basic sanitation services as well as those using safely managed sanitation services.

In [19]:
env_basic_sanitation.head()

Unnamed: 0,economy,time,Country,Time,Basic Sanitation
0,ZWE,YR2020,Zimbabwe,2020,41.829436
1,ZWE,YR2019,Zimbabwe,2019,43.223952
2,ZWE,YR2018,Zimbabwe,2018,44.613054
3,ZWE,YR2017,Zimbabwe,2017,45.996743
4,ZWE,YR2016,Zimbabwe,2016,47.375018


We check the size of the dataset:

In [20]:
env_basic_sanitation.shape

(4659, 5)

We review the amount of null values and the type of data each column has: 

In [21]:
env_basic_sanitation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4659 entries, 0 to 4658
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   economy           4659 non-null   object 
 1   time              4659 non-null   object 
 2   Country           4659 non-null   object 
 3   Time              4659 non-null   object 
 4   Basic Sanitation  4659 non-null   float64
dtypes: float64(1), object(4)
memory usage: 182.1+ KB


In [22]:
env_basic_sanitation.describe()

Unnamed: 0,Basic Sanitation
count,4659.0
mean,76.552443
std,24.638195
min,9.568371
25%,57.140528
50%,87.068263
75%,97.5
max,100.0


We check for duplicates: 

In [23]:
env_basic_sanitation[env_basic_sanitation.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Basic Sanitation


We check it has the years needed:

In [24]:
env_basic_sanitation['Time'].unique()

array(['2020', '2019', '2018', '2017', '2016', '2015', '2014', '2013',
       '2012', '2011', '2010', '2009', '2008', '2007', '2006', '2005',
       '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [25]:
env_basic_sanitation['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Vanuatu', 'Uzbekistan', 'Uruguay', 'United States',
       'United Kingdom', 'Ukraine', 'Uganda', 'Tuvalu', 'Turkmenistan',
       'Turkiye', 'Tunisia', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Martin (French part)',
       'St. Lucia', 'Sri Lanka', 'Spain', 'South Sudan', 'South Africa',
       'Somalia', 'Solomon Islands', 'Slovak Republic',
       'Sint Maarten (Dutch part)', 'Singapore', 'Sierra Leone', 'Serbia',
       'Senegal', 'Sao Tome and Principe', 'Samoa', 'Rwanda',
       'Russian Federation', 'Romania', 'Portugal', 'Poland',
       'Philippines', 'Peru', 'Paraguay', 'Papua New Guinea', 'Panama',
       'Palau', 'Pakistan', 'Oman', 'Norway', 'North Macedonia',
       'Nigeria', 'Niger', 'Nicaragua', 'New Zealand', 'Netherlands',
       'Nepal', 'Nauru', 'Namibia', 'Myanmar'

In [26]:
env_basic_sanitation['Country'].nunique()

224

---

🔹 Conclusions on: **Basic sanitation in Urban areas** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'

---

### Population Density
    This dataset shows information related to Amount of people per square meter of land area (people per sq. hm of land area).

In [27]:
env_pop_density.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1961,6.976239
1,Arab World,ARB,1962,7.169853
2,Arab World,ARB,1963,7.370144
3,Arab World,ARB,1964,7.577779
4,Arab World,ARB,1965,7.793214


We check the size of the dataset:

In [28]:
env_pop_density.shape

(14339, 4)

We review the amount of null values and the type of data each column has: 

In [29]:
env_pop_density.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14339 entries, 0 to 14338
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  14339 non-null  object 
 1   Country Code  14339 non-null  object 
 2   Year          14339 non-null  int64  
 3   Value         14339 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 448.2+ KB


In [30]:
env_pop_density.describe()

Unnamed: 0,Year,Value
count,14339.0,14339.0
mean,1988.751098,276.128503
std,16.182033,1432.188184
min,1961.0,0.098625
25%,1975.0,18.969471
50%,1989.0,50.23171
75%,2003.0,124.867467
max,2016.0,21398.95


We check for duplicates: 

In [31]:
env_pop_density[env_pop_density.duplicated(subset=['Country Code', 'Year'])]

Unnamed: 0,Country Name,Country Code,Year,Value


We check it has the years needed:

In [32]:
env_pop_density['Year'].unique()

array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982,
       1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993,
       1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
       2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015,
       2016], dtype=int64)

We check the amount of countries and which onces it has:

In [34]:
env_pop_density['Country Name'].unique()

array(['Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD countries)', 'Euro area',
       'Europe & Central Asia',
       'Europe & Central Asia (excluding high income)',
       'Europe & Central Asia (IDA & IBRD countries)', 'European Union',
       'Fragile and conflict affected situations',
       'Heavily indebted poor countries (HIPC)', 'High income',
       'IBRD only', 'IDA & IBRD total', 'IDA blend', 'IDA only',
       'IDA total', 'Late-demographic dividend',
       'Latin America & Caribbean',
       'Latin America & Caribbean (excluding high income)',
       'Latin America & the Caribbean (IDA & IBRD countries)',
       'Least developed countries: UN classification',
       'Low & middle income', 'Low income', 'Lower middle income',
       'Middle East & North Africa',
       'Middle East & No

In [36]:
env_pop_density['Country Name'].nunique()

262

---

🔹 Conclusions on: **Population Density** <br>
* Does not contain duplicates. 
* Does not contain null values. 
* It has some of the years we need. 
* It has years we do not need, will be dropped. 
* The column "value" needs to be renamed for better handling.
* We consider reoder the columns to match the order in the previous datasets.

---

## General conclusions on Environment: 
    Two out of the three datasets for this category need the same treatment. 
    The third dataset needs different treatment. 
    We have enough data for the years we need. 
    They are clean enough to start working on them. 

---

## ⟐ Poverty:
•	Maternal mortality ratio<br>
•	Refugee population by country or territory of origin<br>
•	Refugee population by country or territory of asylum<br>

We load the databases into variables:

In [38]:
#Maternal Mortality Ratio
data_url = 'https://datahub.io/world-bank/sh.sta.mmrt/datapackage.json'     # Storing the dataset into a generic variable
package = datapackage.Package(data_url)                                     # Loading Data Package into storage
resources = package.resources
for resource in resources:
    if resource.tabular:
        pov_maternal_mortality = pd.read_csv(resource.descriptor['path'])   # Loading only tabular data

# Refugee population by country or territory of origin
pov_refugee_origin = wb.data.DataFrame('SM.POP.REFG.OR', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
pov_refugee_origin.rename(columns={'SM.POP.REFG.OR': 'Refugee population by country or territory of origin'}, inplace=True)

# Refugee population by country or territory of asylum
pov_refugee_asylum = wb.data.DataFrame('SM.POP.REFG', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
pov_refugee_asylum.rename(columns={'SM.POP.REFG': 'Refugee population by country or territory of asylum'}, inplace=True)

---

We start checking the basics of each one of them: 

### Maternal mortality ratio
    This dataset shows information related to Number of women who die from pregnancy-related causes while pregnant or within 42 days of pregnancy termination. The data are estimated with a regression model using information on the proportion of maternal deaths among non-AIDS deaths in women ages 15-49, fertility, birth attendants, and GDP.(per 100,000 live births.)

In [39]:
pov_maternal_mortality.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1990,289
1,Arab World,ARB,1991,285
2,Arab World,ARB,1992,281
3,Arab World,ARB,1993,278
4,Arab World,ARB,1994,274


We check the size of the dataset:

In [40]:
pov_maternal_mortality.shape

(5954, 4)

We review the amount of null values and the type of data each column has: 

In [41]:
pov_maternal_mortality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5954 entries, 0 to 5953
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Country Name  5954 non-null   object
 1   Country Code  5954 non-null   object
 2   Year          5954 non-null   int64 
 3   Value         5954 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 186.2+ KB


In [42]:
pov_maternal_mortality.describe()

Unnamed: 0,Year,Value
count,5954.0,5954.0
mean,2002.5,259.020658
std,7.50063,343.795511
min,1990.0,3.0
25%,1996.0,25.0
50%,2002.5,92.0
75%,2009.0,408.0
max,2015.0,2900.0


We check for duplicates: 

In [43]:
pov_maternal_mortality[pov_maternal_mortality.duplicated(subset=['Country Code', 'Year'])]

Unnamed: 0,Country Name,Country Code,Year,Value


We check it has the years needed:

In [44]:
pov_maternal_mortality['Year'].unique()

array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015], dtype=int64)

We check the amount of countries and which onces it has:

In [46]:
pov_maternal_mortality['Country Name'].unique()

array(['Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD countries)', 'Euro area',
       'Europe & Central Asia',
       'Europe & Central Asia (excluding high income)',
       'Europe & Central Asia (IDA & IBRD countries)', 'European Union',
       'Fragile and conflict affected situations',
       'Heavily indebted poor countries (HIPC)', 'High income',
       'IBRD only', 'IDA & IBRD total', 'IDA blend', 'IDA only',
       'IDA total', 'Late-demographic dividend',
       'Latin America & Caribbean',
       'Latin America & Caribbean (excluding high income)',
       'Latin America & the Caribbean (IDA & IBRD countries)',
       'Least developed countries: UN classification',
       'Low & middle income', 'Low income', 'Lower middle income',
       'Middle East & North Africa',
       'Middle East & No

In [47]:
pov_maternal_mortality['Country Name'].nunique()

229

---

🔹 Conclusions on: **Maternal mortality ratio** <br>
* Does not contain duplicates. 
* Does not contain null values.
* We will have to remove the rows which refer to a year prior to 2000. 
* The data only goes up to the year 2015.

---

### Refugee population by country or territory of origin
    Country of origin generally refers to the nationality or country of citizenship of a claimant.

In [48]:
pov_refugee_origin.head()

Unnamed: 0,economy,time,Country,Time,Refugee population by country or territory of origin
0,ZWE,YR2021,Zimbabwe,2021,8115.0
1,ZWE,YR2020,Zimbabwe,2020,8575.0
2,ZWE,YR2019,Zimbabwe,2019,10045.0
3,ZWE,YR2018,Zimbabwe,2018,15618.0
4,ZWE,YR2017,Zimbabwe,2017,17420.0


We check the size of the dataset:

In [49]:
pov_refugee_origin.shape

(5037, 5)

We review the amount of null values and the type of data each column has:

In [50]:
pov_refugee_origin.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5037 entries, 0 to 5036
Data columns (total 5 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   economy                                               5037 non-null   object 
 1   time                                                  5037 non-null   object 
 2   Country                                               5037 non-null   object 
 3   Time                                                  5037 non-null   object 
 4   Refugee population by country or territory of origin  5037 non-null   float64
dtypes: float64(1), object(4)
memory usage: 196.9+ KB


In [51]:
pov_refugee_origin.describe()

Unnamed: 0,Refugee population by country or territory of origin
count,5037.0
mean,789034.2
std,2570647.0
min,5.0
25%,191.0
50%,3376.0
75%,117687.0
max,27119820.0


We check for duplicates:

In [52]:
pov_refugee_origin[pov_refugee_origin.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Refugee population by country or territory of origin


We check it has the years needed:

In [53]:
pov_refugee_origin['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [54]:
pov_refugee_origin['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Tuvalu', 'Turks and Caicos Islands',
       'Turkmenistan', 'Turkiye', 'Tunisia', 'Trinidad and Tobago',
       'Tonga', 'Togo', 'Timor-Leste', 'Thailand', 'Tanzania',
       'Tajikistan', 'Syrian Arab Republic', 'Switzerland', 'Sweden',
       'Suriname', 'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'St. Kitts and Nevis', 'Sri Lanka', 'Spain', 'South Sudan',
       'South Africa', 'Somalia', 'Solomon Islands', 'Slovenia',
       'Slovak Republic', 'Singapore', 'Sierra Leone', 'Seychelles',
       'Serbia', 'Senegal', 'Saudi Arabia', 'Sao Tome and Principe',
       'San Marino', 'Samoa', 'Rwanda', 'Russian Federation', 'Romania',
       'Qatar', 'Portugal', 'Poland', 'Philippines', 'Peru', 'Paraguay',
       'Papua New Guinea', 'Panama', 'Pakistan', '

In [55]:
pov_refugee_origin['Country'].nunique()

244

---

🔹 Conclusions on: **Refugee population by country or territory of origin** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

### Refugee population by country or territory of asylum
    Country of asylum is the country where an asylum claim was filled and granted.

In [56]:
pov_refugee_asylum.head()

Unnamed: 0,economy,time,Country,Time,Refugee population by country or territory of asylum
0,ZWE,YR2021,Zimbabwe,2021,9483.0
1,ZWE,YR2020,Zimbabwe,2020,9261.0
2,ZWE,YR2019,Zimbabwe,2019,8956.0
3,ZWE,YR2018,Zimbabwe,2018,7795.0
4,ZWE,YR2017,Zimbabwe,2017,7566.0


We check the size of the dataset: 

In [57]:
pov_refugee_asylum.shape

(4505, 5)

We review the amount of null values and the type of data each column has:

In [58]:
pov_refugee_asylum.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4505 entries, 0 to 4504
Data columns (total 5 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   economy                                               4505 non-null   object 
 1   time                                                  4505 non-null   object 
 2   Country                                               4505 non-null   object 
 3   Time                                                  4505 non-null   object 
 4   Refugee population by country or territory of asylum  4505 non-null   float64
dtypes: float64(1), object(4)
memory usage: 176.1+ KB


In [59]:
pov_refugee_asylum.describe()

Unnamed: 0,Refugee population by country or territory of asylum
count,4505.0
mean,1037694.0
std,2785809.0
min,5.0
25%,1076.0
50%,17883.0
75%,366491.0
max,27119820.0


We check for duplicates:

In [60]:
pov_refugee_asylum[pov_refugee_asylum.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Refugee population by country or territory of asylum


We check it has the years needed:

In [61]:
pov_refugee_asylum['Time'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2001', '2000'], dtype=object)

We check the amount of countries and which onces it has:

In [62]:
pov_refugee_asylum['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Vietnam', 'Venezuela, RB', 'Vanuatu', 'Uzbekistan', 'Uruguay',
       'United States', 'United Kingdom', 'United Arab Emirates',
       'Ukraine', 'Uganda', 'Turks and Caicos Islands', 'Turkmenistan',
       'Turkiye', 'Tunisia', 'Trinidad and Tobago', 'Togo', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Lucia', 'St. Kitts and Nevis',
       'Sri Lanka', 'Spain', 'South Sudan', 'South Africa', 'Somalia',
       'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Sint Maarten (Dutch part)', 'Singapore', 'Sierra Leone', 'Serbia',
       'Senegal', 'Saudi Arabia', 'Samoa', 'Rwanda', 'Russian Federation',
       'Romania', 'Qatar', 'Portugal', 'Poland', 'Philippines', 'Peru',
       'Paraguay', 'Papua New Guinea', 'Panama', 'Palau', 'Pakistan',
       'Oman', 'Norway', 'North Macedonia', 'Nigeria', 'Niger',
       'Nicaragua', 'Ne

In [63]:
pov_refugee_asylum['Country'].nunique()

228

---

🔹 Conclusions on: **Refugee population by country or territory of asylum** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed. 

---

## General conclusions on Poverty: 
    Two out of the three datasets for this category need the same treatment. 
    The third dataset needs different treatment. 
    We have enough data for the years we need. 
    They are clean enough to start working on them.  

---

## ⟐ States:
• Mobile cellular subscriptions <br>
• Population Total <br>
• Research and development expenditure (PCT of GDP) <br>
• Labour force, total <br>
• GDP per capita

We load the databases into variables:

In [64]:
# Mobile Subscription
sta_mobile = wb.data.DataFrame('IT.CEL.SETS.P2', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
sta_mobile.rename(columns={'IT.CEL.SETS.P2': 'Mobile Subs.'}, inplace=True)

# Population Total
sta_population = wb.data.DataFrame('SP.POP.TOTL', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
sta_population.rename(columns={'SP.POP.TOTL': 'Population Total'}, inplace=True)

# Research and development expenditure (PCT of GDP)
sta_research = wb.data.DataFrame('GB.XPD.RSDV.GD.ZS', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
sta_research.rename(columns={'GB.XPD.RSDV.GD.ZS': 'Research and development expenditure (PCT of GDP)'}, inplace=True)

# Labour force, total
sta_labour = wb.data.DataFrame('SL.TLF.TOTL.IN', labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
sta_labour.rename(columns={'SL.TLF.TOTL.IN': 'Labour force, total'}, inplace=True)

#GDP per capita
data_url = 'https://datahub.io/world-bank/ny.gdp.pcap.pp.cd/datapackage.json'   # Storing the dataset into a generic variable
package = datapackage.Package(data_url)                                         # Loading Data Package into storage
resources = package.resources
for resource in resources:
    if resource.tabular:
        sta_gdp_percapita = pd.read_csv(resource.descriptor['path'])            # Loading only tabular data

---

We start checking the basics of each one of them:

### Mobile Subscription
     This dataset refers to subscriptions to a public mobile telephone service that provide access to the PSTN using cellular technology. The indicator includes (and is split into) the number of postpaid subscriptions, and the number of active prepaid accounts (i.e. that have been used during the last three months). The indicator applies to all mobile cellular subscriptions that offer voice communications. It excludes subscriptions via data cards or USB modems, subscriptions to public mobile data services, private trunked mobile radio, telepoint, radio paging and telemetry services. 

In [65]:
sta_mobile.head()

Unnamed: 0,economy,time,Country,Time,Mobile Subs.
0,ZWE,YR2021,Zimbabwe,2021,89.146019
1,ZWE,YR2020,Zimbabwe,2020,84.186274
2,ZWE,YR2019,Zimbabwe,2019,85.940989
3,ZWE,YR2018,Zimbabwe,2018,85.761588
4,ZWE,YR2017,Zimbabwe,2017,95.532557


We check the size of the dataset:

In [66]:
sta_mobile.shape

(5491, 5)


We review the amount of null values and the type of data each column has: 


In [67]:
sta_mobile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5491 entries, 0 to 5490
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   economy       5491 non-null   object 
 1   time          5491 non-null   object 
 2   Country       5491 non-null   object 
 3   Time          5491 non-null   object 
 4   Mobile Subs.  5491 non-null   float64
dtypes: float64(1), object(4)
memory usage: 214.6+ KB


In [68]:
sta_mobile.describe()

Unnamed: 0,Mobile Subs.
count,5491.0
mean,75.106348
std,49.260799
min,0.0
25%,31.792328
50%,79.196269
75%,112.055832
max,420.853098



We check for duplicates: 


In [69]:
sta_mobile[sta_mobile.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Mobile Subs.


We check it has the years needed:

In [72]:
sorted_years = np.sort(sta_mobile['Time'].unique())
print(sorted_years)

['2000' '2001' '2002' '2003' '2004' '2005' '2006' '2007' '2008' '2009'
 '2010' '2011' '2012' '2013' '2014' '2015' '2016' '2017' '2018' '2019'
 '2020' '2021']


We check the amount of countries and which onces it has:

In [73]:
sta_mobile['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Tuvalu',
       'Turks and Caicos Islands', 'Turkmenistan', 'Turkiye', 'Tunisia',
       'Trinidad and Tobago', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Vincent and the Grenadines',
       'St. Lucia', 'St. Kitts and Nevis', 'Sri Lanka', 'Spain',
       'South Sudan', 'South Africa', 'Somalia', 'Solomon Islands',
       'Slovenia', 'Slovak Republic', 'Sint Maarten (Dutch part)',
       'Singapore', 'Sierra Leone', 'Seychelles', 'Serbia', 'Senegal',
       'Saudi Arabia', 'Sao Tome and Principe', 'San Marino', 'Samoa',
       'Rwanda', 'Russian Federation', 'Romania', 'Qatar', 'Puerto Rico',
       'Portugal', 'Poland'

In [74]:
sta_mobile['Country'].nunique()

262

---

🔹 Conclusions on: **Mobile Subscription** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* Does not have all the years needed. 
* Too small, will not be used. 

---

### Population Total
    This dataset bases on the facto definition of population, which counts all residents regardless of legal status or citizenship.

In [75]:
sta_population.head()

Unnamed: 0,economy,time,Country,Time,Population Total
0,ZWE,YR2021,Zimbabwe,2021,15993524.0
1,ZWE,YR2020,Zimbabwe,2020,15669666.0
2,ZWE,YR2019,Zimbabwe,2019,15354608.0
3,ZWE,YR2018,Zimbabwe,2018,15052184.0
4,ZWE,YR2017,Zimbabwe,2017,14751101.0


We check the size of the dataset:


In [76]:
sta_population.shape

(5830, 5)


We review the amount of null values and the type of data each column has: 


In [77]:
sta_population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5830 entries, 0 to 5829
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   economy           5830 non-null   object 
 1   time              5830 non-null   object 
 2   Country           5830 non-null   object 
 3   Time              5830 non-null   object 
 4   Population Total  5830 non-null   float64
dtypes: float64(1), object(4)
memory usage: 227.9+ KB


In [78]:
sta_population.describe()

Unnamed: 0,Population Total
count,5830.0
mean,283639600.0
std,893110600.0
min,9609.0
25%,1450669.0
50%,9511978.0
75%,59237980.0
max,7888409000.0


We check for duplicates: 


In [79]:
sta_population[sta_population.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Population Total


We check it has the years needed:

In [80]:
sorted_years = np.sort(sta_population['Time'].unique())
print(sorted_years)

['2000' '2001' '2002' '2003' '2004' '2005' '2006' '2007' '2008' '2009'
 '2010' '2011' '2012' '2013' '2014' '2015' '2016' '2017' '2018' '2019'
 '2020' '2021']


We check the amount of countries and which onces it has:

In [81]:
sta_population['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Tuvalu',
       'Turks and Caicos Islands', 'Turkmenistan', 'Turkiye', 'Tunisia',
       'Trinidad and Tobago', 'Tonga', 'Togo', 'Timor-Leste', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Suriname', 'Sudan', 'St. Vincent and the Grenadines',
       'St. Martin (French part)', 'St. Lucia', 'St. Kitts and Nevis',
       'Sri Lanka', 'Spain', 'South Sudan', 'South Africa', 'Somalia',
       'Solomon Islands', 'Slovenia', 'Slovak Republic',
       'Sint Maarten (Dutch part)', 'Singapore', 'Sierra Leone',
       'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'San Marino', 'Samoa', 'Rwanda',
       'Russian Federation', 'Romania', 'Qatar', 'Puerto

In [82]:
sta_population['Country'].nunique()

265

---

🔹 Conclusions on: **Population Total** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'

---

### Research and development expenditure (PCT of GDP)
    This dataset shows information related to Gloss domestic expenditure on research and development. Include capital and current expenditure in 4 main sectors: business enterprise, government, higher education and private non-profit. Covers basic research, applied research, and experimental development(PCT of GDP)

In [83]:
sta_research.head()

Unnamed: 0,economy,time,Country,Time,Research and development expenditure (PCT of GDP)
0,ZMB,YR2008,Zambia,2008,0.27819
1,ZMB,YR2005,Zambia,2005,0.02493
2,ZMB,YR2004,Zambia,2004,0.02223
3,ZMB,YR2003,Zambia,2003,0.00847
4,ZMB,YR2002,Zambia,2002,0.00544


We check the size of the dataset:


In [84]:
sta_research.shape

(2449, 5)

We review the amount of null values and the type of data each column has: 


In [85]:
sta_research.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2449 entries, 0 to 2448
Data columns (total 5 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   economy                                            2449 non-null   object 
 1   time                                               2449 non-null   object 
 2   Country                                            2449 non-null   object 
 3   Time                                               2449 non-null   object 
 4   Research and development expenditure (PCT of GDP)  2449 non-null   float64
dtypes: float64(1), object(4)
memory usage: 95.8+ KB


In [86]:
sta_research.describe()

Unnamed: 0,Research and development expenditure (PCT of GDP)
count,2449.0
mean,1.048454
std,0.934187
min,0.00544
25%,0.31312
50%,0.73532
75%,1.61479
max,5.43562


We check for duplicates: 


In [88]:
sta_research[sta_research.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,Research and development expenditure (PCT of GDP)


We check it has the years needed:

In [92]:
sorted_years = np.sort(sta_research['Time'].unique())
print(sorted_years)

['2000' '2001' '2002' '2003' '2004' '2005' '2006' '2007' '2008' '2009'
 '2010' '2011' '2012' '2013' '2014' '2015' '2016' '2017' '2018' '2019'
 '2020' '2021']


We check the amount of countries and which onces it has:

In [94]:
sta_research['Country'].unique()

array(['Zambia', 'West Bank and Gaza', 'Virgin Islands (U.S.)', 'Vietnam',
       'Venezuela, RB', 'Uzbekistan', 'Uruguay', 'United States',
       'United Kingdom', 'United Arab Emirates', 'Ukraine', 'Uganda',
       'Turkiye', 'Tunisia', 'Trinidad and Tobago', 'Togo', 'Thailand',
       'Tanzania', 'Tajikistan', 'Syrian Arab Republic', 'Switzerland',
       'Sweden', 'Sudan', 'St. Vincent and the Grenadines', 'Sri Lanka',
       'Spain', 'South Africa', 'Slovenia', 'Slovak Republic',
       'Singapore', 'Seychelles', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Rwanda', 'Russian Federation', 'Romania', 'Qatar', 'Puerto Rico',
       'Portugal', 'Poland', 'Philippines', 'Peru', 'Paraguay',
       'Papua New Guinea', 'Panama', 'Pakistan', 'Oman', 'Norway',
       'North Macedonia', 'Nigeria', 'Nicaragua', 'New Zealand',
       'Netherlands', 'Nepal', 'Namibia', 'Myanmar', 'Mozambique',
       'Morocco', 'Montenegro', 'Mongolia', 'Monaco', 'Moldova', 'Mexico',
       'Mauritius', 'Mauri

In [93]:
sta_research['Country'].nunique()

187

---

🔹 Conclusions on: **Research and development expenditure (PCT of GDP)** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'

---

### Labour force, total
    This dataset refers to total amount of people capable to work.

In [95]:
sta_labour.head()

Unnamed: 0,economy,time,Country,Time,"Labour force, total"
0,ZWE,YR2021,Zimbabwe,2021,7915768.0
1,ZWE,YR2020,Zimbabwe,2020,7693983.0
2,ZWE,YR2019,Zimbabwe,2019,7591946.0
3,ZWE,YR2018,Zimbabwe,2018,7403981.0
4,ZWE,YR2017,Zimbabwe,2017,7214627.0


We check the size of the dataset:


In [96]:
sta_labour.shape

(5170, 5)


We review the amount of null values and the type of data each column has: 


In [97]:
sta_labour.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5170 entries, 0 to 5169
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   economy              5170 non-null   object 
 1   time                 5170 non-null   object 
 2   Country              5170 non-null   object 
 3   Time                 5170 non-null   object 
 4   Labour force, total  5170 non-null   float64
dtypes: float64(1), object(4)
memory usage: 202.1+ KB


In [98]:
sta_labour.describe()

Unnamed: 0,"Labour force, total"
count,5170.0
mean,141911100.0
std,424296000.0
min,31485.0
25%,1435772.0
50%,5180552.0
75%,38243190.0
max,3467593000.0



We check for duplicates: 


In [100]:
sta_labour[sta_labour.duplicated(subset=['Country', 'Time'])]

Unnamed: 0,economy,time,Country,Time,"Labour force, total"


We check it has the years needed:

In [101]:
sorted_years = np.sort(sta_labour['Time'].unique())
print(sorted_years)

['2000' '2001' '2002' '2003' '2004' '2005' '2006' '2007' '2008' '2009'
 '2010' '2011' '2012' '2013' '2014' '2015' '2016' '2017' '2018' '2019'
 '2020' '2021']


We check the amount of countries and which onces it has:

In [102]:
sta_labour['Country'].unique()

array(['Zimbabwe', 'Zambia', 'Yemen, Rep.', 'West Bank and Gaza',
       'Virgin Islands (U.S.)', 'Vietnam', 'Venezuela, RB', 'Vanuatu',
       'Uzbekistan', 'Uruguay', 'United States', 'United Kingdom',
       'United Arab Emirates', 'Ukraine', 'Uganda', 'Turkmenistan',
       'Turkiye', 'Tunisia', 'Trinidad and Tobago', 'Tonga', 'Togo',
       'Timor-Leste', 'Thailand', 'Tanzania', 'Tajikistan',
       'Syrian Arab Republic', 'Switzerland', 'Sweden', 'Suriname',
       'Sudan', 'St. Vincent and the Grenadines', 'St. Lucia',
       'Sri Lanka', 'Spain', 'South Sudan', 'South Africa', 'Somalia',
       'Solomon Islands', 'Slovenia', 'Slovak Republic', 'Singapore',
       'Sierra Leone', 'Serbia', 'Senegal', 'Saudi Arabia',
       'Sao Tome and Principe', 'Samoa', 'Rwanda', 'Russian Federation',
       'Romania', 'Qatar', 'Puerto Rico', 'Portugal', 'Poland',
       'Philippines', 'Peru', 'Paraguay', 'Papua New Guinea', 'Panama',
       'Pakistan', 'Oman', 'Norway', 'North Macedonia', 'N

In [103]:
sta_labour['Country'].nunique()

235

---

🔹 Conclusions on: **Labour force, total** <br>
* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed. 
* It has an extra column, called 'time', that will be removed. 
* The column named 'Time' will be renamed and it's tipe of data will be changed into int. 
* The column 'economy' will be renamed into 'Country Code'
* THe column 'Country' will be renamed into 'Country Name'
* It has more unique in 'Country' than needed.

---

### GDP per Capita
    This dataset shows information related to Total gross value added by all resident producers in the economy of each country in U$D.

In [104]:
sta_gdp_percapita.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1990,6759.785391
1,Arab World,ARB,1991,6821.770961
2,Arab World,ARB,1992,7193.242012
3,Arab World,ARB,1993,7394.499977
4,Arab World,ARB,1994,7583.281922


We check the size of the dataset:

In [105]:
sta_gdp_percapita.shape

(6194, 4)

We review the amount of null values and the type of data each column has: 

In [106]:
sta_gdp_percapita.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  6194 non-null   object 
 1   Country Code  6194 non-null   object 
 2   Year          6194 non-null   int64  
 3   Value         6194 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 193.7+ KB


In [107]:
sta_gdp_percapita.describe()

Unnamed: 0,Year,Value
count,6194.0,6194.0
mean,2003.270746,12780.054466
std,7.704633,16032.009169
min,1990.0,242.001214
25%,1997.0,2459.438806
50%,2003.0,6641.198256
75%,2010.0,16734.31882
max,2016.0,140037.223791


We check for duplicates: 

In [109]:
sta_gdp_percapita[sta_gdp_percapita.duplicated(subset=['Country Name', 'Year'])]

Unnamed: 0,Country Name,Country Code,Year,Value


We check it has the years needed:

In [110]:
sta_gdp_percapita['Year'].unique()

array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016], dtype=int64)

We check the amount of countries and which onces it has:

In [111]:
sta_gdp_percapita['Country Name'].unique()

array(['Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD countries)', 'Euro area',
       'Europe & Central Asia',
       'Europe & Central Asia (excluding high income)',
       'Europe & Central Asia (IDA & IBRD countries)', 'European Union',
       'Fragile and conflict affected situations',
       'Heavily indebted poor countries (HIPC)', 'High income',
       'IBRD only', 'IDA & IBRD total', 'IDA blend', 'IDA only',
       'IDA total', 'Late-demographic dividend',
       'Latin America & Caribbean',
       'Latin America & Caribbean (excluding high income)',
       'Latin America & the Caribbean (IDA & IBRD countries)',
       'Least developed countries: UN classification',
       'Low & middle income', 'Low income', 'Lower middle income',
       'Middle East & North Africa',
       'Middle East & No

In [112]:
sta_gdp_percapita['Country Name'].nunique()

241

---

🔹 Conclusions on: **GDP per Capita** <br>

* Does not contain duplicates. 
* Does not contain null values.  
* It has all the years needed.
    * We will remove the years prior to 2000.
* The column named 'Value' will be renamed.
* It has more unique in 'Country' than needed. 

---

## General conclusions on States: 
    Three out of the four datasets for this category need the same treatment. 
    The fourth dataset needs different treatment. 
    We have enough data for the years we need. 
    They are clean enough to start working on them.  

---

## ⟐ Migration:
•	Country Migration<br>
•	Skill migration<br>
•	Demographic Indicators

We load the databases into variables:

In [113]:
mig_country = pd.read_csv("migration-country_migration.csv")

mig_skill = pd.read_csv("migration-skill_migration.csv")

mig_demo_url = 'https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/EXCEL_FILES/1_General/WPP2022_GEN_F01_DEMOGRAPHIC_INDICATORS_REV1.xlsx'
    # Storing the link in a variable to make the code cleaner.
mig_demo  = pd.read_excel(mig_demo_url, skiprows=15 , header=1 , index_col=False)
    # Importing the excel, indicating we want to skip the first 15 rows, keep row 16 as header, and removing the index column. 

---

We start checking the basics of each one of them: 

### Country Migration
    This dataset shows the information of from which country to which country, and the net value of each year.

In [115]:
mig_country.head()

Unnamed: 0,base_country_code,base_country_name,base_lat,base_long,base_country_wb_income,base_country_wb_region,target_country_code,target_country_name,target_lat,target_long,...,net_per_10K_2019,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,af,Afghanistan,33.93911,67.709953,...,-0.02,,,,,,,,,
1,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,dz,Algeria,28.033886,1.659626,...,0.78,,,,,,,,,
2,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,ao,Angola,-11.202692,17.873887,...,-0.06,,,,,,,,,
3,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,ar,Argentina,-38.416097,-63.616672,...,0.23,,,,,,,,,
4,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,am,Armenia,40.069099,45.038189,...,0.02,,,,,,,,,


We check the size of the dataset:


In [116]:
mig_country.shape

(4148, 26)


We review the amount of null values and the type of data each column has: 


In [117]:
mig_country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4148 entries, 0 to 4147
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   base_country_code         4148 non-null   object 
 1   base_country_name         4148 non-null   object 
 2   base_lat                  4148 non-null   float64
 3   base_long                 4148 non-null   float64
 4   base_country_wb_income    4148 non-null   object 
 5   base_country_wb_region    4148 non-null   object 
 6   target_country_code       4148 non-null   object 
 7   target_country_name       4148 non-null   object 
 8   target_lat                4148 non-null   float64
 9   target_long               4148 non-null   float64
 10  target_country_wb_income  4148 non-null   object 
 11  target_country_wb_region  4148 non-null   object 
 12  net_per_10K_2015          4148 non-null   float64
 13  net_per_10K_2016          4148 non-null   float64
 14  net_per_

In [118]:
mig_country.describe()

Unnamed: 0,base_lat,base_long,target_lat,target_long,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
count,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,28.418022,21.698305,28.418022,21.698305,0.461757,0.150248,-0.080272,-0.040591,-0.022743,,,,,,,,,
std,25.086012,61.937381,25.086012,61.937381,5.00653,4.201118,3.203092,3.593876,3.633247,,,,,,,,,
min,-40.900557,-106.346771,-40.900557,-106.346771,-37.01,-40.89,-43.66,-56.22,-50.33,,,,,,,,,
25%,14.058324,-3.435973,14.058324,-3.435973,-0.15,-0.19,-0.21,-0.21,-0.21,,,,,,,,,
50%,35.86166,19.145136,35.86166,19.145136,0.0,0.0,0.0,0.0,0.0,,,,,,,,,
75%,47.516231,53.688046,47.516231,53.688046,0.24,0.22,0.16,0.17,0.18,,,,,,,,,
max,64.963051,179.414413,64.963051,179.414413,150.68,124.48,87.0,91.41,87.71,,,,,,,,,


We check the columns to see if it has the years we need:

In [119]:
mig_country.columns

Index(['base_country_code', 'base_country_name', 'base_lat', 'base_long',
       'base_country_wb_income', 'base_country_wb_region',
       'target_country_code', 'target_country_name', 'target_lat',
       'target_long', 'target_country_wb_income', 'target_country_wb_region',
       'net_per_10K_2015', 'net_per_10K_2016', 'net_per_10K_2017',
       'net_per_10K_2018', 'net_per_10K_2019', 'Unnamed: 17', 'Unnamed: 18',
       'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22',
       'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25'],
      dtype='object')

---

🔹 Conclusions on: **Country Migration** <br>
* Does not contain null values.
* Very difficult to check for duplicates. 
* Some columns will be deleted since they do not have information.
* Will be used separately from the other datasets. 

---

### Skill migration
    This dataset refers to the migration of people having some level of high education. 

In [120]:
mig_skill.head()

Unnamed: 0,country_code,country_name,wb_income,wb_region,skill_group_id,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,...,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28
0,af,Afghanistan,Low income,South Asia,2549.0,Tech Skills,Information Management,-791.59,-705.88,-550.04,...,,,,,,,,,,
1,af,Afghanistan,Low income,South Asia,2608.0,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,...,,,,,,,,,,
2,af,Afghanistan,Low income,South Asia,3806.0,Specialized Industry Skills,National Security,-1731.45,-769.68,-756.59,...,,,,,,,,,,
3,af,Afghanistan,Low income,South Asia,50321.0,Tech Skills,Software Testing,-957.5,-828.54,-964.73,...,,,,,,,,,,
4,af,Afghanistan,Low income,South Asia,1606.0,Specialized Industry Skills,Navy,-1510.71,-841.17,-842.32,...,,,,,,,,,,


We check the size of the dataset:

In [121]:
mig_skill.shape

(20647, 29)

We review the amount of null values and the type of data each column has: 

In [122]:
mig_skill.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20647 entries, 0 to 20646
Data columns (total 29 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   country_code          17617 non-null  object 
 1   country_name          17617 non-null  object 
 2   wb_income             17617 non-null  object 
 3   wb_region             17617 non-null  object 
 4   skill_group_id        17617 non-null  float64
 5   skill_group_category  17617 non-null  object 
 6   skill_group_name      17617 non-null  object 
 7   net_per_10K_2015      17617 non-null  float64
 8   net_per_10K_2016      17617 non-null  float64
 9   net_per_10K_2017      17617 non-null  float64
 10  net_per_10K_2018      17617 non-null  float64
 11  net_per_10K_2019      17617 non-null  float64
 12  Unnamed: 12           0 non-null      float64
 13  Unnamed: 13           0 non-null      float64
 14  Unnamed: 14           0 non-null      float64
 15  Unnamed: 15        

In [127]:
mig_skill.describe()

Unnamed: 0,skill_group_id,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,...,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28
count,17617.0,17617.0,17617.0,17617.0,17617.0,17617.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,6768.889368,-14.600256,-39.20081,-54.772238,-36.081342,-34.782431,,,,,...,,,,,,,,,,
std,11609.744113,255.690007,252.33377,256.931197,265.209723,239.798934,,,,,...,,,,,,,,,,
min,44.0,-3037.38,-2435.26,-6604.67,-3629.02,-4022.04,,,,,...,,,,,,,,,,
25%,618.0,-106.16,-119.24,-121.46,-111.44,-111.74,,,,,...,,,,,,,,,,
50%,2091.0,-11.5,-27.04,-31.81,-15.17,-17.43,,,,,...,,,,,,,,,,
75%,6189.0,72.12,58.85,47.48,65.94,65.92,,,,,...,,,,,,,,,,
max,50415.0,2824.97,1796.89,1906.14,1515.79,1901.99,,,,,...,,,,,,,,,,


🔹 Conclusions on: **Skill Migration** <br>
* Does not contain null values.
* Very difficult to check for duplicates. 
* Some columns will be deleted since they do not have information.
* Will be used separately from the other datasets. 

---

### Demographic Indicator
    This dataset contains a vast amount of information. 

In [123]:
mig_demo.head()

Unnamed: 0,Index,Variant,"Region, subregion, country or area *",Notes,Location code,ISO3 Alpha-code,ISO2 Alpha-code,SDMX code**,Type,Parent code,...,"Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)","Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)","Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)","Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)","Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)","Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)","Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)","Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)",Net Number of Migrants (thousands),"Net Migration Rate (per 1,000 population)"
0,1,Estimates,WORLD,,900,,,1.0,World,0,...,580.75,498.04,240.316,271.625,208.192,378.697,430.259,324.931,0,0
1,2,Estimates,WORLD,,900,,,1.0,World,0,...,566.728,490.199,231.177,258.09,203.78,368.319,415.836,319.336,0,0
2,3,Estimates,WORLD,,900,,,1.0,World,0,...,546.317,477.264,218.674,240.034,197.142,353.055,395.533,309.91,0,0
3,4,Estimates,WORLD,,900,,,1.0,World,0,...,535.829,469.532,212.872,232.602,193.049,345.083,385.843,303.905,0,0
4,5,Estimates,WORLD,,900,,,1.0,World,0,...,523.124,458.484,205.762,224.05,187.444,335.442,374.658,295.994,0,0


In [124]:
mig_demo.shape

(20596, 65)

In [125]:
mig_demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20596 entries, 0 to 20595
Data columns (total 65 columns):
 #   Column                                                                                          Non-Null Count  Dtype  
---  ------                                                                                          --------------  -----  
 0   Index                                                                                           20596 non-null  int64  
 1   Variant                                                                                         20596 non-null  object 
 2   Region, subregion, country or area *                                                            20596 non-null  object 
 3   Notes                                                                                           5475 non-null   object 
 4   Location code                                                                                   20596 non-null  int64  
 5   ISO3 Alpha-

In [126]:
mig_demo.describe()

Unnamed: 0,Index,Location code,SDMX code**,Parent code,Year
count,20596.0,20596.0,20304.0,20596.0,20592.0
mean,10298.5,597.600748,410.088652,1217.669062,1985.5
std,5945.697408,565.832272,268.84186,1004.610463,20.78311
min,1.0,4.0,1.0,0.0,1950.0
25%,5149.75,266.0,158.0,913.0,1967.75
50%,10298.5,531.0,415.5,922.0,1985.5
75%,15447.25,792.0,643.0,931.0,2003.25
max,20596.0,5501.0,914.0,5501.0,2021.0


We check if the columns we want have null values: 

In [128]:
check_nulls = mig_demo[['Natural Change, Births minus Deaths (thousands)', 'Rate of Natural Change (per 1,000 population)', 'Population Growth Rate (percentage)', 'Crude Birth Rate (births per 1,000 population)', 
                        'Median Age, as of 1 July (years)', 'Life Expectancy at Birth, both sexes (years)', 'Net Number of Migrants (thousands)', 'Net Migration Rate (per 1,000 population)',
                        'Infant Mortality Rate (infant deaths per 1,000 live births)', 'Infant Deaths, under age 1 (thousands)']].isnull().sum()
print(check_nulls)

Natural Change, Births minus Deaths (thousands)                0
Rate of Natural Change (per 1,000 population)                  0
Population Growth Rate (percentage)                            0
Crude Birth Rate (births per 1,000 population)                 0
Median Age, as of 1 July (years)                               0
Life Expectancy at Birth, both sexes (years)                   0
Net Number of Migrants (thousands)                             0
Net Migration Rate (per 1,000 population)                      0
Infant Mortality Rate (infant deaths per 1,000 live births)    0
Infant Deaths, under age 1 (thousands)                         0
dtype: int64


We check it has the years needed:

In [129]:
mig_demo['Year'].unique()

array([1950., 1951., 1952., 1953., 1954., 1955., 1956., 1957., 1958.,
       1959., 1960., 1961., 1962., 1963., 1964., 1965., 1966., 1967.,
       1968., 1969., 1970., 1971., 1972., 1973., 1974., 1975., 1976.,
       1977., 1978., 1979., 1980., 1981., 1982., 1983., 1984., 1985.,
       1986., 1987., 1988., 1989., 1990., 1991., 1992., 1993., 1994.,
       1995., 1996., 1997., 1998., 1999., 2000., 2001., 2002., 2003.,
       2004., 2005., 2006., 2007., 2008., 2009., 2010., 2011., 2012.,
       2013., 2014., 2015., 2016., 2017., 2018., 2019., 2020., 2021.,
         nan])

We check the amount of countries and which onces it has:

In [131]:
mig_demo['Region, subregion, country or area *'].unique()

array(['WORLD', 'Sustainable Development Goal (SDG) regions',
       'Sub-Saharan Africa', 'Northern Africa and Western Asia',
       'Central and Southern Asia', 'Eastern and South-Eastern Asia',
       'Latin America and the Caribbean',
       'Oceania (excluding Australia and New Zealand)',
       'Australia/New Zealand', 'Europe and Northern America',
       'UN development groups', 'More developed regions',
       'Less developed regions', 'Least developed countries',
       'Less developed regions, excluding least developed countries',
       'Less developed regions, excluding China',
       'Land-locked Developing Countries (LLDC)',
       'Small Island Developing States (SIDS)',
       'World Bank income groups', 'High-income countries',
       'Middle-income countries', 'Upper-middle-income countries',
       'Lower-middle-income countries', 'Low-income countries',
       'No income group available', 'Geographic regions', 'AFRICA',
       'Eastern Africa', 'Burundi', 'Comoros'

In [130]:
mig_demo['Region, subregion, country or area *'].nunique()

289

---

🔹 Conclusions on: **Demographic Indicator** <br>

* Does not contain duplicates.
* Does not contain null values.
* Has all the years needed.
    * We will have to remove the years prior 2000. 
    * We will have to remove the extra dot.
* It has more unique in 'Country' than needed. 

---

---

## General conclusions on Migration: 
    One of them will be merged with the rest of the datasets. 
    The other two will probably be used separetely.

---

# Final conclusions: 

Datasets can be split in three groups, so working with them will be simple.  <br>
Normalizing is a must, because of how different they are between groups, we need to have them in the same format. <br>

---

Version: 2023/02/28