<img src=https://i.imgur.com/QnKVI6k.jpg>

---

# Intruduction

In this notebook you'll find all the ETL processes through pipelines to make the transforming data automatic in the future. 

---

# ETL

🔹 We begin importing all the necessary libraries: 

In [137]:
#Libraries to work on the databases:
from pandas_datareader import wb
import wbgapi as wb
import datetime
import datapackage 

#Libraries to work on the pipeline:
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

# Importing the databases
    We have different sources where the data comes from, therefore different methods are requiered.
    All this data comes from the web, and so each time you call it, if theres's been an update, that will be shown here without problems. 

For some dataframes, we need to stablish today's date, so in the future, the database will update itself:

In [138]:
today = datetime.date.today()
year = today.year

## Economy
All of the databases come from the same source, therefore we only need to use one block of code.

In [139]:
#Economy series
economy_worldbank_series = ['NY.GDP.MKTP.CD', 'NY.GDP.MKTP.KD.ZG', "NE.CON.TOTL.KD.ZG", "NY.GNP.PCAP.CD", "NY.GNS.ICTR.ZS", "FP.CPI.TOTL"]
    # We indicate which data from the economy series we want.

#Economoy DF load
economy = wb.data.DataFrame((economy_worldbank_series), labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
    # We indicate the range of years we want and various details to have a clean database.

#Economy rename columns 
economy.rename(columns={'NY.GDP.MKTP.CD': 'GDP', 'NY.GDP.MKTP.KD.ZG': 'GDP_GROWTH', 'NE.CON.TOTL.KD.ZG': 'CONS_EXPEND',
'NY.GNP.PCAP.CD': 'GNI_CAPITA', 'NY.GNS.ICTR.ZS': 'GROSS_SAVINGS', 'FP.CPI.TOTL': 'CONSUMER_PRICE' }, inplace=True)
    # In this stage, we can rename the columns with what those values represent.

We check everything was loaded properly and see some info:

In [140]:
economy.head()

Unnamed: 0,economy,time,Country,Time,CONSUMER_PRICE,CONS_EXPEND,GDP,GDP_GROWTH,GNI_CAPITA,GROSS_SAVINGS
0,ZWE,YR2021,Zimbabwe,2021,5411.002445,14.261391,28371240000.0,8.468017,1530.0,
1,ZWE,YR2020,Zimbabwe,2020,2725.312815,-4.540536,21509700000.0,-7.816951,1460.0,16.452763
2,ZWE,YR2019,Zimbabwe,2019,414.684309,-10.119249,21832230000.0,-6.332446,1450.0,20.231198
3,ZWE,YR2018,Zimbabwe,2018,116.712211,-0.462873,34156070000.0,5.009867,1550.0,13.923906
4,ZWE,YR2017,Zimbabwe,2017,105.508414,3.920652,17584890000.0,4.080264,1170.0,-2.682417


In [141]:
economy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5648 entries, 0 to 5647
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   economy         5648 non-null   object 
 1   time            5648 non-null   object 
 2   Country         5648 non-null   object 
 3   Time            5648 non-null   object 
 4   CONSUMER_PRICE  3966 non-null   float64
 5   CONS_EXPEND     4400 non-null   float64
 6   GDP             5609 non-null   float64
 7   GDP_GROWTH      5529 non-null   float64
 8   GNI_CAPITA      5285 non-null   float64
 9   GROSS_SAVINGS   4337 non-null   float64
dtypes: float64(6), object(4)
memory usage: 441.4+ KB


---

# People
All of the databases come from the same source, therefore we only need to use one block of code.

In [142]:
#People series
people_worldbank_series = ['SE.XPD.TOTL.GD.ZS', 'SL.UEM.TOTL.ZS']
    # We indicate which data from the people series we want.

#People DF load
people = wb.data.DataFrame((people_worldbank_series), labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
    # We indicate the range of years we want and various details to have a clean database.

#Economy rename columns 
people.rename(columns={'SE.XPD.TOTL.GD.ZS': 'gover_exp', 'SL.UEM.TOTL.ZS': 'unemploy'}, inplace=True)
    # In this stage, we can rename the columns with what those values represent.

We check everything was loaded properly and see some info:

In [143]:
people.head()

Unnamed: 0,economy,time,Country,Time,gover_exp,unemploy
0,ZWE,YR2018,Zimbabwe,2018,3.86611,4.796
1,ZWE,YR2017,Zimbabwe,2017,5.81878,4.785
2,ZWE,YR2016,Zimbabwe,2016,5.47262,4.788
3,ZWE,YR2015,Zimbabwe,2015,5.81279,4.778
4,ZWE,YR2014,Zimbabwe,2014,6.13835,4.77


In [144]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5406 entries, 0 to 5405
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   economy    5406 non-null   object 
 1   time       5406 non-null   object 
 2   Country    5406 non-null   object 
 3   Time       5406 non-null   object 
 4   gover_exp  4013 non-null   float64
 5   unemploy   5170 non-null   float64
dtypes: float64(2), object(4)
memory usage: 253.5+ KB


---

# Environment
These datasets come from different sources, therefore two blocks of code are used.

🔹 Access to electricity (% of population) & People using at least basic sanitation services in urban areas.

In [145]:
#Enviroment series
enviroment_worldbank_series = ['EG.ELC.ACCS.ZS', 'SH.STA.BASS.UR.ZS']
    # We indicate which data from the environment series we want.

#Enviroment DF load
enviroment = wb.data.DataFrame((enviroment_worldbank_series), labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
    # We indicate the range of years we want and various details to have a clean database.

#Enviroment rename columns 
enviroment.rename(columns={'EG.ELC.ACCS.ZS': 'elect', 'SH.STA.BASS.UR.ZS': 'basic_sanitation'}, inplace=True)
    # In this stage, we can rename the columns with what those values represent.

We check everything was loaded properly and see some info:

In [146]:
enviroment.head()

Unnamed: 0,economy,time,Country,Time,elect,basic_sanitation
0,ZWE,YR2020,Zimbabwe,2020,52.747669,41.829436
1,ZWE,YR2019,Zimbabwe,2019,46.781475,43.223952
2,ZWE,YR2018,Zimbabwe,2018,45.572647,44.613054
3,ZWE,YR2017,Zimbabwe,2017,44.178635,45.996743
4,ZWE,YR2016,Zimbabwe,2016,42.561729,47.375018


In [147]:
enviroment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5523 entries, 0 to 5522
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   economy           5523 non-null   object 
 1   time              5523 non-null   object 
 2   Country           5523 non-null   object 
 3   Time              5523 non-null   object 
 4   elect             5507 non-null   float64
 5   basic_sanitation  4659 non-null   float64
dtypes: float64(2), object(4)
memory usage: 259.0+ KB


🔹 Population Density

In [148]:
data_url = 'https://datahub.io/world-bank/en.pop.dnst/datapackage.json'
    # Storing the dataset into a generic variable:

package = datapackage.Package(data_url)
    # Loading Data Package into storage

resources = package.resources
for resource in resources:
    if resource.tabular:
        pop_density = pd.read_csv(resource.descriptor['path'])
    # Loading only tabular data

We check everything was loaded properly and see some info:

In [149]:
pop_density.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1961,6.976239
1,Arab World,ARB,1962,7.169853
2,Arab World,ARB,1963,7.370144
3,Arab World,ARB,1964,7.577779
4,Arab World,ARB,1965,7.793214


In [150]:
pop_density.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14339 entries, 0 to 14338
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  14339 non-null  object 
 1   Country Code  14339 non-null  object 
 2   Year          14339 non-null  int64  
 3   Value         14339 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 448.2+ KB


---

# Poverty
All of these datasets come from the same source ('Datahub'), however, one block of code per dataset are needed.

🔹 Population below $1.90 a day:

In [151]:
data_url = 'https://datahub.io/world-bank/si.pov.dday/datapackage.json'
    # Storing the dataset into a generic variable:

package = datapackage.Package(data_url)
    # Loading Data Package into storage

resources = package.resources
for resource in resources:
    if resource.tabular:
        population_below = pd.read_csv(resource.descriptor['path'])
    # Loading only tabular data

We check everything was loaded properly and see some info:

In [152]:
population_below.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,East Asia & Pacific,EAS,1981,80.8
1,East Asia & Pacific,EAS,1984,70.5
2,East Asia & Pacific,EAS,1987,59.5
3,East Asia & Pacific,EAS,1990,61.6
4,East Asia & Pacific,EAS,1993,54.0


In [153]:
population_below.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  1611 non-null   object 
 1   Country Code  1611 non-null   object 
 2   Year          1611 non-null   int64  
 3   Value         1611 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 50.5+ KB


🔹 Maternal mortality ratio:

In [154]:
data_url = 'https://datahub.io/world-bank/sh.sta.mmrt/datapackage.json'
    # Storing the dataset into a generic variable

package = datapackage.Package(data_url)
    # Loading Data Package into storage

resources = package.resources
for resource in resources:
    if resource.tabular:
        maternal_mortality = pd.read_csv(resource.descriptor['path'])
    # Loading only tabular data

We check everything was loaded properly and see some info:

In [155]:
maternal_mortality.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1990,289
1,Arab World,ARB,1991,285
2,Arab World,ARB,1992,281
3,Arab World,ARB,1993,278
4,Arab World,ARB,1994,274


In [156]:
maternal_mortality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5954 entries, 0 to 5953
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Country Name  5954 non-null   object
 1   Country Code  5954 non-null   object
 2   Year          5954 non-null   int64 
 3   Value         5954 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 186.2+ KB


🔹 Incidence of tuberculosis

In [157]:
data_url = 'https://datahub.io/world-bank/sh.tbs.incd/datapackage.json'
    # Storing the dataset into a generic variable

package = datapackage.Package(data_url)
    # Loading Data Package into storage

resources = package.resources
for resource in resources:
    if resource.tabular:
       tuberculosis = pd.read_csv(resource.descriptor['path'])
    # Loading only tabular data

We check everything was loaded properly and see some info:

In [158]:
tuberculosis.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,East Asia & Pacific,EAS,2000,178.0
1,East Asia & Pacific,EAS,2001,176.0
2,East Asia & Pacific,EAS,2002,173.0
3,East Asia & Pacific,EAS,2003,171.0
4,East Asia & Pacific,EAS,2004,168.0


In [159]:
tuberculosis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3850 entries, 0 to 3849
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  3850 non-null   object 
 1   Country Code  3850 non-null   object 
 2   Year          3850 non-null   int64  
 3   Value         3850 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 120.4+ KB


🔹 Contributing family workers and own-account workers, female

In [160]:
data_url = 'https://datahub.io/world-bank/sl.fam.work.fe.zs/datapackage.json'
    # Storing the dataset into a generic variable

package = datapackage.Package(data_url)
    # Loading Data Package into storage

resources = package.resources
for resource in resources:
    if resource.tabular:
       contributing_f = pd.read_csv(resource.descriptor['path'])
    # Loading only tabular data

We check everything was loaded properly and see some info:

In [161]:
contributing_f.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1991,23.202541
1,Arab World,ARB,1992,24.240175
2,Arab World,ARB,1993,23.876738
3,Arab World,ARB,1994,23.389433
4,Arab World,ARB,1995,26.585767


In [162]:
contributing_f.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6291 entries, 0 to 6290
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  6291 non-null   object 
 1   Country Code  6291 non-null   object 
 2   Year          6291 non-null   int64  
 3   Value         6291 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 196.7+ KB


🔹 Contributing family workers and own-account workers, male

In [163]:
data_url = 'https://datahub.io/world-bank/sl.fam.work.ma.zs/datapackage.json'
    # Storing the dataset into a generic variable

package = datapackage.Package(data_url)
    # Loading Data Package into storage

resources = package.resources
for resource in resources:
    if resource.tabular:
        contributing_m = pd.read_csv(resource.descriptor['path'])
    # Loading only tabular data

We check everything was loaded properly and see some info:

In [164]:
contributing_m.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1991,8.541916
1,Arab World,ARB,1992,8.650984
2,Arab World,ARB,1993,8.591657
3,Arab World,ARB,1994,8.023385
4,Arab World,ARB,1995,8.329218


In [165]:
contributing_m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6291 entries, 0 to 6290
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  6291 non-null   object 
 1   Country Code  6291 non-null   object 
 2   Year          6291 non-null   int64  
 3   Value         6291 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 196.7+ KB


---

# States
These datasets come from different sources, therefore two blocks of code are used.

🔹 Profit tax & Mobile cellular subscriptions

In [166]:
#States series
state_worldbank_series = ['IC.TAX.PRFT.CP.ZS', 'IT.CEL.SETS.P2']
#State DF load
state = wb.data.DataFrame((state_worldbank_series), labels = True, time=range(2000, year), skipBlanks=True, columns='series').reset_index()
#State rename columns 
state.rename(columns={'IC.TAX.PRFT.CP.ZS': 'profit_tax', 'IT.CEL.SETS.P2': 'mobilesubs'}, inplace=True)

We check everything was loaded properly and see some info:

In [167]:
state.head()

Unnamed: 0,economy,time,Country,Time,profit_tax,mobilesubs
0,ZWE,YR2019,Zimbabwe,2019,17.6,85.940989
1,ZWE,YR2018,Zimbabwe,2018,17.6,85.761588
2,ZWE,YR2017,Zimbabwe,2017,17.6,95.532557
3,ZWE,YR2016,Zimbabwe,2016,17.6,89.11084
4,ZWE,YR2015,Zimbabwe,2015,17.6,90.126929


In [168]:
state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5529 entries, 0 to 5528
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   economy     5529 non-null   object 
 1   time        5529 non-null   object 
 2   Country     5529 non-null   object 
 3   Time        5529 non-null   object 
 4   profit_tax  3397 non-null   float64
 5   mobilesubs  5491 non-null   float64
dtypes: float64(2), object(4)
memory usage: 259.3+ KB


🔹 GDP per capita

In [169]:
data_url = 'https://datahub.io/world-bank/ny.gdp.pcap.pp.cd/datapackage.json'
    # Storing the dataset into a generic variable

package = datapackage.Package(data_url)
    # Loading Data Package into storage

resources = package.resources
for resource in resources:
    if resource.tabular:
        gdp_percapita = pd.read_csv(resource.descriptor['path'])
    # Loading only tabular data

We check everything was loaded properly and see some info:

In [170]:
gdp_percapita.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1990,6759.785391
1,Arab World,ARB,1991,6821.770961
2,Arab World,ARB,1992,7193.242012
3,Arab World,ARB,1993,7394.499977
4,Arab World,ARB,1994,7583.281922


In [171]:
gdp_percapita.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  6194 non-null   object 
 1   Country Code  6194 non-null   object 
 2   Year          6194 non-null   int64  
 3   Value         6194 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 193.7+ KB


🔹 Poverty Gap

In [172]:
data_url ='https://datahub.io/world-bank/si.pov.urgp/datapackage.json'
    # Storing the dataset into a generic variable

package = datapackage.Package(data_url)
    # Loading Data Package into storage

resources = package.resources
for resource in resources:
    if resource.tabular:
        povertygap = pd.read_csv(resource.descriptor['path'])
    # Loading only tabular data

We check everything was loaded properly and see some info:

In [173]:
povertygap.head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Afghanistan,AFG,2007,6.1
1,Afghanistan,AFG,2011,5.6
2,Albania,ALB,2002,4.5
3,Albania,ALB,2005,2.3
4,Albania,ALB,2008,1.9


In [174]:
povertygap.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  242 non-null    object 
 1   Country Code  242 non-null    object 
 2   Year          242 non-null    int64  
 3   Value         242 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 7.7+ KB


---

# ETL Pipeline
    We have designed two pipelines that work on both of the sources ('World Bank' and 'Datahub') separately; this is because the sources provide the data with different format, and therefore the transform processes are different. The transformation were thought after the exploratory data analysis was done, and based on our needs for the project.

## World Bank

We create a class for each process the databases coming from 'World Bank' will have to go through.

In [175]:
# Renaming the first column into 'id_country'
class ColumnRenamer(BaseEstimator, TransformerMixin):
    def __init__(self, old_name, new_name):
        self.old_name = old_name
        self.new_name = new_name
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.rename(columns={self.old_name: self.new_name})
        return X

# Filling null values. 
class FillNAs(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.fillna(method="ffill")
        if X.iloc[0].isnull().any():
            X.iloc[0] = X.iloc[1]
        elif X.isnull().any().any():
            # if there are still null values after forward filling, fill them with the last valid value
            X = X.fillna(method="backfill")
        return X

# Dropping the extra column named 'time'
class DropTimeColumn(BaseEstimator, TransformerMixin):
    def __init__(self, columns=['time']):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.drop(columns=self.columns)
        return X

#Defining the pipeline:
processes_wb = [
    ('rename_country_column', ColumnRenamer(old_name='economy', new_name='id_country')),
    ('rename_time_column', ColumnRenamer(old_name='Time', new_name='Year')),
    ('fill_nas', FillNAs()),
    ('drop_columns', DropTimeColumn(columns=['time']))
                ]

pipeline_wb = Pipeline(processes_wb)

## Datahub

We create a class for each process the databases coming from 'Datahub' will have to go through.

In [176]:
# Drop rows which year is prior to 2000:
class DropRowsBefore2000(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X[X['Year'] >= 2000]
        return X
    
# Check if there are countries and years duplicated:
class DropDuplicates(BaseEstimator, TransformerMixin):
    def __init__(self, columns=["Country Code", "Year"]):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.drop_duplicates(subset=self.columns)
        return X

# Fill null values with previous values:
class FillNAs(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X.fillna(method="ffill")
        return X

# Organize the columns and rename 'Value':
class ColumnOrganizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = X[['Country Code', 'Country Name', 'Year', 'Value']]
        return X

#Defining the pipeline:
known_columns = ['Country Code', 'Country Name', 'Year']

processes_dh = [        
    ('drop_rows_before_2000', DropRowsBefore2000()),    
    ('drop_duplicates', DropDuplicates(columns=["Country Code", "Year"])),
    ('fill_nas', FillNAs()),
    ('organize_columns', ColumnOrganizer())
]

pipeline_dh = Pipeline(processes_dh)

## Applying pipelines:

### Economy:

In [177]:
economy = pipeline_wb.fit_transform(economy)

We check the pipeline got applied correctly:

In [178]:
economy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5648 entries, 0 to 5647
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id_country      5648 non-null   object 
 1   Country         5648 non-null   object 
 2   Year            5648 non-null   object 
 3   CONSUMER_PRICE  5648 non-null   float64
 4   CONS_EXPEND     5648 non-null   float64
 5   GDP             5648 non-null   float64
 6   GDP_GROWTH      5648 non-null   float64
 7   GNI_CAPITA      5648 non-null   float64
 8   GROSS_SAVINGS   5648 non-null   float64
dtypes: float64(6), object(3)
memory usage: 397.2+ KB


### People

In [179]:
people = pipeline_wb.fit_transform(people)

We check the pipeline got applied correctly:

In [180]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5406 entries, 0 to 5405
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id_country  5406 non-null   object 
 1   Country     5406 non-null   object 
 2   Year        5406 non-null   object 
 3   gover_exp   5406 non-null   float64
 4   unemploy    5406 non-null   float64
dtypes: float64(2), object(3)
memory usage: 211.3+ KB


### Environment

Access to electricity (% of population) & People using at least basic sanitation services in urban areas.

In [181]:
enviroment = pipeline_wb.fit_transform(enviroment)
pop_density = pipeline_dh.fit_transform(pop_density)

We check the pipeline got applied correctly:

In [182]:
enviroment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5523 entries, 0 to 5522
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id_country        5523 non-null   object 
 1   Country           5523 non-null   object 
 2   Year              5523 non-null   object 
 3   elect             5523 non-null   float64
 4   basic_sanitation  5523 non-null   float64
dtypes: float64(2), object(3)
memory usage: 215.9+ KB


In [183]:
pop_density.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4449 entries, 39 to 14338
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  4449 non-null   object 
 1   Country Name  4449 non-null   object 
 2   Year          4449 non-null   int64  
 3   Value         4449 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 173.8+ KB


### Poverty

In [184]:
population_below = pipeline_dh.fit_transform(population_below)
maternal_mortality = pipeline_dh.fit_transform(maternal_mortality)
tuberculosis = pipeline_dh.fit_transform(tuberculosis)
contributing_f = pipeline_dh.fit_transform(contributing_f)
contributing_m = pipeline_dh.fit_transform(contributing_m)

We check the pipeline got applied correctly:

In [185]:
population_below.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1188 entries, 7 to 1610
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  1188 non-null   object 
 1   Country Name  1188 non-null   object 
 2   Year          1188 non-null   int64  
 3   Value         1188 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 46.4+ KB


In [186]:
maternal_mortality.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3664 entries, 10 to 5953
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Country Code  3664 non-null   object
 1   Country Name  3664 non-null   object
 2   Year          3664 non-null   int64 
 3   Value         3664 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 143.1+ KB


In [187]:
tuberculosis.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3850 entries, 0 to 3849
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  3850 non-null   object 
 1   Country Name  3850 non-null   object 
 2   Year          3850 non-null   int64  
 3   Value         3850 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 150.4+ KB


In [188]:
contributing_f.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4194 entries, 9 to 6290
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  4194 non-null   object 
 1   Country Name  4194 non-null   object 
 2   Year          4194 non-null   int64  
 3   Value         4194 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 163.8+ KB


In [189]:
contributing_m.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4194 entries, 9 to 6290
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  4194 non-null   object 
 1   Country Name  4194 non-null   object 
 2   Year          4194 non-null   int64  
 3   Value         4194 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 163.8+ KB


### States

In [190]:
state = pipeline_wb.fit_transform(state)
gdp_percapita = pipeline_dh.fit_transform(gdp_percapita)
povertygap = pipeline_dh.fit_transform(povertygap)

We check the pipeline got applied correctly:

In [191]:
state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5529 entries, 0 to 5528
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id_country  5529 non-null   object 
 1   Country     5529 non-null   object 
 2   Year        5529 non-null   object 
 3   profit_tax  5529 non-null   float64
 4   mobilesubs  5529 non-null   float64
dtypes: float64(2), object(3)
memory usage: 216.1+ KB


In [192]:
gdp_percapita.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4011 entries, 10 to 6193
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  4011 non-null   object 
 1   Country Name  4011 non-null   object 
 2   Year          4011 non-null   int64  
 3   Value         4011 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 156.7+ KB


In [193]:
povertygap.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 230 entries, 0 to 241
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Code  230 non-null    object 
 1   Country Name  230 non-null    object 
 2   Year          230 non-null    int64  
 3   Value         230 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 9.0+ KB


---

# Final Conclusion

This two pipelines work perfectly to ensure the datasets are clean and ready to be used in machine learning algorythms, and powerbi dashboards. 

---