ETL. Extracción y transformación
EDA. Análisis preliminar de calidad de datos

In [1]:
import pandas as pd
import requests
import json

## WORLD BANK

Como el objetivo principal del trabajo es analizar cúales con los factores socioeconómicos que más inciden en la Eperanza de Vida de los (35) países de que integran la OEA (Organización de Estados Americanos), nos vamos vamos a focalizar en buscar indicadores que tengan que ver con el desarrollo económico y el crecimiento de los países. 
Como fuente de datos, nos vamos a nutrir de las bases de datos del World Bank y de base de datos externas de otros organismos internacionales especializados en desarrollo.
Vamos a poceder a extrer los datos de la base del Banco Mundial utilizando la API que provee el WB.
Creamos un código para poder extraer los distintos indicadores potenciales para ser analizados, de las distintas bases de datos que se elojan en el sitio del organismo internacional. Las principales bases de datos utilizadas del BM son World Economic Indicators(WEI) y Health & Nutrition indicators.
El análisis preliminar nos permite hacer un corte en los datos, que se tomarán datos a partir de 1990, debido a al aumento de la disponibilidad de la data del WB y la externa al WB (aún a definir?). 
Elegimos a priori aproximadamente 50 indicadores, según los grandes temas o tópicos de las bases del BM que tienen que ver con factores socioeconómicos que más inciden en la esperanza de vida, y luego apoyados en base a literatura especializada en el tema. (Fuente)
Al analizar los indicadores prelimminarmente, advertimos que hay variables que podrían enriquecer nuestros análisis, pero que si bien existen en la base de datos del BM, hay muchos nulos y la calidad por tanto no es buena. Dedicimos usar bases de datos externas para aumentar la calidad de los datos, y contar con indicadores confiables en hábitos como consumo de Tabaco, Obesidad, y gasto publico social de los gobiernos.
Bases de datos externas utilizadas:  Cepal, Naciones Unidas


In [2]:
#Código para extraer los datos con la API del WB
#particionamos la información de los países e indicadores para realizar la extracción
# Países de la OEA
countries = ["USA",
    "ATG",
    "ARG",
    "BHS",
    "BRB",
    "BLZ",
    
]


# Indicadores elegidos luego de consultar las distintas base de datos en desarrollo y crecimiento del WB

indicators = ["SH.STA.TRAF.P5",
"SH.DYN.NCOM.ZS",
"SH.STA.WASH.P5",
    "FX.OWN.TOTL.ZS",
    "SP.DYN.CBRT.IN",
    "SH.XPD.KHEX.GD.ZS",
    "EN.ATM.CO2E.PP.GD",
    "SE.COM.DURS",
    "CC.EST",
    "SH.XPD.CHEX.PP.CD",
    "SP.DYN.CDRT.IN",
    "SH.XPD.GHED.GD.ZS",
    "SH.XPD.GHED.PP.CD",
    "SH.XPD.PVTD.CH.ZS",
    "SE.TER.CUAT.BA.ZS",
    "SE.SEC.CUAT.LO.ZS",
    "SP.DYN.TFRT.IN",
    "NY.GDP.MKTP.PP.KD", "NY.GDP.MKTP.KD.ZG",
    "NY.GDP.PCAP.PP.KD",
    "SI.POV.GINI",
    "SH.STA.IYCF.ZS",    
    "SP.DYN.LE00.FE.IN",
    "SP.DYN.LE00.IN",
    "SE.ADT.LITR.ZS",
    "SP.DYN.AMRT.FE",
    "SP.DYN.AMRT.MA",
    "SP.DYN.IMRT.IN",
    "SH.DTH.NMRT",
    "SH.STA.BASS.ZS",
    "SH.H2O.SMDW.ZS",
    "EN.ATM.PM25.MC.ZS",
    "PV.EST",
    "SP.POP.80UP.FE",
    "SP.POP.80UP.MA",
    "SP.POP.TOTL.FE.IN",
    "SP.POP.TOTL.FE.ZS",
    "SP.POP.TOTL.MA.IN",
    "SP.POP.TOTL.MA.ZS",
    "SP.POP.TOTL",
    "SI.POV.UMIC.GP",
    "SH.STA.OWAD.ZS",
    "SH.STA.OWGH.ZS",
    "SE.XPD.TOTL.GD.ZS",
    "SP.RUR.TOTL.ZS",
    "SL.UEM.TOTL.ZS",
    "SP.URB.TOTL",
    "SP.URB.TOTL.IN.ZS",
      "EN.POP.DNST",
"SP.DYN.LE00.MA.IN",
"SH.STA.ANVC.ZS",
"SN.ITK.DEFC.ZS",
"SH.STA.DIAB.ZS",
"SH.ALC.PCAP.LI"
]

# Años deseados para extraer los datos
#En base analisis preliminares y a la disponibildiad de las series de tiempo de lo indicadores, 
# optamos por tomar la data a partir de 1990, cuando comienza a aumentar la cantidad de datos 
# relacionados con el desarrollo
#Vamos a bajar los datos en distintos dataframes que luego se van a concatenar para unificarlo en un dataset
start_year = "1990"
end_year = "2022"

# Lista para almacenar DataFrames individuales
data_frames = []

# URL base de la API del Banco Mundial
base_url = "http://api.worldbank.org/v2/country"

# Realiza las consultas para cada país, indicador y año
for country_code in countries:
    for indicator in indicators:
        # Construye la URL de la consulta
        url = f"{base_url}/{country_code}/indicator/{indicator}?date={start_year}:{end_year}&format=json"
        
        # Realiza la solicitud GET a la API del Banco Mundial
        response = requests.get(url)
        
        # Verifica si la solicitud fue exitosa
        if response.status_code == 200:
            data = response.json()
            # Los datos se encuentran en data[1]
            for entry in data[1]:
                year = entry['date']
                value = entry['value']
                indicator_name = entry['indicator']['value'] 
                country_name = entry['country']['value']
                data_frames.append(pd.DataFrame({"País": [country_name], "Indicador": [indicator_name], "Año": [year], "Valor": [value]}))

# Concatenar todos los DataFrames individuales en uno
data_df0 = pd.concat(data_frames, ignore_index=True)

# Mostrar los datos en una tabla
data_df0

  data_df0 = pd.concat(data_frames, ignore_index=True)


Unnamed: 0,País,Indicador,Año,Valor
0,United States,Mortality caused by road traffic injury (per 1...,2022,
1,United States,Mortality caused by road traffic injury (per 1...,2021,
2,United States,Mortality caused by road traffic injury (per 1...,2020,
3,United States,Mortality caused by road traffic injury (per 1...,2019,12.7
4,United States,Mortality caused by road traffic injury (per 1...,2018,12.6
...,...,...,...,...
10687,Belize,Total alcohol consumption per capita (liters o...,1994,
10688,Belize,Total alcohol consumption per capita (liters o...,1993,
10689,Belize,Total alcohol consumption per capita (liters o...,1992,
10690,Belize,Total alcohol consumption per capita (liters o...,1991,


In [3]:
#Continuamos con la extraccion extracción de datos de los países
countries = ["CAN",
    "CHL",
    "COL",
    "CRI",
    "CUB",
    "DOM",
    "ECU"]
    
 

# Indicadores que deseas consultar
indicators = ["SH.STA.TRAF.P5",
"SH.DYN.NCOM.ZS",
"SH.STA.WASH.P5",
    "FX.OWN.TOTL.ZS",
    "SP.DYN.CBRT.IN",
    "SH.XPD.KHEX.GD.ZS",
    "EN.ATM.CO2E.PP.GD",
    "SE.COM.DURS",
    "CC.EST",
    "SH.XPD.CHEX.PP.CD",
    "SP.DYN.CDRT.IN",
    "SH.XPD.GHED.GD.ZS",
    "SH.XPD.GHED.PP.CD",
    "SH.XPD.PVTD.CH.ZS",
    "SE.TER.CUAT.BA.ZS",
    "SE.SEC.CUAT.LO.ZS",
    "SP.DYN.TFRT.IN",
    "NY.GDP.MKTP.PP.KD",
    "NY.GDP.PCAP.PP.KD", "NY.GDP.MKTP.KD.ZG", 
    "SI.POV.GINI",
    "SH.STA.IYCF.ZS",
    "SP.DYN.LE00.FE.IN",
    "SP.DYN.LE00.IN",
    "SE.ADT.LITR.ZS",
    "SP.DYN.AMRT.FE",
    "SP.DYN.AMRT.MA",
    "SP.DYN.IMRT.IN",
    "SH.DTH.NMRT",
    "SH.STA.BASS.ZS",
    "SH.H2O.SMDW.ZS",
    "EN.ATM.PM25.MC.ZS",
    "PV.EST",
    "SP.POP.80UP.FE",
    "SP.POP.80UP.MA",
    "SP.POP.TOTL.FE.IN",
    "SP.POP.TOTL.FE.ZS",
    "SP.POP.TOTL.MA.IN",
    "SP.POP.TOTL.MA.ZS",
    "SP.POP.TOTL",
    "SI.POV.UMIC.GP",
    "SH.STA.OWAD.ZS",
    "SH.STA.OWGH.ZS",
    "SE.XPD.TOTL.GD.ZS",
    "SP.RUR.TOTL.ZS",
    "SL.UEM.TOTL.ZS",
    "SP.URB.TOTL",
    "SP.URB.TOTL.IN.ZS",
      "EN.POP.DNST",
"SP.DYN.LE00.MA.IN",
"SH.STA.ANVC.ZS",
"SN.ITK.DEFC.ZS",
"SH.STA.DIAB.ZS",
"SH.ALC.PCAP.LI"
]




# Años elegidos
start_year = "1990"
end_year = "2022"

# Lista para almacenar DataFrames individuales
data_frames = []

# URL base de la API del Banco Mundial
base_url = "http://api.worldbank.org/v2/country"

# Realiza las consultas para cada país, indicador y año
for country_code in countries:
    for indicator in indicators:
        # Construye la URL de la consulta
        url = f"{base_url}/{country_code}/indicator/{indicator}?date={start_year}:{end_year}&format=json"
        
        # Realiza la solicitud GET a la API del Banco Mundial
        response = requests.get(url)
        
        # Verifica si la solicitud fue exitosa
        if response.status_code == 200:
            data = response.json()
            # Los datos se encuentran en data[1]
            for entry in data[1]:
                year = entry['date']
                value = entry['value']
                indicator_name = entry['indicator']['value'] 
                country_name = entry['country']['value']
                data_frames.append(pd.DataFrame({"País": [country_name], "Indicador": [indicator_name], "Año": [year], "Valor": [value]}))

# Concatenar todos los DataFrames individuales en uno
data_df = pd.concat(data_frames, ignore_index=True)

# Mostrar los datos en una tabla
data_df

  data_df = pd.concat(data_frames, ignore_index=True)


Unnamed: 0,País,Indicador,Año,Valor
0,Canada,Mortality caused by road traffic injury (per 1...,2022,
1,Canada,Mortality caused by road traffic injury (per 1...,2021,
2,Canada,Mortality caused by road traffic injury (per 1...,2020,
3,Canada,Mortality caused by road traffic injury (per 1...,2019,5.3
4,Canada,Mortality caused by road traffic injury (per 1...,2018,5.4
...,...,...,...,...
12469,Ecuador,Total alcohol consumption per capita (liters o...,1994,
12470,Ecuador,Total alcohol consumption per capita (liters o...,1993,
12471,Ecuador,Total alcohol consumption per capita (liters o...,1992,
12472,Ecuador,Total alcohol consumption per capita (liters o...,1991,


In [4]:
#Continuamos con el mismo proceso 
countries = [
    "GRD",
    "GTM",
    "GUY",
    "HTI",
    "HND",
    "JAM" ]
# Indicadores que deseas consultar
indicators = ["SH.STA.TRAF.P5",
"SH.DYN.NCOM.ZS",
"SH.STA.WASH.P5",
    "FX.OWN.TOTL.ZS",
    "SP.DYN.CBRT.IN",
    "SH.XPD.KHEX.GD.ZS",
    "EN.ATM.CO2E.PP.GD",
    "SE.COM.DURS",
    "CC.EST",
    "SH.XPD.CHEX.PP.CD",
    "SP.DYN.CDRT.IN",
    "SH.XPD.GHED.GD.ZS",
    "SH.XPD.GHED.PP.CD",
    "SH.XPD.PVTD.CH.ZS",
    "SE.TER.CUAT.BA.ZS",
    "SE.SEC.CUAT.LO.ZS",
    "SP.DYN.TFRT.IN",
    "NY.GDP.MKTP.PP.KD",
    "NY.GDP.PCAP.PP.KD", "NY.GDP.MKTP.KD.ZG",
    "SI.POV.GINI",
    "SH.STA.IYCF.ZS",
    "SP.DYN.LE00.FE.IN",
    "SP.DYN.LE00.IN",
    "SE.ADT.LITR.ZS",
    "SP.DYN.AMRT.FE",
    "SP.DYN.AMRT.MA",
    "SP.DYN.IMRT.IN",
    "SH.DTH.NMRT",
    "SH.STA.BASS.ZS",
    "SH.H2O.SMDW.ZS",
    "EN.ATM.PM25.MC.ZS",
    "PV.EST",
    "SP.POP.80UP.FE",
    "SP.POP.80UP.MA",
    "SP.POP.TOTL.FE.IN",
    "SP.POP.TOTL.FE.ZS",
    "SP.POP.TOTL.MA.IN",
    "SP.POP.TOTL.MA.ZS",
    "SP.POP.TOTL",
    "SI.POV.UMIC.GP",
    "SH.STA.OWAD.ZS",
    "SH.STA.OWGH.ZS",
    "SE.XPD.TOTL.GD.ZS",
    "SP.RUR.TOTL.ZS",
    "SL.UEM.TOTL.ZS",
    "SP.URB.TOTL",
    "SP.URB.TOTL.IN.ZS",
      "EN.POP.DNST",
"SP.DYN.LE00.MA.IN",
"SH.STA.ANVC.ZS",
"SN.ITK.DEFC.ZS",
"SH.STA.DIAB.ZS",
"SH.ALC.PCAP.LI"
]




# Años elegidos
start_year = "1990"
end_year = "2022"

# Lista para almacenar DataFrames individuales
data_frames = []

# URL base de la API del Banco Mundial
base_url = "http://api.worldbank.org/v2/country"

# Realiza las consultas para cada país, indicador y año
for country_code in countries:
    for indicator in indicators:
        # Construye la URL de la consulta
        url = f"{base_url}/{country_code}/indicator/{indicator}?date={start_year}:{end_year}&format=json"
        
        # Realiza la solicitud GET a la API del Banco Mundial
        response = requests.get(url)
        
        # Verifica si la solicitud fue exitosa
        if response.status_code == 200:
            data = response.json()
            # Los datos se encuentran en data[1]
            for entry in data[1]:
                year = entry['date']
                value = entry['value']
                indicator_name = entry['indicator']['value'] 
                country_name = entry['country']['value']
                data_frames.append(pd.DataFrame({"País": [country_name], "Indicador": [indicator_name], "Año": [year], "Valor": [value]}))

# Concatenar todos los DataFrames individuales en uno
data_df_a = pd.concat(data_frames, ignore_index=True)

# Mostrar los datos en una tabla
data_df_a

  data_df_a = pd.concat(data_frames, ignore_index=True)


Unnamed: 0,País,Indicador,Año,Valor
0,Grenada,Mortality caused by road traffic injury (per 1...,2022,
1,Grenada,Mortality caused by road traffic injury (per 1...,2021,
2,Grenada,Mortality caused by road traffic injury (per 1...,2020,
3,Grenada,Mortality caused by road traffic injury (per 1...,2019,8.0
4,Grenada,Mortality caused by road traffic injury (per 1...,2018,7.8
...,...,...,...,...
10687,Jamaica,Total alcohol consumption per capita (liters o...,1994,
10688,Jamaica,Total alcohol consumption per capita (liters o...,1993,
10689,Jamaica,Total alcohol consumption per capita (liters o...,1992,
10690,Jamaica,Total alcohol consumption per capita (liters o...,1991,


In [5]:
#Sigue la lista de países para exraer la data ...
countries = [ 
             "BOL", "BRA", "PER", "DOM", "KNA", "SLV"
             ]
# Indicadores que deseas consultar
indicators = ["SH.STA.TRAF.P5",
"SH.DYN.NCOM.ZS",
"SH.STA.WASH.P5",
    "FX.OWN.TOTL.ZS",
    "SP.DYN.CBRT.IN",
    "SH.XPD.KHEX.GD.ZS",
    "EN.ATM.CO2E.PP.GD",
    "SE.COM.DURS",
    "CC.EST",
    "SH.XPD.CHEX.PP.CD",
    "SP.DYN.CDRT.IN",
    "SH.XPD.GHED.GD.ZS",
    "SH.XPD.GHED.PP.CD",
    "SH.XPD.PVTD.CH.ZS",
    "SE.TER.CUAT.BA.ZS",
    "SE.SEC.CUAT.LO.ZS",
    "SP.DYN.TFRT.IN",
    "NY.GDP.MKTP.PP.KD",
    "NY.GDP.PCAP.PP.KD", "NY.GDP.MKTP.KD.ZG",
    "SI.POV.GINI",
    "SH.STA.IYCF.ZS",
    "SP.DYN.LE00.FE.IN",
    "SP.DYN.LE00.IN",
    "SE.ADT.LITR.ZS",
    "SP.DYN.AMRT.FE",
    "SP.DYN.AMRT.MA",
    "SP.DYN.IMRT.IN",
    "SH.DTH.NMRT",
    "SH.STA.BASS.ZS",
    "SH.H2O.SMDW.ZS",
    "EN.ATM.PM25.MC.ZS",
    "PV.EST",
    "SP.POP.80UP.FE",
    "SP.POP.80UP.MA",
    "SP.POP.TOTL.FE.IN",
    "SP.POP.TOTL.FE.ZS",
    "SP.POP.TOTL.MA.IN",
    "SP.POP.TOTL.MA.ZS",
    "SP.POP.TOTL",
    "SI.POV.UMIC.GP",
    "SH.STA.OWAD.ZS",
    "SH.STA.OWGH.ZS",
    "SE.XPD.TOTL.GD.ZS",
    "SP.RUR.TOTL.ZS",
    "SL.UEM.TOTL.ZS",
    "SP.URB.TOTL",
    "SP.URB.TOTL.IN.ZS",
      "EN.POP.DNST",
"SP.DYN.LE00.MA.IN",
"SH.STA.ANVC.ZS",
"SN.ITK.DEFC.ZS",
"SH.STA.DIAB.ZS",
"SH.ALC.PCAP.LI"
]

# Años elegidos 
start_year = "1990"
end_year = "2022"

# Lista para almacenar DataFrames individuales
data_frames = []

# URL base de la API del Banco Mundial
base_url = "http://api.worldbank.org/v2/country"

# Realiza las consultas para cada país, indicador y año
for country_code in countries:
    for indicator in indicators:
        # Construye la URL de la consulta
        url = f"{base_url}/{country_code}/indicator/{indicator}?date={start_year}:{end_year}&format=json"
        
        # Realiza la solicitud GET a la API del Banco Mundial
        response = requests.get(url)
        
        # Verifica si la solicitud fue exitosa
        if response.status_code == 200:
            data = response.json()
            # Los datos se encuentran en data[1]
            for entry in data[1]:
                year = entry['date']
                value = entry['value']
                indicator_name = entry['indicator']['value'] 
                country_name = entry['country']['value']
                data_frames.append(pd.DataFrame({"País": [country_name], "Indicador": [indicator_name], "Año": [year], "Valor": [value]}))

# Concatenar todos los DataFrames individuales en uno
data_df1 = pd.concat(data_frames, ignore_index=True)

# Mostrar los datos en una tabla
data_df1

  data_df1 = pd.concat(data_frames, ignore_index=True)


Unnamed: 0,País,Indicador,Año,Valor
0,Bolivia,Mortality caused by road traffic injury (per 1...,2022,
1,Bolivia,Mortality caused by road traffic injury (per 1...,2021,
2,Bolivia,Mortality caused by road traffic injury (per 1...,2020,
3,Bolivia,Mortality caused by road traffic injury (per 1...,2019,21.1
4,Bolivia,Mortality caused by road traffic injury (per 1...,2018,20.7
...,...,...,...,...
10687,El Salvador,Total alcohol consumption per capita (liters o...,1994,
10688,El Salvador,Total alcohol consumption per capita (liters o...,1993,
10689,El Salvador,Total alcohol consumption per capita (liters o...,1992,
10690,El Salvador,Total alcohol consumption per capita (liters o...,1991,


In [6]:
#Sigue la lista de países para exraer la data ...
countries = [
    "MEX", "NIC", "PAN", "PRY" 
]


# Indicadores que deseas consultar
indicators = ["SH.STA.TRAF.P5",
"SH.DYN.NCOM.ZS",
"SH.STA.WASH.P5",
    "FX.OWN.TOTL.ZS",
    "SP.DYN.CBRT.IN",
    "SH.XPD.KHEX.GD.ZS",
    "EN.ATM.CO2E.PP.GD",
    "SE.COM.DURS",
    "CC.EST",
    "SH.XPD.CHEX.PP.CD",
    "SP.DYN.CDRT.IN",
    "SH.XPD.GHED.GD.ZS",
    "SH.XPD.GHED.PP.CD",
    "SH.XPD.PVTD.CH.ZS",
    "SE.TER.CUAT.BA.ZS",
    "SE.SEC.CUAT.LO.ZS",
    "SP.DYN.TFRT.IN",
    "NY.GDP.MKTP.PP.KD",
    "NY.GDP.PCAP.PP.KD", "NY.GDP.MKTP.KD.ZG",
    "SI.POV.GINI",
    "SH.STA.IYCF.ZS",
    "SP.DYN.LE00.FE.IN",
    "SP.DYN.LE00.IN",
    "SE.ADT.LITR.ZS",
    "SP.DYN.AMRT.FE",
    "SP.DYN.AMRT.MA",
    "SP.DYN.IMRT.IN",
    "SH.DTH.NMRT",
    "SH.STA.BASS.ZS",
    "SH.H2O.SMDW.ZS",
    "EN.ATM.PM25.MC.ZS",
    "PV.EST",
    "SP.POP.80UP.FE",
    "SP.POP.80UP.MA",
    "SP.POP.TOTL.FE.IN",
    "SP.POP.TOTL.FE.ZS",
    "SP.POP.TOTL.MA.IN",
    "SP.POP.TOTL.MA.ZS",
    "SP.POP.TOTL",
    "SI.POV.UMIC.GP",
    "SH.STA.OWAD.ZS",
    "SH.STA.OWGH.ZS",
    "SE.XPD.TOTL.GD.ZS",
    "SP.RUR.TOTL.ZS",
    "SL.UEM.TOTL.ZS",
    "SP.URB.TOTL",
    "SP.URB.TOTL.IN.ZS",
      "EN.POP.DNST",
"SP.DYN.LE00.MA.IN",
"SH.STA.ANVC.ZS",
"SN.ITK.DEFC.ZS",
"SH.STA.DIAB.ZS",
"SH.ALC.PCAP.LI"
]

# Años elegidos 
start_year = "1990"
end_year = "2022"

# Lista para almacenar DataFrames individuales
data_frames = []

# URL base de la API del Banco Mundial
base_url = "http://api.worldbank.org/v2/country"

# Realiza las consultas para cada país, indicador y año
for country_code in countries:
    for indicator in indicators:
        # Construye la URL de la consulta
        url = f"{base_url}/{country_code}/indicator/{indicator}?date={start_year}:{end_year}&format=json"
        
        # Realiza la solicitud GET a la API del Banco Mundial
        response = requests.get(url)
        
        # Verifica si la solicitud fue exitosa
        if response.status_code == 200:
            data = response.json()
            # Los datos se encuentran en data[1]
            for entry in data[1]:
                year = entry['date']
                value = entry['value']
                indicator_name = entry['indicator']['value'] 
                country_name = entry['country']['value']
                data_frames.append(pd.DataFrame({"País": [country_name], "Indicador": [indicator_name], "Año": [year], "Valor": [value]}))

# Concatenar todos los DataFrames individuales en uno
data_df2 = pd.concat(data_frames, ignore_index=True)

# Mostrar los datos en una tabla
data_df2

  data_df2 = pd.concat(data_frames, ignore_index=True)


Unnamed: 0,País,Indicador,Año,Valor
0,Mexico,Mortality caused by road traffic injury (per 1...,2022,
1,Mexico,Mortality caused by road traffic injury (per 1...,2021,
2,Mexico,Mortality caused by road traffic injury (per 1...,2020,
3,Mexico,Mortality caused by road traffic injury (per 1...,2019,12.8
4,Mexico,Mortality caused by road traffic injury (per 1...,2018,13.0
...,...,...,...,...
7123,Paraguay,Total alcohol consumption per capita (liters o...,1994,
7124,Paraguay,Total alcohol consumption per capita (liters o...,1993,
7125,Paraguay,Total alcohol consumption per capita (liters o...,1992,
7126,Paraguay,Total alcohol consumption per capita (liters o...,1991,


In [7]:
#Ultima extracción de data para analizar los países que integran la OEA
countries =[ "LCA", "VCT", "SUR", "TTO", "URY", "VEN"  ]

# Indicadores que deseas consultar
indicators = ["SH.STA.TRAF.P5",
"SH.DYN.NCOM.ZS",
"SH.STA.WASH.P5",
    "FX.OWN.TOTL.ZS",
    "SP.DYN.CBRT.IN",
    "SH.XPD.KHEX.GD.ZS",
    "EN.ATM.CO2E.PP.GD",
    "SE.COM.DURS",
    "CC.EST",
    "SH.XPD.CHEX.PP.CD",
    "SP.DYN.CDRT.IN",
    "SH.XPD.GHED.GD.ZS",
    "SH.XPD.GHED.PP.CD",
    "SH.XPD.PVTD.CH.ZS",
    "SE.TER.CUAT.BA.ZS",
    "SE.SEC.CUAT.LO.ZS",
    "SP.DYN.TFRT.IN",
    "NY.GDP.MKTP.PP.KD",
    "NY.GDP.PCAP.PP.KD", "NY.GDP.MKTP.KD.ZG",
    "SI.POV.GINI",
    "SH.STA.IYCF.ZS",
    "SP.DYN.LE00.FE.IN",
    "SP.DYN.LE00.IN",
    "SE.ADT.LITR.ZS",
    "SP.DYN.AMRT.FE",
    "SP.DYN.AMRT.MA",
    "SP.DYN.IMRT.IN",
    "SH.DTH.NMRT",
    "SH.STA.BASS.ZS",
    "SH.H2O.SMDW.ZS",
    "EN.ATM.PM25.MC.ZS",
    "PV.EST",
    "SP.POP.80UP.FE",
    "SP.POP.80UP.MA",
    "SP.POP.TOTL.FE.IN",
    "SP.POP.TOTL.FE.ZS",
    "SP.POP.TOTL.MA.IN",
    "SP.POP.TOTL.MA.ZS",
    "SP.POP.TOTL",
    "SI.POV.UMIC.GP",
    "SH.STA.OWAD.ZS",
    "SH.STA.OWGH.ZS",
    "SE.XPD.TOTL.GD.ZS",
    "SP.RUR.TOTL.ZS",
    "SL.UEM.TOTL.ZS",
    "SP.URB.TOTL",
    "SP.URB.TOTL.IN.ZS",
      "EN.POP.DNST",
"SP.DYN.LE00.MA.IN",
"SH.STA.ANVC.ZS",
"SN.ITK.DEFC.ZS",
"SH.STA.DIAB.ZS",
"SH.ALC.PCAP.LI"
]




# Años elegidos
start_year = "1990"
end_year = "2022"

# Lista para almacenar DataFrames individuales
data_frames = []

# URL base de la API del Banco Mundial
base_url = "http://api.worldbank.org/v2/country"

# Realiza las consultas para cada país, indicador y año
for country_code in countries:
    for indicator in indicators:
        # Construye la URL de la consulta
        url = f"{base_url}/{country_code}/indicator/{indicator}?date={start_year}:{end_year}&format=json"
        
        # Realiza la solicitud GET a la API del Banco Mundial
        response = requests.get(url)
        
        # Verifica si la solicitud fue exitosa
        if response.status_code == 200:
            data = response.json()
            # Los datos se encuentran en data[1]
            for entry in data[1]:
                year = entry['date']
                value = entry['value']
                indicator_name = entry['indicator']['value'] 
                country_name = entry['country']['value']
                data_frames.append(pd.DataFrame({"País": [country_name], "Indicador": [indicator_name], "Año": [year], "Valor": [value]}))

# Concatenar todos los DataFrames individuales en uno
data_df3 = pd.concat(data_frames, ignore_index=True)

# Mostrar los datos en una tabla
data_df3

  data_df3 = pd.concat(data_frames, ignore_index=True)


Unnamed: 0,País,Indicador,Año,Valor
0,St. Lucia,Mortality caused by road traffic injury (per 1...,2022,
1,St. Lucia,Mortality caused by road traffic injury (per 1...,2021,
2,St. Lucia,Mortality caused by road traffic injury (per 1...,2020,
3,St. Lucia,Mortality caused by road traffic injury (per 1...,2019,29.8
4,St. Lucia,Mortality caused by road traffic injury (per 1...,2018,26.7
...,...,...,...,...
10687,"Venezuela, RB",Total alcohol consumption per capita (liters o...,1994,
10688,"Venezuela, RB",Total alcohol consumption per capita (liters o...,1993,
10689,"Venezuela, RB",Total alcohol consumption per capita (liters o...,1992,
10690,"Venezuela, RB",Total alcohol consumption per capita (liters o...,1991,


In [8]:
# Unificamos los dataframes
concatenated_df2 = pd.concat([data_df0, data_df, data_df1, data_df3, data_df2, data_df_a], ignore_index=True)

# El argumento 'ignore_index=True' restablece los índices de fila para que sean secuenciales.

# Ahora 'concatenated_df' contiene todos los datos de los tres DataFrames.


In [9]:
concatenated_df2

Unnamed: 0,País,Indicador,Año,Valor
0,United States,Mortality caused by road traffic injury (per 1...,2022,
1,United States,Mortality caused by road traffic injury (per 1...,2021,
2,United States,Mortality caused by road traffic injury (per 1...,2020,
3,United States,Mortality caused by road traffic injury (per 1...,2019,12.7
4,United States,Mortality caused by road traffic injury (per 1...,2018,12.6
...,...,...,...,...
62365,Jamaica,Total alcohol consumption per capita (liters o...,1994,
62366,Jamaica,Total alcohol consumption per capita (liters o...,1993,
62367,Jamaica,Total alcohol consumption per capita (liters o...,1992,
62368,Jamaica,Total alcohol consumption per capita (liters o...,1991,


In [10]:
#Comprobamos que los indicadores se expongan de manera correcta luego de la concatenación
import pandas as pd

# Supongamos que tu DataFrame se llama concatenated_df2

# Convierte la columna 'año' a enteros
concatenated_df2['Año'] = concatenated_df2['Año'].astype(int)

# Filtra las filas para Argentina y el indicador "Population, total" en los años 1990 a 2000
filtro = (concatenated_df2['País'] == 'Argentina') & (concatenated_df2['Indicador'] == 'Population, total') & (concatenated_df2['Año'] >= 1990) & (concatenated_df2['Año'] <= 2000)
resultado = concatenated_df2[filtro]

# Muestra el DataFrame resultante
print(resultado)


           País          Indicador   Año       Valor
4873  Argentina  Population, total  2000  37070774.0
4874  Argentina  Population, total  1999  36653031.0
4875  Argentina  Population, total  1998  36233195.0
4876  Argentina  Population, total  1997  35815971.0
4877  Argentina  Population, total  1996  35389362.0
4878  Argentina  Population, total  1995  34946110.0
4879  Argentina  Population, total  1994  34488696.0
4880  Argentina  Population, total  1993  34027240.0
4881  Argentina  Population, total  1992  33568285.0
4882  Argentina  Population, total  1991  33105763.0
4883  Argentina  Population, total  1990  32637657.0


In [11]:
concatenated_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62370 entries, 0 to 62369
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   País       62370 non-null  object 
 1   Indicador  62370 non-null  object 
 2   Año        62370 non-null  int32  
 3   Valor      43065 non-null  float64
dtypes: float64(1), int32(1), object(2)
memory usage: 1.7+ MB


## ORGANIZACION MUNDIAL DE LA SALUD OMS

In [12]:
#Bases de datos externas del WHO


# Lista de códigos de indicadores para consultar
indicator_codes = [ "CC_1", "AIR_12", "AIR_6", "CC_3", "CC_5", "CC_6", "AIR_4", "AIR_11",
    "AIR_41", "OCC_11", "OCC_1", "OCC_19", "OCC_21", "OCC_3", "OCC_17",
    "OCC_7", "OCC_9", "OCC_15", "OCC_5", "TOTENV_7", "TOTENV_8", "UV_4",
    "UV_2", "SHS_8", "SHS_9", "SHS_4", "SHS_6", "WSH_2", "WSH_3", "WSH_4",
    "LEAD_5", "AIR_72", "LEAD_7", "AIR_74"]  

# Rango de años
years = "1989:2022"

# Códigos ISO de los países de América
america_countries = ["ATG","ARG","BHS","BRB","BLZ","BOL","BRA","CAN","CHL","COL","CRI","CUB","DOM","ECU","GRD","GTM",
             "GUY","HTI","HND","JAM","SLV","USA","MEX","NIC","PAN","PRY","PER","DOM","KNA","LCA","VCT","SUR",
             "TTO","URY","VEN"]  


# URL base de la API de la OMS
base_url = "https://ghoapi.azureedge.net/api"

# Diccionario para almacenar la correspondencia entre códigos de indicadores y sus nombres
indicador_nombres = {}

# Obtener nombres de indicadores
for code in indicator_codes:
    indicator_url = f"{base_url}/Indicator?$filter=IndicatorCode eq '{code}'"
    response = requests.get(indicator_url)
    if response.status_code == 200:
        indicator_data = response.json().get('value', [])
        if indicator_data:
            indicador_nombres[code] = indicator_data[0]['IndicatorName']

# Obtener datos de la API
data_frames = []
for indicator_code in indicator_codes:
    for country_code in america_countries:
        params = {
            "filter": f"Year ge {years.split(':')[0]} and Year le {years.split(':')[1]} and CountryCode eq '{country_code}' and IndicatorCode eq '{indicator_code}'",
            "format": "json",
        }

        response = requests.get(f"{base_url}/{indicator_code}", params=params)

        if response.status_code == 200:
            data = response.json().get('value', [])
            if data:
                filtered_data = [item for item in data if item['SpatialDim'] == country_code]
                if filtered_data:
                    for item in filtered_data:
                        item['IndicatorName'] = indicador_nombres[indicator_code]
                    data_frames.extend(filtered_data)

# Convertir a DataFrame
df = pd.DataFrame(data_frames, columns=["SpatialDim", "IndicatorName",  'TimeDim', 'NumericValue'])

# Cambiar nombres de las columnas
df = df.rename(columns={'SpatialDim': 'País', 'IndicatorName': 'Indicador', 'TimeDim': 'Año', 'NumericValue': 'Valor'})

df

Unnamed: 0,País,Indicador,Año,Valor
0,ATG,Household air pollution attributable deaths in...,2019,0.00000
1,ATG,Household air pollution attributable deaths in...,2019,0.00000
2,ATG,Household air pollution attributable deaths in...,2019,0.00000
3,ARG,Household air pollution attributable deaths in...,2019,2.30901
4,ARG,Household air pollution attributable deaths in...,2019,0.98960
...,...,...,...,...
1901,URY,Household air pollution attributable deaths in...,2019,0.00000
1902,URY,Household air pollution attributable deaths in...,2019,0.00000
1903,VEN,Household air pollution attributable deaths in...,2019,60.43387
1904,VEN,Household air pollution attributable deaths in...,2019,26.39030


In [41]:
df["Año"].unique()

array([2019, 2002, 2004], dtype=int64)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1906 entries, 0 to 1905
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   País       1906 non-null   object 
 1   Indicador  1906 non-null   object 
 2   Año        1906 non-null   int64  
 3   Valor      1906 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 59.7+ KB


In [14]:
#Prueba
# Filtrardesde 1990 hasta 2002
filtered_df = df.loc[
    (df['Año'] >= 1990) & (df['Año'] <= 2003)
]

# El DataFrame 'filtered_df' contendrá los datos de los países ARG y VEN desde 1990 hasta 2002.


In [15]:
filtered_df.head()

Unnamed: 0,País,Indicador,Año,Valor
1530,ATG,UV radiation attributable deaths per 100'000 c...,2002,0.15
1531,ARG,UV radiation attributable deaths per 100'000 c...,2002,2.12
1532,BHS,UV radiation attributable deaths per 100'000 c...,2002,0.55
1533,BRB,UV radiation attributable deaths per 100'000 c...,2002,1.19
1534,BLZ,UV radiation attributable deaths per 100'000 c...,2002,0.48


In [16]:
#Exportamos la baase de datos externa


df.to_csv('crudo_externa_WHO.csv', index=False)



Unificamos la base da datos del WB con base de datos externa
Hasta aquí vamos ok. 
Veo que las columunaa año de casda datafram tienen distinta estructura , ylos voy a pasar a ambos a integrer.
Tambien veo que los paises  del df bajaron solo con siglas, por lo que los tengo que passar a nombre como la base del BM

In [17]:
import pandas as pd

# Suponiendo que el archivo CSV 'crudo_externa_WHO.csv' se encuentra en el mismo directorio que tu script o notebook.
df_externa = pd.read_csv('crudo_externa_WHO.csv')


In [18]:
import pandas as pd

# Supongamos que tienes dos DataFrames llamados concatenated_df2 y df.

# Verificar si los tipos de datos de las columnas 'Valor', 'Indicador' y 'País' son iguales en ambos DataFrames
columns_to_check = ['Valor', 'Indicador', 'País', 'Año']

for column in columns_to_check:
    if concatenated_df2[column].dtype == df[column].dtype:
        print(f"Los tipos de datos de la columna '{column}' son iguales en ambos DataFrames.")
    else:
        print(f"Los tipos de datos de la columna '{column}' son diferentes en ambos DataFrames.")

    # Si los tipos de datos son iguales, puedes verificar si los valores son iguales
    if concatenated_df2[column].equals(df[column]):
        print(f"Los valores en la columna '{column}' son iguales en ambos DataFrames.")
    else:
        print(f"Los valores en la columna '{column}' son diferentes en ambos DataFrames.")


Los tipos de datos de la columna 'Valor' son iguales en ambos DataFrames.
Los valores en la columna 'Valor' son diferentes en ambos DataFrames.
Los tipos de datos de la columna 'Indicador' son iguales en ambos DataFrames.
Los valores en la columna 'Indicador' son diferentes en ambos DataFrames.
Los tipos de datos de la columna 'País' son iguales en ambos DataFrames.
Los valores en la columna 'País' son diferentes en ambos DataFrames.
Los tipos de datos de la columna 'Año' son diferentes en ambos DataFrames.
Los valores en la columna 'Año' son diferentes en ambos DataFrames.


In [19]:

# Convertir la columna 'Año' en concatenated_df2 a int64
concatenated_df2['Año'] = concatenated_df2['Año'].astype('int64')

In [20]:
concatenated_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62370 entries, 0 to 62369
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   País       62370 non-null  object 
 1   Indicador  62370 non-null  object 
 2   Año        62370 non-null  int64  
 3   Valor      43065 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 1.9+ MB


In [21]:
df.head()

Unnamed: 0,País,Indicador,Año,Valor
0,ATG,Household air pollution attributable deaths in...,2019,0.0
1,ATG,Household air pollution attributable deaths in...,2019,0.0
2,ATG,Household air pollution attributable deaths in...,2019,0.0
3,ARG,Household air pollution attributable deaths in...,2019,2.30901
4,ARG,Household air pollution attributable deaths in...,2019,0.9896


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1906 entries, 0 to 1905
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   País       1906 non-null   object 
 1   Indicador  1906 non-null   object 
 2   Año        1906 non-null   int64  
 3   Valor      1906 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 59.7+ KB


In [23]:
df["Indicador"].head(300)

0      Household air pollution attributable deaths in...
1      Household air pollution attributable deaths in...
2      Household air pollution attributable deaths in...
3      Household air pollution attributable deaths in...
4      Household air pollution attributable deaths in...
                             ...                        
295    Ambient air pollution attributable deaths in c...
296    Ambient air pollution attributable deaths in c...
297    Ambient air pollution attributable deaths in c...
298    Ambient air pollution attributable deaths in c...
299    Ambient air pollution attributable deaths in c...
Name: Indicador, Length: 300, dtype: object

In [24]:

# Ahora, la columna 'País' en df contendrá nombres completos en lugar de códigos de país.
# Diccionario de códigos de país a nombres completos
pais_codigo_a_nombre = {"ATG": "Antigua y Barbuda", "BHS":"Bahamas",
    'ARG': 'Argentina',
    'VEN': 'Venezuela, RB',
    'BOL': 'Bolivia',
    'BRA': 'Brasil',
    'CHL': 'Chile',
    'COL': 'Colombia',
    'CRI': 'Costa Rica',
    'CUB': 'Cuba',
    'DMA': 'Dominica',
    'ECU': 'Ecuador',
    'SLV': 'El Salvador',
    'GRD': 'Granada',
    'GTM': 'Guatemala',
    'GUY': 'Guyana',
    'HTI': 'Haití',
    'HND': 'Honduras',
    'JAM': 'Jamaica',
    'MEX': 'México',
    'NIC': 'Nicaragua',
    'PAN': 'Panamá',
    'PRY': 'Paraguay',
    'PER': 'Perú',
    'KNA': 'San Cristóbal y Nieves',
    'LCA': 'St. Lucía',
    'VCT': 'San Vicente y las Granadinas',
    'SUR': 'Surinam',
    'TTO': 'Trinidad y Tobago',
    'USA': 'Estados Unidos',
    'URY': 'Uruguay',  
    'BLZ': 'Belice',
    'CAN': 'Canadá',
    'BRB': 'Barbados',
    'DOM': 'República Dominicana',
    'HND': 'Honduras'
    # Añade los demás países y sus nombres completos según la lista de la OEA.
}

# Asignar nombres completos a los países
df['País'] = df['País'].map(pais_codigo_a_nombre)


In [25]:
df.head()

Unnamed: 0,País,Indicador,Año,Valor
0,Antigua y Barbuda,Household air pollution attributable deaths in...,2019,0.0
1,Antigua y Barbuda,Household air pollution attributable deaths in...,2019,0.0
2,Antigua y Barbuda,Household air pollution attributable deaths in...,2019,0.0
3,Argentina,Household air pollution attributable deaths in...,2019,2.30901
4,Argentina,Household air pollution attributable deaths in...,2019,0.9896


In [26]:
paises_diferentes = df['País'].unique()
print("Lista de países diferentes en la columna 'País':")
for pais in paises_diferentes:
    print(pais)


Lista de países diferentes en la columna 'País':
Antigua y Barbuda
Argentina
Bahamas
Barbados
Belice
Bolivia
Brasil
Canadá
Chile
Colombia
Costa Rica
Cuba
República Dominicana
Ecuador
Granada
Guatemala
Guyana
Haití
Honduras
Jamaica
El Salvador
Estados Unidos
México
Nicaragua
Panamá
Paraguay
Perú
St. Lucía
San Vicente y las Granadinas
Surinam
Trinidad y Tobago
Uruguay
Venezuela, RB
San Cristóbal y Nieves


In [27]:
import pandas as pd

# Concatenar los dos DataFrames verticalmente
resulting_df = pd.concat([concatenated_df2, df], ignore_index=True)

# Verificar la forma del DataFrame resultante
print(resulting_df.shape)

# Esto mostrará la forma del DataFrame concatenado, que debería incluir todas las filas de ambos DataFrames.


(64276, 4)


In [28]:
resulting_df["Año"].unique()

array([2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012,
       2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001,
       2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990],
      dtype=int64)

In [29]:
# Supongamos que tienes un DataFrame llamado 'df'
# Verifica si hay registros duplicados
duplicates = resulting_df.duplicated(subset=['País', 'Año', 'Indicador'], keep=False)

# Muestra las filas duplicadas (si las hay)
duplicated_rows = resulting_df[duplicates]

if duplicated_rows.empty:
    print("No se encontraron registros duplicados.")
else:
    print("Registros duplicados encontrados:")
    print(duplicated_rows)


Registros duplicados encontrados:
                     País                                          Indicador  \
19602  Dominican Republic  Mortality caused by road traffic injury (per 1...   
19603  Dominican Republic  Mortality caused by road traffic injury (per 1...   
19604  Dominican Republic  Mortality caused by road traffic injury (per 1...   
19605  Dominican Republic  Mortality caused by road traffic injury (per 1...   
19606  Dominican Republic  Mortality caused by road traffic injury (per 1...   
...                   ...                                                ...   
64271             Uruguay  Household air pollution attributable deaths in...   
64272             Uruguay  Household air pollution attributable deaths in...   
64273       Venezuela, RB  Household air pollution attributable deaths in...   
64274       Venezuela, RB  Household air pollution attributable deaths in...   
64275       Venezuela, RB  Household air pollution attributable deaths in...   

     

In [30]:
#Quito duplicado y los guardo aparte 
duplicates_mask = resulting_df.duplicated(subset=['País', 'Año', 'Indicador'], keep=False)

# Crear un DataFrame con los registros duplicados
duplicates_df = resulting_df[duplicates_mask]

# Crear un nuevo DataFrame sin duplicados
resulting_df_ok = resulting_df.drop_duplicates(subset=['País', 'Año', 'Indicador'], keep='first')

# Ahora 'df' contiene los registros no duplicados, y 'duplicates_df' contiene los registros duplicados


In [31]:
# Supongamos que tienes un DataFrame llamado 'df' con la estructura proporcionada

# Utiliza la función pivot para pivotear el DataFrame
pivoted_df = resulting_df_ok.pivot(index=['País', 'Año'], columns='Indicador', values='Valor').reset_index()

# Esto creará un DataFrame en el que los indicadores se presentan como columnas separadas


In [32]:
pivoted_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171 entries, 0 to 1170
Data columns (total 68 columns):
 #   Column                                                                                                           Non-Null Count  Dtype  
---  ------                                                                                                           --------------  -----  
 0   País                                                                                                             1171 non-null   object 
 1   Año                                                                                                              1171 non-null   int64  
 2   Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+)  88 non-null     float64
 3   Ambient air pollution  attributable deaths per 100'000 children under 5 years                                    33 non-null     float64
 4   Ambient air pollution attributable d

In [33]:
pivoted_df.isnull().sum()

Indicador
País                                                                                                                  0
Año                                                                                                                   0
Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+)    1083
Ambient air pollution  attributable deaths per 100'000 children under 5 years                                      1138
Ambient air pollution attributable deaths                                                                          1138
                                                                                                                   ... 
Urban population                                                                                                     49
Urban population (% of total population)                                                                             49
Water, sanitation and hygiene 

In [34]:

nulos_por_columna = pivoted_df.isnull().sum()
# Encuentra las columnas con más de 1000 valores nulos
columnas_con_nulos = nulos_por_columna[nulos_por_columna > 1000].index

# Elimina las columnas con más de 1000 valores nulos
pivoted_df = pivoted_df.drop(columns=columnas_con_nulos)

In [35]:
pivoted_df.head()

Indicador,País,Año,"Birth rate, crude (per 1,000 people)",CO2 emissions (kg per PPP $ of GDP),Capital health expenditure (% of GDP),"Compulsory education, duration (years)",Control of Corruption: Estimate,"Current health expenditure per capita, PPP (current international $)","Death rate, crude (per 1,000 people)",Domestic general government health expenditure (% of GDP),...,"Population, total",Poverty gap at $6.85 a day (2017 PPP) (%),Pregnant women receiving prenatal care (%),Prevalence of overweight (% of adults),"Prevalence of overweight, weight for height (% of children under 5)",Prevalence of undernourishment (% of population),Rural population (% of total population),"Unemployment, total (% of total labor force) (modeled ILO estimate)",Urban population,Urban population (% of total population)
0,Antigua and Barbuda,1990,20.916,0.30129,,,,,7.666,,...,63328.0,,,33.7,,,64.574,,22435.0,35.426
1,Antigua and Barbuda,1991,18.504,0.289409,,,,,7.753,,...,63634.0,,,34.4,,,64.535,,22568.0,35.465
2,Antigua and Barbuda,1992,19.228,0.276056,,,,,7.745,,...,64659.0,,,34.9,,,64.915,,22686.0,35.085
3,Antigua and Barbuda,1993,18.597,0.260243,,,,,7.666,,...,65834.0,,,35.5,,,65.291,,22850.0,34.709
4,Antigua and Barbuda,1994,18.99,0.242297,,,,,7.533,,...,67072.0,,,36.1,,,65.666,,23029.0,34.334


In [36]:
pivoted_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171 entries, 0 to 1170
Data columns (total 50 columns):
 #   Column                                                                                              Non-Null Count  Dtype  
---  ------                                                                                              --------------  -----  
 0   País                                                                                                1171 non-null   object 
 1   Año                                                                                                 1171 non-null   int64  
 2   Birth rate, crude (per 1,000 people)                                                                1088 non-null   float64
 3   CO2 emissions (kg per PPP $ of GDP)                                                                 1014 non-null   float64
 4   Capital health expenditure (% of GDP)                                                               630 non-null  

In [37]:
#La base de datos mantiene aún muchos nulos en datos de países de las islas del Caribe.
#Para aumentar la calidad de los datos, vamos a quitar del análisis los países con una población menor a 2 millones de personas, que básicamente son las islas mencionadas más pequeñas.
#Umbral = 2000000

# Define un umbral de población (2 millones en este caso)
umbral_población = 2000000

# Filtra los países con población mayor al umbral
df_Sin_Caribe = pivoted_df.groupby('País').filter(lambda x: x['Population, total'].max() > umbral_población)




In [38]:
df_Sin_Caribe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 759 entries, 36 to 1170
Data columns (total 50 columns):
 #   Column                                                                                              Non-Null Count  Dtype  
---  ------                                                                                              --------------  -----  
 0   País                                                                                                759 non-null    object 
 1   Año                                                                                                 759 non-null    int64  
 2   Birth rate, crude (per 1,000 people)                                                                736 non-null    float64
 3   CO2 emissions (kg per PPP $ of GDP)                                                                 673 non-null    float64
 4   Capital health expenditure (% of GDP)                                                               431 non-null    flo

In [39]:
df_Sin_Caribe.head()

Indicador,País,Año,"Birth rate, crude (per 1,000 people)",CO2 emissions (kg per PPP $ of GDP),Capital health expenditure (% of GDP),"Compulsory education, duration (years)",Control of Corruption: Estimate,"Current health expenditure per capita, PPP (current international $)","Death rate, crude (per 1,000 people)",Domestic general government health expenditure (% of GDP),...,"Population, total",Poverty gap at $6.85 a day (2017 PPP) (%),Pregnant women receiving prenatal care (%),Prevalence of overweight (% of adults),"Prevalence of overweight, weight for height (% of children under 5)",Prevalence of undernourishment (% of population),Rural population (% of total population),"Unemployment, total (% of total labor force) (modeled ILO estimate)",Urban population,Urban population (% of total population)
36,Argentina,1990,21.989,0.427859,,,,,7.743,,...,32637657.0,,,48.7,,,13.016,,28389540.0,86.984
37,Argentina,1991,21.844,0.400371,,,,,7.536,,...,33105763.0,4.5,,49.3,,,12.672,5.44,28910601.0,87.328
38,Argentina,1992,21.683,0.369563,,,,,7.595,,...,33568285.0,5.0,,49.9,,,12.458,6.36,29386348.0,87.542
39,Argentina,1993,21.57,0.340845,,,,,7.631,,...,34027240.0,5.5,95.0,50.5,,,12.248,10.1,29859584.0,87.752
40,Argentina,1994,21.419,0.320044,,,,,7.403,,...,34488696.0,5.7,,51.1,11.1,,12.04,11.76,30336257.0,87.96


In [40]:
#Exporta el DataFrame a un archivo CSV sin incluir los índices
df_Sin_Caribe.to_csv('tabla_sin_caribe_ok.csv', index=False)