# Data-Pipelines

## Libraries


In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import re
import numpy as np
import squarify 
import calendar
import requests
from bs4 import BeautifulSoup
import pycountry

### Functions

In [2]:
import scraper as scr
import api_extract as api

## Context and hypotheses

Other possible variables to consider, not included in this analysis: unemployment rate, gender (in)equality

## Data Collection Process

### Original Dataset from Kaggle

In [3]:
## Objetivo: Descargar un datset y enriqucerlo con api y web spcraping

World happiness report 2015-2022 (AUTHORS, LITTLE EXPLANATION REPORT AND DATASET, IN WHICH CHAPTER THEY USE DATASET, MAYBE SHOW ORIGINAL GRAPHS FROM REPORT
--> stuty happiest/least happy countries relate with pollution around the world. 
Web scraping of https://worldpopulationreview.com/country-rankings/most-polluted-countries 

In [4]:
df = pd.read_csv("../input/world-happiness-report-2015-2022-cleaned.csv")

In [5]:
df

Unnamed: 0.1,Unnamed: 0,Happiness Rank,Country,Region,Happiness Score,Economy (GDP per Capita),Family (Social Support),Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Year
0,0,1,Switzerland,Western Europe,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2015
1,1,2,Iceland,Western Europe,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2015
2,2,3,Denmark,Western Europe,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2015
3,3,4,Norway,Western Europe,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2015
4,4,5,Canada,North America,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2015
...,...,...,...,...,...,...,...,...,...,...,...,...
1224,141,142,Botswana*,-,3471,1503,0815,0280,0571,0102,0012,2022
1225,142,143,Rwanda*,-,3268,0785,0133,0462,0621,0544,0187,2022
1226,143,144,Zimbabwe,Sub-Saharan Africa,2995,0947,0690,0270,0329,0105,0106,2022
1227,144,145,Lebanon,Middle East and Northern Africa,2955,1392,0498,0631,0103,0034,0082,2022


In [6]:
df['Country'] = df['Country'].str.replace("*", "")

  df['Country'] = df['Country'].str.replace("*", "")


### IQ Air Web scraping

In [7]:
## Web scraping to obtain air pollution by country
# Website = https://www.iqair.com/world-most-polluted-countries
# table = Most polluted country and region ranking based on annual average PM2.5 concentration (μg/m³)

Find the following columns from the website:
- Rank
- Country/Region	
- 2018		
- 2019	
- 2020	
- 2021
- Population

In [14]:
pollution = scr.scrape_iqair()

In [15]:
pollution

Unnamed: 0,Country,Population,Year,Pollution
0,Bangladesh,164689383,2018,97.1
1,Chad,16425859,2018,-
2,Pakistan,220892331,2018,74.3
3,Tajikistan,9537642,2018,-
4,India,1380004385,2018,72.5
...,...,...,...,...
463,"Bonaire, Saint Eustatius and Saba",26221,2021,5.1
464,Cape Verde,555988,2021,5.1
465,Puerto Rico,2860840,2021,4.8
466,U.S. Virgin Islands,104423,2021,4.5


###  World Bank Indicators API

The World Bank Indicators API provides access to nearly 16,000 time series indicators. Most of these indicators are available online through tools such as Databank and the Open Data website. The API provides programmatic access to this same data. Many data series date back over 50 years, and can be used to create interesting applications.
The Indicators API provides access to over 45 databases and no authentication method is required to access the API.

**URL usage**

The basic url for the v2 API version of the World Bank Indicators is `http://api.worldbank.org/v2/country/all/indicator/indicator_code` where indicator_code has to be replaced by the id of the indicator to extract data for.

The API supports query string on the url, and the following ones have been used:
- format:  output format of the request. Json format is choosed
- date: date range of the API request. From 2015 to 2022 as in the source data.
- per_page: number of results per page. Length on the longest response.

In [9]:
#Education: Mean years of schooling (ISCED 1 or higher), population 25+ years, both sexes
schooling_years = api.worldbank_indicator(indicator_name='Avg. Schooling Years', indicator_code='UIS.EA.MEAN.1T6.AG25T99')

In [10]:
schooling_years

Unnamed: 0,Country code,Country,Year,Avg. Schooling Years
0,,Global Partnership for Education,2100,
1,,Global Partnership for Education,2095,
2,,Global Partnership for Education,2090,
3,,Global Partnership for Education,2085,
4,,Global Partnership for Education,2080,
...,...,...,...,...
5891,ZW,Zimbabwe,2019,
5892,ZW,Zimbabwe,2018,
5893,ZW,Zimbabwe,2017,8.4668
5894,ZW,Zimbabwe,2016,


In [11]:
#Literacy rate, adult total (% of people ages 15 and above) - SE.ADT.LITR.ZS
literacy = api.worldbank_indicator(indicator_name='Literacy Rate', indicator_code='SE.ADT.LITR.ZS')

In [12]:
literacy

Unnamed: 0,Country code,Country,Year,Literacy Rate
0,ZH,Africa Eastern and Southern,2021,
1,ZH,Africa Eastern and Southern,2020,
2,ZH,Africa Eastern and Southern,2019,
3,ZH,Africa Eastern and Southern,2018,
4,ZH,Africa Eastern and Southern,2017,
...,...,...,...,...
1857,ZW,Zimbabwe,2019,
1858,ZW,Zimbabwe,2018,
1859,ZW,Zimbabwe,2017,
1860,ZW,Zimbabwe,2016,


### Merging datasets

In [None]:
# They will all be left joins 

In [None]:
# df - pollution - schooling_years - literacy

In [None]:
# convert 'year' col of df into object to enable merge bc it's the only int64 one
df['Year'] = df['Year'].astype(int) 

In [None]:
pollution['Year'] = pollution['Year'].astype(int) 

In [None]:
literacy['Year'] = literacy['Year'].astype(int) 

In [None]:
schooling_years['Year'] = schooling_years['Year'].astype(int) 

In [None]:
schooling_years.dtypes

In [None]:
list(df.Country.unique())

In [None]:
pollution[pollution.Country=='Spain']

In [None]:
pd.merge(df, pollution, on=["Country", "Year"], how="left")

**Try to merge df with API**

In [None]:
test = pd.merge(df, literacy, on=["Country", "Year"], how="left")

In [None]:
test[test["Litaracy Rate"].notna()]

In [None]:
(pd.merge(df, literacy, on=["Country", "Year"], how="left"))['Literacy Rate'].isna().sum()

In [None]:
(pd.merge(df, literacy, on=["Country code", "Year"], how="left"))['Literacy Rate'].isna().sum()

In [None]:
countries

In [None]:
def do_fuzzy_search(country):
    try:
        result = pycountry.countries.search_fuzzy(country)
        return result[0].alpha_2
    except:
        return np.nan

df["Country code"] = df["Country"].apply(lambda country: do_fuzzy_search(country))

In [None]:
df['Country Code'].isna().sum()