# Data Engineering: Create your own Dataset
### Extract
The Extract part is the first step of the pipeline. In this step, the data is extracted from all different sources and cached. In this project, I want to extract data from the following sources:

1. Wikipedia -> Scraping the World Happiness Report table.

2. Rapid API -> Getting the population data for each country in the Wikipedia dataset.

3. Worlddata Website -> Get data about the median age per country.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
import numpy as np

# get html data first
html_data = requests.get("https://en.wikipedia.org/wiki/World_Happiness_Report")

# check if status is 200 -> shows that its allowed to scrape the webpage
print(html_data.status_code)

# parse html data now using BeautifulSoup
soup = BeautifulSoup(html_data.text, "html.parser")

# get all tables from wikipedia page
tables = soup.find_all('table',{'class':"wikitable"})

# store target table
table = tables[2]

# convert table html code to pandas df
data = pd.read_html(str(table))
df_happiness = pd.DataFrame(data[0]) 

# show head
df_happiness.head(10)


200


Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.809,1.285,1.5,0.961,0.662,0.16,0.478
1,2,Denmark,7.646,1.327,1.503,0.979,0.665,0.243,0.495
2,3,Switzerland,7.56,1.391,1.472,1.041,0.629,0.269,0.408
3,4,Iceland,7.504,1.327,1.548,1.001,0.662,0.362,0.145
4,5,Norway,7.488,1.424,1.495,1.008,0.67,0.288,0.434
5,6,Netherlands,7.449,1.339,1.464,0.976,0.614,0.336,0.369
6,7,Sweden,7.353,1.322,1.433,0.986,0.65,0.273,0.442
7,8,New Zealand,7.3,1.242,1.487,1.008,0.647,0.326,0.461
8,9,Austria,7.294,1.317,1.437,1.001,0.603,0.256,0.281
9,10,Luxembourg,7.238,1.537,1.388,0.986,0.61,0.196,0.367


In [2]:
# rename some countries to later match the country names from RapidAPI
df_happiness = df_happiness.apply(lambda x: x.replace("Congo (Kinshasa)", "DR Congo"))
df_happiness = df_happiness.apply(lambda x: x.replace("Congo (Brazzaville)", "Congo"))
df_happiness = df_happiness.apply(lambda x: x.replace("Ivory Coast", "Côte d'Ivoire"))

In [3]:
# print number of countries in the dataset
list_countries = df_happiness["Country or region"].to_list()
print(f"Number of countries in Wikipedia dataset: {len(list_countries)}")

Number of countries in Wikipedia dataset: 153


In [4]:
import requests

url = "https://weatherapi-com.p.rapidapi.com/current.json"

headers = {
    "X-RapidAPI-Key": "5d5b5ccb2emshfca9b665a1f607cp1faa53jsn819c8407a326",
    "X-RapidAPI-Host": "weatherapi-com.p.rapidapi.com"
    }

try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # Raise an exception if the request was not successful

    data = response.json()
    print(data)

except requests.exceptions.RequestException as error:
    print("An error occurred:", error)


An error occurred: 400 Client Error: Bad Request for url: https://weatherapi-com.p.rapidapi.com/current.json


## Fetch Population Data for each Country from RapidAPI

In [5]:
import json
import requests
import numpy as np
from tqdm import tqdm

# Create URL and headers for API call
url = "https://weatherapi-com.p.rapidapi.com/current.json"

querystring = {"q":"53.1,-0.13"}

headers = {
	"X-RapidAPI-Key": "5d5b5ccb2emshfca9b665a1f607cp1faa53jsn819c8407a326",
	"X-RapidAPI-Host": "weatherapi-com.p.rapidapi.com"
}

# Add population column to DataFrame and set all values to NaN
df_happiness["Population"] = np.nan

# Loop over countries and get population
for country in tqdm(df_happiness["Country or region"].to_list()):
    # Create query string for API call
    querystring = {"name" : country}
    
    try:
        # Make request and fetch response
        response = requests.request("GET", url, headers=headers, params=querystring)
        response_dict = json.loads(response.text)

        # Check if response is okay and update population
        if response_dict["ok"] == True:
            population = response_dict["body"]["population"]
            df_happiness.loc[df_happiness["Country or region"] == country, "Population"] = population

    except json.JSONDecodeError:
        print(f"Error: Failed to decode JSON response for {country}. Skipping...")
        print(f"Response Text: {response.text}")
        continue


  0%|                                                                                          | 0/153 [00:00<?, ?it/s]


KeyError: 'ok'

In [6]:
import requests

url = "https://weatherapi-com.p.rapidapi.com/current.json"

querystring = {"q":"53.1,-0.13"}

headers = {
	"X-RapidAPI-Key": "5d5b5ccb2emshfca9b665a1f607cp1faa53jsn819c8407a326",
	"X-RapidAPI-Host": "weatherapi-com.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=querystring)

print(response.json())

{'location': {'name': 'Boston', 'region': 'Lincolnshire', 'country': 'United Kingdom', 'lat': 53.1, 'lon': -0.13, 'tz_id': 'Europe/London', 'localtime_epoch': 1686722667, 'localtime': '2023-06-14 7:04'}, 'current': {'last_updated_epoch': 1686722400, 'last_updated': '2023-06-14 07:00', 'temp_c': 13.0, 'temp_f': 55.4, 'is_day': 1, 'condition': {'text': 'Sunny', 'icon': '//cdn.weatherapi.com/weather/64x64/day/113.png', 'code': 1000}, 'wind_mph': 8.1, 'wind_kph': 13.0, 'wind_degree': 10, 'wind_dir': 'N', 'pressure_mb': 1018.0, 'pressure_in': 30.06, 'precip_mm': 0.0, 'precip_in': 0.0, 'humidity': 82, 'cloud': 0, 'feelslike_c': 11.2, 'feelslike_f': 52.2, 'vis_km': 10.0, 'vis_miles': 6.0, 'uv': 5.0, 'gust_mph': 15.9, 'gust_kph': 25.6}}


In [None]:
df_happiness.head()

In [None]:
df_happiness.info()

## Scrape Webpage for getting Average Age per Country

In [None]:
html_data = requests.get("https://www.worlddata.info/average-age.php")

# check if status is 200 -> shows that its allowed to scrape the webpage
print(html_data.status_code)

In [None]:
# parse html data now using BeautifulSoup
soup = BeautifulSoup(html_data.text, "html.parser")

# get all tables from wikipedia page
tables = soup.find_all('table',{'class':"std100 hover"})

# store target table
table = tables[0]

# convert table html code to pandas df
data = pd.read_html(str(table))
df_average_age = pd.DataFrame(data[0]) 

# show head
df_average_age.head()

In [None]:
# print number of countries in the dataset
list_countries = df_average_age["Country"].to_list()
print(f"Number of countries in the average age dataset: {len(list_countries)}")

Okay. There are 127 different countries in this dataset, but our happiness dataset contains 153. So there is some data missing for some countries.
Let's now add these columns to the dataset

In [None]:
# let's use pandas join functionality for joining these tables together
df_final = df_happiness.set_index("Country or region").join(df_average_age.set_index("Country")).reset_index()
df_final.head()

In [None]:
df_final.info()

## Transform
Let's now start with the transform part. Here the data is transformed such that it is in the expected format. The datatypes are already in the expected format. Let's first add the total GDP per country. This is easy, as long as the population is given, because we already have the GDP per capita. So the total GDP can be computed this way:

##### GDP= GDP(percapita) X Population


Let's transform the Population under 20 years old column to be of type float, as the second transform step.

In [None]:
df_final["GDP"] = df_final["GDP per capita"] * df_final["Population"]
df_final.head()

In [None]:
# let's now remove the % sign of the Population under 20 years old column and convert it to type float
def transform_col(col_val):
    try: 
        return float(col_val.replace(" %", ""))
    except: # value is NaN
        return col_val

df_final["Population under 20 years old in %"] = df_final["Population under20 years old"].apply(transform_col)
df_final = df_final.drop(columns=["Population under20 years old"])
df_final.head()

In [None]:
df_final.info()

## Load

In [None]:
def load(dataset):
    dataset.to_csv("final_dataset.csv", index=False)

load(df_final)