# ETL Practice Project
**Project Scenario** 
<br>
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with:

1. Creating an automated script that can extract the list of all countries:
- In order of their GDPs in billion USDs
- Rounded to 2 decimal places
- As logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated. The required data seems to be available on the URL mentioned here: <a href=https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29>List of countries by GDP (nominal)</a>

2. The required information needs to be made accessible as:
- A CSV file Countries_by_GDP.csv
- A table Countries_by_GDP in a database file World_Economies.db with attributes Country and GDP_USD_billion

3. Your boss wants you to demonstrate the success of this code by:
- Running a query on the database table to display only the entries with more than a 100 billion USD economy
- Log in a file with the entire process of execution named etl_project_log.txt

**You must create a Python code 'etl_project_gdp.py' that performs all the required tasks**


**Project Objectives**
<br>
You have to complete the following tasks for this project:
1. Write a data extraction function to retrieve the relevant information from the required URL.
1. Transform the available GDP information into 'Billion USD' from 'Million USD'
1. Load the transformed information to the required CSV file and as a database file
1. Run the required query on the database
1. Log the progress of the code with appropriate timestamps.

<hr>

**Setup**

In [109]:
# import required libraries
import sqlite3
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import numpy as np


In [122]:
#define empty entities
db_name = 'world_economies.db'
table_name = 'countries_by_gdp'
csv_path = './countries_by_gdp.csv'

In [111]:
# define source and source structure
url = 'https://web.archive.org/web/20230902185326/https:/en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
table_attribs = ['Country', 'GDP (Millions, USD)']
column_indices = [0, 2]  # target columns from which data will be extracted


**Testing Extract**

In [112]:
# parse HTML from url
html_content = requests.get(url).content
data = BeautifulSoup(html_content, 'html.parser') # choose content to also bring in tags
table = data.find_all('table')[2] # get content related to table at index 2
rows = table.find_all('tr') # get all rows
rows = rows[2:] # remove first two header rows

In [117]:
# filter cells from the specified columns into a list
filtered_cells = []
for row in rows:
    cells = row.find_all('td')
    if len(cells) == 0: #remember to check number of cells in a row - many web tables have invisible merged rows
        continue
    if row.find('a') is None:
        continue
    selected_cells = [cells[i] for i in column_indices if i < len(cells)]
    filtered_cells.append(selected_cells)

In [131]:
# extract data from filtered cells into a pandas dataframe
df = pd.DataFrame(columns=table_attribs)
for row in filtered_cells:
    data_dict = {table_attribs[0]: row[0].text.strip(), table_attribs[1]: row[1].text.strip()}
    df1 = pd.DataFrame([data_dict], index=[0])
    df = pd.concat([df, df1], ignore_index=True)

In [138]:
# replace special character in GDP column with 0
df[table_attribs[1]] = df[table_attribs[1]].replace('—', 0)

In [139]:
def extract(url,table_attribs,column_indices):
    # parse HTML from url
    html_content = requests.get(url).content
    data = BeautifulSoup(html_content, 'html.parser') # choose content to also bring in tags
    table = data.find_all('table')[2] # get content related to table at index 2
    rows = table.find_all('tr') # get all rows
    rows = rows[2:] # remove first two header rows

    # filter cells from the specified columns into a list
    filtered_cells = []
    for row in rows:
        cells = row.find_all('td')
        if len(cells) == 0: # remember to check number of cells in a row - many web tables have invisible merged rows
            continue
        if row.find('a') is None: # rows not hyperlinked to a country, namely World, are excluded
            continue
        selected_cells = [cells[i] for i in column_indices if i < len(cells)]
        filtered_cells.append(selected_cells)

    # extract data from filtered cells into a pandas dataframe
    df = pd.DataFrame(columns=table_attribs)
    for row in filtered_cells:
        data_dict = {table_attribs[0]: row[0].text.strip(), table_attribs[1]: row[1].text.strip()}
        df1 = pd.DataFrame([data_dict], index=[0])
        df = pd.concat([df, df1], ignore_index=True)
        
    # replace special character in GDP column with 0
    df[table_attribs[1]] = df[table_attribs[1]].replace('—', 0)
    return df

In [140]:
extract(url,table_attribs,column_indices)

Unnamed: 0,Country,"GDP (Millions, USD)"
0,United States,26854599
1,China,19373586
2,Japan,4409738
3,Germany,4308854
4,India,3736882
...,...,...
208,Anguilla,0
209,Kiribati,248
210,Nauru,151
211,Montserrat,0


**Testing Transform**

**Testing Load**

**Testing Query**

In [None]:
# Code for ETL operations on Country-GDP data

def transform(df):
    ''' This function converts the GDP information from Currency
    format to float value, transforms the information of GDP from
    USD (Millions) to USD (Billions) rounding to 2 decimal places.
    The function returns the transformed dataframe.'''

    return df

def load_to_csv(df, csv_path):
    ''' This function saves the final dataframe as a `CSV` file 
    in the provided path. Function returns nothing.'''

def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final dataframe as a database table
    with the provided name. Function returns nothing.'''

def run_query(query_statement, sql_connection):
    ''' This function runs the stated query on the database table and
    prints the output on the terminal. Function returns nothing. '''

def log_progress(message):
    ''' This function logs the mentioned message at a given stage of the code execution to a log file. Function returns nothing'''

''' Here, you define the required entities and call the relevant 
functions in the correct order to complete the project. Note that this
portion is not inside any function.'''