# Project Scenario

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.

You can find the required data on this webpage.

- The required information needs to be made accessible as a JSON file 'Countries_by_GDP.json' as well as a table 'Countries_by_GDP' in a database file 'World_Economies.db' with attributes 'Country' and 'GDP_USD_billion.'

- Your boss wants you to demonstrate the success of this code by running a query on the database table to display only the entries with more than a 100 billion USD economy. Also, log the entire process of execution in a file named 'etl_project_log.txt'.

- You must create a Python code 'etl_project_gdp.py' that performs all the required tasks.

## Initial setup
Before you start building the code, you need to install the required libraries for it.
The libraries needed for the code are as follows:
1. **requests** - The library used for accessing the information from the URL.
2. **bs4** - The library containing the BeautifulSoup function used for webscraping.
3. **pandas** - The library used for processing the extracted data, storing it to required formats and communicating with the databases
4. **sqlite3** - The library required to create a database server connection.    
5. **numpy** - The library required for the mathematical rounding operation as required in the objectives.
6. **datetime** - The library containing the function datetime used for extracting the timestamp for logging purposes.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime 
print('completed import libs')

completed import libs


Further, you need to initialize all the known entities. These are mentioned below:
URL:
'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'

- **table_attribs**: The attributes or column names for the dataframe stored as a list. Since the data available in the website is in USD Millions, the attributes should initially be **'Country'** and **'GDP_USD_millions'**. This will be modified in the transform function later.
- **db_name**: As mentioned in the Project scenario, 'World_Economies.db'
- **table_name**: As mentioned in the Project scenario, 'Countries_by_GDP'

In [57]:
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
table_attribs = ["Country", "GDP_USD_millions"]
db_name = 'World_Economies.db'
table_name = 'Countries_by_GDP'
csv_path = './Countries_by_GDP.csv'

### Task1: Extracting information

In [56]:
def extract(url, table_attribs):
    page = requests.get(url).text
    data = BeautifulSoup(page,'html.parser')
    df = pd.DataFrame(columns=table_attribs)
    tables = data.find_all('tbody')
# Kiểm tra các thẻ 'tbody' là các thẻ chứa bảng
    rows = tables[2].find_all('tr')
# Thẻ 'tr' được hiểu là table rows, thẻ 'td' là table data
    for row in rows:
        col = row.find_all('td')
        if len(col)!=0:
            if col[0].find('a') is not None and '—' not in col[2]:
                data_dict = {"Country": col[0].a.contents[0],
                             "GDP_USD_millions": col[2].contents[0]}
                df1 = pd.DataFrame(data_dict, index=[0])
                df = pd.concat([df,df1], ignore_index=True)
    return df

### Task 2: Transform information

In [50]:
#Tạo hàm để chuyển đổi từ Milions sang bilions 
def transform(df):
    # Chuyển các giá trị trong cột GDP USD Milions thành list, đễ dễ thao tác trên từng giá trị
    GDP_list = df["GDP_USD_millions"].tolist()
    # For x in GDP list, thực hiện phân tách khỏi dấu , rồi gộp lại bằng join.
    GDP_list = [float("".join(x.split(','))) for x in GDP_list]
    GDP_list = [np.round(x/1000,2) for x in GDP_list]
    df["GDP_USD_millions"] = GDP_list
    df=df.rename(columns = {"GDP_USD_millions":"GDP_USD_billions"})
    return df

In [62]:
def load_to_csv(df, csv_path):
    df.to_csv(csv_path)

### Task 3: Loading information

In [51]:
def load_to_db(df, sql_connection, table_name):
    df.to_sql(table_name, sql_connection, if_exists='replace', index=False)

### Task 4: Querying the database table

In [52]:
def run_query(query_statement, sql_connection):
    print(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)

### Task 5: Logging progress

In [53]:
def log_progress(message): 
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("./etl_project_log.txt","a") as f: 
        f.write(timestamp + ' : ' + message + '\n')

## FINAL CALL

In [60]:
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
table_attribs = ["Country", "GDP_USD_millions"]
db_name = 'World_Economies.db'
table_name = 'Countries_by_GDP'
csv_path = './Countries_by_GDP.csv'

In [65]:
# Task 0: Start Process
log_progress('----------------------------------------------------------------')
log_progress('Preliminaries complete. Initiating ETL process')

#Task1
df = extract(url, table_attribs)
log_progress('Data extraction complete. Initiating Transformation process')

#Task2
    #2.1 transform
df = transform(df)
log_progress('Data transformation complete. Initiating loading process')
    #2.2 Load to csv
load_to_csv(df, csv_path)
log_progress('Data saved to CSV file')

#Task3: 
#3.1Connect to DB
sql_connection = sqlite3.connect('World_Economies.db')
log_progress('SQL Connection initiated.')

#3.2 Load data into table
load_to_db(df, sql_connection, table_name)
log_progress('Data loaded to Database as table. Running the query')

#3.3 Running Queries
query_statement = f"SELECT * from {table_name} WHERE GDP_USD_billions >= 100"
run_query(query_statement, sql_connection)

#Task4; Completed
log_progress('Process Complete.')
sql_connection.close()

SELECT * from Countries_by_GDP WHERE GDP_USD_billions >= 100
          Country  GDP_USD_billions
0   United States          26854.60
1           China          19373.59
2           Japan           4409.74
3         Germany           4308.85
4           India           3736.88
..            ...               ...
64          Kenya            118.13
65         Angola            117.88
66           Oman            104.90
67      Guatemala            102.31
68       Bulgaria            100.64

[69 rows x 2 columns]
