# Initial setup

The libraries needed for the solution are as follows:

- requests - The library used for accessing the information from the URL.

- bs4 - The library containing the BeautifulSoup function used for webscraping.

- pandas - The library used for processing the extracted data, storing it to required formats and communicating with the databases.

- sqlite3 - The library required to create a database server connection.

- numpy - The library required for the mathematical rounding operation as required in the objectives.

- datetime - The library containing the function datetime used for extracting the timestamp for logging purposes.

In [1]:
# !pip install pandas
# !pip install numpy
# !pip install bs4

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime 

In [3]:
# initialize all the known entities

url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
table_attribs = ["Country", "GDP_USD_millions"]
db_name = 'World_Economies.db'
table_name = 'Countries_by_GDP'
csv_path = './Countries_by_GDP.csv'

# Task 1: Extracting information

Extraction of information from a web page is done using the web scraping process. For this, I analyzed the link to come up with the strategy of how to get the required information. The following points are worth observing after inspecting the URL and noted the position of the table:

- The images with captions in them are stored in tabular format. Hence, in the given webpage, the table is at the third position, or index 2. Among this, the entries under 'Country/Territory' and 'IMF -> Estimate' are required.

- There are a few entries in which the IMF estimate is shown to be '—'. Also, there is an entry at the top named 'World', which is not required. I have to segregate this entry from the others because this entry does not have a hyperlink and all others in the table do. So I can take advantage of that and access only the rows for which the entry under 'Country/Terriroty' has a hyperlink associated with it.

- '—' is a special character and not a general hyphen, '-'.

In [4]:
def extract(url, table_attribs):
    ''' This function extracts the required
    information from the website and saves it to a dataframe. The
    function returns the dataframe for further processing. '''
     
    page = requests.get(url).text #Extract the web page as text
    data = BeautifulSoup(page,'html.parser') #Parse the text into an HTML object
    df = pd.DataFrame(columns=table_attribs) #Create an empty pandas DataFrame named df with columns as the table_attribs.
    tables = data.find_all('tbody') #Extract all 'tbody' attributes of the HTML object and then extract all the rows of the index 2 table using the 'tr' attribute
    rows = tables[2].find_all('tr')
    for row in rows:
        col = row.find_all('td') #Check the contents of each row, having attribute ‘td’, for the following conditions
        if len(col)!=0: 
            if col[0].find('a') is not None and '—' not in col[2]: #The row should not be empty. The first column should contain a hyperlink. The third column should not be '—'.
                data_dict = {"Country": col[0].a.contents[0],
                             "GDP_USD_millions": col[2].contents[0]}
                df1 = pd.DataFrame(data_dict, index=[0])
                df = pd.concat([df,df1], ignore_index=True) #Store all entries matching the conditions in step 5 to a dictionary with keys the same as entries of table_attribs. Append all these dictionaries one by one to the dataframe.
    return df

# Task 2: Transform information
The transform function needs to modify the ‘GDP_USD_millions’. You need to cover the following points as a part of the transformation process.

In [5]:
def transform(df):
    ''' This function converts the GDP information from Currency
    format to float value, transforms the information of GDP from
    USD (Millions) to USD (Billions) rounding to 2 decimal places.
    The function returns the transformed dataframe.'''
    
    GDP_list = df["GDP_USD_millions"].tolist()
    GDP_list = [float("".join(x.split(','))) for x in GDP_list] #Convert the contents of the 'GDP_USD_millions' column of df dataframe from currency format to floating numbers.
    GDP_list = [np.round(x/1000,2) for x in GDP_list] #Divide all these values by 1000 and round it to 2 decimal places.
    df["GDP_USD_millions"] = GDP_list
    df=df.rename(columns = {"GDP_USD_millions": "GDP_USD_billions"}) #Modify the name of the column from 'GDP_USD_millions' to 'GDP_USD_billions'.
    return df

# Task 3: Loading information

In [6]:
def load_to_csv(df, csv_path):
    df.to_csv(csv_path)

In [7]:
#save the transformed dataframe as a table in the database.

def load_to_db(df, sql_connection, table_name):
    df.to_sql(table_name, sql_connection, if_exists='replace', index=False)

Task 4: Querying the database table

In [8]:
def run_query(query_statement, sql_connection):
    print(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)

# Task 5: Logging progress

In [9]:
def log_progress(message): 
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("./etl_project_log.txt","a") as f: 
        f.write(timestamp + ' : ' + message + '\n')

# Function calls
Now, you have to set up the sequence of function calls for your assigned tasks. Follow the sequence below.

Task	- Log message on completion
- Declaring known values	- Preliminaries complete. Initiating ETL process.
- Call extract() function	- Data extraction complete. Initiating Transformation process.
- Call transform() function	- Data transformation complete. Initiating loading process.
- Call load_to_csv()	- Data saved to CSV file.
- Initiate SQLite3 connection	- SQL Connection initiated.
- Call load_to_db()	- Data loaded to Database as table. Running the query.
- Call run_query() *	- Process Complete.
- Close SQLite3 connection	-

Query statement to be executed is

f"SELECT * from {table_name} WHERE GDP_USD_billions >= 100"

In [10]:
log_progress('Preliminaries complete. Initiating ETL process')

df = extract(url, table_attribs)

log_progress('Data extraction complete. Initiating Transformation process')

df = transform(df)

log_progress('Data transformation complete. Initiating loading process')

load_to_csv(df, csv_path)

log_progress('Data saved to CSV file')

sql_connection = sqlite3.connect('World_Economies.db')

log_progress('SQL Connection initiated.')

load_to_db(df, sql_connection, table_name)

log_progress('Data loaded to Database as table. Running the query')

query_statement = f"SELECT * from {table_name} WHERE GDP_USD_billions >= 100"
run_query(query_statement, sql_connection)

log_progress('Process Complete.')

sql_connection.close()

SELECT * from Countries_by_GDP WHERE GDP_USD_billions >= 100
          Country  GDP_USD_billions
0   United States          26854.60
1           China          19373.59
2           Japan           4409.74
3         Germany           4308.85
4           India           3736.88
..            ...               ...
64          Kenya            118.13
65         Angola            117.88
66           Oman            104.90
67      Guatemala            102.31
68       Bulgaria            100.64

[69 rows x 2 columns]


# Code Execution and expected output

execute all using **Run All**:

# Conclusion

In this project, I performed complex Extract, Transform, and Loading operations on real world data. I was able to:

- Extract relevant information from a websites (https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)) using Webscraping and requests API.
- Transform the data to a required format.
- Load the processed data to a local file or as a database table.
- Query the database table using Python.
- Create detailed logs of all operations conducted.