An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.

You can find the required data on this webpage.
https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29


The required information needs to be made accessible as a JSON file 'Countries_by_GDP.json' as well as a table 'Countries_by_GDP' in a database file 'World_Economies.db' with attributes 'Country' and 'GDP_USD_billion.'

Your boss wants you to demonstrate the success of this code by running a query on the database table to display only the entries with more than a 100 billion USD economy. Also, log the entire process of execution in a file named 'etl_project_log.txt'.

You must create a Python code 'etl_project_gdp.py' that performs all the required tas

## Web scraping and Extracting Data

In [23]:
#libs
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup

In [30]:
#connections

url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
database = 'World_Economies.db'
table_name = 'Countries_by_GDP'
data_path = 'GDPProj/top_DGP_Countries'
df = pd.DataFrame(columns=['Country','GDP_USD_billion'])
count = 0

In [31]:
html_page = requests.get(url).text
soup = BeautifulSoup(html_page, 'html.parser')

# Locate the specific table
table = soup.find("table", {"class": "wikitable sortable"})

In [32]:
# Print class names of all tables to identify the target table
tables = soup.find_all("table")
for i, table in enumerate(tables):
    print(f"Table {i}: Classes - {table.get('class')}")

# Attempt to locate the correct table by indexing (assuming it may be the first or second table)
# Based on the output, you may need to adjust the index to target the correct table
table = tables[2]  # Change the index based on the output of the above print statements

# Verify table selection
print("First few rows of the table:")
for row in table.find("tbody").find_all("tr")[:5]:  # Print first 5 rows to confirm
    print(row)


Table 0: Classes - None
Table 1: Classes - None
Table 2: Classes - ['wikitable', 'sortable', 'static-row-numbers', 'plainrowheaders', 'srn-white-background']
Table 3: Classes - ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner']
Table 4: Classes - ['nowraplinks', 'mw-collapsible', 'uncollapsed', 'navbox-inner']
Table 5: Classes - ['nowraplinks', 'hlist', 'mw-collapsible', 'autocollapse', 'navbox-inner']
Table 6: Classes - ['nowraplinks', 'navbox-subgroup']
First few rows of the table:
<tr class="static-row-header" style="text-align:center;vertical-align:bottom;">
<th rowspan="2">Country/Territory
</th>
<th rowspan="2"><a href="/web/20230902185326/https://en.wikipedia.org/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN region</a>
</th>
<th colspan="2"><a href="/web/20230902185326/https://en.wikipedia.org/wiki/International_Monetary_Fund" title="International Monetary Fund">IMF</a><sup class="reference" id="cite_ref-GDP_IMF_2-2"><a href="#cite_note-GDP_IM

In [33]:
if table is None:
    print("Table not found on the page.")
else:
    # Initialize an empty DataFrame
    df = pd.DataFrame(columns=["Country", "GDP_USD_billion"])

    # Iterate through each row in the table body
    for row in table.find("tbody").find_all("tr"):
        col = row.find_all('td')  # Find all columns in the row

        # Ensure that the row has enough columns to avoid empty or irrelevant rows
        if len(col) >= 3:
            # Extract the country name from the first column
            country = col[0].get_text(strip=True)
            
            # Extract the IMF estimate GDP value from the third column, handling commas
            gdp_text = col[2].get_text(strip=True).replace(',', '')

            # Convert GDP to a numeric value if possible
            try:
                gdp = float(gdp_text) if '.' in gdp_text else int(gdp_text)
            except ValueError:
                gdp = None  # Set to None if conversion fails

            # Append the data to the DataFrame
            df = pd.concat([df, pd.DataFrame({'Country': [country], 'GDP_USD_billion': [gdp]})], ignore_index=True)

# Display the resulting DataFrame
print(df)

           Country GDP_USD_billion
0            World       105568776
1    United States        26854599
2            China        19373586
3            Japan         4409738
4          Germany         4308854
..             ...             ...
209       Anguilla            None
210       Kiribati             248
211          Nauru             151
212     Montserrat            None
213         Tuvalu              65

[214 rows x 2 columns]


In [34]:
df.to_csv(data_path) #as we called it the connections cell