Project Scenario:

An international firm that is looking to expand its business in different countries across the world. I have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).

The required data seems to be available on the URL mentioned below:

URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29


Objectives

    Use Webscraping to extract required information from a website.
    Use Pandas to load and process the tabular data as a dataframe.
    Use Numpy to manipulate the information contatined in the dataframe.
    Load the updated dataframe to CSV file.


Importing Required Libraries

In [4]:
import numpy as np
import pandas as pd

# suppress warnings generated by the code:
def warn(*args, **kwargs):
  pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

Extract the required GDP data from the given URL using web sceaping

In [5]:
URL="https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

In [18]:
# Extract tavles from webpage using Pandas. Retain table number 3 as the required dataframe
tables = pd.read_html(URL)
df = tables[3]

# Replace the column headers with column numbers
df.columns = range(df.shape[1])

# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
df = df[[0,2]]

# Retail colum name as "Country" and "GDP (Million USD)"
df.columns = ['Country', 'GDP (Million USD)']

Modifing the GDP column of the DataFrame, converting the value available in Million USD to Billion USD. Using the round() method of Numpy library to round the value to 2 decimal places. Modifing the header of the DataFrame to GDP (Billion USD).

In [20]:
import numpy as np
import pandas as pd

# Step 1: Replace non-numeric values like '—' with NaN and remove commas
df['GDP (Million USD)'] = df['GDP (Million USD)'].replace({',': '', '—': np.nan}, regex=True)

# Step 2: Drop rows where GDP values are NaN (optional, depending on how you want to handle missing values)
df.dropna(subset=['GDP (Million USD)'], inplace=True)

# Step 3: Convert the cleaned GDP column to float and then to integer
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(float).astype(int)

# Step 4: Convert GDP from Million USD to Billion USD
df['GDP (Million USD)'] = df['GDP (Million USD)'] / 1000

# Step 5: Round the values to 2 decimal places
df['GDP (Million USD)'] = np.round(df['GDP (Million USD)'], 2)

# Step 6: Rename the column header to 'GDP (Billion USD)'
df.rename(columns={'GDP (Million USD)': 'GDP (Billion USD)'}, inplace=True)

# Verify the result
print(df.head())


         Country  GDP (Billion USD)
0          World          105568.78
1  United States           26854.60
2          China           19373.59
3          Japan            4409.74
4        Germany            4308.85


Loading the DataFrame to the CSV file named "Largest_economies.csv"

In [21]:
# Load the DataFrame to the CSV file named "Largest_economies.csv"
df.to_csv('./Largest_economies.csv')

