# Project Scenario

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).

The required data seems to be available on the URL mentioned below:

URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29


# Install and import required libraries

In [7]:
!pip install lxml

Collecting lxml
  Downloading lxml-5.3.2-cp312-cp312-macosx_10_9_universal2.whl.metadata (3.6 kB)
Downloading lxml-5.3.2-cp312-cp312-macosx_10_9_universal2.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: lxml
Successfully installed lxml-5.3.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [8]:
import numpy as np
import requests
import pandas as pd


# Extract the data using web scraping

In [4]:
URL = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"


In [13]:
tables = pd.read_html(URL)
tables

[      0     1     2
 0   Aug   SEP   Oct
 1   NaN    02   NaN
 2  2022  2023  2024,
                                                    0
 0  Largest economies in the world by GDP (nominal...,
                                                    0  \
 0  > $20 trillion $10–20 trillion $5–10 trillion ...   
 
                                                    1  \
 0  $750 billion – $1 trillion $500–750 billion $2...   
 
                                                    2  
 0  $50–100 billion $25–50 billion $5–25 billion <...  ,
     Country/Territory UN region IMF[1][13]            World Bank[14]  \
     Country/Territory UN region   Estimate       Year       Estimate   
 0               World         —  105568776       2023      100562011   
 1       United States  Americas   26854599       2023       25462700   
 2               China      Asia   19373586  [n 1]2023       17963171   
 3               Japan      Asia    4409738       2023        4231141   
 4             Germany 

# We need the 3rd table which tells about GDP

In [15]:
data = tables[3]
data

Unnamed: 0_level_0,Country/Territory,UN region,IMF[1][13],IMF[1][13],World Bank[14],World Bank[14],United Nations[15],United Nations[15]
Unnamed: 0_level_1,Country/Territory,UN region,Estimate,Year,Estimate,Year,Estimate,Year
0,World,—,105568776,2023,100562011,2022,96698005,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935,2021
...,...,...,...,...,...,...,...,...
209,Anguilla,Americas,—,—,—,—,303,2021
210,Kiribati,Oceania,248,2023,223,2022,227,2021
211,Nauru,Oceania,151,2023,151,2022,155,2021
212,Montserrat,Americas,—,—,—,—,72,2021


# We need IMF values of countries

In [49]:
data.columns
 #Replace it with numbers for easy access
data.columns = range(data.shape[1])
df = data[[0,2]]
# name the columns
df.columns =['Country','GDP(in Million USD)']
df

Unnamed: 0,Country,GDP(in Million USD)
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
...,...,...
209,Anguilla,—
210,Kiribati,248
211,Nauru,151
212,Montserrat,—


In [54]:
# Retain 1st 10 rows since we only need 10
d_f = df.iloc[1:11]
d_f
#CONVERT gdp COLUMN TYPE TO INT
d_f.loc[:, 'GDP(in Million USD)'] = d_f['GDP(in Million USD)'].astype(int)


In [60]:
# Convert GDP in million to Billion
d_f['GDP(in Million USD)'] = d_f['GDP(in Million USD)']/ 1000
# round it up
d_f['GDP(in Million USD)'] = np.round(d_f['GDP(in Million USD)'])
#Rename column to billion USD
d_f.rename(columns= {'GDP(in Million USD)':'GDP(in Billion USD)'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d_f['GDP(in Million USD)'] = d_f['GDP(in Million USD)']/ 1000
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d_f['GDP(in Million USD)'] = np.round(d_f['GDP(in Million USD)'])


Unnamed: 0,Country,GDP(in Billion USD)
1,United States,2.7e-05
2,China,1.9e-05
3,Japan,4e-06
4,Germany,4e-06
5,India,4e-06
6,United Kingdom,3e-06
7,France,3e-06
8,Italy,2e-06
9,Canada,2e-06
10,Brazil,2e-06


# Loading data into local files

In [61]:
d_f.to_csv('./Largest economies.csv')