# **Practice Project: GDP Data Extraction and Processing**

## Introduction

In this practice project, you will put the skills acquired through the course to use. You will extract data from a website using webscraping and request APIs, then process it using Pandas and NumPy libraries

## Project Scenario:

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF)

The required data seems to be available on the URL mentioned below:

URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29


## Objectives

After completing this lab I will be able to:

* Use webscraping to extract required information from a website
* Use Pandas to load and process the tabular data as a DataFrame
* Use NumPy to manipulate the information contained in the DataFrame
* Load the updated DataFrame to a CSV file

---

### Disclaimer

If you are using a downloaded version of this notebook on your local machine, you may encounter a warning message as shown in the screenshot below:

<p style="text-align:center">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/mod_5/practice_project_disclaimer.png" width="700" alt="warning message">
</p>

This does not affect the execution of the codes in any way and can simply be ignored

---

### Setup

For this lab, we will be using the following libraries:

* `Pandas` for managing the data
* `NumPy` for mathematical operations

In [1]:
# Install the required packages

!pip install pandas numpy
!pip install lxml

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting lxml
  Downloading lxml-5.3.0-cp312-cp312-win_amd64.whl.metadata (3.9 kB)
Downloading lxml-5.3.0-cp312-cp312-win_amd64.whl (3.8 MB)
   ---------------------------------------- 0.0/3.8 MB ? eta -:--:--
   -- ------------------------------------- 0.3/3.8 MB ? eta -:--:--
   -------- ------------------------------- 0.8/3.8 MB 2.4 MB/s eta 0:00:02
   ------------- -------------------------- 1.3/3.8 MB 2.3 MB/s eta 0:00:02
   ---------------- ----------------------- 1.6/3.8 MB 2.3 MB/s eta 0:00:01
   ---------------- ----------------------- 1.6/3.8 MB 2.3 MB/s eta 0:00:01
   ------------------------ --------------- 2.4/3.8 MB 1.9 MB/s eta 0:00:01
   --------------------------- ------------ 2.6/3.8 MB 1.9 MB/s eta 0:00:01
   -------------------------------- ------- 3.1/3.8 MB 2.0 MB/s eta 0:00:01
   ----------------------------

### Importing Required Libraries

*We recommend you import all required libraries in one place (here):*

In [60]:
import numpy as np
import pandas as pd

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings

warnings.warn = warn
warnings.filterwarnings('ignore')

### Exercises

#### Exercise 1

Extract the required GDP data from the given URL using web scraping

In [61]:
URL = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

You can use the Pandas library to extract the required table directly as a DataFrame. Note that the required table is the third one on the website, as shown in the image below.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/images/pandas_wbs_3.png">

In [62]:
# Extract tables from webpage using Pandas. Retain table number 3 as the required DataFrame
tables = pd.read_html(URL)
table3 = tables[3]
df = pd.DataFrame(table3)

# Replace the column headers with column numbers
df.columns = range(df.shape[1])

# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
country_gdp = df[[0, 2]]

# Retain the rows with index 1 to 11, indicating the top 10 economies in the world
top_ten_df = country_gdp.iloc[1:11]

# Assign column names as "Country" and "GDP (Million USD)"
top_ten_df.columns = ["Country", "GDP (Million USD)"]

# Print and verify the DataFrame is correct
top_ten_df.head(11)

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672
10,Brazil,2081235


#### Exercise 2

Modify the GDP column of the DataFrame, converting the value available in Million USD to Billion USD. Use the `round()` method of the NumPy library to round the value to 2 decimal places. Modify the header of the DataFram to `GDP (Billion USD)`

In [63]:
# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method
top_ten_df['GDP (Million USD)'] = top_ten_df['GDP (Million USD)'].astype(int)

# Convert the GDP value in Million USD to Billion USD
top_ten_df['GDP (Million USD)'] = top_ten_df['GDP (Million USD)']/1000

# Use numpy.round() method to round the value to 2 decimal places.
top_ten_df['GDP (Million USD)'] = top_ten_df['GDP (Million USD)'].round(2)

# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
top_ten_df = top_ten_df.rename(columns={'GDP (Million USD)' : 'GDP (Billion USD)'})

# Print results to verify
top_ten_df.head(11)

Unnamed: 0,Country,GDP (Billion USD)
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
6,United Kingdom,3158.94
7,France,2923.49
8,Italy,2169.74
9,Canada,2089.67
10,Brazil,2081.24


#### Exercise 3

Load the DataFrame to the CSV file named "Largest_economies.csv"

In [64]:
# Load the DataFrame to the CSV file named "Largest_economies.csv"
top_ten_df.to_csv("Largest_economies.csv", index = False)