# Practice Project: GDP Data extraction and processing

Estimated time needed: **30** minutes

## Introduction

In this practice project, you will put the skills acquired through the course to use. You will extract data from a website using webscraping and reqeust APIs process it using Pandas and Numpy libraries.


## Project Scenario:

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 20 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). 

The required data seems to be available on the URL mentioned below:

https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

# Exercises
### Exercise 1
Extract the required GDP data (table 3) from the given URL using Web Scraping.

In [8]:
#Importing the necessary libraries
import numpy as np
import pandas as pd

#Extract tables from webpage using Pandas
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
df = tables[2]
df.columns = range(df.shape[1])
df = df[[0,2]]
df = df.iloc[1:21,:]
df.columns = ['Country','GDP (Million USD)']
df

Unnamed: 0,Country,GDP (Million USD)
1,United States,28781083.0
2,China,18532633.0
3,Germany,4591100.0
4,Japan,4110452.0
5,India,3937011.0
6,United Kingdom,3495261.0
7,France,3130014.0
8,Brazil,2331391.0
9,Italy,2328028.0
10,Canada,2242182.0


### Exercise 2
Modify the GDP column of the DataFrame, converting the value available in Million USD to Billion USD. Use the `round()` method of Numpy library to round the value to 2 decimal places. Modify the header of the DataFrame to `GDP (Billion USD)`.


In [9]:
# Importing the necessary libraries
import numpy as np
import pandas as pd

# Extract tables from webpage using Pandas
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
df = tables[2]
df.columns = range(df.shape[1])
df = df[[0, 2]]
df = df.iloc[1:21, :]
df.columns = ['Country', 'GDP (Million USD)']

# Rename the column
df.rename(columns={'GDP (Million USD)': 'GDP (Billion USD)'}, inplace=True)

# Convert 'GDP (Billion USD)' column to float
df['GDP (Billion USD)'] = df['GDP (Billion USD)'].astype(float)

# Convert to billion USD and round to 2 decimal places
df['GDP (Billion USD)'] /= 1000
df['GDP (Billion USD)'] = df['GDP (Billion USD)'].round(2)

#Setting display format with commas for easier reading
pd.options.display.float_format = '{:,.2f}'.format
df

Unnamed: 0,Country,GDP (Billion USD)
1,United States,28781.08
2,China,18532.63
3,Germany,4591.1
4,Japan,4110.45
5,India,3937.01
6,United Kingdom,3495.26
7,France,3130.01
8,Brazil,2331.39
9,Italy,2328.03
10,Canada,2242.18


#### There is almost complete, but we have a problem with Turkey, the Turkey's GDP in the first Data Frame shows 1113.561 instead of 1,113,561. Because of that we have 1.11 in the second Data Frame.

#### Let's fix it!

In [10]:
#Importing the necessary libraries
import numpy as np
import pandas as pd

#Extract tables from webpage using Pandas
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')
df = tables[2]
df.columns = range(df.shape[1])
df = df[[0,2]]
df = df.iloc[1:21,:]
df.columns = ['Country','GDP (Million USD)']

# Fixing formatting issue for Turkey's GDP
df['GDP (Million USD)'] = df['GDP (Million USD)'].str.replace(',', '').str.replace('.', '')
df

#Now, we simply write the rest of the code:

# Rename the column
df.rename(columns={'GDP (Million USD)': 'GDP (Billion USD)'}, inplace=True)

# Convert 'GDP (Billion USD)' column to float
df['GDP (Billion USD)'] = df['GDP (Billion USD)'].astype(float)

# Convert to billion USD and round to 2 decimal places
df['GDP (Billion USD)'] /= 1000
df['GDP (Billion USD)'] = df['GDP (Billion USD)'].round(2)

#Setting display format with commas for easier reading
pd.options.display.float_format = '{:,.2f}'.format
df

Unnamed: 0,Country,GDP (Billion USD)
1,United States,28781.08
2,China,18532.63
3,Germany,4591.1
4,Japan,4110.45
5,India,3937.01
6,United Kingdom,3495.26
7,France,3130.01
8,Brazil,2331.39
9,Italy,2328.03
10,Canada,2242.18


### Exercise 3
Load the DataFrame to the CSV file named "Largest_economies.csv"

In [11]:
df.to_csv('Largest_economies.csv')