<a href="https://colab.research.google.com/github/CO-CO-LAB/CO-CO-LAB/blob/main/Data_Extraction_(webscraping).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data extraction and processing
Extracting data from a website using webscraping and reqeust APIs process it using Pandas and Numpy libraries.

I want to extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places).

URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29


In [None]:
#Install required packages
!pip install pandas numpy
!pip install lxml



In [None]:
import numpy as np
import pandas as pd

# suppress warnings generated:
def warn(*args, **kwargs):
    pass
import warnings

warnings.warn = warn
warnings.filterwarnings('ignore')

Extracting data from the following URL using Web Scraping.



In [None]:
URL="https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

Using Pandas library to extract the required table directly as a DataFrame. The required table is the third one on the website, as shown in the image below.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/images/pandas_wbs_3.png">

In [None]:
# Extracting tables from webpage
tables=pd.read_html(URL)
df=tables[3]

# Replacing the column headers with column numbers
df.columns = range(df.shape[1])

# Retaining columns with index 0 and 2 (country and value of GDP quoted by IMF)
df=df[[0,2]]

# Retaining the Rows with index 1 to 10
df= df.iloc[1:11,: ]

# Assigning column names as "Country" and "GDP (Million USD)"
df.columns=["Country" , "GDP (Million USD)"]

Modifying the GDP column of the DataFrame, converting the value available in Million USD to Billion USD. Use the round() method of Numpy library to round the value to 2 decimal places. Modify the header of the DataFrame to GDP (Billion USD).

In [None]:
# Changing the data type of the 'GDP (Million USD)' column to int by astype()
df['GDP (Million USD)']= df['GDP (Million USD)'].astype(int)

# Converting the GDP value in Million USD to Billion USD
df['GDP (Million USD)']= df['GDP (Million USD)']/1000

# Using round() to round the value to 2 decimal places
df['GDP (Million USD)']= np.round(df['GDP (Million USD)'],2)

# Renaming the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
df.columns = ['Country', 'GDP (Billion USD)']
df

Unnamed: 0,Country,GDP (Billion USD)
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
6,United Kingdom,3158.94
7,France,2923.49
8,Italy,2169.74
9,Canada,2089.67
10,Brazil,2081.24


Loading the DataFrame to the CSV file named

"Largest_economies.csv"

In [None]:
df.to_csv('Largest_economies.csv', index=False)

from google.colab import files
files.download('Largest_economies.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Webscraping and HTML basis



In [None]:
import requests
from bs4 import BeautifulSoup
# The URL of the webpage we want to scrape
url = 'https://en.wikipedia.org/wiki/IBM'
# Sending an HTTP GET request to the webpage
response = requests.get(url)
# Storing the HTML content
html_content = response.text
# Creating a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')
# Displaying a snippet of the HTML content
print(html_content[:500])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-


In [19]:
# Find all <a> tags (anchor tags) in the HTML
links = soup.find_all('a')
# Iterate through the list of links and print their text
for link in links[:10]:
    print(link.text)

Jump to content
Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Help
Learn to edit
Community portal
