
## WEB SCRAPING IN PYTHON: WIKIPEDIA TABLE

### Introduction

In this guide, we will scrape table data from the Wikipedia page titled **"The 30 largest countries by net national wealth (in billions USD)"**. You can find the page [here](https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth).

This process involves:
1. Fetching the webpage.
2. Parsing the HTML content.
3. Extracting data into a structured format using Python libraries.

### Required Libraries

To perform web scraping and handle data, we will use the following Python libraries:
- `requests` for fetching web pages.
- `beautifulsoup4` for parsing HTML content.
- `pandas` for data manipulation and saving data to CSV.

Before running the code, make sure to install the necessary libraries. In a Jupyter Notebook cell, you can use:

In [17]:
# !pip install beautifulsoup4
# !pip install pandas 
# !pip install requests

In [18]:
# Load Libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [19]:
# Fetch the Webpage
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth'
page = requests.get(url)
page

<Response [200]>

In [20]:
# Parse the HTML Content
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())  # Pretty-print the HTML content

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of countries by total wealth - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-featur

### Inspect the Wikipedia Page 
To locate the table we need:
Right-click on the webpage and select "Inspect" to open the developer tools.
There are multiple tables on the page, and we need the second one. **"The 30 largest countries by net national wealth (in billions USD)"** [List_of_countries_by_total_wealth](https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth) 

Identify the Target Table
The 1st, 3rd, and 4th tables have the class **wikitable sortable jquery-tablesorter.**
The 2nd table has the class **wikitable static-row-numbers.**. 

üîó[Inspect page](https://github.com/DataVizStory/Web-Scraping/blob/main/Images/Wiki_table.gif)

Two ways target the 2nd table: 
+ via tag
+ via class

### Via tag

In [None]:
soup.find_all('table')  # Find all tables on the page

In [22]:

# This returns a list of all tables. We can access the second table using index 1:
soup.find_all('table')[1]

<table class="wikitable static-row-numbers" style="text-align:right;">
<caption>The 30 largest countries by net national wealth (in billions USD)
</caption>
<tbody><tr class="static-row-header" style="text-align:center;vertical-align:bottom;">
<th>Country</th>
<th>2000</th>
<th>Country</th>
<th>2005</th>
<th>Country</th>
<th>2010</th>
<th>Country</th>
<th>2015</th>
<th>Country</th>
<th>2020</th>
<th>Country</th>
<th>2022</th>
<th>Country</th>
<th>Peak value</th>
<th>Peak year
</th></tr>
<tr class="static-row-header" style="font-weight:bold">
<td align="left"><span class="flagicon" style="padding-left:25px;">¬†</span>World</td>
<td>117,825</td>
<td align="left"><span class="flagicon" style="padding-left:25px;">¬†</span>World</td>
<td>182,218</td>
<td align="left"><span class="flagicon" style="padding-left:25px;">¬†</span>World</td>
<td>252,084</td>
<td align="left"><span class="flagicon" style="padding-left:25px;">¬†</span>World</td>
<td>297,743</td>
<td align="left"><span class="flagic

### Via class 

Alternatively, use the unique class name (**'wikitable static-row-numbers'**) for the second table: 

In [None]:
soup.find('table', class_="wikitable static-row-numbers")

In [24]:
# For demonstration, we'll use the first way to find table via tag :
table = soup.find_all('table')[1]

### Extract Column Titles

In [25]:
#To get the table headers: 
all_table_titles=table.find_all('th')
all_table_titles


[<th>Country</th>,
 <th>2000</th>,
 <th>Country</th>,
 <th>2005</th>,
 <th>Country</th>,
 <th>2010</th>,
 <th>Country</th>,
 <th>2015</th>,
 <th>Country</th>,
 <th>2020</th>,
 <th>Country</th>,
 <th>2022</th>,
 <th>Country</th>,
 <th>Peak value</th>,
 <th>Peak year
 </th>]

In [26]:
table_titles = [title.text.strip() for title in all_table_titles]
print(table_titles)  # Output the cleaned table titles

['Country', '2000', 'Country', '2005', 'Country', '2010', 'Country', '2015', 'Country', '2020', 'Country', '2022', 'Country', 'Peak value', 'Peak year']


### Create DataFrame

In [27]:
#Lets create datatable:
df = pd.DataFrame(columns=table_titles)
df

Unnamed: 0,Country,2000,Country.1,2005,Country.2,2010,Country.3,2015,Country.4,2020,Country.5,2022,Country.6,Peak value,Peak year


### Extract and Populate Data

In [28]:
# Extract table rows and populate the DataFrame:
column_data = table.find_all('tr')

for row in column_data[1:]:  # Skip the header row
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    df.loc[len(df)] = individual_row_data
df

Unnamed: 0,Country,2000,Country.1,2005,Country.2,2010,Country.3,2015,Country.4,2020,Country.5,2022,Country.6,Peak value,Peak year
0,World,117825,World,182218,World,252084,World,297743,World,422718,World,454385,World,465666,2021
1,United States,43423,United States,62634,United States,64661,United States,87959,United States,126300,United States,139866,United States,145793,2021
2,Japan,19404,Japan,19476,Japan,28640,China,46535,China,73866,China,84485,China,85947,2021
3,United Kingdom,6565,United Kingdom,10949,China,25493,Japan,21519,Japan,26744,Japan,22582,Japan,29718,2011
4,Germany,6160,France,9679,France,13526,United Kingdom,13978,Germany,18053,Germany,17426,Germany,18412,2021
5,Italy,5522,Italy,9457,Germany,11934,Germany,12009,France,16326,United Kingdom,15972,United Kingdom,16741,2021
6,France,4704,Germany,9073,Italy,11545,France,11594,United Kingdom,15454,France,15727,France,16326,2020
7,China,3704,China,8522,United Kingdom,11199,Italy,10506,India,12688,India,15365,India,15365,2022
8,Canada,2613,Spain,6905,Spain,8701,India,8948,Italy,12176,Canada,11263,Italy,12820,2007
9,Spain,2497,Canada,4363,Canada,6832,Canada,6930,Canada,10586,Italy,11020,Canada,12501,2021


## Alternative Method to Fetch Table Directly

In [29]:
import pandas as pd

# Fetch the page and read the table directly into a DataFrame
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth'
dfs = pd.read_html(url)

# The desired table is the second one (index 1)
df2 = dfs[1]

# Display the DataFrame
df2

Unnamed: 0,Country,2000,Country.1,2005,Country.2,2010,Country.3,2015,Country.4,2020,Country.5,2022,Country.6,Peak value,Peak year
0,World,117825,World,182218,World,252084,World,297743,World,422718,World,454385,World,465666,2021
1,United States,43423,United States,62634,United States,64661,United States,87959,United States,126300,United States,139866,United States,145793,2021
2,Japan,19404,Japan,19476,Japan,28640,China,46535,China,73866,China,84485,China,85947,2021
3,United Kingdom,6565,United Kingdom,10949,China,25493,Japan,21519,Japan,26744,Japan,22582,Japan,29718,2011
4,Germany,6160,France,9679,France,13526,United Kingdom,13978,Germany,18053,Germany,17426,Germany,18412,2021
5,Italy,5522,Italy,9457,Germany,11934,Germany,12009,France,16326,United Kingdom,15972,United Kingdom,16741,2021
6,France,4704,Germany,9073,Italy,11545,France,11594,United Kingdom,15454,France,15727,France,16326,2020
7,China,3704,China,8522,United Kingdom,11199,Italy,10506,India,12688,India,15365,India,15365,2022
8,Canada,2613,Spain,6905,Spain,8701,India,8948,Italy,12176,Canada,11263,Italy,12820,2007
9,Spain,2497,Canada,4363,Canada,6832,Canada,6930,Canada,10586,Italy,11020,Canada,12501,2021


In [30]:
# Save your DataFrame to a CSV file
df.to_csv('Companies.csv', index=False)