
## WEB SCRAPING IN PYTHON: WIKIPEDIA TABLE

### Introduction

In this guide, we will scrape table data from the Wikipedia page titled **"The 30 largest countries by net national wealth (in billions USD)"**. You can find the page [here](https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth).

This process involves:
1. Fetching the webpage.
2. Parsing the HTML content.
3. Extracting data into a structured format using Python libraries.

### Required Libraries

To perform web scraping and handle data, we will use the following Python libraries:
- `requests` for fetching web pages.
- `beautifulsoup4` for parsing HTML content.
- `pandas` for data manipulation and saving data to CSV.

Before running the code, make sure to install the necessary libraries. In a Jupyter Notebook cell, you can use:

In [None]:
# !pip install beautifulsoup4
# !pip install pandas 
# !pip install requests

In [None]:
# Load Libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
# Fetch the Webpage
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth'
page = requests.get(url)
page

In [None]:
# Parse the HTML Content
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())  # Pretty-print the HTML content

### Inspect the Wikipedia Page 
To locate the table we need:
Right-click on the webpage and select "Inspect" to open the developer tools.
There are multiple tables on the page, and we need the second one. **"The 30 largest countries by net national wealth (in billions USD)"** [List_of_countries_by_total_wealth](https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth) 

Identify the Target Table
The 1st, 3rd, and 4th tables have the class **wikitable sortable jquery-tablesorter.**
The 2nd table has the class **wikitable static-row-numbers.**. 

🔗[Inspect page](https://github.com/DataVizStory/Web-Scraping/blob/main/Images/Wiki_table.gif)

Two ways target the 2nd table: 
+ via tag
+ via class

### Via tag

In [None]:
soup.find_all('table')  # Find all tables on the page

In [None]:

# This returns a list of all tables. We can access the second table using index 1:
soup.find_all('table')[1]

### Via class 

Alternatively, use the unique class name (**'wikitable static-row-numbers'**) for the second table: 

In [None]:
soup.find('table', class_="wikitable static-row-numbers")

In [None]:
# For demonstration, we'll use the first way to find table via tag :
table = soup.find_all('table')[1]

### Extract Column Titles

In [None]:
#To get the table headers: 
all_table_titles=table.find_all('th')
all_table_titles


In [None]:
table_titles = [title.text.strip() for title in all_table_titles]
print(table_titles)  # Output the cleaned table titles

### Create DataFrame

In [None]:
#Lets create datatable:
df = pd.DataFrame(columns=table_titles)
df

### Extract and Populate Data

In [None]:
# Extract table rows and populate the DataFrame:
column_data = table.find_all('tr')

for row in column_data[1:]:  # Skip the header row
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    df.loc[len(df)] = individual_row_data
df

## Alternative Method to Fetch Table Directly

In [None]:
import pandas as pd

# Fetch the page and read the table directly into a DataFrame
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth'
dfs = pd.read_html(url)

# The desired table is the second one (index 1)
df2 = dfs[1]

# Display the DataFrame
df2

In [None]:
# Save your DataFrame to a CSV file
df.to_csv('Companies.csv', index=False)