# Web Scraping the List of Largest Companies by Revenue

This notebook demonstrates how to scrape data from the Wikipedia page listing the largest companies by revenue and save the extracted data into a CSV file using Python libraries such as `requests`, `BeautifulSoup`, and `pandas`.

Use the requests library to get the HTML content of the Wikipedia page.

In [None]:
from bs4 import BeautifulSoup  
import requests, os

Parse the HTML content using BeautifulSoup.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

In [None]:
print(soup)

Find the specific table containing the list of largest companies by revenue using its class attribute.

In [None]:
table = soup.find('table', {'class': 'wikitable'})

Extract the headers of the table and clean them by removing any notes.

In [None]:
headers = table.find_all('th')

In [None]:
header_texts = [header.get_text(strip=True).replace('[note 1]', '') for header in headers]

In [None]:
header = header_texts[0:10]
header

Create a DataFrame with the extracted headers.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(columns = header)

df

In [None]:
rows = table.find_all('tr')
data = []

Iterate through the rows of the table and extract the data, then append it to the DataFrame.

In [None]:
for row in rows[2:]:
    cols = row.find_all(['th', 'td'])
    cols = [col.text.strip() for col in cols]
    data.append(cols)

Drop unnecessary columns from the DataFrame.

In [None]:
df = pd.DataFrame(data, columns=header)
df.drop(columns=['Ref.'], inplace=True)
df.drop(columns=['State-owned'], inplace=True)
print(df)

Save the cleaned DataFrame to a CSV file.

In [None]:
path = os.getcwd()

df.to_csv(r"C:\Users\PC\Documents\Mestrado\Curso\Python\Web Scraping\Beatifulsoup\Companies.csv", index=False)

# Alternative Method: Using pandas read_html
Use the pandas read_html method to directly read the table from the webpage.

In [None]:
import pandas as pd

In [None]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue',header=0,skiprows=[1])[1]
df.columns = df.columns.str.replace('\[note 1\]', '')
df

In [None]:
df.to_csv(r"C:\Users\PC\Documents\Mestrado\Curso\Python\Web Scraping\Beatifulsoup\Companies.csv", index=False)