## Web Scraping Task – CodeVeda Internship

For my first task during the CodeVeda internship, I was assigned to scrape data from the web using **BeautifulSoup**. I chose to scrape table data about the **world's highest revenue companies**.

### Challenges Encountered

While working on this task, I ran into a few challenges:

1. **Malformed Table Rows**  
   After successfully scraping the table headers, I noticed that the `<tbody>` section of the table was not structured correctly. It contained only `<td>` elements without any enclosing `<tr>` rows. This made it technically invalid and difficult to work with.  
   To solve this, I grouped the `<td>` elements based on the number of headers so that each group formed a complete row.

2. **Advertisements Embedded in Table**  
   The table included breakpoints with advertisement text that were not part of the actual data. These ad blocks were placed inside the table structure, using `<td>` tags, which made them appear as if they were part of the data.  
   I resolved this by filtering out elements that did not fall between valid opening and closing `<td>` tags.

3. **Pagination Handling**  
   The table data was paginated, with a "Next" button linking to subsequent pages. I handled this by identifying the pagination link and programmatically fetching and parsing all pages to gather the full dataset.


In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

: 

In [None]:
url = "https://companiesmarketcap.com/largest-companies-by-revenue/"
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

In [None]:
table = soup.find('table', class_='default-table')
# print(table.prettify())

In [None]:
raw_table_titles = table.find_all("th")
table_titles = [title.text for title in table_titles]
table_titles

In [None]:
tbody = soup.find('tbody')
tds = tbody.find_all('td', recursive=False)
clean_tds = [td for td in tds if td.name == "td" and td.contents]

In [None]:
columns_per_row = len(table_titles)

In [None]:
# Group tds into rows
rows = [
    clean_tds[i:i + columns_per_row]
    for i in range(0, len(clean_tds), columns_per_row)
]

# rows

In [None]:
# Convert rows to plain text
data = []
for row in rows:
    data.append([cell.get_text(strip=True) for cell in row])

In [None]:
df = pd.DataFrame(data, columns=table_titles)
df

### Handle paginated data

In [None]:
base_url = "https://companiesmarketcap.com"
start_url = "/largest-companies-by-revenue/"

all_data = []
next_page = start_url

# count = 1

while next_page:
    print(f"Fetching: {next_page}")
    response = requests.get(base_url + next_page)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract your <td> data here
    tds = tbody.find_all('td', recursive=False)
    clean_tds = [td for td in tds if td.name == "td" and td.contents]
    all_data.extend([td.get_text(strip=True) for td in clean_tds])

    # Find the "Next" button
    next_link = soup.find("a", class_="page-link", string=lambda text: text and "Next" in text)

    if next_link and next_link.get("href"):
        next_page = next_link["href"]
    else:
        next_page = None  # no more pages
    # count = count + 1 

# At the end, `all_data` contains all scraped text data from all <td>s
print(f"Scraped {len(all_data)} td items.")

In [None]:
rows = [
    all_data[i:i + columns_per_row]
    for i in range(0, len(all_data), columns_per_row)
]


In [None]:
df = pd.DataFrame(data=rows, columns=table_titles)
df.sample(100)

In [None]:
# Save to CSV
df.to_csv('companies.csv', index=False)