# Web Scraping "List of Largest Companies in the United States based on Revenue" data from Wikipedia

This is a portfolio project on webscraping. Variable names are as follows:
* url =  Scraped Webpage URL
* page = Server data
* soup = HTML version of webpage
* table = The table tag with class = 'wikitable sortable'
* us_headers = Column headers of the table in "table" above
* us_comp_headers = Text version of the Header names in "us_headers" above
* us_comp_rev_data = The entire datafame
* us_comp_column_data = Each row in the table in "table" above
* row_data = Individual entries in each column of a row in "us_comp_column_data" above
* individual_row_data = Textual and stripped version of "row_data"
* length = Dataframe length

Link to webpage: [List of Largest Companies in the United States based on Revenue](https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue)





















In [16]:
# Import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# Specify webpage url
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue'

In [3]:
# Send a request to the webpage
page = requests.get(url)

# Extract HTML version of webpage
soup = BeautifulSoup(page.text, 'html')

In [None]:
print(soup)

In [None]:
# Pulling the table using the find_all function and its index in the list.
soup.find_all('table')[1]

In [13]:
# Pulling the data using find and the table's class (Either this or the code before this will suffice)
table = soup.find('table', class_ = 'wikitable sortable')

In [None]:
# Extracts the column headers from the table
us_headers = table.find_all('th')

us_comp_headers = [titles.text.strip() for titles in us_headers]
us_comp_headers

In [None]:
# Insert column names into a pandas dataframe
us_comp_rev_data = pd.DataFrame(columns = us_comp_headers)
us_comp_rev_data

In [22]:
# Extract all the rows of data
us_comp_column_data = table.find_all('tr')

# Loop through the list of rows, extract the individual data and add it to the dataframe
for row in us_comp_column_data[1:]:
  row_data = row.find_all('td')
  individual_row_data = [data.text.strip() for data in row_data]
  length = len(us_comp_rev_data)
  us_comp_rev_data.loc[length] = individual_row_data

In [None]:
# Display dataframe
us_comp_rev_data