# Scrapping for revenue

![Web Scraping](https://t.ly/9KRqy)

This is a notebook showcasing some simple scraping using the [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) and [Requests](https://pypi.org/project/requests/) modules for Python.

The data is related with the Revenue of the Largest Companies in the US and extracted from the following wikipedia link,
[here](https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue)


<img src="https://t.ly/SDrMp" width="30%">

First of all we will load our libraries and hit the site to get the page as whole. Pass the url to our `requests()` function and it does what the name implies, getting us the page itself.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# Get the wikipedia page containg the list
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')

# Extracting the Revenue table

We will extract the first table and make a *soup* object out of it for further edit.
If we inspect the page, using developer tools, we will see that tables on the page are contained within the <table> html tag and with same css class of 'wikitalbe sortable' which will fetch with a simple `soup.find(<tag_in_question>)` command will do the trick. 
This simple command search and fetch the first and only the first `<tag>` that was give as its arguement, fortunately for us our table is the very first one in the page so that will do the trick

In [3]:
# Fetch the table containing the revenue sorting
table = soup.find('table', class_='wikitable sortable')

# Getting the colunm titles of the table

Having our table isolated as a *soup* object we will extract its column headers in the same fashion but this time using the `.find_all()` command because the Headers row contains multiple values.
Afterwards, since raw html is not the most user friendly format for the human, we will strip the uncessary tags symbols to turn it into a list of strings for inspection and future use

In [4]:
# Extract the table headers, drop html tags and special symbols
table_headers = table.find_all('th')
titles = [title.text.strip() for title in table_headers]
titles

['Rank',
 'Name',
 'Industry',
 'Revenue (USD millions)',
 'Revenue growth',
 'Employees',
 'Headquarters']

# Into a Dataframe

Our next move is to turni this headers list into a Dataframe which will enable us further manipulation

In [5]:
df = pd.DataFrame(columns=titles)

# Extracting the row data of the table

After the headers are set we follow a similar approach as above to extract
the actual data of the following rows. Again we create a suitable *soup* object, we iterate over it to create an extra individual *soup* object out of them so-that we can easily access their text to append to the previous Dataframe.
Finally we reset our index to be the 'Rank' column, preserving the original visual format as in the wikipedia page

In [6]:
# Extract the table rows, then  the data in them.
table_rows = table.find_all('tr')

for row in table_rows[1:]:
    row_data = row.find_all('td')
    indiv_row_data = [data.text.strip() for data in row_data]
    
    df.loc[len(df)] = indiv_row_data
    
df.set_index('Rank', inplace=True)    
df

Unnamed: 0_level_0,Name,Industry,Revenue (USD millions),Revenue growth,Employees,Headquarters
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retail,611289,6.7%,2100000,"Bentonville, Arkansas"
2,Amazon,Retail and Cloud Computing,513983,9.4%,1540000,"Seattle, Washington"
3,Exxon Mobil,Petroleum industry,413680,44.8%,62000,"Spring, Texas"
4,Apple,Electronics industry,394328,7.8%,164000,"Cupertino, California"
5,UnitedHealth Group,Healthcare,324162,12.7%,400000,"Minnetonka, Minnesota"
...,...,...,...,...,...,...
96,Best Buy,Retail,46298,10.6%,71100,"Richfield, Minnesota"
97,Bristol-Myers Squibb,Pharmaceutical industry,46159,0.5%,34300,"New York City, New York"
98,United Airlines,Airline,44955,82.5%,92795,"Chicago, Illinois"
99,Thermo Fisher Scientific,Laboratory instruments,44915,14.5%,130000,"Waltham, Massachusetts"


# Saving the data

Lastly we save the data into CSV format which is suitable for processing outside of Python, in other application like Ms Excell ect ect.

In [7]:
# Save dataframe into a CSV, overwriting existing ones
df.to_csv(r'companies.csv')

# Done!

This concludes our scraping. Though simple at its core, it puts the basis for automatition basics as this either can be writen in a `.py` script format and deployed into a web server so it can automatically launch periodically, via a *cronjob* and update the data concerning the revenue. As long as the html format of the table does not change our script will run smoothly, but even if this occurs we can always make use of `try ... expect` statements and notify us via email or other method but until that happens if Walmart happens to fall from its revenue "crown", we will probably be the first to notice without visiting wikipedia