Section II. HTML based scraping (1 hour)

- GET and POST calls to retrieve response objects - using urllib2, requests, JSON etc
- Using bs4 (and lxml) to parse the structure and access different elements within a HTML or XML
- Manipulate it into a tabular structure - explore the schema
- Store it in the appropriate format - CSV, TSV and export the results

## 1. GET and POST calls to retrieve response objects - using urllib2, requests, JSON etc


In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

web_url = 'https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018'
res = requests.get(web_url)

## 2. Using bs4 (and lxml) to parse the structure and access different elements within a HTML or XML

In [1]:
soup_object = BeautifulSoup(res.content)

## 3. Manipulate it into a tabular structure - explore the schema

In [4]:
data_table = soup_object.find_all('table', 'data-table')[0]

In [5]:
all_values = data_table.find_all('tr')
all_values

[<tr><th>Rank</th>
 <th>Company</th>
 <th>Website</th>
 </tr>, <tr><td>1</td>
 <td>Walmart</td>
 <td><a href="http://www.stock.walmart.com">http://www.stock.walmart.com</a></td>
 </tr>, <tr><td>2</td>
 <td>Exxon Mobil</td>
 <td><a href="http://www.exxonmobil.com">http://www.exxonmobil.com</a></td>
 </tr>, <tr><td>3</td>
 <td>Berkshire Hathaway</td>
 <td><a href="http://www.berkshirehathaway.com">http://www.berkshirehathaway.com</a></td>
 </tr>, <tr><td>4</td>
 <td>Apple</td>
 <td><a href="http://www.apple.com">http://www.apple.com</a></td>
 </tr>, <tr><td>5</td>
 <td>UnitedHealth Group</td>
 <td><a href="http://www.unitedhealthgroup.com">http://www.unitedhealthgroup.com</a></td>
 </tr>, <tr><td>6</td>
 <td>McKesson</td>
 <td><a href="http://www.mckesson.com">http://www.mckesson.com</a></td>
 </tr>, <tr><td>7</td>
 <td>CVS Health</td>
 <td><a href="http://www.cvshealth.com">http://www.cvshealth.com</a></td>
 </tr>, <tr><td>8</td>
 <td>Amazon.com</td>
 <td><a href="http://www.amazon.com"

In [6]:
fortune_500_df = pd.DataFrame(columns = ['rank', 'company_name', 'company_website'])
ix = 0

for row in all_values[1:]:
    values = row.find_all('td')
    rank = values[0].text
    company = values[1].text
    website = values[2].text
    
    fortune_500_df.loc[ix] = [rank, company, website]
    ix += 1
    
fortune_500_df.head()

Unnamed: 0,rank,company_name,company_website
0,1,Walmart,http://www.stock.walmart.com
1,2,Exxon Mobil,http://www.exxonmobil.com
2,3,Berkshire Hathaway,http://www.berkshirehathaway.com
3,4,Apple,http://www.apple.com
4,5,UnitedHealth Group,http://www.unitedhealthgroup.com


## 4. Store it in the appropriate format - CSV, TSV and export the results

In [7]:
fortune_500_df.to_csv('./fortune_500_companies.csv', index=False)