# Scraping tables with Pandas using read_html()

- read_html() is one of the easiest ways to read HTML tables directly from a webpage into a Pandas DataFrame without knowing how to scrape a website’s HTML, this tool can be useful for swiftly combining tables from numerous websites. 
- Documentation of pandas.read_html: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

## Example 1: Using an HTML string

We pass the html_string to pd.read_html function, which will extract all the HTML tables and returns a list of all the tables.

In [1]:
import pandas as pd

html_string = '''
<table>
<tr>
    <th>Company</th>
    <th>Country</th>
</tr>
<tr>
    <td>Apple Inc.</td>
    <td>United States</td>
</tr>
<tr>
    <td>Tencent Holdings Limited</td>
    <td>China</td>
</tr>
</table>
'''
df_1 = pd.read_html(html_string) # 只读取<table></table>
df_1


[                    Company        Country
 0                Apple Inc.  United States
 1  Tencent Holdings Limited          China]

## Example 2: Reading HTML Data From URL

We want to extract the List of S&P 500 component stocks from the wikipedia website: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies.

In case students cannot get access to wikipedia, please try the alternative website: https://www.thelists.org/sp-500-list.html. Though the website didn't provide the complete list, we just use it to try the functionality of read_html(). 

In [2]:
import pandas as pd
import numpy as np
  
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
print(len(dfs)) 
dfs

2


[    Symbol              Security             GICS Sector  \
 0      MMM                    3M             Industrials   
 1      AOS           A. O. Smith             Industrials   
 2      ABT                Abbott             Health Care   
 3     ABBV                AbbVie             Health Care   
 4      ACN             Accenture  Information Technology   
 ..     ...                   ...                     ...   
 498    YUM           Yum! Brands  Consumer Discretionary   
 499   ZBRA    Zebra Technologies  Information Technology   
 500    ZBH         Zimmer Biomet             Health Care   
 501   ZION  Zions Bancorporation              Financials   
 502    ZTS                Zoetis             Health Care   
 
                       GICS Sub-Industry    Headquarters Location  Date added  \
 0              Industrial Conglomerates    Saint Paul, Minnesota  1957-03-04   
 1                     Building Products     Milwaukee, Wisconsin  2017-07-26   
 2                 Heal

In [3]:
# If you check out the Wikipedia List of S&P500 Companies, you’ll notice there is a table containing the current S&P500 components and a table listing the historical changes. Let’s grab the first table.

dfs[0]

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
...,...,...,...,...,...,...,...,...
498,YUM,Yum! Brands,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,1041061,1997
499,ZBRA,Zebra Technologies,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,877212,1969
500,ZBH,Zimmer Biomet,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,1136869,1927
501,ZION,Zions Bancorporation,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,109380,1873


In [4]:
dfs[1]

Unnamed: 0_level_0,Date,Added,Added,Removed,Removed,Reason
Unnamed: 0_level_1,Date,Ticker,Security,Ticker,Security,Reason
0,"June 20, 2023",PANW,Palo Alto Networks,DISH,Dish Network,Market capitalization change.[4]
1,"May 4, 2023",AXON,Axon Enterprise,FRC,First Republic Bank,The Federal Deposit Insurance Corporation (FDI...
2,"March 20, 2023",FICO,Fair Isaac,LUMN,Lumen Technologies,Market capitalization change.[6]
3,"March 15, 2023",BG,Bunge Limited,SBNY,Signature Bank,The FDIC placed Signature Bank into FDIC Recei...
4,"March 15, 2023",PODD,Insulet,SIVB,SVB Financial Group,"The FDIC placed SVB's main subsidiary, Silicon..."
...,...,...,...,...,...,...
320,"June 9, 1999",WLP,Wellpoint,HPH,Harnischfeger Industries,Harnischfeger filed for bankruptcy.[246]
321,"December 11, 1998",FSR,Firstar,LDW,Amoco,British Petroleum purchased Amoco.[247]
322,"December 11, 1998",CCL,Carnival Corp.,GRN,General Re,Berkshire Hathaway purchased General Re.[247]
323,"December 11, 1998",CPWR,Compuware,SUN,SunAmerica,AIG purchased SunAmerica.[247]


In [5]:
#Alternative website
import pandas as pd
import numpy as np
  
dfs1 = pd.read_html('https://www.thelists.org/sp-500-list.html')
print(len(dfs1))
dfs1

1


[                S&P 500 List
 0                 3M Company
 1        Abbott Laboratories
 2    Abercrombie & Fitch Co.
 3              Adobe Systems
 4     Advanced Micro Devices
 ..                       ...
 453          XTO Energy Inc.
 454               Yahoo Inc.
 455          Yum! Brands Inc
 456          Zimmer Holdings
 457            Zions Bancorp
 
 [458 rows x 1 columns]]

# Parsing HTML with BeautifulSoup

In [6]:
html = '''
<html>
  <div class="introduction">
    <h1>Shenzhen Stock Exchange</h1>
    <p>Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).<a href="http://www.szse.cn/English/index.html">Home page</a></p>
  </div>

  <div class="marketdata">
    <a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/stocks/index.html">Stocks</a>
    <a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/funds/index.html">Funds</a>
  </div>
</html>
'''

In [7]:
from IPython.display import HTML
HTML(html)

In [8]:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html, 'html.parser')

In [9]:
bs


<html>
<div class="introduction">
<h1>Shenzhen Stock Exchange</h1>
<p>Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).<a href="http://www.szse.cn/English/index.html">Home page</a></p>
</div>
<div class="marketdata">
<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/stocks/index.html">Stocks</a>
<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/funds/index.html">Funds</a>
</div>
</html>

In [10]:
print(bs.prettify())

<html>
 <div class="introduction">
  <h1>
   Shenzhen Stock Exchange
  </h1>
  <p>
   Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).
   <a href="http://www.szse.cn/English/index.html">
    Home page
   </a>
  </p>
 </div>
 <div class="marketdata">
  <a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/stocks/index.html">
   Stocks
  </a>
  <a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/funds/index.html">
   Funds
  </a>
 </div>
</html>



## Getting Elements by Hierarchy

BS Dot Syntax

In [11]:
bs.html.div.h1

<h1>Shenzhen Stock Exchange</h1>

In [12]:
bs.div.h1 

<h1>Shenzhen Stock Exchange</h1>

In [13]:
bs.div.p 

<p>Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).<a href="http://www.szse.cn/English/index.html">Home page</a></p>

In [14]:
bs.div.p.a

<a href="http://www.szse.cn/English/index.html">Home page</a>

In [15]:
bs.div.p.a.string 

'Home page'

In [16]:
bs.div.p.a.contents

['Home page']

In [17]:
bs.a 

<a href="http://www.szse.cn/English/index.html">Home page</a>

In [18]:
bs.div.next_sibling.next_sibling.a 

<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/stocks/index.html">Stocks</a>

In [19]:
bs.h1.next_sibling

'\n'

In [20]:
bs.h1.next_sibling.next_sibling 

<p>Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).<a href="http://www.szse.cn/English/index.html">Home page</a></p>

In [21]:
bs.h1.string 

'Shenzhen Stock Exchange'

In [22]:
bs.div.next_sibling.next_sibling.a.string 

'Stocks'

In [23]:
bs.div.p.string 

In [24]:
list(bs.div.p.strings)

['Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).',
 'Home page']

In [25]:
list(bs.div.p.contents)

['Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).',
 <a href="http://www.szse.cn/English/index.html">Home page</a>]

In [26]:
list(bs.div.next_sibling.next_sibling.strings) 

['\n', 'Stocks', '\n', 'Funds', '\n']

## Searching for Elements
bs.find()

In [27]:
bs.div

<div class="introduction">
<h1>Shenzhen Stock Exchange</h1>
<p>Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).<a href="http://www.szse.cn/English/index.html">Home page</a></p>
</div>

In [28]:
bs.div.next_sibling # '/n' 是与一切东西都平行的元素, 因此无论任何情况下next_sibling都会被'\n'阻挡

'\n'

In [29]:
bs.div.next_sibling.next_sibling

<div class="marketdata">
<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/stocks/index.html">Stocks</a>
<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/funds/index.html">Funds</a>
</div>

In [30]:
bs.find(name='div', class_='marketdata') # use tag in name and attribution (id, class_)

<div class="marketdata">
<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/stocks/index.html">Stocks</a>
<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/funds/index.html">Funds</a>
</div>

In [31]:
bs.find(class_='introduction') 

<div class="introduction">
<h1>Shenzhen Stock Exchange</h1>
<p>Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).<a href="http://www.szse.cn/English/index.html">Home page</a></p>
</div>

In [32]:
bs.find(class_='marketdata').a.string 

'Stocks'

In [33]:
bs.find(href='http://www.szse.cn/English/index.html')

<a href="http://www.szse.cn/English/index.html">Home page</a>

In [34]:
# What's the text of the link http://www.szse.cn/English/index.html?
bs.find(href='http://www.szse.cn/English/index.html').string

'Home page'

In [35]:
bs.find(href='http://www.szse.cn/English/index.html').parent

<p>Shenzhen Stock Exchange (SZSE), established on 1st December, 1990, is a self-regulated legal entity under the supervision of China Securities Regulatory Commission (CSRC).<a href="http://www.szse.cn/English/index.html">Home page</a></p>

In [36]:
bs.find(string='Home page')

'Home page'

In [37]:
bs.find(string='Home page').parent 

<a href="http://www.szse.cn/English/index.html">Home page</a>

In [38]:
bs.find(class_='marketdata')

<div class="marketdata">
<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/stocks/index.html">Stocks</a>
<a href="http://www.szse.cn/English/siteMarketData/siteMarketDatas/funds/index.html">Funds</a>
</div>

In [39]:
bs.find(class_='marketdata').a['href'] # get attribution

'http://www.szse.cn/English/siteMarketData/siteMarketDatas/stocks/index.html'

In [40]:
# What link does the text "Home page" point to?
bs.find(string='Home page').parent['href'] 

'http://www.szse.cn/English/index.html'

## Example 1

Scrape the countries of the world and the related metrics from the following site: https://scrapethissite.com/pages/simple/

Store the result in a DataFrame that looks like the following:

| name | capital | population | area |
| ---- | ------- | ---------- | ---- |
| Andorra | Andorra la Vella | 84000 | 468.0 |
| ....

Then save your DataFrame as "countries.csv".

### Solution

This website is very scraping-friendly, but we still have to string together a lot of concept we've been practicing in more contained problems:
- Fetching HTML with `requests`
- Parsing it with the BeautifulSoup class
- Locating elements of interest
- Looping over multiple elements
- Creating a DataFrame from scraped elements

As for finding the elements, the simplest "container" to loop over is the "col-md-4 country" `div` element -- there is one of these for each country, so we can `find_all()` and then extract the information within each.

In [41]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = 'https://scrapethissite.com/pages/simple/'

# Get the site HTML and parse it.
response = requests.get(URL)
bs = BeautifulSoup(response.content, 'html.parser')
# Find the divs that contain countries.
divs = bs.find_all(name='div', class_='col-md-4 country')
divs

[<div class="col-md-4 country">
 <h3 class="country-name">
 <i class="flag-icon flag-icon-ad"></i>
                             Andorra
                         </h3>
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
 <strong>Population:</strong> <span class="country-population">84000</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
 </div>
 </div>,
 <div class="col-md-4 country">
 <h3 class="country-name">
 <i class="flag-icon flag-icon-ae"></i>
                             United Arab Emirates
                         </h3>
 <div class="country-info">
 <strong>Capital:</strong> <span class="country-capital">Abu Dhabi</span><br/>
 <strong>Population:</strong> <span class="country-population">4975593</span><br/>
 <strong>Area (km<sup>2</sup>):</strong> <span class="country-area">82880.0</span><br/>
 </div>
 </div>,
 <div class="col-md-4 country">
 <h3 class="country-name">
 

#### Step by Step

In [42]:
# see the first div
print(divs[0].prettify())

<div class="col-md-4 country">
 <h3 class="country-name">
  <i class="flag-icon flag-icon-ad">
  </i>
  Andorra
 </h3>
 <div class="country-info">
  <strong>
   Capital:
  </strong>
  <span class="country-capital">
   Andorra la Vella
  </span>
  <br/>
  <strong>
   Population:
  </strong>
  <span class="country-population">
   84000
  </span>
  <br/>
  <strong>
   Area (km
   <sup>
    2
   </sup>
   ):
  </strong>
  <span class="country-area">
   468.0
  </span>
  <br/>
 </div>
</div>



In [43]:
list(divs[0].h3.strings) 

['\n', '\n                            Andorra\n                        ']

In [44]:
name = ''.join(divs[0].h3.strings).strip() 
name 

'Andorra'

In [45]:
name = list(divs[0].h3.strings)[1].strip() 
name

'Andorra'

In [46]:
capital = divs[0].find(name='span', class_='country-capital').string
capital

'Andorra la Vella'

In [47]:
# Population
population = divs[0].find(name='span', class_='country-population').string
population

'84000'

In [48]:
# Area
area = divs[0].find(name='span', class_='country-area').string
area

'468.0'

#### For loop:

In [49]:
# For each one, extract name, capital, population, and area --
# store that info in a dictionary and add it to our list of rows.
rows = []
for div in divs:
    #######
    # Name
    #######
    # We can't just use div.h3.string because there is also an image within the h3 (not just text.)
    name = ''.join(div.h3.strings) #Alternative: name = list(divs[0].h3.strings)[1] 
    # (Optional) Get rid of whitespace around the country name
    name = name.strip()
    
    # Everything else is simpler; use the span classes and .string.
    
    # Capital
    capital = div.find(name='span', class_='country-capital').string
    # Population
    population = div.find(name='span', class_='country-population').string
    # Area
    area = div.find(name='span', class_='country-area').string
    
    # Create a dictionary of this info
    country_dict = {'name': name, 'capital': capital, 'population': population, 'area': area}
    # Add it to our list of rows
    rows.append(country_dict)

# Now just transform our rows into a DataFrame
country_df = pd.DataFrame(rows)
country_df

Unnamed: 0,name,capital,population,area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


# Example-2: Scrape the posts on Eastmoney Guba.com

In [50]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd

i=1
url=f'https://guba.eastmoney.com/list,zssh000001_{i}.html'

# Get the site HTML without headers.
response = requests.get(url)
response 

<Response [403]>

In [51]:
# 加上请求头进行伪装
import requests 
from bs4 import BeautifulSoup
import pandas as pd


i=1
url=f'https://guba.eastmoney.com/list,zssh000001_{i}.html'


# Get the site HTML with header.

# The most common purpose to add headers is pretending to be a browser. Here are sample headers.
my_headers = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.8",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
}

response = requests.get(url, headers=my_headers)
response

<Response [200]>

In [52]:
# Parse HTML with BeautifulSoup.
bs = BeautifulSoup(response.content, 'html.parser')

divs = bs.find_all(name='tr', class_='listitem') # After inspecting the website, we find the the rule of the content.

rows = []

for div in divs:

    # read
    read = div.find(name='div', class_='read').text
    # reply
    reply = div.find(name='div', class_='reply').string
    # title
    title = div.find(name='div', class_='title').string
    # author
    author = div.find(name='div', class_='author').string
    # update
    update = div.find(name='div', class_='update').string
    
    # Create a dictionary of this info
    post_dict = {'read': read, 'reply': reply, 'title': title, 'author': author, 'update':update}
    # Add it to our list of rows
    rows.append(post_dict)
# Now just transform our rows into a DataFrame
post_df = pd.DataFrame(rows)
post_df

Unnamed: 0,read,reply,title,author,update
0,1645,27,研究适当给A股加钟，你怎么看？,前线股指,08-20 12:47
1,31,1,周末这么多利好，明天大盘会不会涨停板买不到了？,鸟用,08-20 10:42
2,40,1,我在2800等着你的到来,财源积玉de尔泰,08-20 10:39
3,69,0,私募基金，机构，主力，散户，谁才是王者？,空仓连道非,08-20 09:48
4,143,2,#减免交易印花税呼声渐高##证监会：研究延长A股市场交易时间##A股有望迎来“活,牛股数据宝,08-20 09:58
...,...,...,...,...,...
75,25,0,明天1%开盘，1.5%涨幅收盘,股友r80550887o,08-20 10:24
76,469,1,周末证监会重磅发声，“一揽子”举措亮相，利好来袭！,封先生论市,08-20 09:17
77,62,1,星期一要破3100吗？,股海泛舟8888,08-20 10:17
78,323,4,活跃市场再出利好，还有没有机会把握？,蜗牛大哥也是牛,08-20 09:03
