# Lab 9: BeautifulSoup

In [None]:
# import libraries

import requests
from bs4 import BeautifulSoup


---

### From lecture notes: `.find()` & `.find_all()`

Using a tag name as an attribute gives us only the first tag by that name.

If we need to get all tags with a certain name, we need to use `find_all()`.

The `find_all()` (`find()`) method can take a variety of filters to find lists of desired tags (a single tag):

In [3]:
html_sample_code = ('<!DOCTYPE html><html lang="en"><head><title>Sample HTML Page</title></head>'
                    '<body><h1>This is a heading.</h1>'
                    '<p>This is a typical paragraph.</p>'
                    '<p class="class-one">This is a paragraph of class "class-one".</p>'
                    '<ol><li class="class-one"><a href="sample.html">The 1st item</a></li>'
                    '<li class="class-one">The 2nd item</li></ol>'
                    '<p id="unique-one">This is a paragraph with an ID of "unique-one".</p>'
                    '<div class="col m-3 border class-one">This is a division.'
                    '<a href="sample.html">link 2</a></div></body></html>')

sample_soup = BeautifulSoup(html_sample_code, 'html.parser')

In [7]:
sample_soup.find('p')                           # perform a match against that exact string; return the first tag encountered

<p>This is a typical paragraph.</p>

In [5]:
sample_soup.find_all('p')                      # perform a match against that exact string; return a list of tags

[<p>This is a typical paragraph.</p>,
 <p class="class-one">This is a paragraph of class "class-one".</p>,
 <p id="unique-one">This is a paragraph with an ID of "unique-one".</p>]

In [6]:
sample_soup.find_all(["p", "a"])               # perform a string match against any item in that list

[<p>This is a typical paragraph.</p>,
 <p class="class-one">This is a paragraph of class "class-one".</p>,
 <a href="sample.html">The 1st item</a>,
 <p id="unique-one">This is a paragraph with an ID of "unique-one".</p>,
 <a href="sample.html">link 2</a>]

In [8]:
sample_soup.find_all('p', {'class': 'class-one'})               # perform a match with a given attribute

[<p class="class-one">This is a paragraph of class "class-one".</p>]

---

### Lab Tasks 1-2
With Requests and BeautifulSoup, route to the site of etnet, scrape the following website: 

http://www.etnet.com.hk/www/eng/stocks/indexes_detail.php?subtype=HSI

##### 1: Hang Seng Index

Use methods `.find` with tag name and class name.

*Hint: Use "inspect" to find the class name needed.*

In [None]:
index_url = "http://www.etnet.com.hk/www/eng/stocks/indexes_detail.php?subtype=HSI"

response = requests.get(index_url, timeout=3)  
soup = BeautifulSoup(response.content, 'html.parser')

# complete the code below
hs_index = soup.find(____, ___________________________)
hs_index.text

##### 2: Sub Menu Bar

2.1: At the top of the website, there is a menu bar (HTML: a `div` element with `id:'SubMenuBar'`) showing text items, like Home, RT Quote, Indices, ..., etc. Use methods `.find`, `.find_all` and `.get_text` to scrap the text items in the menu bar. 

In [None]:
# complete the code below
menu_bar = soup.find( )
items = menu_bar.find_all( )

bar_belt_items =  ________________________
print(*bar_belt_items, sep='\n')

Home
RT Quote
Indices
Industry
Top 20
Record High
Short Sell
Hot Sector
AH
IPO
Company Info


2.2: Now, use `.get` to scrap the hyperlink attribute of each text items in the bar belt. 


In [None]:
# complete the code below
bar_belt_items_links =  ______________________________
print(*bar_belt_items_links, sep='\n')

/www/eng/stocks/realtime/index.php
/www/eng/stocks/realtime/quote.php?code=1
/www/eng/stocks/indexes_main.php
/www/eng/stocks/industry.php
/www/eng/stocks/realtime/top20.php
/www/eng/stocks/breakrecord.php
/www/eng/stocks/ci_act_sell.php
/www/eng/stocks/sector_hot.php
/www/eng/stocks/ah.php
/www/eng/stocks/ci_ipo.php
/www/eng/stocks/ci_database.php


---
### Pandas Dataframe: library for data analysis

To create a dataframe:

```Python
var = pd.DataFrame({'header1': list1, 'header2': list2})
```

Each header and list corresponds to a column; string-list pair given in a dictionary.

We can use a dataframe to display the data scraped!

---

In [11]:
import pandas as pd

In [None]:
sid = [20789215, 20791348, 20795589, 20834892, 20861624, 20954221]
last_name = ["Chan", "Lam", "Chau", "Lau", "Au-Yeung", "Chan"]
first_name= ["Thomas", "Vivian", "Angus", "Charlotte", "Jason", "Annie"]
asm1 = [100, 100, 80, 86, 69, 100]
asm2 = [85, 79, 84, 93, 88, 85]
final_exam = [85, 79, 90, 65, 77, 80]

gradebook = list(zip(sid, last_name, first_name, asm1, asm2, final_exam))
gradebook

[(20789215, 'Chan', 'Thomas', 100, 85, 85),
 (20791348, 'Lam', 'Vivian', 100, 79, 79),
 (20795589, 'Chau', 'Angus', 80, 84, 90),
 (20834892, 'Lau', 'Charlotte', 86, 93, 65),
 (20861624, 'Au-Yeung', 'Jason', 69, 88, 77),
 (20954221, 'Chan', 'Annie', 100, 85, 80)]

In [None]:
# DataFrame version of the table

pd_gradebook = pd.DataFrame({'Student ID': sid, 'Last Name': last_name, 'First Name': first_name, 'Assignment 1': asm1, 'Assignment 2': asm2, 'Final Exam': final_exam})
pd_gradebook

Unnamed: 0,Student ID,Last Name,First Name,Assignment 1,Assignment 2,Final Exam
0,20789215,Chan,Thomas,100,85,85
1,20791348,Lam,Vivian,100,79,79
2,20795589,Chau,Angus,80,84,90
3,20834892,Lau,Charlotte,86,93,65
4,20861624,Au-Yeung,Jason,69,88,77
5,20954221,Chan,Annie,100,85,80


### Lab Task 3: Stocks and Prices

3.1: From the HSI stocks table, extract the trending HSI stocks data including stock `names`, `codes` & `prices`. 

Use methods `.find` & `.find_all` to extract the required data, then organize the extracted data into a pandas *DataFrame*.


In [None]:
# complete the code below
index_url = "http://www.etnet.com.hk/www/eng/stocks/indexes_detail.php?subtype=HSI"

response = requests.get(index_url, timeout=3)  
soup = BeautifulSoup(response.content, 'html.parser')

# prepare empty lists to store data
name_ls = []
code_ls = []
price_ls = []

#locate the table rows
stocks =  _______________________________


for stock in stocks:
    # unpack the located row and get the required data
    _______________________________ = stock.______________
    
    # append the data to the pre-defined lists



    
# create a pandas dataframe
stock_tb = pd.DataFrame({'Stock':______, 'Stock No.':________, 'Price':______})
stock_tb


Unnamed: 0,Stock,Stock No.,Price
0,CKH HOLDINGS,00001,38.200
1,CLP HOLDINGS,00002,62.650
2,HK & CHINA GAS,00003,6.050
3,HSBC HOLDINGS,00005,64.500
4,POWER ASSETS,00006,46.350
...,...,...,...
77,NONGFU SPRING,09633,44.550
78,BIDU-SW,09888,101.800
79,TRIP.COM-S,09961,382.800
80,BABA-SW,09988,73.000


3.2: Scrape also the `change`, `&change` & `turnover` information. 

You can revise from above code or write the code again from scratch for practice. Organize the extracted data into a pandas *DataFrame*.

In [None]:
# complete the code below
index_url = "http://www.etnet.com.hk/www/eng/stocks/indexes_detail.php?subtype=HSI"

response = requests.get(index_url, timeout=3)  
soup = BeautifulSoup(response.content, 'html.parser')

# prepare empty lists to store data






#locate the table rows



for stock in stocks:
    # unpack the located row and get the required data
    
    
    # append the data to the pre-defined lists
    




    
    
# create a pandas dataframe
stock_tb = pd.DataFrame(_____)
stock_tb

82


Unnamed: 0,Stock,Stock No.,Price,Change,%Change,Turnover
0,CKH HOLDINGS,00001,38.200,-0.150,-0.391%,89.923M
1,CLP HOLDINGS,00002,62.650,+0.600,+0.967%,45.234M
2,HK & CHINA GAS,00003,6.050,+0.040,+0.666%,26.985M
3,HSBC HOLDINGS,00005,64.500,+0.600,+0.939%,468.663M
4,POWER ASSETS,00006,46.350,+0.500,+1.091%,12.443M
...,...,...,...,...,...,...
77,NONGFU SPRING,09633,44.550,+0.250,+0.564%,25.227M
78,BIDU-SW,09888,101.800,+1.700,+1.698%,146.618M
79,TRIP.COM-S,09961,382.800,+1.000,+0.262%,108.115M
80,BABA-SW,09988,73.000,+2.500,+3.546%,1.758B


### Take-home practice: Section Menu
Use methods `.find`, `.find_all` and `.get` to scrap the text and hyperlink in the `div` element with `id: 'SectionMenu'`. 

Try to format the output as shown below:

```
1: Local Indices
Link: /www/eng/stocks/indexes_main.php

2: China Indices
Link: /www/eng/stocks/indexes_china.php

3: Global Indices
Link: /www/eng/stocks/indexes_global.php

...
```

*(Hint 1: you may refer to the submenu task)*

*(Hint 2: you may want to use enumerate)*

In [None]:
# write your code below














### Self-practice: CSS selectors

This time, use `.select` or `.select_one` with css selectors, to complete the exercises above. 

1. Scrape the section menu (take-home practice)

In [None]:
# complete the code below
ele_container = soup.___________
ele_list = ele_container.__________

for idx, ele in enumerate(ele_list, start=1):
    print(f"{idx}: {ele.text}\nLink: {ele.get('href')}\n")

2. Scrape the Table Data (Task 3)

In [3]:
# complete the code below
index_url = "http://www.etnet.com.hk/www/eng/stocks/indexes_detail.php?subtype=HSI"

response = requests.get(index_url, timeout=3)  
soup = BeautifulSoup(response.content, 'html.parser')

# prepare empty lists to store data
name_ls = []
code_ls = []
price_ls = []
change_ls = []
per_change_ls = []
turnover_ls = []

#locate the table rows
stocks = soup.______________________________________
#print(len(stocks))

for stock in stocks:
    # unpack the located row and get the required data
    code, name, arrow, price, change, per_change, turnover, *rest = stock._______________
    
    # append the data to the pre-defined lists
    code_ls.append(code.text)
    name_ls.append(name.text)
    price_ls.append(price.text)
    change_ls.append(change.text)
    per_change_ls.append(per_change.text)
    turnover_ls.append(turnover.text)
    
# create a pandas dataframe
stock_tb = pd.DataFrame({'Stock':name_ls, 'Stock No.':code_ls, 'Price':price_ls, 'Change':change_ls, '%Change':per_change_ls, 'Turnover':turnover_ls})
stock_tb

Unnamed: 0,Stock,Stock No.,Price,Change,%Change,Turnover
0,CKH HOLDINGS,00001,38.350,0.000,0.000%,99.767M
1,CLP HOLDINGS,00002,62.650,+0.600,+0.967%,57.351M
2,HK & CHINA GAS,00003,6.050,+0.040,+0.666%,33.489M
3,HSBC HOLDINGS,00005,64.450,+0.550,+0.861%,521.385M
4,POWER ASSETS,00006,46.300,+0.450,+0.981%,14.976M
...,...,...,...,...,...,...
77,NONGFU SPRING,09633,44.650,+0.350,+0.790%,30.256M
78,BIDU-SW,09888,101.900,+1.800,+1.798%,157.189M
79,TRIP.COM-S,09961,383.000,+1.200,+0.314%,118.332M
80,BABA-SW,09988,73.000,+2.500,+3.546%,1.928B


### Self practice: challenge

Insert a `News` column to the above dataframe and show information for the top 10 stocks. The news of each stocks is posted on another page. Extract the links from the HSI page, route to the news page and insert data for the `News` column.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

index_url = 'http://www.etnet.com.hk/www/eng/stocks/'
hsi_url = 'indexes_detail.php?subtype=HSI'

response = requests.get(index_url+hsi_url, timeout=3)  
soup = BeautifulSoup(response.content, 'html.parser')

name = []
code = []
price = []
turnover = []
news_url = []
news = []

# complete the code below




stock_tb = pd.DataFrame( )
stock_tb



Unnamed: 0,Stock,Stock No.,Price,Tureover,News
0,CKH HOLDINGS,1,38.2,89.675M,"[ET Net News Agency, 5 April 2024] 3 listed c..."
1,CLP HOLDINGS,2,62.65,45.234M,Summary of listed companies announcements (1)=...
2,HK & CHINA GAS,3,6.05,26.781M,"[ET Net News Agency, 2 April 2024] 4 listed c..."
3,HSBC HOLDINGS,5,64.45,465.002M,Summary of listed companies announcements (1)=...
4,POWER ASSETS,6,46.35,12.373M,Summary of listed companies announcements (1)=...
5,HANG SENG BANK,11,98.8,209.231M,"[ET Net News Agency, 9 April 2024] HANG SENG ..."
6,HENDERSON LAND,12,23.7,14.072M,"[ET Net News Agency, 9 April 2024] 3 listed c..."
7,SHK PPT,16,74.9,171.516M,"[ET Net News Agency, 25 March 2024] A direct ..."
8,NEW WORLD DEV,17,8.75,21.344M,"[ET Net News Agency, 9 April 2024] NEW WORLD ..."
9,GALAXY ENT,27,40.35,95.609M,"[ET Net News Agency, 9 April 2024] 4 companie..."
