# Assignment 1: Web Scraping

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:


* How to download HTML pages from a website?
* How to extract relevant content from an HTML page? 

Furthermore, you will gain a deeper understanding of the data science lifecycle.

**Requirements:**

1. Please use [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) rather than spark.DataFrame to manipulate data.

2. Please use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) rather than [lxml](http://lxml.de/) to parse an HTML page and extract data from the page.

3. Please follow the python code style (https://www.python.org/dev/peps/pep-0008/). If TA finds your code hard to read, you will lose points. This requirement will stay for the whole semester.

## Objective

## Preliminary

If this is your first time to write a web scraper, you need to learn some basic knowledge of this topic. I found that this is a good resource: [Tutorial: Web Scraping and BeautifulSoup](https://realpython.com/beautiful-soup-web-scraper-python/). 

Please let me know if you find a better resource. I'll share it with the other students.

## Overview

Imagine you are a data scientist working at HKUST(GZ). Your job is to extract insights from HKUST(GZ) data to answer questions. 

In this assignment, you will do two tasks. Please recall the high-level data science lifecycle from Lecture 1. I suggest that when doing this assignment, please remind yourself of what data you collected and what questions you tried to answer.

## Task 1: HKUST(GZ) Information Hub Faculty Members

Sometimes you don't know what questions to ask. No worries. Start collecting data first. 

In Task 1, your job is to write a web scraper to extract the faculty information from this page: [https://facultyprofiles.hkust-gz.edu.cn/](https://facultyprofiles.hkust-gz.edu.cn/).




### (a) Crawl Web Page

A web page is essentially a file stored in a remote machine (called web server). Please write code to download the HTML page and save it as a text file ("infhfaculty.html").

In [1]:
# write your code
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")

URL = "https://facultyprofiles.hkust-gz.edu.cn"

driver = webdriver.Chrome(options=chrome_options)  # 实例化Chrome浏览器
driver.get(URL) # 进入faculty列表

driver.implicitly_wait(3) # 设置等待时长
info_button = '//*[@id="app"]/section/section/div/ul[1]/li[3]' # 筛选Information Hub faculty
driver.find_element(By.XPATH,info_button).click()

last_faculty = '//*[@id="app"]/section/section/div/div[2]/div[3]/table/tbody/tr[77]/td[6]/div/button' # inspect可知一共有77位faculty，将加载出最后一位faculty设置为wait条件
driver.find_element(By.XPATH,last_faculty)

infhfaculty = driver.page_source

file_name = "infhfaculty/allinfhfaculty.html"  #导出页面
with open(file_name, "w", encoding="utf-8") as file:
    file.write(infhfaculty)

print("HTML page saved as allinfhfaculty.html")

soup = BeautifulSoup(infhfaculty, 'html.parser')
faculty_list = soup.find_all('div', class_='cell') #查找faculty信息
faculty_list = [faculty.text for faculty in faculty_list] #将faculty_list转换为list格式的faculty

num_rows = len(faculty_list) // 6 #观察faculty_list，发现每位教授有6个维度的信息
num_faculty = num_rows - 1 #len//6-1则是教授的数量
faculty_reshaped = pd.DataFrame(pd.Series(faculty_list).values.reshape(num_rows, -1)) #将faculty_list里的数据按顺序排列
faculty_table = pd.DataFrame(faculty_reshaped)
faculty_table = faculty_table.drop([1,3,5], axis=1) #删除掉不需要的信息维度
faculty_table.columns = faculty_table.iloc[0, :] #更改列名
faculty_table = faculty_table.drop(0, axis=0) #删除掉第一行的维度名称


HTML page saved as allinfhfaculty.html


In [2]:

for i in range(1, num_faculty+1):
    more = '//*[@id="app"]/section/section/div/div[2]/div[3]/table/tbody/tr['+str(i)+']/td[6]/div/button' #点击每一位faculty的more button
    driver.find_element(By.XPATH,more).click()
    
    driver.switch_to.window(driver.window_handles[1]) #切换driver的handle
    faculty_table.loc[i,'Profile'] = driver.current_url

    for j in range(1,4): #注意到想要获取area的话，需要点击一下 research interest标签，然后再下载整个页面
        rs = '//*[@id="tab-'+str(j)+'"]' #但是每位faculty主页的research interest元素的id均不一样，使用class也只获取到空值
        try:
            # 所以这里采取尝试点击所有与research interest拥有类似xpath的元素，因为它总在标签的最末尾，所以最后能点击到的一定是research interest
            driver.find_element(By.XPATH,rs).click()
        except NoSuchElementException:
            # 在超时后捕获异常并执行下一步
            i
    
    infhfaculty =driver.page_source
    
    file_name = "infhfaculty/"+str(faculty_table.iloc[i-1,0])+".html" #导出每一位faculty的profile
    with open(file_name, "w", encoding="utf-8") as file:
        file.write(infhfaculty)
    
    print(str(i)+" HTML page saved as "+str(faculty_table.loc[i,'Name'])+".html") 
    
    pw1 = '//*[@id="app"]/section/div/div[2]/div/div/div[3]/div/div/div[3]/a'
    pw2 = '//*[@id="app"]/section/div/div[2]/div/div/div[3]/div/div[1]/div[3]/a'
    try: #同样，发现faculty的personal web有两种可能的xpath，均加以尝试
        driver.find_element(By.XPATH,pw1).click()
    except NoSuchElementException:
        driver.find_element(By.XPATH,pw2).click()
    
    driver.switch_to.window(driver.window_handles[2]) #切换driver的handle
    faculty_table.loc[i,'Personal Website'] = driver.current_url
    
    driver.close()
    driver.switch_to.window(driver.window_handles[1]) #将handle切换回faculty profiles
    
    driver.close()
    driver.switch_to.window(driver.window_handles[0]) #将handle切换回faculty list

driver.quit() #退出chrome浏览器

faculty_table.to_csv('faculty_table.csv', index=False) #导出csv


1 HTML page saved as Lei CHEN.html
2 HTML page saved as Pan HUI.html
3 HTML page saved as Vincent Kin Nang LAU.html
4 HTML page saved as Irene Man Chi Lo.html
5 HTML page saved as Lionel Ming-Shuan NI.html
6 HTML page saved as Huamin QU.html
7 HTML page saved as Fu-Gee TSUNG.html
8 HTML page saved as Hui XIONG.html
9 HTML page saved as Liuqing YANG.html
10 HTML page saved as Qiang YANG.html
11 HTML page saved as Qian ZHANG.html
12 HTML page saved as Xiaowen CHU.html
13 HTML page saved as Qiong LUO.html
14 HTML page saved as Danny Hin Kwok TSANG.html
15 HTML page saved as Wei WANG.html
16 HTML page saved as Kaishun WU.html
17 HTML page saved as Yang YANG.html
18 HTML page saved as Kang ZHANG.html
19 HTML page saved as Ying CUI.html
20 HTML page saved as Xinyi HUANG.html
21 HTML page saved as Sung Hun KIM.html
22 HTML page saved as DIRK KUTSCHER.html
23 HTML page saved as Nan TANG.html
24 HTML page saved as Felix Xin WANG.html
25 HTML page saved as Sean Sihong XIE.html
26 HTML page saved

### (b) Extract Structured Data

Please write code to extract relevant content (name, rank, area, profile, homepage, ...) from "infhfaculty.html" and save them as a CSV file (save as "faculty_table.csv"). 

In [3]:
# write your code
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame

faculty_table = pd.read_csv('faculty_table.csv') #导入csv文件

name_list = faculty_table.loc[:,'Name']

for i in range(0, len(name_list)): #循环打开下载的faculty profiles
    file_name = 'infhfaculty/'+str(name_list.loc[i])+'.html'

    with open(file_name, 'r', encoding='utf-8') as file: # 打开html文件
        html_content = file.read()

    soup = BeautifulSoup(html_content, 'html.parser')

    faculty_info = soup.find('p', class_='icon-text') #查找faculty的office信息
    office =  [office_info for office_info in faculty_info] #转换为list格式的faculty
    if len(office) > 0:
        faculty_table.loc[i, 'Office'] = office
    else:
        faculty_table.loc[i, 'Office'] = 'NaN'
    
    faculty_info = soup.find_all('p', class_='content') #查找faculty的area信息
    area = [area_info for area_info in faculty_info]

    areas = pd.DataFrame(area)
    if areas.shape[1] > 0: #faculty往往拥有多个研究兴趣，将其组合成string
        area_str = str()
        for j in range(0,len(areas)):
            area_str = str(area_str) + str(areas.loc[j, 0]) + ", "
        faculty_table.loc[i,'Area'] = area_str
    else:
        faculty_table.loc[i,'Area'] = 'NaN'

faculty_table.to_csv('faculty_table.csv', index=False) #输出到csv中

### (c) Interesting Finding

Note that you don't need to do anything for Task 1(c). The purpose of this part is to give you some sense about how to leverage Exploratory Data Analysis (EDA) to come up with interesting questions about the data. EDA is an important topic in data science; you will  learn it soon from this course. 


First, please install [dataprep](http://dataprep.ai).
Then, run the cell below. 
It shows a bar chart for every column. What interesting findings can you get from these visualizations? 

Below are some examples:

**Finding 1:** Assistant Professor# (~76) is more than 5x larger than Associate Professor# (10). 

**Questions:** Why did it happen? Is it common in all CS schools in the world? Will the gap go larger or smaller in five years? What actions can be taken to enlarge/shrink the gap?


**Finding 2:** The Homepage has 22% missing values. 

**Questions:** Why are there so many missing values? Is it because many faculty do not have their own homepages or do not add their homepages to the school page? What actions can be taken to avoid this to happen in the future? 

## Task 2: Age Follows Normal Distribution?

In this task, you start with a question and then figure out what data to collect.

The question that you are interested in is `Does HKUST(GZ) Info Hub faculty age follow a normal distribution?`

To estimate the age of a faculty member, you can collect the year in which s/he graduates from a university (`gradyear`) and then estimate `age` using the following equation:

$$age \approx 2023+23 - gradyear$$

For example, if one graduates from a university in 1990, then the age is estimated as 2023+23-1990 = 56. 



### (a) Crawl Web Page

You notice that faculty profile pages contain graduation information. For example, you can see that Dr. Yuyu LUO graduated from Tsinghua University in 2023 at [https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/LUO-Yuyu/yuyuluo](https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/LUO-Yuyu/yuyuluo). 


Please write code to download the profile pages (info hub faculties) and save each page as a text file. 

In [2]:
# Write your code
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")

URL = "https://facultyprofiles.hkust-gz.edu.cn"

driver = webdriver.Chrome(options=chrome_options)  # 实例化Chrome浏览器
driver.get(URL) # 进入faculty列表

wait = WebDriverWait(driver, 10) # 设置等待时长
info_button = '//*[@id="app"]/section/section/div/ul[1]/li[3]' # 筛选Information Hub faculty
driver.find_element(By.XPATH,info_button).click()

last_faculty = '//*[@id="app"]/section/section/div/div[2]/div[3]/table/tbody/tr[77]/td[6]/div/button' # inspect可知一共有77位faculty，将加载出最后一位faculty设置为wait条件
wait.until(EC.element_to_be_clickable((By.XPATH,last_faculty)))

infhfaculty = driver.page_source

file_name = "infhfaculty/allinfhfaculty.html"  #导出页面
with open(file_name, "w", encoding="utf-8") as file:
    file.write(infhfaculty)

print("HTML page saved as allinfhfaculty.html")

soup = BeautifulSoup(infhfaculty, 'html.parser')
faculty_list = soup.find_all('div', class_='cell') #查找faculty信息
faculty_list = [faculty.text for faculty in faculty_list] #将faculty_list转换为list格式的faculty

num_rows = len(faculty_list) // 6 #观察faculty_list，发现每位教授有6个维度的信息
num_faculty = num_rows - 1 #len//6-1则是教授的数量
faculty_reshaped = pd.DataFrame(pd.Series(faculty_list).values.reshape(num_rows, -1)) #将faculty_list里的数据按顺序排列
faculty_table = pd.DataFrame(faculty_reshaped)
faculty_table = faculty_table.drop([1,2,3,4,5], axis=1) #删除掉不需要的信息维度
faculty_table.columns = faculty_table.iloc[0, :]
faculty_table = faculty_table.drop(0, axis=0)


HTML page saved as allinfhfaculty.html


In [3]:
# write your code here
for i in range(1, num_faculty+1):
    more = '//*[@id="app"]/section/section/div/div[2]/div[3]/table/tbody/tr['+str(i)+']/td[6]/div/button' #点击每一位faculty的more button
    driver.find_element(By.XPATH,more).click()
    
    driver.switch_to.window(driver.window_handles[1]) #切换driver的handle
    element = '//*[@id="tab-0"]'
    wait.until(EC.element_to_be_clickable((By.XPATH,element))).click() #设置等待条件，否则会下载空html
    infhfaculty =driver.page_source
    
    file_name = "infhfaculty/"+str(faculty_table.iloc[i-1,0])+".html" #导出每一位faculty的profile
    with open(file_name, "w", encoding="utf-8") as file:
        file.write(infhfaculty)
    
    print(str(i)+" HTML page saved as "+str(faculty_table.loc[i,'Name'])+".html") 
    
    driver.close()
    driver.switch_to.window(driver.window_handles[0]) #将handle切换回faculty list

driver.quit() #退出chrome浏览器

faculty_table.to_csv('faculty_table.csv', index=False) #导出csv


1 HTML page saved as Lei CHEN.html
2 HTML page saved as Pan HUI.html
3 HTML page saved as Vincent Kin Nang LAU.html
4 HTML page saved as Irene Man Chi Lo.html
5 HTML page saved as Lionel Ming-Shuan NI.html
6 HTML page saved as Huamin QU.html
7 HTML page saved as Fu-Gee TSUNG.html
8 HTML page saved as Hui XIONG.html
9 HTML page saved as Liuqing YANG.html
10 HTML page saved as Qiang YANG.html
11 HTML page saved as Qian ZHANG.html
12 HTML page saved as Xiaowen CHU.html
13 HTML page saved as Qiong LUO.html
14 HTML page saved as Danny Hin Kwok TSANG.html
15 HTML page saved as Wei WANG.html
16 HTML page saved as Kaishun WU.html
17 HTML page saved as Yang YANG.html
18 HTML page saved as Kang ZHANG.html
19 HTML page saved as Ying CUI.html
20 HTML page saved as Xinyi HUANG.html
21 HTML page saved as Sung Hun KIM.html
22 HTML page saved as DIRK KUTSCHER.html
23 HTML page saved as Nan TANG.html
24 HTML page saved as Felix Xin WANG.html
25 HTML page saved as Sean Sihong XIE.html
26 HTML page saved

### (b) Extract Structured Data

Please write code to extract the earliest graduation year (e.g., 2023 for Dr. Yuyu LUO) from each profile page, and create a csv file like [faculty_grad_year.csv](./faculty_grad_year.csv). 

In [64]:
# write your code
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame

faculty_table = pd.read_csv('faculty_table.csv') #导入csv文件

faculty_list = faculty_table.loc[:,'Name']
grad_year = pd.DataFrame(columns=['Name', 'Graduate Year'])

for i in range(0,len(faculty_list)):
    file_name = 'infhfaculty/'+str(faculty_list.loc[i])+'.html'

    with open(file_name, 'r', encoding='utf-8') as file: # 打开html文件
        html_content = file.read()

    soup = BeautifulSoup(html_content, 'html.parser')

    faculty_detail = soup.find('div', class_='degree-detail') #查找faculty的office信息
    details =  [detail for detail in faculty_detail] #转换为list格式的faculty

    if len(details) > 0:
        raw = str(details[1])
        data = raw.split(',')
        data = pd.DataFrame(str(data[len(data)-1]).split('<'))
        grad_year.loc[i,:] = [faculty_list.loc[i], data.loc[0]]
        
grad_year.to_csv('faculty_grad_year.csv', index=False) #导出csv


### (c) Interesting Finding

Similar to Task 1(c), you don't need to do anything here. Just look at different visualizations w.r.t. age and give yourself an answer to the question: `Does HKUST(GZ) Info Hub faculty age follow a normal distribution?`

In [None]:
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_grad_year.csv")
df["age"] = 2023+23-df["gradyear"]

plot(df, "age")

## Submission

Complete the code in this notebook, and submit it to the Canvas assignment `Assignment 1`.