# Assignment 1: Web Scraping

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:


* How to download HTML pages from a website?
* How to extract relevant content from an HTML page? 

Furthermore, you will gain a deeper understanding of the data science lifecycle.

**Requirements:**

1. Please use [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) rather than spark.DataFrame to manipulate data.

2. Please use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) rather than [lxml](http://lxml.de/) to parse an HTML page and extract data from the page.

3. Please follow the python code style (https://www.python.org/dev/peps/pep-0008/). If TA finds your code hard to read, you will lose points. This requirement will stay for the whole semester.

## Objective

## Preliminary

If this is your first time to write a web scraper, you need to learn some basic knowledge of this topic. I found that this is a good resource: [Tutorial: Web Scraping and BeautifulSoup](https://realpython.com/beautiful-soup-web-scraper-python/). 

Please let me know if you find a better resource. I'll share it with the other students.

## Overview

Imagine you are a data scientist working at HKUST(GZ). Your job is to extract insights from HKUST(GZ) data to answer questions. 

In this assignment, you will do two tasks. Please recall the high-level data science lifecycle from Lecture 1. I suggest that when doing this assignment, please remind yourself of what data you collected and what questions you tried to answer.

## Task 1: HKUST(GZ) Information Hub Faculty Members

Sometimes you don't know what questions to ask. No worries. Start collecting data first. 

In Task 1, your job is to write a web scraper to extract the faculty information from this page: [https://facultyprofiles.hkust-gz.edu.cn/](https://facultyprofiles.hkust-gz.edu.cn/).




### (a) Crawl Web Page

A web page is essentially a file stored in a remote machine (called web server). Please write code to download the HTML page and save it as a text file ("infhfaculty.html").

In [6]:
# write your code
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
#chrome_options.add_argument("--headless")

URL = "https://facultyprofiles.hkust-gz.edu.cn"

driver = webdriver.Chrome(options=chrome_options)  # 实例化Chrome浏览器
driver.get(URL) # 进入faculty列表

wait = WebDriverWait(driver, 10, 0.5) # 设置等待时长
info_button = '//*[@id="app"]/section/section/div/ul[1]/li[3]' # 筛选Information Hub faculty
wait.until(EC.element_to_be_clickable((By.XPATH,info_button))).click()

last_faculty = '//*[@id="app"]/section/section/div/div[2]/div[3]/table/tbody/tr[77]/td[6]/div/button'
wait.until(EC.element_to_be_clickable((By.XPATH,last_faculty)))

infhfaculty = driver.page_source

file_name = "infhfaculty/allinfhfaculty.html" 
with open(file_name, "w", encoding="utf-8") as file:
    file.write(infhfaculty)
    
print("HTML page saved as allinfhfaculty.html")

soup = BeautifulSoup(infhfaculty, 'html.parser')
faculty_list = soup.find_all('div', class_='cell') #查找faculty信息
faculty_list = [faculty.text for faculty in faculty_list] #将faculty_list转换为list格式的faculty

num_faculty = len(faculty_list) // 6 - 1#观察faculty_list，发现每位教授有6个维度的信息，len//6-1则是教授的数量

for i in range(1, num_faculty+1):
    more = '//*[@id="app"]/section/section/div/div[2]/div[3]/table/tbody/tr['+str(i)+']/td[6]/div/button'
    wait.until(EC.element_to_be_clickable((By.XPATH,more))).click()
    
    driver.switch_to.window(driver.window_handles[1])
        
    reaserch = "el-tabs__item.is-top.is-active"
    wait.until(EC.element_to_be_clickable((By.CLASS_NAME,reaserch))).click()
    
    infhfaculty =driver.page_source
    
    file_name = "infhfaculty/infhfaculty"+str(i)+".html"
    with open(file_name, "w", encoding="utf-8") as file:
        file.write(infhfaculty)
    
    print("HTML page saved as infhfaculty"+str(i)+".html") 
    
    driver.close()
    driver.switch_to.window(driver.window_handles[0])

driver.quit()

HTML page saved as allinfhfaculty.html
HTML page saved as infhfaculty1.html
HTML page saved as infhfaculty2.html
HTML page saved as infhfaculty3.html
HTML page saved as infhfaculty4.html
HTML page saved as infhfaculty5.html
HTML page saved as infhfaculty6.html
HTML page saved as infhfaculty7.html
HTML page saved as infhfaculty8.html
HTML page saved as infhfaculty9.html
HTML page saved as infhfaculty10.html
HTML page saved as infhfaculty11.html
HTML page saved as infhfaculty12.html
HTML page saved as infhfaculty13.html
HTML page saved as infhfaculty14.html
HTML page saved as infhfaculty15.html
HTML page saved as infhfaculty16.html
HTML page saved as infhfaculty17.html
HTML page saved as infhfaculty18.html
HTML page saved as infhfaculty19.html
HTML page saved as infhfaculty20.html
HTML page saved as infhfaculty21.html
HTML page saved as infhfaculty22.html
HTML page saved as infhfaculty23.html
HTML page saved as infhfaculty24.html
HTML page saved as infhfaculty25.html
HTML page saved as i

TimeoutException: Message: 
Stacktrace:
0   chromedriver                        0x000000010489ed3c chromedriver + 4336956
1   chromedriver                        0x0000000104896db8 chromedriver + 4304312
2   chromedriver                        0x00000001044c3a5c chromedriver + 293468
3   chromedriver                        0x0000000104508d50 chromedriver + 576848
4   chromedriver                        0x0000000104543908 chromedriver + 817416
5   chromedriver                        0x00000001044fca5c chromedriver + 526940
6   chromedriver                        0x00000001044fd908 chromedriver + 530696
7   chromedriver                        0x0000000104864d88 chromedriver + 4099464
8   chromedriver                        0x0000000104869244 chromedriver + 4117060
9   chromedriver                        0x000000010486f4d0 chromedriver + 4142288
10  chromedriver                        0x0000000104869d44 chromedriver + 4119876
11  chromedriver                        0x0000000104841a18 chromedriver + 3955224
12  chromedriver                        0x00000001048869ec chromedriver + 4237804
13  chromedriver                        0x0000000104886b68 chromedriver + 4238184
14  chromedriver                        0x0000000104896a30 chromedriver + 4303408
15  libsystem_pthread.dylib             0x00000001a6c27fa8 _pthread_start + 148
16  libsystem_pthread.dylib             0x00000001a6c22da0 thread_start + 8


### (b) Extract Structured Data

Please write code to extract relevant content (name, rank, area, profile, homepage, ...) from "infhfaculty.html" and save them as a CSV file (save as "faculty_table.csv"). 

In [None]:
# write your code
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame

file_name = 'infhfaculty.html'

with open(file_name, 'r', encoding='utf-8') as file: # 打开html文件
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
faculty_list = soup.find_all('div', class_='cell') #查找faculty信息
faculty_list = [faculty.text for faculty in faculty_list] #将faculty_list转换为list格式的faculty

num_faculty = len(faculty_list) // 6 - 1#观察faculty_list，发现每位教授有6个维度的信息，len//6-1则是教授的数量
data_reshaped = pd.DataFrame(pd.Series(faculty_list).values.reshape(num_rows, -1)) #将数据按顺序分为6列放入dataframe中
faculty_table = data_reshaped.drop(5, axis=1) #drop多余的“more”列
faculty_table.columns = list(faculty_table.iloc[0,:]) #更改列名
faculty_table = faculty_table.drop(0, axis=0) #drop第一行维度信息


In [None]:
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
file_name = 'infhfaculty.html'
driver.get(file_name)
#wait = WebDriverWait(driver, 10)
more = '//*[@id="app"]/section/section/div/div[2]/div[3]/table/tbody/tr['+str(1)+']/td[6]/div/button'
#wait.until(EC.element_to_be_clickable(By.XPATH, more)).click()
EC.element_to_be_clickable(By.XPATH, more).click()
driver.switch_to.window(driver.window_handles[i])
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "el-row")))


In [None]:

#wait = WebDriverWait(driver, 10)
#num_faculty = faculty_table.shape[0]

for i in range(1,num_faculty):
    more = '//*[@id="app"]/section/section/div/div[2]/div[3]/table/tbody/tr['+str(i)+']/td[6]/div/button'
    wait.until(EC.element_to_be_clickable((By.XPATH, more))).click()
    driver.switch_to.window(driver.window_handles[i])
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "el-row")))

faculty_table.to_csv('faculty_table.csv', encoding='utf-8_sig') #导出文件

In [None]:
faculty_table.shape[0]

### (c) Interesting Finding

Note that you don't need to do anything for Task 1(c). The purpose of this part is to give you some sense about how to leverage Exploratory Data Analysis (EDA) to come up with interesting questions about the data. EDA is an important topic in data science; you will  learn it soon from this course. 


First, please install [dataprep](http://dataprep.ai).
Then, run the cell below. 
It shows a bar chart for every column. What interesting findings can you get from these visualizations? 

In [None]:
pip install dataprep

In [None]:
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_table.csv")
plot(df)

Below are some examples:

**Finding 1:** Assistant Professor# (~76) is more than 5x larger than Associate Professor# (10). 

**Questions:** Why did it happen? Is it common in all CS schools in the world? Will the gap go larger or smaller in five years? What actions can be taken to enlarge/shrink the gap?


**Finding 2:** The Homepage has 22% missing values. 

**Questions:** Why are there so many missing values? Is it because many faculty do not have their own homepages or do not add their homepages to the school page? What actions can be taken to avoid this to happen in the future? 

## Task 2: Age Follows Normal Distribution?

In this task, you start with a question and then figure out what data to collect.

The question that you are interested in is `Does HKUST(GZ) Info Hub faculty age follow a normal distribution?`

To estimate the age of a faculty member, you can collect the year in which s/he graduates from a university (`gradyear`) and then estimate `age` using the following equation:

$$age \approx 2023+23 - gradyear$$

For example, if one graduates from a university in 1990, then the age is estimated as 2023+23-1990 = 56. 



### (a) Crawl Web Page

You notice that faculty profile pages contain graduation information. For example, you can see that Dr. Yuyu LUO graduated from Tsinghua University in 2023 at [https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/LUO-Yuyu/yuyuluo](https://facultyprofiles.hkust-gz.edu.cn/faculty-personal-page/LUO-Yuyu/yuyuluo). 


Please write code to download the profile pages (info hub faculties) and save each page as a text file. 

In [None]:
# Write your code

### (b) Extract Structured Data

Please write code to extract the earliest graduation year (e.g., 2023 for Dr. Yuyu LUO) from each profile page, and create a csv file like [faculty_grad_year.csv](./faculty_grad_year.csv). 

In [None]:
# write your code here

### (c) Interesting Finding

Similar to Task 1(c), you don't need to do anything here. Just look at different visualizations w.r.t. age and give yourself an answer to the question: `Does HKUST(GZ) Info Hub faculty age follow a normal distribution?`

In [None]:
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_grad_year.csv")
df["age"] = 2023+23-df["gradyear"]

plot(df, "age")

## Submission

Complete the code in this notebook, and submit it to the Canvas assignment `Assignment 1`.