# Assignment 1: Web Scraping

## Objective

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:


* How to use [requests](http://www.python-requests.org/en/master/) to download HTML pages from a website?
* How to select content on a webpage with [lxml](http://lxml.de/)? 

You can either use Spark DataFrame or [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) to do the assignment. In comparison, pandas.DataFrame has richer APIs, but is not good at distributed computing.

## Preliminary

If this is your first time to write a web scraper, you need to learn some basic knowledge of HTML, DOM, and XPath. I found that this is a good resource: [https://data-lessons.github.io/library-webscraping/](https://data-lessons.github.io/library-webscraping/). Please take a look at

* [Selecting content on a web page with XPath
](https://data-lessons.github.io/library-webscraping/xpath/)
* [Web scraping using Python: requests and lxml](https://data-lessons.github.io/library-webscraping/04-lxml/). 

Please let me know if you find a better resource. I'll share it with the other students.

## Overview

Imagine you are a data scientist working at SFU. One day, you want to analyze CS faculty data and answer two interesting questions:

1. Who are the CS faculty members?
2. What are their research interests?

To do so, the first thing is to figure out what data to collect.

## Task 1: SFU CS Faculty Members

You find that there is a web page in the CS school website, which lists all the faculty members as well as their basic information. 

In Task 1, your job is to write a web scraper to extract the faculty information from this page: [https://www.sfu.ca/computing/people/faculty.html](https://www.sfu.ca/computing/people/faculty.html).




### (a) Crawling Web Page

A web page is essentially a file stored in a remote machine (called web server). You can use [requests](http://www.python-requests.org/en/master/) to open such a file and read data from it. Please complete the following code to download the HTML page and save it as a text file (like [this](./faculty.txt)). 

In [33]:
import requests


# 1. Download the webpage
# 2. Save it as a text file (named faculty.txt)

webpage_adrs = "https://www.sfu.ca/computing/people/faculty.html"
fac_page = 'faculty_page.txt'

In [34]:
def grab_and_save(webpage_adrs, file_tosave):
    response_obj = requests.get(webpage_adrs)
    html_text = response_obj.text
    with open(file_tosave, 'w', encoding='utf-8') as f:
        f.write(html_text)

grab_and_save(webpage_adrs, fac_page)

### (b) Extracting Structured Data

An HTML page follows the Document Object Model (DOM). It models an HTML page as a tree structure wherein each node is an object representing a part of the page. The nodes can be searched and extracted programmatically using XPath. Please complete the following code to transform the above HTML page to a CSV file (like [this](./faculty_table.csv)). 

In [35]:
import lxml.html
import pandas as pd

# 1. Open faculty.txt
with open(fac_page, 'r', encoding='utf-8') as f:
    fac_content = f.read()

# 2. Parse the HTML page as a tree structure
tree = lxml.html.document_fromstring(fac_content)
fac_texts = tree.xpath("//div[contains(@class, 'faculty-list')]//div[@class='text']")

# 3. Extract related content from the tree using XPath
def cleanup_faculty_profile(elem_obj):
    # Example:
    # name     - Greg Baker
    # rank     - SENIOR LECTURER
    # area     - Instruction
    # profile  - http://www.sfu.ca/computing/people/faculty/gregbaker.html
    # homepage - http://www.cs.sfu.ca/~ggbaker/
    
    acad = dict()
    h4_objs = elem_obj.xpath("h4")
    name, rank = h4_objs[0].text_content().split(',')[:2]
    acad['name'] = name.strip()
    acad['rank'] = rank.split()[0].strip()
    
    area_obj = elem_obj.xpath(".//*[contains(text(), 'Area')]/..")
    acad['area'] = area_obj[0].text_content().split(':')[1].strip()
    
    profile_obj = elem_obj.xpath(".//*[contains(text(), 'Profile')]/..")
    sublinks = profile_obj[0].xpath("a")
    acad['profile'] = requests.compat.urljoin('http://www.sfu.ca/',sublinks[0].attrib['href'])
    if len(sublinks) == 1:
        acad['homepage'] = ""
    else:
        acad['homepage'] = sublinks[1].attrib['href']
    return pd.DataFrame([acad])

df = cleanup_faculty_profile(fac_texts[0])
for fac_obj in fac_texts[1:]:
    df = pd.concat([df,cleanup_faculty_profile(fac_obj)], ignore_index=True)

print(df)

fac_table_filename = "faculty_table_mv.csv"

# 4. Save the extracted content as an csv file (named faculty_table.csv)
df.to_csv(fac_table_filename, index=False, columns=["name","rank","area","profile","homepage"])

                                                 area  \
0                                         Instruction   
1   Probabilistic methods; Randomized algorithms; ...   
2   Constraint Satisfaction, Complexity of Computa...   
3                                         Instruction   
4                               Computational Biology   
5                                         Instruction   
6          Formal Aspects of Knowledge Representation   
7                                         Instruction   
8                                         Instruction   
9                           Databases and Data Mining   
10                                        Instruction   
11  Computer Vision, Deep Learning, Computer Graph...   
12  Software Technology, Distributed Communication...   
13       Computer Networks, Multimedia Communications   
14                                        Instruction   
15                                        Instruction   
16         Computational Biolog

## Task 2: Research Interests

Suppose you want to know the research interests of each faculty. However, the above crawled web page does not contain such information. 

### (a) Crawling Web Page

You notice that such information can be found on the profile page of each faculty. For example, you can find the research interests of Dr. Jiannan Wang from [http://www.sfu.ca/computing/people/faculty/jiannanwang.html](http://www.sfu.ca/computing/people/faculty/jiannanwang.html). 


Please complete the following code to download the profile pages and save them as text files. There are 56 faculties, so you need to download 56 web pages in total. 

In [41]:
import requests
import os
import shutil
import string

fac_profiles_dir = "faculty_profiles"

if os.path.exists(fac_profiles_dir):
    shutil.rmtree(fac_profiles_dir)
    
os.makedirs(fac_profiles_dir)

df = pd.read_csv(fac_table_filename)

def fac_name_resolver(fac_url):
    fac_url_split = fac_url.split('.html')[0]
    name_resolved = fac_url_split.split('/')[-1]
    return name_resolved.lower()

# 1. Download the profile pages of 56 faculties
# 2. Save each page as a text file
for index, row in df.iterrows():
    fac_filename = fac_name_resolver(row['profile'])+".txt"
    grab_and_save(row['profile'], os.path.join(fac_profiles_dir, fac_filename))


### (b) Extracting Structured Data

Please complete the following code to extract the research interests of each faculty, and generate a file like [this](./faculty_more_table.csv). 

In [56]:
import lxml.html 
import unicodedata

# 1. Open each text file and parse it as a tree structure 
# 2. Extract the research interests from each tree using XPath
# 3. Add the extracted content to faculty_table.csv
# 4. Generate a new CSV file, named faculty_more_table.csv

def get_research_interests(fac_profile):
    with open(fac_profile, 'r', encoding='utf-8') as f:
        fac_content = f.read()
    tree = lxml.html.document_fromstring(fac_content)
    research_interests = tree.xpath("//h2[contains(text(), 'Research ')]/../ul/li")
    ri_list = []
    for interest in research_interests:
        preproc = interest.text_content().strip()
        preproc = unicodedata.normalize('NFKD', preproc)
        rm_extra_spaces = ' '.join(preproc.split())
        ri_list.append(rm_extra_spaces)
    ri = ', '.join(ri_list)
    return '[' + ri + ']'


df = pd.read_csv(fac_table_filename)
df["research_interests"] = ""

for index, row in df.iterrows():
    fac_filename = fac_name_resolver(row['profile'])+".txt"
    fac_profile = os.path.join(fac_profiles_dir, fac_filename)
    df.loc[index,'research_interests'] = get_research_interests(fac_profile)

df["homepage"].fillna("", inplace=True)

print(df)

fac_more_table_filename = "faculty_more_table_mv.csv"

df.to_csv(fac_more_table_filename, index=False, columns=["name","rank","area", \
                                                         "profile","homepage","research_interests"])

                       name       rank  \
0                Greg Baker     SENIOR   
1          Petra Berenbrink  PROFESSOR   
2            Andrei Bulatov  Professor   
3                BOBBY CHAN    LIMITED   
4      LEONID CHINDELEVITCH  ASSISTANT   
5           Diana Cukierman     Senior   
6        James P. Delgrande  Professor   
7            Toby Donaldson     Senior   
8                John Edgar     Senior   
9              Martin Ester  Professor   
10             Brian Fraser     Senior   
11        Yasutaka Furukawa  Assistant   
12              Uwe Gl?sser  Professor   
13          Mohamed Hefeeda  Professor   
14  Harinder Singh Khangura     Senior   
15            Anne Lavergne     Senior   
16        Maxwell Libbrecht  Assistant   
17      Jiangchuan (JC) Liu  Professor   
18           David Mitchell  Associate   
19                 Jian Pei  Professor   
20            Fred Popowich  Professor   
21       Arrvindh Shriraman  Associate   
22    William (Nick) Sumner  Assis

## Submission

Complete the code in this notebook, and submit it to the CourSys activity [Assignment 1](https://courses.cs.sfu.ca/2018sp-cmpt-733-g1/+a1/).