# Assignment 1-1: Web Scraping

## Objective

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:


* How to download HTML pages from a website?
* How to extract relevant content from an HTML page? 

Furthermore, you will gain a deeper understanding of the data science lifecycle.

**Requirements:**

1. Please use [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) rather than spark.DataFrame to manipulate data.

2. Please use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) rather than [lxml](http://lxml.de/) to parse an HTML page and extract data from the page.

3. Please follow the python code style (https://www.python.org/dev/peps/pep-0008/). If TA finds your code hard to read, you will lose points. This requirement will stay for the whole semester.

## Preliminary

If this is your first time to write a web scraper, you need to learn some basic knowledge of this topic. I found that this is a good resource: [Tutorial: Web Scraping and BeautifulSoup](https://www.dataquest.io/blog/web-scraping-beautifulsoup/). 

Please let me know if you find a better resource. I'll share it with the other students.

## Overview

Imagine you are a data scientist working at SFU. Your job is to extract insights from SFU data to answer questions. 

In this assignment, you will do two tasks. Please recall the high-level data science lifecycle from Lecture 1. I suggest that when doing this assignment, please remind yourself of what data you collected and what questions you tried to answer.

## Task 1: SFU CS Faculty Members

Sometimes you don't know what questions to ask. No worries. Start collecting data first. 

In Task 1, your job is to write a web scraper to extract the faculty information from this page: [http://www.sfu.ca/computing/people/faculty.html](http://www.sfu.ca/computing/people/faculty.html).




### (a) Crawl Web Page

A web page is essentially a file stored in a remote machine (called web server). Please write code to download the HTML page and save it as a text file ("csfaculty.html").

In [None]:
# write your code
import requests
import os

webpage = "http://www.sfu.ca/computing/people/faculty.html"

response = requests.get(webpage)

with open("csfaculty.html",'w') as f:
    f.write(response.text)

### (b) Extract Structured Data

Please write code to extract relevant content (name, rank, area, profile, homepage) from "csfaculty.html" and save them as a CSV file (like [faculty_table.csv](./faculty_table.csv)). 

In [None]:
# write your code
# !pip3 install bs4
from bs4 import BeautifulSoup
import pandas as pd
import re
names = []
ranks = []
areas = [] 
profiles = []
homepages = []
#prepare soup of the html text file, and extract content from div of this class
csfaculty_soup = BeautifulSoup(response.text)
csfaculty_containers = csfaculty_soup.find_all('div', class_='textimage section')

#extract all features we need into their respective lists
for faculty in csfaculty_containers:
    
    #name
    name = faculty.h4.text.split(',')[0]
    names.append(name)

    #rank
    if len(faculty.h4.text.split(',')) != 1:
        rank = faculty.h4.text.split('\n')[0]
        rank = rank.split(',')[1].lower().replace(u'\xa0',u'').strip()
        ranks.append(rank)
    else:
        ranks.append('N/A')

    #area
    if faculty.find('p') is not None:
        area = faculty.p.text.split('Area:')[1]
        areas.append(area)
    else:
        areas.append('N/A')

    #since we have two anchor tags in the same div tag we need to extract them one at a time matching them to
    #profile and homepage
    all_hrefs = {i.text:i.get('href') for i in faculty.find_all('a')}
    
    #profile
    profile = 'N/A'
    for key, value in all_hrefs.items():
        if key.startswith('Profile & Contact Information'):
            value = value.split('#')[0]
            if value.startswith('http://www.sfu.ca/'):
                profile = value
            else:
                profile = 'http://www.sfu.ca' + value
    profiles.append(profile)

    #home page
    homepage = 'N/A'
    for key, value in all_hrefs.items():
        if key.find('Home Page') != -1:
            homepage = value
    homepages.append(homepage)
    
#create dataframe of all the lists prepared
csfaculty_info = pd.DataFrame({'name':names,
                              'rank':ranks,
                              'area':areas,
                              'profile':profiles,
                              'homepage':homepages
                              })
csfaculty_info = csfaculty_info[csfaculty_info['name'] != 'Ryan Shea']
csfaculty_info.to_csv('faculty_table.csv', index=False)

### (c) Interesting Finding

Note that you don't need to do anything for Task 1(c). The purpose of this part is to give you some sense about how to leverage exploratory data analysis (EDA) to come up with interesting questions about the data. EDA is an important topic in data science; you will  learn it soon from this course. 


First, please install [dataprep](http://dataprep.ai).
Then, run the cell below. 
It shows a bar chart for every column. What interesting findings can you get from these visualizations? 

In [None]:
# !pip install dataprep
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_table.csv")
plot(df)

Below are some examples:

**Finding 1:** Professor# (26) is more than 2x larger than Associate Professor# (10). 

**Questions:** Why did it happen? Is it common in all CS schools in Canada? Will the gap go larger or smaller in five years? What actions can be taken to enlarge/shrink the gap?


**Finding 2:** The Homepage has 22% missing values. 

**Questions:** Why are there so many missing values? Is it because many faculty do not have their own homepages or do not add their homepages to the school page? What actions can be taken to avoid this to happen in the future? 

**Finding 3**: For Area, the number of 'instructors' accounts for 13.24% of all faculty.

**Question**: Is instructor used as a generalized area for professors teaching area, and could that be replaced with more specific teaching areas in order to have a more informational value for the user of the website.

**Finding 4**: For 3 professors, the rank is missing.

**Question**: The website could be updated with the missing ranks of these professors. Further, with this type of analysis, we could identify any such missing values for any other field for faculties from various departments. 

## Task 2: Age Follows Normal Distribution?

In this task, you start with a question and then figure out what data to collect.

The question that you are interested in is `Does SFU CS faculty age follow a normal distribution?`

To estimate the age of a faculty member, you can collect the year in which s/he graduates from a university (`gradyear`) and then estimate `age` using the following equation:

$$age \approx 2021+23 - gradyear$$

For example, if one graduates from a university in 1990, then the age is estimated as 2021+23-1990 = 54. 



### (a) Crawl Web Page

You notice that faculty profile pages contain graduation information. For example, you can see that Dr. Jiannan Wang graduated from Harbin Institute of Technology in 2008 at [http://www.sfu.ca/computing/people/faculty/jiannanwang.html](http://www.sfu.ca/computing/people/faculty/jiannanwang.html). 


Please write code to download the 68 profile pages and save each page as a text file. 

In [None]:
# Write your code
import requests
import pandas as pd
import os
csfaculty_info = pd.read_csv('faculty_table.csv')
csfaculty_info = csfaculty_info[csfaculty_info['profile'].notna()]
profiles = csfaculty_info['profile']
path = 'all_profiles/'
# make a new directory if it does not exist
if not os.path.exists(path):
   os.makedirs(path)

for webpage in profiles:
    response = requests.get(webpage)
    with open(path + webpage.split('/')[-1],'w') as f:
        f.write(response.text) 

### (b) Extract Structured Data

Please write code to extract the earliest graduation year (e.g., 2008 for Dr. Jiannan Wang) from each profile page, and create a csv file like [faculty_grad_year.csv](./faculty_grad_year.csv). 

In [None]:
# write your code here
import os
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

def get_all_text_string(container, required_tag):
  all_text_string = ''
  all_required_tags = container.find_all(required_tag)
  for tag in all_required_tags:
    all_text_string += tag.text
  return all_text_string

path = 'all_profiles/'
names = []
all_edu_years = []
for html in os.listdir(path):
  with open(path+html, "r") as f:
    html_file = f.read()
    #prepare soup of the html text file
    csfaculty_soup = BeautifulSoup(html_file, 'html.parser')
    
    #get names of all faculty
    name_containers = csfaculty_soup.find_all('div', class_ = 'title section')
    for faculty in name_containers:
        if faculty.h1 != None:
            name = faculty.h1.text
            names.append(name)
    
    #get earliest graduating year for all faculty
    education_containers = csfaculty_soup.find_all('div', class_ = 'text parbase section')
    for faculty in education_containers:
        education_years = []
        #h2, h3 and p all have the text 'Education', which we use to identify the required div      
        if (faculty.h2 != None and get_all_text_string(faculty,'h2').find('Education') != -1)\
            or (faculty.p != None and faculty.p.text.startswith('Education'))\
            or (faculty.h3 != None and faculty.h3.text.startswith('Education')):
            
            if faculty.p != None:
                education_years = re.findall('\d\d\d\d',get_all_text_string(faculty, 'p'))
                
            if faculty.li != None:
                education_years = re.findall('\d\d\d\d',get_all_text_string(faculty, 'li'))
                
            education_years = [int(i) for i in education_years]
            if education_years != []:
                all_edu_years.append(min(education_years))
            else:
                all_edu_years.append('N/A')
                
#create dataframe of all the lists prepared
csfaculty_info = pd.DataFrame({'name':names,
                              'gradyear':all_edu_years
                              })
csfaculty_info.to_csv('faculty_grad_year.csv')

### (c) Interesting Finding

Similar to Task 1(c), you don't need to do anything here. Just look at different visualizations w.r.t. age and give yourself an answer to the question: `Does SFU CS faculty age follow a normal distribution?`

In [None]:
# !pip install dataprep

from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_grad_year.csv")
df["age"] = 2021+23-df["gradyear"]

plot(df, "age")

Yes, the ages of faculty memebers follows a normal distribution. There is a slight left skewness, as the values are peaking between 30s and mid 50s and the median is around 45. This can be seen by observing the Normal Q-Q plot and the box plot. The differnce between the lower quartile and the median is larger than the difference between the upper quartile and the median. 

## Submission

Complete the code in this notebook, and submit it to the CourSys activity `Assignment 1`.