# Web and Cloud Computing (DATA 534): Lab 1
## General Lab Instructions

- This assignment is to be completed in python, submitting both a `.ipynb` file (you can add your answers directly to this one) along with a rendered `.md`.
- The demo file will help you with the basics for this lab. If you are already comfortable with scraping data using python, you can safely skip it.
- Hint: Your browser's developer tools is your ally in navigating a website structure.

# Exercise 1: Creating a prerequisite diagram

Below is a prerequisite chart for the 25 MDS courses:

![](MDS-prereq2.png)

In this assignment, you will reproduce this graph, or something very similar, by scraping the prerequisite info from https://courses.students.ubc.ca/cs/courseschedule?tname=subj-department&campuscd=UBCO&dept=DATA&pname=subjarea. Note that you might need to append '&campuscd=UBCO' to the end of the URL string for the course pages to display properly 
(eg. https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=101&campuscd=UBCO).

Try loading webpages in a incognito browser window to see the page that the python code will recieve. 

In this assignment, you will implement a simple crawler to crawl and scrape UBC SSC web pages to grab your course schedule. Here are the steps of what you have to do:

1. Request the url https://courses.students.ubc.ca/cs/courseschedule?tname=subj-department&campuscd=UBCO&dept=DATA&pname=subjarea;
2. Create a `BeautifulSoup` object from the retrieved webpage;
3. Now, for each one of the courses in the page:
    1. obtain the respective link;
    2. retrieve the pre-reqs;

**NOTE:** To avoid being flagged/blocked by UBC security, connect to the UBC VPN before completing the lab.

In [1]:
# TODO: Make request
import requests

try:
    ubc_request = requests.get("https://courses.students.ubc.ca/cs/courseschedule?tname=subj-department&campuscd=UBCO&dept=DATA&pname=subjarea")
except Exception as e:
    print(e)

In [2]:
# TODO: Create BeautifulSoup object
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re

try:
    ubc_request = requests.get("https://courses.students.ubc.ca/cs/courseschedule?tname=subj-department&campuscd=UBCO&dept=DATA&pname=subjarea")
except Exception as e:
    print(e)

text = BeautifulSoup(ubc_request.text)

# TODO: Get courses

tables = text.find_all("table")
# print(tables[0].find_all("td"))
row = []
df = []
for i,entry in enumerate(tables[0].find_all("td")):
    row.append(entry.text)
    if i%2 != 0:
        df.append(row)
        row = []
df_courses = pd.DataFrame(df, columns = ["Course", "Title"])
#print(df_courses)

# TODO: Get link for each course

url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&campuscd=UBCO"
row = []
df = []
for i,entry in enumerate(tables[0].find_all("a")):
    #part_url = str(entry.get('href'))
    df.append(url + str(entry.get('href'))[-21:])
df_course_links = pd.DataFrame(df, columns = ["Course Page"])
pd.set_option('display.max_colwidth', None)
#print(df_course_links)

# TODO: Retrieve the page using the link to get pre-reqs
df = []
for k in df_course_links["Course Page"]:
    try:
        ubc_request = requests.get(k)
        text = BeautifulSoup(ubc_request.text)
        p_all = text.find_all("p")
        test = 0
        for i in p_all:
            if "Pre-reqs" in i.text and "DATA" in i.text:
                test = 1
                temp = i.text
                temp = temp.replace("Pre-reqs:", '')
                temp = temp.replace(".", '')
                temp = temp.replace("All of ", '')
                data_list = [item.strip() for item in temp.split(',')]
                df.append(data_list)
        if test == 0:
            df.append("N/A")
    except Exception as e:
        print(e)
df_prereq = pd.DataFrame(df, columns = ["Pre-reqs"])
#print(df_prereq)

final_df = pd.concat([df_courses, df_course_links, df_prereq], axis=1)
print(final_df)

HTTPSConnectionPool(host='courses.students.ubc.ca', port=443): Max retries exceeded with url: /cs/courseschedule?pname=subjarea&tname=subj-course&campuscd=UBCO&dept=DATA&course=541 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe9128241f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
       Course                                                Title  \
0    DATA 101                         Making Predictions with Data   
1    DATA 301                       Introduction to Data Analytics   
2    DATA 310                          Applied Regression Analysis   
3    DATA 311                                     Machine Learning   
4    DATA 315                  Applied Time Series and Forecasting   
5    DATA 405                  Stochastic Modelling and Simulation   
6    DATA 407                                  Sampling and Design   
7    DATA 410             Regression and Generalized Linear Models   
8   DA

# (Optional) Exercise 2 

In this exercise you will use the [`Scrapy`](https://docs.scrapy.org/en/latest/intro/tutorial.html) package to do the scraping you did in Exercise 1. This cannot be done in Jupyter notebook, so check the file `lab1_question2.md` in lab1 folder for instructions.

# (Optional) Exercise 3

Taking this to the next level, you could point your scraping up one level to this page: https://courses.students.ubc.ca/cs/courseschedule?tname=subj-all-departments&pname=subjarea&campuscd=UBCO
Crawl through _all_ subjects and _all_ courses and report the course with the largest number of students enrolled.

# Exercise 4

All the Game of Thrones episodes are listed, by season, in the following URL: https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes

Unfortunately, the running time of each episode is not available. However, in the link provided for each episodes (e.g., https://en.wikipedia.org/wiki/Dragonstone_(Game_of_Thrones)) there is the running time of the respective episode. Collect the episodes' titles, season, number of U.S. viewers, and running time from wikipedia and create a pandas dataframe with the information collected.

In [4]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import time

got_request = requests.get("https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes")
print('connected')

text = BeautifulSoup(got_request.text)

url = "https://en.wikipedia.org"
temp = []
df = []
tables = text.find_all("table")
for i in range(1,9):
    for j,entry in enumerate(tables[i].find_all("td")):
        if '"' in entry.text:
            temp.append(i)
            result_string = entry.text.replace('"','')
            temp.append(result_string)
            #for k,row in enumerate(entry.find_all("a")):
            temp_url = url + str(entry.find("a").get('href'))
            temp_request = requests.get(temp_url)
            text2 = BeautifulSoup(temp_request.text)
            tds = text2.find_all("td")
            for min in tds:
                if 'minutes' in min.text:
                    pattern = re.compile(r'\[.*?\]')
                    result_string = re.sub(pattern, '', min.text)
                    temp.append(result_string)
                    break
        if '[' in entry.text:
            pattern = re.compile(r'\[.*?\]')
            result_string = re.sub(pattern, '', entry.text)
            temp.append(result_string)
            df.append(temp)
            temp = []
        time.sleep(.005)
    time.sleep(.005)
df_got = pd.DataFrame(df, columns = ["Season", "Episode Title", "Episode Runtime", "U.S Viewers (M)"])
print(df_got)

connected
    Season                          Episode Title Episode Runtime  \
0        1                       Winter Is Coming      61 minutes   
1        1                          The Kingsroad      55 minutes   
2        1                              Lord Snow      57 minutes   
3        1  Cripples, Bastards, and Broken Things      55 minutes   
4        1                  The Wolf and the Lion      54 minutes   
..     ...                                    ...             ...   
68       8         A Knight of the Seven Kingdoms      57 minutes   
69       8                         The Long Night      81 minutes   
70       8                 The Last of the Starks      77 minutes   
71       8                              The Bells      77 minutes   
72       8                        The Iron Throne      78 minutes   

   U.S Viewers (M)  
0             2.22  
1             2.20  
2             2.44  
3             2.45  
4             2.58  
..             ...  
68           1