In [33]:
import pandas as pd
import time
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
from graphviz import Digraph
import re
import time

# Web and Cloud Computing (DATA 534): Lab 1
## General Lab Instructions

- This assignment is to be completed in python, submitting both a `.ipynb` file (you can add your answers directly to this one) along with a rendered `.md`.
- I added an Intro section to help you with the basics for this lab. If you are already comfortable with scrapping data using python, you can safely skip it.

## Intro 

Let's have this intro section to introduce you to the main python functions to deal with web scrapping and crawling.

### Web requests

When you navigate to a website, there are many things going on under the hood. Many layers of protocols are used to allow you to communicate with a web server (take a look at the [OSI model](https://en.wikipedia.org/wiki/OSI_model) if you are curious). We will be mostly dealing with the last layer (Layer 7 - [application layer](https://en.wikipedia.org/wiki/Application_layer)) and [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) (HyperText Transfer Protocol). We'll leave for the python libraries to handle the details of the other layers for us. 

HTTP is text based. For example, to send a GET request to a web server:
```
GET /HTTP/1.1
Host: www.google.com
User-Agent: Python-urllib/3.6
```
and the server send back a response, also with a header and the requested content.

Here we will use python (instead of a web browser like Chrome) to collect information from the web. For illustration purposes, let's scrap historical data of the Word Cups (soccer) available in [this wikipedia page](https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners).

#### The `urllib` package

The [`urllib`](https://docs.python.org/3/library/urllib.html) is a built-in package in python focused on dealing with URLs. To open and read URLs, we use the function [`urllib.request.urlopen`](https://docs.python.org/3/library/urllib.request.html#module-urllib.request). Let's start by importing this function.

In [3]:
from urllib.request import urlopen

Ok, now all we have to do is to call the function and pass the URL we want. Note that although we don't usually need to add the "http://" when we are using a web browser, here we do (try removing this part if you are curious).

In [4]:
soccer_urllib = urlopen("https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners")

Now our `soccer_urllib` contains some information of our request. For example:

In [5]:
print("The url of our request: ", end="")
print(soccer_urllib.geturl())

The url of our request: https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners


We can also get the header from the server's response (don't need to spend a lot of time trying to understand this header - it is just to show you how to access this info):

In [6]:
print(soccer_urllib.info())

Date: Tue, 19 Jan 2021 15:04:58 GMT
Server: mw1271.eqiad.wmnet
X-Content-Type-Options: nosniff
P3p: CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
Content-Language: en
Vary: Accept-Encoding,Cookie,Authorization
X-Request-Id: YAb1GgpAAEIAADO9rJoAAABC
Last-Modified: Thu, 14 Jan 2021 18:32:40 GMT
Content-Type: text/html; charset=UTF-8
Age: 46087
X-Cache: cp4032 miss, cp4032 hit/1
X-Cache-Status: hit-front
Server-Timing: cache;desc="hit-front"
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
Report-To: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
NEL: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
Set-Cookie: WMF-Last-Access=20-Jan-2021;Path=/;HttpOnly;secure;Expires=Sun, 21 Feb 2021 00:00:00 GMT
Set-Cookie: WMF-Last-Access-Globa

Ok, but what about the content of the page? Let's check it out!

In [10]:
# soccer_html = soccer_urllib.read()
# print(soccer_html)

Yes, it is a complete mess, I know. How to make sense of all this? Well, there are different approaches. We could use regular expression (not a good idea though), or we could use python packages such as [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [lxml](https://lxml.de/tutorial.html) to help you make sense of this thing. Here we will focus on `BeautifulSoup`, but you have seen `lxml` in 513, remember?

We have to be careful when using `urllib.request.open`, because if you request a webpage that does not exist, it will throw an exception and stop your program. Try running the commented code in the following cell:

In [9]:
#my_pg = urlopen("www.lourenzu.com")

The exceptions raised by `urllib.request` are defined in [`urllib.error`](https://docs.python.org/3/library/urllib.error.html#module-urllib.error). So we can import this package and make a more robust piece of code that handles these exceptions:

In [12]:
from urllib.error import HTTPError, URLError
try:
    my_pg = urlopen("http://www.lourenzu.com")
except URLError as error:
    print(error.reason)

[Errno 11001] getaddrinfo failed


In [13]:
try:
    my_pg = urlopen("http://www.google.com/rodolfo_lourenzutti")
except HTTPError as error:
    print("The famous error code: ", error.code)
    print("The reason for the exception:", error.reason)

The famous error code:  404
The reason for the exception: Not Found


Handling exceptions is an important part of web scrapping. Let's see an alternative package next.

#### The `requests` package

The [`requests`](http://docs.python-requests.org/en/master/user/quickstart/#custom-headers) package is an alternative (not built-in) HTTP library that is becoming more and more popular. Let's load the package and request our wikipedia page: 

In [14]:
import requests

soccer_requests = requests.get("https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners")

Differently from the `urllib`, requests does not raise an exception for HTTP errors:

In [15]:
bad_request = requests.get("http://www.google.com/rodolfo_lourenzutti")
bad_request

<Response [404]>

But it does raise an exception for some errors (uncomment the line below to see):

In [17]:
# requests.get("http://www.lourenzu.com")

Now, let's check the status code of our wikipedia request:

In [18]:
soccer_requests.status_code

200

In [19]:
soccer_requests.reason

'OK'

Great! Code 200 means everything is ok! We can also check the headers of the request and of the response: 

In [20]:
# Headers of the request
print(soccer_requests.request.headers)

{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


Lastly, let's access the content of the page:

In [21]:
# Headers of the response
soccer_requests.headers

{'Date': 'Tue, 19 Jan 2021 15:04:58 GMT', 'Server': 'mw1271.eqiad.wmnet', 'X-Content-Type-Options': 'nosniff', 'P3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Content-Language': 'en', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'X-Request-Id': 'YAb1GgpAAEIAADO9rJoAAABC', 'Last-Modified': 'Thu, 14 Jan 2021 18:32:40 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Encoding': 'gzip', 'Age': '46515', 'X-Cache': 'cp4032 miss, cp4032 hit/2', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Report-To': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'NEL': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'Set-Cookie': 'WMF-Last-A

In [31]:
#soccer_requests.text

### Beautiful Soup

[`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a quite useful package to handle data in HTML format. It has two main functions 
1. [`BeautifulSoup.find()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)
2. [`BeautifulSoup.find_all()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)

There are other functions that might be useful as well (e.g., `find_next_sibling()`, `find_parent()`) 

Let's starting by importing the library and creating a `BeautifulSoup` object.

In [22]:
from bs4 import BeautifulSoup

In [23]:
# Creating a BeautifulSoup object.
soccer = BeautifulSoup(soccer_requests.text) # remember that the field text contains the content of our request

Great! Now we have a `BeautifulSoup` object stored in the `soccer` variable. We can now use the function `find_all()` to retrieve information. For example, let's find the tables contained in the page.

In [24]:
tables = soccer.find_all("table")
tables

[<table class="wikitable plainrowheaders sortable">
 <caption>Players who have won the World Cup
 </caption>
 <tbody><tr>
 <th rowspan="2" scope="col">Player
 </th>
 <th rowspan="2" scope="col">Team
 </th>
 <th rowspan="2" scope="col">Titles won
 </th>
 <th rowspan="2" scope="col">Year(s)
 </th>
 <th colspan="2" scope="col">Other appearances
 </th>
 <th class="unsortable" rowspan="2" scope="col">Profile
 </th></tr>
 <tr>
 <th scope="col">As player
 </th>
 <th scope="col">As manager
 </th></tr>
 <tr>
 <th data-sort-value="Pele" scope="row"><a href="/wiki/Pel%C3%A9" title="Pelé">Pelé</a>
 </th>
 <td><span style="white-space:nowrap"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="504" data-file-width="720" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Flag_of_Brazil_%281968%E2%80%931992%29.svg/22px-Flag_of_Brazil_%281968%E2%80%931992%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Flag_of_Brazil_%2819

Now we have a list of tables in the web page. Let's scrape the first table - the nations that won the word cup.

In [25]:
df = pd.DataFrame({"year": [], "Team": []})
row = []
df = []
for i,entry in enumerate(tables[0].find_all("td")):
    row.append(entry.text)
    if (i+1)%3 == 0:
        df.append(row)
        row = []

df = pd.DataFrame(df, columns = ["team", "n_titles", "years"])
df

Unnamed: 0,team,n_titles,years
0,Brazil\n,3\n,"1958, 1962, 1970\n"
1,1966\n,\n,[fp 1]\n
2,Brazil\n,2\n,"1958, 1962\n"
3,1966\n,\n,[fp 2]\n
4,Brazil\n,2\n,"1994, 2002\n"
...,...,...,...
885,\n,\n,[fp 443]\n
886,Brazil\n,1\n,1994\n
887,\n,\n,[fp 444]\n
888,Italy\n,1\n,1982\n


How cool is that? Of course we need some data cleaning.

In [26]:
df = df.replace("\n","", regex=True)
df = df.replace("\[[A-Za-z0-9 ]*\]","", regex=True)
df

Unnamed: 0,team,n_titles,years
0,Brazil,3,"1958, 1962, 1970"
1,1966,,
2,Brazil,2,"1958, 1962"
3,1966,,
4,Brazil,2,"1994, 2002"
...,...,...,...
885,,,
886,Brazil,1,1994
887,,,
888,Italy,1,1982


As we can see, Brazil has the highest number of World Cup titles in the world. YEAH, yeah! Don't get upset, Canada would crush us in Hockey (I don't know if we even have a team!). We could keep manipulating this dataframe, but let's keep our focus on scrapping data. Let's get the names of the players that won the world cup, which are present in the second column in the first table of the wiki-page.

In [32]:
#tables[0]

In [28]:
players = []
teams = []
for entry in tables[0].find_all("th"):
    if entry.has_attr("data-sort-value"): # Note the function I'm using here: has_attr(). Inspect the page on your browser to see why
        players.append(entry.text)
        
players = pd.Series(players, name='Player', dtype='object').str.replace("\n","")
players

0                   Pelé
1                Bellini
2                   Cafu
3               Castilho
4                   Didi
             ...        
440                Zetti
441      Zinedine Zidane
442    Ron-Robert Zieler
443                Zinho
444            Dino Zoff
Name: Player, Length: 445, dtype: object

But why are we doing this? Isn't easier to just copy the table directly from the page? Well, if you have only one table, it would be. But say that you want the date of birth of all these players that won the word cup. This information is not present in the table, so we need to gather from somewhere else. However, the table does provide us with the link for the wiki-page of each one of those players. We just need to go there and gather that information. Doing it manually for the 510 players would be annoying, right? 

Well, this could take a while since there are 510 players, but let's do for the first 25 players just so we get a taste.

In [29]:
url = "https://en.wikipedia.org"
bday = []
for name in players[0:25]:
    player_pg = requests.get(url+tables[0].find("a", text=name)["href"])
    print(url+tables[0].find("a", text=name)["href"])
    player_pg = BeautifulSoup(player_pg.text)
    try:
        print(player_pg.find("span", {"class":"bday"}).text)
        bday.append(player_pg.find("span", {"class":"bday"}).text)
    except:
        bday.append(player_pg.find("th", text="Date of birth").next_sibling.text)
    time.sleep(.5)

print("Finished!")

https://en.wikipedia.org/wiki/Pel%C3%A9
1940-10-23
https://en.wikipedia.org/wiki/Hilderaldo_Bellini
1930-06-07
https://en.wikipedia.org/wiki/Cafu
1970-06-07
https://en.wikipedia.org/wiki/Carlos_Jos%C3%A9_Castilho
1927-11-27
https://en.wikipedia.org/wiki/Didi_(footballer,_born_1928)
1928-10-08
https://en.wikipedia.org/wiki/Djalma_Santos
1929-02-27
https://en.wikipedia.org/wiki/Giovanni_Ferrari
1907-12-06
https://en.wikipedia.org/wiki/Garrincha
1933-10-28
https://en.wikipedia.org/wiki/Gylmar_dos_Santos_Neves
1930-08-22
https://en.wikipedia.org/wiki/Guido_Masetti
1907-11-22
https://en.wikipedia.org/wiki/Mauro_Ramos
1930-08-30
https://en.wikipedia.org/wiki/Giuseppe_Meazza
1910-08-23
https://en.wikipedia.org/wiki/Eraldo_Monzeglio
1906-06-05
https://en.wikipedia.org/wiki/N%C3%ADlton_Santos
1925-05-16
https://en.wikipedia.org/wiki/Daniel_Passarella
1953-05-25
https://en.wikipedia.org/wiki/Pepe_(footballer,_born_1935)
1935-02-25
https://en.wikipedia.org/wiki/Ronaldo_(Brazilian_footballer)
1976

Why did I add a `time.sleep` in the code? It is to give some time to the server. We want to be polite and not overload the server with too many requests in a short time. You need to be aware of that. If you are dealing with a small server, you could cause real problems. Besides, you could get blocked. Wikipedia is a very big server, and we aren't requesting that many pages, so we should be fine here. But always keep this in mind. 

Now, let's check the date of birth of the first 25 players. Please note that the previous block of code must successfully finish running first for the birthday table to display properly. 

In [30]:
players_won_world_cup = pd.concat([players[0:25],pd.Series(bday, name="bday", dtype = 'object')], axis=1)
players_won_world_cup

Unnamed: 0,Player,bday
0,Pelé,1940-10-23
1,Bellini,1930-06-07
2,Cafu,1970-06-07
3,Castilho,1927-11-27
4,Didi,1928-10-08
5,Djalma Santos,1929-02-27
6,Giovanni Ferrari,1907-12-06
7,Garrincha,1933-10-28
8,Gilmar,1930-08-22
9,Guido Masetti,1907-11-22


Cool, right? 
One last thing before you start: your browser's developer tools is your ally in navigating a website structure. 

Finally, let's begin your assigment.

# Exercise 1: Creating a prerequisite diagram

Below is a prerequisite chart for the 25 MDS courses:

![](MDS-prereq2.png)

In this assignment, you will reproduce this graph, or something very similar, by scraping the prerequisite info from https://courses.students.ubc.ca/cs/courseschedule?tname=subj-department&campuscd=UBCO&dept=DATA&pname=subjarea. Note that you might need to append '&campuscd=UBCO' to the end of the URL string for the course pages to display properly 
(eg. https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=101&campuscd=UBCO).

Try loading webpages in a incognito browser window to see the page that the python code will recieve. 

In this assignment, you will implement a simple crawler to crawl and scrape UBC SSC web pages to grab your course schedule. Here are the steps of what you have to do:

1. Request the url https://courses.students.ubc.ca/cs/courseschedule?tname=subj-department&campuscd=UBCO&dept=DATA&pname=subjarea;
2. Create a `BeautifulSoup` object from the retrieved webpage;
3. Now, for each one of the courses in the page:
    1. obtain the respective link;
    2. retrieve the pre-reqs;

In [134]:
# Make request
page = requests.get(" https://courses.students.ubc.ca/cs/courseschedule?tname=subj-department&campuscd=UBCO&dept=DATA&pname=subjarea")

# Create BeautifulSoup object
soup = BeautifulSoup(page.text)

# TODO: Get courses
table = soup.find("table")
courses = []
links = []
for entry in table.find_all("a"):
    courses.append(entry.text)
    links.append(entry['href'])

df = pd.DataFrame({"course": courses,
                 "URL": links})
df
# TODO: Retrieve the page using the link to get pre-reqs



Unnamed: 0,course,URL
0,DATA 101,/cs/courseschedule?pname=subjarea&tname=subj-c...
1,DATA 301,/cs/courseschedule?pname=subjarea&tname=subj-c...
2,DATA 311,/cs/courseschedule?pname=subjarea&tname=subj-c...
3,DATA 407,/cs/courseschedule?pname=subjarea&tname=subj-c...
4,DATA 410,/cs/courseschedule?pname=subjarea&tname=subj-c...
5,DATA 419A,/cs/courseschedule?pname=subjarea&tname=subj-c...
6,DATA 448A,/cs/courseschedule?pname=subjarea&tname=subj-c...
7,DATA 449,/cs/courseschedule?pname=subjarea&tname=subj-c...
8,DATA 500,/cs/courseschedule?pname=subjarea&tname=subj-c...
9,DATA 501,/cs/courseschedule?pname=subjarea&tname=subj-c...


In [137]:
url = "https://courses.students.ubc.ca"
prereqs = []
titles = []
for course in courses:
    course_page = requests.get(url+table.find("a", text=course)["href"]+"&campuscd=UBCO")
    print("Scraping: ")
    print(url+table.find("a", text=course)["href"]+"&campuscd=UBCO")
    soup = BeautifulSoup(course_page.text)
    titles.append((soup.find("h4").text[9:]))
    if "Pre-reqs" in soup.find_all("p")[2].text:
        prereqs.append(soup.find_all("p")[2].text[14:]) #Subsetting removes the Pre-reqs:     part of the string
    else:
        prereqs.append("None")     
    time.sleep(0.5)
    
prereqDF = pd.DataFrame({"course": courses,
                         "Name": titles,
                   "Prerequisites": prereqs})
prereqDF

Scraping: 
https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=101&campuscd=UBCO
Scraping: 
https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=301&campuscd=UBCO
Scraping: 
https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=311&campuscd=UBCO
Scraping: 
https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=407&campuscd=UBCO
Scraping: 
https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=410&campuscd=UBCO
Scraping: 
https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=419A&campuscd=UBCO
Scraping: 
https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DATA&course=448A&campuscd=UBCO
Scraping: 
https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&d

Unnamed: 0,course,Name,Prerequisites
0,DATA 101,Making Predictions with Data,
1,DATA 301,Introduction to Data Analytics,"(Either (a) third-year standing, or (b) one o..."
2,DATA 311,Machine Learning,Either (a) STAT 230 or (b) a score more than 7...
3,DATA 407,Sampling and Design,"One of STAT 230, PSYO 372, BIOL 202, ECON 327."
4,DATA 410,Regression and Generalized Linear Models,DATA 311.
5,DATA 419A,Topics in Data Science - TOPICS DATA SCIE,(Fourth-year standing.)
6,DATA 448A,Directed Studies in Data Science - DIR ST DAT...,(Third-year standing in the Data Science majo...
7,DATA 449,Honours Thesis,(Fourth-year standing and permission of the d...
8,DATA 500,Communication and Consulting in Data Science,
9,DATA 501,Data Analytics,


# (Optional) Exercise 2 

In this exercise you will use the [`Scrapy`](https://docs.scrapy.org/en/latest/intro/tutorial.html) package to do the scrapping you did in Exercise 1. This cannot be done in Jupyter notebook, so check the file `lab1_question2.md` in lab1 folder for instructions.

# (Optional) Exercise 3

Taking this to the next level, you could point your scraping up one level to this page: https://courses.students.ubc.ca/cs/courseschedule?tname=subj-all-departments&pname=subjarea&campuscd=UBCO
Crawl through _all_ subjects and _all_ courses and report the course with the largest number of students enrolled.

# Exercise 4

All the Game of Thrones episodes are listed, by season, in the following URL: https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes

Unfortunately, the running time of each episode is not available. However, in the link provided for each episodes (e.g., https://en.wikipedia.org/wiki/Dragonstone_(Game_of_Thrones)) there is the running time of the respective episode. Collect the episodes' titles, season, number of U.S. viewers, and running time from wikipedia and create a pandas dataframe with the information collected.

In [250]:
url = "https://en.wikipedia.org"
eps = []
runtimes = []
seasons = []
viewers = []
GoT = requests.get("https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes")
GotSoup = BeautifulSoup(GoT.text)
for ep in (GotSoup.find_all("td", class_ = "summary")):
    try:
        ep.find("a")["href"] # The non-episode specials don't have links, so this makes it run only on episodes
        # Get Episode names and viewer numbers from summary page
        episode = ep.text[1:-1] #Removes parentheses
        viewer = ep.parent.findAll("td")[5].text
        print(episode,"    Scraping:")
        print(url+ep.find("a")["href"])
        # Request and BeautifulSoup
        ep_pg = requests.get(url+ep.find("a")["href"])
        soup = BeautifulSoup(ep_pg.text)
        info = soup.find("table", class_ = "infobox")
        # Get Runtime and season number from episode page
        runtime = info.find_all(text = re.compile('minutes'))
        season = info.find_all(text = re.compile('Season'))
        # Append episode info to our lists
        eps.append(episode)
        runtimes.append(runtime[0])
        seasons.append(season[0])
        viewers.append(viewer[:-4])
    except: # The special episodes don't have a hyperlink
        pass


Winter Is Coming     Scraping:
https://en.wikipedia.org/wiki/Winter_Is_Coming
The Kingsroad     Scraping:
https://en.wikipedia.org/wiki/The_Kingsroad
Lord Snow     Scraping:
https://en.wikipedia.org/wiki/Lord_Snow
Cripples, Bastards, and Broken Things     Scraping:
https://en.wikipedia.org/wiki/Cripples,_Bastards,_and_Broken_Things
The Wolf and the Lion     Scraping:
https://en.wikipedia.org/wiki/The_Wolf_and_the_Lion
A Golden Crown     Scraping:
https://en.wikipedia.org/wiki/A_Golden_Crown
You Win or You Die     Scraping:
https://en.wikipedia.org/wiki/You_Win_or_You_Die
The Pointy End     Scraping:
https://en.wikipedia.org/wiki/The_Pointy_End
Baelor     Scraping:
https://en.wikipedia.org/wiki/Baelor
Fire and Blood     Scraping:
https://en.wikipedia.org/wiki/Fire_and_Blood_(Game_of_Thrones)
The North Remembers     Scraping:
https://en.wikipedia.org/wiki/The_North_Remembers
The Night Lands     Scraping:
https://en.wikipedia.org/wiki/The_Night_Lands
What Is Dead May Never Die     Scrapin

In [215]:
url = "https://en.wikipedia.org/wiki/The_Wolf_and_the_Lion"
ex = requests.get(url)
soup = BeautifulSoup(ex.text)
info = soup.find("table", class_ = "infobox")
info.find_all(text = re.compile('Season'))

['Season\xa01']

In [234]:
url = "https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes"
ex = requests.get(url)
soup = BeautifulSoup(ex.text)
soup.findAll("tr", class_ = "vevent")[1].findAll("td")[5].text

'2.20[22]'

In [243]:
ep.parent.findAll("td")[5].text

[<td class="summary" style="text-align:left"><i>Game of Thrones: The Last Watch</i></td>,
 <td style="text-align:center">May 26, 2019<span style="display:none"> (<span class="bday dtstart published updated">2019-05-26</span>)</span></td>,
 <td style="text-align:center">1.63<sup class="reference" id="cite_ref-96"><a href="#cite_note-96">[96]</a></sup></td>]

In [253]:
df = pd.DataFrame({"Episode": eps,
                  "Runtime": runtimes,
                  "Season": seasons,
                  "U.S. Viewers (Millions)": viewers})
df


Unnamed: 0,Episode,Runtime,Season,U.S. Viewers (Millions)
0,Winter Is Coming,62 minutes,Season 1,2.22
1,The Kingsroad,56 minutes,Season 1,2.20
2,Lord Snow,58 minutes,Season 1,2.44
3,"Cripples, Bastards, and Broken Things",56 minutes,Season 1,2.45
4,The Wolf and the Lion,55 minutes,Season 1,2.58
...,...,...,...,...
68,A Knight of the Seven Kingdoms,58 minutes,Season 8,10.29
69,The Long Night,82 minutes,Season 8,12.02
70,The Last of the Starks,78 minutes,Season 8,11.80
71,The Bells,78 minutes,Season 8,12.48
