## Web Scraping

Wikipedia:

In web development, "tag soup" is a pejorative term that refers to syntactically or structurally incorrect HTML written for a web page. Because web browsers have historically treated HTML syntax or structural errors leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

[BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

### Warmup

In [13]:
# Taken from the BeautifulSoup website

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [68]:
# prettify

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())


<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [15]:
# Get tags or document attributes:
# title
soup.title

<title>The Dormouse's story</title>

In [70]:
# What is the data type?
type(soup.title)

bs4.element.Tag

In [69]:
# dump name of the tag or attribute
soup.title.name

'title'

In [17]:
soup.title.string

"The Dormouse's story"

#### Inspect html string above

```
...
<head>
  <title>
   The Dormouse's story
  </title>
 </head>
 ...
```

Note that `<title>...</title>`'s parent is `<head>...</head>`

In [18]:
soup.title.parent.name

'head'

#### Get the first instance of a tag

```
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
```

**```<p class="title">```**  

```
   <b>
    The Dormouse's story
   </b>
  </p>
```

In [71]:
# example: first paragraph (p)
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [20]:
soup.p['class']

['title']

In [21]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [72]:
# Get/find all
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [73]:
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

### Get data from an actual website

**Review: What is HTTP?**

HTTP (or Hypertext Transfer Protocol) is a protocol which allows the fetching of resources, such as HTML or JSON documents. It is the foundation of any data exchange on the Web and it is a client-server protocol, which means requests are initiated by the recipient, usually the Web browser. A complete document is reconstructed from the different sub-documents fetched, for instance text, layout description, images, videos, scripts, and more.

Source: [An Overview of HTTP (mozilla.org)](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)

**What is an Application Programming Interface (API)?**

An application program interface (API) is a set of routines, protocols, and tools for building software applications. Basically, an API specifies how software components should interact. Additionally, APIs are used when programming graphical user interface (GUI) components.

Source: [Webopedia](https://www.webopedia.com/TERM/A/API.html)

#### Python Requests Module

Requests is a simple HTTP library for Python. The full documentation is found [here](https://requests.readthedocs.io/en/master/), but for this test, we don't need to go through the whole thing.

To use it, simply import it.

In [90]:
import requests

"http://www.ateneo.edu/ls/jgsom/qmit/faculty"
result = requests.get("http://www.ateneo.edu/ls/jgsom/qmit/faculty")

The call to requests.get(url) returns a **requests.models.Response** object. What you can see from above is the HTTP Response, and if everything is normal, the status code should be **200**.

In [91]:
# be sure to check the status (result.status_code)
result.status_code

200

HTTP normally returns text which Python can then process as a string. The text body may be found in the Response's attribute **text**.

In [92]:
# get text
result.text

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">\n\n<!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ -->\n<!--[if lt IE 7]> <html class="no-js ie6 ie" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" version="XHTML+RDFa 1.0" dir="ltr" \n  xmlns:content="http://purl.org/rss/1.0/modules/content/"\n  xmlns:dc="http://purl.org/dc/terms/"\n  xmlns:foaf="http://xmlns.com/foaf/0.1/"\n  xmlns:og="http://ogp.me/ns#"\n  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"\n  xmlns:sioc="http://rdfs.org/sioc/ns#"\n  xmlns:sioct="http://rdfs.org/sioc/types#"\n  xmlns:skos="http://www.w3.org/2004/02/skos/core#"\n  xmlns:xsd="http://www.w3.org/2001/XMLSchema#"\n  xmlns:fb="http://ogp.me/ns/fb#"\n  xmlns:article="http://ogp.me/ns/article#"\n  xmlns:book="http://ogp.me/ns/book#"\n  xmlns:profile="http://ogp.me/ns/profile#"\n  xmlns:video="http://ogp.me/ns/video#"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js ie7 

We already imported BeautifulSoup previously, so need to reimport. There's no harm in doing so again, however.

In [93]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(result.text)

In [94]:
soup.find_all("div")

[<div id="skip-link">
 <a href="#main-content-area">Skip to main content area</a>
 </div>, <div class="page" id="page">
 <div class="page-inner" id="page-inner">
 <!-- header-group region: width = grid_width -->
 <div class="header-group-wrapper full-width clearfix" id="header-group-wrapper">
 <div class="header-group region grid12-12" id="header-group">
 <div class="header-group-inner inner clearfix" id="header-group-inner">
 <div class="header-site-info clearfix" id="header-site-info">
 <div class="header-site-info-inner gutter" id="header-site-info-inner">
 <div id="logo">
 <a href="/" title="Ateneo de Manila University Home"><img alt="Ateneo de Manila University Home" src="http://www.ateneo.edu/sites/default/files/masthead1_0.png"/></a>
 </div>
 </div><!-- /header-site-info-inner -->
 </div><!-- /header-site-info -->
 <div class="block block-superfish first odd" id="block-superfish-17">
 <div class="gutter inner clearfix">
 <div class="content clearfix">
 <ul class="sf-menu menu-lo

In [95]:
soup.find_all(id='profile-wrapper')

[<div id="profile-wrapper">
 <div id="profile-img"><img alt="" src="http://www.ateneo.edu/sites/default/files/styles/thumbnail/public/default_images/default_profile.jpg?itok=BGRUpJ4B" typeof="foaf:Image"/></div>
 <div id="profile-body">
 <div id="profile-body-title"><a href="/ls/jgsom/qmit/faculty/agbayani-victor-e">Agbayani, Victor E.</a></div>
 <div id="profile-body-contact">
 	Ateneo de Manila University
 
 	
 		Part-Time Faculty, Quantitative Methods &amp; Information Technology
 
 <br/><a href="mailto:vagbayani@ateneo.edu">vagbayani@ateneo.edu</a><br/></div>
 </div>
 </div>, <div id="profile-wrapper">
 <div id="profile-img"><img alt="" src="http://www.ateneo.edu/sites/default/files/styles/thumbnail/public/default_images/default_profile.jpg?itok=BGRUpJ4B" typeof="foaf:Image"/></div>
 <div id="profile-body">
 <div id="profile-body-title"><a href="/ls/jgsom/qmit/faculty/amurao-marianne-kayle-h">Amurao, Marianne Kayle H.</a></div>
 <div id="profile-body-contact">
 	Ateneo de Manila Un

In [96]:
soup.find_all(id='profile-body-title')

[<div id="profile-body-title"><a href="/ls/jgsom/qmit/faculty/agbayani-victor-e">Agbayani, Victor E.</a></div>,
 <div id="profile-body-title"><a href="/ls/jgsom/qmit/faculty/amurao-marianne-kayle-h">Amurao, Marianne Kayle H.</a></div>,
 <div id="profile-body-title"><a href="/ls/jgsom/qmit/faculty/ang-rodolfo-p">Ang, Rodolfo P.</a></div>,
 <div id="profile-body-title"><a href="/ls/jgsom/qmit/faculty/aquino-rafael-alfonso-h">Aquino, Rafael Alfonso H.</a></div>,
 <div id="profile-body-title"><a href="/ls/jgsom/qmit/faculty/asinas-maria-carmen-vicenta-n">Asinas, Maria Carmen Vicenta N.</a></div>]

In [97]:
# what is the type of the returned value of find_all?
type(soup.find_all(id='profile-body-title'))

bs4.element.ResultSet

In [98]:
[d.string for d in soup.find_all(id='profile-body-title')]

['Agbayani, Victor E.',
 'Amurao, Marianne Kayle H.',
 'Ang, Rodolfo P.',
 'Aquino, Rafael Alfonso H.',
 'Asinas, Maria Carmen Vicenta N.']

In [99]:
next=soup.find(title="Go to next page")
next

<a href="/ls/jgsom/qmit/faculty?title_op=contains&amp;title=&amp;page=1" title="Go to next page">next ›</a>

In [100]:
next_url = next["href"]
next_url

'/ls/jgsom/qmit/faculty?title_op=contains&title=&page=1'

In [101]:
ateneo_website = "http://www.ateneo.edu"
result = requests.get(ateneo_website+next_url)
result

<Response [200]>

In [102]:
soup = BeautifulSoup(result.text)

In [103]:
[d.string for d in soup.find_all(id='profile-body-title')]

['Bornilla, Lisha Mae F.',
 'Cabasag Jr., Francisco A.',
 'Cadeliña, Ian Christian Ver P.',
 'Caluag, Jose Antonio I.',
 'Cham, Lance Samuel S.']

In [104]:
next=soup.find(title="Go to next page")
next

<a href="/ls/jgsom/qmit/faculty?title_op=contains&amp;title=&amp;page=2" title="Go to next page">next ›</a>

In [105]:
next_url = next["href"]
next_url

'/ls/jgsom/qmit/faculty?title_op=contains&title=&page=2'

In [106]:
result = requests.get(ateneo_website+next_url)
result

<Response [200]>

In [107]:
soup = BeautifulSoup(result.text)

In [108]:
[d.string for d in soup.find_all(id='profile-body-title')]

['Chua Jr., Robin A.',
 'Chua, Magnolia Ann N.',
 'Denzon, Eduardo Ezekiel S.',
 'Divinagracia, Gerald G.',
 'Duque, Marc Wendolf N. ']

In [109]:
next=soup.find(title="Go to next page")
next

<a href="/ls/jgsom/qmit/faculty?title_op=contains&amp;title=&amp;page=3" title="Go to next page">next ›</a>

In [110]:
next_url = next["href"]
next_url

'/ls/jgsom/qmit/faculty?title_op=contains&title=&page=3'

In [111]:
result = requests.get(ateneo_website+next_url)
result

<Response [200]>

In [112]:
soup = BeautifulSoup(result.text)

In [113]:
[d.string for d in soup.find_all(id='profile-body-title')]

['Encarnacion, Rige Bernadette R.',
 'Filart, Jan Paul G.',
 'Gan, Wilson Q.',
 'Gonzales, Joaquin Emmanuel J.',
 'Gotera, Kristine Mae C.']

In [114]:
next=soup.find(title="Go to next page")
next

<a href="/ls/jgsom/qmit/faculty?title_op=contains&amp;title=&amp;page=4" title="Go to next page">next ›</a>

In [115]:
next_url = next["href"]
next_url

'/ls/jgsom/qmit/faculty?title_op=contains&title=&page=4'

In [116]:
result = requests.get(ateneo_website+next_url)
result

<Response [200]>

In [117]:
soup = BeautifulSoup(result.text)

In [118]:
[d.string for d in soup.find_all(id='profile-body-title')]

['Hernandez, Gio Mari S.',
 'Ilagan, Joseph Benjamin R.',
 'Ingco III, Julio S.',
 'Iyog, Jonathan Marel M.',
 'Jimenez, Edward G.']

In [119]:
# this one won't have anything anymore
next=soup.find(title="Go to next page")
print(next)

<a href="/ls/jgsom/qmit/faculty?title_op=contains&amp;title=&amp;page=5" title="Go to next page">next ›</a>


### Exercise: 

Convert the above to a loop until no more 'Go to next page' can be found

In [120]:
from bs4 import BeautifulSoup
import requests

ateneo_website = "http://www.ateneo.edu"
result = requests.get(ateneo_website+"/ls/jgsom/qmit/faculty")
result.status_code

200

In [122]:
soup = BeautifulSoup(result.text)
[s.text for s in soup.find_all(id='profile-body-contact')]

['\r\n\tAteneo de Manila University\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods & Information Technology\r\n\r\nvagbayani@ateneo.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\xa0\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods and Information Technology...mamurao@ateneo.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\xa0\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods and Information Technology...rang@ateneo.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods & Information Technology\r\n\r\nrhaquino@ateneo.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods & Information Technology\r\n\r\nmasinas@atenoe.edu']

In [123]:
faculty_name = []
faculty_contact = []
result = requests.get("http://www.ateneo.edu/ls/jgsom/qmit/faculty")
if(result.status_code==200):
    soup = soup = BeautifulSoup(result.text)
    faculty_name+=[d.string for d in soup.find_all(id='profile-body-title')]
    faculty_contact+=[s.text for s in soup.find_all(id='profile-body-contact')]
    next=soup.find(title="Go to next page")
    while(next!=None):
        next_url = next["href"]
        print("Fetching...")
        result = requests.get(ateneo_website+next_url)
        soup = BeautifulSoup(result.text)
        faculty_name+=[d.string for d in soup.find_all(id='profile-body-title')]
        faculty_contact+=[s.text for s in soup.find_all(id='profile-body-contact')]
        next=soup.find(title="Go to next page")
        
[n for n in faculty_name]
    

Fetching...
Fetching...
Fetching...
Fetching...
Fetching...
Fetching...
Fetching...
Fetching...
Fetching...


['Agbayani, Victor E.',
 'Amurao, Marianne Kayle H.',
 'Ang, Rodolfo P.',
 'Aquino, Rafael Alfonso H.',
 'Asinas, Maria Carmen Vicenta N.',
 'Bornilla, Lisha Mae F.',
 'Cabasag Jr., Francisco A.',
 'Cadeliña, Ian Christian Ver P.',
 'Caluag, Jose Antonio I.',
 'Cham, Lance Samuel S.',
 'Chua Jr., Robin A.',
 'Chua, Magnolia Ann N.',
 'Denzon, Eduardo Ezekiel S.',
 'Divinagracia, Gerald G.',
 'Duque, Marc Wendolf N. ',
 'Encarnacion, Rige Bernadette R.',
 'Filart, Jan Paul G.',
 'Gan, Wilson Q.',
 'Gonzales, Joaquin Emmanuel J.',
 'Gotera, Kristine Mae C.',
 'Hernandez, Gio Mari S.',
 'Ilagan, Joseph Benjamin R.',
 'Ingco III, Julio S.',
 'Iyog, Jonathan Marel M.',
 'Jimenez, Edward G.',
 'Le Baut, Thierry',
 'Mariano, Annalisa Margarita V.',
 'Olpoc, Joselito C.',
 'Ong, Erwin Donavan S.',
 'Ongcangco, Kristine Claire E.',
 'Plaza, Eddie Francis Cesar B.',
 'Rasco, Joseph P.',
 'Reventar III, Vicente P.',
 'Rodriguez, Raul P.',
 'Rueda IV, Jose R.',
 'Ruiz, Mari-Jo P.',
 'Sin, Immanuel

In [124]:
[c for c in faculty_contact]

['\r\n\tAteneo de Manila University\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods & Information Technology\r\n\r\nvagbayani@ateneo.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\xa0\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods and Information Technology...mamurao@ateneo.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\xa0\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods and Information Technology...rang@ateneo.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods & Information Technology\r\n\r\nrhaquino@ateneo.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods & Information Technology\r\n\r\nmasinas@atenoe.edu',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods & Information Technology\r\n\r\n',
 '\r\n\tAteneo de Manila University\r\n\r\n\t\r\n\t\tPart-Time Faculty, Quantitative Methods & Information Technology\r\

In [125]:
import pandas as pd

df = pd.DataFrame({"name":faculty_name,"contact":faculty_contact})

In [126]:
df

Unnamed: 0,name,contact
0,"Agbayani, Victor E.",\r\n\tAteneo de Manila University\r\n\r\n\t\r\...
1,"Amurao, Marianne Kayle H.",\r\n\tAteneo de Manila University\r\n\r\n\t \r...
2,"Ang, Rodolfo P.",\r\n\tAteneo de Manila University\r\n\r\n\t \r...
3,"Aquino, Rafael Alfonso H.",\r\n\tAteneo de Manila University\r\n\r\n\t\r\...
4,"Asinas, Maria Carmen Vicenta N.",\r\n\tAteneo de Manila University\r\n\r\n\t\r\...
5,"Bornilla, Lisha Mae F.",\r\n\tAteneo de Manila University\r\n\r\n\t\r\...
6,"Cabasag Jr., Francisco A.",\r\n\tAteneo de Manila University\r\n\r\n\t\r\...
7,"Cadeliña, Ian Christian Ver P.",\r\n\tAteneo de Manila University\r\n\r\n\t\r\...
8,"Caluag, Jose Antonio I.",\r\n\tAteneo de Manila University\r\n\r\n\t\r\...
9,"Cham, Lance Samuel S.",\r\n\tAteneo de Manila University\r\n\r\n\t\r\...


In [127]:
df.loc[:,"contact"]=df.loc[:,"contact"].replace(r'\r\n','',regex=True).replace(r'\t',' ',regex=True)

In [128]:
df

Unnamed: 0,name,contact
0,"Agbayani, Victor E.",Ateneo de Manila University Part-Time Facul...
1,"Amurao, Marianne Kayle H.",Ateneo de Manila University Part-Time Fac...
2,"Ang, Rodolfo P.",Ateneo de Manila University Part-Time Fac...
3,"Aquino, Rafael Alfonso H.",Ateneo de Manila University Part-Time Facul...
4,"Asinas, Maria Carmen Vicenta N.",Ateneo de Manila University Part-Time Facul...
5,"Bornilla, Lisha Mae F.",Ateneo de Manila University Part-Time Facul...
6,"Cabasag Jr., Francisco A.",Ateneo de Manila University Part-Time Facul...
7,"Cadeliña, Ian Christian Ver P.",Ateneo de Manila University Part-Time Facul...
8,"Caluag, Jose Antonio I.",Ateneo de Manila University Part-Time Facul...
9,"Cham, Lance Samuel S.",Ateneo de Manila University Part-Time Facul...


In [129]:
## just clean when you have time

df.loc[:,"contact"].str.extract(r'\.\.([a-zA-Z\._]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')

Unnamed: 0,0
0,
1,.mamurao@ateneo.edu
2,.rang@ateneo.edu
3,
4,
5,
6,
7,
8,
9,
