# Web Scraping


## Behind every website

* Go to [http://shakespeare.mit.edu/Poetry/sonnets.html](http://shakespeare.mit.edu/Poetry/sonnets.html)
* On Chrome => More Tools => Developer Tools

![behind-every-web.png](./images/behind-every-web.png)

# HTML
![word-to-html.png](./images/word-to-html.png)

In [None]:
# TODO: Open this page https://wordtohtml.net/
## Try to right a paragraph
## (1) What happen when we enter a new pararapth
## (2) Bold. Italic? 
## (3) Bullet points?
## (4) Hyperlink

## HTML Tags

![html-tags.png](./images/html-tags.png)

## HTML attributes
Include in the opening tags, which provide further details about the tags

* `href` —> the URL for a link
* `class` —> CSS class(es) for an element (see below)
* `id` —> a unique id for an element
* `src` —> the URL for an image
* `style` —> CSS properties for an element (see below)


## CSS
* HTML is about the structure and contents
* CSS provide the styles

![style-css.png](./images/style-css.png)

## Scraping with Python

## `Requests` the HTML source code

In [6]:
import requests
# Send a GET request to a regular URL (not an API this time)
response = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")
response.status_code ## Status

200

In [8]:
response.text[:500]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Python (programming language) - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"c0d9a525-c939-4094-9b8f-b6'

In [9]:
## Write the content to html file
with open('output/wikipedia.html', 'w') as f:
    f.write(response.text) 

In [10]:
!ls output 

paris_iss.json wikipedia.html


In [None]:
#TODO: Check the wikipedia.html file. Install html preview extension in VSCode => preview

## Parse HTML with `BeautifulSoup`
* Without the BeautifulSoup, it looks like a jungle.
* How can we extract, knowing key-value

In [11]:
from bs4 import BeautifulSoup

In [27]:
soup = BeautifulSoup(response.text)

## `Tag` methods

In [None]:
#TODO: Ctrl+F it in wikiperdia.html file

In [1]:
title = soup.find('h1')
str(title)

NameError: name 'soup' is not defined

In [20]:
type(title)

bs4.element.Tag

In [21]:
title.contents 

['Python (programming language)']

In [23]:
# Find all Hyperlink (tag: a)
links = soup.find_all('a')
links[:5]

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Good_articles" title="This is a good article. Click here for more information."><img alt="This is a good article. Click here for more information." data-file-height="185" data-file-width="180" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/19px-Symbol_support_vote.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/29px-Symbol_support_vote.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/39px-Symbol_support_vote.svg.png 2x" width="19"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-redirect mw-disambig" href="/wiki/Python_(disambiguation)" title="Python (disambiguation)">Python (disambiguation)</a>]

In [24]:
# You can also pass attributes to your filter
mw_jump_links = soup.find_all('a', {'class': 'mw-jump-link'})
print(mw_jump_links)

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>, <a class="mw-jump-link" href="#searchInput">Jump to search</a>]


In [25]:
# You can even filter with a list of tags
h2_and_h3 = soup.find_all(['h2', 'h3'])
print(h2_and_h3[:5])

[<h2 id="mw-toc-heading">Contents</h2>, <h2><span class="mw-headline" id="History">History</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Python_(programming_language)&amp;action=edit&amp;section=1" title="Edit section: History">edit</a><span class="mw-editsection-bracket">]</span></span></h2>, <h2><span class="mw-headline" id="Design_philosophy_and_features">Design philosophy and features</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Python_(programming_language)&amp;action=edit&amp;section=2" title="Edit section: Design philosophy and features">edit</a><span class="mw-editsection-bracket">]</span></span></h2>, <h2><span class="mw-headline" id="Syntax_and_semantics">Syntax and semantics</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Python_(programming_language)&amp;action=edit&amp;section=3" title="Edit

In [26]:
# To find all the tags that have an `id`:
has_id = soup.find_all(id=True)
print(has_id[4])

<div id="siteNotice"><!-- CentralNotice --></div>


## `Tag` Attributes

In [30]:
sample_link = mw_jump_links[0]
print(sample_link)
print(sample_link['href'])  # Access a tag attribute with the dictionnary interface
print(sample_link.contents)  # Access the list of content in a tag.

<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
#mw-head
['Jump to navigation']


In [33]:
links[:5]

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Good_articles" title="This is a good article. Click here for more information."><img alt="This is a good article. Click here for more information." data-file-height="185" data-file-width="180" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/19px-Symbol_support_vote.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/29px-Symbol_support_vote.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/94/Symbol_support_vote.svg/39px-Symbol_support_vote.svg.png 2x" width="19"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-redirect mw-disambig" href="/wiki/Python_(disambiguation)" title="Python (disambiguation)">Python (disambiguation)</a>]

In [31]:
targets = [link.get('href') for link in links]
print(targets[:10])

[None, '/wiki/Wikipedia:Good_articles', '#mw-head', '#searchInput', '/wiki/Python_(disambiguation)', '/wiki/File:Python_logo_and_wordmark.svg', '/wiki/Programming_paradigm', '/wiki/Multi-paradigm_programming_language', '/wiki/Object-oriented_programming', '#cite_note-1']


## Exercise: The Sonnets

In [6]:
# TODO: Go to the website poetry and check the HTML structure by Developer Tools
from bs4 import BeautifulSoup
import requests
shakespear_path = "http://shakespeare.mit.edu/Poetry/"
response = requests.get("http://shakespeare.mit.edu/Poetry/sonnets.html") # User requests to pul html data
response.status_code ## Status

200

In [4]:
response.text[:500] ## Quick check the content of response

'<HTML>\n<HEAD>\n<TITLE>\nThe Sonnets\n</TITLE>\n</HEAD>\n<BODY>\n<H1>The Sonnets</H1>\n\n<p>You can buy the Arden text of these sonnets from the Amazon.com online bookstore: <a href="http://www.amazon.com/gp/product/1903436575?ie=UTF8&tag=theinteclasar-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1903436575">Shakespeare\'s Sonnets (Arden Shakespeare: Third Series)</a><img src="http://www.assoc-amazon.com/e/ir?t=theinteclasar-20&l=as2&o=1&a=1903436575" width="1" height="1" border="0" alt="" style="'

In [8]:
poem_soup = BeautifulSoup(response.text) ## Convert the response to BS structure

In [10]:
poem_links = poem_soup.find_all('a') ## Find all hyperlink
poem_links[:5] #3 Check teh first 5

[<a href="http://www.amazon.com/gp/product/1903436575?ie=UTF8&amp;tag=theinteclasar-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1903436575">Shakespeare's Sonnets (Arden Shakespeare: Third Series)</a>,
 <a href="sonnet.I.html">I. FROM fairest creatures we desire increase,</a>,
 <a href="sonnet.II.html">II. When forty winters shall beseige thy brow,</a>,
 <a href="sonnet.III.html">III. Look in thy glass, and tell the face thou viewest</a>,
 <a href="sonnet.IV.html">IV. Unthrifty loveliness, why dost thou spend</a>]

In [11]:
[l.contents for l in poem_links][:5] ## Extract the contents in each links

[["Shakespeare's Sonnets (Arden Shakespeare: Third Series)"],
 ['I. FROM fairest creatures we desire increase,'],
 ['II. When forty winters shall beseige thy brow,'],
 ['III. Look in thy glass, and tell the face thou viewest'],
 ['IV. Unthrifty loveliness, why dost thou spend']]

In [None]:
#TODO: Check the code below
## 1. Now take the href part (by BS)
## 2. Add the href after shakespear_path, request.get the data and print our
## NOW, modify that code to
## 1. Take the last 5 poem in the list
## 2. Print out the contents

In [12]:
shakespear_path
response = requests.get(shakespear_path + 'sonnet.I.html')
response.status_code ## Status

200

In [13]:
print(response.text)

<HTML><HEAD><TITLE>Sonnet I</TITLE></HEAD>
<BODY><H1>Sonnet I</H1>

<BLOCKQUOTE>FROM fairest creatures we desire increase,<BR>
That thereby beauty's rose might never die,<BR>
But as the riper should by time decease,<BR>
His tender heir might bear his memory:<BR>
But thou, contracted to thine own bright eyes,<BR>
Feed'st thy light'st flame with self-substantial fuel,<BR>
Making a famine where abundance lies,<BR>
Thyself thy foe, to thy sweet self too cruel.<BR>
Thou that art now the world's fresh ornament<BR>
And only herald to the gaudy spring,<BR>
Within thine own bud buriest thy content<BR>
And, tender churl, makest waste in niggarding.<BR>
  Pity the world, or else this glutton be,<BR>
  To eat the world's due, by the grave and thee.<BR>
</BLOCKQUOTE>

</BODY></HTML>



In [22]:
poem_links = poem_soup.find_all('a')
len(poem_links)
data=poem_links[150:]
data
for link in data:
    url=link["href"]
    response = requests.get(shakespear_path + url)
    response.status_code
    print(response.text)


<HTML><HEAD><TITLE>Sonnet CL</TITLE></HEAD>
<BODY><H1>Sonnet CL</H1>

<BLOCKQUOTE>O, from what power hast thou this powerful might<BR>
With insufficiency my heart to sway?<BR>
To make me give the lie to my true sight,<BR>
And swear that brightness doth not grace the day?<BR>
Whence hast thou this becoming of things ill,<BR>
That in the very refuse of thy deeds<BR>
There is such strength and warrantize of skill<BR>
That, in my mind, thy worst all best exceeds?<BR>
Who taught thee how to make me love thee more<BR>
The more I hear and see just cause of hate?<BR>
O, though I love what others do abhor,<BR>
With others thou shouldst not abhor my state:<BR>
  If thy unworthiness raised love in me,<BR>
  More worthy I to be beloved of thee.<BR>
</BLOCKQUOTE>

</BODY></HTML>

<HTML><HEAD><TITLE>Sonnet CLI</TITLE></HEAD>
<BODY><H1>Sonnet CLI</H1>

<BLOCKQUOTE>Love is too young to know what conscience is;<BR>
Yet who knows not conscience is born of love?<BR>
Then, gentle cheater, urge not my amis