## Web Scraping
bs4 → is the Python package name for BeautifulSoup 4, a popular library used for parsing HTML and XML documents.

BeautifulSoup → is the main class provided by the library.

By writing from bs4 import BeautifulSoup, you are importing that class so you can create a BeautifulSoup object and work with HTML/XML easily.

In [1]:
from bs4 import BeautifulSoup 

1. import requests → loads the requests library, but you are not using it in this code.
2. open(...).read() → opens your local file (scrap-example.html) and reads its content into a variable called htmlPage.
3. htmlPage is just a string that contains all the HTML code from that file.
4. The requests line is unnecessary unless you want to fetch an online webpage.

In [24]:
import requests
htmlPage=open("C:/Users/user/Desktop/scrap-example.html").read()

1. BeautifulSoup(htmlPage, "html.parser") → takes your HTML text and parses it into a BeautifulSoup object.
2. That object (soup) represents the entire HTML document as a structured parse tree.
3. You can now search, navigate, and modify elements in the HTML using methods like .find(), .find_all(), .title, .p, etc.

In [7]:
soup=BeautifulSoup(htmlPage, "html.parser")
type(soup)

bs4.BeautifulSoup

1. soup.html → Finds the "html" tag in your parsed HTML file and stores it in HTML.

2. HTML.head → From inside the "html" tag, it extracts the "head" section and stores it in HEAD.

3. print(HEAD) → Prints the entire "head" section of your HTML document (including "title", "meta", "link", "style", etc., depending on what’s inside your HTML file).

In [8]:
HTML = soup.html
HEAD = HTML.head
print(HEAD)

<head>
<title>Mad Monk Ji Gong</title>
</head>


In [10]:
type(HEAD)

bs4.element.Tag

1. soup.html.head gives you the entire "head" section of your HTML document.

In [11]:
head=soup.html.head
print(HEAD)

<head>
<title>Mad Monk Ji Gong</title>
</head>


In [12]:
print(soup.head)

<head>
<title>Mad Monk Ji Gong</title>
</head>


In [13]:
print(soup.title)

<title>Mad Monk Ji Gong</title>


1. There is more than one element with this tag, this approach will only return the first one.

In [14]:
print(soup.p)

<p>This happened
	  during <a href="https://en.wikipedia.org/wiki/Song_dynasty">Song Dynasty</a>.
	  The patchwork robe made for <strong>Guang Liang</strong>...</p>


1. soup.find(tag, ...) → finds the first occurrence of that HTML tag.

2. class_="quote" → filters by class name "quote".
 (We use class_ instead of class because class is a reserved keyword in Python.)

In [15]:
print(soup.find("p", class_ = "quote"))   # First line is more specific: it only searches for a <p> with class quote.

<p class="quote">When I was strolling in the street,
	  <footnote>They lived in Linan</footnote> almost
	  everyone was calling me
	  <span class="nickname">Virtuous Li</span> ...</p>


In [16]:
print(soup.find(class_ = "nickname"))  # Second line is more general: it searches for any tag with class nickname.

<span class="nickname">Virtuous Li</span>


In [17]:
Ps=soup.find_all("p")  # if you want to extract all tags, you can use find_all method. It returns all elements that match the query in a list.
print(Ps)

[<p>This happened
	  during <a href="https://en.wikipedia.org/wiki/Song_dynasty">Song Dynasty</a>.
	  The patchwork robe made for <strong>Guang Liang</strong>...</p>, <p class="quote">When I was strolling in the street,
	  <footnote>They lived in Linan</footnote> almost
	  everyone was calling me
	  <span class="nickname">Virtuous Li</span> ...</p>]


In [18]:
type(Ps[0])    

bs4.element.Tag

## Extracting content
The extracted elements, although they appear as text lines from the file, are not text but belong to BS internal classes. If we want to extract the text inside of the element as text, we can do this with the .text attribute.

In [19]:
print(Ps[1].text)

When I was strolling in the street,
	  They lived in Linan almost
	  everyone was calling me
	  Virtuous Li ...


In [20]:
soup.a["href"]   #Html attributes (such as “class” or “href”) can be extracted using square brackets (like in case of dicts).

'https://en.wikipedia.org/wiki/Song_dynasty'

In [21]:
Ps[1]["class"]    # the class of the second paragraph

['quote']

We can also explicitly get the tag using name attribute. By inverting the previous example, let’s find the first element of class “quote” and print it’s tag

In [23]:
soup.find(class_='quote').name

'p'

##  Moving up and down the tree
Beautiful Soup uses concepts children, siblings, descendants, and parents to navigate the html tree. Children are elements that are directly inside of a parent element and not inside other elements that are inside parent elements (those are grandchildren). Parents are elements that children are directly inside

One can get an iterable collections of all first-level children of a tag with the children attribute. 

In [25]:
children = soup.body.children
for child in children:
    print(child.name)

None
h1
None
p
None
h2
None
p
None
h2
None
div
None


In [27]:
children = list(soup.body.children)
print(children[0])  # line break after body, before header
print(children[1])



<h1>Li Visits Buddha</h1>


In [28]:
print(children[2])





If you want to list all descendants of an element, i.e. to include grandchildren and such, you can use descendants instead of children

In [29]:
children = soup.body.descendants
for child in children:
    print(child.name)

None
h1
None
None
p
None
a
None
None
strong
None
None
None
h2
None
None
p
None
footnote
None
None
span
None
None
None
h2
None
None
div
None
table
None
thead
None
tr
None
th
None
None
th
None
None
None
None
tbody
None
tr
None
td
None
None
td
None
None
None
tr
None
td
None
None
td
None
None
None
None
None
None


Moving horizontally among the same level elements proceeds with find_next_sibling, find_next_siblings, find_previous_sibling and find_previous_siblings. Moving back- and forth disregarding the nesting level can be done with find_next, find_all_next, find_previous and find_all_previous. The siblings and all versions find all the occurrences, the others find just the first occurrence. All these functions accept various parameters so one can specify tag names and attributes.

In [30]:
#Let us extract second “p” inside of “body”, and thereafter move back to first preceding “h2”
ps = soup.body.find_all("p")  # list of both b-s
p = ps[1]  # 2nd p
h2 = p.find_previous_sibling("h2")  # h2 before the 2nd p
h2.text  # print the text inside h2

'Li Begs for a Son'

###  Removing elements from the tree

In [31]:
p = soup.body.find("p", class_ = "quote").text
print(p)

When I was strolling in the street,
	  They lived in Linan almost
	  everyone was calling me
	  Virtuous Li ...


Linan in the second line belongs to the footnote and the following almost belongs to the main text, but there is no way to tell this based on the printout. Sometimes this is what we want, e.g. when we strip text from its attributes like “i” and “b” (for italic and bold respectively). But here it is clearly undesirable. As a solution, we can delete the “footnote” element using extract method. However, if we extract an element, it will be deleted from the whole source tree and we cannot access it later. So we may want to make a copy of the element:

In [32]:
import copy
p = soup.body.find("p", class_ = "quote")  # find the quote paragraph
p = copy.copy(p)  # make copy of p (and forget the original)
footnote = p.footnote.extract()  # find the first footnote there and remove it
print(p.text)  # print the final text

When I was strolling in the street,
	   almost
	  everyone was calling me
	  Virtuous Li ...


In [33]:
p = soup.body.find("p", class_ = "quote")
print(p.text)

When I was strolling in the street,
	  They lived in Linan almost
	  everyone was calling me
	  Virtuous Li ...


1. urllib.request is a Python module for handling URLs.
2. urljoin is a function inside this module that can combine a base URL and a relative URL into a full absolute URL.
3. By importing it this way, you can just call urljoin(...) directly.

4. base is a string that represents the main website.
5. In this case, it’s the root of Wikipedia.
6. Later, any relative link will be joined with this base.

7. link is a relative URL, meaning it does not contain the full website address, only the path after the domain.
8. Here /wiki/Taiwan.html refers to a page inside Wikipedia.

9. combine base and relative link.
10. urljoin(base,link) will join the two parts.

In [45]:
from urllib.request import urljoin
base = "https://en.wikipedia.org"
link = "/wiki/Taiwan.html"
address = urljoin(base, link)
print(address)

https://en.wikipedia.org/wiki/Taiwan.html
