# Basic Web Scrapper

In [None]:
project_url = "https://raw.githubusercontent.com/PedroFerreiraBento/Python-Projects/main/2-web-scraping-projects/2.1-beautiful-soap"

## 1 - Simple request
Request data from an url and output the HTML file received on response.

Note: Here we are requesting the raw data of a HTML file that is present in this project, but the github file is on the Web and that is where we are making the request.

In [None]:
from urllib.request import urlopen

html = urlopen(f"{project_url}/static/example-1.html")
print(html.read())

Note: HTML files can have break lines that can disturb the reading of navigation trees

## 2 - BeautifulSoup

### 2.1 - Installation

In [None]:
%pip install --upgrade pip
%pip install beautifulsoup4

### 2.2 - Parse HTML file

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(f"{project_url}/static/example-1.html")
bs = BeautifulSoup(html.read(), "html.parser")

print(f"First instance of 'p' tag found: {bs.p.string}")

### 2.3 - Trying other parsers

You can check the other parsers and them difference here: [**Difference between parsers**](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers)

Some parser will need to be installed.

In [None]:
%pip install lxml
%pip install html5lib

In [None]:
from bs4 import BeautifulSoup

html_parser = BeautifulSoup("<a></b></a>", "html.parser")
lxml = BeautifulSoup("<a></b></a>", "lxml")
html_lib = BeautifulSoup("<a></b></a>", "html5lib")

print(f"'html.parser' parser: {html_parser}")
print(f"'lxml' parser: {lxml}")
print(f"'html5lib' parser: {html_lib}")

### 2.4 - Handling exceptions

Two main things can go wrong in the request:
- The page is not found on the server (or there was an error in retrieving it).
- The server is not found.

#### 2.4.1 - Page not found

Raises HTTPError

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)

#### 2.4.2 - Server not found

Raises URLError

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')


#### 2.4.3 - Tag not found

If you try to access a tag that does not exist in the file the BeautifulSoup will return None. But if you try to access an element inside a non-existing tag it will raise an AttributeError

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(f"{project_url}/static/example-1.html")
bs = BeautifulSoup(html.read(), "html.parser")

# Return None
print(f"Non-existing tag: { bs.missingtag }")

# Raise AttributeError
try: 
    print(bs.missingtag.a)
except AttributeError:
    print("Tag not found!")

### 2.5 - Filter page

BeautifulSoup has two methods to filter HTML pages:

- **find()** - To get the first match
- **find_all()** - To get all matches

#### 2.5.1 - Count elements

In [None]:
from bs4 import BeautifulSoup

bs = BeautifulSoup("<div><p>1</p><p>2</p><p>3</p></div>", "html.parser")

print(f"Count paragraphs: {len(bs.find_all('p'))}")

#### 2.5.2 - Tag and attributes filter

In [None]:
from bs4 import BeautifulSoup

bs = BeautifulSoup("<div><p>1</p><p class='selected'>2</p><a class='selected'>3</a></div>", "html.parser")

print(bs.find_all(["p", "a"], {"class": "selected"}))

#### 2.5.3 - Keyword filter 

In [None]:
from bs4 import BeautifulSoup

bs = BeautifulSoup("<div><p>1</p><p class='selected'>2</p><a class='selected'>3</a></div>", "html.parser")

print(bs.find_all(class_=True))

#### 2.5.4 - Content filter

In [None]:
from bs4 import BeautifulSoup

bs = BeautifulSoup("<div><p>John</p><p>Richard</p><p>John</p></div>", "html.parser")

count = len(bs.find_all(string="John"))
print(f"Match count: {count}")

### 2.6 - Navigation trees

Find tags based on document location

#### 2.6.1 - Children and Descendants

Children are always one tag below their parents.
Descendants can be at any level below the parents.

All children are descendants, but not all descendants are children. 

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen(f"{project_url}/static/example-2.html")
bs = BeautifulSoup(html.read(), "html.parser")

children = bs.find(id="ageTable").findChildren(recursive=False)
descendants = bs.find(id="ageTable").findChildren(recursive=True)

print(f"Children count: {len(list(children))}")
print(f"Descendants count: {len(list(descendants))}")

#### 2.6.2 - Siblings

Siblings are tags in the same tree level.

In [None]:
first_row = bs.find("table", {"id": "ageTable"}).tr

for index, row in enumerate(first_row.findNextSiblings()):
    print(f"Row {index}: \n{row}\n")

#### 2.6.3 - Parents

Parents are always one layer above their children.

In [None]:
child = bs.find(id="John")

print(f"Child: {child}")
print(f"Parent: {child.findParent()}")

### 2.7 Regular expression (REGEX)

Regular expression are used to identify regular strings.

| Syntax | Description | Example | Example matches |
|--------|-------------|---------|-----------------|
| ^ | Matches the start of the string. | ^d | dog, door, death |
| $ | Matches the end of the string. | t$ | cat, Fat, left |
| * | Matches preceding character or expresion 0 or more times. | a\*b\* | aabbbb, aaaab, bbb |
| + | Matches preceding character or expresion 1 or more times. | c+d+ | cddd, ccccdd, ccccd |
| ? | Matches optional character or expresion 0 or 1 times. | cd? | cd, c |
| \ | Matches special characters. | \? | ? |
| {m} | Matches preceding character or expresion exactly a specified number of times. | e{5} | eeeee |
| {m, n} | Matches preceding character or expresion a specified **inclusive** range of times. | a{2, 5} | aa, aaaa, aaaaa |
| {m, n}? | Matches preceding character or expresion a specified **inclusive** range of times with the **fewest** repetition as possible. | a{2, 5} | aa, aaaa, aaaaa |
| {m, n}+ | Matches preceding character or expresion a specified **inclusive** range of times with **as many** repetition as possible. | a{2, 5} | aaaaa, aaaa |
| \| | Matches one of the specified values like an **or** condition. | a\|b | a, b |
| [...] | Matches a set of characters. | [amk], [A-Z], [a-z], [A-Za-z], [0-9], [A-Zk\.\$] | m, L, q, G, 2, . |
| [^...] | Matches all characters that are **not** in a set of characters. | [^A-Z] | q, G, 2, &, ? |
| (...) | Matches a grouped subexpression. | (a\*b)\* | aaaababaab |
| (?!...) | Matches if ... doesn’t match next | Isaac (?!Newton) | Isaac Asimov |

#### 2.7.1 - Special Characters

| Symbol | Description |
|--------|-------------|
| . | Matches all character except new lines. |
| \w | Matches all word characters(like [a-zA-Z0-9_]). |
| \W | Matches no word characters (like [^a-zA-Z0-9_]). |
| \s | Matches all space charcters(which includes [ \t\n\r\f\v]) |
| \S | Matches no space charcters(which includes [^ \t\n\r\f\v]) |
| \d | Matches all digit characters(like [0-9]). |
| \D | Matches no digit characters(like [^0-9]). |




#### 2.7.2 - Regular expression on BeautifulSoup

Using 're' library.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen(f"{project_url}/static/example-1.html")
bs = BeautifulSoup(html.read(), "html.parser")

found = bs.find_all("p", string=re.compile(".*[Ee]xample.*"))
print(f"Paragraphs with 'example' word: {found}")

### 2.8 - Accessing attributes

Tags objects have a "attrs" atribute that return a dictionary with all tag atributes. 

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(f"{project_url}/static/example-2.html")
bs = BeautifulSoup(html.read(), "html.parser")

found = bs.find_all("tr")

for index, tr in enumerate(found):
    if "id" in tr.attrs:
        print(f"Row {index + 1} id: {tr.attrs['id']}")
    else:
        print(f"Row {index + 1} without id")

### 2.9 - BeautifulSoup objects
There are four objects in the library.

#### 2.9.1 - BeautifulSoup *objects*
The BeautifulSoup object represents the parsed document as a whole.

In [28]:
from bs4 import BeautifulSoup

doc = BeautifulSoup("<html><body><div>Example</div></body></html>", "lxml")

print(f"Type: {type(doc)}")
print(f"Example: {doc}")

Type: <class 'bs4.BeautifulSoup'>
Example: <html><body><div>Example</div></body></html>



#### 2.9.2 - Tag *objects*
A Tag object corresponds to an XML or HTML tag in the original document.


In [29]:
from bs4 import BeautifulSoup

doc = BeautifulSoup("<html><body><div>Example</div></body></html>", "lxml")

print(f"Type: {type(doc.div)}")
print(f"Example: {doc.div}")

Type: <class 'bs4.element.Tag'>
Example: <div>Example</div>



#### 2.9.3 - NavigableString *objects*
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text.

In [30]:
from bs4 import BeautifulSoup

doc = BeautifulSoup("<html><body><div>Example</div></body></html>", "lxml")

print(f"Type: {type(doc.div.string)}")
print(f"Example: {doc.div.string}")

Type: <class 'bs4.element.NavigableString'>
Example: Example


#### 2.9.4 - Comment *objects*
The Comment object is just a special type of NavigableString

In [31]:
from bs4 import BeautifulSoup

doc = BeautifulSoup("<html><body><div><!-- Comment example --></div></body></html>", "lxml")

print(f"Type: {type(doc.div.string)}")
print(f"Example: {doc.div.string}")

Type: <class 'bs4.element.Comment'>
Example:  Comment example 


### 2.10 - Lambda expressions

BeautifulSoup allow you to pass a lambda funtion as parameters, the only restriction is that these functions must take a tag object as an argument and return a boolean.

In [36]:
from bs4 import BeautifulSoup

bs = BeautifulSoup("<div>Not selected</div> <div id='selected'>Selected tag</div>", "lxml")

selected = bs.find(lambda tag: ("id" in tag.attrs and tag.attrs["id"] == "selected"))
print(f"Example: {selected}")

Example: <div id="selected">Selected tag</div>
