# Basic Web Scrapper

In [None]:
project_url = "https://raw.githubusercontent.com/PedroFerreiraBento/Python-Projects/main/2-web-scraping-projects/2.1-beautiful-soap"

## 1 - Simple request
Request data from an url and output the HTML file received on response.

Note: Here we are requesting the raw data of a HTML file that is present in this project, but the github file is on the Web and that is where we are making the request.

In [None]:
from urllib.request import urlopen

html = urlopen(f"{project_url}/static/example-1.html")
print(html.read())

## 2 - BeautifulSoup

### 2.1 - Installation

In [None]:
%pip install --upgrade pip
%pip install beautifulsoup4

### 2.2 - Parse HTML file

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(f"{project_url}/static/example-1.html")
bs = BeautifulSoup(html.read(), "html.parser")

print(f"First instance of 'p' tag found: {bs.p.string}")

### 2.3 - Trying other parsers

You can check the other parsers and them difference here: [**Difference between parsers**](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers)

Some parser will need to be installed.

In [None]:
%pip install lxml
%pip install html5lib

In [None]:
from bs4 import BeautifulSoup

html_parser = BeautifulSoup("<a></b></a>", "html.parser")
lxml = BeautifulSoup("<a></b></a>", "lxml")
html_lib = BeautifulSoup("<a></b></a>", "html5lib")

print(f"'html.parser' parser: {html_parser}")
print(f"'lxml' parser: {lxml}")
print(f"'html5lib' parser: {html_lib}")

### 2.4 - Handling exceptions

Two main things can go wrong in the request:
- The page is not found on the server (or there was an error in retrieving it).
- The server is not found.

#### 2.4.1 - Page not found

Raises HTTPError

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)

#### 2.4.2 - Server not found

Raises URLError

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')


#### 2.4.3 - Tag not found

If you try to access a tag that does not exist in the file the BeautifulSoup will return None. But if you try to access an element inside a non-existing tag it will raise an AttributeError

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(f"{project_url}/static/example-1.html")
bs = BeautifulSoup(html.read(), "html.parser")

# Return None
print(f"Non-existing tag: { bs.missingtag }")

# Raise AttributeError
try: 
    print(bs.missingtag.a)
except AttributeError:
    print("Tag not found!")

### 2.5 - Filter page

BeautifulSoup has two methods to filter HTML pages:

- **find()** - To get the first match
- **find_all()** - To get all matches

#### 2.5.1 - Count elements

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(f"{project_url}/static/example-2.html")
bs = BeautifulSoup(html.read(), "html.parser")

print(f"Count rows: {len(bs.find_all(['tr', 'td']))}")

#### 2.5.2 - Tag and attributes filter

It's like an **or** filter

In [None]:
print(bs.find("table", {"id": "ageTable"}))

#### 2.5.3 - Keyword filter 

It's like an **and** filter

In [None]:
print(bs.find(id="ageTable"))

#### 2.5.4 - Content filter

In [47]:
count = len(bs.find_all(string="35"))
print(f"Count 35 age: {count}")

Count 35 age: 1


### 2.6 - Navigation trees

Find tags based on document location

#### 2.6.1 - Children and Descendants

Children are always one tag below their parents.
Descendants can be at any level below the parents.

All children are descendants, but not all descendants are children. 

In [56]:
children = bs.find(id="ageTable").children
descendants = bs.find(id="ageTable").descendants

print(f"Children count: {len(list(children))}")
print(f"Descendants count: {len(list(descendants))}")

Children count: 9
Descendants count: 37


#### 2.6.2 - Siblings

Siblings are tags in the same tree level.

In [68]:
first_row = bs.find("table", {"id": "ageTable"}).tr

print(first_row.next_sibling)

for index, row in enumerate(first_row.next_siblings):
    print(f"Row {index}: \n{row}")

<generator object PageElement.next_siblings at 0x000001E8192FFE80>
Row 0: 


Row 1: 
<tr>
<td>John</td>
<td>35</td>
</tr>
Row 2: 


Row 3: 
<tr>
<td>Jeffrey</td>
<td>40</td>
</tr>
Row 4: 


Row 5: 
<tr>
<td>Jorge</td>
<td>21</td>
</tr>
Row 6: 


