# Basic Web Scrapper

In [3]:
project_url = "https://raw.githubusercontent.com/PedroFerreiraBento/Python-Projects/main/2-web-scraping-projects/2.1-beautiful-soap"

## 1 - Simple request
Request data from an url and output the HTML file received on response.

Note: Here we are requesting the raw data of a HTML file that is present in this project, but the github file is on the Web and that is where we are making the request.

In [2]:
from urllib.request import urlopen

html = urlopen(f"{project_url}/static/example-1.html")
print(html.read())

NameError: name 'project_url' is not defined

## 2 - BeautifulSoup

### 2.1 - Installation

In [13]:
%pip install --upgrade pip
%pip install beautifulsoup4


Note: you may need to restart the kernel to use updated packages.


### 2.2 - Parse HTML file

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(f"{project_url}/static/example-1.html")
bs = BeautifulSoup(html.read(), "html.parser")

print(f"First instance of 'p' tag found: {bs.p.string}")

HTTPError: HTTP Error 404: Not Found

### 2.3 - Trying other parsers

You can check the other parsers and them difference here: [**Difference between parsers**](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers)

Some parser will need to be installed.

In [15]:
%pip install lxml
%pip install html5lib

Note: you may need to restart the kernel to use updated packages.


In [39]:
from bs4 import BeautifulSoup

html_parser = BeautifulSoup("<a></b></a>", "html.parser")
lxml = BeautifulSoup("<a></b></a>", "lxml")
html_lib = BeautifulSoup("<a></b></a>", "html5lib")

print(f"'html.parser' parser: {html_parser}")
print(f"'lxml' parser: {lxml}")
print(f"'html5lib' parser: {html_lib}")

'html.parser' parser: <a></a>
'lxml' parser: <html><body><a></a></body></html>
'html5lib' parser: <html><head></head><body><a></a></body></html>


### 2.4 - Handling exceptions

Two main things can go wrong in the request:
- The page is not found on the server (or there was an error in retrieving it).
- The server is not found.

#### 2.4.1 - Page not found

Raises HTTPError

In [19]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)

#### 2.4.2 - Server not found

Raises URLError

In [20]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')


The server could not be found!


#### 2.4.3 - Tag not found

If you try to access a tag that does not exist in the file the BeautifulSoup will return None. But if you try to access an element inside a non-existing tag it will raise an AttributeError

In [35]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen(f"{project_url}/static/example-1.html")
bs = BeautifulSoup(html.read(), "html.parser")

# Return None
print(f"Non-existing tag: { bs.missingtag }")

# Raise AttributeError
try: 
    print(bs.missingtag.a)
except AttributeError:
    print("Tag not found!")

Non-existing tag: None
Tag not found!
