# Chapter 1: Your First Web Scraper

---

This chapter introduces web scraping by teaching you how to build a basic scraper using Python. You’ll learn about web page structures (HTML, CSS, and DOM), use Requests to fetch data, and Beautiful Soup to extract it. Along the way, you'll practice polite scraping and handle common challenges. By the end, you’ll have created your first functional scraper, gaining the skills to collect and organize data from websites.

---

## Connecting

Web scraping requires stripping away some of the shroud on interface—not just at the browser level (how it interprets all of this HTML, CSS, and JavaScript), but occasionally at the level of the network connection.

A web browser can serve you with:

- Creating packets of information for communicating with the OS 
- And interpreting these information in pretty format

**However this a web browser is a code and code can be splitted, rewritten or even reused**

Which a python can make all of this in just a bunch of two or three lines of code

In [4]:
from urllib.request import urlopen # urllib is a standard Python library (meaning you don’t have to install anything extra to run this example)
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())                      ## Same thing can be done with writing this in terminal (>python <code_file_name>.py) 

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


This command outputs the complete HTML code for page1 located at the URL *http://
pythonscraping.com/pages/page1.html*.

**Why we are taking in language of files and not pages?**

Modern web pages are associated with another files (e.g.image files, JavaScript files or CSS files) 

And since you write in the HTML (img src="cuteKitten.jpg"), The browser knows that he has to make a request to the server to get that data

---

## An Introduction to BeautifulSoup

BeautifulSoup helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

### Installing BeautifulSoup

We can use Python package manager PIP to install BS4

In [None]:
!pip install beautifulsoup4         # This a IPython notebook trick you can use to run terminal commands in notebooks

Now after installation we can do this

In [2]:
from bs4 import BeautifulSoup

### Keeping Libraries Straight with Virtual Environments

Virtual environments in Python help isolate project dependencies, preventing conflicts and making management easier. By using them, you can organize your projects effectively, ensuring each one has its own set of libraries. This practice not only improves your workflow but also enhances proficiency, as it keeps projects clean and reduces the risk of dependency issues. The book offers specific instructions on how to set up and use a virtual environment, which you can reference when needed for more organized and efficient project development.

### Running BeautifulSoup

In [5]:
html = urlopen('http://pythonscraping.com/pages/page1.html')

bs = BeautifulSoup(html.read(), 'html.parser')
# You can also write it like that
# bs = BeautifulSoup(html, 'html.parser')

print(bs)
print(bs.h1)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

<h1>An Interesting Title</h1>


Notice that bs object follows an amazing hierarchy And backing to that it can access any tag directly like h1, Actually you can write that in any prefered format (e.g. bs.html.body.h1, bs.html.h1 or even bs.body.h1)

*You should know that (bs.h1) will bring only the first instance of the h1 tag which ideally should be the only one in the page but this not always true since web pages don't necessary follow these rules and a page may have more than one h1 you should be aware of that this will give you only the first instance of that tag*

BeautifulSoup object requires you with the type of parser to use which makes no differences in most cases

### Parsers

In [27]:
# !pip install lxml  -> Another parser used

In [30]:
import lxml

html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'lxml')

bs

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

One of lxml parser advantages over html.parser is its forgiving, additional features (e.g. fixing unclosed and misnested tags) and some-what faster than html.parser, However it has an annoying disadvantage which is that it has to be installed separately and depends
on third-party C libraries to function. This can cause problems for portability and ease of use, compared to html.parser.

In [32]:
# !pip install html5lib     -> A third parser we can use

Here you may face a problem of that it does not recognize the parser so i have checked that every installation is done properly and finally i have restarted the kernel on notebook and it worked, You try that if you faced a problem like that

In [6]:
import html5lib

html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html5lib')

bs

<html><head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>


</body></html>

---

## Connecting Reliably and Handling Exceptions

Scraping is a the process that may take the whole day running and you may sleep leaving your scraper running and here where a problem may came and your scraper hits an unexpected data format and stop running so handling exceptions in the first place is important as much as the process itself

Backing to the past code

In [7]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')

### Connection Problems

Two main things can go wrong in this line:

- An HTTP error will be returned. In all cases, the urlopen function will throw the generic exception HTTPError
- If the server is not found at all, urlopen will throw an URLError. This indicates that no server could be reached at all, and, because the remote server is responsible for returning HTTP status codes, an HTTPError cannot be thrown, and the more serious URLError must be caught.

Cure

In [10]:
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as HTTPerr:
    print(HTTPerr)
except URLError as URLerr:
    print(URLerr)
else:
    pass                        # Only if there is no exception
finally:
    print("URL caught!")        # Always excute

URL caught!


### Content Problems

Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is, attempting to access a tag on a None object itself will result in an AttributeError being thrown.

In [11]:
# print(bs.<a_name_of_a_tag_that_does_not_exist>)       -> Which will throw a None type object

The real problem comes when you try ti access an attribute or another tag inside that tag

In [12]:
# print(bs.<a_tag_that_doesn't_exist>.<another_tag>)        -> This will pass an AttributeError that None type has no attribute call anther_tag

Cure

In [13]:
try: unkown_content = bs.does_not_exist_tag.another_tag
except AttributeError as attrerr: print("Tag was not found")
else:
    if unkown_content == None: print("Tag was not found")
    else: print(unkown_content)

Tag was not found


Now, Well formatted code

In [14]:
def getTitle(url):
    try:
        html = urlopen(url)  
    except HTTPError as e: 
        return None
    
    try: 
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    
    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')

if title == None:
    print('Title could not be found')
else:
    print(title)

<h1>An Interesting Title</h1>


---

### Note That

**When writing scrapers, it’s important to think about the overall pattern of your code in order to handle exceptions and make it readable at the same time. You’ll also likely want to heavily reuse code. Having generic functions such as getSiteHTML and getTitle (complete with thorough exception handling) makes it easy to quickly— and reliably—scrape the web.**

---

## End

This chapter covered the foundational aspects of web scraping with Python, focusing on using Requests to retrieve web page content and Beautiful Soup to parse and extract data. It also highlighted the importance of virtual environments for managing project dependencies, ensuring an organized and conflict-free development process. The chapter provided the essential tools and knowledge needed to begin scraping data and set the stage for exploring more advanced techniques in subsequent chapters.