# **1. Everything About Web Scraping:**

The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots.

In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data to extract needed information.

In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis, natural language parsing, and information security. Because the scope of the field is so broad

## **Why Web Scraping?**
Web scrapers are excellent at gathering and processing large amounts of
data quickly. Rather than viewing one page at a time through the narrow window of a monitor, you can view databases spanning thousands or even millions of pages at once.

In addition, web scrapers can go places that traditional search engines cannot. A Google search for “cheapest flights to Boston” will result in a slew of advertisements and popular flight search sites. Google knows only what these websites say on their content pages, not the exact results of various queries entered into a flight search application. However, a well-developed web scraper can chart the cost of a flight to
Boston over time, across a variety of websites, and tell you the best time to buy your ticket.

You might be asking: “Isn’t data gathering what APIs are for?”  Well, APIs can be fantastic, if you find one that suits your
purposes. They are designed to provide a convenient stream of well-formatted data from one computer program to another. You can find an API for many types of data you might want to use, such as Twitter posts or Wikipedia pages. In general, it is pref‐ erable to use an API (if one exists), rather than build a bot to get the same data. How‐
ever, an API might not exist or be useful for your purposes, for several reasons:
- You are gathering relatively small, finite sets of data across a large collection of websites without a cohesive API.
- The data you want is fairly small or uncommon, and the creator did not think it warranted an API.
- The source does not have the infrastructure or technical ability to create an API.
- The data is valuable and/or protected and not intended to be spread widely.

Even when an API does exist, the request volume and rate limits, the types of data, or the format of data that it provides might be insufficient for your purposes.

`This is where web scraping steps in. With few exceptions, if you can view data in your browser, you can access it via a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do virtually any‐ thing with that data.`

There are obviously many extremely practical applications of having access to nearly unlimited data: market forecasting, machine-language translation, and even medical diagnostics have benefited tremendously from the ability to retrieve and analyze data from news sites, translated texts, and health forums, respectively

Regardless of your field, web scraping almost always provides a way to guide business practices more effectively, improve productivity, or even branch off into a brand-new field entirely.

# **Building Scrapers**
- Retrieving HTML data from a domain name
- Parsing that data for target information
- Storing the target information
- Optionally, moving to another page to repeat the process

- sending a GET request (a request to fetch, or “get,” the content of a web page) to a web server for a specific page, reading the HTML output from that page, and doing some simple data extraction in order to iso‐late the content that you are looking for.

# **BeautifulSoup**
- Because the BeautifulSoup library is not a default Python library, it must be installed.
- `pip install beautifulsoup4`

- `lxml` has some advantages over `html.parser` in that it is generally better at parsing “messy” or malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags. It is also some‐what faster than html.parser, although speed is not necessarily an advantage in web scraping, given that the speed of the network itself will almost always be your largest
bottleneck.
- One of the disadvantages of `lxml` is that it has to be installed separately and depends on third-party C libraries to function. This can cause problems for portability and ease of use, compared to html.parser.

Another popular HTML parser is `html5lib`. Like lxml, html5lib is an extremely for‐giving parser that takes even more initiative correcting broken HTML. It also depends on an external dependency, and is slower than both lxml and html.parser. Despite this, it may be a good choice if you are working with messy or handwritten HTML sites.

## **Connecting Reliably and Handling Exceptions**
The web is messy. Data is poorly formatted, websites go down, and closing tags go missing. One of the most frustrating experiences in web scraping is to go to sleep with a scraper running, dreaming of all the data you’ll have in your database the next day—only to find that the scraper hit an error on some unexpected data format and stopped execution shortly after you stopped looking at the screen. In situations like these, you might be tempted to curse the name of the developer who created the web‐site (and the oddly formatted data), but the person you should really be kicking is yourself, for not anticipating the exception in the first place!

In [4]:
from bs4 import BeautifulSoup # to scrape data from website
from urllib.request import urlopen # to connect to website
from urllib.error import HTTPError # for handling error
from urllib.error import URLError # for handling url errors like mistyped url or non existing url...

In [3]:
html = urlopen('http://pythonscraping.com/pages/page1.html')
# bs = BeautifuSoup(html.read(), 'html.parser')
# bs = BeautifulSoup(html.read(), 'lxml')
bs = BeautifulSoup(html.read(), 'html5lib')
print(bs.h1)
print(bs.html.body.h1)

<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>


Let’s take a look at the first line of our scraper, after the import statements, and figure out how to handle any exceptions this might throw:

`html = urlopen('http://www.pythonscraping.com/pages/page1.html')`

Two main things can go wrong in this line:
  - The page is not found on the server (or there was an error in retrieving it).
  - The server is not found.
  
- In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found”, “500 Internal Server Error,” and so forth. In all of these cases, the urlopen function will throw the generic exception HTTPError. You can handle this exception

In [8]:
# thers is no 100th page so gives http error.
try:
    html = urlopen('http://pythonscraping.com/pages/page100.html')
    bs = BeautifulSoup(html.read(), 'html.parser')
except HTTPError as e:
    print(e)
else:
    print("It worked")

HTTP Error 404: Not Found


In [9]:
try:
    html = urlopen('http://pythonscraping.com/pages/page1.html')
    bs = BeautifulSoup(html.read(), 'html.parser')
except HTTPError as e:
    print(e)
else:
    print("It worked")

It worked


- If the server is not found at all (if, say, http://www.pythonscraping.com is down, or the URL is mistyped), urlopen will throw an URLError. This indicates that no server
could be reached at all, and, because the remote server is responsible for returning HTTP status codes, an HTTPError cannot be thrown, and the more serious URLError must be caught

In [15]:
try:
    html = urlopen('http://pythonscraping.com/pages/page100.html')
    bs = BeautifulSoup(html.read(), 'html.parser')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print("It worked")

HTTP Error 404: Not Found


In [14]:
try:
    html = urlopen('http://pysraping.com/pages/page1.html')
    bs = BeautifulSoup(html.read(), 'html.parser')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print("It worked")

The server could not be found!


In [16]:
try:
    html = urlopen('http://pythonscraping.com/pages/page1.html')
    bs = BeautifulSoup(html.read(), 'html.parser')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print("It worked")

It worked


- Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the page not quite being what you expected. Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is, attempting to access a tag on a None object itself will result in an AttributeError being thrown.

In [17]:
try:
    html = urlopen('http://pythonscraping.com/pages/page1.html')
    bs = BeautifulSoup(html.read(), 'html.parser')
    print(bs.nonExistentTag) # made up tag - returns none
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print("It worked")

None
It worked


  print(bs.nonExistentTag) # made up tag - returns none


So how can you guard against these two situations? The easiest way is to explicitly check for both situations:


In [21]:
try:
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent != None:
        print(badContent)

Tag was not found


  badContent = bs.nonExistingTag.anotherTag


In [22]:
try:
    badContent = bs.h1
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent != None:
        print(badContent)

<h1>An Interesting Title</h1>


This checking and handling of every error does seem laborious at first, but it’s easy to add a little reorganization to this code to make it less difficult to write (and, more important, much less difficult to read). This code can written in a different way using function.

In [23]:
from bs4 import BeautifulSoup # to scrape data from website
from urllib.request import urlopen # to connect to website
from urllib.error import HTTPError # for handling error
from urllib.error import URLError # for handling url errors like mistyped url or non existing url...

In [26]:
# original working version
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return f'{e}'
    except URLError as e:
        return f'Server not found please make sure you have entered correct url or not.'
    
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.h1
    except AttributeError as e:
        return f'There is no such tag {title} in {url}'
    
    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title != None:
    print(title)
else:
    print('Title could not be found')

<h1>An Interesting Title</h1>


In [27]:
# httperror
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return f'{e}'
    except URLError as e:
        return f'Server not found please make sure you have entered correct url or not.'
    
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.h1
    except AttributeError as e:
        return f'There is no such tag {title} in {url}'
    
    return title

title = getTitle('http://www.pythonscraping.com/pages/page100.html')
if title != None:
    print(title)
else:
    print('Title could not be found')

HTTP Error 404: Not Found


In [32]:
# urlerror
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return f'{e}'
    except URLError as e:
        return f'Server not found please make sure you have entered correct url or not.'
    
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.h1
    except AttributeError as e:
        return f'There is no such tag {title} in {url}'
    
    return title

title = getTitle('http://www.pyscraping.com/pages/page1.html')
if title != None:
    print(title)
else:
    print('Title could not be found')

Server not found please make sure you have entered correct url or not.


In [36]:
# attribute error
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return f'{e}'
    except URLError as e:
        return f'Server not found please make sure you have entered correct url or not.'
    
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.non7
    except AttributeError as e:
        return None
    
    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title != None:
    print(title)
else:
    print('Title could not be found')

Title could not be found


# **2. Advanced HTML Parsing**

## **Another Serving of BeautifulSoup:**
In this section, we’ll discuss searching for tags by attributes, working with lists of tags, and navigating parse trees.

Nearly every website you encounter contains stylesheets. Although you might think that a layer of styling on websites that is designed specifically for browser and human interpretation might be a bad thing, the advent of CSS is a boon for web scrapers. CSS relies on the differentiation of HTML elements that might otherwise have the exact same markup in order to style them differently. Some tags might look like this:

`<span class="green"></span>`

Others look like this:

`<span class="red"></span>`

Web scrapers can easily separate these two tags based on their class; for example, they
might use BeautifulSoup to grab all the red text but none of the green text. Because
CSS relies on these identifying attributes to style sites appropriately, you are almost
guaranteed that these class and ID attributes will be plentiful on most modern web‐
sites.

Let’s create an example web scraper that scrapes the page located at `http://www.pythonscraping.com/pages/warandpeace.html.`

On this page, the lines spoken by characters in the story are in red, whereas the names of characters are in green. You can see the span tags, which reference the appropriate CSS classes, in the following sample of the page’s source code:

`<span class="red">Heavens! what a virulent attack!</span> replied`

`<span class="green">the prince</span>, not in the least disconcerted by this reception.`

In [37]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [38]:
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

Using this BeautifulSoup object, you can use the find_all function to extract a
Python list of proper nouns found by selecting only the text within <span
class="green"></span> tags (find_all is an extremely flexible function you’ll be using a lot later in this book)

In [40]:
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name)

<span class="green">Anna
Pavlovna Scherer</span>
<span class="green">Empress Marya
Fedorovna</span>
<span class="green">Prince Vasili Kuragin</span>
<span class="green">Anna Pavlovna</span>
<span class="green">St. Petersburg</span>
<span class="green">the prince</span>
<span class="green">Anna Pavlovna</span>
<span class="green">Anna Pavlovna</span>
<span class="green">the prince</span>
<span class="green">the prince</span>
<span class="green">the prince</span>
<span class="green">Prince Vasili</span>
<span class="green">Anna Pavlovna</span>
<span class="green">Anna Pavlovna</span>
<span class="green">the prince</span>
<span class="green">Wintzingerode</span>
<span class="green">King of Prussia</span>
<span class="green">le Vicomte de Mortemart</span>
<span class="green">Montmorencys</span>
<span class="green">Rohans</span>
<span class="green">Abbe Morio</span>
<span class="green">the Emperor</span>
<span class="green">the prince</span>
<span class="green">Prince Vasili</span>
<span cl

- to get the names without tag and details:
    - can use `text`
    - or `get_text()`
- .get_text() strips all tags from the document you are working
with and returns a Unicode string containing the text only. For
example, if you are working with a large block of text that contains
many hyperlinks, paragraphs, and other tags, all those will be strip‐
ped away, and you’ll be left with a tagless block of text.
Keep in mind that it’s much easier to find what you’re looking for
in a BeautifulSoup object than in a block of text. Call‐
ing .get_text() should always be the last thing you do, immedi‐
ately before you print, store, or manipulate your final data. In
general, you should try to preserve the tag structure of a document
as long as possible.

In [41]:
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.text)

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [43]:
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


## **find() and find_all() with BeautifulSoup:**
BeautifulSoup’s find() and find_all() are the two functions you will likely use the most. With them, you can easily filter HTML pages to find lists of desired tags, or a single tag, based on their various attributes.

The two functions are extremely similar, as evidenced by their definitions in the BeautifulSoup documentation:
- `find_all(tag, attributes, recursive, text, limit, keywords)`
- `find(tag, attributes, recursive, text, keywords)`

In all likelihood, 95% of the time you will need to use only the first two arguments:
- tag and attributes

The tag argument is one that you’ve seen before; you can pass a string name of a tag or even a Python list of string tag names. For example, the following returns a list of all the header tags in a document:

`.find_all(['h1','h2','h3','h4','h5','h6'])`

The attributes argument takes a Python dictionary of attributes and matches tags that contain any one of those attributes. For example, the following function would return both the green and red span tags in the HTML document:

`.find_all('span', {'class':{'green', 'red'}})`

The recursive argument is a boolean. How deeply into the document do you want to go? If recursive is set to True, the find_all function looks into children, and children's childern  for tags that match your parameters. If it is False, it will look only at the top-level tags in your document. By default, find_all works recursively (recursive is set to True); it’s generally a good idea to leave this as is, unless you really know what you need to do and performance is an issue.

The text argument is unusual in that it matches based on the text content of the tags, rather than properties of the tags themselves. For instance, if you want to find the number of times “the prince” is surrounded by tags on the example page, you could replace your .find_all() function in the previous example with the following lines:

`nameList = bs.find_all(text='the prince')`

`print(len(nameList))`

The limit argument, of course, is used only in the find_all method; find is equivalent to the same find_all call, with a limit of 1. You might set this if you’re interested only in retrieving the first x items from the page. Be aware, however, that this gives you the first items on the page in the order that they occur, not necessarily the first ones that you want.

The keyword argument allows you to select tags that contain a particular attribute or set of attributes. For example:

`title = bs.find_all(id='title', class_='text')`

This returns the first tag with the word “text” in the class_ attribute and “title” in the id attribute. Note that, by convention, each value for an id should be used only once on the page. Therefore, in practice, a line like this may not be particularly useful, and should be equivalent to the following:

`title = bs.find(id='title')`


## **Other BeautifulSoup Objects:**
However, there are two more objects in the library that, although less commonly
used, are still important to know about:
- `NavigableString objects`
    Used to represent text within tags, rather than the tags themselves (some functions operate on and produce NavigableStrings, rather than tag objects).
- `Comment object`
    Used to find HTML comments in comment tags, <!--like this one-->.


## **Navigating Trees:**
The find_all function is responsible for finding tags based on their name and
attributes. But what if you need to find a tag based on its location in a document? That’s where tree navigation comes in handy. 

`bs.tag.subTag.anotherSubTag`

- **Dealing with children and other descendants:**
In the BeautifulSoup library, as well as many other libraries, there is a distinction drawn between children and descendants: much like in a human family tree, children are always exactly one tag below a parent, whereas descendants can be at any level in the tree below a parent. For example, the tr tags are children of the table tag, whereas tr, th, td, img, and span are all descendants of the table tag (at least in our example page). All children are descendants, but not all descendants are children.


In [45]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table',{'id':'giftList'}).children:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


In [47]:
for decend in bs.find('table',{'id':'giftList'}).descendants:
    print(decend)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<th>
Item Title
</th>

Item Title

<th>
Description
</th>

Description

<th>
Cost
</th>

Cost

<th>
Image
</th>

Image



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<td>
Vegetable Basket
</td>

Vegetable Basket

<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

<span class="excitingNote">Now with super-colorful bell peppers!</span>
Now with super-colorful bell peppers!


<td>
$15.00
</td>

$15.00

<td>
<img src="../img/gifts/img1.jpg"

- **Dealing with siblings:**
The BeautifulSoup next_siblings() function makes it trivial to collect data from tables, especially ones with title rows:

In [48]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

As a complement to next_siblings, the previous_siblings function can often be
helpful if there is an easily selectable tag at the end of a list of sibling tags that you would like to get.

And, of course, there are the next_sibling and previous_sibling functions, which perform nearly the same function as next_siblings and previous_siblings, except they return a single tag rather than a list of them.

- **Dealing with parents:**
When scraping pages, you will likely discover that you need to find parents of tags less frequently than you need to find their children or siblings. Typically, when you look at HTML pages with the goal of crawling them, you start by looking at the top layer of tags, and then figure out how to drill your way down into the exact piece of data that you want. Occasionally, however, you can find yourself in odd situations that require BeautifulSoup’s parent-finding functions, .parent and .parents.

In [49]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

print(bs.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())


$15.00



This code will print the price of the object represented by the image at the location ../img/gifts/img1.jpg (in this case, the price is $15.00)

## **Regular Expressions**
Regular expressions are so called because they are used to identify regular strings; they can definitively say, “Yes, this string you’ve given me follows the rules, and I’ll return it,” or “This string does not follow the rules, and I’ll discard it.” This can be exceptionally handy for quickly scanning large documents to look for strings that look like phone numbers or email addresses.

Notice that I used the phrase regular string. What is a regular string? It’s any string that can be generated by a series of linear rules,3 such as these:
1. Write the letter a at least once.
2. Append to this the letter b exactly five times.
3. Append to this the letter c any even number of times.
4. Write either the letter d or e at the end.

## **Regular Expressions and BeautifulSoup**
BeautifulSoup and regular expressions go hand in hand when it comes to scraping the web. In fact, most functions that take in a string argument (e.g., find(id="aTagIdHere")) will also take in a regular expression just as well.

In [51]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

bs.find_all('img')

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <img src="../img/gifts/img1.jpg"/>,
 <img src="../img/gifts/img2.jpg"/>,
 <img src="../img/gifts/img3.jpg"/>,
 <img src="../img/gifts/img4.jpg"/>,
 <img src="../img/gifts/img6.jpg"/>]

above code also fetches logo image, hidden images, images used for spacing and aligning elements other images

The solution is to look for something identifying about the tag itself. In this case, you
can look at the file path of the product images: using regex

In [54]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('https://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')

bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images:
    print(image['src'])

  images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})


## **Accessing Attributes**
So far, you’ve looked at how to access and filter tags and access content within them. However, often in web scraping you’re not looking for the content of a tag; you’re looking for its attributes. This becomes especially useful for tags such as a, where the URL it is pointing to is contained within the href attribute; or the img tag, where the target image is contained within the src attribute.

With tag objects, a Python list of attributes can be automatically accessed by calling this:
- `myTag.attrs`

Keep in mind that this literally returns a Python dictionary object, which makes retrieval and manipulation of these attributes trivial. The source location for an image, for example, can be found using the following line:
- `myImgTag.attrs['src']`

## **Lambda Expressions**
Essentially, a lambda expression is a function that is passed into another function as a variable; instead of defining a function as f(x, y), you may define a function as f(g(x), y) or even f(g(x), h(x)).

BeautifulSoup allows you to pass certain types of functions as parameters into the find_all function.

The only restriction is that these functions must take a tag object as an argument and return a boolean. Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to True are returned, while the rest are discarded.

For example, the following retrieves all tags that have exactly two attributes:
- `bs.find_all(lambda tag: len(tag.attrs) == 2)`

- `bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')`

This can also be accomplished without a lambda function:
- `bs.find_all('', text='Or maybe he\'s only resting?')`

# **3. Writing Web Crawlers:**
Web crawlers are called such because they crawl across the web. At their core is an element of recursion. They must retrieve page contents for a URL, examine that page for another URL, and retrieve that page, ad infinitum.

Beware, however: just because you can crawl the web doesn’t mean that you always should. The scrapers used in previous examples work great in situations where all the data you need is on a single page. With web crawlers, you must be extremely conscientious of how much bandwidth you are using and make every effort to determine whether there’s a way to make the target server’s load easier.

## **Traversing a Single Domain:**

In [57]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')

for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Other_ventures
#Six_Degrees_of_Kevin_Bacon
#Personal_life
#Accolades
#Awards_and_nominations
#Other_honors
#S

If you examine the links that point to article pages (as opposed to other internal pages), you’ll see that they all have three things in common:
* They reside within the div with the id set to bodyContent.
* The URLs do not contain colons.
* The URLs begin with /wiki/.

You can use these rules to revise the code slightly to retrieve only the desired article links by using the regular expression 
- `^(/wiki/)((?!:).)*$"):`

In [58]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')

for link in bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Leading_man
/wiki/Character_actor
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/National_Lampoon%27s_Animal_House
/wiki/Diner_(1982_film)
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Frost/Nixon_(film)
/wiki/Friday_the_13th_(1980_film)
/wiki/Tremors_(1990_film)
/wiki/The_River_Wild
/wiki/The_Woodsman_(2004_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Patriots_Day_(film)
/wiki/Losing_Chase
/wiki/Loverboy_(2005_film)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Michael_Strobl
/wiki/HBO
/wiki/Taking_Chance
/wiki/Fox_Broadcasting_Company
/wik

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now().timestamp()) 

def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))

links = getLinks('/wiki/Kevin_Bacon')

while len(links) > 0:
    newArticle = links[random.randint(0, len(links) - 1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

## **Crawling an Entire Site:**
you took a random walk through a website, going from link to link. But what if you need to systematically catalog or search every page on a site? Crawling an entire site, especially a large one, is a memory-intensive process that is best suited to applications for which a database to store crawling results is readily available. However, you can explore the behavior of these types of applications without running them full-scale.

The general approach to an exhaustive site crawl is to start with a top-level page (such as the home page), and search for a list of all internal links on that page. Every one of those links is then crawled, and additional lists of links are found on each one of them, triggering another round of crawling. Clearly, this is a situation that can blow up quickly. If every page has 10 internal links, and a website is 5 pages deep (a fairly typical depth for a medium-size website), then the number of pages you need to crawl is 105
, or 100,000 pages, before you can be sure that you’ve exhaustively covered the website. Strangely enough, although “5 pages deep and 10 internal links per page” are fairly typical dimensions for a website, very few websites have 100,000 or more pages. The reason, of course, is that the vast majority of internal links are duplicates. To avoid crawling the same page twice, it is extremely important that all internal links discovered are formatted consistently, and kept in a running set for easy lookups, while the program is running. A set is similar to a list, but elements do not have a specific order, and only unique elements will be stored, which is ideal for our needs. Only links that are “new” should be crawled and searched for additional links:

In [62]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

In [None]:
pages = set()

def getLinks(pageUrl):
    global pages 
 
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')

    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
            #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)

getLinks('')

## **Collecting Data Across an Entire Site:**
Web crawlers would be fairly boring if all they did was hop from one page to the other. To make them useful, you need to be able to do something on the page while you’re there. Let’s look at how to build a scraper that collects the title, the first paragraph of content, and the link to edit the page (if available). 

- As always, the first step to determine how best to do this is to look at a few pages from the site and determine a pattern. 

By modifying our basic crawling code, you can create a combination crawler/datagathering (or, at least, data-printing) program:

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages

    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id ='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print('This page is missing something! Continuing.')

    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                # We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)

getLinks('')

## **Crawling Across the Internet**
web crawlers are at the heart of what drives many modern web technologies, and you don’t necessarily need a large data warehouse to use them. To do any cross-domain data analysis, you do need to build crawlers that can interpret and store data across the myriad of pages on the internet. Just as in the previous example, the web crawlers you are going to build will follow links from page to page, building out a map of the web. But this time, they will not ignore external links they will follow them

Before you start writing a crawler that follows all outbound links willy-nilly, you should ask yourself a few questions:

What data am I trying to gather? Can this be accomplished by scraping just a few predefined websites (almost always the easier option), or does my crawler need to be able to discover new websites I might not know about?
- When my crawler reaches a particular website, will it immediately follow the next outbound link to a new website, or will it stick around for a while and drill down into the current website?
- Are there any conditions under which I would not want to scrape a particular
site? Am I interested in non-English content?
- How am I protecting myself against legal action if my web crawler catches the attention of a webmaster on one of the sites it runs across? 

In [2]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

In [67]:
pages = set()
random.seed(datetime.datetime.now().timestamp())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme,
    urlparse(includeUrl).netloc)
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bs.find_all('a',
        href=re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(
                    includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks

#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" that do
    #not contain the current URL
    for link in bs.find_all('a',
        href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks


def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bs = BeautifulSoup(html, 'html.parser')
    externalLinks = getExternalLinks(bs,
    urlparse(startingPage).netloc)
    if len(externalLinks) == 0:
        print('No external links, looking around the site for one')
        domain = '{}://{}'.format(urlparse(startingPage).scheme,urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs, domain)
        return getRandomExternalLink(internalLinks[random.randint(0,len(internalLinks)-1)])
    else:   
        return externalLinks[random.randint(0, len(externalLinks)-1)]
    
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print('Random external link is: {}'.format(externalLink))
    followExternalOnly(externalLink)
    followExternalOnly('http://oreilly.com')


In [None]:
# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()

def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    domain = '{}://{}'.format(urlparse(siteUrl).scheme, urlparse(siteUrl).netloc)
    bs = BeautifulSoup(html, 'html.parser')
    internalLinks = getInternalLinks(bs, domain)
    externalLinks = getExternalLinks(bs, domain)
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)
    for link in internalLinks:
        if link not in allIntLinks:
            allIntLinks.add(link)
            getAllExternalLinks(link)

allIntLinks.add('http://oreilly.com')
getAllExternalLinks('http://oreilly.com')

# **4. Web Crawling Models:**

## **Dealing with Different Website Layouts**

In [None]:
import requests

class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')

def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find("h1").text
    lines = bs.find_all("p", {"class":"story-content"})
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)

def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find("h1").text
    body = bs.find("div",{"class","post-body"}).text
    return Content(url, title, body)

url = 'https://www.brookings.edu/blog/future-development''/2018/01/26 delivering-inclusive-urban-access-3-unc''omfortable-truths/'
content = scrapeBrookings(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)
url = 'https://www.nytimes.com/2018/01/25/opinion/sundaysilicon-valley-immortality.html'
content = scrapeNYTimes(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)
