# Web Scraping with Python - Collecting More Data from the Modern Web - 
https://edu.anarcho-copy.org/Programming%20Languages/Python/Web%20Scraping%20with%20Python,%202nd%20Edition.pdf

# Chapter 1: Your First Web Scraper

urllib = is a standard Python library (meaning you don’t have to install anything extra
to run this example) and contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent.

urlopen = used to open a remote object across a network and read it. Because it is a fairly generic function (it can read HTML files, image files, or any other file stream with ease)

In [1]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## install & run beautiful soup

In [2]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
#beautifulsoup object with 2 arguements, html text and parser you want to use to create object 
bs = BeautifulSoup(html.read(), 'html.parser') 
print(bs.h1) #we are running the h1 tag found on the page

<h1>An Interesting Title</h1>


equivilant code that will write the title....

In [4]:
bs.html.body.h1

<h1>An Interesting Title</h1>

In [5]:
bs.body.h1

<h1>An Interesting Title</h1>

In [6]:
bs.html.h1

<h1>An Interesting Title</h1>

In [7]:
bs.h1

<h1>An Interesting Title</h1>

#### another popular parser which has the benefit of parsing messy or malformed HTML code

In [8]:
pip install lxml 

Note: you may need to restart the kernel to use updated packages.


lxml has some advantages over html.parser in that it is generally better at parsing
“messy” or malformed HTML code. It is forgiving and fixes problems like unclosed
tags, tags that are improperly nested, and missing head or body tags. 

In [9]:
bs = BeautifulSoup(html.read(), 'lxml')

Another popular HTML parser is html5lib. Like lxml, html5lib is an extremely forgiving parser that takes even more initiative correcting broken HTML

In [10]:
pip install html5lib

Note: you may need to restart the kernel to use updated packages.


In [11]:
bs = BeautifulSoup(html.read(), 'html5lib')

html5lib is an extremely forgiving parser that takes even more initiative correcting broken HTML.

In [12]:
bs.html.h1

### Connecting Reliability and Handling Expectations

The web is messy. Data is poorly formatted, websites go down, and closing tags go
missing. One of the most frustrating experiences in web scraping is to go to sleep
with a scraper running, dreaming of all the data you’ll have in your database the next
day—only to find that the scraper hit an error on some unexpected data format and
stopped execution shortly after you stopped looking at the screen.

In [13]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')

If an HTTP error code is returned, the program now prints the error, and does not
execute the rest of the program under the else statement.

If the server is not found at all (if, say, http://www.pythonscraping.com is down, or the
URL is mis-typed), urlopen will throw an URLError

You can add a check to see whether this is the case:

In [14]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e: #some other plan
    print('The server could not be found!')
else:
    print('It Worked!')

The server could not be found!


Of course, if the page is retrieved successfully from the server, there is still the issue of
the content on the page not quite being what you expected. Every time you access a
tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually
exists.

In [15]:
# The line (where nonExistentTag is a made-up tag, not the name of a realBeautifulSoup function)
print(bs.nonExistentTag) 

None


  print(bs.nonExistentTag)


In [16]:
print(bs.nonExistentTag.someTag)

  print(bs.nonExistentTag.someTag)


AttributeError: 'NoneType' object has no attribute 'someTag'

So how can you guard against these two situations? The easiest way is to explicitly
check for both situations:

In [17]:
try:
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent == None:
        print('Tag was not found')
    else:
        print(badContent)

Tag was not found


  badContent = bs.nonExistingTag.anotherTag


## Connecting Reliably and handling Exceptions

This checking and handling of every error does seem laborious at first, but it’s easy to
add a little reorganization to this code to make it less difficult to write. This code, for example, is our same scraper
written in a slightly different way:

In [18]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url): 
 try:
     html = urlopen(url)
 except HTTPError as e:
     return None
 try:
     bs = BeautifulSoup(html.read(), 'html.parser')
     title = bs.body.h1 
 except AttributeError as e:
     return None
 return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
     print('Title could not be found')
else:
     print(title)

<h1>An Interesting Title</h1>


In this example, you’re creating a function getTitle, which returns either the title of
the page, or a None object if there was a problem retrieving it. Inside getTitle, you
check for an HTTPError, as in the previous example, and encapsulate two of the Beau‐
tifulSoup lines inside one try statement

# Chapter 2: Advanced HTML Parsing

Let’s say you have some target content. Maybe it’s a name, statistic, or block of text.
Maybe it’s buried 20 tags deep in an HTML mush with no helpful tags or HTML
attributes to be found. 

1. Look for a “Print This Page” link, or perhaps a mobile version of the site that has
better-formatted HTML.

2. Look for the information hidden in a JavaScript file. Remember, you might need
to examine the imported JavaScript files in order to do this.

3. This is more common for page titles, but the information might be available in
the URL of the page itself.

4. If the information you are looking for is unique to this website for some reason,
you’re out of luck. If not, try to think of other sources you could get this information from. Is there another website with the same data? Is this website displaying
data that it scraped or aggregated from another website?


CSS
relies on the differentiation of HTML elements that might otherwise have the exact
same markup in order to style them differently. Some tags might look like this:

In [19]:
"""
<span class="green"></span>
<span class="red"></span>
"""

'\n<span class="green"></span>\n<span class="red"></span>\n'

Web scrapers can easily separate these two tags based on their class; for example, they
might use BeautifulSoup to grab all the red text but none of the green text. Because
CSS relies on these identifying attributes to style sites appropriately, you are almost
guaranteed that these class and ID attributes will be plentiful on most modern web‐
sites

In [20]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(' http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

In [21]:
# retrieve the title
print(bs.html.h1)

<h1>War and Peace</h1>


## Find() and find_all() with Beautiful Soup

Previously, you’ve called bs.tagName to get the first occurrence of that tag on the page. Now, you’re calling bs.find_all(tagName, tagAttributes) to get a list of all of the tags on the page, rather than just the first.

In [22]:
# define variable that finds all the green class types
nameList = bs.findAll('span', {'class':'green'})

for name in nameList:
    print(name.get_text())  # returns all the green text in the order they appear in war and peace
                            # this can be checked by typing in the url on the web

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


BeautifulSoup’s find() and find_all() are the two functions you will likely use the
most. With them, you can easily filter HTML pages to find lists of desired tags, or a
single tag, based on their various attributes

find_all(tag, attributes, recursive, text, limit, keywords)

find(tag, attributes, recursive, text, keywords)

In all likelihood, 95% of the time you will need to use only the first two arguments:
tag and attributes. However, let’s take a look at all the arguments in greater detail.


In [23]:
# lists all headers in the doc
bs.find_all(['h1','h2','h3','h4','h5','h6']) 

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

In [24]:
# returns green and red span tags in html doc
bs.find_all('span', {'class':{'green', 'red'}}) 

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span clas

The text argument is unusual in that it matches based on the text content of the tags,rather than properties of the tags themselves. For instance, if you want to find the number of times “the prince” is surrounded by tags on the example page:

In [25]:
# if you want to find the number of times “the prince” is surrounded by tags on the example page
nameList = bs.find_all(text='the prince')
print(len(nameList))

7


  nameList = bs.find_all(text='the prince')


In [26]:
# The keyword argument allows you to select tags that contain a particular attribute or set of attributes. For example:
title = bs.find_all(id='title', class_='text')

#### recap

BeautifulSoup objects = Instances seen in previous code examples as the variable bs

Tag objects = Retrieved in lists, or retrieved individually by calling find and find_all on a BeautifulSoup object, or drilling down, as follows: = bs.div.h1

However, there are two more objects in the library that, although less commonly used, are still important to know about:
NavigableString objects = Used to represent text within tags, rather than the tags themselves (some functions operate on and produce NavigableStrings, rather than tag objects).

Comment object = Used to find HTML comments in comment tags, <!--like this one-->.

These four objects are the only objects you will ever encounter in the BeautifulSoup


## Navigating Trees

http://www.pythonscraping.com/pages/page3.html ---- shopping example
        
which has a tree like HTML structure:
    
• HTML
— body
— div.wrapper
— h1
— div.content
— table#giftList
— tr
— th
— th
— th
— th
— tr.gift#gift1
— td
— td
— span.excitingNote
— td
— td
— img
— ...table rows continue...
— div.footer

In the BeautifulSoup library, as well as many other libraries, there is a distinction
drawn between children and descendants: much like in a human family tree, children
are always exactly one tag below a parent, whereas descendants can be at any level in
the tree below a parent. For example, the tr tags are children of the table tag,
whereas tr, th, td, img, and span are all descendants of the table tag (at least in our
example page). All children are descendants, but not all descendants are children

In [27]:
#If you want to find only descendants that are children, you can use the .children tag:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

# This code prints the list of product rows in the giftList table, including the initial
# row of column labels. If you were to write it using the descendants() function
# instead of the children() function, about two dozen tags would be found within the
# table and printed, including img tags, span tags, and individual td tags.



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


### dealing with siblings


The output of this code is to print all rows of products from the product table, except
for the first title row. Why does the title row get skipped? Objects cannot be siblings
with themselves. Anytime you get siblings of an object, the object itself will not be
included in the list. As the name of the function implies, it calls next siblings only


In [28]:
#The BeautifulSoup next_siblings() function makes it trivial to collect data from tables, especially ones with title rows:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

### dealing with parents

When scraping pages, you will likely discover that you need to find parents of tags less frequently than you need to find their children or siblings. 

Typically, when you look at HTML pages with the goal of crawling them, you start by looking at the top layer of tags, and then figure out how to drill your way down into the exact piece of data that you want. 

Occasionally, however, you can find yourself in odd situations that require BeautifulSoup’s parent-finding functions, .parent and .parents. For example:

In [29]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())

#prev sibling of the td tag is the td tag that contains the dollar value of the product


$15.00



## Regular expressions

https://www.pythonscraping.com/pages/page3.html

notice that the site has many product images which takes the following form:>>>"../img/gifts/img3.jpg">

If you wanted to grab URLs to all of the product images, it might seem fairly straight‐forward at first: just grab all the image tags by using.find_all("img"), right?

But here’s a problem. In addition to the obvious “extra” images (e.g., logos), modern web‐sites often have hidden images, blank images used for spacing and aligning elements, ad other random image tags you might not be aware of. Certainly, you can’t count n the only images on the page being product images.

Let’s also assume that the layout of the page might change, or that, for whatever reason, you don’t want to depend on the position of the image in the page in order to ind the correct tag.

The solution is to look for something identifying about the tag itself. In this case, you
can look at the file path of the product images:

In [30]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img',{'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images:
    print(image['src'])
    
#prints relative image paths that start with /img/giftd/img and end in jpg - * says all i.e. img.* referring to 1/2/3 etc..    

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


### accessing attributes

With tag objects, a Python list of attributes can be automatically accessed by calling this: 

myTag.attrs

The source location for animage, for example, can be found using the following line:

myImgTag.attrs['src']

### Lambda Expressions

Is a function that is passed into another function as a variable; instead of defining a function as f(x, y), you may define a function as f(g(x),y) or even f(g(x), h(x)).

BeautifulSoup allows you to pass certain types of functions as parameters into the find_all function.


The only restriction is that these functions must take a tag object as an argument and return a boolean. 

Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to True are returned, while the rest are discarded.

For example, the following retrieves all tags that have exactly two attributes:


In [31]:
bs.find_all(lambda tag: len(tag.attrs) == 2)

#Here, the function that you are passing as the argument is len(tag.attrs) == 2.
#Where this is True, the find_all function will return the tag. That is, it will find tags
# with two attributes, such as the following:
# <div class="body" id="content"></div>
# <span style="color:red" class="title"></span>

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift3"><td>
 Fish Painting
 </td><td>
 If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
 </td><td>
 $10,005.00
 </td><td>
 <img src="../img/gifts/img3.jpg"/>
 </td>

In [32]:
#Lambda functions are so useful you can even use them to replace existing Beauti‐
#fulSoup functions:    
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')

[<span class="excitingNote">Or maybe he's only resting?</span>]

In [33]:
#This can also be accomplished without a lambda function:
bs.find_all('', text='Or maybe he\'s only resting?')

  bs.find_all('', text='Or maybe he\'s only resting?')


["Or maybe he's only resting?"]

### Regular Expressions

Problem:
1. Write the letter a at least once.
2. Append to this the letter b exactly five times.
3. Append to this the letter c any even number of times.
4. Write either the letter d or e at the end.

In [34]:
"""
solution = aa*bbbbb(cc)*(d|e)

aa* = any number of as a inc 0 of them (written at least once)
bbbbb = 5 b's in a row
(cc)* = any even number or things that can be grouped into pairs
(d/e) = d or e
"""

"\nsolution = aa*bbbbb(cc)*(d|e)\n\naa* = any number of as a inc 0 of them (written at least once)\nbbbbb = 5 b's in a row\n(cc)* = any even number or things that can be grouped into pairs\n(d/e) = d or e\n"

### for testing regular expressions

https://www.regexpal.com/

In [35]:
"""
* = matches preceding char 0 or more times i.e. a*b* = aaaaaaaaa,aaabbbbbbbb, bbbbbb
"+" = matches the preceding character 1 or more times i.e a+b+ = aaaaab, aaabbbbb, abbbbb
[] = matches any character wihtin the brackets i..e. [A-Z]* = APPLE, CAPITALS, QUERY
() = grouped sub expression i.e. (a*b)* = aaabaaab,  abaaab, ababaaab
(m,n) = matches the preceding character between m and n inclusive i.e. a(2,3)b(2,3) = aabbb, aaabbb, aabb
[^] = matches any single character that is not in the brackets i.e. [^A-Z]* = apple, lowercase, qwerty
| = matches any character/string seperated by | i.e. b(a|i|e|d) = bad, bid, bed
. = matches any single char at beggining of string i.e. b.d = bad, bzd, b$d, b d
^ = indicates that a character occurs at begginng of string i.e. ^a = aple, asdf, a 
\ = escape character allows special char as literal meanings i.e. \.\|\\ = .|\
$ = often used at end of regular expression match this up to end of string i.e. [A-Z]*[a-z]*$ = ABCabc, zzzyx, Bob
?! = does not contain i.e. ^((?![A-Z]).)*$ = co-caps-here, $ymb0ls a4e f!ne
"""

'\n* = matches preceding char 0 or more times i.e. a*b* = aaaaaaaaa,aaabbbbbbbb, bbbbbb\n"+" = matches the preceding character 1 or more times i.e a+b+ = aaaaab, aaabbbbb, abbbbb\n[] = matches any character wihtin the brackets i..e. [A-Z]* = APPLE, CAPITALS, QUERY\n() = grouped sub expression i.e. (a*b)* = aaabaaab,  abaaab, ababaaab\n(m,n) = matches the preceding character between m and n inclusive i.e. a(2,3)b(2,3) = aabbb, aaabbb, aabb\n[^] = matches any single character that is not in the brackets i.e. [^A-Z]* = apple, lowercase, qwerty\n| = matches any character/string seperated by | i.e. b(a|i|e|d) = bad, bid, bed\n. = matches any single char at beggining of string i.e. b.d = bad, bzd, b$d, b d\n^ = indicates that a character occurs at begginng of string i.e. ^a = aple, asdf, a \n\\ = escape character allows special char as literal meanings i.e. \\.\\|\\ = .|$ = often used at end of regular expression match this up to end of string i.e. [A-Z]*[a-z]*$ = ABCabc, zzzyx, Bob\n?! = do

In [36]:
"""
EMAIL ADDRESS EXAMPLE

[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)

first part contains at least 1 upper/lower case letters, numbers 0-9, periods(.), plus signs (+), or underscores (_)
next contains @
then contians at least 1 upper/lower case letter
followed by period (.)
followed by .com/edu/net etc
 """

'\nEMAIL ADDRESS EXAMPLE\n\n[A-Za-z0-9\\._+]+@[A-Za-z]+\\.(com|org|edu|net)\n\nfirst part contains at least 1 upper/lower case letters, numbers 0-9, periods(.), plus signs (+), or underscores (_)\nnext contains @\nthen contians at least 1 upper/lower case letter\nfollowed by period (.)\nfollowed by .com/edu/net etc\n '

# Chapter 3: Writing Web Crawlers

Web crawlers are called such because they crawl across the web. At their core is an element of recursion. They must retrieve page contents for a URL, examine that page for another URL, and retrieve that page, ad infinitum.

With web crawlers, you must be extremely conscientious of how much bandwidth you are using and make every effort to determine whether there’s a way to make the target server’s load easier.

### Traversing a single Domain

In this section, you’ll begin a project that will become a Six Degrees of Wikipedia solution finder: You’ll be able to take the Eric Idle page and find the fewest number of link clicks that will take you to the Kevin Bacon page.

https://en.wikipedia.org/wiki/Kevin_Bacon


In [37]:
# You should already know how to write a Python script that retrieves an arbitrary
# Wikipedia page and produces a list of links on that page:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])


#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Other_ventures
#Six_Degrees_of_Kevin_Bacon
#Personal_life
#Accolades
#Awards_and_nominations
#Other_honors
#S

Recently a friend of mine, while working on a similar Wikipedia-scraping project, mentioned he had written a large filtering function, with more than 100 lines of code, in order to determine whether an internal Wikipedia link was an article page.

Unfortunately, he had not spent much time upfront trying to find patterns between “article links” and “other links,”

If you examine the links that point to article pages (as opposed to other internal pages), you’ll see that they all have three things in common:

* They reside within the div with the id set to bodyContent.

* The URLs do not contain colons.
* The URLs begin with /wiki/.


In [38]:
#You can use these rules to revise the code slightly to retrieve only the desired article
#links by using the regular expression ^(/wiki/)((?!:).)*$"

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
# here we specify the 3 conditions for the patterns note ?!: = don't contain colons
for link in bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Leading_man
/wiki/Character_actor
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/National_Lampoon%27s_Animal_House
/wiki/Footloose_(1984_film)
/wiki/Diner_(1982_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Frost/Nixon_(film)
/wiki/Friday_the_13th_(1980_film)
/wiki/Tremors_(1990_film)
/wiki/The_River_Wild
/wiki/The_Woodsman_(2004_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Patriots_Day_(film)
/wiki/Losing_Chase
/wiki/Loverboy_(2005_film)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Michael_Strobl
/wiki/HBO
/wiki/Taking_Chance
/wiki/Fox_Broadcasting_Company
/wik

The above a list of all article URLs that the Wikipedia article on Kevin Bacon links to.

Of course, having a script that finds all article links in one, hardcoded Wikipedia arti‐
cle, while interesting, is fairly useless in practice. You need to be able to take this code
and transform it into something more like the following:
* A single function, getLinks, that takes in a Wikipedia article URL of the
form /wiki/<Article_Name> and returns a list of all linked article URLs in the
same form.

* A main function that calls getLinks with a starting article, chooses a random
article link from the returned list, and calls getLinks again, until you stop the
program or until no article links are found on the new page.

In [40]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

# DeprecationWarning: Seeding based on hashing is deprecated since Python 3.9 and will be removed in a subsequent version
random.seed(datetime.datetime.now())

def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))

links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

TypeError: The only supported seed types are: None,
int, float, str, bytes, and bytearray.

### Crawling and Entire Site

In the previous section, you took a random walk through a website, going from link
to link. But what if you need to systematically catalog or search every page on a site?
Crawling an entire site, especially a large one, is a memory-intensive process that is
best suited to applications for which a database to store crawling results is readily
available. However, you can explore the behavior of these types of applications
without running them full-scale

The general approach to an exhaustive site crawl is to start with a top-level page (such
as the home page), and search for a list of all internal links on that page. Every one of
those links is then crawled, and additional lists of links are found on each one of
them, triggering another round of crawling.

If every page has 10 internal links,
and a website is 5 pages deep (a fairly typical depth for a medium-size website), then the number of pages you need to crawl is 105
, or 100,000 pages, before you can be
sure that you’ve exhaustively covered the website.

To avoid crawling the same page twice, it is extremely important that all internal links
discovered are formatted consistently, and kept in a running set for easy lookups,
while the program is running. A set is similar to a list, but elements do not have a
specific order, and only unique elements will be stored, which is ideal for our needs.

In [41]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

# Initially, getLinks is called with an empty URL
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
    getLinks('')

To show you the full effect of how this web crawling business works, I’ve relaxed the
standards of what constitutes an internal link (from previous examples). Rather than
limit the scraper to article pages, it looks for all links that begin with /wiki/, regardless
of where they are on the page, and regardless of whether they contain colons.

Initially getLinks is called with an empty URL. This is translated as “the front page
of Wikipedia” as soon as the empty URL is prepended with http://en.wikipedia.org inside the function. Then, each link on the first page is iterated through and
a check is made to see whether it is in the global set of pages (a set of pages that the
script has encountered already). If not, it is added to the list

if left running long enough, the preceding pro‐
gram will almost certainly crash.
Python has a default recursion limit (the number of times a pro‐
gram can recursively call itself) of 1,000. Because Wikipedia’s net‐
work of links is extremely large, this program will eventually hit
that recursion limit and stop, unless you put in a recursion counter
or something to prevent that from happening. 

### Collecting Data Across an Entire Site

Web crawlers would be fairly boring if all they did was hop from one page to the
other. To make them useful, you need to be able to do something on the page while
you’re there. Let’s look at how to build a scraper that collects the title, the first para‐
graph of content, and the link to edit the page (if available).

For every link in the http://en.wikipedia.org when we click on each of the links on this page, we find the h1, then find the text of the first paragraph of each one and edit links - so the first print output would be from the https://en.wikipedia.org/wiki/Daytona_USA URL and title should be Daytona USA with text Daytona USA[a] is an arcade racing game developed by Sega AM2 and published by Sega in March 1994.


In [42]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages #global variable
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text()) #find titles defined under h1
        print(bs.find(id ='mw-content-text').find_all('p')[0]) #access 1st paragraph of text i.e. p[0]
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href']) #find edit links
    except AttributeError:
        print('This page is missing something! Continuing.')

    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

Main Page
<p><i><b><a href="/wiki/Fallout_(video_game)" title="Fallout (video game)">Fallout: A Post Nuclear Role Playing Game</a></b></i> is a 1997 <a href="/wiki/Role-playing_video_game" title="Role-playing video game">role-playing video game</a> developed and published by <a href="/wiki/Interplay_Entertainment" title="Interplay Entertainment">Interplay Productions</a>. Set in a post-apocalyptic world in the mid–22nd century, it revolves around the player character seeking a replacement computer chip for their underground nuclear shelter's water supply system. The gameplay involves interacting with other survivors and engaging in <a href="/wiki/Timekeeping_in_games#Turn-based" title="Timekeeping in games">turn-based</a> combat. <i>Fallout</i> started development in 1994 as a <a href="/wiki/Game_engine" title="Game engine">game engine</a> designed by <a href="/wiki/Tim_Cain" title="Tim Cain">Tim Cain</a> <i>(pictured)</i>. It was originally based on <i><a href="/wiki/GURPS" title="GUR

IndexError: list index out of range

The for loop in this program is essentially the same as it was in the original crawling
program (with the addition of printed dashes for clarity, separating the printed con‐
tent).

Because you can never be entirely sure that all the data is on each page, each print
statement is arranged in the order that it is likeliest to appear on the site. That is, the
h1 title tag appears on every page (as far as I can tell, at any rate) so you attempt to get
that data first. The text content appears on most pages (except for file pages), so that
is the second piece of data retrieved. The Edit button appears only on pages in which
both titles and text content already exist, but it does not appear on all of those pages.

#### Handling Redirects

Redirects allow a web server to point one domain name or URL to a piece of content
at a different location. There are two types of redirects:

* Server-side redirects, where the URL is changed before the page is loaded
* Client-side redirects, sometimes seen with a “You will be redirected in 10 sec‐
onds” type of message, where the page loads before redirecting to the new one

With server-side redirects, you usually don’t have to worry. If you’re using the urllib
library with Python 3.x, it handles redirects automatically! If you’re using the requests
library, make sure to set the allow-redirects flag to True:

    r = requests.get('http://github.com', allow_redirects=True)

### Crawling Across the Internet

Before you start writing a crawler that follows all outbound links willy-nilly, you
should ask yourself a few questions:
* What data am I trying to gather? Can this be accomplished by scraping just a few
predefined websites (almost always the easier option), or does my crawler need to
be able to discover new websites I might not know about?
* When my crawler reaches a particular website, will it immediately follow the next
outbound link to a new website, or will it stick around for a while and drill down
into the current website?
* Are there any conditions under which I would not want to scrape a particular
site? Am I interested in non-English content?
* How am I protecting myself against legal action if my web crawler catches the
attention of a webmaster on one of the sites it runs across? (Check out Chap‐
ter 18 for more information on this subject.)

In [48]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
#random.seed(datetime.datetime.now())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme,urlparse(includeUrl).netloc)
    internalLinks = []
#Finds all links that begin with a "/"
    for link in bs.find_all('a',href=re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks

In [44]:
#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" that do
    #not contain the current URL
    for link in bs.find_all('a',href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
        return externalLinks

In [45]:
def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bs = BeautifulSoup(html, 'html.parser')
    externalLinks = getExternalLinks(bs,urlparse(startingPage).netloc)
    if len(externalLinks) == 0:
        print('No external links, looking around the site for one')
        domain = '{}://{}'.format(urlparse(startingPage).scheme,urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs, domain)
        return getRandomExternalLink(internalLinks[random.randint(0,len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0, len(externalLinks)-1)]


In [46]:
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print('Random external link is: {}'.format(externalLink))
    followExternalOnly(externalLink)
    followExternalOnly('http://oreilly.com')

External links are not always guaranteed to be found on the first page of a website. To
find external links in this case, a method similar to the one used in the previous
crawling example is employed to recursively drill down into a website until it finds an
external link.

Any external links > yes > return random external link 
Any external links > no > go to an internal link on page > get all external links on that page > any external links > yes > return random external link

For example, if an external link is not found anywhere
on a site that this crawler encounters (unlikely, but it’s bound to
happen at some point if you run it for long enough), this program
will keep running until it hits Python’s recursion limit.

One easy way to increase the robustness of this crawler would be to
combine it with the connection exception-handling code in Chap‐
ter 1.

The nice thing about breaking up tasks into simple functions such as “find all external
links on this page” is that the code can later be easily refactored to perform a different
crawling task. For example, if your goal is to crawl an entire site for external links,
and make a note of each one, you can add the following function:

In [49]:
# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()

def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    domain = '{}://{}'.format(urlparse(siteUrl).scheme, urlparse(siteUrl).netloc)
    bs = BeautifulSoup(html, 'html.parser')
    internalLinks = getInternalLinks(bs, domain)
    externalLinks = getExternalLinks(bs, domain)
    
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)
    for link in internalLinks:
        if link not in allIntLinks:
            allIntLinks.add(link)
            getAllExternalLinks(link)

allIntLinks.add('http://oreilly.com')
getAllExternalLinks('http://oreilly.com')

https://www.oreilly.com


KeyboardInterrupt: 

This code can be thought of as two loops—one gathering internal links, one gathering
external links—working in conjunction with each other

# 4. Writing Web Crawlers  

You may be asked to collect news articles or blog posts from a variety of websites,
each with different templates and layouts. One website’s h1 tag contains the title of the
article, another’s h1 tag contains the title of the website itself, and the article title is in 
    
    <span id="title">

You may need flexible control over which websites are scraped and how they’re scra‐
ped, and a way to quickly add new websites or modify existing ones, as fast as possi‐
ble, without writing multiple lines of code.

You may be asked to scrape product prices from different websites, with the ultimate
aim of comparing prices for the same product. Perhaps these prices are in different
currencies, and perhaps you’ll also need to combine this with external data from
some other nonweb source

## Planning and Defining Objects

If you want to collect
product data, you may first look at a clothing store and decide that each product you
scrape needs to have the following fields:

* product name
* price
* description
* sizes
* colours
* fabric type
* customer rating
* item SKU - from another website

Although clothing may be a great start, you also want to make sure you can extend
this crawler to other types of products. You start perusing product sections of other
websites and decide you also need to collect this information:

* hardcover/paperback
* matt/glossy paint
* number customer reviews
* link to manurfacturer

Clearly, this is an unsustainable approach. Simply adding attributes to your product
type every time you see a new piece of information on a website will lead to far too
many fields to keep track of. 

Not only that, but every time you scrape a new website,
you’ll be forced to perform a detailed analysis of the fields the website has and the
fields you’ve accumulated so far, and potentially add new fields (modifying your
Python object type and your database structure).

You need to limit the amount of information that you need to track to make it achievable.

Perhaps what you really want to do is compare product prices among multiple stores
and track those product prices over time. In this case, you need enough information
to uniquely identify the product, and that’s it:

* product title
* manurfacturer
* product ID number

## Dealing with different website layouts

One of the most impressive feats of a search engine such as Google is that it manages
to extract relevant and useful data from a variety of websites, having no upfront
knowledge about the website structure itself.

Fortunately, in most cases of web crawling, you’re not looking to collect data from
sites you’ve never seen before, but from a few, or a few dozen, websites that are preselected by a human. 

The most obvious approach is to write a separate web crawler or page parser for each
website. Each might take in a URL, string, or BeautifulSoup object, and return a
Python object for the thing that was scraped.

The following is an example of a Content class (representing a piece of content on a
website, such as a news article) and two scraper functions that take in a Beauti
fulSoup object and return an instance of Content:

In [50]:
import requests
from urllib.parse import urlparse


class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
        
def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')

def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find("h1").text
    lines = bs.find_all("p", {"class":"story-content"})
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)

def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find("h1").text
    body = bs.find("div",{"class","post-body"}).text
    return Content(url, title, body)

# url links do not exist anymore on this site
url = 'https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'
 
content = scrapeBrookings(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)

# url links do not exist anymore on this site
url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
content = scrapeNYTimes(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)



AttributeError: 'NoneType' object has no attribute 'text'

As you start to add scraper functions for additional news sites, you might notice a
pattern forming. Every site’s parsing function does essentially the same thing:

* Selects the title element and extracts the text for the title
* Selects the main content of the article
* Selects other content items as needed
* Returns a Content object instantiated with the strings found previously

To make things even more convenient, rather than dealing with all of these tag argu‐
ments and key/value pairs, you can use the BeautifulSoup select function with a sin‐
gle string CSS selector for each piece of information you want to collect and put all of
these selectors in a dictionary object:

        class Content:
            """
            Common base class for all articles/pages
            """

        def __init__(self, url, title, body):
            self.url = url
            self.title = title
            self.body = body

        def print(self):
            """
            Flexible printing function controls output
            """
            print("URL: {}".format(self.url))
            print("TITLE: {}".format(self.title))
            print("BODY:\n{}".format(self.body))

        class Website:
        """
        Contains information about website structure
        """
        
        def __init__(self, name, url, titleTag, bodyTag):
            self.name = name
            self.url = url
            self.titleTag = titleTag
            self.bodyTag = bodyTag

Note that the Website class does not store information collected from the individual
pages themselves, but stores instructions about how to collect that data.  It simply stores the string tag h1 that indicates where
the titles can be found. This is why the class is called Website (the information here
pertains to the entire website) and not Content (which contains information from
just a single page).

Using these Content and Website classes you can then write a Crawler to scrape the
title and content of any URL that is provided for a given web page from a given web‐
site:

In [51]:
import requests
from bs4 import BeautifulSoup

class Crawler:
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')
        
    def safeGet(self, pageObj, selector):
        """
        Utility function used to get a content string from a
        Beautiful Soup object and a selector. Returns an empty
        string if no object is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''
    
    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

In [52]:
# And here’s the code that defines the website objects and kicks off the process:

crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com','h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'h1', 'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu', 'h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com','h1', 'p.story-content']
]

websites = []

for row in siteData:
    websites.append(websites(row[0], row[1], row[2], row[3]))
    
crawler.parse(websites[0], 'http://shop.oreilly.com/product/0636920028154.do')
crawler.parse(websites[1], 'http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html')

TypeError: 'list' object is not callable

While this new method might not seem remarkably simpler than writing a new
Python function for each new website at first glance, imagine what happens when you
go from a system with 4 website sources to a system with 20 or 200 sources.

Each list of strings is relatively easy to write. It doesn’t take up much space. It can be
loaded from a database or a CSV file.

## Structuring Crawlers

Creating flexible and modifiable website layout types doesn’t do much good if you
still have to locate each link you want to scrape by hand. The previous chapter
showed various methods of crawling through websites and finding new pages in an
automated way.


This section shows how to incorporate these methods into a well-structured and
expandable website crawler that can gather links and discover data in an automated
way. I 

### Crawling Sites Through Search

One of the easiest ways to crawl a website is via the same method that humans do:
using the search bar

* Most sites retrieve a list of search results for a particular topic by passing that
topic as a string through a parameter in the URL. For example: http://exam
ple.com?search=myTopic. The first part of this URL can be saved as a property
of the Website object, and the topic can simply be appended to it.

* After searching, most sites present the resulting pages as an easily identifiable list
of links, usually with a convenient surrounding tag such as <span
class="result">, the exact format of which can also be stored as a property of
the Website object.

* Each result link is either a relative URL (e.g., /articles/page.html) or an absolute
URL (e.g., http://example.com/articles/page.html). Whether or not you are expect‐
ing an absolute or relative URL can be stored as a property of the Website object.

* After you’ve located and normalized the URLs on the search page, you’ve suc‐
cessfully reduced the problem to the example in the previous section—extracting
data from a page, given a website format.

In [53]:
class Content:
    """Common base class for all articles/pages"""
    def __init__(self, topic, url, title, body):
        self.topic = topic
        self.title = title
        self.body = body
        self.url = url

def print(self):
    """
    Flexible printing function controls output
    """
    print("New article found for topic: {}".format(self.topic))
    print("TITLE: {}".format(self.title))
    print("BODY:\n{}".format(self.body))
    print("URL: {}".format(self.url))

In [54]:
class Website:
    """Contains information about website structure"""
    def __init__(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl # defines where you should go to get search results if you append the topic you are looking for.
        self.resultListing = resultListing # holds information about each result
        self.resultUrl = resultUrl #defines the tag inside this box that will give you the exact URL for the result
        self.absoluteUrl=absoluteUrl #  tells you whether these search results are absolute or relative URLs.
        self.titleTag = titleTag
        self.bodyTag = bodyTag

In [None]:
import requests
from bs4 import BeautifulSoup

class Crawler:
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')
    
    def safeGet(self, pageObj, selector):
        childObj = pageObj.select(selector)
        if childObj is not None and len(childObj) > 0:
            return childObj[0].get_text()
        return ""

def search(self, topic, site):
    """
    Searches a given website for a given topic and records all pages found
    """
    bs = self.getPage(site.searchUrl + topic)
    searchResults = bs.select(site.resultListing)
    for result in searchResults:
        url = result.select(site.resultUrl)[0].attrs["href"]
        # Check to see whether it's a relative or an absolute URL
        if(site.absoluteUrl):
            bs = self.getPage(url)
        else:
            bs = self.getPage(site.url + url)
        if bs is None:
            print("Something was wrong with that page or URL. Skipping!")
            return
        title = self.safeGet(bs, site.titleTag)
        body = self.safeGet(bs, site.bodyTag)
        if title != '' and body != '':
            content = Content(topic, title, body, url)
            content.print()


crawler = Crawler()

siteData = [
['O\'Reilly Media', 'http://oreilly.com','https://ssearch.oreilly.com/?q=article.product-result', 'p.title a', True, 'h1', 'section#product-description'],
['Reuters', 'http://reuters.com','http://www.reuters.com/search/news?blob=div.search-result-content','h3.search-result-title a',False, 'h1', 'div.StandardArticleBody_body_1gnLA'],
['Brookings', 'http://www.brookings.edu','https://www.brookings.edu/?s=div.list-contentarticle', 'h4.title a', True, 'h1','div.post-body']
]

sites = []
for row in siteData:
    sites.append(Website(row[0], row[1], row[2],row[3], row[4], row[5], row[6], row[7]))
    
topics = ['python', 'data science']
for topic in topics:
    print("GETTING INFO ABOUT: " + topic)
    for targetSite in sites:
        crawler.search(topic, targetSite)

# 5. Scarpy 

# 10. Crawling Through Forms and Logins

# 13. Image Processing and Text Recognition

# 14. Avoiding Scraping Traps

# 15. Testing Your Website with Scrapers

# 16. Web Crawling in Parallel

# 17. Scraping Remotely