# Web Scraping with Python - Collecting More Data from the Modern Web - 
https://edu.anarcho-copy.org/Programming%20Languages/Python/Web%20Scraping%20with%20Python,%202nd%20Edition.pdf

# Chapter 1: Your First Web Scraper

In [2]:
from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## install & run beautiful soup

In [3]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser') #get HTML content of the page - transformed into a BeautifulSoup Object which is nested: html > body > h1
print(bs.h1) #we are running the h1 tag found on the page

<h1>An Interesting Title</h1>


          Beautiful soup object has the following structure
        html <html><head>...</head><body>...</body></html>
             head → <head><title>A Useful Page<title></head>
                 title → <title>A Useful Page</title>
             body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
                 h1 → <h1>An Interesting Title</h1>
                 div → <div>Lorem Ipsum dolor...</div>

In [5]:
bs.html.body.h1

<h1>An Interesting Title</h1>

In [6]:
bs.body.h1

<h1>An Interesting Title</h1>

In [7]:
bs.html.h1

<h1>An Interesting Title</h1>

In [8]:
bs.h1

<h1>An Interesting Title</h1>

#### another popular parser whihch has the benefit of parsing messy or malformed HTML code

In [9]:
pip install lxml 

Note: you may need to restart the kernel to use updated packages.


In [10]:
bs = BeautifulSoup(html.read(), 'lxml')

#### Another popular HTML parser is html5lib. Like lxml, html5lib is an extremely forgiving parser that takes even more initiative correcting broken HTML

In [11]:
pip install html5lib

Note: you may need to restart the kernel to use updated packages.


In [12]:
bs = BeautifulSoup(html.read(), 'html5lib')

In [13]:
bs.html.h1

### Connecting Reliability and Handling Expectations

The web is messy. Data is poorly formatted, websites go down, and closing tags go
missing. One of the most frustrating experiences in web scraping is to go to sleep
with a scraper running, dreaming of all the data you’ll have in your database the next
day—only to find that the scraper hit an error on some unexpected data format and
stopped execution shortly after you stopped looking at the screen.

In [14]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')

page not found / error retrieving it = HTTP error return i.e. 404 page not found or 500 internal server error for mserver is not found

In [16]:
from urllib.request import urlopen
from urllib.error import HTTPError
try:
     html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
 print(e)
 # return null, break, or do some other "Plan B"
else:
# program continues. Note: If you return or break in the
# exception catch, you do not need to use the "else" statement

IndentationError: expected an indented block (3508272428.py, line 10)

If an HTTP error code is returned, the program now prints the error, and does not
execute the rest of the program under the else statement.

If the server is not found at all (if, say, http://www.pythonscraping.com is down, or the
URL is mistyped), urlopen will throw an URLError

You can add a check to see whether this is the case:

In [17]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
try:
 html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
 print(e)
except URLError as e:
 print('The server could not be found!')
else:
 print('It Worked!')

The server could not be found!


Of course, if the page is retrieved successfully from the server, there is still the issue of
the content on the page not quite being what you expected. Every time you access a
tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually
exists.

In [18]:
print(bs.nonExistentTag) #The line (where nonExistentTag is a made-up tag, not the name of a realBeautifulSoup function)


None


In [20]:
print(bs.nonExistentTag.someTag)

AttributeError: 'NoneType' object has no attribute 'someTag'

In [21]:
try:
 badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
 print('Tag was not found')
else:
 if badContent == None:
     print ('Tag was not found')
 else:
     print(badContent)


Tag was not found


This checking and handling of every error does seem laborious at first, but it’s easy to
add a little reorganization to this code to make it less difficult to write. This code, for example, is our same scraper
written in a slightly different way:

In [22]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url): #create function called getTitle that retruns the title of the page or none if can;t retrieve 
 try:
     html = urlopen(url)
 except HTTPError as e:
     return None
 try:
     bs = BeautifulSoup(html.read(), 'html.parser')
     title = bs.body.h1 
 except AttributeError as e:
     return None
 return title
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
     print('Title could not be found')
else:
     print(title)

<h1>An Interesting Title</h1>


In this example, you’re creating a function getTitle, which returns either the title of
the page, or a None object if there was a problem retrieving it. Inside getTitle, you
check for an HTTPError, as in the previous example, and encapsulate two of the Beau‐
tifulSoup lines inside one try statement

# Chapter 2: Advanced HTML Parsing

Let’s say you have some target content. Maybe it’s a name, statistic, or block of text.
Maybe it’s buried 20 tags deep in an HTML mush with no helpful tags or HTML
attributes to be found. 

1. Look for a “Print This Page” link, or perhaps a mobile version of the site that has
better-formatted HTML
2. Look for the information hidden in a JavaScript file. Remember, you might need
to examine the imported JavaScript files in order to do this.
3. This is more common for page titles, but the information might be available in
the URL of the page itself.
4. If the information you are looking for is unique to this website for some reason,
you’re out of luck. If not, try to think of other sources you could get this informa‐
tion from. Is there another website with the same data? Is this website displaying
data that it scraped or aggregated from another website?


CSS
relies on the differentiation of HTML elements that might otherwise have the exact
same markup in order to style them differently. Some tags might look like this:

In [23]:
<span class="green"></span>
<span class="red"></span>


SyntaxError: invalid syntax (3591072515.py, line 1)

Web scrapers can easily separate these two tags based on their class; for example, they
might use BeautifulSoup to grab all the red text but none of the green text. Because
CSS relies on these identifying attributes to style sites appropriately, you are almost
guaranteed that these class and ID attributes will be plentiful on most modern web‐
sites

In [24]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(' http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')


In [25]:
print(bs.html.h1)

<h1>War and Peace</h1>


In [26]:
nameList = bs.findAll('span', {'class':'green'})
for name in nameList:
    print(name.get_text())  # returns all the green text in the order they appear in war and peace
                            # this can be checked by typing in the url on the web

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


BeautifulSoup’s find() and find_all() are the two functions you will likely use the
most. With them, you can easily filter HTML pages to find lists of desired tags, or a
single tag, based on their various attributes

find_all(tag, attributes, recursive, text, limit, keywords)

find(tag, attributes, recursive, text, keywords)

In all likelihood, 95% of the time you will need to use only the first two arguments:
tag and attributes. However, let’s take a look at all the arguments in greater detail.


In [27]:
bs.find_all(['h1','h2','h3','h4','h5','h6']) #lists all headers in the doc

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

In [28]:
bs.find_all('span', {'class':{'green', 'red'}}) #returns green and red span tags in html doc

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span clas

find_all works recursively (recur
sive is set to True); it’s generally a good idea to leave this as is, unless you really know
what you need to do and performance is an issue. function looks into children, and children’s children, for tags that match your parameters. If it is False, it will look only at
the top-level tags in your document.

In [29]:
# if you want to find the number of times “the prince” is surrounded by tags on the example page
nameList = bs.find_all(text='the prince')
print(len(nameList))


7


In [32]:
# The keyword argument allows you to select tags that contain a particular attribute or set of attributes. For example:
title = bs.find_all(id='title', class_='text')


#### recap

BeautifulSoup objects = Instances seen in previous code examples as the variable bs

Tag objects = Retrieved in lists, or retrieved individually by calling find and find_all on a BeautifulSoup object, or drilling down, as follows: = bs.div.h1

However, there are two more objects in the library that, although less commonly used, are still important to know about:
NavigableString objects = Used to represent text within tags, rather than the tags themselves (some functions operate on and produce NavigableStrings, rather than tag objects).

Comment object = Used to find HTML comments in comment tags, <!--like this one-->.

These four objects are the only objects you will ever encounter in the BeautifulSoup


### dealing with children and other decendents

http://www.pythonscraping.com/pages/page3.html ---- shopping example
        
which has a tree like HTML structure:
    
    — body
— div.wrapper
— h1
— div.content
— table#giftList
— tr
— th
— th
— th
— th
— tr.gift#gift1
— td
— td
— span.excitingNote
— td
— td
— img
— ...table rows continue...
— div.footer

In the BeautifulSoup library, as well as many other libraries, there is a distinction
drawn between children and descendants: much like in a human family tree, children
are always exactly one tag below a parent, whereas descendants can be at any level in
the tree below a parent. For example, the tr tags are children of the table tag,
whereas tr, th, td, img, and span are all descendants of the table tag (at least in our
example page). All children are descendants, but not all descendants are children

In [36]:
#If you want to find only descendants that are children, you can use the .children tag:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

#This code prints the list of product rows in the giftList table, including the initial
row of column labels. If you were to write it using the descendants() function
instead of the children() function, about two dozen tags would be found within the
table and printed, including img tags, span tags, and individual td tags.



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


### dealing with siblings

In [38]:
#The BeautifulSoup next_siblings() function makes it trivial to collect data from tables, especially ones with title rows:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)

#The output of this code is to print all rows of products from the product table, except
#for the first title row. Why does the title row get skipped? Objects cannot be siblings
#with themselves. Anytime you get siblings of an object, the object itself will not be
#included in the list. As the name of the function implies, it calls next siblings only



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

### dealing with parents

When scraping pages, you will likely discover that you need to find parents of tags less frequently than you need to find their children or siblings. 

Typically, when you look at HTML pages with the goal of crawling them, you start by looking at the top layer of tags, and then figure out how to drill your way down into the exact piece of data that you want. 

Occasionally, however, you can find yourself in odd situations that require BeautifulSoup’s parent-finding functions, .parent and .parents. For example:

In [40]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())

#prev sibling of the td tag is the td tag that contains the dollar value of the product


$15.00



### Regular expressions and beutiful soup

https://www.pythonscraping.com/pages/page3.html

notice that the site has many product images which takes the following form:>>>"../img/gifts/img3.jpg">

If you wanted to grab URLs to all of the product images, it might seem fairly straight‐forward at first: just grab all the image tags by using.find_all("img"), right?

But here’s a problem. In addition to the obvious “extra” images (e.g., logos), modern web‐sites often have hidden images, blank images used for spacing and aligning elements, ad other random image tags you might not be aware of. Certainly, you can’t count n the only images on the page being product images.

Let’s also assume that the layout of the page might change, or that, for whatever reason, you don’t want to depend on the position of the image in the page in order to ind the correct tag.

The solution is to look for something identifying about the tag itself. In this case, you
can look at the file path of the product images:

In [44]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img',{'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images:
    print(image['src'])
    
#prints relative image paths that start with /img/giftd/img and end in jpg    

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


### accessing attributes

With tag objects, a Python list of attributes can be automatically accessed by calling this: 

myTag.attrs

The source location for animage, for example, can be found using the following line:

myImgTag.attrs['src']

### Lambda Expressions

Is a function that is passed into another function as a variable; instead of defining a function as f(x, y), you may define a function as f(g(x),y) or even f(g(x), h(x)).

BeautifulSoup allows you to pass certain types of functions as parameters into the find_all function.


The only restriction is that these functions must take a tag object as an argument and return a boolean. 

Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to True are returned, while the rest are discarded.

For example, the following retrieves all tags that have exactly two attributes:


In [46]:
bs.find_all(lambda tag: len(tag.attrs) == 2)

#Here, the function that you are passing as the argument is len(tag.attrs) == 2.
#Where this is True, the find_all function will return the tag. That is, it will find tags
# with two attributes, such as the following:
# <div class="body" id="content"></div>
# <span style="color:red" class="title"></span>

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift3"><td>
 Fish Painting
 </td><td>
 If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
 </td><td>
 $10,005.00
 </td><td>
 <img src="../img/gifts/img3.jpg"/>
 </td>

In [47]:
#Lambda functions are so useful you can even use them to replace existing Beauti‐
#fulSoup functions:    
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')

[<span class="excitingNote">Or maybe he's only resting?</span>]

In [48]:
#This can also be accomplished without a lambda function:
bs.find_all('', text='Or maybe he\'s only resting?')

["Or maybe he's only resting?"]

### Regular Expressions

Problem:
1. Write the letter a at least once.
2. Append to this the letter b exactly five times.
3. Append to this the letter c any even number of times.
4. Write either the letter d or e at the end.

In [55]:
#soluction = aa*bbbbb(cc)*(d|e)

# aa* = any number of as a inc 0 of them (written at least once)
# bbbbb = 5 b's in a row
# (cc)* = any even number or things that can be grouped into pairs
# (d/e) = d or e

### for testing regular expressions

https://www.regexpal.com/

In [None]:
# * = matches preceding char 0 or more times i.e. a*b* = aaaaaaaaa,aaabbbbbbbb, bbbbbb
# + = matches the preceding character 1 or more times i.e a+b+ = aaaaab, aaabbbbb, abbbbb
# [] = matches any character wihtin the brackets i..e. [A-Z]* = APPLE, CAPITALS, QUERY
# () = grouped sub expression i.e. (a*b)* = aaabaaab, abaaab, ababaaab
# (m,n) = matches the preceding character between m and n inclusive i.e. a(2,3)b(2,3) = aabbb, aaabbb, aabb
# [^] = matches any single character that is not in the brackets i.e. [^A-Z]* = apple, lowercase, qwerty
# | = matches any character/string seperated by | i.e. b(a|i|e|d) = bad, bid, bed
# . = matches any single char at beggining of string i.e. b.d = bad, bzd, b$d, b d
# ^ = indicates that a character occurs at begginng of string i.e. ^a = aple, asdf, a 
# \ = escape character allows special char as literal meanings i.e. \.\|\\ = .|\
# $ = often used at end of regular expression match this up to end of string i.e. [A-Z]*[a-z]*$ = ABCabc, zzzyx, Bob
# ?! = does not contain i.e. ^((?![A-Z]).)*$ = co-caps-here, $ymb0ls a4e f!ne


In [None]:
# EMAIL ADDRESS EXAMPLE

 # [A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)
    
 # first part contains at least 1 upper/lower case letters, numbers 0-9, periods(.), plus signs (+), or underscores (_)
 # next contains @
 # then contians at least 1 upper/lower case letter
 # followed by period (.)
 #  followed by .com/edu/net etc

# Chapter 3: Writing Web Crawlers

Web crawlers are called such because they crawl across the web. At their core is an element of recursion. They must retrieve page contents for a URL, examine that page for another URL, and retrieve that page, ad infinitum.

With web crawlers, you must be extremely conscientious of how much bandwidth you are using and make every effort to determine whether there’s a way to make the target server’s load easier.

### Traversing a single Domain

In this section, you’ll begin a project that will become a Six Degrees of Wikipedia solution finder: You’ll be able to take the Eric Idle page and find the fewest number of link clicks that will take you to the Kevin Bacon page.

https://en.wikipedia.org/wiki/Kevin_Bacon


In [50]:
# You should already know how to write a Python script that retrieves an arbitrary
# Wikipedia page and produces a list of links on that page:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])


/wiki/Wikipedia:Protection_policy#semi
#mw-head
#searchInput
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/Philadelphia,_Pennsylvania
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
#cite_note-1
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Wikipedia:Citation_needed
http://baconbros.com/
#cite_note-2
#cite_note-actor-3
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(1990_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/X-Men:_First_Class
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chan

Recently a friend of mine, while working on a similar Wikipedia-scraping project, mentioned he had written a large filtering function, with more than 100 lines of code, in order to determine whether an internal Wikipedia link was an article page.

Unfortunately, he had not spent much time upfront trying to find patterns between “article links” and “other links,”

If you examine the links that point to article pages (as opposed to other internal pages), you’ll see that they all have three things in common:

• They reside within the div with the id set to bodyContent.
• The URLs do not contain colons.
• The URLs begin with /wiki/.


In [51]:
#You can use these rules to revise the code slightly to retrieve only the desired article
#links by using the regular expression ^(/wiki/)((?!:).)*$"

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia,_Pennsylvania
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(1990_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/Streaming_television
/wiki/I_Love_Dick_(TV_series)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or