In [9]:
'''
To avoid crawling the same page twice, it is extremely important that all internal links
discovered are formatted consistently, and kept in a running set for easy lookups,
while the program is running. A set is similar to a list, but elements do not have a
specific order, and only unique elements will be stored, which is ideal for our needs.
Only links that are “new” should be crawled and searched for additional links:
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import ssl
pages = set()
def getlinks(pageUrl):
    global pages
    ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl),context=ssl_context)
    bs = BeautifulSoup(html,'html.parser')
    for link in bs.find_all('a',href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getlinks(newPage)
getlinks('')
'''
Initially, getLinks is called with an empty URL. This is translated as “the front page
of Wikipedia” as soon as the empty URL is prepended with http://en.wikipe
dia.org inside the function. Then, each link on the first page is iterated through and
a check is made to see whether it is in the global set of pages (a set of pages that the
script has encountered already). If not, it is added to the list, printed to the screen,
and the getLinks function is called recursively on it.
'''

#Note:
'''
Python has a default recursion limit (the number of times a program
can recursively call itself) of 1,000. Because Wikipedia’s network
of links is extremely large, this program will eventually hit
that recursion limit and stop, unless you put in a recursion counter
or something to prevent that from happening.
'''

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Wikipedia:Requests_for_permissions
/wiki/Wikipedia:Protection_policy#template
/wiki/Wikipedia:Lists_of_protected_pages
/wiki/Wikipedia:Protection_policy
/wiki/Wikipedia:Perennial_proposals
/wiki/Wikipedia:Identifying_reliable_sources/Perennial_sources
/wiki/Wikipedia:Identifying_reliable_sources
/wiki/Wikipedia:RS_(disambiguation)
/wiki/Wikipedia:Reliable_sources/Noticeboard
/wiki/Wikipedia:Reliable_sources
/wiki/Category:Wikipedia_content_guidelines
/wiki/Wikipedia:Maintenance
/wiki/Wikipedia:Directory
/wiki/Wikipedia:DIRECTORY
/wiki/Wikipedia:Notability
/wiki/Wikipedia:NPOV
/wiki/Wikipedia:Describing_points_of_view
/wiki/Wikipedia:NOTOPINION
/wiki/Wikipedia:Glossary
/wiki/Wikipedia:Wikipedia_abbreviations
/wiki/Wikipedia:Manual_of_Style/Abbreviations
/wiki/Wikipedia:Manual_of_Style
/wiki/Wikipedia:Policies_and_guidelines#guide
/wiki/Category:Wikipedia_procedural_policies
/wiki/Wi

KeyboardInterrupt: 