In [1]:
"""
As always, the first step to determine how best to do this is to look at a few pages
from the site and determine a pattern. By looking at a handful of Wikipedia pages
(both articles and nonarticle pages such as the privacy policy page), the following
things should be clear:
    • All titles (on all pages, regardless of their status as an article page, an edit history
    page, or any other page) have titles under h1 → span tags, and these are the only
    h1 tags on the page.
    
    • As mentioned before, all body text lives under the div#bodyContent tag. However,
    if you want to get more specific and access just the first paragraph of text,
    you might be better off using div#mw-content-text → p (selecting the first paragraph
    tag only). This is true for all content pages except file pages (for example,
    https://en.wikipedia.org/wiki/File:Orbit_of_274301_Wikipedia.svg), which do not
    have sections of content text.
    
    • Edit links occur only on article pages. If they occur, they will be found in the
    li#ca-edit tag, under li#ca-edit → span → a.
"""

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import ssl

pages=set()
def getlinks(pageUrl):
    global pages
    ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl),context=ssl_context)
    bs = BeautifulSoup(html,'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print('This page is missing something! continuing...')
    
    for link in bs.find_all('a',href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getlinks(newPage)
getlinks('')



Main Page
<p><a href="/wiki/Other_Worlds,_Universe_Science_Fiction,_and_Science_Stories" title="Other Worlds, Universe Science Fiction, and Science Stories"><i><b>Other Worlds</b></i>, <i><b>Universe Science Fiction</b></i>, <b>and</b> <i><b>Science Stories</b></i></a> were three related American magazines edited by <a href="/wiki/Raymond_A._Palmer" title="Raymond A. Palmer">Raymond A. Palmer</a>. As both publisher and editor of <i>Other Worlds</i> (1949–1953, 1955–1957), he presented a wide array of science fiction, including "Enchanted Village" by <a href="/wiki/A._E._van_Vogt" title="A. E. van Vogt">A. E. van Vogt</a> and "Way in the Middle of the Air", later included in <a href="/wiki/Ray_Bradbury" title="Ray Bradbury">Ray Bradbury</a>'s <i><a href="/wiki/The_Martian_Chronicles" title="The Martian Chronicles">The Martian Chronicles</a></i>. <i>Science Stories</i> (1953–1955) was visually attractive but contained no memorable fiction. <i>Universe Science Fiction</i> (also 1953–1955)

Wikipedia:Notability
<p class="mw-empty-elt">
</p>
This page is missing something! continuing...
--------------------
/wiki/Wikipedia:NPOV
Wikipedia:Neutral point of view
<p>All encyclopedic content on Wikipedia must be written from a <b>neutral point of view</b> (<b>NPOV</b>), which means representing fairly, proportionately, and, as far as possible, without editorial bias, all of the significant <a href="/wiki/Point_of_view_(philosophy)" title="Point of view (philosophy)">views</a> that have been <a href="/wiki/Wikipedia:Verifiability" title="Wikipedia:Verifiability">published by reliable sources</a> on a topic.
</p>
This page is missing something! continuing...
--------------------
/wiki/Wikipedia:Describing_points_of_view
Wikipedia:Describing points of view
<p>At Wikipedia, <b>points of view</b> (<b>POVs</b>) – <a class="mw-redirect" href="/wiki/Perspective_(cognitive)" title="Perspective (cognitive)">cognitive perspectives</a> – are often essential to articles which treat controve

Aristotle
<p class="mw-empty-elt">
</p>
This page is missing something! continuing...
--------------------
/wiki/Wikipedia:Good_articles
Wikipedia:Good articles
<p class="mw-empty-elt">
</p>
This page is missing something! continuing...
--------------------
/wiki/Wikipedia_talk:Good_article_nominations
Wikipedia talk:Good article nominations
<p>This is the <b>discussion</b> page of the <a href="/wiki/Wikipedia:Good_article_nominations" title="Wikipedia:Good article nominations">good article nominations</a> (GAN). To ask a question or start a discussion about the good article nomination process, click the New section link above. Questions may also be asked at the <a href="/wiki/Wikipedia:Good_article_help" title="Wikipedia:Good article help">GA Help desk</a>. To check and see if your question may already be answered, click to show the frequently asked questions below or search the archives below.
</p>
/w/index.php?title=Wikipedia_talk:Good_article_nominations&action=edit
---------------

Template:Noticeboard links
<p><code>{{Noticeboard links |state=collapsed |nostb=yes}}</code>
</p>
This page is missing something! continuing...
--------------------
/wiki/Template_talk:Noticeboard_links
Template talk:Noticeboard links
<p>The result of the move request was <b>moved to <a href="/wiki/Template:Noticeboard_links" title="Template:Noticeboard links">Template:Noticeboard links</a>.</b> –<b><a href="/wiki/User:Juliancolton" title="User:Juliancolton"><span style="font-family:Script MT;color:#36648B">Juliancolton</span></a></b> | <a href="/wiki/User_talk:Juliancolton" title="User talk:Juliancolton"><sup><span style="font-family:Verdana;color:gray"><i>Talk</i></span></sup></a> 00:21, 6 September 2009 (UTC)
</p>
/w/index.php?title=Template_talk:Noticeboard_links&action=edit
--------------------
/wiki/Wikipedia:Talk_page_guidelines
Wikipedia:Talk page guidelines
<p class="mw-empty-elt">
</p>
This page is missing something! continuing...
--------------------
/wiki/Help:Talk_pages
He

Wikipedia:IRC
<p>The <a href="/wiki/Freenode" title="Freenode">freenode</a> network (chat.freenode.net) has "<a class="mw-redirect" href="/wiki/Chat_rooms" title="Chat rooms">chat rooms</a>" dedicated to Wikipedia 24 hours a day, in which Wikipedians can engage in real-time discussions with each other. Many Wikipedians have chatting open in one window and hop back and forth between it and other windows in which they are working on Wikipedia. The chat rooms most relevant to English Wikipedia are <a href="#List_of_useful_channels">listed hereafter</a>; a more complete list of channels (for other language Wikipedias, other languages, and recent changes feeds) exists at <a class="extiw" href="https://meta.wikimedia.org/wiki/IRC/Channels" title="m:IRC/Channels">m:IRC/Channels</a>.
</p>
/w/index.php?title=Wikipedia:IRC&action=edit
--------------------
/wiki/Wikipedia:FORUM
Wikipedia:What Wikipedia is not
<p>Wikipedia is an online <a href="/wiki/Encyclopedia" title="Encyclopedia">encyclopedia

KeyboardInterrupt: 

In [None]:
'''
The for loop in this program is essentially the same as it was in the original crawling
program (with the addition of printed dashes for clarity, separating the printed content).
Because you can never be entirely sure that all the data is on each page, each print
statement is arranged in the order that it is likeliest to appear on the site. That is, the
h1 title tag appears on every page (as far as I can tell, at any rate) so you attempt to get
that data first. The text content appears on most pages (except for file pages), so that
is the second piece of data retrieved. The Edit button appears only on pages in which
both titles and text content already exist, but it does not appear on all of those pages.
'''
#Note:
'''

'''