# 遍历单个域名

本节创建一个项目来实现“维基百科六度分隔理论”的查找方法，实现从埃里克·艾德尔的词条页面（http://en.wikipedia.org/wiki/Eric_Idle） 开始，经过最少的链接点击次数找到凯文·贝肯的词条页面（https://en.wikipedia.org/wiki/Kevin_Bacon）

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html, 'lxml')
for link in bsObj.find_all("a"):
    if "href" in link.attrs:
        print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
http://baconbros.com/
#cite_note-1
#cite_note-actor-2
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
#cite_note-3
/wiki/Hollywood_Walk_of_Fame
#cite_note-4
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
#cite_note-walk-5
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Advertising_work
#Personal_life
#Six_Degrees_of_Kevin_Ba

上面代码的结果包含了侧边栏、页眉、页脚链接以及链接到分类页面、对话页面和其他不包含词条的页面的链接

仔细观察那些指向词条页面（不是指向其他内容页面）的链接会发现有三个共同点：
- 都在id是bodyContent的div标签里
- url链接不包含分号、
- url链接都以/wiki/开头

使用以上规则稍微调整一下代码来获取词条链接

In [9]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html, 'lxml')
for link in bsObj.find("div", {"id":"bodyContent"}).find_all("a", href=re.compile(r"^(/wiki/)((?!:).)*$")):
    if "href" in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
/wiki/Philadelphia
/wiki/Edmund_Bacon_(architect)
/wiki/Pennsylvania_Governor%27s_School_for_the_Arts
/wiki/Bucknell_University
/wiki/Glory_Van_Scott
/wiki/Circle_in_the_Square
/wiki/Nancy_Mills
/wiki/Cosmopolitan_(magazine)
/wiki/Fraternities_and_sororities
/wiki/Animal_House
/wiki/Search_for_Tomorrow
/wiki/Guiding_Light
/wiki/F

封装成getLinks函数，并加入随机数来随机选择链接点击，完整的代码如下：

In [12]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re


def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org" + articleUrl)
    soup = BeautifulSoup(html, 'lxml')
    return soup.find("div", {"id":"bodyContent"}).find_all("a", href=re.compile(r"^(/wiki/)((?!:).)*$"))

random.seed(datetime.datetime.now())
links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
    newArticle = links[random.randint(0, len(links) - 1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)

/wiki/Fox_Broadcasting_Company
/wiki/MHz_Worldview
/wiki/KCSM-TV
/wiki/Digital_subchannel
/wiki/Beaumont,_Texas
/wiki/Gonzales,_Texas
/wiki/1880_United_States_Census
/wiki/Midwestern_United_States


KeyboardInterrupt: 

这里只是简单地构建一个从一个页面到另一个页面的爬虫，要解决“维基百科六度分隔理论”问题还有一点工作要做，我们还应该存储URL链接数据并分析数据，关于这个问题的后续解决方法可以参考第五章内容