# 采集整个网站

如果需要系统地把整个网站按目录分类，或者要搜索网站上的每一个页面，就得采集整个网站，那是非常耗费内存的，尤其是处理大型的网站，最适合的工具就是使用一个数据库来存储采集的资源。

但我们也可以掌握这类工具的行为，并不需要大规模地运行它们。

## 深网(deep web)、浅网(surface web)和暗网(dark web)或隐藏网络

深网是网络的一部分，与浅网相对应，浅网是互联网上搜索引擎可以抓到的那部分网络，即谷歌爬虫机器人可以获取的数据，互联网中实际上有90%的网络都是深网

暗网也被称为Darknet或dark Internet，完全是另一种怪兽，使用Tor客户端，带有运行在HTTP之上的新协议，提供了信息交换的安全隧道。

## 一个常用的费时的网站采集方法

从顶级页面开始（比如主页），然后搜索页面上的所有链接，形成列表，再去采集这些链接的每一个页面，然后把每个页面上找到的链接形成新的列表，重复执行下一轮的采集。（爬虫递归）

优化：避免一个页面被采集两次，链接去重：把已发现的所有链接都放到一起，并保存在方便查询的列表里（python的集合set类型）

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org" + pageUrl)
    bsObj = BeautifulSoup(html, 'lxml')
    for link in bsObj.find_all("a", href=re.compile(r"^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
                
getLinks("")

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Wikipedia:Requests_for_permissions
/wiki/Wikipedia:Requesting_copyright_permission
/wiki/Wikipedia:User_access_levels
/wiki/Wikipedia:Requests_for_adminship
/wiki/Wikipedia:Protection_policy#extended


KeyboardInterrupt: 

## 关于递归的警告

如果递归运行的次数非常多，前面的递归程序就很可能崩溃。

python的默认递归机制（程序递归地自我调用次数）是1000次。因为维基百科的网络链接浩如烟海，所以这个程序达到递归限制之后就会停止，除非设置一个较大的递归计数器，或用其他手段不让它停止。

对于那些链接深度少于1000的普通网站，这个方法通常可以正常运行，一些奇怪的异常除外。

## 收集整个网站数据

创建一个爬虫来收集页面标题、正文的第一个段落，以及编辑页面的链接

观察网站上的一些页面，然后拟定一个采集模式，通过观察会得出下面的规则：
- 所有的标题（所有页面上不论是词条页面、编辑历史页面还是其他页面）都是在h1->span标签里，而且页面上只有一个h1标签
- 所有正文文字都在div#bodyContent标签里，但是如果想要进一步获取第一段文字，可能用div#mw-content-text -> p更好，只选择第一段的标签。
- 编辑链接只出现在词条页面上，如果有编辑链接，都位于li#ca-edit标签的li#ca-edit -> span -> a里面。

调整前面的代码，我们可以建立一个爬虫和数据搜集的组合程序

In [5]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org" + pageUrl)
    bsObj = BeautifulSoup(html, 'lxml')
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id="mw-content-text").find_all("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs["href"])
    except AttributeError:
        print("页面缺少一些属性！")
    for link in bsObj.find_all("a", href=re.compile(r"^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs["href"]
                print("-------------------\n" + newPage)
                pages.add(newPage)
                getLinks(newPage)
                
getLinks("")

Main Page
<p><i><b><a href="/wiki/Freedom_Planet" title="Freedom Planet">Freedom Planet</a></b></i> is a <a href="/wiki/2D_computer_graphics" title="2D computer graphics">two-dimensional</a> <a href="/wiki/Platform_game" title="Platform game">platform</a> video game developed and published by <a href="/wiki/Independent_video_game_development" title="Independent video game development">independent developer</a> GalaxyTrail, a studio set up for the project by designer Stephen DiDuro. The player controls one of three <a href="/wiki/Anthropomorphism" title="Anthropomorphism">anthropomorphic</a> animal protagonists: the dragon Lilac, the <a href="/wiki/Wildcat" title="Wildcat">wildcat</a> Carol, or the <a class="mw-redirect" href="/wiki/Basset_hound" title="Basset hound">basset hound</a> Milla. Aided by the duck-like Torque <i>(concept art shown)</i>, the player attempts to defeat Lord Brevon, who plans to conquer the galaxy. While the game focuses on fast-paced platforming, its levels are 

Wikipedia:Requests for permissions
<p><span class="sysop-show" id="coordinates"><a href="/wiki/Wikipedia:Requests_for_permissions/Administrator_instructions" title="Wikipedia:Requests for permissions/Administrator instructions">Administrator instructions</a></span></p>
页面缺少一些属性！
-------------------
/wiki/Wikipedia:Requesting_copyright_permission
Wikipedia:Requesting copyright permission
<p>To use copyrighted material on Wikipedia, it is <i>not enough</i> that we have permission to use it on Wikipedia alone. That's because Wikipedia itself states all its material may be used by anyone, for any purpose. So we have to be sure all material is in fact licensed for that purpose, whoever provided it.</p>
页面缺少一些属性！
-------------------
/wiki/Wikipedia:User_access_levels
Wikipedia:User access levels
<p>The <b>user access level</b> of an editor affects their ability to perform certain actions on Wikipedia; it depends on which <i>rights</i> (also called <i>permissions</i>, <i><a href="/wiki/Intern

Wikipedia:Shortcut
<p>A <b>shortcut</b> is a specialized type of <a href="/wiki/Wikipedia:Redirect" title="Wikipedia:Redirect">redirect page</a> that provides an abbreviated <a class="mw-redirect" href="/wiki/Wikilink" title="Wikilink">wikilink</a> to a project page or one of its sections, usually from the <b><a href="/wiki/Wikipedia:Project_namespace" title="Wikipedia:Project namespace">Wikipedia namespace</a></b> and <b><a href="/wiki/Wikipedia:Help_namespace" title="Wikipedia:Help namespace">Help namespace</a></b>. They are commonly used on community pages and talk pages, but should not be used in articles themselves. If there is a shortcut for a page or section, it is usually displayed in an information box labelled <i>Shortcuts:</i>, as can be seen at the top of this page.</p>
页面缺少一些属性！
-------------------
/wiki/Wikipedia:Keyboard_shortcuts
Wikipedia:Keyboard shortcuts
<p>The <a href="/wiki/MediaWiki" title="MediaWiki">MediaWiki</a> software contains many <a href="/wiki/Keyboard_s

Help:Talk pages
<p><b>Talk pages</b> (also known as <b>discussion pages</b>) are <a href="/wiki/Wikipedia:Administration#Data_structure_and_development" title="Wikipedia:Administration">administration pages</a> where editors can discuss improvements to articles or other Wikipedia pages. When viewing an article (or any other non-talk page), a link to the corresponding talk page appears on the "Talk" tab at the top of the page. Click this tab to switch to the talk page.</p>
页面缺少一些属性！
-------------------
/wiki/Wikipedia:User_pages
Wikipedia:User pages
<p>User pages are for communication and collaboration. While considerable leeway is allowed in personalizing and managing your user pages, they are community project pages, <a href="/wiki/Wikipedia:What_Wikipedia_is_not#WEBHOST" title="Wikipedia:What Wikipedia is not">not a personal website, blog, or social networking medium</a>. They should be used to better participate in the community, and not used to excess for unrelated purposes nor to 

KeyboardInterrupt: 

目前为止的例子都没有搜集那些打印出来的数据，在命令行里显示的数据是很难进一步处理的。
第五章继续介绍信息存储和数据库创建的内容