Collecting Textual Data from Web: Web Scraping
=====
Computational Text Analysis

Bao Yang, ACEM, SJTU

Announcements
---
* Lab 1 assignment is due today!
* Some of you may not be able to install jupyter correctly because of a bug of the latest version of miniconda3: https://github.com/ContinuumIO/anaconda-issues/issues/1424. Two solutions:
    - install a previous version (4.2.12) of miniconda3: https://repo.continuum.io/miniconda/
    - install anaconda3 which includes jupyter by default: https://www.continuum.io/downloads


Contents
===
* Introduction
    - **What is web scraping? and Why?**
    - Robots.txt：爬虫/机器人协议
* Crawling Webpage
    - urllib in standard library
    - Handling encoding and exception issues
    - requests: HTTP for Humans
* Parsing Webpage
    - HTML Basics
    - BeautifulSoup

What is web scraping?
===
* Obtaining data from webpages
* Two methods:
    - web scraping: crawl and parse HTML webpages
    - using API (应用程序接口)
        + convenient and well-structured, but not provided by all websites
        + e.g., Python财经数据接口包: http://tushare.org/ 
        + installation: pip install tushare, pip install pandas

In [2]:
## Demo: tushare
import tushare as ts # import first

In [3]:
df = ts.get_hist_data('000002') #获取三年内全部日k线数据
df.head(10) # pandas的DataFrame数据结构，打印前10条记录

Unnamed: 0_level_0,open,high,close,low,volume,price_change,p_change,ma5,ma10,ma20,v_ma5,v_ma10,v_ma20,turnover
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2017-10-31,28.8,29.28,28.96,28.28,463277.75,-0.02,-0.07,28.08,27.197,26.898,491185.39,384513.55,375216.62,0.48
2017-10-30,27.5,29.0,28.98,27.41,676711.06,1.48,5.38,27.688,26.922,26.788,500367.43,356666.88,381704.96,0.7
2017-10-27,27.7,28.28,27.5,27.26,447443.84,-0.03,-0.11,27.078,26.655,26.645,404530.66,320496.58,384004.53,0.46
2017-10-26,27.3,27.88,27.53,26.91,422766.84,0.1,0.36,26.81,26.598,26.66,344224.97,308745.0,382786.99,0.44
2017-10-25,27.01,28.0,27.43,26.86,445727.44,0.43,1.59,26.514,26.555,26.704,300176.71,301366.28,388464.85,0.46
2017-10-24,25.81,27.2,27.0,25.81,509187.97,1.07,4.13,26.314,26.499,26.769,277841.7,299393.28,396833.27,0.52
2017-10-23,26.17,26.2,25.93,25.77,197527.2,-0.23,-0.88,26.156,26.458,26.857,212966.32,285184.61,431189.29,0.2
2017-10-20,26.0,26.23,26.16,25.73,145915.38,0.11,0.42,26.232,26.512,26.963,236462.5,336605.64,474465.08,0.15
2017-10-19,26.44,26.57,26.05,26.01,202525.56,-0.38,-1.44,26.386,26.521,27.12,273265.03,356589.37,544694.1,0.21
2017-10-18,26.21,26.56,26.43,25.61,334052.41,0.22,0.84,26.596,26.557,27.195,302555.86,362571.51,592102.62,0.34


In [3]:
df = ts.get_today_all() # 获取当日交易所有股票的行情数据
df.head(10)

[Getting data:]#########

timeout: timed out

In [4]:
df = ts.get_tick_data('000002',date='2017-03-01') # 获取历史分笔数据明细
df.head(10)

Unnamed: 0,time,price,change,volume,amount,type
0,15:00:03,20.5,--,1620,3321000,卖盘
1,14:57:00,20.5,-0.01,7,14350,卖盘
2,14:56:57,20.51,--,43,88193,买盘
3,14:56:54,20.51,--,86,176386,买盘
4,14:56:51,20.51,--,57,116907,买盘
5,14:56:48,20.51,--,38,77938,买盘
6,14:56:45,20.51,--,23,47173,买盘
7,14:56:42,20.51,--,35,71785,买盘
8,14:56:39,20.51,--,4,8204,买盘
9,14:56:36,20.51,--,25,51275,买盘


In [None]:
ts.get_realtime_quotes(['000002','600115','600221']) # 获取实时分笔

In [None]:
df = ts.get_index() # 获取大盘指数行情
df.head(10)

In [None]:
df = ts.get_sina_dd('000002', date='2017-03-01', vol=500)  #获取大笔交易数据（指定大于等于500手）
df.head(10)

In [None]:
df = ts.forecast_data(2017,1) # 获取2017年第1季度的业绩预告
df.head(10)

In [None]:
df = ts.new_stocks() # 获取新股数据
df.head(10)

In [None]:
df = ts.get_stock_basics() # 获取基本面数据
df.head(10)

In [None]:
ts.get_latest_news(top=5,show_content=True) #显示最新5条新闻，并打印出新闻内容

In [None]:
df = ts.realtime_boxoffice() # 获取实时电影票房数据，30分钟更新一次票房数据
df.head(10)

In [None]:
df = ts.month_boxoffice('2017-01') #获取2017年1月票房数据
df.head(10)

**More data in TuShare to explore: http://tushare.org/**

What is web scraping?
===
* How to scrape web data?
    - crawl webpages: urllib, requests, etc.
    - parse webpages: BeautifulSoup, re (regular expression), etc.
    - more sophisicated framework for large-scale crawling: Scrapy, search engines

Why web scraping?
===
* Tremendous amount of useful information
    - online reviews (e-commerce, healthcare, etc.)
    - financial news in traditional media and social media
    - corporate disclosures (annual reports, IPO prospectus, etc.)
    - product information (air ticket price, e-commerce, etc.)
    - any more examples in your research field?
        + start to brainstorm for your course project

Why web scraping?
===
* Infeasible to collect by hand, need to automate the collection
* It is fun, and allows you to do something really useful and cool!
    - Any thoughts/example?

Web scraping is about obtaining data from webpages. There is low level scraping where you parse the data out of the html code of the webpage. There also is scraping over APIs from websites who try to make your life a bit easier.

Read and Tweet!
=================

![ReadTweet](images/readtweet.jpg)

* by Justin Blinder
* http://projects.justinblinder.com/We-Read-We-Tweet

“We Read, We Tweet” geographically visualizes the dissemination of New York Times articles through Twitter. Each line connects the location of a tweet to the contextual location of the New York Times article it referenced. The lines are generated in a sequence based on the time in which a tweet occurs. The project explores digital news distribution in a temporal and spatial context through the social space of Twitter.

Twitter Sentiments
===

![TwitterSentiments](images/tweet-viz-ex.png "Twitter Sentiments")

* by Healey and Ramaswamy
* http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/

Type a keyword into the input field, then click the Query button. Recent tweets that contain your keyword are pulled from Twitter and visualized in the Sentiment tab as circles. Hover your mouse over a tweet or click on it to see its text.

Contents
===
* Introduction
    - What is web scraping? and Why?
    - **Robots.txt：爬虫/机器人协议**
* Crawling Webpage
    - urllib in standard library
    - Handling encoding and exception issues
    - requests: HTTP for Humans
* Parsing Webpage
    - HTML Basics
    - BeautifulSoup

Robots.txt
===
* robots exclusion protocol (网络爬虫排除协议)
    - https://en.wikipedia.org/wiki/Robots_exclusion_standard
* gives instructions to web robots (e.g., search engines, your own crawler)
    - what information can/cannot be scraped
* specified by web site owner
* is located at the top-level directory of the web server

![Robots.txt](images/robots_txt.jpg "Robots.txt")

Robots.txt: an example
---

*** What does this one do? ***

* https://www.taobao.com/robots.txt
    - User-agent:  Baiduspider
    - Allow:  /article
    - Allow:  /oshtml
    - Allow:  /wenzhang
    - Disallow:  /product/
    - Disallow:  /

Robots.txt in real-world
---

Please try to explore robots.txt in the following websites:
* https://www.baidu.com
* https://www.jd.com
* https://www.amazon.cn
* https://www.dianping.com

Robots协议的遵守:
---

* Robots协议只是建议性的，不具有强制约束性
* 大量爬取数据时建议遵守，否则可能存在法律风险
* Robots协议潜在的安全风险？

Contents
===
* Introduction
    - What is web scraping? and Why?
    - Robots.txt：爬虫/机器人协议
* Crawling Webpage
    - **urllib in standard library**
    - Handling encoding and exception issues
    - requests: HTTP for Humans
* Parsing Webpage
    - HTML Basics
    - BeautifulSoup

urllib package
===

* Python3 provides the **urllib** package in its standard libraries for opening and reading url links
* Documentation: https://docs.python.org/3/library/urllib.html
* Differences between Python2 and Python3: 
    - Python2 provides urllib2 package
    - In Python3, urllib2 is splited into two parts
        + urllib.request, and urllib.error
* We mainly use urllib.request
    - https://docs.python.org/3/library/urllib.request.html#module-urllib.request


In [None]:
## Demo: use urllib.request to open and read url links
import urllib.request 

url = "https://www.crummy.com/software/BeautifulSoup/"
response = urllib.request.urlopen(url)
print(response)

In [None]:
print(type(response))

In [None]:
html = response.read()
print(html)

In [None]:
## Demo: encoding issue
import urllib.request

url = "https://www.douban.com"
response = urllib.request.urlopen(url)
html = response.read()
print(html)

In [None]:
html = html.decode('utf-8') # use decode() method to convert bytes to str
print(type(html))
print(html)

In [None]:
## Demo: encoding issue
import urllib.request 

url = "https://bbs.sjtu.edu.cn/php/bbsindex.html"
response = urllib.request.urlopen(url)
html = response.read()
html = html.decode('utf-8') # how to find the encoding of the webpage?
print(html)

In [None]:
## Demo: encoding issue
import urllib.request 

url = "https://bbs.sjtu.edu.cn/php/bbsindex.html"
response = urllib.request.urlopen(url)
html = response.read()
html = html.decode('gb2312') # how to find the encoding of the webpage?
print(html)

Exercise:
---
* 使用urllib连接并读取url链接: https://www.crummy.com/software/BeautifulSoup/
* 单词'Alice' 有没有在网页中出现?
* 单词'Soup' 在网页中出现了多少次? (Hint: use .count() method)

Contents
===
* Introduction
    - What is web scraping? and Why?
    - Robots.txt：爬虫/机器人协议
* Crawling Webpage
    - urllib in standard library
    - **Handling encoding and exception issues**
    - requests: HTTP for Humans
* Parsing Webpage
    - HTML Basics
    - BeautifulSoup

![Encoding](images/encoding.png)
* 字符编码是Python编程中常碰到的一个麻烦，相比于Python2，Python3已经对字符编码做了很大的优化

字符编码
---
* unicode
    - python3中，所有的字符串在**内存**中均是用unicode编码保存
    - 把所有语言统一到一套编码中，避免出现乱码
    - 问题：浪费存储空间（英文字母也要使用多个字节表示）

* 可变长编码
    - 在文件中存储字符时，将unicode转换成可变长编码，可以节省存储空间
    - utf-8
    - gb2312, gbk, gb18030 (子集关系，前者是后者的子集，后者可以兼容前者)

Python中字符编码的转换
---
* encode()方法：unicode编码转换成指定编码方式
    - .encode('utf-8'), .encode('gb2312'), .encode('gbk'), .encode('gb18030')
* decode()方法：指定编码方式转换为unicode编码
    - .decode('utf-8'), .decode('gb2312'), .decode('gbk'), .decode('gb18030')
* 在Python中，unicode是中间编码
    - 编码A转换成编码B：string.decode("A").encode("B")

In [5]:
## Demo: Bytes (utf-8, gb2312, gbk, gb18030) --> Str (Unicode)
import urllib.request

url = "http://bbs.sjtu.edu.cn/php/bbsindex.html"
response = urllib.request.urlopen(url)
html = response.read()
print(type(html))
html = html.decode('gb2312') # can we use gbk and gb18030 here?
print(type(html))
print(html)

<class 'bytes'>
<class 'str'>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<link rel=stylesheet type=text/css href='/file/bbs/bbs.css'>
<link rel=stylesheet type=text/css href='/file/bbs/blue.css'>
<style type="text/css">
<!--
TABLE,TH,TD,P,INPUT,SELECT {FONT-FAMILY: Tahoma, sans-serif, 宋体;}
BODY { FONT-FAMILY: Tahoma, sans-serif, 宋体;}
HR {color: #85BCF5;}
.plain {
	border: 1px solid #606060;
	background-color: #EEEEEE;
}
a:link    { color: #222222; text-decoration: none}
a:visited { text-decoration:none; color: #666666;}
a:hover   { text-decoration: underline overline; color: #AAAAAA;}
a.top:link    { color: #FFFFFF; text-decoration: none}
a.top:visited { text-decoration:none; color: #FFFFFF;}
a.top:hover   { text-decoration: underline overline}
a.bd:link    { color: #000000; text-decoration: none}
a.bd:visited { text-decoration:none; color: #000000;}
a.bd:hover   { text-decoration: underline overline}
-->
</style>

<!-- <script language='JavaScri

注意：有时HTML网页头部中自己申明的字符编码方式有可能是错误的！

How to identify encoding automatically?
---

In [8]:
import chardet #自动检测编码, 安装：pip install chardet
import urllib.request

for url in ["http://www.crummy.com/software/BeautifulSoup",
            "https://www.douban.com",
            "https://bbs.sjtu.edu.cn/php/bbsindex.html"]:
    html = urllib.request.urlopen(url).read()
    mychar = chardet.detect(html)
    print(url,mychar,mychar['encoding'])
    # encoding = mychar['encoding']
    # print(html.decode(encoding)[:1000])

http://www.crummy.com/software/BeautifulSoup {'encoding': 'ascii', 'confidence': 1.0} ascii
https://www.douban.com {'encoding': 'utf-8', 'confidence': 0.99} utf-8
https://bbs.sjtu.edu.cn/php/bbsindex.html {'encoding': 'GB2312', 'confidence': 0.99} GB2312


In [9]:
## Demo: automatically detect encoding
import chardet
import urllib.request

url = "https://www.douban.com"
html = urllib.request.urlopen(url).read()
mychar = chardet.detect(html)
print(mychar)
print(html.decode(mychar['encoding'])) #自动解码

{'encoding': 'utf-8', 'confidence': 0.99}
<!DOCTYPE HTML>
<html lang="zh-cms-Hans" class="">
<head>
<meta charset="UTF-8">
<meta name="description" content="提供图书、电影、音乐唱片的推荐、评论和价格比较，以及城市独特的文化生活。">
<meta name="keywords" content="豆瓣,广播,登陆豆瓣">
<meta property="qc:admins" content="2554215131764752166375" />
<meta property="wb:webmaster" content="375d4a17a4fa24c2" />
<meta name="mobile-agent" content="format=html5; url=https://m.douban.com">
<title>豆瓣</title>
<script>
function set_cookie(t,e,o,n){var i,a,r=new Date;r.setTime(r.getTime()+24*(e||30)*60*60*1e3),i="; expires="+r.toGMTString();for(a in t)document.cookie=a+"="+t[a]+i+"; domain="+(o||"douban.com")+"; path="+(n||"/")}function get_cookie(t){var e,o,n=t+"=",i=document.cookie.split(";");for(e=0;e<i.length;e++){for(o=i[e];" "==o.charAt(0);)o=o.substring(1,o.length);if(0===o.indexOf(n))return o.substring(n.length,o.length).replace(/\"/g,"")}return null}window.Douban=window.Douban||{};var Do=function(){Do.actions.push([].slice.call(argumen

Handling Connection Exceptions
===
* 爬虫程序需要网络连接，但是网络可能存在风险
* 爬虫程序需要处理网络连接异常，否则程序会崩溃

* URLError：
    - 可能的原因：没连上网, 连接不到服务器, 服务器不存在 ...
* HTTPError（URLError的子类），会产生一个HTTP状态码，例如：
    - 400非法请求，401未授权，403禁止，404：没有找到网页 ...
    - more status codes: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes       
* 我们可以用try-except语句来捕获并处理相应的异常

In [10]:
## Demo: URLError
import urllib.request
import urllib.error

try:
    urllib.request.urlopen("https://www.youtube.com")
except urllib.error.URLError as e:
    print(e)

<urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>


In [11]:
## Demo: HTTPError
import urllib.request
import urllib.error

try:
    urllib.request.urlopen("http://blog.csdn.net/cqcre")
#     urllib.request.urlopen("http://www.sjtu.edu.cn/1234")
#     urllib.request.urlopen("http://www.baidu.com")
except urllib.error.HTTPError as e:
    print(e)
except urllib.error.URLError as e:
    print(e)
else:
    print("OK")

HTTP Error 403: Forbidden


将爬虫伪装成浏览器
---

* 有时http错误(e.g., 403)可能是由于网站禁止爬虫
* 我们可以加上一些头部信息header将爬虫伪装成浏览器
* 如何获取你的浏览器的头部信息？
    - Chrome Developer Tools (F12 or Ctrl+Shift+I)

In [12]:
## Demo: set user-agent in header
import urllib.request
import urllib.error

try:
    # urllib.request.urlopen("http://blog.csdn.net/cqcre") # if not setting user-agent in hearder, will throw a 403 error
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
    request = urllib.request.Request(url='http://blog.csdn.net/cqcre', headers=headers)
    response = urllib.request.urlopen(request)
    # html = response.read()
    # print(html.decode('utf-8'))
except urllib.error.HTTPError as e:
    print(e)
except urllib.error.URLError as e:
    print(e)
else:
    print("OK")

OK


Contents
===
* Introduction
    - What is web scraping? and Why?
    - Robots.txt：爬虫/机器人协议
* Crawling Webpage
    - urllib in standard library
    - Handling encoding and exception issues
    - **requests: HTTP for Humans**
* Parsing Webpage
    - HTML Basics
    - BeautifulSoup

Requests: HTTP for Humans
===

* Requests: an elegent and simple HTTP library for Python
    - help you handle tricky issues, e.g., string encoding, etc.
    - **推荐使用Requests库,而不是Python标准库中的urllib包**
* Quick tutorial: http://docs.python-requests.org/en/master/user/quickstart/
* Installation: pip install requests (如果出现权限错误，可以尝试以管理员身份运行anaconda prompt)

In [13]:
## test requests to see if it is installed correctly
import requests

r = requests.get("http://www.douban.com")
print(r.status_code)
print(r.text[:1000])

200
<!DOCTYPE HTML>
<html lang="zh-cms-Hans" class="">
<head>
<meta charset="UTF-8">
<meta name="description" content="提供图书、电影、音乐唱片的推荐、评论和价格比较，以及城市独特的文化生活。">
<meta name="keywords" content="豆瓣,广播,登陆豆瓣">
<meta property="qc:admins" content="2554215131764752166375" />
<meta property="wb:webmaster" content="375d4a17a4fa24c2" />
<meta name="mobile-agent" content="format=html5; url=https://m.douban.com">
<title>豆瓣</title>
<script>
function set_cookie(t,e,o,n){var i,a,r=new Date;r.setTime(r.getTime()+24*(e||30)*60*60*1e3),i="; expires="+r.toGMTString();for(a in t)document.cookie=a+"="+t[a]+i+"; domain="+(o||"douban.com")+"; path="+(n||"/")}function get_cookie(t){var e,o,n=t+"=",i=document.cookie.split(";");for(e=0;e<i.length;e++){for(o=i[e];" "==o.charAt(0);)o=o.substring(1,o.length);if(0===o.indexOf(n))return o.substring(n.length,o.length).replace(/\"/g,"")}return null}window.Douban=window.Douban||{};var Do=function(){Do.actions.push([].slice.call(arguments))};Do.ready=function(){Do.actions.p

Requests: 7个主要方法
---
* requests.request()：构造一个请求，支撑以下各方法的基础方法
* **requests.get()**：获取HTML网页的主要方法，对应于HTTP的GET
* requests.head()：获取HTML网页头信息的方法，对应于HTTP的HEAD
* requests.post()：向HTML网页提交POST请求的方法，对应于HTTP的POST
* requests.put()：向HTML网页提交PUT请求的方法，对应于HTTP的PUT
* requests.patch()：向HTML网页提交局部修改请求，对应于HTTP的PATCH
* requests.delete()：向HTML页面提交删除请求，对应于HTTP的DELETE

Requests的get()方法
---

* r = requests.get(url)
    - r -> 返回一个包含服务器资源的Response对象，即爬虫返回的内容
    - get()方法构造一个向服务器请求资源的Request对象

* requests.get(url, params=None, **kwargs)
    - url: 拟获取页面的url地址
    - params: url中的额外参数
    - **kwargs: 控制访问的参数

In [21]:
import requests
r = requests.get("http://www.douban.com",timeout=0.1575) # 超时参数
print(r.status_code)
print(type(r))

200
<class 'requests.models.Response'>


Response对象的常用属性
---

* r.status_code: HTTP请求的返回状态(200:成功,其他代码:失败)
* r.text: 返回的url页面内容
* r.encoding: 从HTTP header中推测的字符编码方式
* r.apparent_encoding: 从文本内容中分析出的备选字符编码方式

In [23]:
## Demo: Response对象的属性和字符编码处理
import requests
r = requests.get("http://www.sjtu.edu.cn")
print(r.status_code)
print(r.encoding)
print(r.apparent_encoding)
#print(r.text)
r.encoding = r.apparent_encoding
print(r.text)

200
ISO-8859-1
UTF-8-SIG
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!doctype html><HTML class=" js no-flexbox no-canvas no-canvastext no-webgl no-touch no-geolocation postmessage no-websqldatabase no-indexeddb no-hashchange no-history draganddrop no-websockets no-rgba no-hsla no-multiplebgs no-backgroundsize no-borderimage no-borderradius no-boxshadow no-textshadow no-opacity no-cssanimations no-csscolumns no-cssgradients no-cssreflections no-csstransforms no-csstransforms3d no-csstransitions fontface no-video no-audio localstorage sessionstorage no-webworkers no-applicationcache no-svg no-inlinesvg no-smil no-svgclippaths js no-flexbox no-canvas no-canvastext no-webgl no-touch no-geolocation postmessage no-websqldatabase no-indexeddb no-hashchange no-history draganddrop no-websockets no-rgba no-hsla no-multiplebgs no-backgroundsize no-borderimage no-borderradius no-boxshadow no-textshadow no-opacity no-cs

Requests的异常处理
---
* Requests库的异常：
    - requests.ConnectionError: 网络连接错误异常
    - requests.HTTPError: HTTP错误异常
    - requests.URLRequired: URL缺失异常
    - requests.TooManyRedirection: 重定向异常
    - requests.ConnectTimeout: 连接远程服务器超时异常
    - requests.Timeout: URL请求超时异常
* Requests可以自行捕获异常，如需抛出，则可使用:
    - r.raise_for_status(): HTTP状态码如果不是200，则抛出异常requests.HTTPError

In [24]:
## Demo: requests handles exceptions
import requests
r = requests.get("http://www.sjtu.edu.cn/1234")
print(r.status_code)
# r.raise_for_status()

404


实例：抓取京东商品页面
---
* 使用Requests库抓取此链接中的商品页面：https://item.jd.com/497227.html

In [25]:
import requests
url = "https://item.jd.com/497227.html" #空气净化器
r = requests.get(url, timeout=30)
r.raise_for_status() # throw HTTPError if the status code is not 200
r.encoding = r.apparent_encoding # handling encoding issue
print(r.text[:1000])

<!DOCTYPE HTML>
<html lang="zh-CN">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=gbk" />
    <title>【松下F-VXG70C-N】松下（Panasonic）空气净化器 F-VXG70C-N【行情 报价 价格 评测】-京东</title>
    <meta name="keywords" content="PanasonicF-VXG70C-N,松下F-VXG70C-N,松下F-VXG70C-N报价,PanasonicF-VXG70C-N报价"/>
    <meta name="description" content="【松下F-VXG70C-N】京东JD.COM提供松下F-VXG70C-N正品行货，全国价格最低，并包括PanasonicF-VXG70C-N网购指南，以及松下F-VXG70C-N图片、F-VXG70C-N参数、F-VXG70C-N评论、F-VXG70C-N心得、F-VXG70C-N技巧等信息，网购松下F-VXG70C-N上京东,放心又轻松" />
    <meta name="format-detection" content="telephone=no">
    <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/497227.html">
    <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/497227.html">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <link rel="canonical" href="//item.jd.com/497227.html"/>
        <link rel="dns-prefetch" href="//misc.360buyimg.com"/>
    <link rel="dns-prefetch" href="//static

实例: 抓取亚马逊中国商品页面
---
* 使用Requests库抓取此链接中的商品页面：https://www.amazon.cn/dp/B005GNM3SS/

In [26]:
import requests
#url = "https://www.amazon.cn/dp/B005GNM3SS/" #空气净化器
url = "https://www.amazon.cn/gp/product/B01ARKEV1G" # 机器学习西瓜书
r = requests.get(url, timeout=30)
print(r.status_code)
r.encoding = r.apparent_encoding # handling encoding issue
print(r.text)
print(r.request.headers)

200



  
  
































    















    






















    

















    <!doctype html><html class="a-no-js" data-19ax5a9jf="dingo">
    <head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8">

    



    <link rel="dns-prefetch" href="//images-cn.ssl-images-amazon.com">






    






    
    










  
  


























  











    


    
    






<script type="text/javascript">
var iUrl = "https://images-cn.ssl-images-amazon.com/images/I/410pXZz4kFL._SX258_BO1,204,203,200_QL70_.jpg";
(function(){var i=new Image; i.src = iUrl;})();
</script>

























  

   
    


<!--  -->
<link rel="stylesheet" href="https://images-cn.ssl-images-amazon.com/images/I/71AqNWvdkML._RC|01A7OPUhDKL.css,31+C8rQtOEL.css,21j5Or3LbtL.css,31NNwvudtYL.css,01EnpDxOA4L.css,312fqnjyyJL.css_.css#AUIClients/NavDesktopMetaAsset" />































<link rel="stylesheet" href="https:/

In [27]:
## Demo: set user-agnet using "headers"
import requests
url = "https://www.amazon.cn/dp/B005GNM3SS/" #空气净化器
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
# headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get(url, timeout=30, headers=headers) # add header information for user-agent
print(r.status_code)
r.encoding = r.apparent_encoding # handling encoding issue
print(r.text)

200



  
  






































    




















    <!doctype html><html class="a-no-js" data-19ax5a9jf="dingo">
    <head>
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<script type="text/javascript">
var ue_hob=+new Date();
var ue_id='FFTZCF04M387HG3T9E0T',
ue_csm = window,
ue_err_chan = 'jserr-rw',
ue = {};
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){if(1==window.ueinit)try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);

ue.stub(ue,"log");ue.stub(ue,"onunload");ue.stub(ue,"onflush");

(function(c,d){function e(f,b){if(!(a.ec>a.mxe)&&f){a.ec++;a.ter.push(f);b=b||{};var c=f.log

实例: 百度搜索关键词提交
---
* 百度的关键词接口：http://www.baidu.com/s?wd=keyword

In [28]:
## Demo: set keywords to submit using "params"
import requests

keywords={'wd':'python scrape'}
r = requests.get("http://www.baidu.com/s",params=keywords)
print(r.status_code)
print(r.request.url)
print(r.text)

200
http://www.baidu.com/s?wd=python+scrape
<!DOCTYPE html>
<!--STATUS OK-->
























































	















<html>
	<head>
		
		<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
		<meta http-equiv="content-type" content="text/html;charset=utf-8">
		<meta content="always" name="referrer">
        <meta name="theme-color" content="#2932e1">
        <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
        <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg">
        <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" /> 
		
		
<title>python scrape_百度搜索</title>

		

		
<style data-for="result" type="text/css" id="css_newi_result">body{color:#333;background:#fff;padding:6px 0 0;margin:0;position:relative;min-width:900px}
body,th,td,.p1,.p2{font-family:arial}
p,form,ol,ul,li,dl,dt,

Exercise:
---
* 使用requests库抓取你最喜欢的电影的豆瓣页面, e.g., https://movie.douban.com/subject/1298250/
* 抓取该豆瓣电影页面后打印如下信息：
    - 抓取该页面的HTTP状态返回码
    - 该页面的编码
    - 该页面的前2000个字符

Requests抓取网页的通用代码
---

In [29]:
## Requests抓取网页的通用代码: 加入异常捕获，超时设定，编码设定，浏览器伪装
import requests

# define get_html() function
def get_html(url):
    try:
        r = requests.get(url, headers={'User-Agent':'Mozilla/5.0'}, timeout=30)
        r.raise_for_status() # throw HTTPError if the status code is not 200
        r.encoding = r.apparent_encoding # handling encoding issue
        return r.text
    except:
        return "Error: something is Wrong!"

# call get_html() function
url_bad = "www.baidu.com"
print(get_html(url_bad))
print()
url = "http://www.baidu.com"
print(get_html(url))

Error: something is Wrong!

<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"><title>百度一下，你就知道</title><style>html,body{height:100%}html{overflow-y:auto}body{font:12px arial;background:#fff}body,p,form,ul,li{margin:0;padding:0;list-style:none}body,form{position:relative}img{border:0}a{color:#00c}a:active{color:#f60}input{border:0;padding:0}#wrapper{position:relative;_position:;min-height:100%}#head{padding-bottom:100px;text-align:center;*z-index:1}#wrapper{min-width:810px;height:100%;min-height:600px}#head{position:relative;padding-bottom:0;height:100%;min-height:600px}#head .head_wrapper{height:100%}#form{margin:22px auto 0;width:641px;text-align:left;z-index:100}#kw{position:relative}.s_btn{width:95px;height:32px;padding-top:2px\9;font-size:14px;background-color:#ddd;background-position:0 -48px;cursor:pointer}.s_btn{width:100

Contents
===
* Introduction
    - What is web scraping? and Why?
    - Robots.txt：爬虫/机器人协议
* Crawling Webpage
    - urllib in standard library
    - Handling encoding and exception issues
    - requests: HTTP for Humans
* Parsing Webpage
    - **HTML Basics**
    - BeautifulSoup

HTML Basics
===

* 超文本标记语言（HyperText Markup Language）

* 用于写网页的一种语言

* HTML 标签（tags）
    - 尖括号<>
    - 通常成对出现

This is an example for a minimal webpage defined in HTML tags. The root tag is `<html>` and then you have the `<head>` tag. This part of the page typically includes the title of the page and might also have other meta information like the author or keywords that are important for search engines. The `<body>` tag marks the actual content of the page. You can play around with the `<h2>` tag trying different header levels. They range from 1 to 6. 

一些常见的标签
---

* heading
    - `<h1></h1> ... <h6></h6>`
* paragraph
    - `<p></p>` 
* line break
    - `<br>` 
* link with attribute
    - `<a href="http://www.example.com/">An example link</a>`
* More details about tags can be found here:
    - https://www.w3schools.com/tags/


In [None]:
from IPython.display import HTML

htmlString = """<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <h2> Test </h2>
    <p>Hello world!</p>
    <p><a href="http://yangbao.org" target="_blank">My Website</a></p>
    
  </body>
</html>"""

htmlOutput = HTML(htmlString)
htmlOutput

Contents
===
* Introduction
    - What is web scraping? and Why?
    - Robots.txt：爬虫/机器人协议
* Crawling Webpage
    - urllib in standard library
    - Handling encoding and exception issues
    - requests: HTTP for Humans
* Parsing Webpage
    - HTML Basics
    - **BeautifulSoup**

HTML页面的解析
===
* 大部分常用的浏览器都可以将HTML格式的文档解析成DOM（Document Object Model）结构：https://www.w3.org/DOM/

![Html Dom Tree](images/HTMLDOMTree.png)

HTML页面的解析
=================
* 我们需要抓取的文本信息，通常只是DOM中HTML元素的一部分

![HTML Tree](images/treeStructure.png)

BeautifulSoup: Parsing Webpages
===

* BeautifulSoup: a powerful library for parsing webpages
    - 还有一些其他的类库，如lxml（事实上，BeautifulSoup支持使用lxml解析器）
* 另外，我们可以经常使用浏览器来帮助我们理解HTML页面的结构
    - ** 'Ctrl-Shift I' in Chrome, or 右击 -> view page source**
* Quick tutorial: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
* Installation:
    - pip install beautifulsoup4 (如果出现权限错误，可以尝试以管理员身份运行anaconda prompt)
    - pip install lxml

In [30]:
from IPython.display import HTML

htmlString = "<!DOCTYPE html><html><head><title>This is a title</title></head><body><h2>Test</h2><p>Hello world!</p></body></html>"

htmlOutput = HTML(htmlString)
htmlOutput

In [33]:
## test BeautifulSoup

htmlString = "<!DOCTYPE html><html><head><title>This is a title</title></head><body><h2>Test</h2><p>Hello world!</p></body></html>"

from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlString,"html.parser")
#print(htmlString)
#print(soup)
print(soup.prettify()) # 友好的输出：prettify()方法

<!DOCTYPE html>
<html>
 <head>
  <title>
   This is a title
  </title>
 </head>
 <body>
  <h2>
   Test
  </h2>
  <p>
   Hello world!
  </p>
 </body>
</html>


BeautifulSoup库是用来解析、遍历、维护“标签树”的功能库
---
![BeautifulSoup](images/bs4.png)

BeautifulSoup类的基本元素
---
![BeautifulSoup_elements](images/bs4_elements.png)


BeautifulSoup支持的解析器
---
![BeautifulSoup_parser](images/bs4_parser.png)
* More details at: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id9

Use BeautifulSoup for parsing HTML
---

In [34]:
from bs4 import BeautifulSoup
import requests

url = "http://www.crummy.com/software/BeautifulSoup"
r = requests.get(url)

## get BeautifulSoup object
soup = BeautifulSoup(r.text,"lxml")
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [38]:
## compare these print statements
# print(soup)
# print(soup.prettify())
# print(soup.get_text()) # print text by removing tags

## show how to find all a tags
soup.find_all('a')

## ***Why does this not work? ***
# soup.find_all('Soup')

[<a href="bs4/download/"><h1>Beautiful Soup</h1></a>,
 <a href="#Download">Download</a>,
 <a href="bs4/doc/">Documentation</a>,
 <a href="#HallOfFame">Hall of Fame</a>,
 <a href="https://code.launchpad.net/beautifulsoup">Source</a>,
 <a href="https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup">Discussion group</a>,
 <a href="http://www.candlemarkandgleam.com/shop/constellation-games/"><i>Constellation
 Games</i>, my sci-fi novel about alien video games</a>,
 <a href="http://constellation.crummy.com/Constellation%20Games%20excerpt.html">read
 the first two chapters for free</a>,
 <a href="https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup">the discussion
 group</a>,
 <a href="https://bugs.launchpad.net/beautifulsoup/">file it</a>,
 <a href="http://lxml.de/">lxml</a>,
 <a href="http://code.google.com/p/html5lib/">html5lib</a>,
 <a href="bs4/doc/">Read more.</a>,
 <a name="Download"><h2>Download Beautiful Soup</h2></a>,
 <a href="bs4/download/">Beautiful Soup
 

The last command only returns an empty list, because `Soup` is not an HTML tag. It is just a string that occours in the webpage.

Examples of using BeautifulSoup
---
More examples can be found at: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

In [None]:
## get attribute value from an element:
## find tag: this only returns the first occurrence, not all tags in the string
first_tag = soup.find('a')
print(first_tag)

## get attribute `href`
print(first_tag.get('href'))
print(first_tag['href'])

## get text
print(first_tag.string)
print(first_tag.text)
print(first_tag.get_text())

In [39]:
## get all links in the page
link_list = [link.get('href') for link in soup.find_all('a')]
print(link_list)

## filter all external links
## create an empty list to collect the valid links
external_links = []

## write a loop to filter the links
## if it starts with 'http' we are happy
# for l in link_list:
#     if l[:4] == 'http':
#         external_links.append(l)
        
## This throws an error! It says something about 'NoneType'
## lets investigate. Have a close look at the link_list:
# link_list

## Seems that there are None elements! Let's verify
# print([link for link in link_list if link is None])

## So there are two elements in the list that are None!

['bs4/download/', '#Download', 'bs4/doc/', '#HallOfFame', 'https://code.launchpad.net/beautifulsoup', 'https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup', 'http://www.candlemarkandgleam.com/shop/constellation-games/', 'http://constellation.crummy.com/Constellation%20Games%20excerpt.html', 'https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup', 'https://bugs.launchpad.net/beautifulsoup/', 'http://lxml.de/', 'http://code.google.com/p/html5lib/', 'bs4/doc/', None, 'bs4/download/', 'http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html', 'download/3.x/BeautifulSoup-3.2.1.tar.gz', None, 'http://www.nytimes.com/2007/10/25/arts/design/25vide.html', 'https://github.com/reddit/reddit/blob/85f9cff3e2ab9bb8f19b96acd8da4ebacc079f04/r2/r2/lib/media.py', 'http://www.harrowell.org.uk/viktormap.html', 'http://svn.python.org/view/tracker/importer/', 'http://www2.ljworld.com/', 'http://www.b-list.org/weblog/2010/nov/02/news-done-broke/', 'http://esrl.noaa.gov/gsd

In [40]:
# Let's filter those objects out in the for loop
external_links = []

## get all links in the page
link_list = [link.get('href') for link in soup.find_all('a')]

# write a loop to filter the links
# if it is not None and starts with 'http' we are happy
for link in link_list:
    if link is not None and link[:4] == 'http': # Note: lazy evaluation
        external_links.append(link)
        
external_links

['https://code.launchpad.net/beautifulsoup',
 'https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup',
 'http://www.candlemarkandgleam.com/shop/constellation-games/',
 'http://constellation.crummy.com/Constellation%20Games%20excerpt.html',
 'https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup',
 'https://bugs.launchpad.net/beautifulsoup/',
 'http://lxml.de/',
 'http://code.google.com/p/html5lib/',
 'http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html',
 'http://www.nytimes.com/2007/10/25/arts/design/25vide.html',
 'https://github.com/reddit/reddit/blob/85f9cff3e2ab9bb8f19b96acd8da4ebacc079f04/r2/r2/lib/media.py',
 'http://www.harrowell.org.uk/viktormap.html',
 'http://svn.python.org/view/tracker/importer/',
 'http://www2.ljworld.com/',
 'http://www.b-list.org/weblog/2010/nov/02/news-done-broke/',
 'http://esrl.noaa.gov/gsd/fab/',
 'http://laps.noaa.gov/topograbber/',
 'http://groups.google.com/group/beautifulsoup/',
 'https://launchpad.net/beauti

Note: The above `if` condition works because of lazy evaluation in Python. The `and` statement becomes `False` if the first part is `False`, so there is no need to ever evaluate the second part. Thus a `None` entry in the list gets never asked about its first four characters. 

In [None]:
# a more pythonic solution: use list comprehension
link_list = [link.get('href') for link in soup.find_all('a')]
[link for link in link_list if l is not None and l.startswith('http')]

In-Class Practice:
===

* 抓取交大饮水思源BBS笑话版的所有主题帖列表（标题和链接）: https://bbs.sjtu.edu.cn/bbsdoc,board,joke.html
* 提示1：找出url的规律
    - 笑话版第1页url: https://bbs.sjtu.edu.cn/bbsdoc,board,joke,page,0.html
    - 笑话版第2页url: https://bbs.sjtu.edu.cn/bbsdoc,board,joke,page,1.html
    - 笑话版最后一页url?
    - 每页中笑话帖子的url格式：https://bbs.sjtu.edu.cn/bbscon,board,joke,file,M.990192870.A.html
* 提示2
    - find_all('a')可以找出所有的链接
    - link['href'] or link.get('href'), link.text

Recap
===
* Introduction
    - What is web scraping? and Why?
    - Robots.txt：爬虫/机器人协议
* Crawling Webpage
    - urllib in standard library
    - Handling encoding and exception issues
    - requests: HTTP for Humans
* Parsing Webpage
    - HTML Basics
    - BeautifulSoup

Further Readings (Optional)
===

* Python字符串和编码: http://t.cn/R2yTUMm
* Documentation:
    - Requests: http://docs.python-requests.org/en/master/user/quickstart/
    - BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
    
* Book:
    - Ryan Mitchell, 2015. Web Scraping with Python: Collecting Data from the Modern Web. O'Reilly Media, 1st Edition.

Acknowledgement
===
* These slides are an Jupyter notebook: http://jupyter.org/
* Slides presented with 'live reveal': https://github.com/damianavila/RISE