# Crawler Basic

**Dr. Pengfei Zhao**

Finance Mathematics Program, 

BNU-HKBU United International College

** Why learn Crawler **

* Data is King. How to accqusite data?


**Basic Workflow**

1. Capture the web page.

    * When you click a link in web page, you are sending a **request** to web server, and the server will return the web page to you through the internet, accorinding to the HTTP protocal. 
    * Crawler captures web page by sending request to web server. Different to manually click the link, crawler sends the HTTP request by program, which is much faster.

2. Parse the web page.

    * This is to extract useful **information** from HTML page. For example, after step 1 you obtain a web page full of html tags, you may be only interested to extract several numbers and strings.

3. Save data.

    * After extracting the desired data, you may want to keep it for further usage, e.g. do some analysis and prediction. You can save it either in txt/csv/... file, or save in database.


##  Now let us learn by practice

### Step 1: Capture the Web Page.

In [2]:
import requests

newsurl = 'http://uic.edu.hk/cn'
res = requests.get(newsurl)
print (res.text)

﻿<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="cn" lang="cn" dir="ltr">
<head>
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"/>
		<base href="http://uic.edu.hk/cn" />
	<meta http-equiv="content-type" content="text/html; charset=utf-8" />
	<meta name="generator" content="Joomla! - Open Source Content Management" />
	<title>首页</title>
	<link href="/cn/?format=feed&amp;type=rss" rel="alternate" type="application/rss+xml" title="RSS 2.0" />
	<link href="/cn/?format=feed&amp;type=atom" rel="alternate" type="application/atom+xml" title="Atom 1.0" />
	<link href="http://uic.edu.hk/cn/" rel="alternate" hreflang="cn" />
	<link href="http://uic.edu.hk/en/" rel="alternate" hreflang="en" />
	<link href="/templates/uic_web/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon" />
	<link href="http://uic.edu.hk

In [6]:
print (type(res))

<class 'requests.models.Response'>


* We can use the `requests` module to easily send http request and receive the response. `res.text` will return text from the resposne.

* You can view the official API about the `requests` module [here](http://docs.python-requests.org/en/master/).

* Thanks to the great `requests` module, we can easily capture the web page.

### Step 2: Parse the Web Page

* We can easily observe the above returned text is html. We now need to extract information out of tons of html tags.
* Before that, we will learn another useful module, `BeautifulSoup`. 
* Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for **web scraping**.
* You can find the official document [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). There also has many tutorials, for example [this one](http://www.pythonforbeginners.com/beautifulsoup/).

**Example: Extract all links in a webpage**

In [3]:
from bs4 import BeautifulSoup

webpage = requests.get('http://en.wikipedia.org/wiki/Main_Page')
soup = BeautifulSoup(webpage.text,'html.parser')
for anchor in soup.find_all('a'):
    print(anchor.get('href', '/'))

/
#mw-head
#p-search
/wiki/Wikipedia
/wiki/Free_content
/wiki/Encyclopedia
/wiki/Wikipedia:Introduction
/wiki/Special:Statistics
/wiki/English_language
/wiki/Portal:Arts
/wiki/Portal:Biography
/wiki/Portal:Geography
/wiki/Portal:History
/wiki/Portal:Mathematics
/wiki/Portal:Science
/wiki/Portal:Society
/wiki/Portal:Technology
/wiki/Portal:Contents/Portals
/wiki/File:Bundesarchiv_Bild_101I-185-0116-22A,_Bucht_von_Kotor_(-),_jugoslawische_Schiffe.jpg
/wiki/Yugoslav_destroyer_Dubrovnik
/wiki/Flotilla_leader
/wiki/Royal_Yugoslav_Navy
/wiki/Yarrow_Shipbuilders
/wiki/Glasgow
/wiki/Destroyer
/wiki/Czechoslovakia
/wiki/%C5%A0koda_Works
/wiki/Nazi_Germany
/wiki/Axis_powers
/wiki/Invasion_of_Yugoslavia
/wiki/Regia_Marina
/wiki/Allies_of_World_War_II
/wiki/Operation_Harpoon_(1942)
/wiki/Malta
/wiki/Prize_(law)
/wiki/World_War_II
/wiki/Armistice_of_Cassibile
/wiki/Kriegsmarine
/wiki/Battle_of_the_Ligurian_Sea
/wiki/Royal_Navy
/wiki/Scuttling
/wiki/Yugoslav_destroyer_Dubrovnik
/wiki/Lancashire_Fusi

* Above is just a toy example to extract all the links of a web page, but in many cases we may want to extract elements under some specific tags.

**Example**

* Suppose we have a HTML page below:

In [4]:
html_sample = '\
<html>\
    <body> \
        <h1 id="title"> Hello World </h1> \
        <a href="#" class="link"> This is Link 1 </a> \
        <a href="# link2" class="link"> This is Link 2 </a> \
    </body> \
</html>'

In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_sample, 'html.parser')
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [6]:
print (soup.text)

   Hello World   This is Link 1   This is Link 2   


* **Use `select` to find all elements under `h1` tag.**

In [17]:
header = soup.select('h1')
print (header)
print (header[0])
print (header[0].text)

[<h1 id="title"> Hello World </h1>]
<h1 id="title"> Hello World </h1>
 Hello World 


In [22]:
alink = soup.select('a')

for link in alink:
    print (link.text)

 This is Link1 
 This is Link 2 


* **Find information under tags having attributes `id=title`**

In [23]:
id_title = soup.select('#title')
print (id_title)
print (id_title[0].text)

[<h1 id="title"> Hello World </h1>]
 Hello World 


* **Find information under tags having attributes `class=link`**

In [27]:
class_links = soup.select('.link')
print (class_links)
for class_link in class_links:
    print (class_link.text)

[<a class="link" href="#"> This is Link 1 </a>, <a class="link" href="# link2"> This is Link 2 </a>]
 This is Link 1 
 This is Link 2 


* **Find all `href` links under `a` tag**

In [28]:
class_links = soup.select('a')
for link in class_links:
    print(link['href'])

#
# link2


In [31]:
str = '<a href="#", attr1=123, attr2=456> i am a link </a>'
soup2 = BeautifulSoup(str, 'html.parser')
res = soup2.select('a')[0]
print (type(res))

<class 'bs4.element.Tag'>


In [36]:
print (res)
print (res.text)
print (res['href'])
print (res['attr1'])
print (res['attr2'])

<a ,="" attr1="123," attr2="456" href="#"> i am a link </a>
 i am a link 
#
123,
456


* Above we leanred the basic usage of `BeautifulSoup`, now we try to apply what we learned to crawl the news from the website http://news.sina.com.cn/china/.

## Extract news from `www.sina.com.cn`

To extract news, we need to do the following steps:

**(1)** Extract links from http://news.sina.com.cn/china/, say them into a list L.

**(2)** For each link in L, extract the page content (e.g. title, date, article body, editor, comment).

Below will introduce how to complete the above two tasks separately.
**(3)** Combine (1) and (2), extract the page content of all

### (1) Extract links from http://news.sina.com.cn/china/

* We need to first capture the webpage through `requests` module and generate the `BeautifulSoup` object.

In [4]:
import requests
from bs4 import BeautifulSoup

page = requests.get('http://news.sina.com.cn/china/')
page.encoding = 'utf-8'
# Create BeautifulSoup object, to be used in parsing html
soup_page = BeautifulSoup(page.text, 'html.parser')

* Next, we need to find under what html tag the news link exists, and in many cases this needs to be done manually.
* By the help of Google Chrome webpage inspector (in Chrome browser, right click a webpage and click "Inspect"), we can find that the news link exists under '.news-item' tag.

1              |  2 |  3 | 
:-------------------------:|:-------------------------:|:-------------------------:|
<img src="../Figures/Crawler/inspector1.jpeg" width = "300" height = "550"/>  |  <img src="../Figures/Crawler/inspector2.jpeg" width = "300" height = "550"/>  |   <img src="../Figures/Crawler/inspector3.jpeg" width = "300" height = "550"/>  | 

In [5]:
# after checking the html page, we find the news link is under '.news-item' tag
# we select all '.news-item' tags and loop every tag
for news in soup_page.select('.news-item'):
    # we select only .news-item tag which is not empty
    if len(news.select('h2'))>0: 
        news_title = news.select('h2')[0].text
        time = news.select('.time')[0].text
        link = news.select('a')[0]['href']
        print ('++++++++++++++++++++++++++++++++++++')
        print (news_title, time, link)

++++++++++++++++++++++++++++++++++++
央视春晚彩排揭秘：这个体育加舞蹈节目创了个第1 2月10日 16:23 http://news.sina.com.cn/c/2018-02-10/doc-ifyrkrva6788092.shtml
++++++++++++++++++++++++++++++++++++
俄外交部女发言人用中文拜年 网友：请你吃饺子 2月10日 16:23 http://news.sina.com.cn/c/2018-02-10/doc-ifyrkrva6788016.shtml
++++++++++++++++++++++++++++++++++++
台发现失踪一家五口第三具遗体 12岁男孩确认罹难 2月10日 16:13 http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrkrva6781625.shtml
++++++++++++++++++++++++++++++++++++
新疆塔什库尔干县发生3.5级地震 震源深度15千米 2月10日 15:57 http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrmfmc1166618.shtml
++++++++++++++++++++++++++++++++++++
沿江高铁江苏段走南线经泰兴靖江？官方:尚无定论 2月10日 15:47 http://news.sina.com.cn/o/2018-02-10/doc-ifyrkrva6764022.shtml
++++++++++++++++++++++++++++++++++++
青海：庆华矿业尾矿浆直排未改变生态系统结构 2月10日 14:51 http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1330270.shtml
++++++++++++++++++++++++++++++++++++
山西警方破获一起虚开发票案 涉案金额达10亿余元 2月10日 14:42 http://news.sina.com.cn/c/2018-02-10/doc-ifyrkrva6717715.shtml
++++++++++++++++++++++++++++++++++++
今年广

### Extract Latest News

* It seems that we now can extract links from the news page, however if you view the above links carefully they are not exactly the same news titles in "最新消息" news list and there has some repetitive links.

* Actually, the news in "最新消息" (latest) news list are loaded by `AJAX` technique. You can view the introduction of AJAX [here](https://www.w3schools.com/xml/ajax_intro.asp). Generally, it is a technique which can update a section of the web page without reloading the page. For example, if you keep scrolling down the scroll bar of "最新消息" news list, the news list will be reloaded including more news, however the main page will not be affected.

* When an AJAX event happens (e.g. click a button, or scrolling down the scroll bar at a certain position in our case), a request will be sent to server from client browser. When server receives the request url, according to the parameters denoted in the url, it will return the corresponding information, normally organized in `json` format text (see introduction [here](https://www.w3schools.com/js/js_json_intro.asp)). Client browser will then parse the response json text and add returned information into some places in main page. Of course, the html of the main page will be updated dynamically without needing to be reloaded.

* Since we want to extract all the news titles and links in "最新消息" news list, we have to find a way to simulate the triggering of the AJAX event. We can find that every time the "最新消息" news list is reloaded by scrolling down the scroll bar, a link "http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page=3&callback=newsloadercallback&_=1518253611599" will be generated under "Chrome inspector --> network --> jS --> name". From the right side of the below figure we can see that the response exactly contains the reloaded news in "最新消息" news list. If you double click this link, you can see the response is a json file (see below figure).

AJAX Request URL              |  JSON Response |
:-------------------------:|:-------------------------:|
<img src="../Figures/Crawler/inspector4.jpeg" width = "400" height = "550"/>  |  <img src="../Figures/Crawler/inspector5.jpeg" width = "400" height = "550"/>  
 
* If you press the news page button (see below figure), which is also an AJAX event, you can see the above "http://api.roll.news.sina.com.cn/zt_list..." list is also generated, with different value of **page** parameter in the link. Actually, the **page** parameter controls the sequence of loading news "最新消息" news list. This means that the change of **page** parameter value in AJAX URL simulates the process of reloading "最新消息" news list. Thus, we can loop increasing page parameter value to extract all the latest news.

<img src="../Figures/Crawler/inspector6.jpeg" width = "400" height = "550"/>

* Based on above analysis, we can write code below to extract all the latest news:

In [6]:
import requests
import json

page_body = requests.get('http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page=2&callback=newsloadercallback&_=1517968461367')
jd = json.loads(page_body.text.lstrip('  newsloadercallback(').rstrip(');'))
jd

{'result': {'count': '22',
  'data': [{'column': 'gdxw1',
    'comment_channel': 'gn',
    'createtime': '1518241333',
    'ext1': '',
    'ext2': 'gn:comos-fyrkzqr1283462:0',
    'ext3': '',
    'ext4': 'gn:comos-fyrkzqr1283462:0',
    'ext5': '原标题：到这个地方的中国游客人数已居世界第二，有些特殊规矩不能忘\n春节假期快到了，很多人选择回家，也有不少人外出旅游，这其中南极成了很多中国游客的旅游目的地。统计显示，近年来我国赴南极旅游人数快速增长，2017年接近5300人次，成为仅次于美国的南极旅游第二大客源国。业内专家认为，…',
    'id': '1-1-35314584',
    'img': '',
    'keywords': '国家海洋局,长城站,南极科考',
    'level': '0',
    'media_name': '央视新闻',
    'media_type': 'zwsp',
    'old_level': '2',
    'title': '到这里的中国游客数已居世界第二 有些规矩不能忘',
    'url': 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkzqr1283462.shtml'},
   {'column': 'gatxw',
    'comment_channel': 'gn',
    'createtime': '1518241107',
    'ext1': '',
    'ext2': 'gn:comos-fyrkrva6674824:0',
    'ext3': '',
    'ext4': 'gn:comos-fyrkrva6674824:0',
    'ext5': '新华社花莲2月10日电（记者李慧颖 李凯 喻菲）10日上午，花莲地震云门翠堤大楼现场受困的5名大陆游客中的两名，被搜救人员发现，现场确认遇难，其中一名为成年男子，一名为小孩。\n资料显示，最后失联5名大陆游客为一家四大

In [7]:
for news_dic in jd['result']['data']:
    print (news_dic['url'])

http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkzqr1283462.shtml
http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrkrva6674824.shtml
http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkzqr1267990.shtml
http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrmfmc1087231.shtml
http://news.sina.com.cn/c/zs/2018-02-10/doc-ifyrkzqr1258623.shtml
http://news.sina.com.cn/c/zs/2018-02-10/doc-ifyrkrva6652116.shtml
http://news.sina.com.cn/c/2018-02-10/doc-ifyrkzqr1252333.shtml
http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrmfmc1077641.shtml
http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkrva6636220.shtml
http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrkuxt0323199.shtml
http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrmfmc1070831.shtml
http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkrva6625455.shtml
http://news.sina.com.cn/w/2018-02-10/doc-ifyrkzqr1228781.shtml
http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1227765.shtml
http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrkuxt0211314.shtml
http://news.s

**We can write a function to extract all the links given a specific AJAX request URL**

In [33]:
def parsePageLinks(url):
    latest_news_url_list = []
    res = requests.get(url)
    jd = json.loads(res.text.lstrip('  newsloadercallback(').rstrip(');'))
    for news_dic in jd['result']['data']:
        latest_news_url_list.append(news_dic['url'])
    return latest_news_url_list

In [34]:
url = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page=2&callback=newsloadercallback&_=1517968461367'
latest_news_url_list = parsePageLinks(url)
latest_news_url_list

['http://news.sina.com.cn/zx/2018-02-10/doc-ifyrkuxt1962857.shtml',
 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkzqr1491972.shtml',
 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkzqr1491146.shtml',
 'http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrmfmc1273286.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkrva6881377.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkrva6881181.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1479795.shtml',
 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkuxt1882731.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1471563.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkrva6860738.shtml',
 'http://news.sina.com.cn/zx/2018-02-10/doc-ifyrmfmc1252984.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1460625.shtml',
 'http://news.sina.com.cn/c/2018-02-10/doc-ifyrkzqr1448761.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1448285.shtml',
 'http://news.sina.com.cn/w/2018-02-10/doc-ifyr

**We can then loop page number to extract entire links**

In [35]:
url = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}'
total_news_links = []
for page in range(2,30):
    total_news_links.extend(parsePageLinks(url.format(page)))
total_news_links

['http://news.sina.com.cn/zx/2018-02-10/doc-ifyrkuxt1962857.shtml',
 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkzqr1491972.shtml',
 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkzqr1491146.shtml',
 'http://news.sina.com.cn/c/gat/2018-02-10/doc-ifyrmfmc1273286.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkrva6881377.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkrva6881181.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1479795.shtml',
 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkuxt1882731.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1471563.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkrva6860738.shtml',
 'http://news.sina.com.cn/zx/2018-02-10/doc-ifyrmfmc1252984.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1460625.shtml',
 'http://news.sina.com.cn/c/2018-02-10/doc-ifyrkzqr1448761.shtml',
 'http://news.sina.com.cn/o/2018-02-10/doc-ifyrkzqr1448285.shtml',
 'http://news.sina.com.cn/w/2018-02-10/doc-ifyr

### (2) Extract News Body Given a URL

* Given a URL, we want to extract the `title`, `date`, `source`, `article body`, `editor`, `number of comments` of the news page. 
* We can first extract the news page by `requests` module according to the given URL, then use `BeautifulSoup` module to extract the information from the html tags.
* We use the first URL (http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkrva6677197.shtml) as the demo page.

In [36]:
import requests

page = requests.get('http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkrva6677197.shtml')
page.encoding = 'utf-8'
page_body_text = page.text

In [37]:
page_body_text

'<!DOCTYPE html>\n<!-- [ published at 2018-02-10 13:49:51 ] -->\n<!-- LLTJ_MT:name ="央视新闻" -->\n\n<html>\n<head>\n<meta charset="utf-8"/>\n<meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n<meta name="sudameta" content="urlpath:c/; allCIDs:56044,257,51895,200856,51922,56261,258,38790">\n<title>政务APP能办事？有政府人员居然说：谁告诉你的|App|电子政务|软件开发_新浪新闻</title>\n<meta name="keywords" content="App,电子政务,软件开发" />\n<meta name="tags" content="App,电子政务,软件开发" />\n<meta name="description" content="原标题：央视调查|政务APP能办事？有政府人员居然说：谁告诉你的？！随着移动互联网的发展，不少地方也跟上了“流行”趋势，纷纷推出手机政务软件，提出“让群众少跑腿，让数据多跑路”。原本是树立政府形象的好事，但一些政务软件却问题百出，备受诟病。" />\n<link rel="mask-icon" sizes="any" href="//www.sina.com.cn/favicon.svg" color="red">\n<meta property="og:type" content="news" />\n<meta property="og:title" content="政务APP能办事？有政府人员居然说：谁告诉你的" />\n<meta property="og:description" content="政务APP能办事？有政府人员居然说：谁告诉你的" />\n<meta property="og:url" content="http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkrva6677197.shtml" />\n<meta pro

**Title**

In [38]:
soup_body = BeautifulSoup(page_body_text, 'html.parser')
main_title = soup_body.select('.main-title')[0].text
main_title

'政务APP能办事？有政府人员居然说：谁告诉你的'

**Date**

In [44]:
news_date = soup_body.select('.date-source .date')[0].text
news_date

'2018年02月10日 13:42'

In [45]:
# Change the datetime format
from datetime import datetime

dt = datetime.strptime(news_date, '%Y年%m月%d日 %H:%M')
dt.strftime('%Y-%m-%d')

'2018-02-10'

**Source**

In [46]:
source = soup_body.select('.date-source a')[0].text
source

'央视新闻'

**Text Body**

In [None]:
text_body = soup_body.select('.article p')
article = []
for paragraph in text_body[:-2]:
    article.append(paragraph.text.strip())
''.join(article)

We can write above code as a shorter version

In [None]:
text_body = soup_body.select('.article p')
''.join([paragraph.text.strip() for paragraph in text_body[:-2]])

**Editor**

In [48]:
editor = soup_body.select('.show_author')[0].text
editor

'责任编辑：张迪 '

**Comment**

In [50]:
comment_num =  soup_body.select('.num')[0].text
print (comment_num)

0


Above method does not work, that is because news comment also depends on AJAX technique.

In [57]:
comments = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-fyrkrva6677197&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1')
print (comments.text)

{"result": {"status": {"msg": "", "code": 0}, "count": {"qreply": 0, "total": 4, "show": 1}, "language": "ch", "encoding": "utf-8", "cmntlist": [{"comment_imgs": "", "parent_mid": "0", "news_mid_source": "0", "rank": "0", "mid": "5A7E9FF8-71C8CC87-4FCBD447-7F5-8C5", "vote": "0", "uid": "1338758215", "area": "\u9655\u897f", "channel_source": "", "content": "\u9a6c\u4e91\u8001\u5e08\u5f88\u6709\u53ef\u80fd\u4f1a\u4ee5\u98a0\u8986XX\u800c\u8d70\u5411\u6d88\u5931\uff0c\u53ef\u60dc\u554a\uff01", "nick": "\u897f\u90e8\u4e4b\u58f0westsound77", "status_uid": "0", "status_cmnt_mid": "", "parent_nick": "", "config": "wb_verified=0&wb_screen_name=\u897f\u90e8\u4e4b\u58f0westsound77&area=\u9655\u897f&wb_user_id=1338758215&followers_count=62&client_port=0&wb_profile_img=http%3A%2F%2Ftva4.sinaimg.cn%2Fcrop.0.0.180.180.50%2F4fcbd447jw1e8qgp5bmzyj2050050aa8.jpg&wb_time=1518247928", "channel": "gn", "comment_mid": "0", "status": "M_PASWAIT", "newsid_source": "", "parent": "", "parent_profile_img": "", 

In [58]:
import json

jd = json.loads(comments.text)
jd

{'result': {'cmntlist': [{'against': '0',
    'agree': '0',
    'area': '陕西',
    'channel': 'gn',
    'channel_source': '',
    'comment_imgs': '',
    'comment_mid': '0',
    'config': 'wb_verified=0&wb_screen_name=西部之声westsound77&area=陕西&wb_user_id=1338758215&followers_count=62&client_port=0&wb_profile_img=http%3A%2F%2Ftva4.sinaimg.cn%2Fcrop.0.0.180.180.50%2F4fcbd447jw1e8qgp5bmzyj2050050aa8.jpg&wb_time=1518247928',
    'content': '马云老师很有可能会以颠覆XX而走向消失，可惜啊！',
    'ip': '113.200.204.135',
    'length': '24',
    'level': '0',
    'mid': '5A7E9FF8-71C8CC87-4FCBD447-7F5-8C5',
    'news_mid': '0',
    'news_mid_source': '0',
    'newsid': 'comos-fyrkrva6677197',
    'newsid_source': '',
    'nick': '西部之声westsound77',
    'parent': '',
    'parent_mid': '0',
    'parent_nick': '',
    'parent_profile_img': '',
    'parent_uid': '0',
    'profile_img': 'http://tva4.sinaimg.cn/crop.0.0.180.180.50/4fcbd447jw1e8qgp5bmzyj2050050aa8.jpg',
    'rank': '0',
    'status': 'M_PASWAIT',
    'status_c

In [59]:
jd['result']['count']['total']

4

In [60]:
newsurl = 'http://news.sina.com.cn/o/2018-02-01/doc-ifyreuzn1313095.shtml'
news_id = newsurl.split('/')[-1].lstrip('doc-i').rstrip('.shtml')
news_id

'fyreuzn1313095'

In [62]:
import json, re

commentURL_template = 'http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=5&page_size=3&t_size=3&h_size=3&thread=1'

def getCommentCount(newsurl):
    news_id = newsurl.split('/')[-1].lstrip('doc-i').rstrip('.shtml')
    commentURL = commentURL_template.format(news_id)
    comments_soup = requests.get(commentURL)
    jd = json.loads(comments_soup.text)
    return jd['result']['count']['total']

In [64]:
newsurl = 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkrva6677197.shtml'
print (getCommentCount(newsurl))

4


* Let us summarize above procedure into a function.

In [4]:
import requests 
from bs4 import BeautifulSoup
from datetime import datetime
import json

commentURL_template = 'http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=5&page_size=3&t_size=3&h_size=3&thread=1'

def getCommentCount(newsurl):
    news_id = newsurl.split('/')[-1].lstrip('doc-i').rstrip('.shtml')
    commentURL = commentURL_template.format(news_id)
    comments_soup = requests.get(commentURL)
    jd = json.loads(comments_soup.text)
    return jd['result']['count']['total']

def getNewsBody(news_url):
    result = {}
    news = requests.get(news_url)
    news.encoding = 'utf-8'
    news_soup = BeautifulSoup(news.text, 'html.parser')
    result['title'] = news_soup.select('.main-title')[0].text
    result['date'] = datetime.strptime(news_soup.select('.date-source .date')[0].contents[0], '%Y年%m月%d日 %H:%M').strftime('%Y-%m-%d')
    result['source'] = news_soup.select('.date-source a')[0].text
    result['article'] = ''.join([paragraph.text.strip() for paragraph in news_soup.select('.article p')[:-2]])
    result['editor'] = news_soup.select('.show_author')[0].text
    result['comments_num'] = getCommentCount(news_url)
    return result

In [6]:
news_url = 'http://news.sina.com.cn/c/nd/2018-02-10/doc-ifyrkrva6677197.shtml'
res = getNewsBody(news_url)
res

{'article': '原标题：央视调查 | 政务APP能办事？有政府人员居然说：谁告诉你的？！随着移动互联网的发展，不少地方也跟上了“流行”趋势，纷纷推出手机政务软件，提出“让群众少跑腿，让数据多跑路”。原本是树立政府形象的好事，但一些政务软件却问题百出，备受诟病。指尖上的便民工程，变成了形象工程，这究竟是怎么回事？闪退、数据异常 政务软件问题不少记者在手机应用商店里，输入“政务服务”进行搜索，立刻出现了上百个政府部门的手机软件。从省级政府到各区县政府，很多都有自己手机软件。但是这些软件普遍评分偏低，下载用户很少，评论也耐人寻味：“有本事不要通过单位必须下载”；“用政令要求强制下载隶属私人公司的APP”。央视记者尝试用不同的手机，下载了近40款政务软件，进行体验。其中，山东政务服务软件在安装完成之后，一点开就出现了闪退的情况，记者尝试了在多款手机上安装，都不同程度发生了闪退问题，完全无法正常使用。而福建省福州市晋安区的政务软件，发现其浏览量并不多，大约每条资讯的浏览量都在100次上下，然而当记者再次点开后，发现这条资讯的浏览量增加了10多次。央视记者 朱慧容：我们再尝试点开一次这条资讯：浏览量又增加了10多次。点击其它资讯，也出现了一样的情况：每点击一次，浏览量就会增加10多次。这款政务软件，首页中，就亮出了自己的成绩单，上线不到一年，访问量2.6亿，下载量超过千万。看起来，各种服务都挺齐全。记者随机点击首页“预约诊疗”，结果出现的却是一篇文章，让用户下载当地卫计委开发的另一款软件；再来试试“生活缴费”，记者尝试点开“缴电费”，却出现了这样的提示：“软件想要打开支付宝”，点击同意后，直接跳转到支付宝的缴费页面。这款软件的满分5分的评分系统里，只得了2.4分。用户点评是这样的：“各种加载、各种跳转、没做好为什么要放上去”。大部分政务软件用户评分不足3分这些并不是偶然情况，记者一共下载40多款政务软件，其中近二分之一都无法正常使用，可以使用的APP当中，大部分用户评分都不足3分。中国软件测评中心，在前不久，针对70多家国家部委和32家省级政府、网站政务软件的建设情况进行测试。测评报告显示：在建设移动手机软件的省部级单位中，39%的单位，仅提供基于安卓版的应用，与公众移动终端应用现状不匹配。对政务软件缺乏定向管理，山寨版本比比皆是，有损政府形象。更让人惊讶的一个数据是

### (3) Extract entire news body from latest news list

* Now let us combine all we have done and extract entire news body from "latest news list".

In [None]:
url = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}'
total_news_links = []
for page in range(2,30):
    total_news_links.extend(parsePageLinks(url.format(page)))
total_news_links

In [46]:
ajax_request_news_list = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}'
commentURL_template = 'http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=5&page_size=3&t_size=3&h_size=3&thread=1'

def parsePageLinks(url):
    latest_news_url_list = []
    res = requests.get(url)
    jd = json.loads(res.text.lstrip('  newsloadercallback(').rstrip(');'))
    for news_dic in jd['result']['data']:
        latest_news_url_list.append(news_dic['url'])
    return latest_news_url_list

def getCommentCount(newsurl):
    news_id = newsurl.split('/')[-1].lstrip('doc-i').rstrip('.shtml')
    commentURL = commentURL_template.format(news_id)
    comments_soup = requests.get(commentURL)
    jd = json.loads(comments_soup.text)
    return jd['result']['count']['total']

def getNewsBody(news_url):
    result = {}
    news = requests.get(news_url)
    news.encoding = 'utf-8'
    news_soup = BeautifulSoup(news.text, 'html.parser')
    result['title'] = news_soup.select('.main-title')[0].text
    result['date'] = datetime.strptime(news_soup.select('.date-source .date')[0].contents[0], '%Y年%m月%d日 %H:%M').strftime('%Y-%m-%d')
    result['source'] = news_soup.select('.date-source a')[0].text
    result['article'] = ''.join([paragraph.text.strip() for paragraph in news_soup.select('.article p')[:-2]])
    result['editor'] = news_soup.select('.show_author')[0].text
    result['comments_num'] = getCommentCount(news_url)
    return result

def all_news_body(num_pages):
    total_news_links = []
    result = []
    for page in range(2,num_pages):
        total_news_links.extend(parsePageLinks(ajax_request_news_list.format(page)))
    for news_url in total_news_links:
        result.append(getNewsBody(news_url))
    return result

In [49]:
num_pages = 4
news_body_list = all_news_body(num_pages)

In [50]:
news_body_list

[{'article': '资料图：事故现场。海外网2月10日电\xa0据香港东网报道，下午6时许，香港大埔公路近松仔园有巴士翻侧，目前已致11人死亡，40余人受伤，6人被困。据悉，肇事双层巴士为九巴872路线，行走沙田马场至大埔中心，巴士向左翻侧，不少乘客被困。轻伤乘客亦进行自救，打烂车尾窗，爬出车外协助救人，有伤者头破血流，获救后坐在路边，九巴车长亦受伤，初步指11死逾40人伤。现场消息称，一名乘客伤势颇为严重，有生命危险。另外，已有3名乘客陷入昏迷，获送院抢救。',
  'comments_num': 417,
  'date': '2018-02-10',
  'editor': '责任编辑：张岩 ',
  'source': '人民日报海外版-海外网',
  'title': '香港巴士翻车事故10名乘客获救 15人仍被困'},
 {'article': '原标题：[2018春运]火车票是如何生产的？记者带你走进“神秘工厂”一探究竟每逢年节，一张小小的火车票，牵动着千万人的心。从彻夜排队到动动手指、从拥挤的绿皮车到宽敞的动车组、从长途颠簸到“高铁能否再快点”，它承载着游子对家的渴望，也见证着中国春运的变迁。可你知道你手中的火车票是怎么印刷的吗？大河报记者带你走近郑州铁路局的印票人，了解火车票印制、分装的过程。“印票车间”1949年创立，属涉密单位“方寸”之间的火车票是在哪儿印制的呢？记者对中原铁道文化传媒公司印务分公司进行了探访。由于该单位属涉密单位，所以具体位置无法公布。该印务分公司地处一个还保留着上世纪50年代建筑风格的院落里。印务分公司经理王敏介绍，这里是原来的郑州铁路局印刷厂，1949年郑州就创办成立，现归属于中原铁道文化传媒公司，承揽着郑州、武汉、西安三个局集团公司的“红色”电子客票印刷业务，并负责配送。据了解，“红票”主要供各车站人工售票窗口和各铁路代售点。而旅客通过互联网购票后取的磁卡票（俗称：蓝票）则是在自助取票机上打印的，目前在其它铁路局票务分公司印制。走进印务分公司票据库房，墙壁上依次悬挂着保密、互控、交接等各种管理制度，6盏防爆灯下摆放着几十捆特制的票据防伪纸。三名身穿蓝色制服、戴着手套的职工正从成品库往外搬运成箱的车票，就是新鲜出炉的“红票”。一捆纸印10万张车票，不允许私自外流火车票在打印前究竟是怎样生产出来的？对于颇显神秘的车票

In [51]:
import pandas as pd

pd.DataFrame(news_body_list)

Unnamed: 0,article,comments_num,date,editor,source,title
0,资料图：事故现场。海外网2月10日电 据香港东网报道，下午6时许，香港大埔公路近松仔园有巴士...,417,2018-02-10,责任编辑：张岩,人民日报海外版-海外网,香港巴士翻车事故10名乘客获救 15人仍被困
1,原标题：[2018春运]火车票是如何生产的？记者带你走进“神秘工厂”一探究竟每逢年节，一张小...,12,2018-02-10,责任编辑：张岩,澎湃新闻,探访火车票印刷单位：废票不外流 送票和运钞相近
2,原标题：省纪委通报6起违反中央八项规定典型问题：随州一政协副主席被处分据湖北省纪委监委网站1...,1,2018-02-10,责任编辑：张岩,湖北日报网,两干部受处分：主任认为酒档次低 下属买10瓶茅台
3,原标题：香港发生大巴翻侧事故 已致7死40余伤据香港东网报道，下午6时许，香港大埔公路近松仔...,237,2018-02-10,责任编辑：张岩,人民日报,香港发生大巴翻侧事故致数人死伤 多名乘客被困
4,原标题：外交部提醒中国游客重视水上活动安全与交通安全针对春节出境游，外交部领事保护中心常务副...,3,2018-02-10,责任编辑：张岩,央视,外交部：去年中国公民海外意外死亡多涉水上活动
5,原标题：零容忍、速整改、严监管——河北省对环保部曝光问题严肃查处2月9日，环保部微信公众号对...,0,2018-02-10,责任编辑：张岩,澎湃新闻,河北省对环保部曝光问题严肃查处：零容忍速整改
6,原标题：中国放弃了8844.43米珠峰“身高”？国家测绘地理信息局：没有的事！新华社北京2月...,275,2018-02-10,责任编辑：张岩,新华网,中国放弃2005年测定的珠峰海拔？国家测绘局:没有
7,#交通事故通报# [公安部派出工作组赶赴湖北黄石指导重大交通事故调查处理工作]2月10日13...,2,2018-02-10,责任编辑：张岩,政府网站,湖北阳新车祸致10死 公安部派工作组赴现场
8,原标题：广东被告人数最多涉黑案宣判 两“村霸”获刑20年中新网广州2月10日电（蔡敏婕 荔明...,6,2018-02-10,责任编辑：张岩,中国新闻网,广东被告人数最多涉黑案宣判 两“村霸”获刑20年
9,原标题：江西一面包车在湖北发生车祸 致10死1伤（图）,4,2018-02-10,责任编辑：张岩,央视,湖北阳新车祸致10死 面包车逆向超车处置不慎
