## 一、数据获取
数据获取隐含对数据的存储提前规划，对数据进行必要清洗和规整等，为进一步对数据分析和处理做好准备。一般情况下数据的来源有：
- 本地文件：读取本地各种格式的文件，python的标准库和第三方包中有大量的api，主要的格式有文本文件，Excel文件和二进制文件等
- 数据库：所有编程语言都具有读取数据库的能力，python的标准库和第三方包中有大量的api访问目前主流的数据，这包括关系型数据库sqlite，mysql和非关系型数据库mongodb等。
- 网络：从网络获取数据，俗称爬虫，利用编程模仿浏览器获取互联网数据。理论上通过爬虫可以获取互联网上允许访问的任何数据，包括文本（html，xml, json等），二进制的文件、音频和视频等。

利用python获取数据能力的提高，不仅仅来自与python编程能力的提高，还包括对各领域知识的掌握。例如爬虫需要了解web领域中的http协议的规则，访问关系型数据需要了解sql等。下面我们分别针对以上数据来源进行实战，打下基础。



## 二、读写本地文件

### 1. 使用python标准库

#### 读写文本文件

In [1]:
import sys
sys.getdefaultencoding()

'utf-8'

In [2]:
f = open('data/ch04/hello.txt', 'r', encoding='utf-8')
print(f.read())

Hello world!
A hello you may never have expected but that may just be the one you need.
一个你从来没有预期的‘你好’却往往可能是你真真需要的。



python标准库中的open函数可以创建一个文件对象，通过文件对象可以对文件进行各种操作。第2个参数是文件打开模式，
- r	只读
- rb 二进制格式只读
- r+ 读写
- rb+ 二进制格式读写
- w	写，文件已存在则覆盖，文件不存在则创建
- wb 二进制格式写，文件已存在则覆盖，文件不存在则创建
- w+ 读写，文件已存在则覆盖，文件不存在则创建
- wb+ 二进制读写，文件已存在则覆盖，文件不存在则创建
- a 追加
- ab 二进制格式追加
- a+ 读写追加
- ab+ 二进制格式读写追加

文件对象是一个可迭代对象，可以一行行读出也可以循环读出

In [3]:
f.seek(0) #回到内容开始位置
print(f.readline())
print(f.readline())

Hello world!

A hello you may never have expected but that may just be the one you need.



In [4]:
f.seek(0)
for line in f:
    print(line)

Hello world!

A hello you may never have expected but that may just be the one you need.

一个你从来没有预期的‘你好’却往往可能是你真真需要的。



完成对文件的操作后，调用close方法关闭文件对象，释放系统资源，特别是读取大文件，占用大量内存。程序复杂时，关闭前注意检查文件对象是否已经被回收，python提供with语句自动调用close方法

In [5]:
f.close()

In [6]:
try:
    f = open('data/ch04/hello.txt', 'r', encoding='utf-8')
    print(f.read())
finally:
    if f:
        f.close()

Hello world!
A hello you may never have expected but that may just be the one you need.
一个你从来没有预期的‘你好’却往往可能是你真真需要的。



In [7]:
with open('data/ch04/hello.txt', 'r', encoding='utf-8') as f:
    print(f.read())

Hello world!
A hello you may never have expected but that may just be the one you need.
一个你从来没有预期的‘你好’却往往可能是你真真需要的。



也可以写入文件

In [8]:
with open('data/ch04/hello.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()    
    with open('data/ch04/world.txt', 'a', encoding='utf-8') as f2:
        f2.writelines(lines)
        f2.write('That is right.\n')

with open('data/ch04/world.txt', 'r', encoding='utf-8') as f3:
    print(f3.read())

Hello world!
A hello you may never have expected but that may just be the one you need.
一个你从来没有预期的‘你好’却往往可能是你真真需要的。
That is right.
﻿Hello world!
A hello you may never have expected but that may just be the one you need.
一个你从来没有预期的‘你好’却往往可能是你真真需要的。
That is right.
﻿Hello world!
A hello you may never have expected but that may just be the one you need.
一个你从来没有预期的‘你好’却往往可能是你真真需要的。
That is right.
Hello world!
A hello you may never have expected but that may just be the one you need.
一个你从来没有预期的‘你好’却往往可能是你真真需要的。
That is right.



#### 练习：将python重复10次写入文本文件，并读出

In [9]:
with open('data/ch04/temp.txt', 'w', encoding='utf-8') as f4:        
    f4.write('python\n' * 10)

with open('data/ch04/temp.txt', 'r', encoding='utf-8') as f5:
    print(f5.read())

python
python
python
python
python
python
python
python
python
python



#### 处理json数据

In [10]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
              {"name": "Katie", "age": 33, "pet": "Cisco"}]
}
"""

In [11]:
import json
result = json.loads(obj)
result

{'name': 'Wes',
 'pet': None,
 'places_lived': ['United States', 'Spain', 'Germany'],
 'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'},
  {'age': 33, 'name': 'Katie', 'pet': 'Cisco'}]}

In [12]:
asjson = json.dumps(result)
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"}, {"name": "Katie", "age": 33, "pet": "Cisco"}]}'

如果要处理的是文件而不是字符串，可以使用 json.dump() 和 json.load() 来编码和解码JSON数据

In [13]:
with open('data/ch04/data.json', 'w', encoding='utf-8') as f:
    json.dump(obj, f)

with open('data/ch04/data.json', 'r', encoding='utf-8') as f:
    result_str = json.load(f)
    result =  json.loads(result_str)

result

{'name': 'Wes',
 'pet': None,
 'places_lived': ['United States', 'Spain', 'Germany'],
 'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'},
  {'age': 33, 'name': 'Katie', 'pet': 'Cisco'}]}

#### 保存python对象
还可以使用shelve模拟一个key-value数据库。shelve可以将python对象直接保存到文件中，取出时还是一个python对象，不需要像传统数据库一样，先取出数据，然后重构对象

In [14]:
import sys, shelve

def storeperson(db):
    id = input('Id:')
    person = {}
    person['name'] = input('Name:')
    person['age'] = input('Age:')
    person['mobile'] = input('Mobile:')
    db[id] = person

def findperson(db):
    id = input('Find by id:')
    field = input('Field：')
    field = field.strip().lower()
    print(field.capitalize() + ':' + db[id][field])

def entercommand():
    cmd = input('Enter command:')
    cmd = cmd.strip().lower()
    return cmd

def main():
    db = shelve.open('data\ch04\store.dat')
    try:
        while True:
            cmd = entercommand()
            if cmd == 'store':
                storeperson(db)
            elif cmd == 'find':
                findperson(db)
            elif cmd == "quit":
                return
    finally:
        db.close()

main()

Enter command:quit


#### 操作文件系统
python标准库中还提供了对文件系统的操作，主要在os模块中

In [15]:
import os

path = os.path.abspath('.')
#path = os.path.getcwd()
print(path)
path2 = os.path.join(path, 'files')
print(path2)

os.mkdir(path2)
os.rmdir(path2)

path3 = os.path.join(path, 'file.txt')
print(os.path.split(path3))
print(os.path.splitext(path3))

print([x for x in os.listdir('.') if os.path.isfile(x) and os.path.splitext(x)[1]=='.ipynb'])

e:\Project\Python\course-begining-python
e:\Project\Python\course-begining-python\files
('e:\\Project\\Python\\course-begining-python', 'file.txt')
('e:\\Project\\Python\\course-begining-python\\file', '.txt')
['1. python课程简介.ipynb', '2. python开发环境和jupyter notebook.ipynb', '3. python编程基础.ipynb', '4. python数据获取.ipynb', 'python matplotlib可视化.ipynb', 'python numpy包.ipynb', 'python pandas数据处理.ipynb', 'python数据可视化.ipynb', 'python爬虫.ipynb', 'python编程进阶.ipynb']


#### 练习：编写复制文件函数

In [16]:
import sys
import os

def copyfile(from_filename, to_filename):
    if os.path.exists(to_filename):
        print('The file exists.')
        return
    print('Copy from %s to %s' % (from_filename, to_filename))
    from_file = open(from_filename, encoding='utf-8')
    data = from_file.read()
    print('The file size is %d', len(data))
    from_file.close      
    to_file = open(to_filename,'w', encoding='utf-8')
    to_file.write(data)
    to_file.close
    
copyfile('data/ch04/hello.txt', 'data/ch04/copy.txt')

The file exists.


### 2. 使用numpy和pandas
numpy能够读写磁盘上的文本数据和二进制数据，pandas可以读取文本和excel数据。
#### 读写numpy数组至二进制文件

In [17]:
import numpy as np
arr = np.arange(10)
np.save('data/ch04/array', arr) #以.npy为扩展名，未压缩保存数组为二进制文件

In [18]:
np.load('data/ch04/array.npy')

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [19]:
np.savez('data/ch04/array_archive.npz', a=arr, b=arr) #以.npz为扩展名，压缩保存多个数组

In [20]:
arch = np.load('data/ch04/array_archive.npz')
arch['b']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#### 读写numpy数组至文本文件

In [21]:
arr = np.random.randn(3, 4)
np.savetxt('data/ch04/array_numpy.txt', arr, delimiter=',')

In [22]:
arr = np.loadtxt('data/ch04/array_numpy.txt', delimiter=',')
arr

array([[ 1.36918849, -1.56213207,  0.00207497,  0.6651571 ],
       [-0.02509497,  1.52177529, -0.69761807,  0.07524579],
       [-0.19498267, -0.04730856,  0.01545124, -0.96100121]])

#### pandas读写文本文件

In [23]:
import pandas as pd

pandas在读取文本文件时会对数据进行一些处理，这包括建立索引，推断数据类型并转换，解析日期，对大文件迭代，跳过一些行，页脚和注释等

In [24]:
pd.read_csv('data/ch04/data.csv')

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


In [25]:
pd.read_table('data/ch04/data.csv', sep=',') #指定分隔符

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


In [26]:
df = pd.read_csv('data/ch04/data.csv')
df.to_csv('data/ch04/data.csv') #写入文件

In [27]:
df = pd.DataFrame(np.random.randn(4000).reshape(1000, 4), columns=list('abcd'))
df.ix[df.a > 0] = 1
df.ix[df.a <= 0] = 0
df.to_csv('data/ch04/data_random.csv',index=False)

#### pandas读写excel文件

In [28]:
xls = pd.ExcelFile('data/ch04/data.xlsx')
table = xls.parse('Sheet1')
table

Unnamed: 0,a,b,c,d,message,1
0,1.0,2.0,3.0,4.0,hello,
1,50.0,6.0,7.0,8.0,world,
2,9.0,10.0,11.0,12.0,foo,
a,,,,,,50.0


In [29]:
table.loc['a', 1] = 50
out = pd.ExcelWriter('data/ch04/data.xlsx')
table.to_excel(out)
out.save()

#### pandas读数据时的处理

In [30]:
pd.read_csv('data/ch04/data_without_header.csv', header=None) #没有列名

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [31]:
pd.read_csv('data/ch04/data_without_header.csv', names=['a', 'b', 'c', 'd', 'message']) #指定列名

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [32]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('data/ch04/data_without_header.csv', names=names, index_col='message') #指定索引列

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [33]:
list(open('data/ch04/data_abnormal.csv', encoding='utf-8'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491']

In [34]:
pd.read_table('data/ch04/data_abnormal.csv', sep='\s+') #分隔符支持正则表达式，列名比数据行中的数量少，第一列被推断为索引

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In [35]:
pd.read_table('data/ch04/data_abnormal.csv', sep='\s+', skiprows=[1, 3]) #跳过一些数据行

Unnamed: 0,A,B,C
bbb,0.927272,0.302904,-0.032399
ddd,-0.871858,-0.348382,1.100491


In [36]:
list(open('data/ch04/data_lost.csv'))

['something,a,b,c,d,message\n',
 'one,1,2,3,4,NA\n',
 'two,5,6,,8,world\n',
 'three,9,10,11,12,foo']

In [37]:
pd.read_csv('data/ch04/data_lost.csv') #推断NA、null和空缺

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [38]:
pd.read_csv('data/ch04/data_lost.csv', na_values=['foo']) #指定表示空缺的字符串

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,


In [39]:
pd.read_csv('data/ch04/data_lost.csv', na_values={'message': ['foo'], 'something': ['two']}) #为不同列指定表示空缺的字符串

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


#### pandas分块读取文本文件
只想读取文件的一小部分或希望对文件进行迭代时

In [40]:
pd.read_csv('data/ch04/data_large.csv', nrows=5) #只读取前五行

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


In [41]:
chunker = pd.read_csv('data/ch04/data_large.csv', chunksize=1000) #设置分块大小，返回TextParser可迭代对象
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)  #根据key列分组计数

tot = tot.sort_values(ascending=False)
tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

#### 练习：使用pandas对data/ch04/data*.csv进行数据处理

## 三、读写数据库
小的应用可以将数据存储在文件中，但是对于大中型应用则需要成熟的数据库系统的支持。目前主流的数据库系统一般分为关系型数据库和非关系型数据库，关系型数据库包括sqlite和mysql等，非关系型数据库包括mongodb和redis等，接下里我们来讨论python对这些数据库的访问能力，以获取数据库中的数据。

### 1. SQL
SQL是访问关系型数据的的标准语言，可以使用SQL语句操作诸如sqlite、mysql、oralce和SQL Server等关系型数据库。所有的关系型数据库都是库表结构
实现对数据增删改查基本SQL语句如下：

实现对库表结构的增删改基本SQL语句如下（注意不同关系型数据库产品可能会略有不同）：

### 2. 访问sqlite
sqlite是一个非常轻量级的关系型数据库，广泛应用于移动应用开发中。sqlite没有服务器端，通过api即可完成sql语句操作，sqlite将数据存储在一个本地文件中，甚至存储在内存中。python标准库中带有对sqlite访问的api，该api遵循python DB-API规范。
![](images/sqlite.png)

In [42]:
import sqlite3
#conn = sqlite3.connect(':memory:') #数据存储在内存中
conn = sqlite3.connect('data/ch04/sqlite.db') #数据存储在文件中
sql="""
create table test
(
a varchar(20),
b varchar(20),
c real,
d integer
)
"""
conn.execute("drop table if exists test")
conn.execute(sql)
conn.commit()

In [43]:
sql = "insert into test (a, b, c, d) values ('Fujian', 'Xiamen', 1.25, 6)"
conn.execute(sql)
conn.commit()
data = [('Zhejiang', 'Hangzhou', 2.6, 3), 
        ('Guangdong', 'Shenzhen', 1.7, 5)]
sql = "insert into test values (? ,?, ?, ?)"
conn.executemany(sql, data)
conn.commit()

In [44]:
cursor = conn.execute("select * from test")
rows = cursor.fetchall()
rows

[('Fujian', 'Xiamen', 1.25, 6),
 ('Zhejiang', 'Hangzhou', 2.6, 3),
 ('Guangdong', 'Shenzhen', 1.7, 5)]

In [45]:
cursor.description #游标description属性包含列名

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [46]:
pd.DataFrame(rows, columns=list(zip(*cursor.description))[0]) #数据直接传入DataFrame构造器，注意python 3中zip函数返回值发生变化

Unnamed: 0,a,b,c,d
0,Fujian,Xiamen,1.25,6
1,Zhejiang,Hangzhou,2.6,3
2,Guangdong,Shenzhen,1.7,5


In [47]:
cursor.close()
conn.close()

#### 练习：尝试连接sqlite数据库并操作

### 3. 访问mysql
![](images/mysql.png)

In [48]:
import pymysql

In [49]:
conn = pymysql.connect(host='172.16.8.181', port=3306, user='root', password='wisesoe')
cursor = conn.cursor()

In [50]:
cursor.execute('drop database if exists test')

1

In [51]:
cursor.execute('create database test')
conn.select_db('test')
sql="""
create table test
(
a varchar(20),
b varchar(20),
c float,
d int
)
"""
cursor.execute(sql)

0

In [52]:
sql = "insert into test (a, b, c, d) values ('Fujian', 'Xiamen', 1.25, 6)"
cursor.execute(sql)
conn.commit()
data = [('Zhejiang', 'Hangzhou', 2.6, 3), 
        ('Guangdong', 'Shenzhen', 1.7, 5)]
sql = "insert into test values (%s ,%s, %s, %s)"
cursor.executemany(sql, data)
conn.commit()

In [53]:
count = cursor.execute("select * from test")
count

3

In [54]:
rows = cursor.fetchall()
rows #返回的是元组

(('Fujian', 'Xiamen', 1.25, 6),
 ('Zhejiang', 'Hangzhou', 2.6, 3),
 ('Guangdong', 'Shenzhen', 1.7, 5))

In [55]:
pd.DataFrame(list(rows), columns=list(zip(*cursor.description))[0])

Unnamed: 0,a,b,c,d
0,Fujian,Xiamen,1.25,6
1,Zhejiang,Hangzhou,2.6,3
2,Guangdong,Shenzhen,1.7,5


In [56]:
cursor.close()
conn.close()

#### 练习：尝试连接mysql数据库并操作

### 3. 访问mongodb
![](images/mongodb.png?20170619)
mongodb与前面的的关系型数据库不同，是所谓NoSQL数据库中的一员。本质上mongodb是一个文档数据库，其内部没有库表结构，取而代之是库和集合，集合内不一定存储相同结构的数据。
![](images/mongodb-collection.png)

mongodb增删改查基本语法如下：
![](images/mongodb-insert.png)
![](images/mongodb-remove.png)
![](images/mongodb-update.png)
![](images/mongodb-find.png)

In [57]:
import pymongo

client = pymongo.MongoClient('172.16.8.181', 27017)
client.admin.authenticate('root', 'wisesoe', source='admin')

一般使用数据库连接字符串的方式：

In [58]:
client = pymongo.MongoClient('mongodb://root:wisesoe@172.16.8.181/admin')
db = client.test #指定库
db.data.drop()
db.data.insert_one({'a': 'Fujian', 'b': 'Xiamen', 'c': 1.25, 'd': 6})
data = [{'a': 'Zhejiang', 'b': 'Hangzhou', 'c': 2.6, 'd': 3}, 
        {'a': 'Guangdong', 'b': 'Shenzhen', 'c': 1.7, 'd': 5}]
db.data.insert_many(data)

<pymongo.results.InsertManyResult at 0x1ef905c9480>

In [59]:
db.collection_names()

['blog', 'system.indexes', 'data']

In [60]:
results = db.data.find()
list(results)

[{'_id': ObjectId('595c528feaa295c58ad656ad'),
  'a': 'Fujian',
  'b': 'Xiamen',
  'c': 1.25,
  'd': 6},
 {'_id': ObjectId('595c528feaa295c58ad656ae'),
  'a': 'Zhejiang',
  'b': 'Hangzhou',
  'c': 2.6,
  'd': 3},
 {'_id': ObjectId('595c528feaa295c58ad656af'),
  'a': 'Guangdong',
  'b': 'Shenzhen',
  'c': 1.7,
  'd': 5}]

In [61]:
result = db.data.find_one({'a': 'Fujian'})
result

{'_id': ObjectId('595c528feaa295c58ad656ad'),
 'a': 'Fujian',
 'b': 'Xiamen',
 'c': 1.25,
 'd': 6}

In [62]:
db.data.update_one({'a': 'Fujian'}, {'$set': {'b': 'Fuzhou'} })
result = db.data.find_one({'a': 'Fujian'})
result

{'_id': ObjectId('595c528feaa295c58ad656ad'),
 'a': 'Fujian',
 'b': 'Fuzhou',
 'c': 1.25,
 'd': 6}

In [63]:
db.data.update_one({'a': 'Shandong'}, {'$setOnInsert' :{'a': 'Shandong', 'b': 'Jinan', 'c': 2.25, 'd': 5}}, upsert = True )
result = db.data.find_one({'a': 'Shandong'})
result

{'_id': ObjectId('595c529e6b5c4ebf0f2fc646'),
 'a': 'Shandong',
 'b': 'Jinan',
 'c': 2.25,
 'd': 5}

In [64]:
db.data.delete_one({'a': 'Shandong'})
db.data.update_one({'a': 'Shandong'}, {'$setOnInsert' :{'a': 'Shandong', 'b': 'Jinan', 'c': 2.25}}, upsert = True )
results = db.data.find() #mongodb中collection中的文档结构不一定要一样
list(results)

[{'_id': ObjectId('595c528feaa295c58ad656ad'),
  'a': 'Fujian',
  'b': 'Fuzhou',
  'c': 1.25,
  'd': 6},
 {'_id': ObjectId('595c528feaa295c58ad656ae'),
  'a': 'Zhejiang',
  'b': 'Hangzhou',
  'c': 2.6,
  'd': 3},
 {'_id': ObjectId('595c528feaa295c58ad656af'),
  'a': 'Guangdong',
  'b': 'Shenzhen',
  'c': 1.7,
  'd': 5},
 {'_id': ObjectId('595c529e6b5c4ebf0f2fc647'),
  'a': 'Shandong',
  'b': 'Jinan',
  'c': 2.25}]

#### 练习：尝试连接mongodb数据库并操作

## 四、访问网络

### 1. http协议基础
http（超文本传输协议）是基于 tcp/ip 协议的应用层协议。它不涉及数据包传输，主要规定了客户端和服务器之间的通信格式，默认使用80端口。
https（超文本传输安全协议）经由http协议进行通信，但是利用ssl/tls加密数据包。
#### url
url(统一资源定位器)，用来唯一标识网络资源。例如：https://www.baidu.com/s?wd=python
#### http请求和响应
- http请求由三部分组成，分别是：请求行、消息报头、请求正文。常见的请求方式
    - get
    - post
- http响应由三部分组成，分别是：状态行、消息报头、响应正文。常见的状态码有
    - 200 OK
    - 400 Bad Request
    - 401 Unauthorized
    - 403 Forbidden
    - 404 Not Found
    - 500 Internal Server Error
    - 503 Server Unavailable
- 使用开发人员工具

### 2. 使用urllib

In [65]:
from urllib import request #python 2中是urllib2
resp = request.urlopen('https://www.baidu.com')
print(resp.geturl(), resp.status, resp.reason, sep=' ')
print(resp.read().decode('utf-8'))

https://www.baidu.com 200 OK
<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>


In [66]:
resp = request.urlopen('https://api.douban.com/v2/movie/1292720')
print('Status:', resp.status, resp.reason)

Status: 200 OK


In [67]:
{k: v for k, v in resp.getheaders()}

{'Cache-Control': 'must-revalidate, no-cache, private',
 'Connection': 'close',
 'Content-Length': '2861',
 'Content-Type': 'application/json; charset=utf-8',
 'Date': 'Wed, 05 Jul 2017 02:44:48 GMT',
 'Expires': 'Sun, 1 Jan 2006 01:00:00 GMT',
 'Pragma': 'no-cache',
 'Server': 'dae',
 'Set-Cookie': 'bid=4-1j0zfK14s; Expires=Thu, 05-Jul-18 02:44:48 GMT; Domain=.douban.com; Path=/',
 'Vary': 'Accept-Encoding',
 'X-DAE-App': 'movie',
 'X-DAE-Node': 'daisy6d',
 'X-DOUBAN-NEWBID': '4-1j0zfK14s',
 'X-Ratelimit-Limit2': '100',
 'X-Ratelimit-Remaining2': '99'}

In [68]:
import json
json.loads(resp.read().decode('utf-8'))

{'alt': 'https://movie.douban.com/movie/1292720',
 'alt_title': '阿甘正传 / 福雷斯特·冈普',
 'attrs': {'cast': ['Tom Hanks',
   'Robin Wright Penn',
   'Gary Sinise',
   'Mykelti Williamson',
   'Sally Field',
   'Michael Conner Humphreys',
   'Haley Joel Osment'],
  'country': ['美国'],
  'director': ['Robert Zemeckis'],
  'language': ['英语'],
  'movie_duration': ['142 分钟'],
  'movie_type': ['剧情', '爱情'],
  'pubdate': ['1994-06-23(洛杉矶首映)', '1994-07-06(美国)'],
  'title': ['Forrest Gump'],
  'writer': ['Eric Roth', 'Winston Groom'],
  'year': ['1994']},
 'author': [{'name': 'Robert Zemeckis'}],
 'id': 'https://api.douban.com/movie/1292720',
 'image': 'https://img1.doubanio.com/view/movie_poster_cover/ipst/public/p510876377.jpg',
 'mobile_link': 'https://m.douban.com/movie/subject/1292720/',
 'rating': {'average': '9.4', 'max': 10, 'min': 0, 'numRaters': 689390},
 'summary': '阿甘（汤姆·汉克斯 饰）于二战结束后不久出生在美国南方阿拉巴马州一个闭塞的小镇，他先天弱智，智商只有75，然而他的妈妈是一个性格坚强的女性，她常常鼓励阿甘“傻人有傻福”，要他自强不息。\n阿甘像普通孩子一样上学，并且认识了一生的朋友和至爱珍妮（罗宾·莱特·

### 3. 使用requests

In [69]:
import requests
resp = requests.get('https://www.baidu.com')
print(resp.url, resp.status_code, resp.reason, sep=' ')

https://www.baidu.com/ 200 OK


In [70]:
import json
resp = requests.get('https://api.douban.com/v2/movie/1292720')
print('Status:', resp.status_code, resp.reason)
json.loads(resp.text)

Status: 200 OK


{'alt': 'https://movie.douban.com/movie/1292720',
 'alt_title': '阿甘正传 / 福雷斯特·冈普',
 'attrs': {'cast': ['Tom Hanks',
   'Robin Wright Penn',
   'Gary Sinise',
   'Mykelti Williamson',
   'Sally Field',
   'Michael Conner Humphreys',
   'Haley Joel Osment'],
  'country': ['美国'],
  'director': ['Robert Zemeckis'],
  'language': ['英语'],
  'movie_duration': ['142 分钟'],
  'movie_type': ['剧情', '爱情'],
  'pubdate': ['1994-06-23(洛杉矶首映)', '1994-07-06(美国)'],
  'title': ['Forrest Gump'],
  'writer': ['Eric Roth', 'Winston Groom'],
  'year': ['1994']},
 'author': [{'name': 'Robert Zemeckis'}],
 'id': 'https://api.douban.com/movie/1292720',
 'image': 'https://img1.doubanio.com/view/movie_poster_cover/ipst/public/p510876377.jpg',
 'mobile_link': 'https://m.douban.com/movie/subject/1292720/',
 'rating': {'average': '9.4', 'max': 10, 'min': 0, 'numRaters': 689390},
 'summary': '阿甘（汤姆·汉克斯 饰）于二战结束后不久出生在美国南方阿拉巴马州一个闭塞的小镇，他先天弱智，智商只有75，然而他的妈妈是一个性格坚强的女性，她常常鼓励阿甘“傻人有傻福”，要他自强不息。\n阿甘像普通孩子一样上学，并且认识了一生的朋友和至爱珍妮（罗宾·莱特·

### 4. 使用pyspider
一个简易的爬虫框架，支持多线程，js动态解析，带有web操作界面，支持多种数据库。windows中只python2下运行较为正常，ubuntu中支持python3。
实例演示pyspider下爬虫开发
http://docs.pyspider.org/en/latest/apis/self.crawl/

### 5. 练习：尝试访问豆瓣api获取数据

## 五、练习

### 1. 通过百度下载视频

In [6]:
import requests
import re
from pyquery import PyQuery as pq

key = '如何掌控你的自由时间'
baiduUrl = 'https://www.baidu.com/s?wd=%s site:open.163.com'
flvUrl = 'http://www.flvcd.com/parse.php?format=&kw=%s'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\
        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}
resp = requests.get(baiduUrl % key, headers=headers)
page = pq(resp.text)
items = [(item('a').attr('href'), item('div[.c-span-last]').text()) 
         for item in  page('div.c-container h3 a.anchor-link').items()]

videoPageUrl = ''
if(len(items) > 0):
    firstResp = requests.get(items[0][0], allow_redirects=False, headers=headers)
    if(firstResp.status_code == 302):
        videoPageUrl = firstResp.headers.get('location')
    elif(firstResp.status_code == 200):
        matchGroup = re.search(r'URL=\'(.*?)\'', firstResp.text, re.S)
        videoPageUrl = matchGroup.group(1)
        print(videoPageUrl)
    if videoPageUrl:
        flvResp = requests.get(flvUrl % videoPageUrl, headers=headers)
        flvPage = pq(flvResp.text)
        flvItems = [item.attr('href') for item in flvPage('td.mn a.link').items()]
        if(len(flvItems) > 0 and flvItems[0]):
            videoUrl = flvItems[0]
            print(videoUrl)
            if videoUrl:
                downResp = requests.get(videoUrl, stream=True)                
                fileName = 'video/%s.flv' % key
                with open(fileName, "wb") as file:
                    file.write(downResp.content)
                    print('finish')

'http://www.baidu.com/link?url=QNlVhrRe-9pFVLlYnC1lUvQSYoDLCxN0DD406vIUpqPQpA_nGhnh-hPhP_XuaFF-9wc5T_KF37389iglBemivlPZiyLcsCopDWxtQUfm-Qq'

### 2. 将上面的代码重构封装为类

In [None]:
import requests
import re
from pyquery import PyQuery as pq

class OnlineVideoCrawler(object):
    def __init__(self, keys):
        self.keys = keys
        self.baiduUrl = 'https://www.baidu.com/s?wd=%s site:open.163.com'
        self.flvUrl = 'http://www.flvcd.com/parse.php?format=&kw=%s'
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\
        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}

    def getVideoPageUrlFromBaidu(self, key):
        resp = requests.get(self.baiduUrl % key, headers=self.headers)
        page = pq(resp.text)
        items = [(item('a').attr('href'), item('div.c-span-last').text()) for item in  page('div.c-container div.c-gap-top-small').items()]
        videoPageUrl = ''
        if(len(items) > 0):
            firstResp = requests.get(items[0][0], allow_redirects=False, headers=self.headers)
            if(firstResp.status_code == 302):
                videoPageUrl = firstResp.headers.get('location')
            elif(firstResp.status_code == 200):
                matchGroup = re.search(r'URL=\'(.*?)\'', firstResp.text, re.S)
                videoPageUrl = matchGroup.group(1)
        print(videoPageUrl)
        return videoPageUrl
        
    
    def getVideoUrlFromFlv(self, key):
        videoPageUrl = self.getVideoPageUrlFromBaidu(key)
        videoUrl = ''
        if videoPageUrl:
            flvResp = requests.get(self.flvUrl % videoPageUrl, headers=self.headers)
            flvPage = pq(flvResp.text)
            flvItems = [item.attr('href') for item in flvPage('td.mn a.link').items()]
            if(len(flvItems) > 0 and flvItems[0]):
                videoUrl = flvItems[0]
        print(videoUrl)
        return videoUrl
    
    def downloadVideo(self, key):
        videoUrl = self.getVideoUrlFromFlv(key)
        downResp = requests.get(videoUrl, stream=True)
        total = int(downResp.headers.get('content-length'))
        print(total)
        fileName = '%s.flv' % key
        i = 0
        chunk_size = 1024
        with open(fileName, "wb") as file:
            for chunk in downResp.iter_content(chunk_size=chunk_size):
                if chunk:
                    i += 1
                    file.write(chunk)
                    file.flush()
                    print('%.2f' % (chunk_size * i / total), end='\r')

    def downloadVideos(self):
        if len(self.keys) > 0:
            for key in self.keys:
                try:
                    self.downloadVideo(key)
                except Exception as e:
                    print(e)

#if __name__ == '__main__':
keys = ['如何掌控你的自由时间','阅读全世界','如何做得更好']
crawler = OnlineVideoCrawler(keys)
crawler.downloadVideos()

## 六、期末练习
开发一个爬虫，要求任意选取站点（百度或豆瓣）获取站点的信息列表，要求文本数据50条以上，音频或视频不限。
提交方式：将所有本章及本章前的小练习和本次期末练习代码，放置在一个notebook文件中，以 学号+姓名.ipynb 的方式命名，通过以下命令拷贝至公共路径。