任务2内容：完成第5章的学习，代码实现的过程上传到GitHub

DDL：20190810 8:00am

In [1]:
# 引用pandas
import pandas as pd
import numpy as np

In [3]:
# 从pandas引入Series和DataFrame模块
# from pandas import Series, DataFrame

## 5.1 pandas的数据结构介绍

pandas有两种主要数据结构：
* Series
* DataFrame：类似与二维数组的对象，每个DataFrame由

### 5.1.1 Series

Series是一种类似于一维数组的对象，由**index(索引)**和**value(值)**组成

Series可通过**列表**或者**字典**创建

name属性：
Series.name # Series的值
Series.index.name  # Series的索引值

In [4]:
# 通过列表创建Series
obj = pd.Series([4, 7, -5, 3])
obj
# 可指定dtype
# obj = pd.Series([4, 7, -5, 3], dtype='object')
# 可指定Series的索引

0    4
1    7
2   -5
3    3
dtype: int64

In [5]:
# 查询Series全部值
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [6]:
# 查询Series全部索引
obj.index

RangeIndex(start=0, stop=4, step=1)

In [7]:
# 创建Series时，指定索引
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2                  

d    4
b    7
a   -5
c    3
dtype: int64

In [8]:
# Series的索引不一定是数值，可以有多种数据类型（值也是）
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [28]:
# 修改索引
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [29]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] # 通过index赋值直接修改索引
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

通过索引查询对应的值

In [9]:
# 通过单个索引选取对应的值
obj2['a']

-5

In [10]:
# 通过多个索引选取对应的值，无论索引值的顺序如何，python总是能找到相应的值
obj2[['c', 'd', 'a']]

c    3
d    4
a   -5
dtype: int64

In [11]:
# 通过索引选取对应的值并修改该值
obj2['d'] = 6

使用NumPy函数或类似的运算，都会保留索引值的链接

In [12]:
obj2[obj2 > 0] # 布尔运算 

d    6
b    7
c    3
dtype: int64

In [13]:
obj2 * 2 # 标量乘法

d    12
b    14
a   -10
c     6
dtype: int64

In [14]:
np.exp(obj2) # 应用np.exp()函数

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

可以将Series看成一个**定长的有序字典**，它是索引值到数据值的一个**映射**

In [15]:
'b' in obj2

True

In [16]:
'e' in obj2

False

In [17]:
# 通过字典创建Series
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [18]:
# 将index存入一个list中，这个list传入一个Series，该Series通过一个字典创建
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

检测缺失数据
* pd.isnull(Series)
* pd.notnull(Series)
* Series.isnull()
* Series.notnull()

In [19]:
pd.isnull(obj4) # 检测缺失数据，缺失为True
# 等同于
# obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4) # 检测缺失数据，缺失为False
# 等同于
# obj4.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series的数据对齐功能，类似于join操作

In [22]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [23]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [24]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series的**name**属性

In [25]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [26]:
obj4.name = 'population'
obj4.index.name = 'state'

In [27]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

### 5.1.2 DataFrame

DataFrame由**index(索引)**，**column(列名)**和**value(值)**组成

可以将DataFrame看作由Series组成的字典

DataFrame可通过**等长列表**或**NumPy数组**组成的**字典**创建

name属性： DataFrame.columns.name # DataFrame的列值 DataFrame.index.name # DataFrame的行索引值

In [2]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [3]:
# 指定column顺序
pd.DataFrame(data, columns=['pop', 'state', 'year']) # 传入的列名必须在原DataFarme中存在，否则结果会有缺失值
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [4]:
frame2 = pd.DataFrame(data, columns=['pop', 'state', 'year', 'debt'], index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,pop,state,year,debt
one,1.5,Ohio,2000,
two,1.7,Ohio,2001,
three,3.6,Ohio,2002,
four,2.4,Nevada,2001,
five,2.9,Nevada,2002,
six,3.2,Nevada,2003,


In [5]:
frame2.head() # 默认选取前5行
# 选取前n行
# frame.head(n) 

Unnamed: 0,pop,state,year,debt
one,1.5,Ohio,2000,
two,1.7,Ohio,2001,
three,3.6,Ohio,2002,
four,2.4,Nevada,2001,
five,2.9,Nevada,2002,


In [6]:
# 查看列名
frame2.columns

Index(['pop', 'state', 'year', 'debt'], dtype='object')

In [7]:
# 获取某一列数据，结果返回一个Series
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [8]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [9]:
# 获取一行数据
frame2.loc['three']

pop       3.6
state    Ohio
year     2002
debt      NaN
Name: three, dtype: object

In [10]:
# 通过赋值修改列
frame2['debt'] = 16
frame2

Unnamed: 0,pop,state,year,debt
one,1.5,Ohio,2000,16
two,1.7,Ohio,2001,16
three,3.6,Ohio,2002,16
four,2.4,Nevada,2001,16
five,2.9,Nevada,2002,16
six,3.2,Nevada,2003,16


In [11]:
frame2['debt'] = np.arange(6.)
frame2

Unnamed: 0,pop,state,year,debt
one,1.5,Ohio,2000,0.0
two,1.7,Ohio,2001,1.0
three,3.6,Ohio,2002,2.0
four,2.4,Nevada,2001,3.0
five,2.9,Nevada,2002,4.0
six,3.2,Nevada,2003,5.0


In [13]:
val = pd.Series([1.2, -1.5, -2], index=['two', 'four', 'five'])

In [16]:
frame2['debt'] = val
frame2

Unnamed: 0,pop,state,year,debt
one,1.5,Ohio,2000,
two,1.7,Ohio,2001,1.2
three,3.6,Ohio,2002,
four,2.4,Nevada,2001,-1.5
five,2.9,Nevada,2002,-2.0
six,3.2,Nevada,2003,


In [33]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,pop,state,year,debt,eastern
one,1.5,Ohio,2000,,True
two,1.7,Ohio,2001,1.2,True
three,3.6,Ohio,2002,,True
four,2.4,Nevada,2001,-1.5,False
five,2.9,Nevada,2002,-2.0,False
six,3.2,Nevada,2003,,False


In [34]:
# del删除单列
del frame2['eastern']
print(frame2)
print(frame.columns)

       pop   state  year  debt
one    1.5    Ohio  2000   NaN
two    1.7    Ohio  2001   1.2
three  3.6    Ohio  2002   NaN
four   2.4  Nevada  2001  -1.5
five   2.9  Nevada  2002  -2.0
six    3.2  Nevada  2003   NaN
Index(['state', 'year', 'pop'], dtype='object')


In [37]:
pop ={'Nevada': {2001: 2.4, 2002: 2.9},
      'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop) # 外层字典的键作为列，内层键作为行索引
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [38]:
# df行列互换
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [40]:
pdata = {'Ohio': frame3['Ohio'][:-1],     # frame3['Ohio'][:-1]表示取ame3中的Ohio列（可看作一个Series）的第一个到倒数第一个元素
         'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


In [41]:
# 设置name属性
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [48]:
# 查询name属性
print("the index name of frame3 is " + str(frame3.index.name))
print("the column name of frame3 is " + str(frame3.columns.name))

the index name of frame3 is year
the column name of frame3 is state


In [49]:
# 查看frame3的值
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [50]:
frame2.values # 各列dtype不统一，dtype=object兼容所有列的数据类型

array([[1.5, 'Ohio', 2000, nan],
       [1.7, 'Ohio', 2001, 1.2],
       [3.6, 'Ohio', 2002, nan],
       [2.4, 'Nevada', 2001, -1.5],
       [2.9, 'Nevada', 2002, -2.0],
       [3.2, 'Nevada', 2003, nan]], dtype=object)

### 5.1.3 索引对象

In [54]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index)
print(index[1:])

Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')


In [55]:
# 不能直接对index进行修改
index[1] = 'd' # TypeError

TypeError: Index does not support mutable operations

In [60]:
# 创建Index对象
labels = pd.Index(np.arange(3))
print(labels)
# 传入Index对象作为Series的index
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)

Int64Index([0, 1, 2], dtype='int64')
0    1.5
1   -2.5
2    0.0
dtype: float64


In [62]:
obj2.index is labels

True

In [63]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [66]:
print(frame3.columns)
print('Ohio' in frame3.columns)
print(2003 in frame3.columns)

Index(['Nevada', 'Ohio'], dtype='object', name='state')
True
False


In [67]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Index的方法和属性

In [69]:
dup_labels.unique() # 计算Index中唯一值的数组
# 检测Index是否有重复值，没有则返回True
# dup_labels.is_unique()

Index(['foo', 'bar'], dtype='object')

In [77]:
dup_labels.isin(['foo']) # 检测dup_labels各值是否在['foo']数组中，返回布尔值

array([ True,  True, False, False])

In [89]:
labels2 = pd.Index(['ani', 'foo', 'app'])
labels3 = dup_labels.append(labels2) # 连接新的Index对象，产生一个新的Index
labels3

Index(['foo', 'foo', 'bar', 'bar', 'ani', 'foo', 'app'], dtype='object')

In [80]:
dup_labels.difference(labels2) # 计算差集

Index(['bar'], dtype='object')

In [86]:
dup_labels.intersection(labels2) # 计算交集

Index(['foo', 'foo'], dtype='object')

In [94]:
dup_labels.unique().union(labels2) # 计算并集，要求各个Index对象列表没有重复值，否则会报错

Index(['ani', 'app', 'bar', 'foo'], dtype='object')

## 5.2 基本功能

### 5.2.1 重新索引(reindex方法)

In [95]:
# 创建一个Series
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [97]:
# reindex
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

reindex(method=)

In [98]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [115]:
# reindex method=default
obj3.reindex(range(5)) # 默认 dont fill gaps

0      blue
1       NaN
2    purple
3       NaN
4    yellow
dtype: object

In [123]:
# reindex(method='ffill') 前向值填充
obj3.reindex(range(10), method='ffill') #pad / ffill: propagate last valid observation forward to next valid

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
6    yellow
7    yellow
8    yellow
9    yellow
dtype: object

In [113]:
# reindex(method='bfill') 后向值填充
obj3.reindex(range(5), method='bfill') # backfill / bfill: use next valid observation to fill gap

0      blue
1    purple
2    purple
3    yellow
4    yellow
dtype: object

In [124]:
?obj3.reindex

reindex 修改行索引和列

In [116]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [117]:
# 行reindex
frame2 = frame.reindex(['a', 'b', 'c', 'd']) # reindex修改行索引位置，同时增加增加新索引
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [118]:
# 列reindex  利用columns关键字
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


reindex 函数的参数

<img src='./reindex函数的参数.png' width=600>

In [127]:
# reindex(tolerance=) 前向值填充时，填充不准确匹配项的最大间距
print(obj3)
print("\n")
print(obj3.reindex(range(10), method='ffill'))
print("\n")
tol_case = obj3.reindex(range(10), method='ffill', tolerance=3)
print(tol_case)

0      blue
2    purple
4    yellow
dtype: object


0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
6    yellow
7    yellow
8    yellow
9    yellow
dtype: object


0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
6    yellow
7    yellow
8       NaN
9       NaN
dtype: object


### 5.2.2 丢弃指定轴上的项(drop方法)

drop不会就地修改对象，不会返回新的对象，返回的是类似视图的东西

drop(inplace=True) 可实现就地修改对象

Series.drop()

In [142]:
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [143]:
# 删除某个index和对应的value
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [144]:
# 删除多个index和对应的value
obj.drop(['d', 'c']) # 这是一个视图，不会真实地从obj删除某个index，需要储存在一个新对象

a    0.0
b    1.0
e    4.0
dtype: float64

In [145]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

DataFrame.drop()  
可以删除任意轴上的索引值

In [133]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio','Colorado','Utah','New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [134]:
# drop 默认axis=0(删除行)
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [135]:
# drop axis=1 或 axis='columns'(删除列)
data.drop(['one', 'three'], axis=1)

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


In [136]:
data.drop('two', axis='columns')

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [146]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [147]:
# drop(inplace=True) 就地删除
obj.drop('c', inplace=True) # 类似del
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [149]:
# del
del obj['b']

In [150]:
obj

a    0.0
d    3.0
e    4.0
dtype: float64

### 5.2.3 索引、选取和过滤

Series 索引、选取和过滤

In [151]:
obj = pd.Series(np.arange(4.), index=['a','b','c','d'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [152]:
# 通过索引值查询值
obj['b']

1.0

In [153]:
# 通过索引位置查询值
obj[1]

1.0

In [157]:
# 切片1
obj[2:4] # 不含末端

c    2.0
dtype: float64

In [155]:
# 切片2
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [156]:
# 切片3
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [162]:
# 过滤切片
obj[obj < 3] # obj[1,2,3]

a    0.0
b    1.0
c    2.0
dtype: float64

In [164]:
# 修改值
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

DataFrame 索引、选取和过滤

In [165]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio','Colorado','Utah','New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [166]:
data['two']  # 默认 axis=0

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [167]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [168]:
data[:2] # 默认 axis=0，选取行

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [170]:
# 过滤
data[data['three'] > 5] # 选取行

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [169]:
data['three'] > 5 # 返回结果的dtype是布尔型

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [171]:
type(data['three'] > 5)

pandas.core.series.Series

In [172]:
data < 5 # 返回布尔型DF

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [174]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### 5.2.4 用loc和iloc进行选取

loc和iloc属于标签运算符：
* loc：轴标签
* iloc：整数标签

In [175]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [176]:
# loc
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

In [177]:
# iloc
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [178]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [179]:
data.iloc[[1,2], [3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [181]:
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [182]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [185]:
# 行选取  ix运算符：模糊查找
# 不推荐
data.ix['Utah']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

DataFrame的索引选项

<img src = './DF的索引选项.png' width=600>

### 5.2.5 Integer Indexes

用于当标签是整数时，不推荐

推荐用行标签loc和整数标签iloc进行索引，准确性更高

In [192]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [193]:
ser[-1] # illegal

KeyError: -1

In [194]:
ser.iloc[-1] # legal

2.0

### 5.2.6 Arithmetric and Data Alignment

In [5]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, -2.3, 2.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
print(s1)
print("\n")
print(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64


a   -2.1
c   -2.3
e    2.5
f    4.0
g    3.1
dtype: float64


In [6]:
s1 + s2 # 结果数据自动对齐

a    5.2
c   -4.8
d    NaN
e    4.0
f    NaN
g    NaN
dtype: float64

In [8]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),
                   columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df1)
print("\n")
print(df2)

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0


          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0


In [9]:
df1 + df2 # 结果数据行列自动对齐

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [11]:
df1 = pd.DataFrame({'A': [1,2]})
df2 = pd.DataFrame({'B': [3,4]})
print(df1)
print("\n")
print(df2)

   A
0  1
1  2


   B
0  3
1  4


In [13]:
df1 - df2  # 没有公用的列或行标签，结果为空

Unnamed: 0,A,B
0,,
1,,


### 5.2.7 Arithmetic Methods with Fill Values

fill_value

In [14]:
df1 = pd.DataFrame(np.arange(12.).reshape((3,4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4,5)),
                   columns=list('abcde'))
print(df1)
print("\n")
print(df2)

     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0


      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0


In [15]:
df2.loc[1,'b'] = np.nan
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [16]:
df1 + df2 # results in NA values

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [17]:
df1.add(df2, fill_value=0) # fill NA values with 0

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Flexible Arithmetic Methods

<img src = './算术方法.png' width=400>

In [23]:
df1.reindex(columns=df2.columns, fill_value=0)  # fill value while reindexing

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


### 5.2.8 Operations between DataFrame and Series

In [24]:
arr = np.arange(12.).reshape((3,4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [25]:
arr[0]

array([0., 1., 2., 3.])

In [26]:
arr - arr[0] # broadcast: the subtraction is performed once for each row

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [50]:
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

print(frame)
print("\n")
print(series)

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0


b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64


In [51]:
frame - series # broadcast

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [52]:
series2 = pd.Series(range(3), index=['b','e','f'])
series2

b    0
e    1
f    2
dtype: int64

In [53]:
frame + series2 # 索引集合的差不为空，且来自两个df，结果被重新索引，形成并集，差的部分值为nan

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [54]:
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [56]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [55]:
frame.sub(series3, axis='index')
# 沿axis='index' 或 axis=0广播
# frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### 5.2.9 Function Application and Mapping

In [57]:
frame = pd.DataFrame(np.random.randn(4,3),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-0.528012,-2.892127,0.378417
Ohio,0.165509,-1.138978,0.220733
Texas,0.603012,-1.177392,0.981033
Oregon,-1.735516,-0.41542,1.139146


In [58]:
np.abs(frame) # 取绝对值

Unnamed: 0,b,d,e
Utah,0.528012,2.892127,0.378417
Ohio,0.165509,1.138978,0.220733
Texas,0.603012,1.177392,0.981033
Oregon,1.735516,0.41542,1.139146


In [59]:
# 将函数应用到由各列或行所形成的一维数组上
# 默认在每列执行 axis=0 或 axis='index' 
f = lambda x: x.max() - x.min()
frame.apply(f) 

b    2.338528
d    2.476706
e    0.918413
dtype: float64

In [64]:
# 在每行执行 axis=1 或 axis='columns' 
frame.apply(f, axis='columns') 

Utah      3.270543
Ohio      1.359710
Texas     2.158425
Oregon    2.874662
dtype: float64

In [65]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f, axis=1)

Unnamed: 0,min,max
Utah,-2.892127,0.378417
Ohio,-1.138978,0.220733
Texas,-1.177392,0.981033
Oregon,-1.735516,1.139146


In [66]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.735516,-2.892127,0.220733
max,0.603012,-0.41542,1.139146


In [67]:
# 得到frame中各个浮点值的格式化字符串: applymap方法
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.53,-2.89,0.38
Ohio,0.17,-1.14,0.22
Texas,0.6,-1.18,0.98
Oregon,-1.74,-0.42,1.14


In [72]:
frame.applymap(format).dtypes

b    object
d    object
e    object
dtype: object

In [73]:
frame.dtypes

b    float64
d    float64
e    float64
dtype: object

In [74]:
?frame.applymap(format)