### 4.5.1 Series对象
* Series 对象用来表示一维数组结构，由两个相互关联的数组组成。   
 主数组用来存放数据（NumPy任意类型数据）。   
 主数组的每个元素都有一个与之相关联的标签，存在另一个Index的数组中。   

- 声明Series对象   
 调用Series()构造函数，把要存储在Series对象中的数据以数组形式传入，即创建一个Series对象。

In [1]:
import numpy as np
import pandas as pd
s = pd.Series([12,-4,7,9])
s

0    12
1    -4
2     7
3     9
dtype: int64

* 左侧Index是一列标签，右边是标签对应的元素。   
 默认从0开始递增作为标签

In [2]:
s = pd.Series([12,-4,7,9], index=['a','b','c','d'])
s

a    12
b    -4
c     7
d     9
dtype: int64

In [3]:
s.values

array([12, -4,  7,  9])

In [4]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
s[2] == s['c']

True

In [6]:
s[2] = 1
s

a    12
b    -4
c     1
d     9
dtype: int64

In [7]:
s[s>=1]

a    12
c     1
d     9
dtype: int64

In [8]:
s/3

a    4.000000
b   -1.333333
c    0.333333
d    3.000000
dtype: float64

In [9]:
np.log(s)

  """Entry point for launching an IPython kernel.


a    2.484907
b         NaN
c    0.000000
d    2.197225
dtype: float64

* Series 对象的组成元素

In [10]:
serd = pd.Series([1,0,2,1,2,3], index=['w','w','b','g','g','y'])
serd

w    1
w    0
b    2
g    1
g    2
y    3
dtype: int64

* unique( )   
 返回结果为数组（包含Series去重后的元素）

In [11]:
serd.unique()

array([1, 0, 2, 3])

* value_counts( )   
 返回各个不同元素和其出现次数

In [12]:
serd.value_counts()

2    2
1    2
3    1
0    1
dtype: int64

* isin( )   
 用来判断从属关系，用于筛选

In [13]:
serd.isin([0,3])

w    False
w     True
b    False
g    False
g    False
y     True
dtype: bool

* NaN（Not a Number）   
 创建数据时可为数组中元素缺失的项填充   
 >np.NaN

In [14]:
s2 = pd.Series([5,-3,np.NaN,14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

* isnull( )   
 notnull( )   
 用来识别没有对应元素的索引

In [15]:
s2.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [16]:
s2.notnull()

0     True
1     True
2    False
3     True
dtype: bool

In [17]:
s2[s2.notnull()]

0     5.0
1    -3.0
3    14.0
dtype: float64

* SeriesObject   
 可以用来当作字典（dict，dictionary）对象来使用

In [18]:
mydict = {'red':2000,
          'blue':1000,
          'yellow':500,
          'orange':1000}
myseries = pd.Series(mydict)
myseries

blue      1000
orange    1000
red       2000
yellow     500
dtype: int64

In [19]:
colors = ['red','yellow','orange','blue','green']
myseries = pd.Series(mydict,index=colors)
myseries

red       2000.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

* Series 对象间的运算

In [20]:
mydict2 = {'red':400,
           'yellow':1000, 
           'black':700}
myseries2 = pd.Series(mydict2)
myseries + myseries2

black        NaN
blue         NaN
green        NaN
orange       NaN
red       2400.0
yellow    1500.0
dtype: float64

### 4.5.2 DataFrame Object
* 列表式数据结构和工作表相似；   
 使Series的使用场景从一维拓展到多维。   
 
* DataFrame由按一定顺序排列的多列数据组成，各列的数据结构类型可以不同。   
 DataFrame有两个索引数组；第一个数组与row相关，第二个数组与column相关。   
 dict 以每列名称作为键，每个键都有一个数组作为值。

In [21]:
data = {'color': ['b','g','y','r','w'],
       'object': ['ball','pen','pencil','paper','mug'],
       'price' : [1.2,1.0,0.6,0.9,1.7]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,b,ball,1.2
1,g,pen,1.0
2,y,pencil,0.6
3,r,paper,0.9
4,w,mug,1.7


In [22]:
frame2 = pd.DataFrame(data, columns=['object','price'])
frame2

Unnamed: 0,object,price
0,ball,1.2
1,pen,1.0
2,pencil,0.6
3,paper,0.9
4,mug,1.7


In [23]:
frame2 = pd.DataFrame(data, index=['one','two','three','four','five'])
frame2

Unnamed: 0,color,object,price
one,b,ball,1.2
two,g,pen,1.0
three,y,pencil,0.6
four,r,paper,0.9
five,w,mug,1.7


In [24]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['r','b','y','w'],
                     columns=['ball','pen','pencil','paper'])
frame3

Unnamed: 0,ball,pen,pencil,paper
r,0,1,2,3
b,4,5,6,7
y,8,9,10,11
w,12,13,14,15


In [25]:
print(frame.columns)
print(frame.index)
print(frame.values)

Index(['color', 'object', 'price'], dtype='object')
RangeIndex(start=0, stop=5, step=1)
[['b' 'ball' 1.2]
 ['g' 'pen' 1.0]
 ['y' 'pencil' 0.6]
 ['r' 'paper' 0.9]
 ['w' 'mug' 1.7]]


* 赋值

In [26]:
frame.index.name = 'id'
frame.columns.name = 'item'
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,b,ball,1.2
1,g,pen,1.0
2,y,pencil,0.6
3,r,paper,0.9
4,w,mug,1.7


In [27]:
frame['new'] = 12
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,b,ball,1.2,12
1,g,pen,1.0,12
2,y,pencil,0.6,12
3,r,paper,0.9,12
4,w,mug,1.7,12


In [28]:
ser = pd.Series(np.arange(5))
frame['new'] = ser
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,b,ball,1.2,0
1,g,pen,1.0,1
2,y,pencil,0.6,2
3,r,paper,0.9,3
4,w,mug,1.7,4


* 从属关系

In [29]:
frame.isin([1.0,'pen'])

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,False,False,False
1,False,True,True,True
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False


In [30]:
frame[frame.isin([1.0,'pen'])]

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,,,,
1,,pen,1.0,1.0
2,,,,
3,,,,
4,,,,


In [31]:
del frame['new']
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,b,ball,1.2
1,g,pen,1.0
2,y,pencil,0.6
3,r,paper,0.9
4,w,mug,1.7


* DataFrame转置   
 row <--> column

In [32]:
frame.T

id,0,1,2,3,4
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
color,b,g,y,r,w
object,ball,pen,pencil,paper,mug
price,1.2,1,0.6,0.9,1.7


### 3.5.3 Index对象的方法

In [33]:
ser = pd.Series([5,0,3,8,4], index=['r','b','y','w','g'])
print(ser.index)

Index(['r', 'b', 'y', 'w', 'g'], dtype='object')


In [34]:
print(ser.idxmin())
print(ser.idxmax())

b
w


## 4.6 索引对象的其他功能

### 4.6.1 更换索引

In [35]:
ser = pd.Series([2,5,7,4], index=['one','two','three','four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

In [36]:
ser.reindex(['three','four','five','one'])

three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64

In [37]:
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3

0    1
3    5
5    6
6    3
dtype: int64

* 刚定义的Series对象，其索引列缺失了几个值（1、2、4）。   

* 常见需求为插值，用reindex() ，methon选项的值为 ffill。   
 还需要指定索引值的范围0～5，参数为range(6)。

In [38]:
ser3.reindex(range(6),method='ffill')

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

* 用bfill 方法，新插入索引后的元素。

In [39]:
ser3.reindex(range(6),method='bfill')

0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64

In [40]:
frame.reindex(range(5),method='ffill',columns=['colors','price','new','object'])

item,colors,price,new,object
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,b,1.2,b,ball
1,g,1.0,g,pen
2,y,0.6,y,pencil
3,r,0.9,r,paper
4,w,1.7,w,mug


### 4.6.2 删除
* drop( )   
 专门用于删除操作的函数，返回不包含已删除索引及其元素的新对象

In [41]:
frame.drop([4])
frame.drop(['price'],axis=1)

item,color,object
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,b,ball
1,g,pen
2,y,pencil
3,r,paper
4,w,mug


In [42]:
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser

red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

In [43]:
ser.drop('yellow')

red      0.0
blue     1.0
white    3.0
dtype: float64

### 4.6.3 算术 & 数据对齐

In [44]:
s1 = pd.Series([3,2,5,1],['w','y','g','b'])
s2 = pd.Series([1,4,7,2,1],['w','y','k','b','brown'])
s1+s2

b        3.0
brown    NaN
g        NaN
k        NaN
w        4.0
y        6.0
dtype: float64

* 有些标签两者都有，有些只属于其中一个对象。   
 如果一个标签，两个Series对象都有，就把它们的元素相加，   
 反之，标签也会显示在新Series对象中，元素为NaN

* DataFrame对象之间的运算对齐规则相同，只是行，列都要对齐操作。

In [45]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['r','b','y','w'],
                     columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index=['b','g','w','y'],
                     columns=['mug','pen','ball'])
print(frame1)
print(frame2)

   ball  pen  pencil  paper
r     0    1       2      3
b     4    5       6      7
y     8    9      10     11
w    12   13      14     15
   mug  pen  ball
b    0    1     2
g    3    4     5
w    6    7     8
y    9   10    11


In [46]:
print(frame1+frame2)

   ball  mug  paper   pen  pencil
b   6.0  NaN    NaN   6.0     NaN
g   NaN  NaN    NaN   NaN     NaN
r   NaN  NaN    NaN   NaN     NaN
w  20.0  NaN    NaN  20.0     NaN
y  19.0  NaN    NaN  19.0     NaN


### 4.7.1 算术运算方法
* add( )   
 sub( )   
 div( )   
 mul( )   

In [47]:
frame1.add(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
b,6.0,,,6.0,
g,,,,,
r,,,,,
w,20.0,,,20.0,
y,19.0,,,19.0,


### 4.7.2 DataFrame 和 Series对象间的运算

In [48]:
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['r','b','y','w'],
                     columns=['ball','pen','pencil','paper'])
frame - ser

Unnamed: 0,ball,pen,pencil,paper
r,0,0,0,0
b,4,4,4,4
y,8,8,8,8
w,12,12,12,12


In [49]:
ser['mug']=9
frame-ser

Unnamed: 0,ball,mug,paper,pen,pencil
r,0,,0,0,0
b,4,,4,4,4
y,8,,8,8,8
w,12,,12,12,12


## 4.8 函数应用和映射

### 4.8.1 操作元素的函数
* ufunc (universal function) 通用函数

In [50]:
np.sqrt(frame)

Unnamed: 0,ball,pen,pencil,paper
r,0.0,1.0,1.414214,1.732051
b,2.0,2.236068,2.44949,2.645751
y,2.828427,3.0,3.162278,3.316625
w,3.464102,3.605551,3.741657,3.872983


### 4.8.2 按行 or 列操作的函数

In [51]:
f = lambda x: x.max()-xmin()

In [52]:
def f(x):
    return x.max() - x.min()

In [53]:
frame.apply(f)

ball      12
pen       12
pencil    12
paper     12
dtype: int64

* if 用此函数处理行， axis=1

In [54]:
frame.apply(f,axis=1)

r    3
b    3
y    3
w    3
dtype: int64

* apply( ) 也可返回Series对象

In [55]:
def f(x):
    return pd.Series([x.min(), x.min()], index=['min','max'])
frame.apply(f)

Unnamed: 0,ball,pen,pencil,paper
min,0,1,2,3
max,0,1,2,3


### 4.8.3 统计函数

In [56]:
frame.sum()

ball      24
pen       28
pencil    32
paper     36
dtype: int64

In [57]:
frame.mean()

ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

In [58]:
frame.describe()

Unnamed: 0,ball,pen,pencil,paper
count,4.0,4.0,4.0,4.0
mean,6.0,7.0,8.0,9.0
std,5.163978,5.163978,5.163978,5.163978
min,0.0,1.0,2.0,3.0
25%,3.0,4.0,5.0,6.0
50%,6.0,7.0,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


## 4.9 排序(sorting) 和 排位次(ranking)

* sortindex( ) 返回一个和原对象**元素相同但顺序不同**的新对象。

In [59]:
ser = pd.Series([5,0,3,8,4],index=['r','b','y','w','g'])
ser

r    5
b    0
y    3
w    8
g    4
dtype: int64

In [60]:
ser.sort_index()

b    0
g    4
r    5
w    8
y    3
dtype: int64

* 默认各元素按A～Z升序排列，   
 ascending = False, 降序排列。   
 DataFrame 同理

In [61]:
ser.sort_index(ascending=False)

y    3
w    8
r    5
g    4
b    0
dtype: int64

* 对数据结构中的元素排序，   
* **Series 和 DataFrame对象不同**   
* **Series:**    
> ser.order( )
* **DataFrame:**    
>frame.sort_index(by='')    
>frame.sort_index(by=['',''])   

* 排位次操作作为序列的每个元素安排一个位次（初始=1，依次+=1）   
 **位次越前，数值越小**

In [62]:
ser.rank()

r    4.0
b    1.0
y    2.0
w    5.0
g    3.0
dtype: float64

In [63]:
ser.rank(method='first')

r    4.0
b    1.0
y    2.0
w    5.0
g    3.0
dtype: float64

* 默认升序，
 ascending=False 降序

In [64]:
ser.rank(ascending= False)

r    2.0
b    5.0
y    4.0
w    1.0
g    3.0
dtype: float64

## 4.10 相关性(correlation)和协方差(covariance)
* corr( )   
 cov( )

In [65]:
seq2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq  = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2

2006    3
2007    4
2008    3
2009    4
2010    5
2011    4
2012    3
2013    2
dtype: int64

In [66]:
seq

2006    1
2007    2
2008    3
2009    4
2010    4
2011    3
2012    2
2013    1
dtype: int64

In [67]:
seq.corr(seq2)

0.77459666924148352

In [68]:
seq.cov(seq2)

0.8571428571428571

* 计算单个DataFrame对象的corr和cov，   
* 返回两个新DataFrame对象形式的矩阵。

In [69]:
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                  index=['r','b','y','w'],
                  columns=['ball','pen','pencil','paper'])
frame2

Unnamed: 0,ball,pen,pencil,paper
r,1,4,3,6
b,4,5,6,1
y,3,3,1,5
w,4,1,6,4


In [70]:
frame2.corr()

Unnamed: 0,ball,pen,pencil,paper
ball,1.0,-0.276026,0.57735,-0.763763
pen,-0.276026,1.0,-0.079682,-0.361403
pencil,0.57735,-0.079682,1.0,-0.692935
paper,-0.763763,-0.361403,-0.692935,1.0


In [71]:
frame2.cov()

Unnamed: 0,ball,pen,pencil,paper
ball,2.0,-0.666667,2.0,-2.333333
pen,-0.666667,2.916667,-0.333333,-1.333333
pencil,2.0,-0.333333,6.0,-3.666667
paper,-2.333333,-1.333333,-3.666667,4.666667


* corrwith()   
 计算DataFrame对象的列或行与Series对象或其他DataFrame对象元素两两之间相关性

In [72]:
frame2.corrwith(ser)

ball     -0.140028
pen      -0.869657
pencil    0.080845
paper     0.595854
dtype: float64

In [73]:
frame2.corrwith(frame)

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

## 4.11 NaN 数据

* np.nan

In [74]:
ser = pd.Series([0,1,2,np.nan,9],
               index=['r','y','w','b','g'])
ser

r    0.0
y    1.0
w    2.0
b    NaN
g    9.0
dtype: float64

* 过滤NaN   
 dropna( ) or notnull( )

In [75]:
ser.dropna()

r    0.0
y    1.0
w    2.0
g    9.0
dtype: float64

In [76]:
ser[ser.notnull()]

r    0.0
y    1.0
w    2.0
g    9.0
dtype: float64

* 对于DataFrame，只要有一个NaN，整行/列元素都会被删除   
 仅删除所有元素均为NaN的行或列   
>frame3.dropna(how='all')   

### 4.11.3 给NaN元素填充其他值
>frame.fillna(0)   
>frame.fillna({'ball':1,'mug':0,'pen':99})   

### 4.12 等级索引和分级
* 等级索引(hierarchical indexing)   
 单条轴可以有多级索引

In [77]:
mser = pd.Series(np.random.rand(8),
                index=[['w','w','w','b','b','r','r','r'],
                      ['up','down','right','up','down','up','down','left']])
mser

w  up       0.626307
   down     0.256277
   right    0.227527
b  up       0.464895
   down     0.436307
r  up       0.158141
   down     0.736704
   left     0.864644
dtype: float64

In [78]:
mser.index

MultiIndex(levels=[['b', 'r', 'w'], ['down', 'left', 'right', 'up']],
           labels=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])

In [79]:
mser['w']

up       0.626307
down     0.256277
right    0.227527
dtype: float64

In [80]:
mser[:,'up']

w    0.626307
b    0.464895
r    0.158141
dtype: float64

In [81]:
mser['w','up']

0.62630662315248575

* unstack() 可以把等级索引转为DataFrame   
 把第二列索引转换为相应的列

In [82]:
mser.unstack()

Unnamed: 0,down,left,right,up
b,0.436307,,,0.464895
r,0.736704,0.864644,,0.158141
w,0.256277,,0.227527,0.626307


In [83]:
frame

Unnamed: 0,ball,pen,pencil,paper
r,0,1,2,3
b,4,5,6,7
y,8,9,10,11
w,12,13,14,15


In [84]:
frame.stack()

r  ball       0
   pen        1
   pencil     2
   paper      3
b  ball       4
   pen        5
   pencil     6
   paper      7
y  ball       8
   pen        9
   pencil    10
   paper     11
w  ball      12
   pen       13
   pencil    14
   paper     15
dtype: int64

### 4.12.1 重新调整顺序和为层级排序
* swaplevel( )    
 以要互换位置的两个层级名称为参数，返回交换位置后的一个新对象，各元素顺序不变   
 > mframe.swaplevel('colors,'status')   

* sortlevel( )   
 仅根据一个层级对数据进行排序   
 > mframe.sortlevel('color')   

### 4.12.2 按层级统计数据
> mframe.sum(level='colors')   

* 对某一列统计，axis=1   

>mframe.sum(level='id',axis=1)