In [1]:
import pandas as pd
pd.__version__

'0.24.2'

Pandas提供了两种主要数据类型：

Series：索引 + 一维数据

DataFrame：行索引（index)、列索引（column）+ 二维表或者高维

# Series类型

## 索引

Pandas关注数据与索引的关系，操作索引就是操作数据

索引分自动索引和自定义索引

自动索引：

In [2]:
a = pd.Series([9,8,7,6])
a

0    9
1    8
2    7
3    6
dtype: int64

自定义索引：

指定的索引元素个数必须与列表元素个数相同，否则出错

索引可以重复

In [3]:
a = pd.Series([9,8,7,6],index=['a','c','c','d'])
a

a    9
c    8
c    7
d    6
dtype: int64

## Series类型的创建

### 从Python列表创建

见1.1节

### 从标量创建

如果不给出索引，则只有一个元素，如果给出索引，Series元素个数与索引个数相同

In [4]:
a = pd.Series(3)
a

0    3
dtype: int64

In [5]:
a = pd.Series(4, [0,1,2,3,4])
a

0    4
1    4
2    4
3    4
4    4
dtype: int64

### 从字典类型创建

In [6]:
a = pd.Series({'a':9,'b':8,'c':7})
a

a    9
b    8
c    7
dtype: int64

可以从字典中有选择的创建，没有的键为空值NaN

In [7]:
b = pd.Series({'a':9,'b':8, 'c':7},index=['b','c','d'])
b

b    8.0
c    7.0
d    NaN
dtype: float64

### 从numpy ndarray创建

In [8]:
import numpy as np
a = pd.Series(np.arange(5))
a

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [9]:
b = pd.Series(np.arange(5), np.arange(9,4,-1))
b

9    0
8    1
7    2
6    3
5    4
dtype: int64

### 其他函数也行

In [10]:
pd.Series(range(5))

0    0
1    1
2    2
3    3
4    4
dtype: int64

## 基本操作

Series数据分两个部分，

一是index，index是pandas中的一个类型

二是values， numpy ndarray类型

In [11]:
a = pd.Series([9,8,7,6], ['a','b','c','d'])

In [12]:
a.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [13]:
a.values     # not a.value

array([9, 8, 7, 6])

### 索引和切片

即使定义了自定义索引，自动索引仍然有效

可以像在Numpy和Python中使用切片，也可以使用自定义索引作为切片

但是不能混合使用

单个索引提取的单个值是个标量，但是索引获取的结果仍然是Series类型

In [14]:
a['a']

9

In [15]:
a[0]

9

注意这种枚举索引的取值方法只适用于Numpy和pandas，原生list是不支持的
记住a[[0,1,2]]可以，a[0,1,2]是不行的

In [16]:
a[[0,1,2]]

a    9
b    8
c    7
dtype: int64

In [17]:
a[0:3]

a    9
b    8
c    7
dtype: int64

In [18]:
a['a':'c']

a    9
b    8
c    7
dtype: int64

In [19]:
# a['a',2,1]] label和position不能同时存在

### 与Numpy的操作类似
只是计算结果大多数还是Series类型

In [20]:
a[a>a.median()]

a    9
b    8
dtype: int64

In [21]:
np.exp(a)

a    8103.083928
b    2980.957987
c    1096.633158
d     403.428793
dtype: float64

### 与Python的字典操作类似

In [22]:
'c' in a

True

In [23]:
0 in a   # 自动索引不属于键

False

In [24]:
a.get('f',100)

100

### 根据索引进行对齐操作

In [25]:
a = pd.Series([0,4,3,2],['a','b','c','d'])

In [26]:
b = pd.Series([1,5,3,7],['b','c','d','e'])

In [27]:
a + b

a    NaN
b    5.0
c    8.0
d    5.0
e    NaN
dtype: float64

### 就地修改，随时生效

In [28]:
a

a    0
b    4
c    3
d    2
dtype: int64

In [29]:
a['c'] = 15
a[0] = 7
a

a     7
b     4
c    15
d     2
dtype: int64

# DataFrame类型

DataFrame是二维数据类型，行和列都有索引，行索引为index,列索引为columns
与Numpy的定义相同，行为第0轴，列为第1轴

In [30]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,1.05411,0.353464,0.2652,-0.727784
2013-01-02,-0.75545,1.340865,-0.644747,0.188549
2013-01-03,-0.678723,-1.459584,-0.242464,-0.942459
2013-01-04,-0.550449,-0.241706,-0.656447,0.345561
2013-01-05,0.35186,-1.083367,0.695469,0.043402
2013-01-06,-0.785006,-1.74546,0.955968,-0.203192


## 切片和选择

[ ]的使用

|数据类型 |选择方法| 返回类型|
|:-------|--------|-------|
|Series|s[label]或者s[slice|标量，slice的操作跟numpy完全一致|
|DataFrame|df[column_name]或者df[slice]|与column_name对应的Series,注意slice只操纵行|
|Panel|p[item_name]或者p[slice]|与item_name对应的DatFrame|

In [31]:
# 选择列
df['A']

2013-01-01    1.054110
2013-01-02   -0.755450
2013-01-03   -0.678723
2013-01-04   -0.550449
2013-01-05    0.351860
2013-01-06   -0.785006
Freq: D, Name: A, dtype: float64

In [32]:
# 与df['A']等同
#  Series的索引，DataFrame的Column，Pannel的item可以直接当做属性
# 注意在使用属性进行修改的时候，要注意这个索引或者列一定要存在，否则会引发警告，而不会创建新的索引或列，只是创建新的属性
df.A

2013-01-01    1.054110
2013-01-02   -0.755450
2013-01-03   -0.678723
2013-01-04   -0.550449
2013-01-05    0.351860
2013-01-06   -0.785006
Freq: D, Name: A, dtype: float64

### 基于标签(label)选择 loc/at | 基于位置(position)选择 iloc/iat

In [33]:
# df.loc[rows_label, columns_label]
# 注意label切片[m,n]与python，numpy不同，pandas的label切片包括末尾一个
df.loc['2013-01-01':'2013-01-03',['A','B']]

Unnamed: 0,A,B
2013-01-01,1.05411,0.353464
2013-01-02,-0.75545,1.340865
2013-01-03,-0.678723,-1.459584


In [34]:
# 但是使用数字索引是不包含末尾项的
df.iloc[0:3,0:3]

Unnamed: 0,A,B,C
2013-01-01,1.05411,0.353464,0.2652
2013-01-02,-0.75545,1.340865,-0.644747
2013-01-03,-0.678723,-1.459584,-0.242464


In [35]:
# 选取标量可以用 label的at和position的iat
df.at[dates[0], 'A']

1.0541096403157948

In [36]:
# 这种枚举索引的方式和numpy不同
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2013-01-02,-0.75545,-0.644747
2013-01-03,-0.678723,-0.242464
2013-01-05,0.35186,0.695469


In [37]:
a = np.arange(12).reshape(3,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [38]:
a[[0,2],[1,3]]

array([ 1, 11])

## 修改
dataframe对象都是可变的，但是有的大小是不可变的，比如Series的长度
但是大多数方法都会拷贝一份数据进行修改，而不是在原数据上直接修改

### 增加一列
类似字典操作

In [39]:
# 操作类似字典，但是行对齐是按照索引index来的
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
df['F'] = s1
df

Unnamed: 0,A,B,C,D,F
2013-01-01,1.05411,0.353464,0.2652,-0.727784,
2013-01-02,-0.75545,1.340865,-0.644747,0.188549,1.0
2013-01-03,-0.678723,-1.459584,-0.242464,-0.942459,2.0
2013-01-04,-0.550449,-0.241706,-0.656447,0.345561,3.0
2013-01-05,0.35186,-1.083367,0.695469,0.043402,4.0
2013-01-06,-0.785006,-1.74546,0.955968,-0.203192,5.0


### 修改单个元素
直接就地修改

In [40]:
df.at[dates[0], 'A'] = 0

In [41]:
df.iat[0, 1] = 0

### 就地修改一行或一列
注意len(df)返回的是行的数目,df有shape属性可以返回完整的形状

In [42]:
df.loc[:, 'D'] = np.array([5] * len(df))

In [43]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.2652,5,
2013-01-02,-0.75545,1.340865,-0.644747,5,1.0
2013-01-03,-0.678723,-1.459584,-0.242464,5,2.0
2013-01-04,-0.550449,-0.241706,-0.656447,5,3.0
2013-01-05,0.35186,-1.083367,0.695469,5,4.0
2013-01-06,-0.785006,-1.74546,0.955968,5,5.0


In [44]:
# 通过字典修改一行
x = pd.DataFrame({'x':[1,2,3],'y':[4,5,6]})
x.iloc[1] = {'x':9,'y':99}
x

Unnamed: 0,x,y
0,1,4
1,9,99
2,3,6


### 通过位置选择进行赋值

In [45]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.2652,-5,
2013-01-02,-0.75545,-1.340865,-0.644747,-5,-1.0
2013-01-03,-0.678723,-1.459584,-0.242464,-5,-2.0
2013-01-04,-0.550449,-0.241706,-0.656447,-5,-3.0
2013-01-05,-0.35186,-1.083367,-0.695469,-5,-4.0
2013-01-06,-0.785006,-1.74546,-0.955968,-5,-5.0


In [46]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.2652,5,
2013-01-02,-0.75545,1.340865,-0.644747,5,1.0
2013-01-03,-0.678723,-1.459584,-0.242464,5,2.0
2013-01-04,-0.550449,-0.241706,-0.656447,5,3.0
2013-01-05,0.35186,-1.083367,0.695469,5,4.0
2013-01-06,-0.785006,-1.74546,0.955968,5,5.0


注意df[['A','B']]=df[['B','A']]和df.loc[:,['A','B']] = df[['B','A']]的不同

其中第一个对df就行了原地修改，而第二个使用loc并没有修改df
这是因为使用loc会对轴进行重新排列

但是其他非'A','B'列却会引发df变化
使用loc交换两个列的正确方法是
df.loc[:, ['A','B']] = df[['B','A']].to_numpy()

In [47]:
df[['A','B']] = df[['B','A']]
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.2652,5,
2013-01-02,1.340865,-0.75545,-0.644747,5,1.0
2013-01-03,-1.459584,-0.678723,-0.242464,5,2.0
2013-01-04,-0.241706,-0.550449,-0.656447,5,3.0
2013-01-05,-1.083367,0.35186,0.695469,5,4.0
2013-01-06,-1.74546,-0.785006,0.955968,5,5.0


In [48]:
#df.loc[:,['A','B']]= df[['C','D']]
#df A 、B变成了NAN

In [49]:
t = df.loc[:,['A','B']]
t

Unnamed: 0,A,B
2013-01-01,0.0,0.0
2013-01-02,1.340865,-0.75545
2013-01-03,-1.459584,-0.678723
2013-01-04,-0.241706,-0.550449
2013-01-05,-1.083367,0.35186
2013-01-06,-1.74546,-0.785006


In [50]:
df.loc['2013-01-02':'2013-01-05']

Unnamed: 0,A,B,C,D,F
2013-01-02,1.340865,-0.75545,-0.644747,5,1.0
2013-01-03,-1.459584,-0.678723,-0.242464,5,2.0
2013-01-04,-0.241706,-0.550449,-0.656447,5,3.0
2013-01-05,-1.083367,0.35186,0.695469,5,4.0


In [51]:
s = pd.Series([-1,-2,-3,2,4,5])

In [52]:
s>0

0    False
1    False
2    False
3     True
4     True
5     True
dtype: bool

In [54]:
df>0

Unnamed: 0,A,B,C,D,F
2013-01-01,False,False,True,True,False
2013-01-02,True,False,False,True,True
2013-01-03,False,False,False,True,True
2013-01-04,False,False,False,True,True
2013-01-05,False,True,True,True,True
2013-01-06,False,False,True,True,True


In [55]:
df['A']>0

2013-01-01    False
2013-01-02     True
2013-01-03    False
2013-01-04    False
2013-01-05    False
2013-01-06    False
Freq: D, Name: A, dtype: bool