<b>Pandas之索引操作</b>

In [1]:
import pandas as pd

In [2]:
data = pd.read_excel('excel-comp-data.xlsx')
data

Unnamed: 0,id,name,street,city,state,postal-code,Jan,Feb,Mar,year
0,211829,Kerluke,34456 Sean Highway,New Jaycob,Texas,28752.0,10000.0,62000,35000,2006
1,320563,Walter,1311 Alvis Tunnel,Hyattburgh,Texas,38365.0,95000.0,45000,35000,2006
2,648336,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,120000,35000,2007
3,109996,Bode,155 Fadel Crescent Apt. 144,Hyattburgh,Texas,,45000.0,120000,10000,2007
4,121213,Bauch,7274 Marissa Common,Shanahanchester,Iowa,49681.0,162000.0,120000,35000,2008


<b>1. 建立索引</b>

我们可以使用index_col参数来设置得到的DataFrame的index，例如，可以直接将id列设为index_col，当然了，也可以将state和name列合并在一起作为index_col，无论如何，目的都是为了让index唯一，如果不唯一，根据index查询值会得到多条记录。

In [3]:
data2 = pd.read_excel('excel-comp-data.xlsx',index_col='id')
data2.head()

Unnamed: 0_level_0,name,street,city,state,postal-code,Jan,Feb,Mar,year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
211829,Kerluke,34456 Sean Highway,New Jaycob,Texas,28752.0,10000.0,62000,35000,2006
320563,Walter,1311 Alvis Tunnel,Hyattburgh,Texas,38365.0,95000.0,45000,35000,2006
648336,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,120000,35000,2007
109996,Bode,155 Fadel Crescent Apt. 144,Hyattburgh,Texas,,45000.0,120000,10000,2007
121213,Bauch,7274 Marissa Common,Shanahanchester,Iowa,49681.0,162000.0,120000,35000,2008


In [4]:
new_index_col = data2.state.map(lambda x: x + '_') + data2.name
data_new = data2.copy()
data_new.index = new_index_col
data_new.head()

Unnamed: 0,name,street,city,state,postal-code,Jan,Feb,Mar,year
Texas_Kerluke,Kerluke,34456 Sean Highway,New Jaycob,Texas,28752.0,10000.0,62000,35000,2006
Texas_Walter,Walter,1311 Alvis Tunnel,Hyattburgh,Texas,38365.0,95000.0,45000,35000,2006
Iowa_Bashirian,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,120000,35000,2007
Texas_Bode,Bode,155 Fadel Crescent Apt. 144,Hyattburgh,Texas,,45000.0,120000,10000,2007
Iowa_Bauch,Bauch,7274 Marissa Common,Shanahanchester,Iowa,49681.0,162000.0,120000,35000,2008


可以使用index的is_unique属性来查看index是否唯一。

In [5]:
data_new.index.is_unique

True

<b>2. 索引处理</b>

有时候需要重建索引，常用的一个方法是reindex方法，可以用它来改变行顺序，列顺序。

In [6]:
data.reindex(index = data.index[::-1]) # 将行顺序进行反转

Unnamed: 0,id,name,street,city,state,postal-code,Jan,Feb,Mar,year
4,121213,Bauch,7274 Marissa Common,Shanahanchester,Iowa,49681.0,162000.0,120000,35000,2008
3,109996,Bode,155 Fadel Crescent Apt. 144,Hyattburgh,Texas,,45000.0,120000,10000,2007
2,648336,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,120000,35000,2007
1,320563,Walter,1311 Alvis Tunnel,Hyattburgh,Texas,38365.0,95000.0,45000,35000,2006
0,211829,Kerluke,34456 Sean Highway,New Jaycob,Texas,28752.0,10000.0,62000,35000,2006


In [7]:
data.reindex(index=data.index[::-1],columns=['street','Jan']) # 将行顺序反转，列只取['street','name']两列

Unnamed: 0,street,Jan
4,7274 Marissa Common,162000.0
3,155 Fadel Crescent Apt. 144,45000.0
2,62184 Schamberger Underpass Apt. 231,
1,1311 Alvis Tunnel,95000.0
0,34456 Sean Highway,10000.0


使用method可以按照特定的形式补齐NaN值，使用fill_value可以指定将NaN值指定为特定的值。

In [8]:
data.reindex(index=data.index[::-1],method='ffill',columns=['Feb','Jan'])

Unnamed: 0,Feb,Jan
4,120000,162000.0
3,120000,45000.0
2,120000,
1,45000,95000.0
0,62000,10000.0


In [9]:
data.reindex(index=data.index[::-1],fill_value='1',columns=['street','Jan'])

Unnamed: 0,street,Jan
4,7274 Marissa Common,162000.0
3,155 Fadel Crescent Apt. 144,45000.0
2,62184 Schamberger Underpass Apt. 231,
1,1311 Alvis Tunnel,95000.0
0,34456 Sean Highway,10000.0


其实使用ix也可以实现行顺序和列顺序的改变，但是没法补充NaN值。

In [10]:
new_index = data.index[::-1]
col = ['street','Jan']
data.ix[new_index,col]

Unnamed: 0,street,Jan
4,7274 Marissa Common,162000.0
3,155 Fadel Crescent Apt. 144,45000.0
2,62184 Schamberger Underpass Apt. 231,
1,1311 Alvis Tunnel,95000.0
0,34456 Sean Highway,10000.0


<b>drop操作</b>

drop可以删除特定的行数据或列数据，axis可以指定删除行/列，inplace可以指定是否在元数据进行删除。

In [11]:
data.drop([1,3]) # 删除第1行和第3行数据

Unnamed: 0,id,name,street,city,state,postal-code,Jan,Feb,Mar,year
0,211829,Kerluke,34456 Sean Highway,New Jaycob,Texas,28752.0,10000.0,62000,35000,2006
2,648336,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,120000,35000,2007
4,121213,Bauch,7274 Marissa Common,Shanahanchester,Iowa,49681.0,162000.0,120000,35000,2008


In [12]:
data.drop(['postal-code','Jan','Feb','Mar'],axis=1) # 删除指定的列

Unnamed: 0,id,name,street,city,state,year
0,211829,Kerluke,34456 Sean Highway,New Jaycob,Texas,2006
1,320563,Walter,1311 Alvis Tunnel,Hyattburgh,Texas,2006
2,648336,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,2007
3,109996,Bode,155 Fadel Crescent Apt. 144,Hyattburgh,Texas,2007
4,121213,Bauch,7274 Marissa Common,Shanahanchester,Iowa,2008


<b>索引、挑选和过滤</b>

使用“.”方式索引指定的列数据

In [13]:
data.city

0         New Jaycob
1         Hyattburgh
2     New Lilianland
3         Hyattburgh
4    Shanahanchester
Name: city, dtype: object

使用切片的方式索引指定行数据,和python切片不同的是，会包括结束点。

In [14]:
data[0:5]

Unnamed: 0,id,name,street,city,state,postal-code,Jan,Feb,Mar,year
0,211829,Kerluke,34456 Sean Highway,New Jaycob,Texas,28752.0,10000.0,62000,35000,2006
1,320563,Walter,1311 Alvis Tunnel,Hyattburgh,Texas,38365.0,95000.0,45000,35000,2006
2,648336,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,120000,35000,2007
3,109996,Bode,155 Fadel Crescent Apt. 144,Hyattburgh,Texas,,45000.0,120000,10000,2007
4,121213,Bauch,7274 Marissa Common,Shanahanchester,Iowa,49681.0,162000.0,120000,35000,2008


使用切片的方式索引指定的列数据。

In [15]:
data[['name','city']]

Unnamed: 0,name,city
0,Kerluke,New Jaycob
1,Walter,Hyattburgh
2,Bashirian,New Lilianland
3,Bode,Hyattburgh
4,Bauch,Shanahanchester


使用比较方式来索引特定的值。

In [16]:
data[data.id >=  320563]

Unnamed: 0,id,name,street,city,state,postal-code,Jan,Feb,Mar,year
1,320563,Walter,1311 Alvis Tunnel,Hyattburgh,Texas,38365.0,95000.0,45000,35000,2006
2,648336,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,120000,35000,2007


当然也可以通过ix来索引特定的行，特定的列或特定的行和列。

In [17]:
data.ix[1:3,['id','name','street']] # 选择1-3行的id、name和street列

Unnamed: 0,id,name,street
1,320563,Walter,1311 Alvis Tunnel
2,648336,Bashirian,62184 Schamberger Underpass Apt. 231
3,109996,Bode,155 Fadel Crescent Apt. 144


xs方法可以通过传入的key来得到一个单行或单列到Series

In [18]:
data.xs(2) # 得到索引为2的这一行

id                                           648336
name                                      Bashirian
street         62184 Schamberger Underpass Apt. 231
city                                 New Lilianland
state                                          Iowa
postal-code                                   76517
Jan                                             NaN
Feb                                          120000
Mar                                           35000
year                                           2007
Name: 2, dtype: object

In [19]:
data.xs('id',axis=1) # 得到id这一列

0    211829
1    320563
2    648336
3    109996
4    121213
Name: id, dtype: int64