<b>Pandas之数据对象方法</b>

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_excel('excel-comp-data.xlsx',index_col='id')
data

Unnamed: 0_level_0,name,street,city,state,postal-code,Jan,Feb,Mar,year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
211829,Kerluke,34456 Sean Highway,New Jaycob,Texas,28752.0,10000.0,62000,35000,2006
320563,Walter,1311 Alvis Tunnel,Hyattburgh,Texas,38365.0,95000.0,45000,35000,2006
648336,Bashirian,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,120000,35000,2007
109996,Bode,155 Fadel Crescent Apt. 144,Hyattburgh,Texas,,45000.0,120000,10000,2007
121213,Bauch,7274 Marissa Common,Shanahanchester,Iowa,49681.0,162000.0,120000,35000,2008


<b>1. add/sub等方法</b>

In [3]:
df1 = data[['Jan','Feb','Mar']]
df1

Unnamed: 0_level_0,Jan,Feb,Mar
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
211829,10000.0,62000,35000
320563,95000.0,45000,35000
648336,,120000,35000
109996,45000.0,120000,10000
121213,162000.0,120000,35000


In [4]:
df2 = df1.ix[211829:648336]
df2

Unnamed: 0_level_0,Jan,Feb,Mar
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
211829,10000.0,62000,35000
320563,95000.0,45000,35000
648336,,120000,35000


如果想要将df1和df2中id相同index的同一列名下的数据加在一起，使用"+"可以实现，但是如果不重复的index，则会填为NaN值。

In [5]:
df1 + df2

Unnamed: 0_level_0,Jan,Feb,Mar
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
109996,,,
121213,,,
211829,20000.0,124000.0,70000.0
320563,190000.0,90000.0,70000.0
648336,,240000.0,70000.0


使用DataFrame的add方法可以实现将不重复的标签中的NaN值补充为指定值。

In [6]:
df1.add(df2,fill_value=0)

Unnamed: 0_level_0,Jan,Feb,Mar
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
109996,45000.0,120000.0,10000.0
121213,162000.0,120000.0,35000.0
211829,20000.0,124000.0,70000.0
320563,190000.0,90000.0,70000.0
648336,,240000.0,70000.0


<b>2. apply/map/applymap方法</b>

apply是Series和DataFrame中的用法是不一样的，在Series中apply会对Series中的每个值使用func；在DataFrame中可以通过axis参数来控制对所有行或所有列使用func，也就是说操作对象是一行/一列数据,也就是Series类型的数据。map是Series中的一个element-wise方法，意味着会对Series中的每一个数据使用func/dict/Series。applaymap是DataFrame中的一个element-wise方法，会对DataFrame中的每一个数据使用func。

In [7]:
df = data[['city','year','Feb']].copy()
df

Unnamed: 0_level_0,city,year,Feb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
211829,New Jaycob,2006,62000
320563,Hyattburgh,2006,45000
648336,New Lilianland,2007,120000
109996,Hyattburgh,2007,120000
121213,Shanahanchester,2008,120000


In [57]:
df.applymap(lambda x: str(x) + '_') # 对df中每个数据调用func

Unnamed: 0_level_0,city,year,Feb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
211829,New Jaycob_,2006_,62000_
320563,Hyattburgh_,2006_,45000_
648336,New Lilianland_,2007_,120000_
109996,Hyattburgh_,2007_,120000_
121213,Shanahanchester_,2008_,120000_


In [8]:
# 判断每一列Series里包含的数据类型，为int64，则加1，否则加一个“！”
df.apply(lambda series: series + 1 if isinstance(series.values[0],np.int64) else series + '!',axis=0) 

Unnamed: 0_level_0,city,year,Feb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
211829,New Jaycob!,2007,62001
320563,Hyattburgh!,2007,45001
648336,New Lilianland!,2008,120001
109996,Hyattburgh!,2008,120001
121213,Shanahanchester!,2009,120001


In [9]:
df.year.apply(lambda x : x + 1) # 对Series中的每个数据加1

id
211829    2007
320563    2007
648336    2008
109996    2008
121213    2009
Name: year, dtype: int64

In [10]:
m = {2006:'a',2007:'b',2008:'c'}
df.year.map(m,na_action='ignore') # 对每个数据使用dict，得到映射后的结果。

id
211829    a
320563    a
648336    b
109996    b
121213    c
Name: year, dtype: object

<b>3. 排序和秩</b>

Series和Pandas之前的版本都有一些排序方法，比如order，sort_index,但是已经过时，可以使用sort_values方法来Series和Pandas进行排序。对于DataFrame来说，当axis=0时，也就是按列进行排序时，可以使用by来指定按那个列进行排序。

In [11]:
df

Unnamed: 0_level_0,city,year,Feb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
211829,New Jaycob,2006,62000
320563,Hyattburgh,2006,45000
648336,New Lilianland,2007,120000
109996,Hyattburgh,2007,120000
121213,Shanahanchester,2008,120000


In [12]:
df.year.order(na_last=True,ascending=False)

  if __name__ == '__main__':


id
121213    2008
109996    2007
648336    2007
320563    2006
211829    2006
Name: year, dtype: int64

In [13]:
df.sort_index(axis=0,by='year')

  if __name__ == '__main__':


Unnamed: 0_level_0,city,year,Feb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
211829,New Jaycob,2006,62000
320563,Hyattburgh,2006,45000
648336,New Lilianland,2007,120000
109996,Hyattburgh,2007,120000
121213,Shanahanchester,2008,120000


In [14]:
df.year.sort_values(axis=0,ascending=False,na_position='last') # 对列进行排序

id
121213    2008
109996    2007
648336    2007
320563    2006
211829    2006
Name: year, dtype: int64

In [15]:
df.sort_values(by='Feb',axis=0,ascending=False) #按照Feb降序

Unnamed: 0_level_0,city,year,Feb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
648336,New Lilianland,2007,120000
109996,Hyattburgh,2007,120000
121213,Shanahanchester,2008,120000
211829,New Jaycob,2006,62000
320563,Hyattburgh,2006,45000


rank的作用与排序的不同之处在于，他会把对象的 values 替换成名次（从 1 到 n）。这时唯一的问题在于如何处理平级项，方法里的 method 参数就是起这个作用的，他有四个值可选：average, min, max, first。

In [16]:
df.year

id
211829    2006
320563    2006
648336    2007
109996    2007
121213    2008
Name: year, dtype: int64

In [17]:
df.year.rank()

id
211829    1.5
320563    1.5
648336    3.5
109996    3.5
121213    5.0
Name: year, dtype: float64

In [18]:
df.year.rank(method='min')

id
211829    1.0
320563    1.0
648336    3.0
109996    3.0
121213    5.0
Name: year, dtype: float64

In [19]:
df.year.rank(method='max')

id
211829    2.0
320563    2.0
648336    4.0
109996    4.0
121213    5.0
Name: year, dtype: float64

In [20]:
df.year.rank(method='first')

id
211829    1.0
320563    2.0
648336    3.0
109996    4.0
121213    5.0
Name: year, dtype: float64

<b>4. astype</b>

In [21]:
df.year

id
211829    2006
320563    2006
648336    2007
109996    2007
121213    2008
Name: year, dtype: int64

In [22]:
df.year.astype(str)

id
211829    2006
320563    2006
648336    2007
109996    2007
121213    2008
Name: year, dtype: object

<b>5. unique</b>

In [23]:
data.year

id
211829    2006
320563    2006
648336    2007
109996    2007
121213    2008
Name: year, dtype: int64

In [24]:
data.year.unique()

array([2006, 2007, 2008], dtype=int64)

<b>6. str</b>

在对字符串元素进行规整化操作时，使用 .map() 方法的一个弊端是需要小心绕过 NA 值。为了解决这个问题，Series 直接提供了一些能够跳过 NA 值的字符串操作方法，全部通过 ser.str.xxx() 来访问。这些方法一般也都支持正则表达式。

In [25]:
data.city.str.split(' ')

id
211829        [New, Jaycob]
320563         [Hyattburgh]
648336    [New, Lilianland]
109996         [Hyattburgh]
121213    [Shanahanchester]
Name: city, dtype: object