## Pandas学习记录
pandas是一个非常强大的库，也正因为它的强大，导致入门起来并不是特别快，因此我要写下这篇来进行记录我学习的过程和一些有用的经验。照旧先是库的导入约定如下：

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
import numpy as np

### 1、创建基本数据结构
Pandas中常用的数据类型有Series和DataFrame，pandas中的Series是由一组index和一组values值组成的。和Python本身的dict对象有些类似，都拥有一组索引，一组数据。但是Series操作起来更为方便，快捷。也可以使用read_csv,read_excel得到DataFrame。

In [2]:
s1 = Series([2,2,4,5,6],index=['hh','as','ff','da','zx'])
s1

hh    2
as    2
ff    4
da    5
zx    6
dtype: int64

In [3]:
s1.name = 'zhou'
s1.index.name = 'wow'

In [4]:
s1

wow
hh    2
as    2
ff    4
da    5
zx    6
Name: zhou, dtype: int64

DataFrame是一个表格数据，可以看作是一组Series组成的，不过这些Series有共同的索引。DataFrame有行索引和列索引，对这两个方向的索引基本上对等的，因为DataFrame本身是可转置的。下面先创建这一对象。

In [5]:
data = {'state':['anhui','hebei','hainan','jiangsu','riben'],
        'year':[2001,2002,2003,2004,2004],
        'pop':[1.1,1.2,1.1,1.4,1.8]}
frame = DataFrame(data,columns = ['state','pop','year','hh'],index=['a1','a2','a3','a4','a5'])
frame

Unnamed: 0,state,pop,year,hh
a1,anhui,1.1,2001,
a2,hebei,1.2,2002,
a3,hainan,1.1,2003,
a4,jiangsu,1.4,2004,
a5,riben,1.8,2004,


可以看出，DataFrame就是一个表格，除了利用字典的方式创建这种数据结构之外，还可以使用二维ndarray，多个Series等来创建，在这里都可以指定索引，包括纵向或者横向的。

In [6]:
import numpy as np
DataFrame(np.arange(0,10).reshape(5,2),columns = ['hh','ww'])

Unnamed: 0,hh,ww
0,0,1
1,2,3
2,4,5
3,6,7
4,8,9


### 2、索引和取值
对于Series，因为它只有一列，所以它的取值直接索引的是index（这里注意我们使用的是标签索引而不是数字索引。前者必须知道keyword，而后者使用的是序号）。这里也可以用标签切片，不过与数组切片不同的是，标签切片表示的是一个封闭区间。

In [7]:
s1

wow
hh    2
as    2
ff    4
da    5
zx    6
Name: zhou, dtype: int64

In [8]:
s1['hh']

2

In [9]:
s1[['ff','as']]

wow
ff    4
as    2
Name: zhou, dtype: int64

In [10]:
s1['ff':'zx']

wow
ff    4
da    5
zx    6
Name: zhou, dtype: int64

相比较于Series，DataFrame的索引方式由于增加了一维的因素变得略微有些复杂。我们直接对DataFrame进行索引实际上得到的是一列或者是多列。这类似于Series简单的索引。

In [11]:
frame['state']

a1      anhui
a2      hebei
a3     hainan
a4    jiangsu
a5      riben
Name: state, dtype: object

In [12]:
frame[['pop','hh']]

Unnamed: 0,pop,hh
a1,1.1,
a2,1.2,
a3,1.1,
a4,1.4,
a5,1.8,


为了让语法上更加接近ndarray，对于DataFrame的数字切片表示的是取出一个或多个行。此外还有bool索引表示的也是选取行。

In [13]:
frame[:2]

Unnamed: 0,state,pop,year,hh
a1,anhui,1.1,2001,
a2,hebei,1.2,2002,


In [14]:
frame[frame['pop']>1.1]

Unnamed: 0,state,pop,year,hh
a2,hebei,1.2,2002,
a4,jiangsu,1.4,2004,
a5,riben,1.8,2004,


下面是两个很重要的索引方式标签索引和数字索引。前者使用的是loc函数。

In [15]:
frame.loc['a2':'a3',['pop','hh']]

Unnamed: 0,pop,hh
a2,1.2,
a3,1.1,


In [16]:
frame.loc[:,'pop']

a1    1.1
a2    1.2
a3    1.1
a4    1.4
a5    1.8
Name: pop, dtype: float64

In [17]:
frame.loc['a2','pop']

1.2

可以看出loc函数可以利用标签来包含上文的所有索引功能，是非常强大的一种方式。而如果我们不想使用或者不方便使用标签时，我们可以使用它的相似函数iloc来进行索引。

In [18]:
frame.iloc[:,:2]

Unnamed: 0,state,pop
a1,anhui,1.1
a2,hebei,1.2
a3,hainan,1.1
a4,jiangsu,1.4
a5,riben,1.8


In [19]:
frame.iloc[[1,2,3],1:3]

Unnamed: 0,pop,year
a2,1.2,2002
a3,1.1,2003
a4,1.4,2004


总结：对于Series的索引基本上与字典的索引类似，DataFrame的索引方式很多很复杂，但是对于一般的方式我们只需要了解，重点掌握iloc和loc两个函数即可方便的对DataFrame进行索引。

### 3、一些操作
首先要讲的当然是Series对象啦，Series有两个重要参数，index和values，这确实特别类似于字典。

In [20]:
s1.index,s1.values

(Index(['hh', 'as', 'ff', 'da', 'zx'], dtype='object', name='wow'),
 array([2, 2, 4, 5, 6], dtype=int64))

除了这两个参数之外，我们来介绍一下一些常用函数

In [21]:
s1['hh'] = np.nan  
pd.isnull(s1)  # 判断是否为缺省值

wow
hh     True
as    False
ff    False
da    False
zx    False
Name: zhou, dtype: bool

In [22]:
s2 = s1.copy()
s2 = s2.drop(['as'])  # 丢掉某一行
s2

wow
hh    NaN
ff    4.0
da    5.0
zx    6.0
Name: zhou, dtype: float64

In [23]:
s1['hh'] = 5
s1.sort_index()  # 按照index进行排序

wow
as    2.0
da    5.0
ff    4.0
hh    5.0
zx    6.0
Name: zhou, dtype: float64

In [24]:
s1.sort_values(ascending=False) # 按照值进行排序,降序排列

wow
zx    6.0
da    5.0
hh    5.0
ff    4.0
as    2.0
Name: zhou, dtype: float64

In [25]:
s1.unique() # 得到values中唯一值数组

array([5., 2., 4., 6.])

In [26]:
s1.value_counts() # 得到value每个值的个数

5.0    2
6.0    1
4.0    1
2.0    1
Name: zhou, dtype: int64

DataFrame的values，columns，index的属性，其中的index和columns类似于Series的index，不过这里变成了二维。

In [27]:
frame.index,frame.columns,frame.values

(Index(['a1', 'a2', 'a3', 'a4', 'a5'], dtype='object'),
 Index(['state', 'pop', 'year', 'hh'], dtype='object'),
 array([['anhui', 1.1, 2001, nan],
        ['hebei', 1.2, 2002, nan],
        ['hainan', 1.1, 2003, nan],
        ['jiangsu', 1.4, 2004, nan],
        ['riben', 1.8, 2004, nan]], dtype=object))

In [28]:
frame.index.name = 'ax'
frame.columns.name = 'des'
frame

des,state,pop,year,hh
ax,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a1,anhui,1.1,2001,
a2,hebei,1.2,2002,
a3,hainan,1.1,2003,
a4,jiangsu,1.4,2004,
a5,riben,1.8,2004,


下面首先介绍几个可以对DataFrame的概览的函数,这样在之后我们拿到数据之后就可以对数据有一个大概的了解，包括数据类型，数据分布等。

In [29]:
frame.describe()

des,pop,year
count,5.0,5.0
mean,1.32,2002.8
std,0.294958,1.30384
min,1.1,2001.0
25%,1.1,2002.0
50%,1.2,2003.0
75%,1.4,2004.0
max,1.8,2004.0


In [30]:
frame.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a1 to a5
Data columns (total 4 columns):
state    5 non-null object
pop      5 non-null float64
year     5 non-null int64
hh       0 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 360.0+ bytes


In [31]:
frame.head(2)

des,state,pop,year,hh
ax,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a1,anhui,1.1,2001,
a2,hebei,1.2,2002,


In [32]:
frame.tail(3)

des,state,pop,year,hh
ax,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a3,hainan,1.1,2003,
a4,jiangsu,1.4,2004,
a5,riben,1.8,2004,


索引对象，在DataFrame里面所用到的数组和序列的标签都会变成Index对象，其不可更改。而reindex是用来改变原来数据的index对象的函数，他可以修改行或者列的对象。

In [33]:
frame.reindex(index = ['a1','aa2','a3','aa4','a5'])

des,state,pop,year,hh
ax,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a1,anhui,1.1,2001.0,
aa2,,,,
a3,hainan,1.1,2003.0,
aa4,,,,
a5,riben,1.8,2004.0,


In [34]:
frame.reindex(columns= ['state','hi'],fill_value=0)

des,state,hi
ax,Unnamed: 1_level_1,Unnamed: 2_level_1
a1,anhui,0
a2,hebei,0
a3,hainan,0
a4,jiangsu,0
a5,riben,0


In [35]:
frame.drop(['state'],axis = 1) #丢掉某一列

des,pop,year,hh
ax,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a1,1.1,2001,
a2,1.2,2002,
a3,1.1,2003,
a4,1.4,2004,
a5,1.8,2004,


使用函数（自定义函数，统计函数，汇总函数）对行或者列进行计算。(mean,sum,min,max)

In [36]:
f = lambda x:x.min()-x.max()
frame.drop(['state','hh'],axis = 1).apply(f)

des
pop    -0.7
year   -3.0
dtype: float64

In [37]:
frame.sum()

des
state    anhuihebeihainanjiangsuriben
pop                               6.6
year                            10014
hh                                  0
dtype: object

In [38]:
frame.sort_values(by = 'pop',axis = 0)  #by也可以传入多个列名

des,state,pop,year,hh
ax,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a1,anhui,1.1,2001,
a3,hainan,1.1,2003,
a2,hebei,1.2,2002,
a4,jiangsu,1.4,2004,
a5,riben,1.8,2004,


In [39]:
frame.rank(axis = 0)    #进行排名

des,state,pop,year,hh
ax,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a1,1.0,1.5,1.0,
a2,3.0,3.0,2.0,
a3,2.0,1.5,3.0,
a4,4.0,4.0,4.5,
a5,5.0,5.0,4.5,


这里还有比较重要的一点就是缺失数据，我们使用dropna，fillna即可解决。

### 4、分组汇总
这里涉及的是非常有用的操作，包括分组运算和数据聚合。数据处理中有一个表示分组运算的术语（拆分-应用-合并）。

In [40]:
df = DataFrame({'key1':list('aabba'),'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,-1.35778,0.499631,a,one
1,-0.215346,-0.883277,a,two
2,-0.173451,0.59207,b,one
3,-0.555463,1.154787,b,two
4,-0.335108,-1.019389,a,one


In [41]:
grouped = df['data1'].groupby(df['key1'])
grouped.mean()

key1
a   -0.636078
b   -0.364457
Name: data1, dtype: float64

In [42]:
means = df['data1'].groupby([df['key1'],df['key2']]).mean()
means

key1  key2
a     one    -0.846444
      two    -0.215346
b     one    -0.173451
      two    -0.555463
Name: data1, dtype: float64

In [43]:
df.groupby(['key1','key2']).size() #类似于value_counts

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

这里我们创建了分组，可以以某一列作为分类标准，也可以一组合适的外来值作为类别。我们进行groupby之后得到的groupby对象，包含了分组信息，但是并还需要进行每一组之间的数据处理函数才能实际分类。使用单个键值分组时得到的索引是这个键值的unique，使用两组键值进行分组之后，得到的是一个层次化索引（有唯一的键值对组成）。默认分组方向是在axis = 0上的，我们也可以在列上进行分组。

In [44]:
for (k1,k2), group_data in df.groupby(['key1','key2']):
    print(k1,k2)
    print(group_data)

a one
      data1     data2 key1 key2
0 -1.357780  0.499631    a  one
4 -0.335108 -1.019389    a  one
a two
      data1     data2 key1 key2
1 -0.215346 -0.883277    a  two
b one
      data1    data2 key1 key2
2 -0.173451  0.59207    b  one
b two
      data1     data2 key1 key2
3 -0.555463  1.154787    b  two


In [45]:
dict(list(df.groupby(['key1','key2'])))

{('a', 'one'):       data1     data2 key1 key2
 0 -1.357780  0.499631    a  one
 4 -0.335108 -1.019389    a  one, ('a', 'two'):       data1     data2 key1 key2
 1 -0.215346 -0.883277    a  two, ('b', 'one'):       data1    data2 key1 key2
 2 -0.173451  0.59207    b  one, ('b', 'two'):       data1     data2 key1 key2
 3 -0.555463  1.154787    b  two}

In [46]:
df.groupby(['key1','key2'])['data1'].mean() #直接在groupby对象上进行索引

key1  key2
a     one    -0.846444
      two    -0.215346
b     one    -0.173451
      two    -0.555463
Name: data1, dtype: float64

分组方式还有通过传入字典，函数名，Series，虽然很复杂，但是这些方式最终的形式还是传入一个数组来进行相同值分在一起的操作。

In [47]:
people = DataFrame(np.random.randn(5,5),columns=list('abcde'),index = ['Joe','Steve','Wes','Jim','Travis'])
people

Unnamed: 0,a,b,c,d,e
Joe,0.637212,-0.819694,0.327676,-1.707011,-0.526686
Steve,0.169409,1.566446,-0.444387,0.566042,0.180229
Wes,0.448888,1.073056,1.464489,0.118306,-1.214961
Jim,0.527272,1.18125,-0.049758,-1.006947,0.917712
Travis,0.638777,-1.139441,2.173421,-1.088418,0.222622


In [48]:
mapp = {'a':'red','b':'red','c':'blue','d':'blue','e':'red'} # 通过映射变成了传入了一组数组
people.groupby(mapp,axis = 1).mean()

Unnamed: 0,blue,red
Joe,-0.689667,-0.236389
Steve,0.060828,0.638695
Wes,0.791397,0.102328
Jim,-0.528353,0.875411
Travis,0.542501,-0.092681


In [49]:
people.groupby(len).sum() #名字那一列求完长度之后又变成了数组

Unnamed: 0,a,b,c,d,e
3,1.613372,1.434612,1.742407,-2.595652,-0.823935
5,0.169409,1.566446,-0.444387,0.566042,0.180229
6,0.638777,-1.139441,2.173421,-1.088418,0.222622


上面我们对数据进行了分组和聚合，但是如果我们要把得到的这些数据加入到原先的数据表中需要用到merge函数。这里介绍的是直接将得到的值直接放到原有数据上的transform方法。

In [50]:
people.groupby(len).transform(np.mean)

Unnamed: 0,a,b,c,d,e
Joe,0.537791,0.478204,0.580802,-0.865217,-0.274645
Steve,0.169409,1.566446,-0.444387,0.566042,0.180229
Wes,0.537791,0.478204,0.580802,-0.865217,-0.274645
Jim,0.537791,0.478204,0.580802,-0.865217,-0.274645
Travis,0.638777,-1.139441,2.173421,-1.088418,0.222622


除了以上介绍的这些操作之外，DataFrame还有一个很重要的技术就是合并数据，分为列连接和行连接。对于列连接，我们使用merge函数，这里需要注意的是merge函数的一些参数（连接方式how，键值left_on,right_on,使用索引连接left_index，suffixes增加区别）。如果对层次化索引的话，连接键必须使用列表来指定。

In [51]:
people_mean = people.groupby(len).transform(np.mean)

In [52]:
pd.merge(left=people,right=people_mean,left_index=True,right_index=True,suffixes=('','_mean'))

Unnamed: 0,a,b,c,d,e,a_mean,b_mean,c_mean,d_mean,e_mean
Joe,0.637212,-0.819694,0.327676,-1.707011,-0.526686,0.537791,0.478204,0.580802,-0.865217,-0.274645
Steve,0.169409,1.566446,-0.444387,0.566042,0.180229,0.169409,1.566446,-0.444387,0.566042,0.180229
Wes,0.448888,1.073056,1.464489,0.118306,-1.214961,0.537791,0.478204,0.580802,-0.865217,-0.274645
Jim,0.527272,1.18125,-0.049758,-1.006947,0.917712,0.537791,0.478204,0.580802,-0.865217,-0.274645
Travis,0.638777,-1.139441,2.173421,-1.088418,0.222622,0.638777,-1.139441,2.173421,-1.088418,0.222622


对与行连接，使用的是concat函数，这里需要注意的也是一些参数的设置。

In [53]:
pd.concat([people,people_mean],keys = ['origin','mean'])

Unnamed: 0,Unnamed: 1,a,b,c,d,e
origin,Joe,0.637212,-0.819694,0.327676,-1.707011,-0.526686
origin,Steve,0.169409,1.566446,-0.444387,0.566042,0.180229
origin,Wes,0.448888,1.073056,1.464489,0.118306,-1.214961
origin,Jim,0.527272,1.18125,-0.049758,-1.006947,0.917712
origin,Travis,0.638777,-1.139441,2.173421,-1.088418,0.222622
mean,Joe,0.537791,0.478204,0.580802,-0.865217,-0.274645
mean,Steve,0.169409,1.566446,-0.444387,0.566042,0.180229
mean,Wes,0.537791,0.478204,0.580802,-0.865217,-0.274645
mean,Jim,0.537791,0.478204,0.580802,-0.865217,-0.274645
mean,Travis,0.638777,-1.139441,2.173421,-1.088418,0.222622


总结：分组聚合合并是数据处理中很重要的步骤，Pandas中都分别有对应的函数来进行处理，这里重点掌握groupy，一些聚合函数以及合并的方法。