# pandas学习笔记

## pandas数据结构
### Series  
1. Series是一种类似于一维数组的对象，它由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成  
   即：数组+标签索引  
   Series对象本身及其索引都有一个name属性,可以在任意时刻赋值定义

In [1]:
import pandas as pd
import numpy as np
obj = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj)
print("---------")
print(obj.values)  #分别用obj.values和obj.index来获取其数组表示形式和索引对象
print("---------")
print(obj.index)
print("---------")
obj.name = 'my_data'
obj.index.name = 'my_index' #可以通过赋值obj.name和obj.index.name来为其指定一个名字。可以用rename方法来修改
print(obj)  #若有名字，则会被显示出来

d    4
b    7
a   -5
c    3
dtype: int64
---------
[ 4  7 -5  3]
---------
Index(['d', 'b', 'a', 'c'], dtype='object')
---------
my_index
d    4
b    7
a   -5
c    3
Name: my_data, dtype: int64


2. Series的初始化
   * 使用各种类型的python列表创建
   * 使用numpy数组创建
   * 使用字典创建

In [2]:
import pandas as pd
import numpy as np
#1.使用列表创建Series
obj1 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])  #可以使用数组指定索引
print(obj1)
print("---------")

#2.使用numpy数组创建Series
a = np.arange(1,10,2)
obj2 = pd.Series(a)  #默认索引为0,1,2,3,4
print(obj2)
print("---------")

#3.使用字典创建Series
dict = {'a':1,'b':2,'c':3,'d':4}
obj3 = pd.Series(dict)  #默认索引为字典的键
print(obj3)
print("---------")

#4.使用字典创建Series，并指定索引
dict = {'a':1,'b':2,'c':3,'d':4}
index = ['b','c','d','e']           #指定索引时，只有指定的索引的值会出现在最终的Series中
obj4 = pd.Series(dict,index=index)  #字典中没有的索引，其值为NaN
print(obj4)

d    4
b    7
a   -5
c    3
dtype: int64
---------
0    1
1    3
2    5
3    7
4    9
dtype: int64
---------
a    1
b    2
c    3
d    4
dtype: int64
---------
b    2.0
c    3.0
d    4.0
e    NaN
dtype: float64


3. 访问series元素
   * 通过索引访问值 如obj['a'],obj[1]
   * 也可以通过索引对值进行修改

In [3]:
import pandas as pd
import numpy as np
obj1 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj1['a'])
print("---------")
print(obj1[['a','b','c']])
print("---------")
obj1['a'] = 10  #通过索引修改值
print(obj1['a'])

-5
---------
a   -5
b    7
c    3
dtype: int64
---------
10


* 可以通过对obj.index赋值，达成对索引的修改

In [4]:
import pandas as pd
import numpy as np
obj1 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj1.index = ['A','B','C','D']  #通过赋值的方式修改索引
print(obj1)

A    4
B    7
C   -5
D    3
dtype: int64


4. Series运算
   * Series支持类似numpy的运算
   * Series会根据运算的索引标签自动对齐数据

In [5]:
import pandas as pd
import numpy as np
obj1 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

a = np.arange(1,10,2)
obj2 = pd.Series(a)

dict = {'a':1,'b':2,'c':3,'d':4}
obj3 = pd.Series(dict)

print(obj1[obj1>0])  #通过布尔数组进行过滤
print("---------")

print(obj2*2)  #Series支持按位运算
print("---------")

print(np.exp(obj3))  #Series支持numpy的函数
print("---------")

print('b' in obj3)  #类似字典，判断索引是否存在
print('e' in obj3)



d    4
b    7
c    3
dtype: int64
---------
0     2
1     6
2    10
3    14
4    18
dtype: int64
---------
a     2.718282
b     7.389056
c    20.085537
d    54.598150
dtype: float64
---------
True
False


In [6]:
import pandas as pd
import numpy as np

dict = {'a':1,'b':2,'c':3,'d':4}
index = ['b','c','d','e']
obj4 = pd.Series(dict,index=index) 
print(obj4)
print("---------")

print(pd.isnull(obj4))  #判断是否为空值
print("---------")

print(pd.isna(obj4))  #判断是否为NaN
print(obj4.isna())  #或者直接使用Series的isna方法，上述及其他函数都可以这样使用

b    2.0
c    3.0
d    4.0
e    NaN
dtype: float64
---------
b    False
c    False
d    False
e     True
dtype: bool
---------
b    False
c    False
d    False
e     True
dtype: bool
b    False
c    False
d    False
e     True
dtype: bool


In [7]:
import pandas as pd
import numpy as np
obj1 = pd.Series([4, 7, -5, 3], index=['a', 'b', 'c', 'd'])
obj2 = pd.Series([0, -3, 7, 4], index=['b', 'a', 'c', 'e'])
print(obj1+obj2)  #索引不同的Series相加，相同索引的值相加，不同索引的值为NaN,无论索引顺序如何都会自动对齐，类似于数据库的join操作

a    1.0
b    7.0
c    2.0
d    NaN
e    NaN
dtype: float64


### DataFrame
DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。  
DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）

1. DataFrame的初始化
   * 传入一个由等长列表或NumPy数组组成的字典
   * 传入一个嵌套字典

In [8]:
#初始化方法一：传入一个由等长列表或NumPy数组组成的字典
import pandas as pd
import numpy as np
#键为标签，值为Series或者其他数组，这个Series和数组可以是不同数据类型的，但长度必须相同
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
print(frame)

#在结果中，每一列的名称即为字典的键（也就是Series的索引/标签），每一行的索引是从0开始的整数，可以通过index参数来指定索引

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


In [9]:
#初始化方法二：传入一个嵌套字典
import pandas as pd
import numpy as np

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame2 = pd.DataFrame(pop)  #如果嵌套字典传给DataFrame，pandas就会被解释为：外层字典的键作为列，内层键则作为行索引：
print(frame2)

      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5


* 对于特别大的DataFrame，可以用head方法查看前五行

In [10]:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
print(frame.head())

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9


* 如果指定了列序列，则DataFrame的列就会按照指定顺序进行排列：

In [11]:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data,columns=['year', 'state', 'pop'])
print(frame)

   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2


* 默认索引是从0开始，可以对索引进行指定
* 同时，如果传入的列在数据中找不到，就会在结果中产生缺失值：

In [12]:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data,columns=['year', 'state', 'pop','debt'],index=['one', 'two', 'three', 'four', 'five', 'six'])
print(frame)
print("---------")
frame.index = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']  #通过赋值的方式修改索引
print(frame)


       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN
---------
        year   state  pop debt
first   2000    Ohio  1.5  NaN
second  2001    Ohio  1.7  NaN
third   2002    Ohio  3.6  NaN
fourth  2001  Nevada  2.4  NaN
fifth   2002  Nevada  2.9  NaN
sixth   2003  Nevada  3.2  NaN


2. 获取DataFrame的数据  
   1. 通过类似字典标记的方式或属性的方式，可以将DataFrame的列获取为一个Series：
   2. 通过通过位置或名称的方式进行获取行数据

In [13]:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data,columns=['year', 'state', 'pop','debt'],index=['one', 'two', 'three', 'four', 'five', 'six'])

print(frame['state'])  #通过列名获取列
print("---------")
print(frame.year)  #通过属性的方式获取列

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
---------
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64


In [14]:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data,columns=['year', 'state', 'pop','debt'],index=['one', 'two', 'three', 'four', 'five', 'six'])

print(frame.loc['three'])  #通过行索引获取行  这里的three需要是行索引
print("---------")
print(frame.iloc[2])  #通过行号获取行  这里的2需要是行号, 行号是从0开始

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
---------
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object


3. DataFrame的值的修改

In [15]:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data,columns=['year', 'state', 'pop','debt'],index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2 = frame.copy()

print(frame)
print("---------")
frame.debt = 10  #通过赋值的方式修改列的值 ,传入单个值则所有值都会被设置为该值，传入数组则会被设置为数组的值
print(frame)
print("---------")
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])  #传入Series则会根据索引对齐,又验证了Series的索引对齐
frame2['debt'] = val
print(frame2)

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN
---------
       year   state  pop  debt
one    2000    Ohio  1.5    10
two    2001    Ohio  1.7    10
three  2002    Ohio  3.6    10
four   2001  Nevada  2.4    10
five   2002  Nevada  2.9    10
six    2003  Nevada  3.2    10
---------
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN


In [16]:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data,columns=['year', 'state', 'pop'],index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2 = frame.copy()

frame['eastern'] = frame2.state == 'Ohio' #如果赋值的列不存在，则会创建一个新的列，这里添加了一个boolean列
print(frame)                                #注意：不能用frame.eastern创建新的列。
print("---------")
del frame['eastern'] #del方法 删除列
print(frame)

       year   state  pop  eastern
one    2000    Ohio  1.5     True
two    2001    Ohio  1.7     True
three  2002    Ohio  3.6     True
four   2001  Nevada  2.4    False
five   2002  Nevada  2.9    False
six    2003  Nevada  3.2    False
---------
       year   state  pop
one    2000    Ohio  1.5
two    2001    Ohio  1.7
three  2002    Ohio  3.6
four   2001  Nevada  2.4
five   2002  Nevada  2.9
six    2003  Nevada  3.2


4. 对DataFrame的操作

In [17]:
#对DataFrame进行转置（交换行和列）：
import pandas as pd
import numpy as np

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame2 = pd.DataFrame(pop)

print(frame2.T)  #转置,不是在原DataFrame上进行转置，而是返回一个新的DataFrame
print("---------")

#设置DataFrame的名称和索引
frame2.index.name = 'year'; frame2.columns.name = 'state'
print(frame2)
print("---------")
#values属性也会以二维ndarray的形式返回DataFrame中的数据
print(frame2.values)

        2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5
---------
state  Nevada  Ohio
year               
2001      2.4   1.7
2002      2.9   3.6
2000      NaN   1.5
---------
[[2.4 1.7]
 [2.9 3.6]
 [nan 1.5]]


## pandas基本功能

### 重新索引
* pandas对象的一个重要方法是reindex，其作用是创建一个**新对象**，它的数据符合新的索引

In [18]:
import pandas as pd
import numpy as np
obj = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj)
print("---------")
obj2 = obj.reindex(['a', 'b', 'c', 'd','e'])  #reindex方法，重新索引，如果某个索引值当前不存在，就引入缺失值
print(obj2)

d    4
b    7
a   -5
c    3
dtype: int64
---------
a   -5.0
b    7.0
c    3.0
d    4.0
e    NaN
dtype: float64


* 对于像时间序列这样的有序数据，重索引时可能需要插值处理。
* 在method选项中，填入ffill即可完成前向插入

In [19]:
import pandas as pd
import numpy as np

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj3)
print("---------")
obj4 = obj3.reindex(range(6), method='ffill')  #method选项可以指定填充方法，ffill表示前向填充,即：缺失值用前一个非缺失值填充
print(obj4)

0      blue
2    purple
4    yellow
dtype: object
---------
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object


* 对于DataFrame，reindex也可修改行索引和列名

In [20]:
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                    index=['a', 'c', 'd'],
                    columns=['Ohio', 'Texas', 'California'])
print(frame)
print("---------")
frame2 = frame.reindex(['a', 'b', 'c', 'd'])  #reindex方法可以重新索引行或列，如果只传入一个序列，则重新索引行
print(frame2)
print("---------")
states = ['Texas', 'Utah', 'California']
frame3 = frame.reindex(columns=states)  #如果使用columns关键字，则重新索引列,
print(frame3)

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
---------
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0
---------
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8


* reindex其他常见关键字参数
  * index:用作新索引
  * method:插值方法
  * fill_value:重索引，引入的缺失值
  * limit：前向或者后向填充的最大填充量
  * copy：默认为true，返回新的对象，false则在原对象上进行重索引

### 丢弃指定轴上的数据
* drop方法返回的是一个在指定轴上删除了指定值的新对象

In [21]:
import pandas as pd
import numpy as np

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
print("---------")
new_obj = obj.drop('c')  #drop方法返回一个在指定轴上删除了指定值的新对象
print(new_obj)
print("---------")
new_obj2 = obj.drop(['d', 'c'])  #传入一个列表可以删除多个轴
print(new_obj2)


a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
---------
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
---------
a    0.0
b    1.0
e    4.0
dtype: float64


In [22]:
import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
print(data)
print("---------")
data2 = data.drop(['Colorado', 'Ohio'])  #传入行名删除行
print(data2)
print("---------")
data3 = data.drop('two', axis=1)  #传入axis=列索引，需要索引和列名称对应
print(data3)
print("---------")
data4 = data.drop(['two', 'four'], axis='columns')  #axis关键字可以指定删除行还是列,columns或者列索引都可以
print(data4)

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
---------
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15
---------
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15
---------
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14


* 如果需要原地修改，可以使用inplace关键字

In [23]:
import pandas as pd
import numpy as np

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj)
print("---------")
obj.drop('c',inplace=True)  #inplace=True可以直接在原对象上操作
print(obj)

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
---------
a    0.0
b    1.0
d    3.0
dtype: float64


### 索引
1. 对Series进行索引

In [24]:
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
print(obj)
print("---------")
print(obj['b'])     #索引既可以是标签也可以是位置，也可以是python切片
print("---------")
print(obj.iloc[1])
print("---------")
print(obj[2:4])

a    0
b    1
c    2
d    3
dtype: int64
---------
1
---------
1
---------
c    2
d    3
dtype: int64


* 可以对标签进行切片，但与python切片不同，标签切片包含末端

In [25]:
import pandas as pd
import numpy as np

obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
print(obj['b':'c'])  #标签切片包含末端

b    1
c    2
dtype: int64


2. 对DataFrame进行索引
   * 对DataFrame进行标签索引其实就是在获取列（series对象）

In [26]:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
print(data)
print("---------")
print(data['two'])  #获取列
print("---------")
print(data[['three', 'one']])
print("---------")
print(isinstance(data['two'],pd.Series))  #获取列是Series类型

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
---------
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
---------
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12
---------
True


* 对DataFrame选取行

In [27]:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
print(data)
print("---------")
print(data[:2])  #获取行

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
---------
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7


* 布尔类型索引

In [28]:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
print(data)
print("---------")
print(data[data['three'] > 5])  #布尔索引

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
---------
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


### 用loc和iloc进行选取行
* 用loc时用标签进行索引
* 用iloc和整数进行选取

In [29]:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
print(data)
print("---------")

print(data.loc['Colorado', ['two', 'three']])
print("---------")
print(data.iloc[[1,2]]) #获取多行
print("---------")
print(data.iloc[[1,2], [1,2]]) #获取多行多列,或者一行多列
print("---------")
print(f"单个元素:{data.iloc[1,2]}") #获取单个元素
print(f"单个元素：{data.at['Colorado','three']}")  #at方法获取单个元素
print(f"单个元素：{data.iat[1,2]}")  #iat方法获取单个元素

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
---------
two      5
three    6
Name: Colorado, dtype: int64
---------
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
---------
          two  three
Colorado    5      6
Utah        9     10
---------
单个元素:6
单个元素：6
单个元素：6


### 算术运算和数据对齐
* series进行相加时，会自动对齐标签
* 自动的数据对齐操作在不重叠的索引处引入了NA值。缺失值会在算术运算过程中传播。

In [30]:
import pandas as pd
import numpy as np
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
print(s1+s2)

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64


* 对于DataFrame，对齐操作会同时发生在行和列上
* 把它们相加后将会返回一个新的DataFrame，其索引和列为原来那两个DataFrame的并集

In [31]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                    index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(df1+df2)  #DataFrame相加，没有重叠的位置就会产生NA值


            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN


* 使用add方法，可以设置相加时遇到缺失值的填充值

In [32]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                    index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

df1.add(df2, fill_value=0)  #使用add方法，fill_value参数可以指定填充值

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


* 其他方法：(以字母r开头，它会翻转参数；如：r1.sub(r2) = r1 - r2; r1.rsub(r2) = r2 - r1)
  * add,radd
  * sub,rsub
  * div,rdiv
  * floordiv,rfloordiv
  * mul,rmul
  * pow,rpow

### DataFrame和Series 的运算
* DataFrame和Series 的运算会进行广播操作，即对每一行都会应用相关运算

In [33]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                    columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
print(frame - series)  #DataFrame和Series相减，会广播到每一行

print("--------------------")
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
print(frame + series2)  #如果一个索引值在DataFrame的列或Series的索引找不到，则参与运算的两个对象就会被重新索引以形成并集

          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0
--------------------
          b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaN


* 如果想要在列上广播，需要使用算数运算方法

In [34]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                    columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
s3 = frame['d']
print(frame.sub(s3, axis=0) ) #sub方法，指定轴，按行广播

        b  d  e
Utah   -1  0  1
Ohio   -1  0  1
Texas  -1  0  1
Oregon -1  0  1


### 函数应用
* NumPy的ufuncs（元素级数组方法）也可用于操作pandas对象：

In [35]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
print("---------")
print(np.abs(frame))  #numpy的ufunc函数也可以用于DataFrame


               b         d         e
Utah    0.951819  0.834572 -0.858503
Ohio   -2.307620 -0.529004 -0.635606
Texas   0.420061  0.264920 -1.010867
Oregon  1.317498 -1.647144 -0.014250
---------
               b         d         e
Utah    0.951819  0.834572  0.858503
Ohio    2.307620  0.529004  0.635606
Texas   0.420061  0.264920  1.010867
Oregon  1.317498  1.647144  0.014250


### 排序
* 用sort_index方法对标签或者列进行排序
* 用sort_values方法对值进行排序
* 在排序时，任何缺失值默认都会被放到Series的末尾：

In [36]:
import pandas as pd
import numpy as np
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

obj2 = obj.sort_index()  #sort_index方法，按索引排序
print(obj2)
print("---------")
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                    columns=['d', 'a', 'b', 'c'])
frame2 = frame.sort_index()  #DataFrame按行或列排序
print(frame2)
print("---------")
frame3 = frame.sort_index(axis=1) #axis关键字，指定按列排序
print(frame3)

a    1
b    2
c    3
d    0
dtype: int64
---------
       d  a  b  c
one    4  5  6  7
three  0  1  2  3
---------
       a  b  c  d
three  1  2  3  0
one    5  6  7  4


* 对Series的值排序

In [37]:
import pandas as pd
import numpy as np
obj = pd.Series([4, 7, -3, 2])
print(obj.sort_values())  #sort_values方法，按值排序

2   -3
3    2
0    4
1    7
dtype: int64


* 对DataFrame的值排序
* 当排序一个DataFrame时，你可能希望根据一个或多个列中的值进行排序。将一个或多个列的名字传递给sort_values的by选项即可达到该目的

In [38]:
import pandas as pd
import numpy as np
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
print(frame.sort_values(by=['a', 'b'])) #同时对多个列排序，优先对第一个列排序，如果相同则再以下一个列的值排序

   b  a
2 -3  0
0  4  0
3  2  1
1  7  1


## 汇总和计算描述统计

### 计算方法
* 调用DataFrame的sum方法将会返回一个含有列的和的Series：

In [39]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
                index=['a', 'b', 'c', 'd'],
                columns=['one', 'two'])
print(df)
print("---------")
print(df.sum())

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
---------
one    9.25
two   -5.80
dtype: float64


* 传入axis='columns'或axis=1将会按行进行求和运算：

In [40]:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
                index=['a', 'b', 'c', 'd'],
                columns=['one', 'two'])
print(df)
print("---------")
df.sum(axis=1)

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
---------


a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64