## pandas入门

    基于NumPy数组构建，特别是基于数组的函数和不用for循环的数据处理，数据清洗和分析。与数值计算工具NumPy和SciPy、分析库statsmodels和scikit-learn和可视化库matplotlib一起使用
    与numpy的主要区别在于pandas用于处理混杂数据和表格，而numpy更适用于处理统一的数值数组数据。
    Series和DataFrame用的较多，常以下对其本地引入

In [4]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

### 数据结构- Series & DataFrame & Index（理解）

- Series
- DataFrame
- Index

#### Series

     Series由一组Numpy数组和对应的一组数据标签(索引)组成。
     
1.Series对象的生成及基本调用

In [16]:
# 直接用一组数组生成
obj1 = pd.Series([3,2,1,5.3])
print('obj1=\n',obj1)

# 指定索引
obj2 = pd.Series([1,2,4,6],index= ['a','b','c','d'])
print('obj2=\n',obj2)

# 通过values和index获得属性和索引，也可通过索引获取多个值chuanrushuzu
print('obj2 index: ',obj2.index)
print('obj2[''c','a','d'']=\n', obj2[['c','a','d']])

# numpy函数进行操作，不改变索引链接值
print('exp(obj1)= \n', np.exp(obj1))

# 赋值就地修改index
obj1.index =  ['Bob', 'Steve', 'Jeff', 'Ryan']
print('After changes obj1=\n',obj1)

# 将Series对象看作定长字典，字典的函数也可以使用
print('b in obj2?:', 'b'in obj2)
print('e in obj2?:', 'e'in obj2)

# 传入字典,索引为字典key的有序排列；也可排好字典的键以改变顺序
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print('obj3 = \n', obj3)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index =states)
obj4

obj1=
 0    3.0
1    2.0
2    1.0
3    5.3
dtype: float64
obj2=
 a    1
b    2
c    4
d    6
dtype: int64
obj2 index:  Index(['a', 'b', 'c', 'd'], dtype='object')
obj2[c a d]=
 c    4
a    1
d    6
dtype: int64
exp(obj1)= 
 0     20.085537
1      7.389056
2      2.718282
3    200.336810
dtype: float64
After changes obj1=
 Bob      3.0
Steve    2.0
Jeff     1.0
Ryan     5.3
dtype: float64
b in obj2?: True
e in obj2?: False
obj3 = 
 Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

2.Series对象的应用函数
    
    missing/NA表示缺失数据，isnull或notnull函数用于检测缺失数据。more refer to chap7
    重要功能： 根据索引标签自动数据对齐 +
    Series对象及索引有name属性，与其他功能密切相关   


In [15]:
# isnull或notnull
print('isnull?:\n',pd.isnull(obj4))
print('notnull?\n',pd.notnull(obj4))

isnull?:
 California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
notnull?
 California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool


In [17]:
# 自动对其，类数据库的join
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print('obj3 = \n', obj3)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index =states)
print('obj4 = \n', obj4)
obj3 + obj4

obj3 = 
 Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
obj4 = 
 California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [19]:
# name属性 for  Series对象及索引对象
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

#### DataFrame

    表格型的数据结构，一组有序的列，每列可以是不同的数据类型；可以看成多个Series对象组成的字典(共用行索引)，以下为生成及基本操作。
    
 >笔记：虽然DataFrame是以二维结构保存数据的，但你仍然可以轻松地将其表示为更高维度的数据（层次化索引的表格型结构，这是pandas中许多高级数据处理功能的关键要素，我们会在第8章讨论这个问题）。
 
way1 ：等长的列表或NumPy数组，自动加上索引

In [2]:
import pandas as pd
from pandas import Series, DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
print('frame= \n', frame)

# 指定列序列，传入列没有则自动生成
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                             index=['one', 'two', 'three', 'four',
                                    'five', 'six'])
print('frame2=\n', frame2)

# 可将dataframe的列作为Series对象取出
# 传入Series对象作为列的值
frame2['debt'] = pd.Series([1.,3.4, -0.5], index = ['one', 'four', 'three'])
print('frame2 with debt value\n', frame2)

# del关键字删除列
frame2['easter']  = frame2['state']=='Ohio'
print('frame2 columns: ', frame2.columns)
del frame2['easter']
print('frame2 columns after deletion: ', frame2.columns)

frame= 
     state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2
frame2=
        year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN
frame2 with debt value
        year   state  pop  debt
one    2000    Ohio  1.5   1.0
two    2001    Ohio  1.7   NaN
three  2002    Ohio  3.6  -0.5
four   2001  Nevada  2.4   3.4
five   2002  Nevada  2.9   NaN
six    2003  Nevada  3.2   NaN
frame2 columns:  Index(['year', 'state', 'pop', 'debt', 'easter'], dtype='object')
frame2 columns after deletion:  Index(['year', 'state', 'pop', 'debt'], dtype='object')


>note:
>
>    返回的Series拥有原DataFrame相同的索引，且其name属性也已经被相应地设置好了。
>    
>    通过索引方式返回的列只是相应数据的视图而已，并不是副本。因此，对返回的Series所做的任何就地修改全都会反映到源DataFrame上。通过Series的copy方法即可指定复制列
    
way2: 嵌套字典；外部为列索引columns， 内部为行索引index；
               内层字典的键会被合并，指定索引则不会产生这种情况

In [6]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
    'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
print('frame3=\n', frame3)

# 类NumPy可对其进行transpose
print('frame3 transpose=\n', frame3.T)

# 对其索引指定
frame4 = pd.DataFrame(pop, index = [2001, 2002, 2003])
print('指定索引:\n', frame4)

frame3=
       Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5
frame3 transpose=
         2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5
指定索引:
       Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN


Tips：

    可传给DataFrame构造器的数据 refer to table5-1；
    可对DataFrame的index和columns的name属性进行设置；
    values属性返回二维ndarray


#### Index索引对象

    pandas的索引对象负责管理轴标签和其他元数据（比如轴名称等）。构建Series或DataFrame时，所用到的任何数组或其他序列的标签都会被转换成一个Index；
    
    Index对象是不可变的，因此用户不能对其进行修改；
    
    不可变可以使Index对象在多个数据结构之间安全共享;
    

In [10]:
import numpy as np
labels =  pd.Index(np.arange(3))
print('labels:\n', labels)

# 与Series对象共享
obj = pd.Series([1.5, 2.0, -3], index = labels)
print('obj series with index label:\n', obj)
print('obj.index is labels?: ', obj.index is labels)

labels:
 Int64Index([0, 1, 2], dtype='int64')
obj series with index label:
 0    1.5
1    2.0
2   -3.0
dtype: float64
obj.index is labels?:  True



    除了类似于数组，Index的功能也类似一个固定大小的集合;
    
    与python的集合不同，pandas的Index可以包含重复的标签；
    
    每个索引都有一些方法和属性，它们可用于设置逻辑并回答有关该索引所包含的数据的常见问题。表5-2列出了这些函数。

In [14]:
frame3.columns.name = 'state'
print('frame3 colunmns: ', frame3.columns)
print('Ohio in columns?: ', 'Ohio' in frame3.columns)
print('Califorlia in columns?: ', 'Cali' in frame3.columns)

frame3 colunmns:  Index(['Nevada', 'Ohio'], dtype='object', name='state')
Ohio in columns?:  True
Califorlia in columns?:  False


### 基本功能
- reindex
- drop
- 索引、选取和过滤
- loc & iloc
- 整数索引
- 算术运算和数据对齐
- 在算术方法中填充值
- DataFrame和Series之间的运算
- 函数应用和映射
- 排序&排名
- 带有重复标签的轴索引


#### 重新索引reindex

  pandas对象的一个重要方法是reindex，其作用是创建一个**新对象**，它的数据符合新的索引。
  
可以做的事情有：
- 根据新索引排序，不存在的引入缺失值 for both Series & DataFrame
- 插值：对于时间序列这样的有序数据，重新索引时可能需要做一些插值处理。method选项即可达到此目的，例如，使用ffill可以实现前向值填充
- 对DataFrame对象reindex时，传入一个列表默认对行重新索引，对列重新索引关键字columns
- more refer to table5-3

In [15]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print('the original obj=\n', obj)

# Series对象的reindex对其进行重新索引
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print('the reindexed obj=\n', obj2)

the original obj=
 d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
the reindexed obj=
 a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64


In [22]:
# 插值
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj3)
## ffill前向插值
obj4 = obj3.reindex(np.arange(6), method='ffill')
print('after 插值：\n', obj4)

0      blue
2    purple
4    yellow
dtype: object
after 插值：
 0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object


#### 丢弃指定轴上的项 drop
丢弃某条轴上的一个或多个项很简单，只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑，所以drop方法返回的是一个在指定轴上删除了指定值的**新对象**

对DataFrame对象， 指定axis = columns，对列进行处理
>注意： inplace参数可以指定是否原地修改，默认为false，谨慎使用

In [23]:
# for Series
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print('the original obj=\n', obj)
new_obj = obj.drop(['d','c'])
print('drop the d and c\n', new_obj)

the original obj=
 a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
drop the d and c
 a    0.0
b    1.0
e    4.0
dtype: float64


In [25]:
# for DataFrame
frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                           index=['Ohio', 'Colorado', 'Utah', 'New York'],
                           columns=['one', 'two', 'three', 'four'])
print('the original frame=\n', frame)
new_frame = frame.drop(['one', 'three'], axis = 'columns')
print('drop 1 and 3_frame\n', new_frame)

the original frame=
           one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
drop 1 and 3_frame
           two  four
Ohio        1     3
Colorado    5     7
Utah        9    11
New York   13    15


#### 索引、选取&过滤
与Numpy类似
- Series:
不用数字用标签对其索引；与Python切片不同，这个切片包含末端值；
- DataFrame：
传入值或序列时对应某列或某几列frame['']orframe[['','']]；传入frame[:2]取前两行

也可采用布尔型DataFrame对其索引， 与NumPy二维数组类似

In [28]:
# for series
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd','e'])
print('obj[b:d]\n',obj['b':'d'])
# 对大于3的置0
obj[obj>3] = 0
obj

obj[b:d]
 b    1.0
c    2.0
d    3.0
dtype: float64


a    0.0
b    1.0
c    2.0
d    3.0
e    0.0
dtype: float64

In [35]:
# for DataFrame
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
                     columns=['one', 'two', 'three', 'four'])
print('data：\n', data)
print('1&3 column：\n', data[['one','three']])
print('前两行：\n', data[:2])
print('col3 >5: \n', data[data['three']>5])

data：
           one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
1&3 column：
           one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14
前两行：
           one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
col3 >5: 
           one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


#### loc&iloc选取
- loc: 通过*标签*进行索引 loc[....]
- iloc: 通过*整数*进行索引 iloc[....]
> 注意标签和整数一定要在范围内
- 也适用于一个标签或多个标签的切片

In [45]:
print('data:\n',data)
# loc和标签
print('loc & 标签：\n', data.loc[['Utah', 'Colorador'],['one','three']])
# iloc和整数
print('iloc & 整数：\n', data.iloc[3, [3,0,1]])
# 切片
print('data.loc[:\'Utah\', \'two\']\n',data.loc[:'Utah', 'two'])
print('data.iloc[:, :3][data.three > 5]:\n',data.iloc[:, :3][data.three > 5])

data:
           one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
loc & 标签：
            one  three
Utah       8.0   10.0
Colorador  NaN    NaN
iloc & 整数：
 four    15
one     12
two     13
Name: New York, dtype: int64
data.loc[:'Utah', 'two']
 Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64
data.iloc[:, :3][data.three > 5]:
           one  two  three
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14


在pandas中，有多个方法可以选取和重新组合数据。对于DataFrame，表5-4进行了总结。后面会看到，还有更多的方法进行层级化索引。

#### 算术运算和数据对齐

可对不同索引的对象进行算术运算，自动的数据对齐操作在不重叠的索引处引入了NA值。缺失值会在算术运算过程中传播。并生成**新的对象**

对Series对象进行索引的对齐，对DataFrame对象进行行和列的对齐，并生成新的DataFrame对象
    

In [2]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                     index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print('df1:\n', df1)
print('df2:\n', df2)
print('df1+df2:\n', df1 + df2)


df1:
             b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
df2:
           b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
df1+df2:
             b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN


#### 在算术方法中填充值

没有重叠索引时的NaN值，可以通过算术方法中的fill_value参数进行填充

> 注意：
算术方法处理非重叠索引与+的区别，以及fill_value的影响 ；
常用的算术方法: r是参数调换 eg:1/df = df.rdiv(1)
    
   ![1.png](attachment:1.png)

In [4]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                    columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                     columns=list('abcde'))
print('df1:\n', df1)
df2.loc[1,'c'] = np.nan
print('df2:\n', df2)
print('df1+df2:\n', df1 + df2)

# 对NaN进行填充
print('df1+df21 with fill_value = 1:\n', df1.add(df2, fill_value = 1))

df1:
      a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0
df2:
       a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   NaN   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0
df1+df2:
       a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0   NaN  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN
d1+d1 with fill_value = 1:
       a     b     c     d     e
0   0.0   2.0   4.0   6.0   5.0
1   9.0  11.0   7.0  15.0  10.0
2  18.0  20.0  22.0  24.0  15.0
3  16.0  17.0  18.0  19.0  20.0


#### DataFrame和Series之间的运算
与numpy不同维度数组之间的运算类似，DataFrame与Series之间也有广播的机制，且默认Series对象的索引匹配DataFrame对象的列索引；

如果某个索引值在DataFrame的列或Series的索引中找不到，则参与运算的两个对象就会被重新索引以形成并集；

若要匹配行索引指定axis = index/axis = 0

In [5]:
# numpy的例子
arr = np.arange(20.).reshape(4,5)
print('arr\n', arr)
print('arr - arr[0]\n', arr - arr[0])

arr
 [[ 0.  1.  2.  3.  4.]
 [ 5.  6.  7.  8.  9.]
 [10. 11. 12. 13. 14.]
 [15. 16. 17. 18. 19.]]
arr - arr[0]
 [[ 0.  0.  0.  0.  0.]
 [ 5.  5.  5.  5.  5.]
 [10. 10. 10. 10. 10.]
 [15. 15. 15. 15. 15.]]


In [28]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                      columns=list('bde'),
                      index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print('frame:\n', frame)
series1 = frame.iloc[0]
print('series1\n',series1)
print('frame - series1\n', frame - series1)

# 有不重叠索引
series2 =  pd.Series(range(3), index = list('ace'))
print('series2\n', series2)
# frame + series2 = frame.add(series2)
print('frame + series2\n', frame + series2)

# Series对象索引与行索引匹配
series3 = frame['b']
print('series3\n', series3)
print('与frame行匹配\n', frame.sub(series3, axis = 0))

frame:
           b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
series1
 b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
frame - series1
           b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0
series2
 a    0
c    1
e    2
dtype: int64
frame + series2
          a   b   c   d     e
Utah   NaN NaN NaN NaN   4.0
Ohio   NaN NaN NaN NaN   7.0
Texas  NaN NaN NaN NaN  10.0
Oregon NaN NaN NaN NaN  13.0
series3
 Utah      0.0
Ohio      3.0
Texas     6.0
Oregon    9.0
Name: b, dtype: float64
与frame行匹配
           b    d    e
Utah    0.0  1.0  2.0
Ohio    0.0  1.0  2.0
Texas   0.0  1.0  2.0
Oregon  0.0  1.0  2.0


#### 函数应用和映射
1. NumPy的ufuncs可用于操作pandas对象
2. **DataFrame**的apply方法可以实现将函数应用到每一列（默认）；传递anxi = 'columns'对每一行操作
3. 元素级的Python函数,applymap

>？？为啥Sereis对象不能apply，查看元马

In [18]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                        index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print('frame\n', frame)
# numpy unfunc
np.abs(frame)


frame
                b         d         e
Utah   -1.192305  0.195479  0.047908
Ohio    0.544413  1.952725  0.102777
Texas   0.189833  1.067861  0.931190
Oregon -0.786287  1.998192 -0.820045


Unnamed: 0,b,d,e
Utah,1.192305,0.195479,0.047908
Ohio,0.544413,1.952725,0.102777
Texas,0.189833,1.067861,0.93119
Oregon,0.786287,1.998192,0.820045


In [29]:
# DataFrame对象的apply 方法
f = lambda x: x.max() - x.min()
print('对列进行运算\n', frame.apply(f))
print('对行进行运算\n', frame.apply(f, axis = 1))


对列进行运算
 b    9.0
d    9.0
e    9.0
dtype: float64
对行进行运算
 Utah      2.0
Ohio      2.0
Texas     2.0
Oregon    2.0
dtype: float64


In [22]:
# python元素级函数
format = lambda x: '%.3f'%x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-1.192,0.195,0.048
Ohio,0.544,1.953,0.103
Texas,0.19,1.068,0.931
Oregon,-0.786,1.998,-0.82


#### 排序与排名
- 排序：
    sort_index：按索引排序，返回已排序的**新对象**
    sort_values：按值排序；Series对象的NaN值会排到最后，可指定升降序
- 排名：rank
  从1开始排名，默认index，默method为average,如数据a在行中排名2，3，4，则输出排名为3；若method为max/min，即为4/2；若为first，则与其在原始数据中出现顺序有关；dense即从1开始递增1；refer to![https://blog.csdn.net/maymay_/article/details/80209709]

In [24]:
# sort by index
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                       index=['three', 'one'],
                       columns=['d', 'a', 'b', 'c'])
print('frame\n', frame)
frame.sort_index()

frame
        d  a  b  c
three  0  1  2  3
one    4  5  6  7


Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [26]:
# sort by value
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
print('obj= \n', obj)
obj.sort_values()

obj= 
 0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64


4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [27]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
print('frame:\n', frame)
frame.sort_values(by=['a', 'b'])

frame:
    b  a
0  4  0
1  7  1
2 -3  0
3  2  1


Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [31]:
# 排名
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [5, 5, 8, -2.5]})
print('frame.T\n', frame.T)
frame.T.rank(axis = 1)

frame.T
      0    1    2    3
b  4.3  7.0 -3.0  2.0
a  0.0  1.0  0.0  1.0
c  5.0  5.0  8.0 -2.5


Unnamed: 0,0,1,2,3
b,3.0,4.0,1.0,2.0
a,1.5,3.5,1.5,3.5
c,2.5,2.5,4.0,1.0


#### 带有重复标签的轴索引

索引的is_unique属性可以告诉你它的值是否是唯一的;

重复的索引会使代码变复杂，因为索引的输出类型会根据标签是否有重复发生变化。

In [32]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj.index.is_unique

False

### 汇总和计算描述

pandas对象拥有一组常用的数学和统计方法。它们大部分都属于约简和汇总统计，用于从Series中提取单个值（如sum或mean）或从DataFrame的行或列中提取一个Series。跟对应的NumPy数组方法相比，它们都是基于没有缺失数据的假设而构建的。
- 约简和汇总等计算描述

    1.约简方法选项：axis-约简的轴；skipna:是否排除缺失值；level:轴为层次化索引，根据level分组约简
    
    2.有些方法返回间接统计如idmax返回最大值的索引，有些累计如cumsum；有些均不是，一次性产生多个汇总描述。refer to table5-8
- 相关系数与协方差
pandas-datareader包,corr及cov函数，dataframe的covwith可对列之间进行计算
- 唯一值，值计数及成员资格
![image.png](attachment:image.png)

In [34]:
# data collection
from pandas_datareader import data,wb
import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker)
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
price = pd.DataFrame({ticker: data['Adj Close']
                     for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})
returns = price.pct_change()
returns.tail()
                      



Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-11-27,0.013432,-0.009771,0.001908,-0.000426
2019-11-29,-0.002203,0.005083,-0.006171,-0.006116
2019-12-02,-0.011562,-0.011454,-0.012089,-0.011525
2019-12-03,-0.01783,-0.005944,-0.001605,0.004155
2019-12-04,0.008826,-0.000984,0.003617,0.019502


In [37]:
# Series对象的corr及cov
print('corr between AAPL & IBM: ', returns['AAPL'].corr(returns['IBM']))
print('cov between AAPL & IBM: ', returns['AAPL'].cov(returns['IBM']))

corr between AAPL & IBM:  0.40666237059947635
cov between AAPL & IBM:  8.344416437347757e-05


In [38]:
# DataFrame进行相关系数及协方差计算返回DataFrame对象(corr，cov)或Series对象(corrwith)
print('corr= \n', returns.corr())
print('cov = \n', returns.cov())
print('与IBM的相关系数：\n', returns.corrwith(returns['IBM']) )

corr= 
           AAPL       IBM      MSFT      GOOG
AAPL  1.000000  0.406662  0.576210  0.523947
IBM   0.406662  1.000000  0.488937  0.414642
MSFT  0.576210  0.488937  1.000000  0.660128
GOOG  0.523947  0.414642  0.660128  1.000000
cov = 
           AAPL       IBM      MSFT      GOOG
AAPL  0.000247  0.000083  0.000134  0.000125
IBM   0.000083  0.000171  0.000094  0.000082
MSFT  0.000134  0.000094  0.000218  0.000148
GOOG  0.000125  0.000082  0.000148  0.000231
与IBM的相关系数：
 AAPL    0.406662
IBM     1.000000
MSFT    0.488937
GOOG    0.414642
dtype: float64


In [41]:
# 唯一值，值计数及成员资格
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print('obj:\n', obj)
# 唯一值
print('unique values:', pd.unique(obj.values))
# 值计数
print('value_counts:\n', pd.value_counts(obj.values, sort = False))
# 成员资格
print('obj in [\'c\',\'d\']\n', obj.isin(['c','d']))

obj:
 0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
unique values: ['c' 'a' 'd' 'b']
value_counts:
 b    2
d    1
c    3
a    3
dtype: int64
obj in ['c','d']
 0     True
1    False
2     True
3    False
4    False
5    False
6    False
7     True
8     True
dtype: bool
