In [1]:
import numpy as np
import pandas as pd

## Series

 Series 是一种类似于一维数组的对象，它由一组数据（各种 NumPy 数据类型）以及一组与之相关的数据标签（即索引）组成。

In [2]:
obj = pd.Series([1, 2, 3, 4])

In [3]:
obj

0    1
1    2
2    3
3    4
dtype: int64

In [4]:
obj.values

array([1, 2, 3, 4], dtype=int64)

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
obj1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

In [7]:
obj1

a    1
b    2
c    3
d    4
dtype: int64

In [8]:
obj1.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [9]:
obj1[obj1 > 1]

b    2
c    3
d    4
dtype: int64

In [11]:
np.exp(obj1)

a     2.718282
b     7.389056
c    20.085537
d    54.598150
dtype: float64

In [12]:
obj1 * 2

a    2
b    4
c    6
d    8
dtype: int64

Series 看成是一个定长的有序字典，因为它是索引值到数据值的一个映射。它可以用在许多原本需要字典参数的函数中

In [13]:
'b' in obj1

True

In [14]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
pd.Series(sdata)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [15]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
pd.Series(sdata, index=states)

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [24]:
obj2 = pd.Series(sdata, index=states)
obj3 = pd.Series(sdata)

In [21]:
obj2.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

筛选出含缺省值的值

In [23]:
obj2[obj2.isnull()]

California   NaN
dtype: float64

在算术运算中会自动对齐不同索引的数据。

In [27]:
obj4 = obj2 + obj3
obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [29]:
obj4.index.name = 'State'
obj4.name = 'Population'

In [30]:
obj4

State
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
Name: Population, dtype: float64

## DataFrame

DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共用同一个索引）。  
DataFrame 中的数据是以一个或多个二维块存放的（而不是列表、字典或別的一维数据结构）。


In [31]:
data = {'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year' : [2000, 2001, 2002, 2001, 2002],
        'pop' : [1.5, 1.7, 3.6, 2.4, 2.9]}
df1 = pd.DataFrame(data)
df1

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


指定了列序列，则 DataFrame 的列就会按照指定顺序迸行排列

In [32]:
df2 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
df2

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [33]:
df3 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
            index=['one', 'two', 'three', 'four', 'five'])
df3

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [38]:
df3.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [40]:
df3.debt = 16.5
df3

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


将列表或数组赋值给某个列时，其长度必须跟 DataFrame 的长度相匹配。

In [42]:
df3.debt = np.arange(5)
df3

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


如果赋值的是一个 Series，就会精确匹配 DataFrame 的索引，所有的空位都将被填上缺失值。

In [44]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
df3['debt'] = val
df3

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


In [46]:
df3['eastern'] = df3.state == 'Ohio'
df3

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [47]:
del df3['eastern']
df3

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


将嵌套字典传给 DataFrame，它就会被解释为：外层字典的键作为列，内层键则作为行索引。

In [48]:
pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
        'Ohio' : {2000 : 1.5, 2001 : 1.7, 2002 : 3.6}}
df4 = pd.DataFrame(pop)
df4

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [50]:
df4.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


values 属性会以二维 ndarray 的形式返回 DataFrame 中的数据。

In [53]:
df4.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [67]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [73]:
obj1 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
obj1

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

使用 ffill 方法可以实现前向值填充。

**method**  

| 参数 | 说明 |
| :------ | :------ |
| ffill 或 pad | 前向填充（或搬运）值 |
| bfill 或 backfill | 后向填充（或搬运）值 | 

In [76]:
obj2 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj2.reindex(range(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [87]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                columns=[ 'Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [91]:
frame1 = frame.reindex(['a', 'b', 'c', 'd'])
frame1

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [89]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)                                    

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


**reindex 函数**

| 参数 | 说明 |
| :---- | :---- |
| index	| 用作索引的新序列。既可以是 index 实例，也可以是其他序列型的 Python 数据结构。Index 会被完全使用，就像没有任何复制一样 |       
| method | 插值（填充）方式 |
| fill_value | 在重新索引的过程中，需要引入缺失值时使用的替代值 |   
| limit | 前向或后向填充时的最大填充量 |
| level | 在 Multiindex 的指定级别上匹配简单索引，否则选取其子集 |
| copy | 默认为 True，无论如何都复制；如果为 False，则新旧相等就不复制 |

In [92]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [100]:
data.drop(['two', 'four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [102]:
data.drop(['Colorado'])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Utah,8,9,10,11
New York,12,13,14,15


In [103]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [104]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [122]:
data = [[1, 2, 3], [4, 5, 6]]
index = ['a','b']
columns = ['c','d','e']
df = pd.DataFrame(data, index=index, columns=columns)
df

Unnamed: 0,c,d,e
a,1,2,3
b,4,5,6


loc -- 通过行标签索引行数据 

In [123]:
df.loc['a']

c    1
d    2
e    3
Name: a, dtype: int64

iloc -- 通过行号索引行数据 

In [126]:
df.iloc[0]

c    1
d    2
e    3
Name: a, dtype: int64

ix -- 通过行标签或者行号索引行数据（基于 loc 和 iloc 的混合）  
**已被弃用**

In [127]:
df.ix[0]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


c    1
d    2
e    3
Name: a, dtype: int64

In [131]:
df1 = pd.DataFrame(np.arange(12).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [132]:
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [133]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [134]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [136]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


类似的还有 df1.sub(df2)，df1.div(df2)，df1.mul(df2)

In [140]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
            index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [141]:
series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

DataFrame 和 Series 之间的算术运算会将 Series 的索引匹配到 DataFrame 的列，然后沿着行一直向下广播。

In [143]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [2]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
            index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-1.07488,-0.222851,-0.558925
Ohio,1.099382,-0.523096,2.314788
Texas,-0.567669,-0.845356,0.707259
Oregon,0.828206,-0.432289,-1.254031


In [3]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.07488,0.222851,0.558925
Ohio,1.099382,0.523096,2.314788
Texas,0.567669,0.845356,0.707259
Oregon,0.828206,0.432289,1.254031


In [6]:
f = lambda x : '%.2f' % x

In [8]:
frame.applymap(f)

Unnamed: 0,b,d,e
Utah,-1.07,-0.22,-0.56
Ohio,1.1,-0.52,2.31
Texas,-0.57,-0.85,0.71
Oregon,0.83,-0.43,-1.25


In [11]:
frame['e'].map(f)

Utah      -0.56
Ohio       2.31
Texas      0.71
Oregon    -1.25
Name: e, dtype: object

In [5]:
obj = pd.Series(range(4), index=['a', 'd', 'c', 'b'])
obj

a    0
d    1
c    2
b    3
dtype: int64

### 按行索引重新排列

In [6]:
obj.sort_index()

a    0
b    3
c    2
d    1
dtype: int64

In [8]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 
    index=['three', 'one'], columns=['d','a','b','c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


设置按列倒序排列

In [13]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


### 按值重新排列

In [19]:
obj = pd.Series([4, 7, -3, 2])
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [21]:
obj.sort_values(ascending=False)

1    7
0    4
3    2
2   -3
dtype: int64

In [23]:
obj.rank(method="first")

0    3.0
1    4.0
2    1.0
3    2.0
dtype: float64

In [24]:
obj.rank(method="max", ascending=False)

0    2.0
1    1.0
2    4.0
3    3.0
dtype: float64

| method | 说明 |
| ------ | ---- |
| average | 默认：在相等分组中，为各个值分配平均排名 |
| min | 使用整个分组的最小排名 |
|max | 使用整个分组的最大排名 |
|first | 按值在原始数据中的出现顺序分配排名 | 


In [4]:
df = pd.DataFrame([[1.4, np.nan], [np.nan, np.nan], [4.5, 6], [np.nan, 3.3]], columns=['1', '2'], index=list("abcd"))
df

Unnamed: 0,1,2
a,1.4,
b,,
c,4.5,6.0
d,,3.3


按行进行统计

In [9]:
df.sum(axis=1)

a     1.4
b     0.0
c    10.5
d     3.3
dtype: float64

In [12]:
df.mean(axis=1, skipna=False)

a     NaN
b     NaN
c    5.25
d     NaN
dtype: float64

In [17]:
df.describe()

Unnamed: 0,1,2
count,2.0,2.0
mean,2.95,4.65
std,2.192031,1.909188
min,1.4,3.3
25%,2.175,3.975
50%,2.95,4.65
75%,3.725,5.325
max,4.5,6.0


In [107]:
import pandas_datareader as web

Yahoo! Finance 的股票价格和成交量

In [26]:
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2000', '1/1/2010')

In [28]:
price = pd.DataFrame({tic: data['Adj Close']
        for tic, data in all_data.items()})

In [30]:
volume = pd.DataFrame({tic: data['Volume']
        for tic, data in all_data.items()})

计算价格的百分比变化

In [31]:
returns = price.pct_change()

In [33]:
returns.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1999-12-31,,,,
2000-01-03,0.088754,0.075319,-0.001606,
2000-01-04,-0.084311,-0.033944,-0.033781,
2000-01-05,0.014634,0.035137,0.010545,
2000-01-06,-0.086538,-0.017242,-0.033498,


相关系数

In [34]:
returns.MSFT.corr(returns.IBM)

0.49253706494724375

协方差

In [35]:
returns.MSFT.cov(returns.IBM)

0.00021557771540279465

In [39]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.412392,0.422852,0.470676
IBM,0.412392,1.0,0.492537,0.390688
MSFT,0.422852,0.492537,1.0,0.438313
GOOG,0.470676,0.390688,0.438313,1.0


In [40]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.00103,0.000254,0.000309,0.000303
IBM,0.000254,0.000369,0.000216,0.000142
MSFT,0.000309,0.000216,0.000519,0.000204
GOOG,0.000303,0.000142,0.000204,0.00058


In [42]:
returns.corrwith(returns.IBM)

AAPL    0.412392
IBM     1.000000
MSFT    0.492537
GOOG    0.390688
dtype: float64

In [62]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [57]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [59]:
uniques.sort()
uniques

array(['a', 'b', 'c', 'd'], dtype=object)

In [60]:
pd.value_counts(obj.values, sort=False)

a    3
c    3
b    2
d    1
dtype: int64

In [64]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [65]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [66]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [67]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


In [71]:
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.],[1., NA,NA],
                  [NA, NA, NA], [NA, 6.5, 3.1]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.1


In [72]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [75]:
df = pd.DataFrame(np.random.randn(6, 3))
df

Unnamed: 0,0,1,2
0,-2.44864,-0.916469,-0.90781
1,0.551093,1.875987,-1.171547
2,0.071833,0.711929,0.209857
3,-0.169246,-0.263123,0.166853
4,-0.941175,-1.002605,-0.000868
5,-1.540807,-0.036314,-0.370128


In [76]:
df.loc[2:, 1] = NA
df.loc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-2.44864,-0.916469,-0.90781
1,0.551093,1.875987,-1.171547
2,0.071833,,0.209857
3,-0.169246,,0.166853
4,-0.941175,,
5,-1.540807,,


In [79]:
df1 = df.fillna(method="ffill")
df1

Unnamed: 0,0,1,2
0,-2.44864,-0.916469,-0.90781
1,0.551093,1.875987,-1.171547
2,0.071833,1.875987,0.209857
3,-0.169246,1.875987,0.166853
4,-0.941175,1.875987,0.166853
5,-1.540807,1.875987,0.166853


In [80]:
df2 = df.fillna(method="ffill", limit=2)
df2

Unnamed: 0,0,1,2
0,-2.44864,-0.916469,-0.90781
1,0.551093,1.875987,-1.171547
2,0.071833,1.875987,0.209857
3,-0.169246,1.875987,0.166853
4,-0.941175,,0.166853
5,-1.540807,,0.166853


层次化索引（hierarchical indexing）

In [83]:
data = pd.Series(np.random.randn(10), 
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], 
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

a  1   -0.000106
   2    0.969241
   3    0.914162
b  1    0.515243
   2   -0.108548
   3   -1.183225
c  1    0.883057
   2    0.896628
d  2    0.543947
   3   -2.408275
dtype: float64

In [84]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

In [88]:
data[:, 2]

a    0.969241
b   -0.108548
c    0.896628
d    0.543947
dtype: float64

In [89]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], 
                     columns=[['Ohio', 'Ohio', 'Colorado'], 
                              ['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [91]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [92]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [90]:
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'],
                           ['Green','Red', 'Green']],
                          names=['state', 'color'])

MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=['state', 'color'])

In [93]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [96]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [97]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [98]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [99]:
frame = pd.DataFrame({'a': range(7),'b': range(7, 0, -1), 
                      'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [100]:
frame1 = frame.set_index(['c', 'd'])
frame1

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [54]:
ser = pd.Series(np.arange(3.))

In [52]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [48]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [43]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [50]:
ser2.iloc[:1]

a    0.0
dtype: float64

In [102]:
frame = pd.DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1])
frame

Unnamed: 0,0,1
2,0,1
0,2,3
1,4,5


In [105]:
frame.iloc[0]

0    0
1    1
Name: 2, dtype: int32

In [112]:
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012'))
                      for stk in ['AAPL', 'GOOG', 'MSFT']))

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  exec(code_obj, self.user_global_ns, self.user_ns)


In [113]:
pdata

<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 862 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2008-12-31 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: High to Adj Close

In [114]:
pdata = pdata.swapaxes('items', 'minor')

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  """Entry point for launching an IPython kernel.


In [116]:
pdata['Adj Close'].head()

Unnamed: 0_level_0,AAPL,GOOG,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2008-12-31,8.165102,152.830978,15.171647
2009-01-02,8.6817,159.621811,15.866238
2009-01-05,9.0481,162.965073,16.014515
2009-01-06,8.898861,165.950653,16.201828
2009-01-07,8.706573,159.964584,15.226283


In [117]:
pdata.loc[:, '6/1/2012', :]

Unnamed: 0,High,Low,Open,Close,Volume,Adj Close
AAPL,81.807144,80.074287,81.308571,80.141426,130246900.0,53.667721
GOOG,284.474762,282.338654,284.047546,283.645172,6155600.0,283.645172
MSFT,28.959999,28.440001,28.76,28.450001,56634300.0,24.127708


In [118]:
pdata.loc['Adj Close', '5/22/2012' :, :]

Unnamed: 0_level_0,AAPL,GOOG,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-05-22,53.283173,298.458801,25.238676
2012-05-23,54.583275,302.760834,24.687433
2012-05-24,54.081974,299.879578,24.653509
2012-05-25,53.792103,293.85376,24.645023
2012-05-29,54.746853,295.249695,25.069065
2012-05-30,55.406948,292.214417,24.882488
2012-05-31,55.269192,288.553253,24.755281
2012-06-01,53.667721,283.645172,24.127708


In [119]:
stacked = pdata.loc[ :, '5/30/2012' :, :].to_frame()
stacked

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  return self.obj._slice(obj, axis=axis, kind=kind)


Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Open,Close,Volume,Adj Close
Date,minor,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2012-05-30,AAPL,82.855713,80.937141,81.314285,82.738571,132357400.0,55.406948
2012-05-30,GOOG,294.037567,289.879608,292.179657,292.214417,3838100.0,292.214417
2012-05-30,MSFT,29.48,29.120001,29.35,29.34,41585500.0,24.882488
2012-05-31,AAPL,83.071426,81.637146,82.96286,82.53286,122918600.0,55.269192
2012-05-31,GOOG,293.093719,287.629242,292.457855,288.553253,5975200.0,288.553253
2012-05-31,MSFT,29.42,28.940001,29.299999,29.190001,39134000.0,24.755281
2012-06-01,AAPL,81.807144,80.074287,81.308571,80.141426,130246900.0,53.667721
2012-06-01,GOOG,284.474762,282.338654,284.047546,283.645172,6155600.0,283.645172
2012-06-01,MSFT,28.959999,28.440001,28.76,28.450001,56634300.0,24.127708


In [120]:
stacked.to_panel()

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  """Entry point for launching an IPython kernel.


<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: High to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT