## Pandas 数据结构介绍

In [22]:
from pandas import Series, DataFrame
import pandas as pd

### Series

In [23]:
obj = Series([1,34,552,234,4])
obj

0      1
1     34
2    552
3    234
4      4
dtype: int64

通过values和index属性获取数组和索引

In [24]:
obj.values

array([  1,  34, 552, 234,   4])

In [25]:
obj.index

RangeIndex(start=0, stop=5, step=1)

索引可以自己定义

In [26]:
obj2 = Series([3,51,5,2,6],index=['a','b','c','d','e'])
obj2

a     3
b    51
c     5
d     2
e     6
dtype: int64

In [27]:
obj2.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

与数组相比的化，这里可以通过索引来获取数组元素

In [28]:
obj2['c']

5

In [29]:
obj2[['a','d','c']]

a    3
d    2
c    5
dtype: int64

运算会保留索引与值的连接

In [30]:
print(obj2)
print(obj2*2)

a     3
b    51
c     5
d     2
e     6
dtype: int64
a      6
b    102
c     10
d      4
e     12
dtype: int64


同时可以当作字典

In [31]:
'c' in obj2

True

In [41]:
sdata = {'Ohio':35000, 'Texas':7000, 'Orange':10902}
obj3 = Series(sdata)
obj3

Ohio      35000
Orange    10902
Texas      7000
dtype: int64

In [42]:
states = {'califor','Ohio','Texas'}
obj4 = Series(sdata, index = states)
obj4
## NaN 表示缺失值

Texas       7000.0
califor        NaN
Ohio       35000.0
dtype: float64

In [43]:
pd.isnull(obj4)

Texas      False
califor     True
Ohio       False
dtype: bool

In [44]:
pd.notnull(obj4)

Texas       True
califor    False
Ohio        True
dtype: bool

In [45]:
obj4.isnull

<bound method NDFrame.isnull of Texas       7000.0
califor        NaN
Ohio       35000.0
dtype: float64>

Series的重要功能之一是 在算术运算中会自动对齐不同的索引数据

In [46]:
obj3

Ohio      35000
Orange    10902
Texas      7000
dtype: int64

In [47]:
obj4

Texas       7000.0
califor        NaN
Ohio       35000.0
dtype: float64

In [48]:
obj3+obj4

Ohio       70000.0
Orange         NaN
Texas      14000.0
califor        NaN
dtype: float64

Series对象及其索引有一个name属性

In [50]:
obj4.name = 'alvin'
obj4.index.name = 'state'
obj4

state
Texas       7000.0
califor        NaN
Ohio       35000.0
Name: alvin, dtype: float64

In [52]:
obj4.index = ['a','b','f']
obj4

a     7000.0
b        NaN
f    35000.0
Name: alvin, dtype: float64

### DataFrame

dataframe 是一个表格形式的数据结构。

In [54]:
# 构建一个DataFrame
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
       'year':[2000,2001,2002,2003,2018],
       'pop':[1.5,1.7,3.6,2.4,2.9]}
frame = DataFrame(data)

In [55]:
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2003
4,2.9,Nevada,2018


In [81]:
frame2 = DataFrame(data,columns=['year','pop','state','null'],
         index=['one','two','three','four','five'])
frame2

Unnamed: 0,year,pop,state,null
one,2000,1.5,Ohio,
two,2001,1.7,Ohio,
three,2002,3.6,Ohio,
four,2003,2.4,Nevada,
five,2018,2.9,Nevada,


In [82]:
frame2.columns

Index(['year', 'pop', 'state', 'null'], dtype='object')

 从DataFrame中获取一个Series

In [83]:
frame['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [84]:
frame2.ix['three']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  """Entry point for launching an IPython kernel.


year     2002
pop       3.6
state    Ohio
null      NaN
Name: three, dtype: object

In [85]:
frame2

Unnamed: 0,year,pop,state,null
one,2000,1.5,Ohio,
two,2001,1.7,Ohio,
three,2002,3.6,Ohio,
four,2003,2.4,Nevada,
five,2018,2.9,Nevada,


In [86]:
frame2['debt'] = 18
frame2

Unnamed: 0,year,pop,state,null,debt
one,2000,1.5,Ohio,,18
two,2001,1.7,Ohio,,18
three,2002,3.6,Ohio,,18
four,2003,2.4,Nevada,,18
five,2018,2.9,Nevada,,18


In [87]:
import numpy as np
frame2['debt'] = np.arange(5.)
frame2

Unnamed: 0,year,pop,state,null,debt
one,2000,1.5,Ohio,,0.0
two,2001,1.7,Ohio,,1.0
three,2002,3.6,Ohio,,2.0
four,2003,2.4,Nevada,,3.0
five,2018,2.9,Nevada,,4.0


将列表或数组赋值给某个列

In [88]:
val = Series([-1.2,-1.5,-1.7],index=['two','four','five'])
val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

In [89]:
frame2['null'] = val
frame2

Unnamed: 0,year,pop,state,null,debt
one,2000,1.5,Ohio,,0.0
two,2001,1.7,Ohio,-1.2,1.0
three,2002,3.6,Ohio,,2.0
four,2003,2.4,Nevada,-1.5,3.0
five,2018,2.9,Nevada,-1.7,4.0


In [90]:
#将缺省值删除
frame2['estern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,pop,state,null,debt,estern
one,2000,1.5,Ohio,,0.0,True
two,2001,1.7,Ohio,-1.2,1.0,True
three,2002,3.6,Ohio,,2.0,True
four,2003,2.4,Nevada,-1.5,3.0,False
five,2018,2.9,Nevada,-1.7,4.0,False


In [92]:
del frame2['estern']
frame2.columns


Index(['year', 'pop', 'state', 'null', 'debt'], dtype='object')

In [94]:
#这里可以使用副本
frame2.T

Unnamed: 0,one,two,three,four,five
year,2000,2001,2002,2003,2018
pop,1.5,1.7,3.6,2.4,2.9
state,Ohio,Ohio,Ohio,Nevada,Nevada
,,-1.2,,-1.5,-1.7
debt,0,1,2,3,4
