## pandas数据结构简介

In [3]:
import pandas as pd
from pandas import Series, DataFrame

### series

In [2]:
obj = pd.Series([4, 7, -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [4]:
obj2 = pd.Series([4, 7, -5, 3], 
                 index=['d', 'b', 'a', 'c'])


In [7]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [8]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [9]:
obj2['a']

-5

In [11]:
obj2['d']=6

In [12]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [13]:
obj2[['c','a','d']]

c    3
a   -5
d    6
dtype: int64

In [14]:
obj2[obj2>0]

d    6
b    7
c    3
dtype: int64

In [15]:
obj2*2

d    12
b    14
a   -10
c     6
dtype: int64

In [16]:
import numpy as np

In [17]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

另一个方式思考认为series是一种固定长度，有序的字典，因为它由索引值映射到数据值，当你想要使用字典时，它在许多情况下也可以使用

In [5]:
'b' in obj2

True

In [6]:
'e' in obj2

False

如果你拥有一个包含python字典到数据，就可以通过传字典创建一个series

In [7]:
sdata = {'ohio': 35000, 'texas': 7000}

In [9]:
obj3 = pd.Series(sdata)

In [10]:
obj3

ohio     35000
texas     7000
dtype: int64

当你只传字典时，series到索引只是字典的值，你可以通过传递字典的key来定制她们出现的顺序

In [11]:
states = ['cali', 'texas','ohio']

In [12]:
obj4 = pd.Series(sdata, index=states)

In [13]:
obj4

cali         NaN
texas     7000.0
ohio     35000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN(Not a number), which is considered in pandas to mark missing or NA values, Since 'Utah' was not included in states, it is exclued from the resulting object.

I will use the terms 'missing' or 'NA' interchangebly to refer to missing data. The isnull and notnull functions in pandas should be used to detect missiong data

In [14]:
pd.isnull(obj4)

cali      True
texas    False
ohio     False
dtype: bool

In [15]:
pd.notnull(obj4)

cali     False
texas     True
ohio      True
dtype: bool

series also has these as instance methods

In [17]:
obj4.isnull()

cali      True
texas    False
ohio     False
dtype: bool

a useful feature for many application is that it automatically aligns by index label in arithmetic operations

In [18]:
obj3

ohio     35000
texas     7000
dtype: int64

In [19]:
obj4

cali         NaN
texas     7000.0
ohio     35000.0
dtype: float64

In [20]:
obj3+obj4

cali         NaN
ohio     70000.0
texas    14000.0
dtype: float64

if you have experience with databases, you can think about this as being similar to a join operation

series object itself and ite index have a name attribute, which integrates with other key areas of pandas functionality

In [21]:
obj4.name = 'population'

In [22]:
obj4.index.name = 'state'

In [23]:
obj4

state
cali         NaN
texas     7000.0
ohio     35000.0
Name: population, dtype: float64

a series index can be altered in-place bu assignment

In [24]:
obj = pd.Series([3,7,-5,1])

In [25]:
obj

0    3
1    7
2   -5
3    1
dtype: int64

In [26]:
obj.index = ['bob', 'steve', 'jef', 'rya']

In [27]:
obj

bob      3
steve    7
jef     -5
rya      1
dtype: int64

#### DataFrame

there are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

In [28]:
data = {'state':['a', 'a', 'a', 'b','b'],
        'year':[2000,2002,2002,2003,2004],
        'pop':[1.5,1.6,1.7,1.8,1.9]
       }

frame = pd.DataFrame(data)

In [29]:
frame

Unnamed: 0,state,year,pop
0,a,2000,1.5
1,a,2002,1.6
2,a,2002,1.7
3,b,2003,1.8
4,b,2004,1.9


the head selects only the first five rows

In [30]:
frame.head()

Unnamed: 0,state,year,pop
0,a,2000,1.5
1,a,2002,1.6
2,a,2002,1.7
3,b,2003,1.8
4,b,2004,1.9


In [31]:
pd.DataFrame(data, columns=['year',
                           'state',
                           'pop'])

Unnamed: 0,year,state,pop
0,2000,a,1.5
1,2002,a,1.6
2,2002,a,1.7
3,2003,b,1.8
4,2004,b,1.9


if you pass a column that isn't contained in the dict, it will appear with missing values in the result

In [33]:
frame2 = pd.DataFrame(data, 
                      columns=['year',
                                'state',
                                'pop',
                                'debt'],
                     index=['one',
                           'two',
                           'three',
                           'four',
                           'five'])

In [34]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,a,1.5,
two,2002,a,1.6,
three,2002,a,1.7,
four,2003,b,1.8,
five,2004,b,1.9,


In [35]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

a column in a dataframe can be retried as a series either by dict-like notation or by attribute

In [36]:
frame2['state']

one      a
two      a
three    a
four     b
five     b
Name: state, dtype: object

In [37]:
frame2.year

one      2000
two      2002
three    2002
four     2003
five     2004
Name: year, dtype: int64

rows can also be retieved by position or name with the special loc attribute

In [38]:
frame2.loc['three']

year     2002
state       a
pop       1.7
debt      NaN
Name: three, dtype: object

columns can be modified by assignment, for example,the empty 'debt' column could be assigned a scalar value or an array of values

In [39]:
frame2['debt'] = 10

In [40]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,a,1.5,10
two,2002,a,1.6,10
three,2002,a,1.7,10
four,2003,b,1.8,10
five,2004,b,1.9,10


In [42]:
import numpy as np
frame2['debt'] = np.arange(5.)

In [43]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,a,1.5,0.0
two,2002,a,1.6,1.0
three,2002,a,1.7,2.0
four,2003,b,1.8,3.0
five,2004,b,1.9,4.0


when you are assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes

In [44]:
val = pd.Series([-1.2, -1.3, -1.7], index=['two', 'four', 'five'])

In [45]:
val

two    -1.2
four   -1.3
five   -1.7
dtype: float64

In [46]:
frame2['debt'] = val

In [47]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,a,1.5,
two,2002,a,1.6,-1.2
three,2002,a,1.7,
four,2003,b,1.8,-1.3
five,2004,b,1.9,-1.7


assigning a column that doesn't exist will create a new column, The del keyword will delete columns as with a dict

as an example of del, I first add a new column of boolean values where the state column equals 'a'

In [48]:
frame2['eastern'] = frame2.state == 'a'

In [49]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,a,1.5,,True
two,2002,a,1.6,-1.2,True
three,2002,a,1.7,,True
four,2003,b,1.8,-1.3,False
five,2004,b,1.9,-1.7,False


The del method can then be used to remove this column

In [50]:
del frame2['eastern']

In [51]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [52]:
pop = {'n':{2001:2.4,2002:2.9},
      'o':{2000:1.5, 2001:3.9}}

In [53]:
frame3 = pd.DataFrame(pop)

In [54]:
frame3

Unnamed: 0,n,o
2000,,1.5
2001,2.4,3.9
2002,2.9,


you can transpose the dataframe (swap row and columns) with similar syntax to a NumPy array

In [55]:
frame3.T

Unnamed: 0,2000,2001,2002
n,,2.4,2.9
o,1.5,3.9,
