In [6]:
import pandas as pd

In [7]:
from pandas import Series, DataFrame

## 5.1 Introduction to pandas Data Structures

* Series

* DataFrame

#### Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its **index**.

In [8]:
obj = pd.Series([4, 7, -5, 3])

In [9]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

You can get the array representation and index object of the Series via this values and index and index attributes, respectively.

In [10]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [11]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point with a label:

In [12]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [13]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [14]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with the Numpy array, you can use labels in the index to call values:

In [15]:
obj2['a']

-5

In [16]:
obj2['b'] = 6

In [17]:
obj2

d    4
b    6
a   -5
c    3
dtype: int64

In [18]:
obj2[['c', 'd', 'a']] # the index can be ranged in anyway you want

c    3
d    4
a   -5
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [19]:
obj2[obj2 > 0]

d    4
b    6
c    3
dtype: int64

In [20]:
obj2 * 2

d     8
b    12
a   -10
c     6
dtype: int64

In [21]:
import numpy as np
np.exp(obj2)

d     54.598150
b    403.428793
a      0.006738
c     20.085537
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.

In [22]:
'b' in obj2

True

In [23]:
'e' in obj2

False

A Series can be created using Python dict, the index will be the keys of the dict and the value will be the corresponding value of the dict.

In [24]:
sdata = {'Ohio' : 35000, 'Texas' : 71000, 'Oregon' : 16000, 'Utah' : 5000}

In [25]:
sdata?

In [26]:
obj3 = pd.Series(sdata)

In [27]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

You can also specify the index, the Series will pick the corresponding value in dict. And, if there is no such key in dict, the value in Series will be set as NaN(not a number).

In [28]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [29]:
obj4 = pd.Series(sdata, index=states)

In [30]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

The isnull and notnull functions in pandas should be used to detect missing data:

In [31]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [32]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

There are also instance version of these two functions for Series:

In [33]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [34]:
obj4.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [35]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [36]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [37]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:

In [38]:
obj4.name = 'population'

In [39]:
obj4.index.name = 'state'

In [40]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series’s index can be altered in-place by assignment:

In [41]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [42]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [43]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

#### DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.

The most common ways to construct a DataFrame are using a dict of equal-length lists or Numpy arrays:

In [44]:
data = {'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year' : [2000, 2001, 2002, 2001, 2002, 2003],
        'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [45]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [46]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [47]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop']
            , index=['one', 'two', 'three', 'four', 'five', 'six'])

In [48]:
frame2

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


In [49]:
frame2.columns

Index(['year', 'state', 'pop'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

In [50]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [51]:
type(frame2['year'])

pandas.core.series.Series

In [52]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt']
            , index=['one', 'two', 'three', 'four', 'five', 'six'])

In [53]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [54]:
frame2['debt'] = 16.5

In [55]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:

In [56]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [57]:
frame2['debt'] = val

In [58]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [59]:
frame2['eastern'] = frame2.state == 'Ohio'

In [60]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


The del method can then be used to remove this column:

In [61]:
del frame2['eastern']

In [62]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


**Caution:**
The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series’s copy method.

In [63]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [64]:
frame3 = pd.DataFrame(pop)

In [65]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [66]:
# Transpose
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [67]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [68]:
pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}

In [69]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [70]:
 frame3.index.name = 'year'; frame3.columns.name = 'state'

In [71]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [72]:
frame3.values # returns two-dimensional ndarray

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [73]:
type(frame2.values)

numpy.ndarray

In [74]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

#### Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index

In [75]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [76]:
index = obj.index

In [77]:
index

Index(['a', 'b', 'c'], dtype='object')

In [78]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus cannot be modified by the user:

In [79]:
index[1] = 'd'

TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures:

In [80]:
labels = pd.Index(np.arange(3))

In [81]:
labels

Int64Index([0, 1, 2], dtype='int64')

In [82]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)

In [83]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [84]:
obj2.index is labels

True

In [85]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [86]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [87]:
'Ohio' in frame3.columns

True

In [88]:
2003 in frame3.index

False

Unlike Python sets, a pandas Index can contain duplicate labels. Selection with duplicate labels will select all occurrences of that label.

In [89]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [90]:
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Methods of Index | Description
:---:|:---
append|Concatenate with additional Index objects, producing a new Index
difference | Computes set difference as an Index
intersection | Computes set intersection
union | Computes set union
isin | Compute boolean array indicating whether each value is contained in the passed collection
delete | Compute new Index with element at index i deleted
drop | Compute new Index by deleting passed values
insert | Compute new Index by inserting element at index i
is_monotonic | Returns True if each element is greater than or equal to the previous element
is_unique | Returns True if the Index has no duplicate values
unique | Compute the array of unique values in the Index

## 5.2 Essential Functionality

### Reindexing

Reindexing means to create a new object with the data comformed to a new index:

In [91]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [92]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [93]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [94]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [95]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])

In [96]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [97]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [98]:
frame = pd.DataFrame(np.arange(9).reshape(3, 3), index = ['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])

In [99]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [100]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [101]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [102]:
states = ['Texas', 'Utah', 'California']

In [104]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [105]:
frame.loc[['a', 'b', 'c', 'd'], states] # 新版pandas不支持在loc中使用无数据的表情

KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

### Dropping Entries from an Axis

In [106]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [107]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [108]:
new_obj = obj.drop('c')

In [109]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [110]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

In [115]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])


In [116]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [117]:
data.drop(['Colorado', 'Ohio']) # 默认 行 axis=0

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [118]:
data.drop('two', axis=1) # 列 axis=1

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [119]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object

In [120]:
obj.drop('c', inplace=True)

In [121]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

### Indexing, Selection, and Filtering

In [122]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [123]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [124]:
obj['b']

1.0

In [125]:
obj[1]

1.0

In [126]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [128]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [129]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [130]:
obj[obj<2]

a    0.0
b    1.0
dtype: float64

In [131]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [132]:
obj['b':'c'] = 5

In [133]:
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [134]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
   .....:                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
   .....:                     columns=['one', 'two', 'three', 'four'])

In [135]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [136]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [137]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [138]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [139]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [140]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [141]:
data[data < 5] = 0

In [142]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### SELECTION WITH LOC AND ILOC
loc and iloc. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis **labels (loc) or integers (iloc).**

In [143]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

In [145]:
data.iloc[2, [3, 0, 1]] # [rows, columns]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [146]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


### Integer Index