# Getting Started with Pandas 
Pandas contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and easy in Python. pandas is often used in tandem with numerical
computing tools like NumPy and SciPy, analytical libraries like statsmodels and
scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant
parts of NumPy’s idiomatic style of array-based computing, especially array-based
functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast,
is best suited for working with homogeneous numerical array data.

In [1]:
from pandas import Series, DataFrame  #Series and DataFrame are used so much so import them into the local namespace
import pandas as pd
import numpy as np

## Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: `Series` and `DataFrame`. While they are not a universal solution for
every problem, they provide a solid, easy-to-use basis for most applications.

### Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

In [2]:
obj=Series([4,7,-5,3])

In [3]:
obj #a default one consisting of intergers 0 through N-1 is created.

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the
left and the values on the right. <font color=red>Since we did not specify an index for the data, a
default one consisting of the integers 0 through $N - 1$ (where N is the length of the
data) is created.</font>

##### Get the array representation and index object of the Series via its values and index attributes.

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Create a Series with an index identifying each data point

In [6]:
obj2=Series([4,7,-5,3], index=['d','b','a','c'])

In [7]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

##### Use values in the index when selecting single values or a set of values

In [8]:
obj2['a']

-5

In [9]:
obj2['d']

4

In [10]:
obj2['d']=6

In [11]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [12]:
obj2[['c','a','d']]

c    3
a   -5
d    6
dtype: int64

In [13]:
obj2[obj2>0] # filtering with a boolean array

d    6
b    7
c    3
dtype: int64

In [14]:
#scalar multiplication
obj2*2

d    12
b    14
a   -10
c     6
dtype: int64

In [15]:
#Applying math functions
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Thinking Series as a fixed-length, ordered dictionary, as it is a mapping
of index values to data values.

In [16]:
'b' in obj2

True

In [17]:
'e' in obj2

False

##### Creating a Series by a dictionary

In [18]:
sdata={'Ohio':35000, 'Texas':71000,'Oregon':16000, 'Utah':5000}

In [19]:
#Creating Series by function Series
obj3=Series(sdata)

In [20]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s
keys in sorted order. You can override this by passing the dict keys in the order you
want them to appear in the resulting Series:

In [21]:
states=['California','Ohio','Oregon','Texas']

In [22]:
# The 3 same values found in sdata were placed in the appropriate location, but since no value for 'California' was found,
# it appears as NaN
obj4=Series(sdata, index=states) #按照index 顺序进行匹配

In [23]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

##### Using isnull and notnull functions in pandas to detect missing data

In [24]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

<font color=red> A useful tool to help count how many missing values in the data is the calling `Series` or `dataframe`.isna().sum()</font>

In [25]:
obj4.isna().sum()

1

In [26]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [27]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

##### A critical Series feature for many applications is that is automatically aligns differently-indexed data in arithmetic operations

In [29]:
obj3+obj4 #注意California and Utah

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both Series object itself and its index have a "name" attribute, which integrates with other key areas of pandas functionality

In [30]:
obj4.name='population'

In [31]:
obj4.index.name='state'

In [32]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

##### A Series’s index can be altered in-place by assignment:

In [33]:
obj.index=['Bob','Steve','Jeff','Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure. <font color=red>DataFrame has both a row and column index.</font> It can be thought of as a dict of Series.

##### Lists in dictionary

In [34]:
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
     'year':[2000,2001,2002,2001,2002],
     'pop':[1.5,1.7,3.6,2.4,2.9]}

All the keys in the dictionary become column index.

In [35]:
frame=DataFrame(data)

In [36]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


<font color=red>If you specify a sequence of columns, the DataFrame's column will be exactly what you pass.</font> 如果指定了列序列，DataFrame就会按照指定的顺序进行排列

In [37]:
DataFrame(data, columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


As with Series, if you pass a column that isn't contained in data, it will appear with NA values.

In [38]:
frame2=DataFrame(data, columns=['year','state','pop','debt'],
                index=['one','two','three','four','five'])

In [39]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [40]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dict-like notation or
by attribute:

In [41]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [42]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

The returned Series have the same index as the DataFrame. <font color=red>Rows can also be retrieved by position or name by a couple of methods.</font>

In [43]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

<font color=red>Columns can be modified by assignment.</font>

In [44]:
frame2['debt']=16.5

In [45]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [46]:
frame2['debt']=np.arange(5)

In [47]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


When assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, it will be instead conforned exactly to the DataFrame's index, inserting missing values.

In [48]:
val=Series([-1.2,-1.5,-1.7], index=['two','four','five'])

In [49]:
frame2['debt']=val

In [50]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


<font color=red>Assigning a column that doesn't exist will create a new column. The del keyword will delete columns as with a dictionary.</font>

In [51]:
frame2['eastern']=frame2.state=='Ohio' #boolean

In [52]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [53]:
del frame2['eastern']

In [54]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


##### Another common form of data is a nested dict of dicts format.

In [55]:
pop={'Nevada':{2001:2.4,2002:2.9},
    'Ohio':{2000:1.5,2001:1.7,2002:3.6}}

In [56]:
frame3=DataFrame(pop)

In [57]:
frame3
#Outer dict keys as the columns and the inner keys as the row indices

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Transpose the dataframe

In [58]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


If we explicit the index

In [59]:
DataFrame(pop, index=pd.Index(np.arange(2001,2004,1)))

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [60]:
pdata={'Ohio': frame3['Ohio'][:-1],
      'Nevada': frame3['Nevada'][:2]}

In [61]:
DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


<font color=red>If a DataFrame's index and columns have their name attributes set, these will also be displayed.</font>

In [62]:
frame3.index.name='year'; frame3.columns.name='state'

In [63]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Like Series, the values attributes returns the data contained in the DataFrame as 2D ndarray

In [64]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

If the DataFrame’s columns are different dtypes, the dtype of the values array will be
chosen to accommodate all of the columns:

In [65]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

### Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index:

In [66]:
obj=Series(range(3), index=['a','b','c'])

In [67]:
obj

a    0
b    1
c    2
dtype: int64

In [68]:
index=obj.index

In [69]:
index

Index(['a', 'b', 'c'], dtype='object')

In [70]:
index[1:]

Index(['b', 'c'], dtype='object')

In [71]:
index[1]='d' #index objects are immuatble

TypeError: Index does not support mutable operations

In [72]:
index=pd.Index(np.arange(3))

In [73]:
obj2=Series([1.5,-2.5,0], index=index)

In [74]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [75]:
obj2.index is index

True

In [76]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [77]:
'Ohio' in frame3.columns

True

In [78]:
2003 in frame3.index

False

## Essential Functionality

### Reindexing

<font color=red>Reindex means to create a new object with the data conformed to a new index</font>

In [79]:
obj=Series([4.5,7.2,-5.3,3.6], index=['d','b','a','c'])

In [80]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [81]:
obj.reindex(['a','b','c','d','e'], fill_value=0)# 用0去填充缺失的值value

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present.

In [82]:
obj2=obj.reindex(['a','b','c','d','e'])

In [83]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [84]:
obj3=Series(['blue','purple','yellow'], index=[0,2,4])

In [85]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [86]:
obj3.reindex(range(6), method='ffill') #前向填充 上一值个来填充下一个值

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [87]:
frame=DataFrame(np.arange(9).reshape((3,3)), index=['a','c','d'],
               columns=['Ohio','Texas','California']) #Gives a new shape to an array without changing its data.

In [88]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [89]:
frame2=frame.reindex(['a','b','c','d'])

In [90]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


<font color=red>The columns can be reindexed using the columns keyword</font>

In [91]:
states=['Texas','Utah','California']

In [92]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Both can be reindexed in one shot, though interpolation will only apply row-wise.

In [93]:
frame.reindex(index=['a','b','c','d'],
             columns=states).ffill()

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,1.0,,2.0
c,4.0,,5.0
d,7.0,,8.0


In [94]:
frame.loc[['a','b','c','d'], states]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


### Dropping entries from an axis

Dropping one or more entries from an axis is easy if you already have an index array
or list without those entries. As that can require a bit of munging and set logic, the
`drop` method will return a new object with the indicated value or values deleted from
an axis:

In [95]:
obj=Series(np.arange(5), index=['a','b','c','d','e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int64

##### Series or DataFrame.drop( index)  

In [96]:
new_obj=obj.drop('c') #drop row 'c'

In [97]:
new_obj

a    0
b    1
d    3
e    4
dtype: int64

In [98]:
obj.drop(['d','c'])

a    0
b    1
e    4
dtype: int64

With DataFrame, index values can be deleted from either axis:

In [99]:
data=DataFrame(np.arange(16).reshape((4,4)),
              index=['Ohio','Colorado','Utah','New York'],
              columns=['One','Two','Three','Four'])

In [100]:
data

Unnamed: 0,One,Two,Three,Four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [101]:
data.drop(['Colorado','Ohio'])

Unnamed: 0,One,Two,Three,Four
Utah,8,9,10,11
New York,12,13,14,15


##### DataFrame.drop(column index, axis=1)

In [102]:
data.drop('Two', axis=1) #default axis is 0

Unnamed: 0,One,Three,Four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [103]:
data.drop(['Two', 'Four'], axis=1)

Unnamed: 0,One,Three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


Many functions, like drop, which modify the size or shape of a Series or DataFrame,
can manipulate an object in-place without returning a new object:

In [104]:
obj.drop('c', inplace=True)
obj

a    0
b    1
d    3
e    4
dtype: int64

### Indexing, selection and filtering
Series indexing (`obj[...]`) works analogously to NumPy array indexing, except you
can use the Series’s index values instead of only integers. Here are some examples of
this:

In [105]:
obj=Series(np.arange(4), index=['a','b','c','d'])

In [106]:
obj

a    0
b    1
c    2
d    3
dtype: int64

In [107]:
obj['b']

1

In [108]:
obj[1] #Obj 第二项

1

In [109]:
obj[2:4] #第三项 第四项

c    2
d    3
dtype: int64

In [110]:
obj[['b','a','d']] #按照索引顺序 索引

b    1
a    0
d    3
dtype: int64

In [111]:
obj[[1,3]]

b    1
d    3
dtype: int64

In [112]:
obj[obj<2]

a    0
b    1
dtype: int64

<font color=red>Slicing with labels behaves differently than normal Python slicing. End point is inclusive.</font>

In [113]:
obj['b':'c']

b    1
c    2
dtype: int64

In [114]:
obj['b':'c']=5

In [115]:
obj

a    0
b    5
c    5
d    3
dtype: int64

Indexing into a DataFrame is for retrieving one or more columns

In [116]:
data=DataFrame(np.arange(16).reshape((4,4)),
              index=['Ohio','Colorado','Utah','New York'],
              columns=['One','Two','Three','Four'])

In [117]:
data

Unnamed: 0,One,Two,Three,Four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [118]:
data['Two'] #retrieving one column or say, Series

Ohio         1
Colorado     5
Utah         9
New York    13
Name: Two, dtype: int64

In [119]:
data[['Four','Three']]

Unnamed: 0,Four,Three
Ohio,3,2
Colorado,7,6
Utah,11,10
New York,15,14


In [120]:
data[:2]#选取前两行

Unnamed: 0,One,Two,Three,Four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [121]:
data[data['Three']>5]

Unnamed: 0,One,Two,Three,Four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [122]:
data<5

Unnamed: 0,One,Two,Three,Four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [123]:
data[data<5]=0

In [124]:
data

Unnamed: 0,One,Two,Three,Four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selection with loc and iloc
For DataFrame label-indexing on the rows, I introduce the special indexing operators
`loc` and `iloc`. They enable you to select a subset of the rows and columns from a
DataFrame with NumPy-like notation using either axis labels (`loc`) or integers
(`iloc`).

##### DataFrame.loc[row index] but DataFrame.iloc[row number]

In [125]:
data.loc['Colorado',['Two','Three']]

Two      5
Three    6
Name: Colorado, dtype: int64

In [126]:
data.iloc[[2,3],[3,0,1]]

Unnamed: 0,Four,One,Two
Utah,11,8,9
New York,15,12,13


In [127]:
data.iloc[2]

One       8
Two       9
Three    10
Four     11
Name: Utah, dtype: int64

<font color=red> End point is inclusive.</font>

In [128]:
data.loc[:'Utah','Two']

Ohio        0
Colorado    5
Utah        9
Name: Two, dtype: int64

In [129]:
data.iloc[:,:3]

Unnamed: 0,One,Two,Three
Ohio,0,0,0
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [130]:
data.iloc[:,:3][data.Three>5]

Unnamed: 0,One,Two,Three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


#### Integer Indexes
Working with pandas objects indexed by integers is something that often trips up
new users due to some differences with indexing semantics on built-in Python data
structures like lists and tuples. For example, you might not expect the following code
to generate an error:

In [131]:
ser=Series(np.arange(3.))
ser
#ser[-1] gets an error

0    0.0
1    1.0
2    2.0
dtype: float64

In [132]:
ser2=Series(np.arange(3.), index=['a','b','c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [133]:
ser2[-1]

2.0

In [134]:
ser[:1]

0    0.0
dtype: float64

In [135]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [136]:
ser.iloc[:1]

0    0.0
dtype: float64

## Arithmetic and data alignment
An important pandas feature for some applications is the behavior of arithmetic
between objects with different indexes. When you are adding together objects, if any
index pairs are not the same, the respective index in the result will be the union of the
index pairs.

In [137]:
s1=Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])

In [138]:
s2=Series([-2.1,3.6,-1.5,4,3.1], index=['a','c','e','f','g'])

In [139]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [140]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [141]:
s1+s2 ## combind them with the same index

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

##### Alignment is performed on both the rows and the columns for DataFrames

In [142]:
df1=DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'),
             index=['Ohio','Texas','Colorado'])

In [143]:
df2=DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'),
             index=['Utah','Ohio','Texas','Oregon'])

In [144]:
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [145]:
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


Adding these together returns a DataFrame whose index and columns are united.

In [146]:
df1+df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear
as all missing in the result. The same holds for the rows whose labels are not common
to both objects.

If you add DataFrame objects with no column or row labels in common, the result
will contain all nulls:

In [147]:
df1=DataFrame({'A':[1,2]})
df1

Unnamed: 0,A
0,1
1,2


In [148]:
df2=DataFrame({'B':[3,4]})
df2

Unnamed: 0,B
0,3
1,4


In [149]:
df1+df2

Unnamed: 0,A,B
0,,
1,,


In [150]:
df1-df2

Unnamed: 0,A,B
0,,
1,,


### Arithmetic methods with fill values
In arithmetic operations between differently indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other:

In [151]:
df1=DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))

In [152]:
df2=DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))

In [153]:
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [154]:
df2

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [155]:
df2.loc[1,'b']=np.nan
df2

Unnamed: 0,a,b,c,d,e
0,0,1.0,2,3,4
1,5,,7,8,9
2,10,11.0,12,13,14
3,15,16.0,17,18,19


Adding these together results in NA values in the locations that don’t overlap:

In [156]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


##### Using the `add` method on df1, I pass `df2` and an argument to `fill_value`:

In [157]:
## df1+df2 避免NAN值 用df1.add() NaN的部分用 df2去填充
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [158]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [159]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value:

In [160]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,0
1,4,5,6,7,0
2,8,9,10,11,0


### Operations between DataFrame and Series

In [161]:
arr=np.arange(12.).reshape((3,4))

In [162]:
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [163]:
arr[0]

array([0., 1., 2., 3.])

In [164]:
arr-arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

When we subtract `arr[0]` from arr, the subtraction is performed once for each row.
This is referred to as <font color=red>broadcasting</font> and is explained in more detail as it relates to general
NumPy arrays in Appendix A.

In [165]:
frame=DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),
               index=['Utah','Ohio','Texas','Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [166]:
series=frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame’s columns, broadcasting down the rows:

In [167]:
frame-series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [168]:
series2=Series(range(3), index=list('bef'))
series2

b    0
e    1
f    2
dtype: int64

In [169]:
frame+series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods. For example:

In [170]:
series3=frame['d']

In [171]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [172]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [173]:
frame.sub(series3, axis=0) #Subtraction of dataframe and other, element-wise

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


## Function application and mapping
Numpy ufuncs work fine with pandas objects

In [174]:
#np.random.randn Return a sample (or samples) from the “standard normal” distribution.
frame=DataFrame(np.random.randn(4,3), columns=list('bde'),
               index=['Utah','Ohio','Texas','Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-0.58096,0.17488,-2.525324
Ohio,1.355102,-1.759478,-0.905877
Texas,0.262419,-1.565827,0.137794
Oregon,0.687498,1.806119,-0.722967


In [175]:
np.abs(frame) #taking absolute value

Unnamed: 0,b,d,e
Utah,0.58096,0.17488,2.525324
Ohio,1.355102,1.759478,0.905877
Texas,0.262419,1.565827,0.137794
Oregon,0.687498,1.806119,0.722967


In [176]:
f=lambda x: x.max()-x.min()

In [177]:
frame.apply(f) #默认值 是一行操作

b    1.936062
d    3.565597
e    2.663119
dtype: float64

In [178]:
frame.apply(f, axis=1)

Utah      2.700205
Ohio      3.114581
Texas     1.828246
Oregon    2.529085
dtype: float64

In [179]:
def f(x):
    return Series([x.min(), x.max()], index=['min','max'])

In [180]:
frame.apply(f) #按列选取最大值 最小值

Unnamed: 0,b,d,e
min,-0.58096,-1.759478,-2.525324
max,1.355102,1.806119,0.137794


In [181]:
frame.apply(f, axis=1) #按行选取最大值最小值

Unnamed: 0,min,max
Utah,-2.525324,0.17488
Ohio,-1.759478,1.355102
Texas,-1.565827,0.262419
Oregon,-0.722967,1.806119


Element-wise Python functions can be used, too. Suppose you wanted to compute a
formatted string from each floating-point value in frame. You can do this with apply
map:

In [182]:
format= lambda x: '%.2f' % x

In [183]:
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.58,0.17,-2.53
Ohio,1.36,-1.76,-0.91
Texas,0.26,-1.57,0.14
Oregon,0.69,1.81,-0.72


The reason for the name applymap is that Series has a map method for applying an
element-wise function:

In [184]:
frame['e'].map(format)

Utah      -2.53
Ohio      -0.91
Texas      0.14
Oregon    -0.72
Name: e, dtype: object

## Sorting and ranking
Sorting a dataset by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the `sort_index` method, which returns
a new, sorted object:

In [185]:
obj=Series(range(4), index=list('dabc'))
obj

d    0
a    1
b    2
c    3
dtype: int64

In [186]:
obj.sort_index() #sort by row or column index

a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis:

In [187]:
frame=DataFrame(np.arange(8.).reshape((2,4)), index=['Three','One'],
               columns=list('dabc'))
frame

Unnamed: 0,d,a,b,c
Three,0.0,1.0,2.0,3.0
One,4.0,5.0,6.0,7.0


In [188]:
frame.sort_index() #按照行索引排序

Unnamed: 0,d,a,b,c
One,4.0,5.0,6.0,7.0
Three,0.0,1.0,2.0,3.0


In [189]:
frame.sort_index(axis=1) #按照列索引排序

Unnamed: 0,a,b,c,d
Three,1.0,2.0,3.0,0.0
One,5.0,6.0,7.0,4.0


The data is sorted in ascending order by default. Sorting be descending order by setting parameter ascending=False

In [190]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
Three,0.0,3.0,2.0,1.0
One,4.0,7.0,6.0,5.0


To sort a Series by its values, use its `sort_values` method:

In [191]:
obj=Series([4,7,-3,2])
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [192]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [193]:
obj=Series([4,np.nan, 7, np.nan, -3,2])
obj

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [194]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

When sorting a DataFrame, you can use the data in one or more columns as the sort
keys. To do so, pass one or more column names to the `by` option of sort_values:

In [195]:
frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [196]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [197]:
frame.sort_values(by=['a','b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


#### Ranking
Assigning ranks from one through the number of valid data points in an array. It is similar to the indirect sort indices produced by numpy.argsort.

In [198]:
obj=Series([7,-5,7,4,2,0,4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [199]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranks can also be assigned according to the order in which they’re observed in the
data:

In [200]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [201]:
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [202]:
frame=DataFrame({'b':[4.3,7,-3,2], 'a':[0,1,0,1],
                'c':[-2,5,8,-2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [203]:
frame.rank(axis=1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


### Axis indexes with duplicate values

In [204]:
obj=Series(np.arange(5), index=list('aabbc'))
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The index's is_unique property can tell you whether its values are unique or not

In [205]:
obj.index.is_unique #行索引 是否唯一 判断

False

In [206]:
obj['a']

a    0
a    1
dtype: int64

In [207]:
df=DataFrame(np.random.randn(4,3), index=list('aabb'))
df

Unnamed: 0,0,1,2
a,-0.013661,1.815655,-0.25198
a,0.574857,0.300252,-0.190661
b,1.441923,-0.774977,1.075733
b,0.312097,1.142985,-1.083875


In [208]:
df.loc['b']

Unnamed: 0,0,1,2
b,1.441923,-0.774977,1.075733
b,0.312097,1.142985,-1.083875


## Summarizing and Computing Descriptive Statistics
<font color=red>Pandas methods are built from the ground up to exclude missing data.</font>

In [209]:
df=DataFrame([[1.4,np.nan],[7.1,-4.5],
            [np.nan,np.nan],[0.75,-1.3]],
            index=list('abcd'),
            columns=['One','Two'])
df

Unnamed: 0,One,Two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [210]:
df.sum() #Calling DataFrame's sum method returns a Series containing columns sums默认按列求和

One    9.25
Two   -5.80
dtype: float64

In [211]:
df.sum(axis=1)# 按行求和

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [212]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

Some methods, like `idxmin` and `idxmax`, return indirect statistics like the index value
where the minimum or maximum values are attained:

In [213]:
df.idxmax()

One    b
Two    d
dtype: object

In [214]:
df.idxmin()

One    d
Two    b
dtype: object

In [215]:
df.cumsum() #按列累积求和

Unnamed: 0,One,Two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [216]:
df.describe()

Unnamed: 0,One,Two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


### Correlation and Covariance

In [217]:
import pandas_datareader.data as web

In [218]:
all_data={}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
    all_data[ticker]=web.get_data_yahoo(ticker, '1/1/2000','1/1/2010')

In [219]:
all_data

{'AAPL':                  High        Low       Open      Close       Volume  Adj Close
 Date                                                                          
 2000-01-03   4.017857   3.631696   3.745536   3.997768  133949200.0   2.665724
 2000-01-04   3.950893   3.613839   3.866071   3.660714  128094400.0   2.440975
 2000-01-05   3.948661   3.678571   3.705357   3.714286  194580400.0   2.476697
 2000-01-06   3.821429   3.392857   3.790179   3.392857  191993200.0   2.262367
 2000-01-07   3.607143   3.410714   3.446429   3.553571  115183600.0   2.369532
 2000-01-10   3.651786   3.383929   3.642857   3.491071  126266000.0   2.327857
 2000-01-11   3.549107   3.232143   3.426339   3.312500  110387200.0   2.208785
 2000-01-12   3.410714   3.089286   3.392857   3.113839  244017200.0   2.076318
 2000-01-13   3.526786   3.303571   3.374439   3.455357  258171200.0   2.304043
 2000-01-14   3.651786   3.549107   3.571429   3.587054   97594000.0   2.391858
 2000-01-18   3.785714   3.58705

In [220]:
price=DataFrame({tic:data['Adj Close']
                for tic, data in all_data.items()})

In [221]:
volume=DataFrame({tic: data['Volume']
                 for tic, data in all_data.items()})

In [222]:
price

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-03,2.665724,79.153915,42.295185,
2000-01-04,2.440975,76.467079,40.866425,
2000-01-05,2.476697,79.153915,41.297348,
2000-01-06,2.262367,77.789200,39.913952,
2000-01-07,2.369532,77.448021,40.435547,
2000-01-10,2.327857,80.518608,40.730385,
2000-01-11,2.208785,81.200981,39.687164,
2000-01-12,2.076318,81.542168,38.394501,
2000-01-13,2.304043,80.689217,39.120216,
2000-01-14,2.391858,81.627472,40.730385,


In [223]:
returns=price.pct_change() #Percentage change between the current and a prior element.

In [224]:
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-24,0.034339,0.004385,0.002587,0.011117
2009-12-28,0.012294,0.013326,0.005484,0.007098
2009-12-29,-0.011861,-0.003477,0.007058,-0.005571
2009-12-30,0.012147,0.005461,-0.013698,0.005376
2009-12-31,-0.0043,-0.012597,-0.015504,-0.004416


In [225]:
returns.MSFT.corr(returns.IBM)

0.49414967814095245

In [226]:
returns.MSFT.cov(returns.IBM)

0.0002157125382430134

In [227]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.410011,0.423556,0.470676
IBM,0.410011,1.0,0.49415,0.390689
MSFT,0.423556,0.49415,1.0,0.438313
GOOG,0.470676,0.390689,0.438313,1.0


In [228]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.001027,0.000252,0.000309,0.000303
IBM,0.000252,0.000367,0.000216,0.000142
MSFT,0.000309,0.000216,0.000519,0.000204
GOOG,0.000303,0.000142,0.000204,0.00058


Using DataFrame’s `corrwith` method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:

In [229]:
returns.corrwith(returns.IBM)

AAPL    0.410011
IBM     1.000000
MSFT    0.494150
GOOG    0.390689
dtype: float64

## Unique Values, Value Counts, and Membership

In [230]:
obj=Series(list('cadaabbcc'))
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [231]:
uniques=obj.unique()

In [232]:
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [233]:
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

In [234]:
pd.value_counts(obj.values, sort=False) #sort =False 按索引顺序排列

d    1
c    3
a    3
b    2
dtype: int64

In [235]:
mask=obj.isin(['b','c'])

In [236]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [237]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [238]:
data=DataFrame({'Qu1':[1,3,4,3,4],
               'Qu2':[2,3,1,2,3],
               'Qu3':[1,5,2,4,4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


Passing `pandas.value_counts` to this DataFrame’s `apply` function gives:

In [239]:
result=data.apply(pd.value_counts).fillna(0)

In [240]:
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
