# Getting Started with Pandas 
Pandas contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and easy in Python. pandas is often used in tandem with numerical
computing tools like NumPy and SciPy, analytical libraries like statsmodels and
scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant
parts of NumPy’s idiomatic style of array-based computing, especially array-based
functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast,
is best suited for working with homogeneous numerical array data.

In [1]:
from pandas import Series, DataFrame  #Series and DataFrame are used so much so import them into the local namespace
import pandas as pd
import numpy as np

## Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: `Series` and `DataFrame`. While they are not a universal solution for
every problem, they provide a solid, easy-to-use basis for most applications.

### Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

In [2]:
obj=Series([4,7,-5,3])

In [3]:
obj #a default one consisting of intergers 0 through N-1 is created.

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the
left and the values on the right. <font color=red>Since we did not specify an index for the data, a
default one consisting of the integers 0 through $N - 1$ (where N is the length of the
data) is created.</font>

##### Get the array representation and index object of the Series via its values and index attributes.

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Create a Series with an index identifying each data point

In [6]:
obj2=Series([4,7,-5,3], index=['d','b','a','c'])

In [7]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

##### Use values in the index when selecting single values or a set of values

In [8]:
obj2['a']

-5

In [9]:
obj2['d']

4

In [10]:
obj2['d']=6

In [11]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [12]:
obj2[['c','a','d']]

c    3
a   -5
d    6
dtype: int64

In [13]:
obj2[obj2>0] # filtering with a boolean array

d    6
b    7
c    3
dtype: int64

In [14]:
#scalar multiplication
obj2*2

d    12
b    14
a   -10
c     6
dtype: int64

In [15]:
#Applying math functions
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Thinking Series as a fixed-length, ordered dictionary, as it is a mapping
of index values to data values.

In [16]:
'b' in obj2

True

In [17]:
'e' in obj2

False

##### Creating a Series by a dictionary

In [18]:
sdata={'Ohio':35000, 'Texas':71000,'Oregon':16000, 'Utah':5000}

In [19]:
#Creating Series by function Series
obj3=Series(sdata)

In [20]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s
keys in sorted order. You can override this by passing the dict keys in the order you
want them to appear in the resulting Series:

In [21]:
states=['California','Ohio','Oregon','Texas']

In [22]:
# The 3 same values found in sdata were placed in the appropriate location, but since no value for 'California' was found,
# it appears as NaN
obj4=Series(sdata, index=states) #按照index 顺序进行匹配

In [23]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

##### Using isnull and notnull functions in pandas to detect missing data

In [24]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

<font color=red> A useful tool to help count how many missing values in the data is the calling `Series` or `dataframe`.isna().sum()</font>

In [25]:
obj4.isna().sum()

1

In [26]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [27]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

##### A critical Series feature for many applications is that is automatically aligns differently-indexed data in arithmetic operations

In [29]:
obj3+obj4 #注意California and Utah

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both Series object itself and its index have a "name" attribute, which integrates with other key areas of pandas functionality

In [30]:
obj4.name='population'

In [31]:
obj4.index.name='state'

In [32]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

##### A Series’s index can be altered in-place by assignment:

In [33]:
obj.index=['Bob','Steve','Jeff','Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure. <font color=red>DataFrame has both a row and column index.</font> It can be thought of as a dict of Series.

##### Lists in dictionary

In [34]:
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
     'year':[2000,2001,2002,2001,2002],
     'pop':[1.5,1.7,3.6,2.4,2.9]}

All the keys in the dictionary become column index.

In [35]:
frame=DataFrame(data)

In [36]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


<font color=red>If you specify a sequence of columns, the DataFrame's column will be exactly what you pass.</font> 如果指定了列序列，DataFrame就会按照指定的顺序进行排列

In [36]:
DataFrame(data, columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


As with Series, if you pass a column that isn't contained in data, it will appear with NA values.

In [37]:
frame2=DataFrame(data, columns=['year','state','pop','debt'],
                index=['one','two','three','four','five'])

In [38]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [39]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dict-like notation or
by attribute:

In [40]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [41]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

The returned Series have the same index as the DataFrame. <font color=red>Rows can also be retrieved by position or name by a couple of methods.</font>

In [43]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

<font color=red>Columns can be modified by assignment.</font>

In [42]:
frame2['debt']=16.5

In [43]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [44]:
frame2['debt']=np.arange(5)

In [45]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


When assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, it will be instead conforned exactly to the DataFrame's index, inserting missing values.

In [46]:
val=Series([-1.2,-1.5,-1.7], index=['two','four','five'])

In [47]:
frame2['debt']=val

In [48]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


<font color=red>Assigning a column that doesn't exist will create a new column. The del keyword will delete columns as with a dictionary.</font>

In [49]:
frame2['eastern']=frame2.state=='Ohio' #boolean

In [50]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [51]:
del frame2['eastern']

In [52]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


##### Another common form of data is a nested dict of dicts format.

In [53]:
pop={'Nevada':{2001:2.4,2002:2.9},
    'Ohio':{2000:1.5,2001:1.7,2002:3.6}}

In [54]:
frame3=DataFrame(pop)

In [55]:
frame3
#Outer dict keys as the columns and the inner keys as the row indices

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Transpose the dataframe

In [56]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


If we explicit the index

In [57]:
DataFrame(pop, index=pd.Index(np.arange(2001,2004,1)))

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [58]:
pdata={'Ohio': frame3['Ohio'][:-1],
      'Nevada': frame3['Nevada'][:2]}

In [59]:
DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


<font color=red>If a DataFrame's index and columns have their name attributes set, these will also be displayed.</font>

In [60]:
frame3.index.name='year'; frame3.columns.name='state'

In [61]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Like Series, the values attributes returns the data contained in the DataFrame as 2D ndarray

In [62]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

If the DataFrame’s columns are different dtypes, the dtype of the values array will be
chosen to accommodate all of the columns:

In [63]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

### Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index:

In [64]:
obj=Series(range(3), index=['a','b','c'])

In [65]:
obj

a    0
b    1
c    2
dtype: int64

In [66]:
index=obj.index

In [67]:
index

Index(['a', 'b', 'c'], dtype='object')

In [68]:
index[1:]

Index(['b', 'c'], dtype='object')

In [69]:
index[1]='d' #index objects are immuatble

TypeError: Index does not support mutable operations

In [70]:
index=pd.Index(np.arange(3))

In [71]:
obj2=Series([1.5,-2.5,0], index=index)

In [72]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [73]:
obj2.index is index

True

In [74]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [75]:
'Ohio' in frame3.columns

True

In [76]:
2003 in frame3.index

False

## Essential Functionality

### Reindexing

<font color=red>Reindex means to create a new object with the data conformed to a new index</font>

In [77]:
obj=Series([4.5,7.2,-5.3,3.6], index=['d','b','a','c'])

In [78]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [79]:
obj.reindex(['a','b','c','d','e'], fill_value=0)# 用0去填充缺失的值value

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present.

In [81]:
obj2=obj.reindex(['a','b','c','d','e'])

In [82]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [83]:
obj3=Series(['blue','purple','yellow'], index=[0,2,4])

In [84]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [85]:
obj3.reindex(range(6), method='ffill') #前向填充 上一值个来填充下一个值

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [86]:
frame=DataFrame(np.arange(9).reshape((3,3)), index=['a','c','d'],
               columns=['Ohio','Texas','California']) #Gives a new shape to an array without changing its data.

In [87]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [88]:
frame2=frame.reindex(['a','b','c','d'])

In [89]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


<font color=red>The columns can be reindexed using the columns keyword</font>

In [92]:
states=['Texas','Utah','California']

In [93]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Both can be reindexed in one shot, though interpolation will only apply row-wise.

In [94]:
frame.reindex(index=['a','b','c','d'],
             columns=states).ffill()

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,1.0,,2.0
c,4.0,,5.0
d,7.0,,8.0


In [99]:
frame.loc[['a','b','c','d'], states]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


### Dropping entries from an axis

Dropping one or more entries from an axis is easy if you already have an index array
or list without those entries. As that can require a bit of munging and set logic, the
`drop` method will return a new object with the indicated value or values deleted from
an axis:

In [100]:
obj=Series(np.arange(5), index=['a','b','c','d','e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int64

##### Series or DataFrame.drop( index)  

In [101]:
new_obj=obj.drop('c') #drop row 'c'

In [102]:
new_obj

a    0
b    1
d    3
e    4
dtype: int64

In [103]:
obj.drop(['d','c'])

a    0
b    1
e    4
dtype: int64

With DataFrame, index values can be deleted from either axis:

In [104]:
data=DataFrame(np.arange(16).reshape((4,4)),
              index=['Ohio','Colorado','Utah','New York'],
              columns=['One','Two','Three','Four'])

In [105]:
data

Unnamed: 0,One,Two,Three,Four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [106]:
data.drop(['Colorado','Ohio'])

Unnamed: 0,One,Two,Three,Four
Utah,8,9,10,11
New York,12,13,14,15


##### DataFrame.drop(column index, axis=1)

In [107]:
data.drop('Two', axis=1) #default axis is 0

Unnamed: 0,One,Three,Four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [108]:
data.drop(['Two', 'Four'], axis=1)

Unnamed: 0,One,Three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


Many functions, like drop, which modify the size or shape of a Series or DataFrame,
can manipulate an object in-place without returning a new object:

In [109]:
obj.drop('c', inplace=True)
obj

a    0
b    1
d    3
e    4
dtype: int64

### Indexing, selection and filtering
Series indexing (`obj[...]`) works analogously to NumPy array indexing, except you
can use the Series’s index values instead of only integers. Here are some examples of
this:

In [110]:
obj=Series(np.arange(4), index=['a','b','c','d'])

In [111]:
obj

a    0
b    1
c    2
d    3
dtype: int64

In [112]:
obj['b']

1

In [113]:
obj[1] #Obj 第二项

1

In [115]:
obj[2:4] #第三项 第四项

c    2
d    3
dtype: int64

In [116]:
obj[['b','a','d']] #按照索引顺序 索引

b    1
a    0
d    3
dtype: int64

In [117]:
obj[[1,3]]

b    1
d    3
dtype: int64

In [118]:
obj[obj<2]

a    0
b    1
dtype: int64

<font color=red>Slicing with labels behaves differently than normal Python slicing. End point is inclusive.</font>

In [119]:
obj['b':'c']

b    1
c    2
dtype: int64

In [120]:
obj['b':'c']=5

In [121]:
obj

a    0
b    5
c    5
d    3
dtype: int64

Indexing into a DataFrame is for retrieving one or more columns

In [122]:
data=DataFrame(np.arange(16).reshape((4,4)),
              index=['Ohio','Colorado','Utah','New York'],
              columns=['One','Two','Three','Four'])

In [123]:
data

Unnamed: 0,One,Two,Three,Four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [124]:
data['Two'] #retrieving one column or say, Series

Ohio         1
Colorado     5
Utah         9
New York    13
Name: Two, dtype: int64

In [125]:
data[['Four','Three']]

Unnamed: 0,Four,Three
Ohio,3,2
Colorado,7,6
Utah,11,10
New York,15,14


In [126]:
data[:2]#选取前两行

Unnamed: 0,One,Two,Three,Four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [127]:
data[data['Three']>5]

Unnamed: 0,One,Two,Three,Four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [128]:
data<5

Unnamed: 0,One,Two,Three,Four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [129]:
data[data<5]=0

In [130]:
data

Unnamed: 0,One,Two,Three,Four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selection with loc and iloc
For DataFrame label-indexing on the rows, I introduce the special indexing operators
`loc` and `iloc`. They enable you to select a subset of the rows and columns from a
DataFrame with NumPy-like notation using either axis labels (`loc`) or integers
(`iloc`).

##### DataFrame.loc[row index] but DataFrame.iloc[row number]

In [131]:
data.loc['Colorado',['Two','Three']]

Two      5
Three    6
Name: Colorado, dtype: int64

In [134]:
data.iloc[[2,3],[3,0,1]]

Unnamed: 0,Four,One,Two
Utah,11,8,9
New York,15,12,13


In [135]:
data.iloc[2]

One       8
Two       9
Three    10
Four     11
Name: Utah, dtype: int64

<font color=red> End point is inclusive.</font>

In [136]:
data.loc[:'Utah','Two']

Ohio        0
Colorado    5
Utah        9
Name: Two, dtype: int64

In [138]:
data.iloc[:,:3]

Unnamed: 0,One,Two,Three
Ohio,0,0,0
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [137]:
data.iloc[:,:3][data.Three>5]

Unnamed: 0,One,Two,Three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


#### Integer Indexes
Working with pandas objects indexed by integers is something that often trips up
new users due to some differences with indexing semantics on built-in Python data
structures like lists and tuples. For example, you might not expect the following code
to generate an error:

In [139]:
ser=Series(np.arange(3.))
ser
#ser[-1] gets an error

0    0.0
1    1.0
2    2.0
dtype: float64

In [141]:
ser2=Series(np.arange(3.), index=['a','b','c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [142]:
ser2[-1]

2.0

In [143]:
ser[:1]

0    0.0
dtype: float64

In [144]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [145]:
ser.iloc[:1]

0    0.0
dtype: float64

## Arithmetic and data alignment
An important pandas feature for some applications is the behavior of arithmetic
between objects with different indexes. When you are adding together objects, if any
index pairs are not the same, the respective index in the result will be the union of the
index pairs.

In [146]:
s1=Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])

In [147]:
s2=Series([-2.1,3.6,-1.5,4,3.1], index=['a','c','e','f','g'])

In [148]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [149]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [150]:
s1+s2 ## combind them with the same index

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

##### Alignment is performed on both the rows and the columns for DataFrames

In [151]:
df1=DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'),
             index=['Ohio','Texas','Colorado'])

In [152]:
df2=DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'),
             index=['Utah','Ohio','Texas','Oregon'])

In [153]:
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [154]:
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


Adding these together returns a DataFrame whose index and columns are united.

In [155]:
df1+df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear
as all missing in the result. The same holds for the rows whose labels are not common
to both objects.

If you add DataFrame objects with no column or row labels in common, the result
will contain all nulls:

In [156]:
df1=DataFrame({'A':[1,2]})
df1

Unnamed: 0,A
0,1
1,2


In [157]:
df2=DataFrame({'B':[3,4]})
df2

Unnamed: 0,B
0,3
1,4


In [158]:
df1+df2

Unnamed: 0,A,B
0,,
1,,


In [159]:
df1-df2

Unnamed: 0,A,B
0,,
1,,


### Arithmetic methods with fill values
In arithmetic operations between differently indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other:

In [160]:
df1=DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))

In [161]:
df2=DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))

In [162]:
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [163]:
df2

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [164]:
df2.loc[1,'b']=np.nan
df2

Unnamed: 0,a,b,c,d,e
0,0,1.0,2,3,4
1,5,,7,8,9
2,10,11.0,12,13,14
3,15,16.0,17,18,19


Adding these together results in NA values in the locations that don’t overlap:

In [165]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


##### Using the `add` method on df1, I pass `df2` and an argument to `fill_value`:

In [166]:
## df1+df2 避免NAN值 用df1.add() NaN的部分用 df2去填充
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [167]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [168]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value:

In [169]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,0
1,4,5,6,7,0
2,8,9,10,11,0


### Operations between DataFrame and Series

In [170]:
arr=np.arange(12.).reshape((3,4))

In [171]:
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [172]:
arr[0]

array([0., 1., 2., 3.])

In [173]:
arr-arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

When we subtract `arr[0]` from arr, the subtraction is performed once for each row.
This is referred to as <font color=red>broadcasting</font> and is explained in more detail as it relates to general
NumPy arrays in Appendix A.

In [174]:
frame=DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),
               index=['Utah','Ohio','Texas','Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [175]:
series=frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame’s columns, broadcasting down the rows:

In [176]:
frame-series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [177]:
series2=Series(range(3), index=list('bef'))
series2

b    0
e    1
f    2
dtype: int64

In [178]:
frame+series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [179]:
series3=frame['d']

In [180]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [181]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [182]:
frame.sub(series3, axis=0) #Subtraction of dataframe and other, element-wise

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


## Function application and mapping
Numpy ufuncs work fine with pandas objects

In [None]:
#np.random.randn Return a sample (or samples) from the “standard normal” distribution.
frame=DataFrame(np.random.randn(4,3), columns=list('bde'),
               index=['Utah','Ohio','Texas','Oregon'])

In [None]:
frame

In [None]:
np.abs(frame) #taking absolute value

In [None]:
f=lambda x: x.max()-x.min()

In [None]:
frame.apply(f) #默认值 是一行操作

In [None]:
frame.apply(f, axis=1)

In [None]:
def f(x):
    return Series([x.min(), x.max()], index=['min','max'])

In [None]:
frame.apply(f) #按列选取最大值 最小值

In [None]:
frame.apply(f, axis=1) #按行选取最大值最小值

## Sorting and ranking

In [None]:
obj=Series(range(4), index=list('dabc'))

In [None]:
obj.sort_index() #sort by row or column index

In [None]:
frame=DataFrame(np.arange(8.).reshape((2,4)), index=['Three','One'],
               columns=list('dabc'))

In [None]:
frame.sort_index() #按照行索引排序

In [None]:
frame.sort_index(axis=1) #按照列索引排序

The data is sorted in ascending order by default. Sorting be descending order by setting parameter ascending=False

In [None]:
frame.sort_index(axis=1, ascending=False)

In [None]:
obj=Series([4,7,-3,2])

In [None]:
obj.sort_values()

In [None]:
obj=Series([4,np.nan, 7, np.nan, -3,2])

In [None]:
obj.sort_values()

In [None]:
frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})

In [None]:
frame

In [None]:
frame.sort_values(by='b')

In [None]:
frame.sort_values(by=['a','b'])

### Ranking
Assigning ranks from one through the number of valid data points in an array. It is similar to the indirect sort indices produced by numpy.argsort.

In [None]:
obj=Series([7,-5,7,4,2,0,4])

In [None]:
obj.rank()

In [None]:
obj.rank(method='first')

In [None]:
obj.rank(ascending=False, method='max')

In [None]:
frame=DataFrame({'b':[4.3,7,-3,2], 'a':[0,1,0,1],
                'c':[-2,5,8,-2.5]})

In [None]:
frame

In [None]:
frame.rank(axis=1)

### Axis indexes with duplicate values

In [None]:
obj=Series(np.arange(5), index=list('aabbc'))

In [None]:
obj

The index's is_unique property can tell you whether its values are unique or not

In [None]:
obj.index.is_unique #行索引 是否唯一 判断

In [None]:
obj['a']

In [None]:
df=DataFrame(np.random.randn(4,3), index=list('aabb'))

In [None]:
df

In [None]:
df.loc['b']

## Summarizing and Computing Descriptive Statistics
Pandas methods are built from the ground up to exclude missing data.

In [None]:
df=DataFrame([[1.4,np.nan],[7.1,-4.5],
            [np.nan,np.nan],[0.75,-1.3]],
            index=list('abcd'),
            columns=['One','Two'])

In [None]:
df

In [None]:
df.sum() #Calling DataFrame's sum method returns a Series containing columns sums默认按列求和

In [None]:
df.sum(axis=1)# 按行求和

In [None]:
df.mean(axis=1, skipna=False)

In [None]:
df.idxmax()

In [None]:
df.idxmin()

In [None]:
df.cumsum() #按列累积求和

In [None]:
df.describe()

### Correlation and Covariance

In [None]:
import pandas_datareader.data as web

In [None]:
all_data={}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
    all_data[ticker]=web.get_data_yahoo(ticker, '1/1/2000','1/1/2010')

In [None]:
all_data

In [None]:
price=DataFrame({tic:data['Adj Close']
                for tic, data in all_data.items()})

In [None]:
volume=DataFrame({tic: data['Volume']
                 for tic, data in all_data.items()})

In [None]:
price

In [None]:
returns=price.pct_change() #Percentage change between the current and a prior element.

In [None]:
returns.tail()

In [None]:
returns.MSFT.corr(returns.IBM)

In [None]:
returns.MSFT.cov(returns.IBM)

In [None]:
returns.corr()

In [None]:
returns.cov()

In [None]:
returns.corrwith(returns.IBM)

## Unique Values, Value Counts, and Membership

In [None]:
obj=Series(list('cadaabbcc'))

In [None]:
obj

In [None]:
uniques=obj.unique()

In [None]:
uniques

In [None]:
obj.value_counts()

In [None]:
pd.value_counts(obj.values, sort=False) #sort =False 按索引顺序排列

In [None]:
mask=obj.isin(['b','c'])

In [None]:
mask

In [None]:
obj[mask]

In [None]:
data=DataFrame({'Qu1':[1,3,4,3,4],
               'Qu2':[2,3,1,2,3],
               'Qu3':[1,5,2,4,4]})

In [None]:
data

In [None]:
result=data.apply(pd.value_counts).fillna(0)

In [None]:
result

## Handling Missing Data

In [None]:
string_data=Series(['aardvark','artichoke', np.nan, 'avocado'])

In [None]:
string_data

In [None]:
#判断是否有缺失值
string_data.isnull()

The build-in Python None value is also treated as NA in object arrays.

In [None]:
string_data[0]=None

In [None]:
string_data

In [None]:
string_data.isnull()

### Filtering Out Missing Data

In [None]:
from numpy import nan as NA

In [None]:
data=Series([1,NA, 3.5, NA, 7])

In [None]:
data

In [None]:
data.dropna()

In [None]:
data[data.notnull()]

Dropna by default drops any row containing a missing value.

In [None]:
data=DataFrame([[1,6.5,3],[1.,NA,NA],
              [NA,NA,NA],[NA,6.5,3.]])

In [None]:
cleaned=data.dropna()

In [None]:
data

In [None]:
cleaned

Passing how='all' will only drop rows that are all NA.

In [None]:
data.dropna(how='all')

In [None]:
data[4]=NA

In [None]:
data

In [None]:
data.dropna(axis=1, how='all')

In [None]:
df=DataFrame(np.random.randn(7,3))

In [None]:
df.iloc[:4,1]=NA

In [None]:
df.iloc[:2,2]=NA

In [None]:
df

In [None]:
df.dropna(thresh=3)

### Filling in Missing Data

In [None]:
df.fillna(0)

In [None]:
df.fillna({1:0.5, 2:-1})

In [None]:
data=Series([1,NA,3.5,NA,7])

In [None]:
data.fillna(data.mean())

## Hierarchical Indexing

In [None]:
data=Series(np.random.randn(10),
           index=[list('aaabbbccdd'),[1,2,3,1,2,3,1,2,2,3]])

In [None]:
data

层次化索引

In [None]:
data.index

In [None]:
data['b']

In [None]:
data.loc[['b','d']]

Selection is possible from an 'inner' level

In [None]:
data[:,2]

In [None]:
data.unstack()

### hierarchical indexing for DataFrame

In [None]:
frame=DataFrame(np.arange(12.).reshape((4,3)),
               index=[list('aabb'), [1,2,1,2]],
               columns=[['Ohio','Ohio','Colorado'],
                       ['Green','Red','Green']])

In [None]:
frame

In [None]:
frame.index.names=['key1','key2'] #为行索引添加名字

In [None]:
frame.columns.names=['state','color'] #为列索引添加名字

In [None]:
frame

In [None]:
frame['Ohio']

In [None]:
frame['Colorado']

### Reordering and Sorting Levels