In [1]:
from pandas import Series, DataFrame

In [14]:
import pandas as pd
import numpy as np

## Series
A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index.

In [3]:
obj = Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Since we did not specify an index for the data, a default
one consisting of the integers 0 through N - 1 (where N is the length of the data) is
created. You can get the array representation and index object of the Series via its `values`
and `index` attributes, respectively:

In [4]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [7]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [8]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [9]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

In [10]:
obj2['d'] = 6

In [11]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

NumPy array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link

In [12]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [13]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [15]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [16]:
'b' in obj2

True

Should you have data contained in a Python dict, you can create a Series from it by
passing the dict:

In [17]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [18]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In this case, 3 values found in `sdata` were placed in the appropriate locations, but since
no value for `'California'` was found, it appears as `NaN` (not a number) which is considered in pandas to mark missing or NA values
The `isnull` and `notnull` functions in pandas should be used to
detect missing data:

In [19]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [21]:
obj3+obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a `name` attribute

In [23]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series’s index can be altered in place by assignment

In [24]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [25]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

In [26]:
obj.index

Index(['Bob', 'Steve', 'Jeff', 'Ryan'], dtype='object')

## DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index).
There are numerous ways to construct a DataFrame, though one of the most common
is from a dict of equal-length lists or NumPy arrays

In [27]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

In [28]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:

In [29]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


As with Series, if you pass a column that isn’t contained in data, it will appear with NA
values in the result:

In [30]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [31]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [32]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

Rows can also be retrieved by position or name by a couple of methods, such as the
`loc` indexing field, `iloc` for integer index.

In [36]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [37]:
frame2['debt'] = 16.5

In [38]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [39]:
frame2['debt'] = np.arange(5)

In [40]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


When assigning lists or arrays to a column, the value’s length must match the length
of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
DataFrame’s index, inserting missing values in any holes

In [41]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


Assigning a column that doesn’t exist will create a new column. The `del` keyword will
delete columns as with a dict:

In [42]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [43]:
del frame2['eastern']

In [44]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

The column returned when indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series
will be reflected in the DataFrame. The column can be explicitly copied
using the Series’s `copy` method.

In [45]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [46]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


The keys in the inner dicts are unioned and sorted to form the index in the result. This
isn’t true if an explicit index is specified

In [47]:
DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [48]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


If a DataFrame’s index and columns have their name attributes set, these will also be
displayed:

In [50]:
frame3.index.name = 'year'; frame3.columns.name = 'state'

In [51]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


Like Series, the `values` attribute returns the data contained in the DataFrame as a 2D
ndarray

In [52]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [53]:
frame3.values.dtype

dtype('float64')

If the DataFrame’s columns are different dtypes, the dtype of the values array will be
chosen to accomodate all of the columns

In [54]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

![Possible data inputs to DataFrame constructor](./images/1.jpg)

## Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index

In [58]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index

In [59]:
index

Index(['a', 'b', 'c'], dtype='object')

In [60]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [61]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index

True

![2](./images/2.jpg)

In addition to being array-like, an Index also functions as a fixed-size set

In [62]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [63]:
'Ohio' in frame3.columns

True

In [64]:
2003 in frame3.index

False

![3](./images/3.jpg)

In [69]:
frame3.index.is_unique

True

## Essential Functionality

### Reindexing

A critical method on pandas objects is `reindex`, which means to create a new object
with the data conformed to a new index.

In [70]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [71]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [72]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [73]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The `method` option allows us to do this, using a method such
as `ffill` which forward fills the values:

In [74]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

![4](./images/4.jpg)

With DataFrame, reindex can alter either the (row) index, columns, or both. When
passed just a sequence, the rows are reindexed in the result

In [76]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [77]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed using the `columns` keyword

In [78]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [89]:
frame.reindex(['a', 'b', 'c', 'd'], 
              columns=states).ffill()

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,1.0,,2.0
c,4.0,,5.0
d,7.0,,8.0


In [95]:
frame.reindex(index=['a', 'b', 'c', 'd'], columns=states)

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


![5](./images/5.jpg)

## Dropping entries from an axis

Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop
method will return a new object with the indicated value or values deleted from an axis

In [97]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [98]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [99]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis

In [101]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [102]:
data.drop(['Colorado','Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [104]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [105]:
data.drop(['two', 'four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


## Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can
use the Series’s index values instead of only integers

In [106]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b']

1.0

In [107]:
obj[1]

1.0

In [108]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [109]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [110]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the endpoint
is inclusive

In [111]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

indexing into a DataFrame is for retrieving one or more columns
either with a single value or sequence

In [112]:
data = DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [113]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [116]:
data.loc['Ohio']

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int32

In [117]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Indexing like this has a few special cases. First selecting rows by slicing or a boolean
array

In [122]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [123]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [124]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [129]:
data.loc['Colorado'][['two','three']]

two      5
three    6
Name: Colorado, dtype: int32

In [130]:
data.loc[data.three>5][:3]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not
the same, the respective index in the result will be the union of the index pairs.

In [131]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [132]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [133]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces NA values in the indices that don’t overlap.
Missing values propagate in arithmetic computations.

In the case of DataFrame, alignment is performed on both the rows and the columns

In [134]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [135]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [136]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In arithmetic operations between differently-indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other

In [138]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [139]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [140]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [141]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


## Operations between DataFrame and Series

In [145]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [146]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame's columns, broadcasting down the rows

In [147]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union

In [148]:
series2 = Series(range(3), index=['b', 'e', 'f'])
frame+series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods. For example:

In [149]:
series3 = frame['d']
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [150]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [151]:
frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


In [152]:
frame.add(series3, axis=0)

Unnamed: 0,b,d,e
Utah,1.0,2.0,3.0
Ohio,7.0,8.0,9.0
Texas,13.0,14.0,15.0
Oregon,19.0,20.0,21.0


## Function application and mapping

In [153]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,1.788886,0.654169,0.068098
Ohio,1.496934,-0.401634,0.0295
Texas,0.983181,1.253863,-0.840517
Oregon,0.172844,-0.639259,0.710486


In [154]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.788886,0.654169,0.068098
Ohio,1.496934,0.401634,0.0295
Texas,0.983181,1.253863,0.840517
Oregon,0.172844,0.639259,0.710486


Another frequent operation is applying a function on 1D arrays to each column or row.
DataFrame’s `apply` method does exactly this

In [155]:
f = lambda x: x.max() - x.min()
frame.apply(f)

b    1.616042
d    1.893122
e    1.551003
dtype: float64

In [156]:
frame.apply(f, axis=1)

Utah      1.720788
Ohio      1.898568
Texas     2.094380
Oregon    1.349745
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.
The function passed to apply need not return a scalar value, it can also return a Series
with multiple values

In [157]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,0.172844,-0.639259,-0.840517
max,1.788886,1.253863,0.710486


Element-wise Python functions can be used, too. Suppose you wanted to compute a
formatted string from each floating point value in frame. You can do this with `applymap`

In [158]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,1.79,0.65,0.07
Ohio,1.5,-0.4,0.03
Texas,0.98,1.25,-0.84
Oregon,0.17,-0.64,0.71


In [159]:
frame['e'].map(format)

Utah       0.07
Ohio       0.03
Texas     -0.84
Oregon     0.71
Name: e, dtype: object

## Sorting and ranking

To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object

In [160]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [161]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [162]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [163]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its `sort_values` method, Any missing values are sorted to the end of the Series by default:

In [166]:
obj = Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [167]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

On DataFrame, you may want to sort by the values in one or more columns. To do so,
pass one or more column names to the `by` option

In [168]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [169]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [171]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [172]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


Ranking is closely related to sorting, assigning ranks from one through the number of
valid data points in an array. It is similar to the indirect sort indices produced by
numpy.argsort, except that ties are broken according to a rule. The rank methods for
Series and DataFrame are the place to look; by default rank breaks ties by assigning
each group the mean rank

In [173]:
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranks can also be assigned according to the order they’re observed in the data:

In [174]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [175]:
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

DataFrame can compute ranks
over the rows or the columns

In [176]:
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})

In [177]:
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [178]:
frame.rank(axis=1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


![6](./images/6.jpg)

## Axis indexes with duplicate values

In [180]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [181]:
obj.index.is_unique

False

In [182]:
obj['a']

a    0
a    1
dtype: int64

In [183]:
obj['c']

4

In [184]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,1.599726,-0.289813,1.158162
a,0.432212,-1.407708,-0.731988
b,0.853975,0.379761,0.671064
b,0.770217,-0.848009,1.012745


In [186]:
df.loc['b']

Unnamed: 0,0,1,2
b,0.853975,0.379761,0.671064
b,0.770217,-0.848009,1.012745


## Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data

In [187]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
               [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [188]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [189]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

NA values are excluded unless the entire slice (row or column in this case) is NA. This
can be disabled using the `skipna` option:

In [190]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

![7](./images/7.jpg)

Some methods, like `idxmin` and `idxmax`, return indirect statistics like the index value
where the minimum or maximum values are attained

In [191]:
df.idxmax()

one    b
two    d
dtype: object

In [192]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [193]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [194]:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [196]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

![8](./images/8.jpg)

## Unique Values, Value Counts, and Membership

In [198]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [199]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [200]:
mask = obj.isin(['b', 'c'])

In [201]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [202]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

## Handling Missing Data

pandas uses the floating point value `NaN` (Not a Number) to represent missing data in
both floating as well as in non-floating point arrays.

In [203]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [204]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

The built-in Python `None` value is also treated as NA in object arrays

![9](./images/9.jpg)

## Filtering Out Missing Data

In [205]:
data = Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [206]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, these are a bit more complex. You may want to drop rows
or columns which are all NA or just those containing any NAs. `dropna` by default drops
any row containing a missing value

In [208]:
data = DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                  [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [209]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how='all'` will only drop rows that are all NA

In [210]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [212]:
data[3] = np.nan
data

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [213]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the `thresh` argument

In [217]:
df = DataFrame(np.random.randn(7, 3))
df.iloc[:5, 1] = np.nan; df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,-0.102184,,
1,0.088611,,
2,-0.402805,,-0.194768
3,-0.641586,,-0.626022
4,2.11817,,1.350163
5,0.713504,2.870145,-0.406597
6,0.255229,0.14424,1.233549


In [218]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,0.713504,2.870145,-0.406597
6,0.255229,0.14424,1.233549


## Filling in Missing Data


In [219]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.102184,0.0,0.0
1,0.088611,0.0,0.0
2,-0.402805,0.0,-0.194768
3,-0.641586,0.0,-0.626022
4,2.11817,0.0,1.350163
5,0.713504,2.870145,-0.406597
6,0.255229,0.14424,1.233549


Calling `fillna` with a dict you can use a different fill value for each column

In [220]:
df.fillna({1: 0.5, 3: -1})

Unnamed: 0,0,1,2
0,-0.102184,0.5,
1,0.088611,0.5,
2,-0.402805,0.5,-0.194768
3,-0.641586,0.5,-0.626022
4,2.11817,0.5,1.350163
5,0.713504,2.870145,-0.406597
6,0.255229,0.14424,1.233549


`fillna` returns a new object, but you can modify the existing object in place

In [221]:
df.fillna(0, inplace=True)

In [222]:
df

Unnamed: 0,0,1,2
0,-0.102184,0.0,0.0
1,0.088611,0.0,0.0
2,-0.402805,0.0,-0.194768
3,-0.641586,0.0,-0.626022
4,2.11817,0.0,1.350163
5,0.713504,2.870145,-0.406597
6,0.255229,0.14424,1.233549


The same interpolation methods available for reindexing can be used with `fillna`:

In [224]:
df = DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = np.nan; df.iloc[4:, 2] = np.nan
df

Unnamed: 0,0,1,2
0,-1.379822,0.238129,0.325428
1,0.678036,-0.492668,-0.202141
2,0.36283,,1.342329
3,0.496226,,0.458841
4,0.342856,,
5,1.796486,,


In [225]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-1.379822,0.238129,0.325428
1,0.678036,-0.492668,-0.202141
2,0.36283,-0.492668,1.342329
3,0.496226,-0.492668,0.458841
4,0.342856,-0.492668,0.458841
5,1.796486,-0.492668,0.458841


In [226]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-1.379822,0.238129,0.325428
1,0.678036,-0.492668,-0.202141
2,0.36283,-0.492668,1.342329
3,0.496226,-0.492668,0.458841
4,0.342856,,0.458841
5,1.796486,,0.458841


![10](./images/10.jpg)

## Hierarchical Indexing
Hierarchical indexing is an important feature of pandas enabling you to have multiple
(two or more) index levels on an axis

In [227]:
data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

a  1   -0.075454
   2    0.484907
   3   -0.870147
b  1   -2.057312
   2   -0.476106
   3    0.480471
c  1   -1.533881
   2   -1.416894
d  2   -0.009811
   3   -0.860112
dtype: float64

In [228]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [229]:
data['b']

1   -2.057312
2   -0.476106
3    0.480471
dtype: float64

In [230]:
data['b':'c']

b  1   -2.057312
   2   -0.476106
   3    0.480471
c  1   -1.533881
   2   -1.416894
dtype: float64

In [232]:
data.loc[['b','d']]

b  1   -2.057312
   2   -0.476106
   3    0.480471
d  2   -0.009811
   3   -0.860112
dtype: float64

In [233]:
data[:, 2]

a    0.484907
b   -0.476106
c   -1.416894
d   -0.009811
dtype: float64

this data could be rearranged into a DataFrame
using its `unstack` method

In [234]:
data.unstack()

Unnamed: 0,1,2,3
a,-0.075454,0.484907,-0.870147
b,-2.057312,-0.476106,0.480471
c,-1.533881,-1.416894,
d,,-0.009811,-0.860112


The inverse operation of `unstack` is `stack`

In [235]:
data.unstack().stack()

a  1   -0.075454
   2    0.484907
   3   -0.870147
b  1   -2.057312
   2   -0.476106
   3    0.480471
c  1   -1.533881
   2   -1.416894
d  2   -0.009811
   3   -0.860112
dtype: float64

In [236]:
frame = DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierarchical levels can have names (as strings or any Python objects). If so, these
will show up in the console output

In [237]:
frame.index.names = ['key1', 'key2']

In [238]:
frame.columns.names = ['state', 'color']

In [239]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [240]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A MultiIndex can be created by itself and then reused; the columns in the above DataFrame with level names could be created like this:

In [242]:
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
                       names=['state', 'color'])

MultiIndex([(    'Ohio', 'Green'),
            (    'Ohio',   'Red'),
            ('Colorado', 'Green')],
           names=['state', 'color'])

## Integer Indexing

In [243]:
ser = Series(np.arange(3.))
ser[-1]

KeyError: -1

In [244]:
ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

2.0

In cases where you need reliable position-based indexing regardless of the index type,
you can use the `iloc` method from Series and `iloc` and `icol` methods from DataFrame

In [248]:
ser3 = Series(range(3), index=[-5, 1, 3])
ser3.iloc[2]

2

In [257]:
frame = DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1])
frame

Unnamed: 0,0,1
2,0,1
0,2,3
1,4,5


In [253]:
frame.iloc[0]

0    0
1    1
Name: 2, dtype: int32

In [256]:
frame.iloc[:, 0]

2    0
0    2
1    4
Name: 0, dtype: int32

In [258]:
frame[0]

2    0
0    2
1    4
Name: 0, dtype: int32