## 1 Introduction to pandas Data Structures

NumPy is best suited for working with homogeneous numerical array data;
pandas is designed for working with tabular or heterogeneous data.

In [1]:
import numpy as np

In [2]:
import pandas as pd

In [3]:
from pandas import Series, DataFrame

### 1.1 Series

One-dimensional array-like object containing a sequence of values and an associated array of data labels, called ___index___

In [4]:
obj = pd.Series([4, 6, -1, 2])

In [5]:
obj

0    4
1    6
2   -1
3    2
dtype: int64

In [8]:
obj.values  # array representation of the Series

array([ 4,  6, -1,  2])

In [9]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Create a Series with index labels

In [22]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [23]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [24]:
obj2.index

Index([u'd', u'b', u'a', u'c'], dtype='object')

In [25]:
obj2['a']

-5

In [26]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

In [27]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [28]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

Think about a Series is as a fixed-length, ordered dict.

In [30]:
'b' in obj2

True

Create a Series from a Python dict

In [32]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [33]:
obj3 = pd.Series(sdata)

In [34]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

When passing a dict only, the index in the resulting Series will have the dict's keys in sorted order. Can be overrided by passing the dict keys in the order you want them to appear

In [35]:
obj4 = pd.Series(sdata, index=['California', 'Ohio', 'Oregon', 'Texas'])

In [36]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

NaN refers to missing or NA values. ___isnull___ and ___notnull___ should be used to detect missing data

In [37]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [38]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series automatically aligns by index label in arithmetic operations, which is similar to join operation

In [40]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [41]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [42]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a nume attribue.

In [43]:
obj4.name = 'population'

In [48]:
obj4.index.name = 'state'

In [49]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be altered in-place by assignment

In [51]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [52]:
obj

Bob      4
Steve    6
Jeff    -1
Ryan     2
dtype: int64

### 1.2 DataFrame

Construct a DataFrame from a dict of equal-length lists or NumPy arrays

In [54]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002, 2003],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [55]:
frame = pd.DataFrame(data)

In [56]:
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002
5,3.2,Nevada,2003


In [58]:
frame.head(3)

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002


Arrange the columns by specifying a sequence of columns

In [59]:
pd.DataFrame(data, columns=['year','state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If pass a column that isn't contained in the dict:

In [61]:
frame2 = pd.DataFrame(data, columns=['year','state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four','five', 'six'])

In [62]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [63]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

Retrieve a column

In [64]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [65]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Retrieve a row

In [67]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Assign columns

In [72]:
frame2['debt'] = np.arange(6.)

In [73]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


When assigning lists or arrays to a colum, the length must match. If assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:

In [74]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [90]:
frame2['debt'] = val

In [91]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Assign a column not existed

In [92]:
frame2['eastern'] = frame2.state == 'Ohio'  # Can not be created with frame2.eastern

In [93]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


Remove a column with ___del___ method

In [96]:
del frame2['eastern']

In [97]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

If a nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices

In [100]:
pop = {'Nevada': {2001:2.4, 2002: 2.9},
      'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [101]:
frame3 = pd.DataFrame(pop)

In [102]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Transpose the DataFrame

In [103]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


Dicts of Series are treated in much the same way:

In [105]:
pdata = {'Ohio': frame3['Ohio'][:-1],
        'Nevada': frame3['Nevada'][:2]}

In [106]:
pd.DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


<img src="img/5_1_1.png">

The ___name___ attribute of DataFrame's index and columns

In [107]:
frame3.index.name = 'year'; frame3.columns.name = 'state'

In [108]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


The ___values___ attribute returns the data contained in the DF as a ndarray

In [109]:
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

If columns contains different types, the dtype of the value array will be chosen to accommodate all of the columns

In [111]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

In [110]:
frame2.values[0].dtype

dtype('O')

### 1.3 Index Objects

pandas's Index objects are responsible for holding the axis labels and other metadata. Any array or other sequence of labels used to construct a Series or DF is internally converted to an Index

In [112]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [113]:
index = obj.index

In [114]:
index

Index([u'a', u'b', u'c'], dtype='object')

In [115]:
index[1:]

Index([u'b', u'c'], dtype='object')

Index objects are immutable

In [116]:
index[1] = 'd'

TypeError: Index does not support mutable operations

An Index behaves like a fixed-size set:

In [123]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [125]:
frame3.columns

Index([u'Nevada', u'Ohio'], dtype='object', name=u'state')

In [126]:
'Ohio' in frame3.columns

True

But a pandas Index can contain duplicate labels. Selections with duplicate labels will select all occurrences of that label.

In [168]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [169]:
dup_labels

Index([u'foo', u'foo', u'bar', u'bar'], dtype='object')

In [170]:
dup_labels2 = pd.Index(['bar', 'a'])

In [171]:
dup_labels.intersection(dup_labels2)

Index([u'bar', u'bar'], dtype='object')

<img src="img/5_1_2.png">

## 2 Essential Functionality

### 2.1 Reindexing

In [3]:
obj = pd.Series([4.5, 7.2, -5.4, 3.6], index=['d', 'b', 'a', 'c'])

In [5]:
obj

d    4.5
b    7.2
a   -5.4
c    3.6
dtype: float64

In [6]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [7]:
obj2

a   -5.4
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

 ___ffill___ method of reindex will forward-fills the values:

In [8]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [9]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [10]:
obj4 = obj3.reindex(range(6), method='ffill')

In [11]:
obj4

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, reindenx can alter either row index, columns or both. When passed only a sequence, it reindexes the rows in the result:

In [14]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index = ['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])

In [15]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [16]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [17]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


Reindex columns with the columns keyword:

In [18]:
frame.reindex(columns=['Texas', 'Utah', 'California'])

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [19]:
frame.reindex(['a', 'b', 'c', 'd'], ['Texas', 'Utah', 'California'])

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


In [23]:
frame.loc[['a', 'b', 'c', 'd'], ['Texas', 'Utah', 'California']]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


### 2.2 Dropping Entries from an Axis

___drop___ method

In [24]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [25]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [26]:
new_obj = obj.drop('c')

In [27]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [28]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis

In [29]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                    index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])

In [30]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [31]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


You can drop values from the columns by passing axis=1 or axis='column'

In [32]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place

In [33]:
obj.drop('c', inplace=True)

In [34]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

### 2.3 Indexing, Selection, and Filtering

Series indexing

In [35]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [36]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [37]:
obj['a']

0.0

In [38]:
obj[0]

0.0

In [39]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [40]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [41]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels will include the end-point

In [42]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [43]:
obj['b':'c'] = 5

In [44]:
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns

In [45]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                   index=['Ohio', 'Colorado', 'Utah', 'New York'],
                   columns=['one', 'two', 'three', 'four'])

In [46]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [47]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [49]:
data[['two', 'three']]

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


Slicing or selecting data with a boolean array will retrieve rows

In [51]:
data[1:3]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11


In [52]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Indexing with a boolean DataFrame

In [53]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [54]:
data[data < 5] = 0

In [55]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Selection with loc and iloc

___loc___ and ___iloc___ enable you to slect a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc)

In [57]:
data.loc['Colorado', ['two', 'three']]

pandas.core.series.Series

In [58]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [59]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


Both work with slices

In [60]:
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [63]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Summary of DataFrame Indexing

<img src='img/5_2_1.png'>

### 2.4 Integer Indexes

If an axis index contains integers, data selection will always be label-oriented

In [67]:
ser = pd.Series(np.arange(3.))

In [68]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [69]:
ser[-1]

KeyError: -1

There is no ambiguity with a non-integer index

In [70]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [71]:
ser2[-1]

2.0

In [75]:
ser[:1]

0    0.0
dtype: float64

Slicing with labels always contains the end-point

In [76]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [77]:
ser.iloc[:1]

0    0.0
dtype: float64

### 2.5 Arithmetic and Data Alignment

When add objects together, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Similar with the outer join on the index labels.

In [79]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [80]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [81]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In the case of DataFrame, alignment is performed on both the rows and the colums:

In [83]:
df1 = pd.DataFrame(np.arange(9.0).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])

In [84]:
df2 = pd.DataFrame(np.arange(12.0).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [85]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [86]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [87]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Arithmetic methods with fill values

In [95]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))

In [96]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [97]:
df2.loc[1, 'b'] = np.nan

In [99]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [100]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


<img src='img/5_2_2.png'>

The function starting with the letter r has arguments flipped

In [101]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [102]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


__Operations between DataFrame and Series__

Broadcasting of NumPy array

In [103]:
arr = np.arange(12.).reshape((3, 4))

In [104]:
arr

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [105]:
arr[0]

array([ 0.,  1.,  2.,  3.])

The subtraction is performed once for each row.

In [106]:
arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

Operations between a DataFrame and a Series are similar

In [107]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [108]:
series = frame.iloc[0]

In [109]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [110]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [111]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If an index value is not found in either the df columns of the sr index, the objects will be reindexed to form the union:

In [112]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [113]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If want to broadcast over the columns, matching on the rows, you have to use the arithmetic methods.

In [115]:
series3 = frame['d']

In [116]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [117]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [119]:
frame.sub(series3, axis='index') # or axis=0

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


The axis number that you pass is the axis to match on

### 2.6 Function Application and Mapping

NumPy ufuncs also work with pandas objects:

In [121]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [122]:
frame

Unnamed: 0,b,d,e
Utah,-0.405125,0.082628,-1.164402
Ohio,-2.210062,-0.486038,1.24732
Texas,-1.341049,0.391687,-0.172739
Oregon,-0.941353,1.008947,-0.636273


In [123]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.405125,0.082628,1.164402
Ohio,2.210062,0.486038,1.24732
Texas,1.341049,0.391687,0.172739
Oregon,0.941353,1.008947,0.636273


DataFrame's ___apply___ method

In [124]:
f = lambda x: x.max() - x.min()

In [125]:
frame.apply(f)

b    1.804937
d    1.494985
e    2.411722
dtype: float64

In [126]:
frame.apply(f, axis='columns')

Utah      1.247030
Ohio      3.457383
Texas     1.732735
Oregon    1.950300
dtype: float64

The function passed to apply can also return a Series with multiple values:

In [130]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [131]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-2.210062,-0.486038,-1.164402
max,-0.405125,1.008947,1.24732


Element-wise Python functions can be used.

Compute a formatted string from each floating-point value in frame.

In [132]:
format = lambda x: '%.2f' % x

In [134]:
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.41,0.08,-1.16
Ohio,-2.21,-0.49,1.25
Texas,-1.34,0.39,-0.17
Oregon,-0.94,1.01,-0.64


Series has a map method for applying an element-wise function

In [137]:
frame['e'].map(format)

Utah      -1.16
Ohio       1.25
Texas     -0.17
Oregon    -0.64
Name: e, dtype: object

### 2.7 Sorting and Ranking

Use the ***sort_index*** method to sort lexicographically by row or column index

In [138]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [139]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [140]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=list('dabc'))

In [141]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [143]:
frame.sort_index(axis='columns')

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [144]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use sort_values method

In [3]:
obj = pd.Series([4, 7, 3, -2])

In [4]:
obj.sort_values()

3   -2
2    3
0    4
1    7
dtype: int64

Missing values are sorted to the end by default

In [5]:
pd.Series([4, np.nan, 7, np.nan, -3, 2]).sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

Use one or more columns to sort a DataFrame

In [6]:
frame = pd.DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})

In [7]:
frame

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [8]:
frame.sort_values(by='b')

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


In [9]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,a,b
2,0,-3
0,0,4
3,1,2
1,1,7


In [10]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

Ranking assigns ranks from one through the number of valid data points in an array. By default ___rank___ breaks ties by assigning each group the mean rank:

In [11]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Rank can also be assigned according to the order in which they're observed in the data:

In [12]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [13]:
# Assign tie values the maximum rank in the group
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

DataFrame can compute ranks over rows also:

In [14]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]})

In [15]:
frame

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [16]:
frame.rank(axis='columns')

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


<img src='img/5_2_3.png'>

### 2.8 Axis Indexes with Duplicate Labels

In [17]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [18]:
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

***is_unique*** attribute can check if labels are unique

In [19]:
obj.index.is_unique

False

Indexing a label with multiple entries returns a series

In [20]:
obj['c']

4

In [21]:
obj['a']

a    0
a    1
dtype: int64

The same logic extends to indexing rows or columnsin a DataFrame

In [25]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

In [26]:
df

Unnamed: 0,0,1,2
a,2.22054,-0.04441,1.616807
a,-0.231418,-0.388724,-1.053837
b,0.952508,-1.24549,-0.317826
b,-0.100918,0.502864,1.567973


In [27]:
df.loc['b']

Unnamed: 0,0,1,2
b,0.952508,-1.24549,-0.317826
b,-0.100918,0.502864,1.567973


In [28]:
df = pd.DataFrame(np.random.randn(4, 3), columns=['d', 'd', 'e'])

In [29]:
df['d']

Unnamed: 0,d,d.1
0,0.087507,1.014865
1,1.593993,0.656362
2,0.243245,-0.179076
3,-0.590873,-0.741181


## 3 Summarizing and Computing Descriptive Statistics

Pandas objects have built-in summary methods that can handle missing data

In [30]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                 index=['a', 'b', 'c', 'd'],
                 columns=['one', 'two'])

In [31]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


sum method returns a Series containing column sums in default

In [34]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [35]:
df.sum(axis=1)

a    1.40
b    2.60
c     NaN
d   -0.55
dtype: float64

In [36]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

___idxmin___ and ___idxmax___ return the index value where the minimum or maximum values are attained

In [38]:
df.idxmax()

one    b
two    d
dtype: object

Other methods are accumulations

In [39]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


___describe___ produces multiple summary statistics in one shot:

In [40]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [42]:
pd.Series(['a', 'a', 'b', 'c'] * 4).describe()

count     16
unique     3
top        a
freq       8
dtype: object

<img src='img/5_3_1.png'>

### 3.1 Correlation and Covariance

Use the pandas_datareader module to download some data for a few stock tickers:

In [43]:
import pandas_datareader.data as web

In [44]:
all_data = {ticker: web.get_data_yahoo(ticker)
           for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [45]:
price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})

In [47]:
volumn = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})

In [54]:
returns = price.pct_change()

In [55]:
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-10-25,-0.004392,0.002875,-0.015268,-0.002917
2017-10-26,0.006393,-0.000791,0.000652,0.001653
2017-10-27,0.03583,0.048028,0.000521,0.064119
2017-10-30,0.022508,-0.002119,0.004425,0.000955
2017-10-31,0.013915,-0.000462,-0.001944,-0.008463


The ___corr___ method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Similar to ___cov___

In [56]:
returns['MSFT'].corr(returns['IBM'])

0.47132700197558081

In [59]:
returns.MSFT.cov(returns.IBM)

7.8933345751300323e-05

DataFrame's corr and cov methods return a full correlation or covariance matrix as a DataFrame

In [64]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.415596,0.358733,0.396574
GOOG,0.415596,1.0,0.382766,0.47963
IBM,0.358733,0.382766,1.0,0.471327
MSFT,0.396574,0.47963,0.471327,1.0


Using DataFrame's ___corrwith___ method, you can compute pairwise correlations between DataFrame's columns or rows with another Series or DataFrame

In [66]:
returns.corrwith(returns.IBM)

AAPL    0.358733
GOOG    0.382766
IBM     1.000000
MSFT    0.471327
dtype: float64

Passing a DataFrame compute the correlations of matching column names.

In [67]:
returns.corrwith(volumn)

AAPL   -0.071114
GOOG   -0.013276
IBM    -0.159097
MSFT   -0.087365
dtype: float64

Passing axis='columns' does things row-by-row.

### 3.2 Unique Values, Value Counts, and Membership

In [68]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

___unique___ method gives you an array of the unique values in a Series

In [69]:
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

***value_counts*** computes a Series containing values frequencies:

In [70]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

***value_counts*** is also available as a top-level pandas method that can be used with any array or sequence:

In [71]:
pd.value_counts(['a', 'a', 'b'], sort=False)

a    2
b    1
dtype: int64

***isin*** performs a vectorized set membership check.

In [72]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [74]:
mask = obj.isin(['b', 'c'])

In [75]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [76]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

***Index.get_indexer*** method gives you an index array from an array of possibly non-distinct values into another array of distinct values:

In [77]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

In [78]:
unique_vals = pd.Series(['c', 'b', 'a'])

In [79]:
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2])

Compute a histogram on multiple related columns in a DataFrame.

In [80]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                    'Qu2': [2, 3, 1, 2, 3],
                    'Qu3': [1, 5, 2, 4, 4]})

In [81]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [82]:
result = data.apply(pd.value_counts)

In [83]:
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,,2.0,1.0
3,2.0,2.0,
4,2.0,,2.0
5,,,1.0


In [84]:
result.fillna(0)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
