## Getting Started with pandas

pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric application.

### Introduction to pandas Data Structure

To get started with pandas, you will need to get comfortable with its two workhorse data structures <b>Series</b> and <b>DataFrame</b>. 
While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applictions.

#### Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

In [29]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = Series([10, 20 , 30, 40])
obj

0    10
1    20
2    30
3    40
dtype: int64

In [2]:
# you can get only the array representation using:
obj.values

array([10, 20, 30, 40])

In [3]:
# you can get only the index represtation using:
obj.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [4]:
# getting value using index
obj[0]

10

In [5]:
# defining the index values
obj1 = Series([10, 20, 30, 40, 50], index=['b','a','c','d','e'])
obj1

b    10
a    20
c    30
d    40
e    50
dtype: int64

In [6]:
obj1.index

Index([u'b', u'a', u'c', u'd', u'e'], dtype='object')

In [7]:
obj1['a']

20

In [8]:
obj1['e']

50

In [9]:
obj1[['c', 'b', 'a']]

c    30
b    10
a    20
dtype: int64

NumPy array operations will preserve the index value link.

In [10]:
obj1[obj1 > 20]

c    30
d    40
e    50
dtype: int64

In [13]:
obj1 * 2

b     20
a     40
c     60
d     80
e    100
dtype: int64

In [12]:
np.exp(obj1)

b    2.202647e+04
a    4.851652e+08
c    1.068647e+13
d    2.353853e+17
e    5.184706e+21
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

In [14]:
'b' in obj1

True

In [15]:
'f' in obj1

False

In [16]:
# you can create a Series using Python dic
dic ={'Ohio':123, 'Texas':456, 'Oregon':879, 'Utah':667}
obj2 = Series(dic)
obj2

Ohio      123
Oregon    879
Texas     456
Utah      667
dtype: int64

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj3 = Series(obj2, index = states)
obj3

California    NaN
Ohio          123
Oregon        879
Texas         456
dtype: float64

In [18]:
# using isnull function on Series
pd.isnull(obj3)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [19]:
pd.notnull(obj3)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [20]:
# or Series has this function
obj3.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [22]:
obj2

Ohio      123
Oregon    879
Texas     456
Utah      667
dtype: int64

In [23]:
obj3

California    NaN
Ohio          123
Oregon        879
Texas         456
dtype: float64

In [24]:
# adding two Series. It adds the values based on common key. You can perform other operations too.
obj2 + obj3

California     NaN
Ohio           246
Oregon        1758
Texas          912
Utah           NaN
dtype: float64

In [25]:
obj2

Ohio      123
Oregon    879
Texas     456
Utah      667
dtype: int64

In [26]:
# both Series object itself and its index have a name
obj2.name = 'population'
obj2.index.name = 'state'
obj2

state
Ohio      123
Oregon    879
Texas     456
Utah      667
Name: population, dtype: int64

In [30]:
obj

0    10
1    20
2    30
3    40
dtype: int64

In [31]:
# you can alter Series index in place
obj.index = ['u', 'v', 'x', 'y']
obj

u    10
v    20
x    30
y    40
dtype: int64

#### DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different valuetype (numeric, boolean, etc.) The DataFrame has both a row and column index, it can be though of as a dict of Series.

In [32]:
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year':[2000, 2001, 2002, 2001, 2002],       
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9] 
       }
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [33]:
# you can specifiy the sequence of the columns
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [34]:
# As like Series, if you pass a column that isnt contained in data, it will appear with NA
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index =['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [35]:
frame2.index

Index([u'one', u'two', u'three', u'four', u'five'], dtype='object')

In [36]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

In [37]:
# returns a 2d array
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan]], dtype=object)

In [38]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [40]:
type(frame2['state'])

pandas.core.series.Series

In [39]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [41]:
type(frame2.year)

pandas.core.series.Series

In [42]:
# retrieving rows using ix indexing field
frame2.ix['two']

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object

In [44]:
type(frame2.ix['two'])

pandas.core.series.Series

In [45]:
# assigning values to a column 
frame2['debt'] = 16.6
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.6
two,2001,Ohio,1.7,16.6
three,2002,Ohio,3.6,16.6
four,2001,Nevada,2.4,16.6
five,2002,Nevada,2.9,16.6


In [46]:
frame2['debt'] = [10, 20, 15, 22, 32]
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,10
two,2001,Ohio,1.7,20
three,2002,Ohio,3.6,15
four,2001,Nevada,2.4,22
five,2002,Nevada,2.9,32


In [47]:
frame2['debt'] = np.arange(5)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


In [48]:
# you can assign Series to DataFrame columns. The indices should match otherwise gets populated with NA
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

In [49]:
frame2['debit'] = val
frame2

Unnamed: 0,year,state,pop,debt,debit
one,2000,Ohio,1.5,0,
two,2001,Ohio,1.7,1,-1.2
three,2002,Ohio,3.6,2,
four,2001,Nevada,2.4,3,-1.5
five,2002,Nevada,2.9,4,-1.7


In [50]:
# assigning a column that does not exist will create a new column
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,debit,eastern
one,2000,Ohio,1.5,0,,True
two,2001,Ohio,1.7,1,-1.2,True
three,2002,Ohio,3.6,2,,True
four,2001,Nevada,2.4,3,-1.5,False
five,2002,Nevada,2.9,4,-1.7,False


In [51]:
# deleting a column
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt,debit
one,2000,Ohio,1.5,0,
two,2001,Ohio,1.7,1,-1.2
three,2002,Ohio,3.6,2,
four,2001,Nevada,2.4,3,-1.5
five,2002,Nevada,2.9,4,-1.7


In [52]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt', u'debit'], dtype='object')

In [54]:
# Another common form of data is a nested dict of dicts format
pop = {'Nevada':{2001: 2.4, 2002: 2.9}, 'Ohio':{2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop

{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [55]:
# you can put this in DataFrame and it makes outer keys columns and inner keys indices
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [56]:
# you can transpose the DataFrame
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [57]:
# you can explicitly declare the indices
DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


#### Index Objects

In [58]:
obj = Series(range(3), index =['a', 'b', 'c'])
index = obj.index
index

Index([u'a', u'b', u'c'], dtype='object')

In [59]:
index[1:]

Index([u'b', u'c'], dtype='object')

In [60]:
index[1]

'b'

In [61]:
# index objects are immutable, so you cannot do:
index[1] = 'd'

TypeError: Indexes does not support mutable operations

In [62]:
# Immutability is important so that Index objects can be safely shared among data structures
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index = index)
obj2.index is index

True

In [63]:
# In addition to being array-like, an index also functions as fixed-size set:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [64]:
'Ohio' in frame3.columns

True

In [65]:
2003 in frame3.index

False

### Essential Functionality

#### Reindexing

In [66]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [67]:
# NA for index which does not exist
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [68]:
# you can fill the NA with any value 
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [69]:
# for ordered data like time series, it maybe desirable to do some interpolation or filling of values when reindexing
# use method option, ffill means forward filling, bfill means backward filling
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [70]:
obj3.reindex(range(6), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

In [71]:
# with DataFrame reindex can alter either the (row) index, columns or both.
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [72]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [74]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [73]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [75]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [77]:
states

['Texas', 'Utah', 'California']

In [79]:
# doing both index and columns reindexing in one shot
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
b,1,,2
c,4,,5
d,7,,8


In [80]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [81]:
states

['Texas', 'Utah', 'California']

In [82]:
# you can accomplish the same thing by lablel indexing using ix
frame.ix[['a','b','c','d'], states]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


#### Dropping entries from an axis

In [83]:
obj = Series(np.arange(5), index=['a','b','c','d','e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [84]:
new_obj = obj.drop('c')
new_obj

a    0
b    1
d    3
e    4
dtype: int64

In [89]:
# just if I forgot to mention this you can use copy function on Series and DataFrames
copyObj = new_obj.copy
print
copyObj




<bound method Series.copy of a    0
b    1
d    3
e    4
dtype: int64>

In [92]:
data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'],
                                                columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [93]:
# this is a view no effect on data unless assigned to another variable
data.drop(['Colorado','Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [94]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [95]:
data.drop(['two','four'],axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


#### Indexing, selection and filtering

In [96]:
obj = Series(np.arange(4), index=['a','b','c','d'])
obj

a    0
b    1
c    2
d    3
dtype: int64

In [97]:
obj['b']

1

In [98]:
obj[1]

1

In [99]:
obj[2:4]

c    2
d    3
dtype: int64

In [100]:
obj[['b','a','d']]

b    1
a    0
d    3
dtype: int64

In [101]:
obj[[1,3]]

b    1
d    3
dtype: int64

In [102]:
obj[obj < 2]

a    0
b    1
dtype: int64

In [103]:
# slicing with labels are different from normal Python slicing and as you can see the endpoint is inclusive
obj['b':'c']

b    1
c    2
dtype: int64

In [104]:
obj['b':'c'] = 5
obj

a    0
b    5
c    5
d    3
dtype: int64

In [105]:
data = DataFrame(np.arange(16).reshape((4,4)),
                 index=['Ohio','Colorado','Utah','New York'],
                 columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [106]:
# these might be a bit inconsistence with previous examples
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [107]:
data[['two','three']]

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


In [108]:
# selecting rows by slicing
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [109]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [110]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [111]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [112]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
# you can use ix property of data frame for mentioned operations too.
# please refer to DataFrame pandas reference

#### Arithmetic and data alignment

In [113]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [114]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [115]:
# indices not match NaN will be placed
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [116]:
df1 = DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'), index=['Ohio','Texas','Colorado'])
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [117]:
df2 = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [118]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


#### Arithmetic methods with fill values

In [119]:
df1 = DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [120]:
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [121]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [122]:
# populates the missing one on each DataFrame to zero
# this works for add, sub, div, mul
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0,2,4,6,4
1,9,11,13,15,9
2,18,20,22,24,14
3,15,16,17,18,19


In [123]:
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [124]:
df2

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [125]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,0
1,4,5,6,7,0
2,8,9,10,11,0


#### Operation between DataFrame and Series

In [126]:
arr = np.arange(12).reshape((4,3))
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [127]:
arr[0]

array([0, 1, 2])

In [128]:
# this is called broadcasting, it subtracts row by row
arr - arr[0]

array([[0, 0, 0],
       [3, 3, 3],
       [6, 6, 6],
       [9, 9, 9]])

In [129]:
frame = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [130]:
series

b    0
d    1
e    2
Name: Utah, dtype: int64

In [131]:
# broadcasting down the rows
frame - series

Unnamed: 0,b,d,e
Utah,0,0,0
Ohio,3,3,3
Texas,6,6,6
Oregon,9,9,9


In [132]:
series2 = Series(range(3), index=['b','e','f'])
series2

b    0
e    1
f    2
dtype: int64

In [133]:
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [134]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0,,3,
Ohio,3,,6,
Texas,6,,9,
Oregon,9,,12,


In [136]:
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [135]:
# you can do broadcasting on columns using arithmetic methods as follow
series3 = frame['d']
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [137]:
series3

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: int64

In [138]:
# subtraction
frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


#### Function application and mapping

NumPy ufuncs work fine with pandas objects:

In [139]:
frame = DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,d,b,e
Utah,-1.448031,1.583063,-0.759537
Ohio,0.037969,-1.903567,-0.605953
Texas,0.85744,1.296539,-0.94368
Oregon,-0.868156,0.517078,0.497081


In [140]:
np.abs(frame)

Unnamed: 0,d,b,e
Utah,1.448031,1.583063,0.759537
Ohio,0.037969,1.903567,0.605953
Texas,0.85744,1.296539,0.94368
Oregon,0.868156,0.517078,0.497081


Another frequent operation is applying a function on 1D array to each column or row

In [141]:
f = lambda x: x.max() - x.min()
# by default axis is zero
frame.apply(f)

d    2.305471
b    3.486630
e    1.440762
dtype: float64

In [142]:
frame.apply(f, axis = 1)

Utah      3.031094
Ohio      1.941537
Texas     2.240220
Oregon    1.385235
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

apply need not return a scalar value, it can also return a Series with multiple values:

In [143]:
def f(x): return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,d,b,e
min,-1.448031,-1.903567,-0.94368
max,0.85744,1.583063,0.497081


Elemenst-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating point value in frame.

In [144]:
# old formatting way
pi = 3.14159
print(" pi = %1.2f " % pi)

 pi = 3.14 


In [145]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,d,b,e
Utah,-1.45,1.58,-0.76
Ohio,0.04,-1.9,-0.61
Texas,0.86,1.3,-0.94
Oregon,-0.87,0.52,0.5


The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [146]:
frame['e'].map(format)

Utah      -0.76
Ohio      -0.61
Texas     -0.94
Oregon     0.50
Name: e, dtype: object

#### Sorting

In [147]:
obj = Series(range(4), index=['d','a','b','c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [148]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [152]:
frame = DataFrame(np.arange(8).reshape((2,4)), index=['three','one'], columns=['d','a','b','c'])
frame                                                                        

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [155]:
# default axis is 0
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [154]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [156]:
# sorting a Series by its values
obj = Series([4,7,-3,2])
obj.order()

2   -3
3    2
0    4
1    7
dtype: int64

In [157]:
# any missing values are sorted to the end of the Series by default
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.order()

4    -3
5     2
0     4
2     7
1   NaN
3   NaN
dtype: float64

In [158]:
frame = DataFrame({'b':[4,7,-3,2], 'a':[0,1,0,1]})
frame

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [159]:
frame.sort_index(by='b')

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


In [160]:
frame.sort_index(by=['a','b'])

Unnamed: 0,a,b
2,0,-3
0,0,4
3,1,2
1,1,7


#### Summarizing and Computing Descriptive Statistics

In [161]:
df = DataFrame([[1.4,np.nan], [7.1,-4.5],
               [np.nan, np.nan], [0.75, -1.3]],
               index=['a','b','c','d'],
               columns=['one','two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [162]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [163]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [164]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [165]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [167]:
# on non-numeric, it produces alternative summary statistics:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [168]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

Some summary statistics and related functions:

count<br />
describe<br />
min, max<br />
quantile<br />
sum<br />
mean<br />
median<br />
mad<br />
var<br />
std<br />
diff<br />
pct_change<br />
cumsum<br />
cumprod<br />

#### Correlation and Covariance

Covariance measures how two variables move together. It measures whether the two move in the same direction (a positive covariance) or in opposite directions. (a negative covariance) Ranges between(-inf, +inf)

Finding that two variables have a high or low covariance might not be a useful metric on its own. Covariance can tell how the two variables move together, but to determine the strength of the relationship, we need to look at the correlation.

Correlation between two variables X and Y is simply the covariance between both variables divided by the product of the standard deviation of the variables X and Y. Ranges between [-1, 1]

In [169]:
# getting stock data from Yahoo
import pandas.io.data as web

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker)

price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.iteritems()})

volume = DataFrame({tic: data['Volume']
                   for tic, data in all_data.iteritems()})

returns = price.pct_change()

returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-10-12,-0.004638,0.004754,-0.008203,-0.002335
2015-10-13,0.001703,0.008706,-0.010057,-0.00234
2015-10-14,-0.014134,-0.001748,0.002607,-0.004479
2015-10-15,0.014971,0.016248,0.000533,0.007069
2015-10-16,-0.007331,0.000695,0.001999,0.010636


In [170]:
# corrolation between MSFT and IBM stocks
returns.MSFT.corr(returns.IBM)

0.51590834370171734

In [171]:
# covariance between MSFT and IBM stcoks
returns.MSFT.cov(returns.IBM)

8.8741671317885853e-05

In [172]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.345789,0.398595,0.373097
GOOG,0.345789,1.0,0.39662,0.454525
IBM,0.398595,0.39662,1.0,0.515908
MSFT,0.373097,0.454525,0.515908,1.0


In [173]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000282,8.7e-05,8e-05,9.1e-05
GOOG,8.7e-05,0.000273,7.8e-05,0.000111
IBM,8e-05,7.8e-05,0.000141,8.9e-05
MSFT,9.1e-05,0.000111,8.9e-05,0.00021


In [174]:
returns.corrwith(returns.IBM)

AAPL    0.398595
GOOG    0.396620
IBM     1.000000
MSFT    0.515908
dtype: float64

In [175]:
# finding corrolation between price change and volume
returns.corrwith(volume)

AAPL   -0.089878
GOOG    0.174451
IBM    -0.184904
MSFT   -0.101639
dtype: float64

#### Unique Values, Value Counts, and Membership

In [176]:
obj = Series(['c','a','d','a','a','b','b','c','c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [177]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [178]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [179]:
# panda has method for this too that can be used for any array or sequence
pd.value_counts(obj.values)

c    3
a    3
b    2
d    1
dtype: int64

In [180]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [181]:
# isin is responsible for vectorized set memebership and can be very useful in filtering a data set
mask = obj.isin(['b','c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [182]:
data = DataFrame({'Qu1': [1,3,4,5,4],
                  'Qu2': [2,3,1,2,3],
                  'Qu3': [1,5,2,4,4]})

data      

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,5,2,4
4,4,3,4


#### Handling Missing Data

In [184]:
# pandas uses the floating point value NaN to represent missing data
string_data = Series(['aardvark', 'artichoke', np.nan, 'avacado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avacado
dtype: object

In [185]:
# built-in Python None value is also treated as NaN
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

#### Filtering Out Missing Data

In [186]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [187]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [188]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [189]:
data = DataFrame([[1, 6.5, 3], [1, NA, NA],
                 [NA, NA, NA], [NA, 6.5, 3]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [190]:
cleaned = data.dropna()

In [191]:
# any row with NA will be dropped
cleaned

Unnamed: 0,0,1,2
0,1,6.5,3


In [192]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [193]:
# if all row has NA
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [194]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [195]:
data[2] = NA
data

Unnamed: 0,0,1,2
0,1.0,6.5,
1,1.0,,
2,,,
3,,6.5,


In [197]:
# drop the column if all NA
data.dropna(how='all', axis = 1)

Unnamed: 0,0,1
0,1.0,6.5
1,1.0,
2,,
3,,6.5


In [198]:
df = DataFrame(np.random.randn(7, 3))
df

Unnamed: 0,0,1,2
0,0.191247,-1.873126,1.785697
1,0.725072,0.467903,0.085839
2,0.822686,1.143011,0.881387
3,0.681399,-0.178191,-0.097181
4,-0.829799,0.889181,-1.015267
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


In [199]:
# rememeber! ix slicing includes the upper bound
df.ix[:4, 1] = NA
df.ix[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,0.191247,,
1,0.725072,,
2,0.822686,,
3,0.681399,,-0.097181
4,-0.829799,,-1.015267
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


In [200]:
# keep only rows having certain number of observations
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


#### Filling in Missing Data

In [201]:
df

Unnamed: 0,0,1,2
0,0.191247,,
1,0.725072,,
2,0.822686,,
3,0.681399,,-0.097181
4,-0.829799,,-1.015267
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


In [202]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.191247,0.0,0.0
1,0.725072,0.0,0.0
2,0.822686,0.0,0.0
3,0.681399,0.0,-0.097181
4,-0.829799,0.0,-1.015267
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


In [204]:
df

Unnamed: 0,0,1,2
0,0.191247,,
1,0.725072,,
2,0.822686,,
3,0.681399,,-0.097181
4,-0.829799,,-1.015267
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


In [205]:
# use a dic which indicates what to fill NA at each column
df.fillna({1:0.5, 2: -1})

Unnamed: 0,0,1,2
0,0.191247,0.5,-1.0
1,0.725072,0.5,-1.0
2,0.822686,0.5,-1.0
3,0.681399,0.5,-0.097181
4,-0.829799,0.5,-1.015267
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


In [203]:
df

Unnamed: 0,0,1,2
0,0.191247,,
1,0.725072,,
2,0.822686,,
3,0.681399,,-0.097181
4,-0.829799,,-1.015267
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


In [206]:
# fillna returns a new object, but you can modify the existing object in place
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.191247,0.0,0.0
1,0.725072,0.0,0.0
2,0.822686,0.0,0.0
3,0.681399,0.0,-0.097181
4,-0.829799,0.0,-1.015267
5,0.509141,-0.761826,-0.982326
6,-0.340249,0.975043,0.894818


In [207]:
df = DataFrame(np.random.randn(6, 3))
df

Unnamed: 0,0,1,2
0,0.857998,0.963697,-0.892012
1,-0.66855,-0.270534,-1.527414
2,-0.044437,1.229962,-0.31168
3,-0.109165,1.825827,-0.334014
4,-1.7106,0.752877,-0.624511
5,0.693313,1.164449,-1.675549


In [208]:
df.ix[2:,1] = NA
df.ix[4:,2] = NA
df

Unnamed: 0,0,1,2
0,0.857998,0.963697,-0.892012
1,-0.66855,-0.270534,-1.527414
2,-0.044437,,-0.31168
3,-0.109165,,-0.334014
4,-1.7106,,
5,0.693313,,


In [209]:
# forward filling
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.857998,0.963697,-0.892012
1,-0.66855,-0.270534,-1.527414
2,-0.044437,-0.270534,-0.31168
3,-0.109165,-0.270534,-0.334014
4,-1.7106,-0.270534,-0.334014
5,0.693313,-0.270534,-0.334014


In [210]:
df

Unnamed: 0,0,1,2
0,0.857998,0.963697,-0.892012
1,-0.66855,-0.270534,-1.527414
2,-0.044437,,-0.31168
3,-0.109165,,-0.334014
4,-1.7106,,
5,0.693313,,


In [211]:
# you can put a limit of how many to fill
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.857998,0.963697,-0.892012
1,-0.66855,-0.270534,-1.527414
2,-0.044437,-0.270534,-0.31168
3,-0.109165,-0.270534,-0.334014
4,-1.7106,,-0.334014
5,0.693313,,-0.334014


In [212]:
# With fillna you can do lots of other things with a little creativity
data = Series([1, NA, 3.5, NA, 7])
# putting mean of values for NAs
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64