In [2]:
from pandas import Series, DataFrame

In [22]:
import pandas as pd 
import numpy as np 

# Introduction to pandas Data Structures

The two workhouse data structures of pandas library : **Series** and **DataFrame** 

### Series

A Series is a one-dimensiona array like object containing an array of data (of any NumPy data type) and as associated array of data labels, called *index* . The simplest Series is formed from ony an array of data:

In [4]:
obj = Series([4, 7, -5, 3])
obj 

0    4
1    7
2   -5
3    3
dtype: int64

In [5]:
obj.values 

array([ 4,  7, -5,  3], dtype=int64)

In [6]:
obj.index 

RangeIndex(start=0, stop=4, step=1)

Often it is desirable to create a Series with an index identifying each data point

In [13]:
obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])
obj2 

d    4
b    7
a   -5
c    3
dtype: int64

In [14]:
obj2.index 

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with regular NumPy array, we can use values in the index when selecting single values or a set of values :

In [15]:
obj2['a'] 

-5

In [17]:
obj2['d'] = 6
obj2[['c','a','d']]

c    3
a   -5
d    6
dtype: int64

NumPy ay operations , such as filtering with a boolean array, scaler multiplication, or applying math functions, will preserve the idex-value link :

In [18]:
obj2 

d    6
b    7
a   -5
c    3
dtype: int64

In [19]:
obj2[obj2 > 0] 

d    6
b    7
c    3
dtype: int64

In [20]:
obj2 * 2 

d    12
b    14
a   -10
c     6
dtype: int64

In [23]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict :

In [24]:
'b' in obj2 

True

In [25]:
'e' in obj2 

False

If we hava data contained in a Python dict, we can create a Series from it by passing the dict:

In [26]:
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
sdata 

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [27]:
obj3 = Series(sdata) 

In [28]:
obj3 

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When only passing a dict, the index in the resulting Series will have the dict's keys in the sorted order

In [29]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, states) 
obj4 

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In this case, the values found in sdata were placed in the appropriate locations, but since no values for 'California' was found, it appears as **NaN** (not a Number), which is considered in pandas to mark missing or **NA** values. The terms **"missing"** or **"NA"** is used to refer missing data. The **``isnull``** and **``notnull``** functions in pandasdas is used to detect missing data

In [31]:
pd.isnull(obj4) 

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [32]:
pd.notnull(obj4) 

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series has also these as instance methods :

In [33]:
obj4.isnull() 

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A critical series feature for many applications is that it automatically aligns differently-indexed data in arithmetic operations:

In [35]:
obj3 

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [36]:
obj4 

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [37]:
obj3 + obj4 

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality :

In [38]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4 

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be altered in place by assignment :

In [39]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj 

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

### DataFrame

A DataFrame represents a tabular, spreadheet-like data structure containing an ordered collection of columns, each of which can be different value type (numeric, string, boolean etc). The DataFrame has both a row and column index

There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

In [40]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:

In [41]:
frame 

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


If we specify a sequence of column, the DataFrame's columns will be exactly what we pass

In [46]:
frame = pd.DataFrame(data,columns = ['year', 'state', 'pop'] )
frame 

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


As with Series, if we pass a column that isn't contained in data, it will appear with NA values in the result :

In [48]:
frame2 = pd.DataFrame(data,columns = ['year', 'state', 'pop', 'debt'], index = ['one', 'two', 'three', 'four', 'five'] )
frame2 

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [49]:
frame2.columns 

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can we retrieved as a Series either by dict-like notation or by attribute:

In [50]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [51]:
frame2.state 

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

Note that the returned Series have the same index as the DataFrame, and their ``name`` attribute has been appropriately set. 

Rows can also be retrieved by position or name by a couple of methods, such as ``ix`` indexing field(this method has been deprecated)

In [52]:
frame2.ix['three']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment. For example, the reply 'debt' column could be assigned a scalar value or an array of values :

In [54]:
frame2['debt'] = 16.5 

In [55]:
frame2 

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [59]:
frame2['debt'] = np.arange(5.)

In [60]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


When assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If we assign a Series, it will be instead conformed exactly to the DataFrame's index, inserting missing values in any holes:

In [62]:
val = Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val 
frame2 

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


Assigning a column that doesn't exist will create a new column. The ``del`` keyword will delete columns as with a dict.

In [63]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2 

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [64]:
del frame2['eastern']
frame2 

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


In [65]:
frame2.columns 

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another common form of data is a nested dict of dicts formmat :

In [66]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If passed to a DataFrame, it will interpret the outer dict keys as the columns and the inner keys as row indices: 

In [67]:
frame3 = DataFrame(pop) 

In [68]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


We can also transpose the result : 

In [69]:
frame3.T 

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


The keys in the inner dict are unioned and sorted to form the index in the result. This isn't true if an explicit index is specified :

In [70]:
DataFrame(pop, index = [2001, 2002, 2003]) 

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


Dict of Series are treated much the same way: 

In [71]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


Complete list of things that can be passed to the DataFrame constructor

Type
--- 
- 2D ndarray 
- dict of arrays, list or tuples 
- NumPy structured/record array 
- dict of Series 
- dict of dicts
- list of dicts or Series
- List of list or tuples
- Another DataFrame 
- Numpy MaskedArray 

### Index Objects 

panda's index object are responsible for holding the axis labels and other metadata (like axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [73]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can't be modified by the user: 

In [74]:
index[1] = 'd' 

TypeError: Index does not support mutable operations

Immutable is important so that Index objects can be safely shared among data structures: 

In [75]:
index = pd.Index(np.arange(3))

In [76]:
obj2 = Series([1.5, -2.5, 0], index = index) 

In [78]:
obj2.index is index 

True

Main Index objects in pandas
----
- Index  : The most general Index object ,representing axis labels in a NumPy of Python objects
- Int64Index : Specalidex Index for integer values 
- MultiIndex : "Hierchical" index object representing multiple levels of index on a single axis.
- DatetimeIndex: Stores nansecond timestamps (represented using NumPy's datatime64 dtype)
- PeriodIndex: Specialized Index for Period data (timespans).

In addition to being array-like =, an Index also functions as a fixed-size set:

In [79]:
frame3 

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [80]:
'Ohio' in frame3.columns 

True

In [81]:
2003 in frame3.index 

False

Each Index has a number of methods and properties for set logic and answering other common questions about the data it contains. These are summarized in the table below:

### Method                     Description
- append
- diff
- intersection
- union
- isin 
- delete 
- drop 
- insert 
- is_monotonic 
- is_unique 
- unique 

### Essential Functionality 

### Reindexing 

A critical method on pandas object is ``reindex``, which means to create a new object with the *data conformed* to a new index. Consider a simple exampel as below:

In [82]:
obj = Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])

In [83]:
obj 

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling ``reindex`` on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [84]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [85]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [86]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The ``method`` option allows us to do this, using a method such as ``ffill`` which forward fills the values :

In [87]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

### List of available ``method`` options
- ffill or pad : Fill or carry values forward.
- bfill or backfill : Fill (or carry) values backward

With DataFrame, ``reindex`` can alter either the (row) index, columns or both. When passed just as a sequence, the rows are reindexed in the result.

In [88]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [89]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed using the ``columns`` keywords

In [90]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Both can be reindexed in one shot, though interpolation will only appear row-wise (axis 0)

In [109]:
frame.reindex(index = ['a', 'b', 'c', 'd'], columns=states) 

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


### Reindex function arguments

- index : New sequence to use as index
- method 
- fill_value
- limit 
- level 
- copy 

### Dropping entries from an axis

Dropping one or more entries from an axis if you have an index array or list without those entries. As that can require a bit of munging and set logic, the ``drop`` method will return a new objecr with the indicated value or values deleted from an axis:

In [111]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [112]:
new_obj = obj.drop('c')
new_obj 

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [116]:
obj.drop(['d', 'c'], inplace=True)

In [117]:
obj

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis :

In [118]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [119]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [120]:
data.drop('two', axis=1)
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


### Indexing, selection and filtering 

Series indexing (obj[...]) works analogously to NumPy array indexibg, except you ca use the  Series's index values instead of only integers. Here are some examples on this :

In [122]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj 

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [123]:
obj['b'] 

1.0

In [124]:
obj[1]

1.0

In [125]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [126]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [127]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [128]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels behaves differenty than normal Python slicing in that the endpoint is inclusive :

In [129]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

*Setting* using these methods works just as one would expect

In [130]:
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence :

In [131]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [132]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [133]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Indexing like this has a few special cases. First selecting rows by slicing or a boolean array :

In [134]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [135]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


This might seem inconsistent to some, but this syntax arose out of practicality and nothing more. Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:

In [136]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [137]:
data[data < 5] = 0

In [138]:
data 

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


This is intended to make DataFrame syntactically more like an ndarray in this case :

### Arithmetic and data alignment

One of the most important pandas features is rge behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of teh index pairs. Let's look at this example:

In [139]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1 

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [140]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
s2 

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

Adding these two yeilds

In [141]:
s1 + s2 

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces NA values in the indices that don't overlap. *Missing* values propagate in arithmetic ocomputations.

In the case of DataFrame, alignment is performed on both the rows and columns :

In [144]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [145]:
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df2 

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [146]:
df1 + df2 

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


### Arithmetic methods with fill values

In arithmetic operations between differently-indexed objects, we might want to fill with a special value, like 0, when an axis label is found in one object but not the other : 

In [147]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df1 

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [148]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
df2 

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Adding these together results in NA values in the locations that don't overlap 

In [149]:
df1 + df2 

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


Using the ``add`` method pn df1, we pass df2 and an argument to ``fill_value`` :

In [150]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


### Flexible arithmetic methods
- add
- sub 
- div 
- mul 

### Operations between DataFrame and Series 

As with NumPy arrays, arithmetic between DataFrame and Series is well-defined. Firs as a motivating example, let us consider the difference between a 2D array and one of its rows:

In [151]:
arr = np.arange(12.).reshape((3, 4))

In [152]:
arr 

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [153]:
arr[0] 

array([0., 1., 2., 3.])

In [154]:
arr - arr[0] 

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

*This is referred to as broadcasting*

Operations between a DataFrame and a Series are similar : 

In [155]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

In [156]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [157]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame's columns, broadcasting the rows: 

In [158]:
frame - series 

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If am index value of not found in either of the DataFrame's columns or the Series's index, the objects will be reindexed to form the union:

In [159]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [160]:
series2 

b    0
e    1
f    2
dtype: int64

In [161]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example:

In [162]:
series3 = frame['d'] 

In [163]:
frame 

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [164]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [165]:
frame.sub(series3, axis = 0) 

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


The axis number you pass is the *axis to match on*. In this case we mean to match on the DataFrame's row index and broadcast across.

### Funcion application and mapping 

NumPy ufuncs (element-wise array methods) work fine with pandass objects:

In [175]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-0.715716,1.822352,0.828372
Ohio,-0.65991,-0.49462,0.658842
Texas,-0.494519,0.841856,-0.743946
Oregon,0.241871,1.059879,-0.359894


In [176]:
np.abs(frame) 

Unnamed: 0,b,d,e
Utah,0.715716,1.822352,0.828372
Ohio,0.65991,0.49462,0.658842
Texas,0.494519,0.841856,0.743946
Oregon,0.241871,1.059879,0.359894


Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame's ``apply`` method does exactly this :

In [177]:
f = lambda x: x.max() - x.min()
frame.apply(f) 

b    0.957587
d    2.316972
e    1.572319
dtype: float64

In [178]:
frame.apply(f, axis = 1)  

Utah      2.538068
Ohio      1.318752
Texas     1.585803
Oregon    1.419774
dtype: float64

Many of the most common array statistics (like ``sum`` and ``mean``) are DataFrame methods, so using ``apply`` is not necessary.

The function passed to ``appy`` need not return a scalar valuem it can also return a Series with multiple valuesL

In [179]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.715716,-0.49462,-0.743946
max,0.241871,1.822352,0.828372


Element-wise Python functions can be used, too. Suppose you want to compute a formatted string  from each floating point value in ``frame``. You can do this using ``applymap``:

In [180]:
format = lambda x: '%.2f' %x 
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.72,1.82,0.83
Ohio,-0.66,-0.49,0.66
Texas,-0.49,0.84,-0.74
Oregon,0.24,1.06,-0.36


The reason for the name applymap is that Series has a ``map`` method for applying an element-wise function : 

In [181]:
frame['e'].map(format)

Utah       0.83
Ohio       0.66
Texas     -0.74
Oregon    -0.36
Name: e, dtype: object

### Sorting and ranking 

Sorting a data set by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the ``sort_index`` method, which returns a new, sorted object:

In [182]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [184]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis : 

In [185]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])

In [186]:
frame 

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [187]:
frame.sort_index() 

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [188]:
frame.sort_index(axis = 1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The data is sorted in ascending order by default, but can be sorted in descending order, too : 

In [189]:
frame.sort_index(axis = 1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its ``order`` method :

In [190]:
obj = pd.Series([4, 7, -3, 2])
obj 

0    4
1    7
2   -3
3    2
dtype: int64

In [191]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default : 

In [192]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj 

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [193]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

On DataFrame, you may want to sort by the values in one or more columns. To do so, pass one or more name sto the ``by`` option :

In [194]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [195]:
frame.sort_index(by = 'b')

  """Entry point for launching an IPython kernel.


Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [196]:
frame.sort_index(by = ['a','b'])

  """Entry point for launching an IPython kernel.


Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


*Ranking* is closely related to sorting, assigning ranks from one through the number of valid data points in an array. It is similar to the indirect sort indices produced by ``numpy.argsort``, except that ties are broken according to a rule. The ``rank`` methods for Series and DataFrame are the place to look; by default ``rank`` ties by assigning each group the mean rank :

In [197]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [198]:
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [199]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranks can be assigned according to the order they're observed in the data :

In [200]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

Naturally, you can rank in descending order too :

In [201]:
obj.rank(ascending=False, method = 'max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

### Tie-breaking methods with rank

- 'average'
- 'min'
- 'max'
- 'first'

DataFrame can compute ranks over the rows or the columns : 

In [202]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})

In [203]:
frame 

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [205]:
frame.rank(axis = 1) 

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


### Axis indexes with duplicate values 

While many pandas functions (like ``reindex``) require that the labels be unique, it's not mandatory. Let's consider a small Series with duplicate indices.

In [206]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The index's ``is_unique`` property will tell you whether its values are unique or not: 

In [208]:
obj.index.is_unique 

False

Data selection is one of the main things that behaves differently with duplicates. Indexing a value with multiuple entries return a Series while single entries return a scalar value : 

In [209]:
obj['a'] 

a    0
a    1
dtype: int64

In [210]:
obj['c'] 

4

Same logic extends to indexing rows in a DataFrame 

In [211]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,0.733332,1.115521,0.440759
a,1.435622,-1.758288,-0.9281
b,-0.264451,0.315724,-1.064022
b,-0.56766,0.631902,1.622736


In [212]:
df.loc['b']

Unnamed: 0,0,1,2
b,-0.264451,0.315724,-1.064022
b,-0.56766,0.631902,1.622736


### Summarizing and Computing Descriptive Statistics 

pandas objects are equipped with a set of common mathemetical and statistical methods. Most of these fall into the category of *reduction* or *summary statistics*, methods that extract a single value (like sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame.

In [213]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling DataFrame's ``sum`` method retruns a Series containing column sums:  

In [214]:
df.sum() 

one    9.25
two   -5.80
dtype: float64

Passing axis = 1 sums over the rows instead  

In [215]:
df.sum(axis = 1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled using the ``skipna`` option:

In [217]:
df.sum(axis = 1, skipna=True)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [218]:
df.sum(axis = 1, skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

In [219]:
df.mean(axis = 1, skipna=False) 

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

### List of common options for each reduction method options.
- axis
- skipna 
- level 

Some methods, like ``idxmax`` and ``idxmin``, return indirect statistics like the index value where the minimum or maximum values are attained:

In [224]:
df.idxmax()

one    b
two    d
dtype: object

In [225]:
df.idxmin() 

one    d
two    b
dtype: object

Other methods are *accumulations* :

In [226]:
df.cumsum() 

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


Another type of method is neither a reduction nor an accumulation. ``describe`` is one such example, producing multiple summary statistics in one shpt :

In [227]:
df.describe() 

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On non-numeric data, ``describe`` produces alternate summary statistics :

In [228]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

In [229]:
obj 

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [230]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### List of summary statistics and related methods:

- count
- describe
- min, max
- argmin, argmax
- idxmin, idxmax
- quantile
- sum
- mean 
- median
- mad
- var 
- std
- skew 
- kurt
- cumsum
- cummin, cummax
- cumprod
- diff
- pct_change


### Corralation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let's consider some DataFrame of stock prices and volumes obtained from Yahoo! finance

In [239]:
import pandas_datareader.data as web

In [241]:
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010','1/1/2020')

price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})

Now we will compute the percentage change of the price

In [243]:
returns = price.pct_change()

In [245]:
returns.tail() 

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-12-26,0.01984,-0.000519,0.008197,0.012534
2019-12-27,-0.000379,0.002668,0.001828,-0.006256
2019-12-30,0.005935,-0.018186,-0.008619,-0.01165
2019-12-31,0.007307,0.009261,0.000698,0.000659
2020-01-02,0.022816,0.010295,0.018516,0.0227


The ``corr`` method of Series computes the correlation of the overlapping, non-NA aligned-by-index values in two Series. Relatedly, ``cov`` computes the covariance

In [246]:
returns.MSFT.corr(returns.IBM)

0.4904284660096838

In [247]:
returns.MSFT.cov(returns.IBM) 

8.657033537766205e-05

DataFrame's ``corr`` and ``cor`` methods, on the other hand, return a full correlaataFrame, respectively.

In [248]:
returns.corr() 

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.384224,0.457604,0.464621
IBM,0.384224,1.0,0.490428,0.406632
MSFT,0.457604,0.490428,1.0,0.541196
GOOG,0.464621,0.406632,0.541196,1.0


In [249]:
returns.cov()  

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000263,7.7e-05,0.000106,0.000116
IBM,7.7e-05,0.000152,8.7e-05,7.7e-05
MSFT,0.000106,8.7e-05,0.000205,0.000119
GOOG,0.000116,7.7e-05,0.000119,0.000237


Using DataFrame's ``corrwith`` method, you can compute pairwise correlations between a DataFrame's columns or rows with another Series or DataFrame. Passing a Series returns a Series with rge correlation value computed for each column:

In [250]:
returns.corrwith(returns.IBM)

AAPL    0.384224
IBM     1.000000
MSFT    0.490428
GOOG    0.406632
dtype: float64

Passing a DataFrame computes the correlations of matching column names. Here we compute correlations of percentage changes with volume :

In [251]:
returns.corrwith(volume)

AAPL   -0.067294
IBM    -0.157094
MSFT   -0.090448
GOOG   -0.020118
dtype: float64

Passing axis=1 does things row-wise instead. In all cases, the data points are aligned by label before computing the correlation.

### Unique Value, Value Counts and Membership

Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, let us consider this example

In [252]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [253]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

The first is ``unique``, which gives you an array of the unique values in a Series:

In [254]:
obj.unique() 

array(['c', 'a', 'd', 'b'], dtype=object)

The unique values are not necessarily returned in sorted order, but could be sorted after the fact if needed (``uniques.sort()``). Relatedly, ``value_counts`` computes a Series containing value frequencies:

In [255]:
obj.value_counts() 

c    3
a    3
b    2
d    1
dtype: int64

The Series is sorted by value in descending order as a convenience. ``value_counts`` is also is also available as a top-level pandas method that can be used with any array or sequence:

In [256]:
pd.value_counts(obj.values, sort = False) 

a    3
c    3
d    1
b    2
dtype: int64

Lastly, ``isin`` is responsible for vectorized set membership and can be very useful in filtering a data set down to a subset values in a Series or column in a DataFrame :

In [257]:
mask = obj.isin(['b','c'])

In [258]:
mask 

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [259]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

### Unique , value_counts and binning methods:

- isin
- unique
- value_counts 

In some cases, you may want to compute a histogram on multiple related columns in a DataFrame, For example

In [260]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


Passing ``pandas.vataFrame's ``apply`` function gives :

In [261]:
result = data.apply(pd.value_counts).fillna(0)

In [262]:
result 

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


### Handling Missing Data

Missing data is common in most data analysis applications. One of the goals in designing pandas was to make working with missing data as painless and possible. For example =, all of the descriptive statistics on pandas objects exclude missing data as we have seen earlier in this chapter.

pandas uses the floating point values **NaN** (Not a Number) to represent data in both floating as well as in on-floating point arrays. It is just used as a *sentinel* that can be easily detected.

In [263]:
string_data = Series(['aardvark','artichoke',np.nan,'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [264]:
string_data.isnull() 

0    False
1    False
2     True
3    False
dtype: bool

### NA handling methods

- dropna
- fillna
- isnull
- notnull

### Filtering out missing data 

We have a number of options fpr filtering out missing data. While doing it by hand is always an option, ``dropna`` can be very helpful. On a Series, it returns teh Series with only the non-null data and index values :

In [265]:
from numpy import nan as NA 

In [273]:
data = Series([1, NA, 3.5, NA, 7])

In [274]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [275]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

We could have computed this using boolean indexing

In [276]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, these are a bit more complex. We may want to drop rows or columns which are all NA or just those containing any NAs. ``dropna`` by default drops any row containing a missing value:

In [291]:
data = DataFrame([[1., 6.5, 3.],[1., NA, NA],[NA, NA, NA],[NA, 6.5, 3.]]) 

In [288]:
data 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [289]:
cleaned = data.dropna()

In [290]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


how = 'all' will only drop rows that are all NA:

In [294]:
data = DataFrame([[1., 6.5, 3.],[1., NA, NA],[NA, NA, NA],[NA, 6.5, 3.]]) 
cleaned = data.dropna(how = 'all')

In [295]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Dropping columns in the same way can be done by passing axis = 1

In [296]:
data = DataFrame([[1., 6.5, 3.],[1., NA, NA],[NA, NA, NA],[NA, 6.5, 3.]]) 
cleaned = data.dropna(axis = 1, how = 'all')

In [297]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [298]:
data = DataFrame([[1., 6.5, 3.],[1., NA, NA],[NA, NA, NA],[NA, 6.5, 3.]]) 
cleaned = data.dropna(axis = 1, how = 'any')

In [299]:
cleaned

0
1
2
3


A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You may indicate this with the ``thresh`` argument:

In [311]:
df = DataFrame(np.random.randn(7,3))

In [313]:
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df

Unnamed: 0,0,1,2
0,-0.223069,,
1,-0.51527,,
2,1.341823,,0.033618
3,1.321211,,0.306673
4,-0.149716,-0.346206,-0.835967
5,0.318544,-0.198479,-0.382104
6,-0.55202,2.1181,0.884973


In [314]:
cleaned = df.dropna(thresh = 3)
cleaned

Unnamed: 0,0,1,2
4,-0.149716,-0.346206,-0.835967
5,0.318544,-0.198479,-0.382104
6,-0.55202,2.1181,0.884973


### Filling in Missing Data 

Rather than filtering out missing data (and potentially discarding other data along with it), we may want to fill in the "holes" in any number of ways. For most purposes, the ``fillna`` method is the workhorse function to use. Calling ``fillna`` with a constant replaces missing values with that value:

In [315]:
df = DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df 

Unnamed: 0,0,1,2
0,-0.333229,,
1,2.324243,,
2,0.596543,,0.091931
3,-0.750003,,2.079477
4,-1.206976,0.582248,-2.118402
5,0.720056,0.284408,-0.397013
6,0.207148,0.954819,-2.149309


In [316]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.333229,0.0,0.0
1,2.324243,0.0,0.0
2,0.596543,0.0,0.091931
3,-0.750003,0.0,2.079477
4,-1.206976,0.582248,-2.118402
5,0.720056,0.284408,-0.397013
6,0.207148,0.954819,-2.149309


By calling ``fillna`` with a dict , we can use different fill value for each column:

In [318]:
df.fillna({1:0.5, 3:-1})

Unnamed: 0,0,1,2
0,-0.333229,0.5,
1,2.324243,0.5,
2,0.596543,0.5,0.091931
3,-0.750003,0.5,2.079477
4,-1.206976,0.582248,-2.118402
5,0.720056,0.284408,-0.397013
6,0.207148,0.954819,-2.149309


``fillna`` returns a new object, but we can modify the existing object in place:

In [319]:
df.fillna(0, inplace=True)

In [320]:
df 

Unnamed: 0,0,1,2
0,-0.333229,0.0,0.0
1,2.324243,0.0,0.0
2,0.596543,0.0,0.091931
3,-0.750003,0.0,2.079477
4,-1.206976,0.582248,-2.118402
5,0.720056,0.284408,-0.397013
6,0.207148,0.954819,-2.149309


The same interpolation method available for reindexing can be used with ``fillna``:

In [322]:
df = DataFrame(np.random.randn(6,3))
df

Unnamed: 0,0,1,2
0,-0.874784,-0.472673,0.634364
1,1.872826,-0.926159,-1.63019
2,1.599765,0.204119,2.181159
3,1.578553,0.753978,1.303471
4,0.851861,0.189777,0.130257
5,0.234069,0.830682,-1.116323


In [323]:
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.874784,-0.472673,0.634364
1,1.872826,-0.926159,-1.63019
2,1.599765,,2.181159
3,1.578553,,1.303471
4,0.851861,,
5,0.234069,,


In [324]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.874784,-0.472673,0.634364
1,1.872826,-0.926159,-1.63019
2,1.599765,-0.926159,2.181159
3,1.578553,-0.926159,1.303471
4,0.851861,-0.926159,1.303471
5,0.234069,-0.926159,1.303471


In [325]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.874784,-0.472673,0.634364
1,1.872826,-0.926159,-1.63019
2,1.599765,-0.926159,2.181159
3,1.578553,-0.926159,1.303471
4,0.851861,,1.303471
5,0.234069,,1.303471


With ``fillna`` we can do lots of other things with a little creativity. For example we can pass the mean or median value of a Series:

In [326]:
data = Series([1., NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [327]:
df.fillna(data.mean())

Unnamed: 0,0,1,2
0,-0.874784,-0.472673,0.634364
1,1.872826,-0.926159,-1.63019
2,1.599765,3.833333,2.181159
3,1.578553,3.833333,1.303471
4,0.851861,3.833333,3.833333
5,0.234069,3.833333,3.833333


### fillna function arguments

- value
- method
- axis
- inplace 
- limit

### Hierarchical Indexing

*Hierarchical Indexing*  is an important feature of pandas enabling us to have multiple (two or more) index *levels* on an axis. Somewhat abstractly, it provides a way for us to work with higher dimensional data in a lower dimensional form. Let's start with a simple example, create a Series with a list of lists or arrays as the index:

In [328]:
data = Series(np.random.rand(10),index = [['a', 'a', 'a', 'b', 'b', 'b', 'c','c','d','d'],
                      [1,2,3,1,2,3,1,2,2,3]]
             )
data

a  1    0.750725
   2    0.188640
   3    0.622940
b  1    0.985170
   2    0.976815
   3    0.042923
c  1    0.051562
   2    0.612588
d  2    0.028290
   3    0.081693
dtype: float64

We are seeing a prettified view of a Series with a ``MultiIndex`` as its index. The "gaps" in the index display mean "use the label directly above":

In [329]:
data.index 

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

With a hierarchically-indexed object, so-called *partial* indexing is possible, enabling us to concisely select subsets of the data:

In [330]:
data['b']

1    0.985170
2    0.976815
3    0.042923
dtype: float64

In [331]:
data['b':'c']

b  1    0.985170
   2    0.976815
   3    0.042923
c  1    0.051562
   2    0.612588
dtype: float64

In [333]:
data.ix[['b','d']]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


b  1    0.985170
   2    0.976815
   3    0.042923
d  2    0.028290
   3    0.081693
dtype: float64

Selection is even possible in some cases from an "inner" level:

In [334]:
data[:,2]

a    0.188640
b    0.976815
c    0.612588
d    0.028290
dtype: float64

Hierachical indexing plays a critical role in reshaping data and group-based operations like forming a pivot table. For example, this data could be arranged into a DataFrame using its ``unstack`` method:

In [335]:
data.unstack() 

Unnamed: 0,1,2,3
a,0.750725,0.18864,0.62294
b,0.98517,0.976815,0.042923
c,0.051562,0.612588,
d,,0.02829,0.081693


The inverse operation of ``unstack`` is ``stack``

In [336]:
data.unstack().stack() 

a  1    0.750725
   2    0.188640
   3    0.622940
b  1    0.985170
   2    0.976815
   3    0.042923
c  1    0.051562
   2    0.612588
d  2    0.028290
   3    0.081693
dtype: float64

With a DataFrame, either axis can have a hierarchical index:

In [338]:
frame = DataFrame(np.arange(12).reshape((4,3)),
                  index = [['a', 'a','b','b'], [1,2,1,2]],
                  columns=[['Ohio','Ohio','Colorado'],
                          ['Green','Red','Green']])

In [339]:
frame 

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierachical levels can have names (as strings or any Python objects). If so, these will show up in the console output (don't confuse index names with axis labels!):

In [340]:
frame.index.names = ['key1','key2']

In [341]:
frame.columns.name = ['state','color']

In [342]:
frame 

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


With partial column indexing you can similarly select groups of columns:

In [343]:
frame['Ohio']

Unnamed: 0_level_0,Unnamed: 1_level_0,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A ``MultiIndex`` can be created by itself and then reused; the columns in the above DataFrame with level names could be created like this :

MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],
                      names = ['state','color'])

### Reordering and Sorting Levels

At times we need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The ``swaplevel`` takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [348]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [349]:
frame.swaplevel('key1','key2')

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


### Summary Statistics by Level

Many descriptive and summary statistics on DataFrame and Series have a ``level`` option in which you can specify the level you want to sum by on a particlar axis. Consider the above DataFrame; we can sum by level on either the rows or columns lise so:

In [352]:
frame 

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [353]:
frame.sum(level = 'key2')

Unnamed: 0_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


Under the hood, this utilizes panda's ``groupby`` machinery

### Using a DataFrame's Columns

It's not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame's columns. Here's an example DataFrame :

In [355]:
frame = DataFrame({'a':range(7), 
                   'b':range(7,0,-1),
                   'c': ['one','one','one','two','two','two','two'],
                   'd':[0,1,2,0,1,2,3,]})

In [356]:
frame 

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame's ``set_index`` function will create a new DataFrame using one or more of its columns as the index:

In [358]:
frame2 = frame.set_index(['c','d'])

In [359]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame, though you can leave them in :

In [360]:
frame.set_index(['c','d'], drop = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


``reset_index`` on the other hand, does the opposite of ``set_index``; the hierarchical index levels are moved into the columns:

In [361]:
frame2.reset_index() 

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## Other pandas Topics 

Some additional topics that may be of use in data travels

### Integer Indexing

Working with pandas objects by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python datastructures like lists and tuples. For example, you would not expect the following code to generate an error:

In [362]:
ser = Series(np.arange(3.))

In [363]:
ser[-1]

KeyError: -1

In this case, oandas could "fall back" on integer indexing, but there's bot a safe and general way to do this without introducing subtle bugs. Here we have an index containing 0,1,2, but inferring what the user wants (label-based indexing or position-based) is difficult:

In [364]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

On the other hand, with a non-integer index, there is no potential for ambiguity:

In [368]:
ser2 = Series(np.arange(3.), index = ['a','b','c'])

In [369]:
ser2[-2]

1.0

To keep things consistent, if you have an axis index containing indexers, data selection with integers will always be label-oriented. This includes iloc, too

In [371]:
ser2.iloc[:1]

a    0.0
dtype: float64