## Getting Started with pandas

pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric application.

### Introduction to pandas Data Structure

To get started with pandas, you will need to get comfortable with its two workhorse data structures <b>Series</b> and <b>DataFrame</b>. 
While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applictions.

#### Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

In [1]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = Series([10, 20 , 30, 40])
obj

0    10
1    20
2    30
3    40
dtype: int64

In [2]:
# you can get only the array representation using:
obj.values

array([10, 20, 30, 40])

In [5]:
# you can get only the index represtation using:
# you can make it a list:  list(obj.index)
list(obj.index)
#obj.index
# to make np.array
obj.index.values

RangeIndex(start=0, stop=4, step=1)

In [4]:
# getting value using index
obj[0]
# better to use obj.iloc[0], index are numbers 
#that start not with zero can cause issues
# but is ok if they are not numbers use the literal like obj['a']
#obj.iloc[0]

10

In [6]:
# defining the index values
obj1 = Series([10, 20, 30, 40, 50], index=['b','a','c','d','e'])
obj1

b    10
a    20
c    30
d    40
e    50
dtype: int64

In [7]:
obj1.index

Index([u'b', u'a', u'c', u'd', u'e'], dtype='object')

In [8]:
obj1['a']

20

In [9]:
obj1['e']

50

In [10]:
obj1[['c', 'b', 'a']]

c    30
b    10
a    20
dtype: int64

NumPy array operations will preserve the index value link.

In [11]:
obj1[obj1 > 20]

c    30
d    40
e    50
dtype: int64

In [12]:
obj1 * 2

b     20
a     40
c     60
d     80
e    100
dtype: int64

In [13]:
np.exp(obj1)

b    2.202647e+04
a    4.851652e+08
c    1.068647e+13
d    2.353853e+17
e    5.184706e+21
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

In [14]:
'b' in obj1

True

In [15]:
'f' in obj1

False

In [16]:
# you can create a Series using Python dic
dic ={'Ohio':123, 'Texas':456, 'Oregon':879, 'Utah':667}
obj2 = Series(dic)
obj2

Ohio      123
Oregon    879
Texas     456
Utah      667
dtype: int64

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj3 = Series(obj2, index = states)
obj3

California      NaN
Ohio          123.0
Oregon        879.0
Texas         456.0
dtype: float64

In [18]:
# using isnull function on Series
pd.isnull(obj3)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [None]:
pd.notnull(obj3)

In [None]:
# or Series has this function
obj3.isnull()

In [19]:
obj2

Ohio      123
Oregon    879
Texas     456
Utah      667
dtype: int64

In [20]:
obj3

California      NaN
Ohio          123.0
Oregon        879.0
Texas         456.0
dtype: float64

In [21]:
# adding two Series. It adds the values based on common key. You can perform other operations too.
obj2 + obj3

California       NaN
Ohio           246.0
Oregon        1758.0
Texas          912.0
Utah             NaN
dtype: float64

In [22]:
obj2

Ohio      123
Oregon    879
Texas     456
Utah      667
dtype: int64

In [23]:
# both Series object itself and its index have a name
obj2.name = 'population'
obj2.index.name = 'state'
obj2

state
Ohio      123
Oregon    879
Texas     456
Utah      667
Name: population, dtype: int64

In [24]:
obj

0    10
1    20
2    30
3    40
dtype: int64

In [25]:
# you can alter Series index in place
obj.index = ['u', 'v', 'x', 'y']
obj

u    10
v    20
x    30
y    40
dtype: int64

#### DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different valuetype (numeric, boolean, etc.) The DataFrame has both a row and column index, it can be though of as a dict of Series.

In [26]:
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year':[2000, 2001, 2002, 2001, 2002],       
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9] 
       }
frame = DataFrame(data)
frame

# renameing columns
#frame.rename(columns={'pop':'a', 'state':'b', 'year':'c'})

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [29]:
# you can specifiy the sequence of the columns
d1 = DataFrame(data, columns=['year', 'state', 'pop'])
d1

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [30]:
# As like Series, if you pass a column that isnt contained in data, it will appear with NA
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index =['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [31]:
frame2.index

Index([u'one', u'two', u'three', u'four', u'five'], dtype='object')

In [32]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

In [33]:
# returns a 2d array
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan]], dtype=object)

In [34]:
frame2['state']
# Gives the index data
#frame2.loc['one']
# Gives the index data by 0 based index number
#frame2.iloc[0]

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [35]:
type(frame2['state'])

pandas.core.series.Series

In [36]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [37]:
type(frame2.year)

pandas.core.series.Series

In [38]:
# retrieving rows using ix indexing field (deprecated)
#frame2.ix['two']
frame2.loc['two']
# gives you data for first index
#frame2.iloc[0]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object

In [39]:
#type(frame2.ix['two'])
type(frame2.loc['two'])

pandas.core.series.Series

In [40]:
# assigning values to a column 
frame2['debt'] = 16.6
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.6
two,2001,Ohio,1.7,16.6
three,2002,Ohio,3.6,16.6
four,2001,Nevada,2.4,16.6
five,2002,Nevada,2.9,16.6


In [41]:
frame2['debt'] = [10, 20, 15, 22, 32]
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,10
two,2001,Ohio,1.7,20
three,2002,Ohio,3.6,15
four,2001,Nevada,2.4,22
five,2002,Nevada,2.9,32


In [42]:
frame2['debt'] = np.arange(5)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


In [43]:
# you can assign Series to DataFrame columns. The indices should match otherwise gets populated with NA
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

In [44]:
frame2['debit'] = val
frame2

Unnamed: 0,year,state,pop,debt,debit
one,2000,Ohio,1.5,0,
two,2001,Ohio,1.7,1,-1.2
three,2002,Ohio,3.6,2,
four,2001,Nevada,2.4,3,-1.5
five,2002,Nevada,2.9,4,-1.7


In [45]:
# assigning a column that does not exist will create a new column
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,debit,eastern
one,2000,Ohio,1.5,0,,True
two,2001,Ohio,1.7,1,-1.2,True
three,2002,Ohio,3.6,2,,True
four,2001,Nevada,2.4,3,-1.5,False
five,2002,Nevada,2.9,4,-1.7,False


In [46]:
# deleting a column
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt,debit
one,2000,Ohio,1.5,0,
two,2001,Ohio,1.7,1,-1.2
three,2002,Ohio,3.6,2,
four,2001,Nevada,2.4,3,-1.5
five,2002,Nevada,2.9,4,-1.7


In [47]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt', u'debit'], dtype='object')

In [48]:
# Another common form of data is a nested dict of dicts format
pop = {'Nevada':{2001: 2.4, 2002: 2.9}, 'Ohio':{2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop

{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [49]:
# you can put this in DataFrame and it makes outer keys columns and inner keys indices
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [50]:
# you can transpose the DataFrame
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [51]:
# you can explicitly declare the indices
DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


#### Index Objects

In [52]:
obj = Series(range(3), index =['a', 'b', 'c'])
index = obj.index
index

Index([u'a', u'b', u'c'], dtype='object')

In [None]:
index[1:]

In [None]:
index[1]

In [None]:
# index objects are immutable, so you cannot do:
index[1] = 'd'

In [None]:
# Immutability is important so that Index objects can be safely shared among data structures
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index = index)
obj2.index is index

In [None]:
# In addition to being array-like, an index also functions as fixed-size set:
frame3

In [None]:
'Ohio' in frame3.columns

In [None]:
2003 in frame3.index

### Essential Functionality

#### Reindexing

In [53]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [54]:
# NA for index which does not exist
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [55]:
# you can fill the NA with any value 
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [56]:
# for ordered data like time series, it maybe desirable to do some interpolation or filling of values when reindexing
# use method option, ffill means forward filling, bfill means backward filling
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [57]:
obj3.reindex(range(6), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

In [58]:
# with DataFrame reindex can alter either the (row) index, columns or both.
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [59]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [60]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [61]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [62]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [63]:
states

['Texas', 'Utah', 'California']

In [None]:
# doing both index and columns reindexing in one shot
# was working before lib update
#frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=states)

In [None]:
frame

In [None]:
states

In [None]:
# you can accomplish the same thing by lablel indexing using ix
# deprecated
frame.ix[['a','b','c','d'], states]

#### Dropping entries from an axis

In [64]:
obj = Series(np.arange(5), index=['a','b','c','d','e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [65]:
new_obj = obj.drop('c')
new_obj

a    0
b    1
d    3
e    4
dtype: int64

In [None]:
# just if I forgot to mention this you can use copy function on Series and DataFrames
copyObj = new_obj.copy
print 
copyObj

In [66]:
data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'],
                                                columns=['one','two','three','four'])
data
#data['one']

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
# this is a view no effect on data unless assigned to another variable
data.drop(['Colorado','Ohio'])

In [None]:
data.drop('two', axis=1)

In [None]:
data.drop(['two','four'],axis=1)

#### Indexing, selection and filtering

In [67]:
obj = Series(np.arange(4), index=['a','b','c','d'])
obj

a    0
b    1
c    2
d    3
dtype: int64

In [68]:
obj['b']

1

In [69]:
obj[1]

1

In [None]:
obj[2:4]

In [None]:
obj[['b','a','d']]

In [None]:
obj[[1,3]]

In [70]:
obj[obj < 2]

a    0
b    1
dtype: int64

In [None]:
# slicing with labels are different from normal Python slicing and as you can see the endpoint is inclusive
obj['b':'c']

In [None]:
obj['b':'c'] = 5
obj

In [71]:
data = DataFrame(np.arange(16).reshape((4,4)),
                 index=['Ohio','Colorado','Utah','New York'],
                 columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [72]:
# these might be a bit inconsistence with previous examples
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [73]:
data[['two','three']]

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


In [74]:
# selecting rows by slicing
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [75]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data

In [None]:
data < 5

In [76]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
# you can use ix property of data frame for mentioned operations too.
# please refer to DataFrame pandas reference

#### Arithmetic and data alignment

In [77]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [78]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [79]:
# indices not match NaN will be placed
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [80]:
df1 = DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'), index=['Ohio','Texas','Colorado'])
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [81]:
df2 = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [82]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


#### Arithmetic methods with fill values

In [83]:
df1 = DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [84]:
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [85]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [86]:
# populates the missing one on each DataFrame to zero
# this works for add, sub, div, mul
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [87]:
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [88]:
df2

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [89]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,0
1,4,5,6,7,0
2,8,9,10,11,0


#### Operation between DataFrame and Series

In [90]:
arr = np.arange(12).reshape((4,3))
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [91]:
arr[0]

array([0, 1, 2])

In [92]:
# this is called broadcasting, it subtracts row by row
arr - arr[0]

array([[0, 0, 0],
       [3, 3, 3],
       [6, 6, 6],
       [9, 9, 9]])

In [None]:
# deprecated
#frame = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
#series = frame.ix[0]
#frame

In [93]:
series

NameError: name 'series' is not defined

In [None]:
# broadcasting down the rows
frame - series

In [None]:
series2 = Series(range(3), index=['b','e','f'])
series2

In [None]:
frame

In [None]:
series2 + frame

In [None]:
frame

In [None]:
# you can do broadcasting on columns using arithmetic methods as follow
series3 = frame['d']
frame

In [None]:
series3

In [None]:
# multiplication
frame.multiply(series3, axis = 0)

#### Function application and mapping

NumPy ufuncs work fine with pandas objects:

In [94]:
frame = DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,d,b,e
Utah,0.741553,-1.036138,1.186799
Ohio,0.190351,0.82823,-0.005094
Texas,-0.742835,0.962473,-0.508709
Oregon,0.821545,1.089326,0.384054


In [95]:
np.abs(frame)

Unnamed: 0,d,b,e
Utah,0.741553,1.036138,1.186799
Ohio,0.190351,0.82823,0.005094
Texas,0.742835,0.962473,0.508709
Oregon,0.821545,1.089326,0.384054


Another frequent operation is applying a function on 1D array to each column or row

In [96]:
f = lambda x: x.max() - x.min()
# by default axis is zero
frame.apply(f)

d    1.564380
b    2.125464
e    1.695508
dtype: float64

In [97]:
frame.apply(f, axis = 1)

Utah      2.222937
Ohio      0.833324
Texas     1.705308
Oregon    0.705272
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

apply need not return a scalar value, it can also return a Series with multiple values:

In [None]:
def f(x): return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Elemenst-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating point value in frame.

In [None]:
# old formatting way
pi = 3.14159
print(" pi = %1.2f " % pi)

In [None]:
format = lambda x: '%.2f' % x
frame.applymap(format)

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [None]:
frame['e'].map(format)

#### Sorting

In [None]:
obj = Series(range(4), index=['d','a','b','c'])
obj

In [None]:
obj.sort_index()

In [None]:
frame = DataFrame(np.arange(8).reshape((2,4)), index=['three','one'], columns=['d','a','b','c'])
frame                                                                        

In [None]:
# default axis is 0
frame.sort_index()

In [None]:
frame.sort_index(axis=1)

In [None]:
# sorting a Series by its values
# it is deprecated - gone
obj = Series([4,7,-3,2])
#obj.order()
# use instead
obj.sort_values()

In [None]:
# any missing values are sorted to the end of the Series by default
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

In [None]:
frame = DataFrame({'b':[4,7,-3,2], 'a':[0,1,0,1]})
frame

In [None]:
# deprecated
#frame.sort_index(by='b')
# instead use
frame.sort_values(by='b')

In [None]:
#frame.sort_index(by=['a','b'])
frame.sort_values(by=['a','b'])

#### Summarizing and Computing Descriptive Statistics

In [None]:
df = DataFrame([[1.4,np.nan], [7.1,-4.5],
               [np.nan, np.nan], [0.75, -1.3]],
               index=['a','b','c','d'],
               columns=['one','two'])
df

In [None]:
df.sum()

In [None]:
df.sum(axis=1)

In [None]:
df.mean(axis=1, skipna=True)

In [None]:
df.describe()

In [None]:
# on non-numeric, it produces alternative summary statistics:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj

In [None]:
obj.describe()

Some summary statistics and related functions:

count<br />
describe<br />
min, max<br />
quantile<br />
sum<br />
mean<br />
median<br />
mad<br />
var<br />
std<br />
diff<br />
pct_change<br />
cumsum<br />
cumprod<br />

#### Correlation and Covariance

Covariance measures how two variables move together. It measures whether the two move in the same direction (a positive covariance) or in opposite directions. (a negative covariance) Ranges between(-inf, +inf)

Finding that two variables have a high or low covariance might not be a useful metric on its own. Covariance can tell how the two variables move together, but to determine the strength of the relationship, we need to look at the correlation.

Correlation between two variables X and Y is simply the covariance between both variables divided by the product of the standard deviation of the variables X and Y. Ranges between [-1, 1]

In [None]:
# getting stock data from Yahoo
# this module old and packaging has changed
#import pandas.io.data as web

# you have to install pandas-datareader and then
import pandas_datareader.data as web

# The Yahoo API is deprecated and not replace - Sorry :-( 
# This example will not run anymore


all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker)

#print type(all_data)
#print all_data['AAPL']
    
price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.iteritems()})

volume = DataFrame({tic: data['Volume']
                   for tic, data in all_data.iteritems()})

#print volume

returns = price.pct_change()

print returns

#returns.tail()

In [None]:
# corrolation between MSFT and IBM stocks
returns.MSFT.corr(returns.IBM)

In [None]:
# covariance between MSFT and IBM stcoks
returns.MSFT.cov(returns.IBM)

In [None]:
returns.corr()

In [None]:
returns.cov()

In [None]:
returns.corrwith(returns.IBM)

In [None]:
# finding corrolation between price change and volume
returns.corrwith(volume)

#### Unique Values, Value Counts, and Membership

In [None]:
obj = Series(['c','a','d','a','a','b','b','c','c'])
obj

In [None]:
uniques = obj.unique()
uniques

In [None]:
obj.value_counts()

In [None]:
# panda has method for this too that can be used for any array or sequence
pd.value_counts(obj.values)

In [None]:
obj

In [None]:
# isin is responsible for vectorized set memebership and can be very useful in filtering a data set
mask = obj.isin(['b','c'])
mask

In [None]:
data = DataFrame({'Qu1': [1,3,4,5,4],
                  'Qu2': [2,3,1,2,3],
                  'Qu3': [1,5,2,4,4]})

data      

#### Handling Missing Data

In [None]:
# pandas uses the floating point value NaN to represent missing data
string_data = Series(['aardvark', 'artichoke', np.nan, 'avacado'])
string_data

In [None]:
# built-in Python None value is also treated as NaN
string_data[0] = None
string_data.isnull()

#### Filtering Out Missing Data

In [None]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

In [None]:
data

In [None]:
data[data.notnull()]

In [None]:
data = DataFrame([[1, 6.5, 3], [1, NA, NA],
                 [NA, NA, NA], [NA, 6.5, 3]])

data

In [None]:
cleaned = data.dropna()

In [None]:
# any row with NA will be dropped
cleaned

In [None]:
data

In [None]:
# if all row has NA
data.dropna(how='all')

In [None]:
data

In [None]:
data[2] = NA
data

In [None]:
# drop the column if all NA
data.dropna(how='all', axis = 1)

In [None]:
df = DataFrame(np.random.randn(7, 3))
df

In [None]:
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

In [None]:
# keep only rows having certain number of observations
df.dropna(thresh=4,axis='columns')  # or axis='rows'or 0 or 1 as usual

#### Filling in Missing Data

In [None]:
df

In [None]:
df.fillna(0)

In [None]:
df

In [None]:
# use a dic which indicates what to fill NA at each column
df.fillna({1:0.5, 2: -1})

In [None]:
df

In [None]:
# fillna returns a new object, but you can modify the existing object in place
df.fillna(0, inplace=True)
df

In [None]:
df = DataFrame(np.random.randn(6, 3))
df

In [None]:
df.iloc[2:,1] = np.nan
df.iloc[4:,2] = np.nan
df

In [None]:
# forward filling
df.fillna(method='ffill')

In [None]:
df

In [None]:
# you can put a limit of how many to fill
df.fillna(method='ffill', limit=2)

In [None]:
# With fillna you can do lots of other things with a little creativity
data = Series([1, np.nan, 3.5, np.nan, 7])
# putting mean of values for NAs
data.fillna(data.mean())