## Getting Started with pandas

pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric application.

### Introduction to pandas Data Structure

To get started with pandas, you will need to get comfortable with its two workhorse data structures <b>Series</b> and <b>DataFrame</b>. 
While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applictions.

#### Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

In [None]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = Series([10, 20 , 30, 40])
obj

In [None]:
# you can get only the array representation using:
obj.values

In [None]:
# you can get only the index represtation using:
obj.index

In [None]:
# getting value using index
obj[0]

In [None]:
# defining the index values
obj1 = Series([10, 20, 30, 40, 50], index=['b','a','c','d','e'])
obj1

In [None]:
obj1.index

In [None]:
obj1['a']

In [None]:
obj1['e']

In [None]:
obj1[['c', 'b', 'a']]

NumPy array operations will preserve the index value link.

In [None]:
obj1[obj1 > 20]

In [None]:
obj1 * 2

In [None]:
np.exp(obj1)

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

In [None]:
'b' in obj1

In [None]:
'f' in obj1

In [None]:
# you can create a Series using Python dic
dic ={'Ohio':123, 'Texas':456, 'Oregon':879, 'Utah':667}
obj2 = Series(dic)
obj2

In [None]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj3 = Series(obj2, index = states)
obj3

In [None]:
# using isnull function on Series
pd.isnull(obj3)

In [None]:
pd.notnull(obj3)

In [None]:
# or Series has this function
obj3.isnull()

In [None]:
obj2

In [None]:
obj3

In [None]:
# adding two Series. It adds the values based on common key. You can perform other operations too.
obj2 + obj3

In [None]:
obj2

In [None]:
# both Series object itself and its index have a name
obj2.name = 'population'
obj2.index.name = 'state'
obj2

In [None]:
obj

In [None]:
# you can alter Series index in place
obj.index = ['u', 'v', 'x', 'y']
obj

#### DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different valuetype (numeric, boolean, etc.) The DataFrame has both a row and column index, it can be though of as a dict of Series.

In [None]:
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year':[2000, 2001, 2002, 2001, 2002],       
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9] 
       }
frame = DataFrame(data)
frame

In [None]:
# you can specifiy the sequence of the columns
DataFrame(data, columns=['year', 'state', 'pop'])

In [None]:
# As like Series, if you pass a column that isnt contained in data, it will appear with NA
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index =['one', 'two', 'three', 'four', 'five'])
frame2

In [None]:
frame2.index

In [None]:
frame2.columns

In [None]:
# returns a 2d array
frame2.values

In [None]:
frame2['state']

In [None]:
type(frame2['state'])

In [None]:
frame2.year

In [None]:
type(frame2.year)

In [None]:
# retrieving rows using ix indexing field
frame2.ix['two']

In [None]:
type(frame2.ix['two'])

In [None]:
# assigning values to a column 
frame2['debt'] = 16.6
frame2

In [None]:
frame2['debt'] = [10, 20, 15, 22, 32]
frame2

In [None]:
frame2['debt'] = np.arange(5)
frame2

In [None]:
# you can assign Series to DataFrame columns. The indices should match otherwise gets populated with NA
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val

In [None]:
frame2['debit'] = val
frame2

In [None]:
# assigning a column that does not exist will create a new column
frame2['eastern'] = frame2.state == 'Ohio'
frame2

In [None]:
# deleting a column
del frame2['eastern']
frame2

In [None]:
frame2.columns

In [None]:
# Another common form of data is a nested dict of dicts format
pop = {'Nevada':{2001: 2.4, 2002: 2.9}, 'Ohio':{2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop

In [None]:
# you can put this in DataFrame and it makes outer keys columns and inner keys indices
frame3 = DataFrame(pop)
frame3

In [None]:
# you can transpose the DataFrame
frame3.T

In [None]:
# you can explicitly declare the indices
DataFrame(pop, index=[2001, 2002, 2003])

#### Index Objects

In [None]:
obj = Series(range(3), index =['a', 'b', 'c'])
index = obj.index
index

In [None]:
index[1:]

In [None]:
index[1]

In [None]:
# index objects are immutable, so you cannot do:
index[1] = 'd'

In [None]:
# Immutability is important so that Index objects can be safely shared among data structures
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index = index)
obj2.index is index

In [None]:
# In addition to being array-like, an index also functions as fixed-size set:
frame3

In [None]:
'Ohio' in frame3.columns

In [None]:
2003 in frame3.index

### Essential Functionality

#### Reindexing

In [None]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

In [None]:
# NA for index which does not exist
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

In [None]:
# you can fill the NA with any value 
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)

In [None]:
# for ordered data like time series, it maybe desirable to do some interpolation or filling of values when reindexing
# use method option, ffill means forward filling, bfill means backward filling
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

In [None]:
obj3.reindex(range(6), method='bfill')

In [None]:
# with DataFrame reindex can alter either the (row) index, columns or both.
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

In [None]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

In [None]:
frame

In [None]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

In [None]:
frame

In [None]:
states

In [None]:
# doing both index and columns reindexing in one shot
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=states)

In [None]:
frame

In [None]:
states

In [None]:
# you can accomplish the same thing by lablel indexing using ix
frame.ix[['a','b','c','d'], states]

#### Dropping entries from an axis

In [None]:
obj = Series(np.arange(5), index=['a','b','c','d','e'])
obj

In [None]:
new_obj = obj.drop('c')
new_obj

In [None]:
# just if I forgot to mention this you can use copy function on Series and DataFrames
copyObj = new_obj.copy
print
copyObj

In [None]:
data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'],
                                                columns=['one','two','three','four'])
data

In [None]:
# this is a view no effect on data unless assigned to another variable
data.drop(['Colorado','Ohio'])

In [None]:
data.drop('two', axis=1)

In [None]:
data.drop(['two','four'],axis=1)

#### Indexing, selection and filtering

In [None]:
obj = Series(np.arange(4), index=['a','b','c','d'])
obj

In [None]:
obj['b']

In [None]:
obj[1]

In [None]:
obj[2:4]

In [None]:
obj[['b','a','d']]

In [None]:
obj[[1,3]]

In [None]:
obj[obj < 2]

In [None]:
# slicing with labels are different from normal Python slicing and as you can see the endpoint is inclusive
obj['b':'c']

In [None]:
obj['b':'c'] = 5
obj

In [None]:
data = DataFrame(np.arange(16).reshape((4,4)),
                 index=['Ohio','Colorado','Utah','New York'],
                 columns=['one','two','three','four'])
data

In [None]:
# these might be a bit inconsistence with previous examples
data['two']

In [None]:
data[['two','three']]

In [None]:
# selecting rows by slicing
data[:2]

In [None]:
data[data['three'] > 5]

In [None]:
data

In [None]:
data < 5

In [None]:
data[data < 5] = 0
data

In [None]:
# you can use ix property of data frame for mentioned operations too.
# please refer to DataFrame pandas reference

#### Arithmetic and data alignment

In [None]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s1

In [None]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s2

In [None]:
# indices not match NaN will be placed
s1 + s2

In [None]:
df1 = DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'), index=['Ohio','Texas','Colorado'])
df1

In [None]:
df2 = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
df2

In [None]:
df1 + df2

#### Arithmetic methods with fill values

In [None]:
df1 = DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))
df1

In [None]:
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))
df2

In [None]:
df1 + df2

In [None]:
# populates the missing one on each DataFrame to zero
# this works for add, sub, div, mul
df1.add(df2, fill_value=0)

In [None]:
df1

In [None]:
df2

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

#### Operation between DataFrame and Series

In [None]:
arr = np.arange(12).reshape((4,3))
arr

In [None]:
arr[0]

In [None]:
# this is called broadcasting, it subtracts row by row
arr - arr[0]

In [None]:
frame = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
frame

In [None]:
series

In [None]:
# broadcasting down the rows
frame - series

In [None]:
series2 = Series(range(3), index=['b','e','f'])
series2

In [None]:
frame

In [None]:
frame + series2

In [None]:
frame

In [None]:
# you can do broadcasting on columns using arithmetic methods as follow
series3 = frame['d']
frame

In [None]:
series3

In [None]:
# subtraction
frame.sub(series3, axis=0)

#### Function application and mapping

NumPy ufuncs work fine with pandas objects:

In [None]:
frame = DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

In [None]:
np.abs(frame)

Another frequent operation is applying a function on 1D array to each column or row

In [None]:
f = lambda x: x.max() - x.min()
# by default axis is zero
frame.apply(f)

In [None]:
frame.apply(f, axis = 1)

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

apply need not return a scalar value, it can also return a Series with multiple values:

In [None]:
def f(x): return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Elemenst-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating point value in frame.

In [None]:
# old formatting way
pi = 3.14159
print(" pi = %1.2f " % pi)

In [None]:
format = lambda x: '%.2f' % x
frame.applymap(format)

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [None]:
frame['e'].map(format)

#### Sorting

In [None]:
obj = Series(range(4), index=['d','a','b','c'])
obj

In [None]:
obj.sort_index()

In [None]:
frame = DataFrame(np.arange(8).reshape((2,4)), index=['three','one'], columns=['d','a','b','c'])
frame                                                                        

In [None]:
# default axis is 0
frame.sort_index()

In [None]:
frame.sort_index(axis=1)

In [None]:
# sorting a Series by its values
obj = Series([4,7,-3,2])
obj.order()

In [None]:
# any missing values are sorted to the end of the Series by default
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.order()

In [None]:
frame = DataFrame({'b':[4,7,-3,2], 'a':[0,1,0,1]})
frame

In [None]:
frame.sort_index(by='b')

In [None]:
frame.sort_index(by=['a','b'])

#### Summarizing and Computing Descriptive Statistics

In [None]:
df = DataFrame([[1.4,np.nan], [7.1,-4.5],
               [np.nan, np.nan], [0.75, -1.3]],
               index=['a','b','c','d'],
               columns=['one','two'])
df

In [None]:
df.sum()

In [None]:
df.sum(axis=1)

In [None]:
df.mean(axis=1, skipna=False)

In [None]:
df.describe()

In [None]:
# on non-numeric, it produces alternative summary statistics:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj

In [None]:
obj.describe()

Some summary statistics and related functions:

count<br />
describe<br />
min, max<br />
quantile<br />
sum<br />
mean<br />
median<br />
mad<br />
var<br />
std<br />
diff<br />
pct_change<br />
cumsum<br />
cumprod<br />

#### Correlation and Covariance

Covariance measures how two variables move together. It measures whether the two move in the same direction (a positive covariance) or in opposite directions. (a negative covariance) Ranges between(-inf, +inf)

Finding that two variables have a high or low covariance might not be a useful metric on its own. Covariance can tell how the two variables move together, but to determine the strength of the relationship, we need to look at the correlation.

Correlation between two variables X and Y is simply the covariance between both variables divided by the product of the standard deviation of the variables X and Y. Ranges between [-1, 1]

In [None]:
# getting stock data from Yahoo
import pandas.io.data as web

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker)

price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.iteritems()})

volume = DataFrame({tic: data['Volume']
                   for tic, data in all_data.iteritems()})

returns = price.pct_change()

returns.tail()

In [None]:
# corrolation between MSFT and IBM stocks
returns.MSFT.corr(returns.IBM)

In [None]:
# covariance between MSFT and IBM stcoks
returns.MSFT.cov(returns.IBM)

In [None]:
returns.corr()

In [None]:
returns.cov()

In [None]:
returns.corrwith(returns.IBM)

In [None]:
# finding corrolation between price change and volume
returns.corrwith(volume)

#### Unique Values, Value Counts, and Membership

In [None]:
obj = Series(['c','a','d','a','a','b','b','c','c'])
obj

In [None]:
uniques = obj.unique()
uniques

In [None]:
obj.value_counts()

In [None]:
# panda has method for this too that can be used for any array or sequence
pd.value_counts(obj.values)

In [None]:
obj

In [None]:
# isin is responsible for vectorized set memebership and can be very useful in filtering a data set
mask = obj.isin(['b','c'])
mask

In [None]:
data = DataFrame({'Qu1': [1,3,4,5,4],
                  'Qu2': [2,3,1,2,3],
                  'Qu3': [1,5,2,4,4]})

data      

#### Handling Missing Data

In [None]:
# pandas uses the floating point value NaN to represent missing data
string_data = Series(['aardvark', 'artichoke', np.nan, 'avacado'])
string_data

In [None]:
# built-in Python None value is also treated as NaN
string_data[0] = None
string_data.isnull()

#### Filtering Out Missing Data

In [None]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

In [None]:
data

In [None]:
data[data.notnull()]

In [None]:
data = DataFrame([[1, 6.5, 3], [1, NA, NA],
                 [NA, NA, NA], [NA, 6.5, 3]])

data

In [None]:
cleaned = data.dropna()

In [None]:
# any row with NA will be dropped
cleaned

In [None]:
data

In [None]:
# if all row has NA
data.dropna(how='all')

In [None]:
data

In [None]:
data[2] = NA
data

In [None]:
# drop the column if all NA
data.dropna(how='all', axis = 1)

In [None]:
df = DataFrame(np.random.randn(7, 3))
df

In [None]:
# rememeber! ix slicing includes the upper bound
df.ix[:4, 1] = NA
df.ix[:2, 2] = NA
df

In [None]:
# keep only rows having certain number of observations
df.dropna(thresh=3)

#### Filling in Missing Data

In [None]:
df

In [None]:
df.fillna(0)

In [None]:
df

In [None]:
# use a dic which indicates what to fill NA at each column
df.fillna({1:0.5, 2: -1})

In [None]:
df

In [None]:
# fillna returns a new object, but you can modify the existing object in place
df.fillna(0, inplace=True)
df

In [None]:
df = DataFrame(np.random.randn(6, 3))
df

In [None]:
df.ix[2:,1] = NA
df.ix[4:,2] = NA
df

In [None]:
# forward filling
df.fillna(method='ffill')

In [None]:
df

In [None]:
# you can put a limit of how many to fill
df.fillna(method='ffill', limit=2)

In [None]:
# With fillna you can do lots of other things with a little creativity
data = Series([1, NA, 3.5, NA, 7])
# putting mean of values for NAs
data.fillna(data.mean())