# Pandas
**This note is based on Python for Data Analysis by Wes McKinney and uses Python 2.7**

Pandas is a data analysis tool for Python. It has two main data structures: **Series and DataFrame**.

In [1]:
from pandas import Series, DataFrame
import pandas as pd

### Series
Series is a one-dimensional array-like object containing data and associated index. Series can be formed with default index or index of choice as 

In [2]:
#Default
#mySeries = Series([4, 3, 2, 1])

mySeries = Series([4, 3, 2, 1], ['a', 'b', 'c', 'd'])
mySeries

a    4
b    3
c    2
d    1
dtype: int64

The values and indexes of a Series can be obtained via its **values** and **index** attributes, respectively as

In [3]:
mySeries.values

array([4, 3, 2, 1])

In [4]:
mySeries.index

Index([u'a', u'b', u'c', u'd'], dtype='object')

NumPy array operations preserves the index-value link as

In [5]:
mySeries[(mySeries < 4) & (mySeries > 2)]

#mySeires*2, np.exp(mySeries)

b    3
dtype: int64

Series can also be thought of as a fixed-length, ordered Python dictionary. Series can be created from dictionary; when passing a dictionary, the intex will be in sorted order.

In [6]:
myData = {'d':40, 'c':30, 'a':20, 'b':10}
mySeries = Series(myData)
mySeries

a    20
b    10
c    30
d    40
dtype: int64

The **isnull** and **notnull** functions in pandas should be used to detect missing data.

### DataFrame
DataFrame is a tabular (spreadsheet-like) data structure with labeled axes (rows and columns), each of which can be a different value type. One of the most common way to construct a DataFrame is from a dictionary of equal-length lists (or NUmPy arrays) as

In [7]:
myData = {'State': ['NY', 'NY', 'NY', 'MA', 'MA'],
          'Year': [2000, 2001, 2002, 2001, 2002],
          'Population': [1.5, 1.7, 3.6, 2.4, 2.9]}
#myFrame = DataFrame(data)- comumns are placed in sorted orders
#A specifice sequence of columns can be passed (if column not in data is passed, NA values will appear)
myFrame = DataFrame(myData, columns=['State', 'Year', 'Population', 'Average Income'], 
                    index=['one', 'two', 'three', 'four', 'five'])
myFrame

Unnamed: 0,State,Year,Population,Average Income
one,NY,2000,1.5,
two,NY,2001,1.7,
three,NY,2002,3.6,
four,MA,2001,2.4,
five,MA,2002,2.9,


A column can be retrived as a Series either as

In [8]:
myFrame['State']

one      NY
two      NY
three    NY
four     MA
five     MA
Name: State, dtype: object

In [9]:
myFrame.State

one      NY
two      NY
three    NY
four     MA
five     MA
Name: State, dtype: object

Rows can be retrived by using **ix** indexing field as

In [10]:
myFrame.ix['four']

State               MA
Year              2001
Population         2.4
Average Income     NaN
Name: four, dtype: object

Coumnns can be modified by assignment as

In [11]:
#myFrame['Average Income'] = 16.5
#myFrame['Average Income'] = np.arange(5.)

myValue = Series([5.5, 7.0, 6.0], index=['two', 'four', 'five'])
myFrame['Average Income'] = myValue
myFrame

Unnamed: 0,State,Year,Population,Average Income
one,NY,2000,1.5,
two,NY,2001,1.7,5.5
three,NY,2002,3.6,
four,MA,2001,2.4,7.0
five,MA,2002,2.9,6.0


The **del** keyword will delet columns as

In [12]:
del myFrame['Population']

In [13]:
myFrame.columns

Index([u'State', u'Year', u'Average Income'], dtype='object')

Transpos can be made as myFrame.T, and the **values** attribute returns the data contained in the DataFrame as a 2D
ndarray

In [14]:
myFrame.values

array([['NY', 2000, nan],
       ['NY', 2001, 5.5],
       ['NY', 2002, nan],
       ['MA', 2001, 7.0],
       ['MA', 2002, 6.0]], dtype=object)

The **drop** method will return **a new object** with the indicated value or values deleted from an axis as

In [15]:
#myFrame.drop('two', axis=0)
myFrame.drop('two')

Unnamed: 0,State,Year,Average Income
one,NY,2000,
three,NY,2002,
four,MA,2001,7.0
five,MA,2002,6.0


In [16]:
myNewFrame = myFrame.drop('State', axis=1)
myNewFrame

Unnamed: 0,Year,Average Income
one,2000,
two,2001,5.5
three,2002,
four,2001,7.0
five,2002,6.0


Missing data can be filled many ways, for example as

In [17]:
myNewFrame.fillna(myNewFrame.mean())

Unnamed: 0,Year,Average Income
one,2000,6.166667
two,2001,5.5
three,2002,6.166667
four,2001,7.0
five,2002,6.0


A function can be applied to each column or row by using **apply** method as (element-wise functions can be applied with **applymap** method)

In [18]:
myFunction = lambda x: x.max() - x.min()
myNewFrame.apply(myFunction)

Year              2.0
Average Income    1.5
dtype: float64

To sort lexicographically by row or column index, **sort_index()** method can be used as (data is sorted in accending order by default)

In [19]:
myFrame.sort_index(axis=0, ascending=False)

Unnamed: 0,State,Year,Average Income
two,NY,2001,5.5
three,NY,2002,
one,NY,2000,
four,MA,2001,7.0
five,MA,2002,6.0
