## Introduction
#### pandas contains data structures and data manipulation tools to make data cleaning and analysis fast and easy.
#### Can be used in tandem with numerical computing tools (NumPy and SciPy), analytical libraries (statsmodels, scikit-learn) and data visualisation libraries (matplotlib).
#### It adopts NumPy's idiomatic style of array-based computing. Biggest difference between the two is that Pandas is designed to work with heterogenous data, whereas NumPy is designed for homogenous numerical data.

## pandas Data Structures
#### There are 2 workhorse Data Structures in pandas - Series and DataFrame.
#### Not a universal solution to every problem, but provide a solid, easy-to-use basis.

### Series
#### It's a 1-D array like object containing sequence of values and associated array of labels.
#### The output representation for a Series shows index on the left and values on the right.
#### If we do not specify an index, a default one from 0 to (n-1) is created.
#### You can access the index and values of a Series seperately through 'index' and 'values' attributes respectively.

In [1]:
import pandas as pd

obj = pd.Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [2]:
obj2 = pd.Series([4,7,-5,3], index=['d','b','a','c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [3]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [4]:
obj2.values

array([ 4,  7, -5,  3], dtype=int64)

#### You can use labels in index to select single values or set of values. To select set of values you need to use list of indices.
#### We can use NumPy like operations (filtering with boolean indexing, multiplication or math functions). The index-value link will not be affected by this.

In [5]:
obj2['a']

-5

In [6]:
obj2['d'] = 6
obj2[['c','a','d']]

c    3
a   -5
d    6
dtype: int64

In [7]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [8]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [9]:
import numpy as np

np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

#### Another way of looking at Series is like a fixed-length, ordered Dictionary.
#### This is because mapping of index to values resembles that of a dict.
#### If you have a dict, you can create a Series from it. If you are only passing the dict, then its keys become the index in order.
#### You can override this by passing dict keys in the order you want them to appear.
#### In case the key list is not present in the dict, its respective value will be NaN (Python Version of missing or NA value). Missing values in Series can be found with the 'notnull' or 'isnull' operator.

In [10]:
'b' in obj2

True

In [11]:
'e' in obj2

False

In [12]:
sdata = {'Ohio': 3500, 'Texas':71000, 'Oregon':16000, 'Utah':5000}

obj3 = pd.Series(sdata)
obj3

Ohio       3500
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [13]:
states = ['California','Ohio','Oregon','Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [14]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [15]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

#### A useful feature of Series is that it automatically aligns by the index label in arithematic operations.
#### This alignment of indexes can be seen as similar to Joins in Databases.
#### Both Series object and its Index have a name attribute and they integrate with other pandas functinality.
#### A Series index can be replaced in-place by assignment.

In [16]:
obj3

Ohio       3500
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [17]:
obj4

California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [18]:
obj3 + obj4

California         NaN
Ohio            7000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [19]:
obj4.name = 'population'
obj4.index.name = 'state'

obj4

state
California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [20]:
obj.index = ['Bob','Steve','Jeff','Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

### DataFrame
#### Represents rectangular table  of data containing ordered collection of columns, each of different value type (numeric, string, boolean, etc.).
#### It has a row and column index and can be though of as a dict of Series, all sharing same index.
#### Under the hood, data in DataFrame is stored as one or more 2-D blocks.
#### NOTE - Even though DataFrame is 2-D, it can be used to represent higher dimensional data in tabular form using hierarchical indexing.

#### Many ways to create a DataFrame, the most common is from a dict of equal-length lists or NumPy arrays.
#### It will have an index assigned automatically and columns placed in sorted order.
#### The head() method will show you the first 5 rows of the DataFrame by default.

In [21]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year' : [2000, 2001, 2002, 2001, 2002, 2003],
       'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)

In [22]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [23]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


#### To arrange sequence of columns, specify the sequence.
#### If you pass a column that isn't present in the dict, it will appear with missing values (NaN).
#### A column can be retrieved as a Series using a dict-like notation or by attribute.
#### The returned Series will have the same index as the DataFrame.
#### NOTE - Attribute-like access and tab completion of column names is provided as convineance in IPython. The attribute use of column name only works if the column name is valid Python variable name.

In [24]:
pd.DataFrame(data, columns=['year', 'state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [25]:
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [26]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [27]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [28]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

#### Rows can be retrieved by position or name using the 'loc' method.

In [29]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

#### Columns can be modified by assignment. We can pass a scalar values or a list of values to be assigned.
#### When assigning, the length of the value list must match the length of the DataFrame.
#### If we use a Series to assign to a column, its index will realign to the index of the DataFrame. Missing values will be inserted with NaNs.
#### assigning a non-existant column will create a new column. New columns cannot be created with the 'attribute' method.

In [30]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [31]:
frame2['debt'] = np.arange(6.0)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [32]:
val = pd.Series([-1.2, -1.5, -1.7], index = ['two','four','five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [33]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


#### The 'del' keyword will delete columns as with a dict.
#### NOTE - Any column returned from any index operation is a view for the DataFrame, and not a copy. any change made to it will be reflected in the DataFrame.

In [34]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

#### Another method of input is nested dict of dicts.
#### If it is passed as input to a DataFrame, pandas will interpret outer dict keys as columns and inner keys as row indexes.
#### The keys of inner dict are sorted to form the index. But this can be changed if the index values are explicitly specified.
#### The same rules apply to a dict of Series.

In [35]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
      'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [36]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

AttributeError: 'list' object has no attribute 'astype'

In [38]:
pdata = {'Ohio': frame3['Ohio'][:-1],
        'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


#### You can transpose a dataframe (swap rows with columns) in similar syntax as NumPy array.
#### If index and columns in a DataFrame have their name attributes set, they will also be displayed.
#### The values attribute returns the data in a DataFrame as a 2-D ndarray.
#### If the dtypes of the columns are different, then the output ndarray's dtype will be chosen in such a way as to accomodate all the columns.

In [39]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [40]:
frame3.index.name = 'year'
frame3.columns.name = 'state'

frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [41]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [42]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

### Index Objects
#### They are responsible for holding axis labels and other metadata.
#### Any array or sequence of labels used in a Series or DataFrame is internally converted to an Index.
#### Indexes are immutable (can't be modififed by any user). Hence it is safer to share among data structures.
#### They behave like a fixed-size set. But unlike a set, they can contain duplicate labels.
#### selection of duplicate labels will select all occurences of that label.

In [43]:
obj = pd.Series(range(3), index=['a','b','c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [44]:
index[1] = 'd'

TypeError: Index does not support mutable operations

In [46]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [47]:
obj2 = pd.Series([1.5, -2.5, 0], index = labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [48]:
obj2.index is labels

True

In [49]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [50]:
'Ohio' in frame3.columns

True

In [51]:
2003 in frame3.columns

False

In [52]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

### Dropping Entries from an Axis
#### Dropping entries from an axis is easy if you already have index array or list without those entries.
#### The 'drop' method will return a new object with the indicate dvalues deleted from the axis.
#### In DataFrames, index values can be deleted from either axis using the drop function.
#### By default drop will delete values from row axis. To delet from columns, you need to specify 'axis=1' or "axis='columns'".

In [53]:
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [54]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [55]:
obj.drop(['d','c'])

a    0.0
b    1.0
e    4.0
dtype: float64

In [56]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                   index = ['Ohio','Colorado','Utah','New York'],
                   columns = ['one','two','three','four'])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [57]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [58]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [59]:
data.drop(['two','four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


#### drop, like many other functions which modify size or shape of Series or DataFrame, can manipulate objects in-place without returning a new object.
#### Careful with inplace, as it can destroy any data that it drops.

In [60]:
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

### Indexing, Selection and Filtering
#### series indexing works similarly to NumPy array indexing. Difference is that for Series you can use index values insted of only integers.
#### Slicing Series is different from Python slicing because here the endpoint is inclusive of the slice.
#### Using slices for modifications will modify the Series as well.

In [61]:
obj = pd.Series(np.arange(4.), index = ['a','b','c','d'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [62]:
obj['b']

1.0

In [63]:
obj[1]

1.0

In [64]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [65]:
obj[['b','a','d']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [66]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [67]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

In [68]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [69]:
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

#### DataFrame indexing is for retrieving 1 or more columns through single value or sequence. 

In [70]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                   index = ['Ohio','Colorado','Utah','New York'],
                   columns = ['one','two','three','four'])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [71]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [72]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


#### There are some special cases in this indexing - 
#### 1. Selecting data with Boolean array - This filters rows based on the value of a column and selects all the columns.

In [73]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### 2. Indexing with a Boolean dataframe, usually by scalar comparison - This DataFrame looks more like a 2-D NumPy array.

In [74]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [75]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Selection with loc and iloc
#### These are special indexing operators that select subset of rows and columns with NumPy like notation.
#### loc is used for axis labels and iloc for integers.

In [76]:
data.loc['Colorado', ['two','three']]

two      5
three    6
Name: Colorado, dtype: int32

In [77]:
data.iloc[2, [3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

#### Both the operators work with slices as well as single labels and list of labels.

In [78]:
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [79]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


### Integer Indexes
#### Working with pandas objects indexed with integers is slightly different from in-built Python Data Structures oin indexing semantics.
#### This trips-up new users. Like the below example, which will give an error.

In [80]:
ser = pd.Series(np.arange(3.))
ser[-1]

KeyError: -1

#### Doing a 'fall back' on integer indexing would be difficult without introducing subtle bugs.
#### Inferring what the user wants in this case would be difficult. But with non-integer index, there is no potential for ambiguity.
#### Hence to keep things consistent, axis with integer labels will always be label oriented.
#### We can always use loc and iloc for precise handling.

In [82]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [83]:
ser2 = pd.Series(np.arange(3.), index=['a','b','c'])
ser2[-1]

2.0

In [84]:
ser[:1]

0    0.0
dtype: float64

In [85]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [86]:
ser.iloc[:1]

0    0.0
dtype: float64

### Arithematic and Data Alignment
#### The behaviour of arithematic operators between objects is important.
#### If index pairs of the the objects you are adding are not same, the result's index will be a union of both.
#### This is similar to automatic Outer Join in Databases.
#### This internal data alignment introduces NaN values to labels that don't overlap. The missing values will propagate in further arithematic computations.

In [87]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a','c','e','f','g'])

s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [88]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [89]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

#### For DataFrames, the alignment is performed in both rows and columns. 
#### So the result of adding 2 DataFrames is a DataFrame whose index and columns are unions of each of the DataFrames.
#### Missing values will apear both in the row and column indices whose labels are not common to the 2 objects.
#### If the row or column has no values, the result will als o contain NaNs.

In [90]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), columns=list('bcd'), 
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'), 
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [91]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [92]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [93]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [94]:
df1 = pd.DataFrame({'A': [1,2]})
df2 = pd.DataFrame({'B': [3,4]})

In [95]:
df1

Unnamed: 0,A
0,1
1,2


In [96]:
df2

Unnamed: 0,B
0,3
1,4


In [97]:
df1 - df2

Unnamed: 0,A,B
0,,
1,,


### Arithematic Methods with Fill Values
#### A good practice is to fill a value in arithematic operations between differently indexed objects when 1 axis label is found in one object but not in another.
#### This involves using the 'add' method with a 'fill_value'.

In [98]:
df1 = pd.DataFrame(np.arange(12.).reshape((3,4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4,5)), columns=list('abcde'))

In [99]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [100]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


#### For every arithematic operation for Series and DataFrame, we have a counterpart starting with letter r, where arguments are flipped.
#### We can specify fill value also when reindexing a Series or DataFrame. 

In [101]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [102]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [103]:
df1.reindex(columns = df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


### Operations between DataFrame and Series
#### In NumPy, when you run an operation between a 2-D and a 1-D array, the 1-D array is operated on all the rows of the 2-D array.
#### This is referred to as Broadcasting.
#### In DataFrames, when we operate between a DataFrame and a Series, the same broadcasting happens.
#### By default, arithematic between DataFrame and Series matches insex of the Series on the Columns, broadcasting down rows.
#### If index values are not found, objects will be reindexed to form the union. 

In [104]:
# For NumPy

arr = np.arange(12.).reshape((3,4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [105]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [106]:
# For DataFrame and Series
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                    columns=list('bde'),
                    index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

In [107]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [108]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [109]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [110]:
series2 = pd.Series(range(3), index=['b','e','f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


#### If you want to match on rows and broadcast over Columns, use one of the arithematic methods.
#### The axis you pass is the axis to match on. To match on row index we pass axis='index' or axis=0.

In [111]:
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [112]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [113]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function Application and Mapping
#### NumPy ufuncs can also work with pandas objects.
#### We can also apply user defined functions for 1-D arrays to each row or column using DataFrame's 'apply' method.
#### The function will be invoked once on each column. Result will be Series having Columns of frame as its index.
#### By passing axis='columns', the function will be invoked once per row.

In [114]:
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.519304,0.299195,0.035282
Ohio,0.354135,0.370665,0.453685
Texas,0.109418,-0.020133,-0.967618
Oregon,0.724363,-1.698724,0.012189


In [115]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.519304,0.299195,0.035282
Ohio,0.354135,0.370665,0.453685
Texas,0.109418,0.020133,0.967618
Oregon,0.724363,1.698724,0.012189


In [116]:
f = lambda x: x.max() - x.min()
frame.apply(f)

b    0.614945
d    2.069389
e    1.421303
dtype: float64

In [117]:
frame.apply(f, axis='columns')

Utah      0.484022
Ohio      0.099549
Texas     1.077036
Oregon    2.423087
dtype: float64

#### Most of the common array statistics are DataFrame methods, so using apply is not necessary.
#### The functions may not return a scalar value, but may also return a Series.
#### Element-wise Python functions can be applied to pandas objects using 'applymap'.
#### The name is because series has a method 'map' for applying functions for each element.

In [118]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min','max'])

frame.apply(f)

Unnamed: 0,b,d,e
min,0.109418,-1.698724,-0.967618
max,0.724363,0.370665,0.453685


In [119]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,0.52,0.3,0.04
Ohio,0.35,0.37,0.45
Texas,0.11,-0.02,-0.97
Oregon,0.72,-1.7,0.01


In [120]:
frame['e'].map(format)

Utah       0.04
Ohio       0.45
Texas     -0.97
Oregon     0.01
Name: e, dtype: object

### Sorting and Ranking
#### Sorting dataset  lexicographically is possible using sort_index method, which returns a new object.
#### In DataFrames, you can sort by index on either axis.
#### Data is sorted in ascending order by default, but can be sorted in descending orer too.

In [121]:
obj = pd.Series(range(4), index=['d','a','b','c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [122]:
frame = pd.DataFrame(np.arange(8).reshape((2,4)),
                    index=['three','one'],
                    columns = ['d','a','b','c'])

frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [123]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [124]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


#### To sort Series by its values, use 'sort_values'.
#### Missing values will be sorted to the end of the Series by default.
#### While sorting DataFrames, you can use values of more than 1 columns as sort keys. Do this by passing 1 or more column names in the sort_values option.

In [125]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [126]:
frame = pd.DataFrame({'b': [4,7,-3,2], 'a':[0,1,0,1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [127]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [128]:
frame.sort_values(by=['a','b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


#### Ranking assigns ranks 1 through the number of valid data points in an array.
#### In pandas objects by default, 'rank' breaks ties by assigning each group mean rank.
#### Ranks can also be assigned in the order in which they are observed using method='first'. This breaks ties giving upper rank to the value observed first.
#### Rank can be assigned in descending order too.
#### For DataFrame, ranks can be computed for rows as well.

In [129]:
obj = pd.Series([7,-5,7,4,2,0,4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [130]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [131]:
# Assign tie values the max rank in the group
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [132]:
frame = pd.DataFrame({'b': [4.3,7,-3,2], 'a':[0,1,0,1],
                     'c': [-2,5,8,-2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [133]:
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


### Axis Indexes with duplicate Labels
#### All the previous examples have required unique axis labels.
#### Many pandas functions (eg - reindex) require unique labels, but that's not necessary.
#### The property 'is_unique' can tell if the labels of a Series are unique or not.
#### Data selection is affected by duplicate labels. Getting an index with duplicate label will get a Series, with single entries will return a scalar value.
#### This multiple values for a label will make code more complicated because indexing of output will keep on varying.
#### The same logic applies to indexing in DataFrames.

In [134]:
obj = pd.Series(range(5), index=['a','a','b','b','c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [135]:
obj.index.is_unique

False

In [136]:
obj['a']

a    0
a    1
dtype: int64

In [137]:
obj['c']

4

In [138]:
df = pd.DataFrame(np.random.randn(4,3), index=['a','a','b','b'])
df

Unnamed: 0,0,1,2
a,0.623306,-0.412164,0.074452
a,0.692392,-0.067321,0.550947
b,0.548775,0.748544,-1.071833
b,-1.187395,-0.82733,-0.134716


In [139]:
df.loc['b']

Unnamed: 0,0,1,2
b,0.548775,0.748544,-1.071833
b,-1.187395,-0.82733,-0.134716


## Summarizing and Computing Descriptive Statistics
#### pandas has a set of common mathematical and statistical methods library.
#### These methods can be classified into 'reductions' or 'summary statistics', i.e. extract single value from a Series or a Series of values from rows or columns of DataFrame.
#### Like NumPy, these methods have built-in handling for missing data.

#### sum method returns a Series containing column sums.
#### Passing axis='columns' or axis=1 sums across columns instead.
#### NA values are excluded unless entire entire slice in NA. It can be disabled with skipna option.

In [142]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                  [np.nan, np.nan], [0.75,-1.3]],
                 index=['a','b','c','d'],
                 columns=['one','two'])

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [143]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [144]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [145]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

#### Some methods (eg - idxmin, idxmax) return indirect statistic like index value where the min or max values were attained. 

In [147]:
df.idxmax()

one    b
two    d
dtype: object

#### Accumulation is another class of functions.
#### Rather than checking individual values, they go for aggregated values.

In [149]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


#### Some functions are neither reductions nor accumulations.
#### 'describe' is one such method. It provides multiple summary statistics in one shot.
#### For non-numeric

In [151]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [152]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [153]:
obj = pd.Series(['a','a','b','c'] * 4)

In [154]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### Correlation and Covariance
#### Some summary statistics (correlation, covariance) are computed from pairs of arguments.
#### We will be using pandas-datareader package with Quandl information for this example

In [190]:
import pandas_datareader.data as web
import datetime

start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
all_data = {ticker: web.DataReader(symbol, 'quandl', start, end)
        for ticker in ['WIKI/FB','WIKI/AAPL']}

In [198]:
all_data

{'WIKI/FB':               Open     High     Low    Close       Volume  ExDividend  \
 Date                                                                    
 2013-01-25  31.410  31.9300  31.130  31.5400   54363600.0         0.0   
 2013-01-24  31.270  31.4900  30.810  31.0800   43845100.0         0.0   
 2013-01-23  31.100  31.5000  30.800  30.8200   48899800.0         0.0   
 2013-01-22  29.750  30.8900  29.740  30.7290   55243300.0         0.0   
 2013-01-18  30.310  30.4400  29.270  29.6600   49631500.0         0.0   
 2013-01-17  30.080  30.4200  30.030  30.1400   40256700.0         0.0   
 2013-01-16  30.210  30.3500  29.530  29.8500   75332700.0         0.0   
 2013-01-15  30.640  31.7100  29.880  30.1000  173242600.0         0.0   
 2013-01-14  32.080  32.2100  30.620  30.9475   98892800.0         0.0   
 2013-01-11  31.280  31.9600  31.100  31.7200   89598000.0         0.0   
 2013-01-10  30.600  31.4500  30.280  31.3000   95316400.0         0.0   
 2013-01-09  29.670  30.600

In [201]:
price = pd.DataFrame({ticker:data['AdjClose']
                     for ticker, data in all_data.items()})
price.columns = ['FB', 'AAPL']
price.head()

Unnamed: 0_level_0,FB,AAPL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-25,31.54,31.54
2013-01-24,31.08,31.08
2013-01-23,30.82,30.82
2013-01-22,30.729,30.729
2013-01-18,29.66,29.66


In [202]:
volume = pd.DataFrame({ticker:data['Volume']
                     for ticker, data in all_data.items()})
volume.columns = ['FB', 'AAPL']
volume.head()

Unnamed: 0_level_0,FB,AAPL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-25,54363600.0,54363600.0
2013-01-24,43845100.0,43845100.0
2013-01-23,48899800.0,48899800.0
2013-01-22,55243300.0,55243300.0
2013-01-18,49631500.0,49631500.0


#### We can get percent changes as a time-series operation using pct_change.
#### The 'corr' method computes correlation of overlapping, non-NA, aligned-by-index values of 2 series.
#### Similarly, 'cov' computes covariance between 2 Series.
#### Using Column name as an attribute allows corr and cov between 2 Series without index labels. corr and cov on a DataFrame returns a full matrix.
#### Using 'corrwith' we can compute pairwise correlations between DataFrame's columns or rows with another Series or DataFrame. 
#### Passing a Series in 'corrwith' will return a Series with correlation value for each column. Passing a DataFrame gets correlations of matching column names.
#### Passing axis='columns' does things row-by-row instead.

In [203]:
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,FB,AAPL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2012-05-24,0.035099,0.035099
2012-05-23,-0.031184,-0.031184
2012-05-22,-0.03125,-0.03125
2012-05-21,0.097742,0.097742
2012-05-18,0.123473,0.123473


In [205]:
returns['FB'].corr(returns['AAPL'])

0.9999999999999998

In [206]:
returns.FB.corr(returns.AAPL)

0.9999999999999998

In [207]:
returns.corr()

Unnamed: 0,FB,AAPL
FB,1.0,1.0
AAPL,1.0,1.0


In [208]:
returns.cov()

Unnamed: 0,FB,AAPL
FB,0.001545,0.001545
AAPL,0.001545,0.001545


In [209]:
returns.corrwith(returns.AAPL)

FB      1.0
AAPL    1.0
dtype: float64

In [210]:
returns.corrwith(volume)

FB      0.222491
AAPL    0.222491
dtype: float64

### Unique Values, Value Counts, and Membership
#### There is a class of methods that extracts info about values in 1-D Series.
#### 'unique' gives you an array of unique values in a Series. The values are not sorted but could be after the fact.
#### 'value_counts' returns a Series with value frequencies. By default it is sorted in descending order for convinience and can be turned off with sort=False.
#### It is also a top-lebvel pandas method that can be used with any array or sequence.

In [212]:
obj = pd.Series(['c','d','a','a','e','b','c','c'])

In [213]:
uniques = obj.unique()

uniques

array(['c', 'd', 'a', 'e', 'b'], dtype=object)

In [215]:
uniques.sort()
uniques

array(['a', 'b', 'c', 'd', 'e'], dtype=object)

In [216]:
obj.value_counts()

c    3
a    2
e    1
b    1
d    1
dtype: int64

In [217]:
pd.value_counts(obj.values, sort=False)

d    1
b    1
a    2
e    1
c    3
dtype: int64

#### 'isin' does a vectorized set membership check. Useful in filtering a dataset to a subset of values in a Series or column in a DataFrame.
#### Index.get_indexer gives an index array from an array of non-distinct values into an array of distinct values.

In [219]:
mask = obj.isin(['b','c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
dtype: bool

In [220]:
to_match = pd.Series(['c','a','b','b','c','a'])
unique_vals = pd.Series(['c','b','a'])

pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2], dtype=int64)

#### We can use the value_counts as a method as told earlier to apply to an entire DataFrame to get a Histogram on multiple related columns.
#### The output's row labels are the distinct values in the DataFrame and the values are respective counts of these distinct values in each column.

In [222]:
data = pd.DataFrame({'Qu1': [1,3,4,3,4],
                    'Qu2': [2,3,1,2,3],
                    'Qu3': [1,5,2,4,4]})

data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [223]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
