#Pandas Cheat Sheet

References:  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html>    
<http://synesthesiam.com/posts/an-introduction-to-pandas.html>  
<http://pbpython.com/excel-pandas-comp.html>  
<http://pbpython.com/excel-pandas-comp-2.html>  

<https://iqbalnaved.wordpress.com/2013/08/26/python-pandas-hacks/>   

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime

<https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf>   

|Create pandas data structures| |
|--|--|
|s = Series(data, index) |Create a Series.|
|df = DataFrame (data, index, columns) |Create a Dataframe.|
|p = Panel(data, items, major_axis, minor_axis)|Create a Panel.|


|	DataFrame Commands	|		|
|--|--|
|	df[col]	|	Select column.	|
|	df.iloc[label]	|	Select row by label.	|
|	df.index	|	Return DataFrame index.	|
|	df.drop()	|	Delete given row or column. Pass axis=1 for columns.	|
|	df1 = df1.reindex_like(df1,df2)	|	Reindex df1 with index of df2.	|
|	df.reset_index()	|	Reset index, putting old index in column named index.	|
|	df.reindex()	|	Change DataFrame index, new indecies set to NaN.	|
|	df.head(n)	|	Show first n rows.	|
|	df.tail(n)	|	Show last n rows.	|
|	df.sort()	|	Sort index.	|
|	df.sort(axis=1)	|	Sort columns.	|
|	df.pivot(index,column,values)	|	Pivot DataFrame, using new conditions.	|
|	df.T	|	Transpose DataFrame.	|
|	df.stack()	|	Change lowest level of column labels into innermost row index.	|
|	df.unstack()	|	Change innermost row index into lowest level of column labels.	|
|	df.applymap()	|	Apply function to every element in DataFrame.	|
|	df.apply()	|	Apply function along a given axis	|
|	df.dropna()	|	Drops rows where any data is missing.	|
|	df.count()	|	Returns Series of row counts for every column.	|
|	df.min()	|	Return minimum of every column.	|
|	df.max()	|	Return maximum of every column.	|
|	df.describe()	|	Generate various summary statistics for every column.	|
|	concat()	|	Merge DataFrame or Series objects	|	

|	Groupby	|		|
|--|--|
|	groupby()	|	Split DataFrame by columns. Creates a GroupBy object (gb).	|
|	gb.agg()	|	Apply function (single or list) to a GroupBy object.	|
|	gb.transform()	|	Applies function and returns object with same index as one being grouped.	|
|	gb.filter()	|	Filter GroupBy object by a given function.	|
|	gb.groups	|	Return dict whose keys are the unique groups, and values are axis labels belonging to each group.	|

|	I/O	|		|
|--|--|
|	df.to_csv(‘foo.csv’)	|	Save to CSV.	|
|	read_csv(‘foo.csv’)	|	Read CSV into DataFrame.	|
|	to_excel(‘foo.xlsx’, sheet_name)	|	Save to Excel.	|
|	read_excel(‘foo.xlsx’,’sheet1’, index_col = None, na_values = [‘NA’])	|	Read exel into DataFrame	|

	


###Set the display width for print output

In [2]:
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 150)

In [3]:
def makeDateRand():
    dates = pd.date_range('20130101',periods=6)
    df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))    
    return df

In [4]:
def makefoobar():
    return  pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                          'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                          'C' : np.random.randn(8),
                          'D' : np.random.randn(8)})

In [5]:
def makegridDF():
    return  pd.DataFrame({'A' : [1,2,3],
                          'B' : [4,5,6],
                          'C' : [7,8,9],
                          'D' : [10,11,12]})



##Creating/Loading Data

###NaN in Pandas / Numpy

In [6]:
a = np.nan
print(a)
print(pd.isnull(a))

nan
True


In [7]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
print(pd.isnull(a))
#change all NaN to some other value
df.B = df.B.fillna('**')
print(df)

   A   B
0  1 NaN
1  3   4
True
   A   B
0  1  **
1  3   4


Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns.

In [8]:
df = makeDateRand()
print(df)

                   A         B         C         D
2013-01-01 -0.080959 -0.295942  0.969176 -0.198913
2013-01-02  0.074509 -0.419840  0.452743  1.055139
2013-01-03 -0.356354 -0.521118 -0.631585 -0.835083
2013-01-04  1.379267 -1.009065 -0.124177  0.352295
2013-01-05 -0.516978 -0.369779  2.040361 -1.361787
2013-01-06 -0.730203 -1.338022  0.847936  0.212879


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [9]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                      'E' : pd.Categorical(["test","train","test","train"]),
                      'F' : 'foo' })
print(df2.dtypes)
df2


A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object


Unnamed: 0,A,B,C,D,E,F
0,1,2013-01-02,1,3,test,foo
1,1,2013-01-02,1,3,train,foo
2,1,2013-01-02,1,3,test,foo
3,1,2013-01-02,1,3,train,foo


In [10]:
#create dataframe from a string
from StringIO import StringIO
data = """UsrId JobNos
1 4 
1 5"""
df = pd.read_csv(StringIO(data), sep='\s+')
df

Unnamed: 0,UsrId,JobNos
0,1,4
1,1,5


If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled:  
    `df2.<Tab>`

In [11]:
df.to_csv('foo.csv')
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,UsrId,JobNos
0,0,1,4
1,1,1,5


In [12]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Unnamed: 0,UsrId,JobNos
0,1,4
1,1,5


###Get the Number of Rows in a DataFrame

In [13]:
df = pd.DataFrame([[1, np.nan], [3, 4]], columns=list('AB'))
print(df)
print('\nlen(df)={}'.format(len(df)))
print('\nshape[0]={}'.format(df.shape[0]))
print('\ndf.count()[0]={}'.format(df.count()[0]))
print('\ndf.count()[1]={}'.format(df.count()[1]))
print('\nNans = \n{}'.format(df.apply(lambda col: pd.isnull(col))))

   A   B
0  1 NaN
1  3   4

len(df)=2

shape[0]=2

df.count()[0]=2

df.count()[1]=1

Nans = 
       A      B
0  False   True
1  False  False


##Manipulating DataFrames

###Adding a row to a DataFrame

When using drop(), note the axis direction.   
- to drop a column axis=1  
- to drop a row axis=0  (default)

In [14]:
#drop a column
df = makegridDF()
print(df.drop('A',axis=1))
print(df.drop(0))

   B  C   D
0  4  7  10
1  5  8  11
2  6  9  12
   A  B  C   D
1  2  5  8  11
2  3  6  9  12


In [15]:
#drop a row
df = makegridDF()
print(df.drop(1))
print(df.drop(0,axis=0))

   A  B  C   D
0  1  4  7  10
2  3  6  9  12
   A  B  C   D
1  2  5  8  11
2  3  6  9  12


In [16]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
# print(df)
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df = df.append(df2)
print(df)

   A  B
0  1  2
1  3  4
0  5  6
1  7  8


In [17]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df.index)
df.loc[df.shape[0]] = ['a','b']
df.loc[df.shape[0]] = ['d','e']

print(df)


Int64Index([0, 1], dtype='int64')
   A  B
0  1  2
1  3  4
2  a  b
3  d  e


In [18]:
#set / change the index name
df.index.name = 'DateIndex'
df

Unnamed: 0_level_0,A,B
DateIndex,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,2
1,3,4
2,a,b
3,d,e


In [19]:
#set index to one of the columns
df.index = df.A
print(df)

   A  B
A      
1  1  2
3  3  4
a  a  b
d  d  e


In [20]:
df2 = pd.concat([df,df[2:4]])
df2

Unnamed: 0_level_0,A,B
A,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,2
3,3,4
a,a,b
d,d,e
a,a,b
d,d,e


In [21]:
#Find duplicate rows
df2['isdup'] = df2.duplicated(subset=['A','B'])
print(df2)
# df = df2[df.duplicated(subset=1)] # creates a new dataframe with the repeated rows
# len(df) # number of duplicates

   A  B  isdup
A             
1  1  2  False
3  3  4  False
a  a  b  False
d  d  e  False
a  a  b   True
d  d  e   True


In [22]:
# creates a new dataframe with the repeated rows
dx = df2[df2.duplicated(subset=['A'])] 
dx

Unnamed: 0_level_0,A,B,isdup
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,a,b,True
d,d,e,True


In [23]:
# Drop rows based on duplicate values on column with label 2
df2.drop_duplicates(subset=['A'], take_last=True, inplace=True)
df2

Unnamed: 0_level_0,A,B,isdup
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,2,False
3,3,4,False
a,a,b,True
d,d,e,True


In [24]:
df.head(2)

Unnamed: 0_level_0,A,B
A,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,2
3,3,4


In [25]:
df.tail(2)

Unnamed: 0_level_0,A,B
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,a,b
d,d,e


In [26]:
df.index

Index([1, 3, u'a', u'd'], dtype='object')

In [27]:
df.columns

Index([u'A', u'B'], dtype='object')

In [28]:
df.values

array([[1L, 2L],
       [3L, 4L],
       ['a', 'b'],
       ['d', 'e']], dtype=object)

In [29]:
df.T

A,1,3,a,d
A,1,3,a,d
B,2,4,b,e


###Adding a column to a dataframe

In [30]:
print(df)
print(df.A)
print(df['A'])
print(df[['A','B']])

   A  B
A      
1  1  2
3  3  4
a  a  b
d  d  e
A
1    1
3    3
a    a
d    d
Name: A, dtype: object
A
1    1
3    3
a    a
d    d
Name: A, dtype: object
   A  B
A      
1  1  2
3  3  4
a  a  b
d  d  e


In [31]:
df = makeDateRand()
df['Total'] = df['A'] + df['B'] + df['C']
print(df)

                   A         B         C         D     Total
2013-01-01  2.618385  1.477583 -0.268805  0.758620  3.827162
2013-01-02  0.225074  0.477158 -1.707232 -1.462022 -1.005000
2013-01-03  0.250879  0.458501 -1.329991 -0.968092 -0.620611
2013-01-04 -0.747945 -0.533420  0.812261 -0.713105 -0.469105
2013-01-05 -1.725701  1.311558 -1.262824 -0.717738 -1.676967
2013-01-06  1.655497  0.903953  1.662080  0.315426  4.221531


In [32]:
#Adding the sum along a column
df['A'].sum(), df['B'].sum(), df['C'].sum(), 

(2.2761887060649197, 4.0953334365554817, -2.0945121508549356)

In [33]:
sum_row=df[["A","B",'Total']].sum()
sum_row

A        2.276189
B        4.095333
Total    4.277010
dtype: float64

We need to transpose the data and convert the Series to a DataFrame so that it is easier to concat onto our existing data. The T function allows us to switch the data from being row-based to column-based.

In [34]:
df_sum=pd.DataFrame(data=sum_row).T
df_sum

Unnamed: 0,A,B,Total
0,2.276189,4.095333,4.27701


The final thing we need to do before adding the totals back is to add the missing columns. We use reindex to do this for us. The trick is to add all of our columns and then allow pandas to fill in the values that are missing.

In [35]:
df_sum=df_sum.reindex(columns=df.columns)
df_sum

Unnamed: 0,A,B,C,D,Total
0,2.276189,4.095333,,,4.27701


Now append the totals to the end of the dataframe

In [36]:
df=df.append(df_sum,ignore_index=True)
df.tail()

Unnamed: 0,A,B,C,D,Total
2,0.250879,0.458501,-1.329991,-0.968092,-0.620611
3,-0.747945,-0.53342,0.812261,-0.713105,-0.469105
4,-1.725701,1.311558,-1.262824,-0.717738,-1.676967
5,1.655497,0.903953,1.66208,0.315426,4.221531
6,2.276189,4.095333,,,4.27701


###Sorting

In [37]:
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.417729,1.604336,1.729113,-0.455307
2013-01-02,-1.869739,0.695898,0.496634,2.010457
2013-01-03,1.443181,-0.361062,-1.185517,-0.180359
2013-01-04,-1.05046,0.943304,-0.03672,-0.710934
2013-01-05,-0.150511,-2.379234,-0.45555,1.519411
2013-01-06,-2.321827,0.361758,-0.651096,-0.354631


Sort by column index (axis=1)

In [38]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.455307,1.729113,1.604336,-0.417729
2013-01-02,2.010457,0.496634,0.695898,-1.869739
2013-01-03,-0.180359,-1.185517,-0.361062,1.443181
2013-01-04,-0.710934,-0.03672,0.943304,-1.05046
2013-01-05,1.519411,-0.45555,-2.379234,-0.150511
2013-01-06,-0.354631,-0.651096,0.361758,-2.321827


Sort all rows by value in column B

In [39]:
df.sort(columns='B')

Unnamed: 0,A,B,C,D
2013-01-05,-0.150511,-2.379234,-0.45555,1.519411
2013-01-03,1.443181,-0.361062,-1.185517,-0.180359
2013-01-06,-2.321827,0.361758,-0.651096,-0.354631
2013-01-02,-1.869739,0.695898,0.496634,2.010457
2013-01-04,-1.05046,0.943304,-0.03672,-0.710934
2013-01-01,-0.417729,1.604336,1.729113,-0.455307


###Selecting portions

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix. 

[Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing)  
[MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced)

Selecting a single column, which yields a Series, equivalent to df.A

In [40]:
df['A']

2013-01-01   -0.417729
2013-01-02   -1.869739
2013-01-03    1.443181
2013-01-04   -1.050460
2013-01-05   -0.150511
2013-01-06   -2.321827
Freq: D, Name: A, dtype: float64

In [41]:
np.asarray(df['A'])

array([-0.41772912, -1.86973945,  1.4431808 , -1.05046033, -0.15051053,
       -2.32182723])

Selecting via [], which slices the rows.

In [42]:
df[2:4]

Unnamed: 0,A,B,C,D
2013-01-03,1.443181,-0.361062,-1.185517,-0.180359
2013-01-04,-1.05046,0.943304,-0.03672,-0.710934


In [43]:
df['2013-01-01':'2013-01-02']

Unnamed: 0,A,B,C,D
2013-01-01,-0.417729,1.604336,1.729113,-0.455307
2013-01-02,-1.869739,0.695898,0.496634,2.010457


###[Selection by Label](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-label)

In [44]:
df = makeDateRand()
df.loc[dates[0]]

A    0.728004
B    1.883773
C   -0.887738
D   -0.922813
Name: 2013-01-01 00:00:00, dtype: float64

In [45]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,0.728004,1.883773
2013-01-02,-0.831081,1.408119
2013-01-03,0.054015,2.065349
2013-01-04,-0.045419,-0.709139
2013-01-05,-0.334698,0.366812
2013-01-06,-0.354265,1.055953


In [46]:
#endpoints included
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B
2013-01-02,-0.831081,1.408119
2013-01-03,0.054015,2.065349
2013-01-04,-0.045419,-0.709139


In [47]:
df.loc['20130102',['A','B']]

A   -0.831081
B    1.408119
Name: 2013-01-02 00:00:00, dtype: float64

In [48]:
#locate a row by index
df.ix[datetime(2013,01,02)]

A   -0.831081
B    1.408119
C   -0.383991
D   -0.575649
Name: 2013-01-02 00:00:00, dtype: float64

In [49]:
#locate a row by index
df.irow(1)

A   -0.831081
B    1.408119
C   -0.383991
D   -0.575649
Name: 2013-01-02 00:00:00, dtype: float64

In [50]:
#drop some of the rows
df = df.drop(df.index[[0,1,2,3]])
for idx, row in df.iterrows():
    print(idx,row)

(Timestamp('2013-01-05 00:00:00', offset='D'), A   -0.334698
B    0.366812
C    1.501057
D   -0.205027
Name: 2013-01-05 00:00:00, dtype: float64)
(Timestamp('2013-01-06 00:00:00', offset='D'), A   -0.354265
B    1.055953
C    1.507894
D   -0.666595
Name: 2013-01-06 00:00:00, dtype: float64)


In [51]:
df = makeDateRand()
for idx, row in df.iterrows():
    row['A'] = 2
    df.ix[idx, 'B'] = 3
    df.ix[idx]['C'] = 4
print(df.head())

            A  B  C         D
2013-01-01  2  3  4 -1.025116
2013-01-02  2  3  4  0.599774
2013-01-03  2  3  4  1.535912
2013-01-04  2  3  4 -0.490835
2013-01-05  2  3  4  0.947673


###[Selection by Position](http://pandas.pydata.org/pandas-docs/version/0.15.2/indexing.html#indexing-integer)

In [52]:
df.iloc[3]

A    2.000000
B    3.000000
C    4.000000
D   -0.490835
Name: 2013-01-04 00:00:00, dtype: float64

In [53]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,2,3
2013-01-05,2,3


In [54]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,2,4
2013-01-03,2,4
2013-01-05,2,4


In [55]:
#slicing rows
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,2,3,4,0.599774
2013-01-03,2,3,4,1.535912


In [56]:
#slicing columns
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,3,4
2013-01-02,3,4
2013-01-03,3,4
2013-01-04,3,4
2013-01-05,3,4
2013-01-06,3,4


In [57]:
df.iloc[1,1]

3.0

###Boolean indexing and filtering

In [58]:
#filter by single row
df = makeDateRand()
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-03,1.016892,1.299084,0.375407,0.856304
2013-01-04,0.299379,0.19384,-0.30905,1.034682


In [59]:
#filter by multiple row
df2 = df[(df.A>0) & (df.B>0)]
df2

Unnamed: 0,A,B,C,D
2013-01-03,1.016892,1.299084,0.375407,0.856304
2013-01-04,0.299379,0.19384,-0.30905,1.034682


In [60]:
#filter specs are pandas time series, which can be manipulated
filt = (df.A>0) & (df.B>0)
print(type(filt), filt)
print('filt.any() = {}'.format(filt.any()))
print('filt.all() = {}'.format(filt.all()))

(<class 'pandas.core.series.Series'>, 2013-01-01    False
2013-01-02    False
2013-01-03     True
2013-01-04     True
2013-01-05    False
2013-01-06    False
Freq: D, dtype: bool)
filt.any() = True
filt.all() = False


In [61]:
#filter by element
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,0.03039,0.363971,0.357064
2013-01-02,,,0.53335,
2013-01-03,1.016892,1.299084,0.375407,0.856304
2013-01-04,0.299379,0.19384,,1.034682
2013-01-05,,0.044256,,
2013-01-06,,,,


In [62]:
#isin filtering
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
print(df2)
df2[df2['E'].isin(['two','four'])]

                   A         B         C         D      E
2013-01-01 -1.301610  0.030390  0.363971  0.357064    one
2013-01-02 -0.939748 -0.449733  0.533350 -0.055531    one
2013-01-03  1.016892  1.299084  0.375407  0.856304    two
2013-01-04  0.299379  0.193840 -0.309050  1.034682  three
2013-01-05 -0.668822  0.044256 -1.285183 -1.015692   four
2013-01-06 -0.045665 -0.487452 -0.767566 -0.724484  three


Unnamed: 0,A,B,C,D,E
2013-01-03,1.016892,1.299084,0.375407,0.856304,two
2013-01-05,-0.668822,0.044256,-1.285183,-1.015692,four


In [63]:
#get unique values in a column
df = makefoobar()
print(df)
df.B.unique()

     A      B         C         D
0  foo    one  0.733808 -1.405718
1  bar    one -1.279761  2.360999
2  foo    two  0.627262  1.257645
3  bar  three  0.428243  0.256072
4  foo    two -0.152718 -1.007474
5  bar    two -0.986388  0.432497
6  foo    one -0.071638 -1.284469
7  foo  three -0.175485  0.748959


array(['one', 'two', 'three'], dtype=object)

###Setting data

Setting a new column automatically aligns the data by the indexes

In [64]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
df2['F'] = s1
df2

Unnamed: 0,A,B,C,D,E,F
2013-01-01,-1.30161,0.03039,0.363971,0.357064,one,
2013-01-02,-0.939748,-0.449733,0.53335,-0.055531,one,1.0
2013-01-03,1.016892,1.299084,0.375407,0.856304,two,2.0
2013-01-04,0.299379,0.19384,-0.30905,1.034682,three,3.0
2013-01-05,-0.668822,0.044256,-1.285183,-1.015692,four,4.0
2013-01-06,-0.045665,-0.487452,-0.767566,-0.724484,three,5.0


In [65]:
# Setting values by label
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D
0,foo,one,0.733808,-1.405718
1,bar,one,-1.279761,2.360999
2,foo,two,0.627262,1.257645
3,bar,three,0.428243,0.256072
4,foo,two,-0.152718,-1.007474
5,bar,two,-0.986388,0.432497
6,foo,one,-0.071638,-1.284469
7,foo,three,-0.175485,0.748959
2013-01-01 00:00:00,0,,,


In [66]:
# Setting values by position
df.iat[0,1] = 7
df

Unnamed: 0,A,B,C,D
0,foo,7,0.733808,-1.405718
1,bar,one,-1.279761,2.360999
2,foo,two,0.627262,1.257645
3,bar,three,0.428243,0.256072
4,foo,two,-0.152718,-1.007474
5,bar,two,-0.986388,0.432497
6,foo,one,-0.071638,-1.284469
7,foo,three,-0.175485,0.748959
2013-01-01 00:00:00,0,,,


In [67]:
# Setting by assigning with a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D
0,foo,7,0.733808,5
1,bar,one,-1.279761,5
2,foo,two,0.627262,5
3,bar,three,0.428243,5
4,foo,two,-0.152718,5
5,bar,two,-0.986388,5
6,foo,one,-0.071638,5
7,foo,three,-0.175485,5
2013-01-01 00:00:00,0,,,5


In [69]:
# A where operation with setting.
df = makeDateRand()
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D
2013-01-01,-0.897777,-0.805169,-1.041925,-0.673602
2013-01-02,-0.149888,-1.637216,-0.005448,-0.33704
2013-01-03,-0.16127,-0.866962,-0.164275,-1.657327
2013-01-04,-0.78688,-0.800631,-0.698673,-0.084176
2013-01-05,-1.877959,-0.293622,-1.013078,-0.862554
2013-01-06,-0.864802,-1.254154,-0.277437,-1.526444


##Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the [Missing Data section](http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html#missing-data)

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [70]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.897777,-0.805169,-1.041925,-0.673602,1.0
2013-01-02,0.149888,-1.637216,0.005448,-0.33704,1.0
2013-01-03,0.16127,0.866962,0.164275,-1.657327,
2013-01-04,-0.78688,0.800631,0.698673,-0.084176,


To drop any rows that have missing data.

In [71]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.897777,-0.805169,-1.041925,-0.673602,1
2013-01-02,0.149888,-1.637216,0.005448,-0.33704,1


Filling missing data

In [72]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.897777,-0.805169,-1.041925,-0.673602,1
2013-01-02,0.149888,-1.637216,0.005448,-0.33704,1
2013-01-03,0.16127,0.866962,0.164275,-1.657327,5
2013-01-04,-0.78688,0.800631,0.698673,-0.084176,5


To get the boolean mask where values are nan

In [73]:
pd.isnull(df1)

Unnamed: 0,A,B,C,D,E
2013-01-01,False,False,False,False,False
2013-01-02,False,False,False,False,False
2013-01-03,False,False,False,False,True
2013-01-04,False,False,False,False,True


##Operations
###Binary operations

See the Basic section on [Binary Ops](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-binop)  

Operations in general exclude missing data.

In [74]:
df.mean()

A    0.228210
B   -0.289221
C    0.186164
D   -0.569339
dtype: float64

In [75]:
#along the other axis
df.mean(1)

2013-01-01   -0.854618
2013-01-02   -0.454730
2013-01-03   -0.116205
2013-01-04    0.157062
2013-01-05    1.011803
2013-01-06   -0.409590
Freq: D, dtype: float64

###Applying functions to the data

When using apply(), note the axis direction.   
- for each column, apply down a row: axis=0  (default)
- for each row, apply across columns: axis=1.

In [76]:
df = makegridDF()
print(df)
print(df.apply(np.cumsum))
print(df.apply(np.cumsum, axis=0))
print(df.apply(np.cumsum, axis=1))

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A   B   C   D
0  1   4   7  10
1  3   9  15  21
2  6  15  24  33
   A  B   C   D
0  1  5  12  22
1  2  7  15  26
2  3  9  18  30


In [77]:
df = makeDateRand()
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,-1.145105,0.331087,0.155704,-2.416678
2013-01-02,-0.498978,0.367483,1.672925,-2.691722
2013-01-03,-1.732635,0.569565,0.701772,-3.723195
2013-01-04,-1.679595,0.465376,-0.541017,-2.787303
2013-01-05,-2.318057,1.99308,-0.015948,-0.881321
2013-01-06,-2.915722,1.068827,0.550851,-0.153618


In [78]:
df.apply(lambda x: x.max() - x.min())

A    1.879783
B    2.451957
C    2.760010
D    4.322659
dtype: float64

In [79]:
from datetime import datetime
df = makeDateRand()
df.index.name = 'Date'
df.reset_index(level=0,inplace=True)
print(df)
#convert to string format
df.Date = df.Date.apply(lambda d: ' '.join(d.isoformat().split('T')))
print(df)
#convert back to datetime format
df.Date = df.Date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d %H:%M:%S"))
print(df)

df.index = df.Date
print(df)

        Date         A         B         C         D
0 2013-01-01  1.355881 -1.111720  0.159182  2.556427
1 2013-01-02  0.825970  0.262434  0.076754 -0.997311
2 2013-01-03  0.329492  0.649276 -1.042162  0.810665
3 2013-01-04  0.787017 -0.619211 -1.693874 -0.431582
4 2013-01-05 -0.069305 -0.470965  0.172827  1.317490
5 2013-01-06 -1.279747  1.037582 -1.237433 -0.687833
                  Date         A         B         C         D
0  2013-01-01 00:00:00  1.355881 -1.111720  0.159182  2.556427
1  2013-01-02 00:00:00  0.825970  0.262434  0.076754 -0.997311
2  2013-01-03 00:00:00  0.329492  0.649276 -1.042162  0.810665
3  2013-01-04 00:00:00  0.787017 -0.619211 -1.693874 -0.431582
4  2013-01-05 00:00:00 -0.069305 -0.470965  0.172827  1.317490
5  2013-01-06 00:00:00 -1.279747  1.037582 -1.237433 -0.687833
        Date         A         B         C         D
0 2013-01-01  1.355881 -1.111720  0.159182  2.556427
1 2013-01-02  0.825970  0.262434  0.076754 -0.997311
2 2013-01-03  0.329492  0.649

###Histograms

[Histogramming and Discretization](http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-discretization)

In [80]:
s = pd.Series(np.random.randint(0,7,size=10))
s.value_counts()

2    4
1    3
6    1
3    1
0    1
dtype: int64

### Strings
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses [regular expressions](https://docs.python.org/2/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/version/0.15.2/text.html#text-string-methods).

In [81]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
s.str.upper()
s.str.len()

0     1
1     1
2     1
3     4
4     4
5   NaN
6     4
7     3
8     3
dtype: float64

In [82]:
# Methods like split return a Series of lists:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

In [83]:
s2.str.split('_').str[1]

0      b
1      d
2    NaN
3      g
dtype: object

In [84]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s.str[1]

0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.

In [85]:
# Easy to expand this to return a DataFrame
s2.str.split('_').apply(pd.Series)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


Methods like replace and findall take regular expressions, too:

In [86]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

###Merge and concat

In [87]:
# Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10, 4))
print(df)
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)

          0         1         2         3
0  0.664435  0.519529  0.517068  0.112110
1  1.129895 -0.825383  0.023337  0.245131
2 -0.069950  0.190675  1.921309  0.911999
3 -2.027998  1.359977  0.398001 -0.847109
4  1.324686  0.182335 -0.237303 -0.080617
5 -0.562008  1.113831  1.022310  0.009409
6 -1.248402 -0.823899  0.873292  1.389289
7  1.508891 -0.258467  1.206670  0.786257
8 -0.223412 -0.216831 -0.153479  0.381530
9 -0.134774  0.993819  0.772913  0.360929


Unnamed: 0,0,1,2,3
0,0.664435,0.519529,0.517068,0.11211
1,1.129895,-0.825383,0.023337,0.245131
2,-0.06995,0.190675,1.921309,0.911999
3,-2.027998,1.359977,0.398001,-0.847109
4,1.324686,0.182335,-0.237303,-0.080617
5,-0.562008,1.113831,1.02231,0.009409
6,-1.248402,-0.823899,0.873292,1.389289
7,1.508891,-0.258467,1.20667,0.786257
8,-0.223412,-0.216831,-0.153479,0.38153
9,-0.134774,0.993819,0.772913,0.360929


In [88]:
# Repeat some rows in  data frame
df2 = pd.concat([df,df[2:4]])
df2

Unnamed: 0,0,1,2,3
0,0.664435,0.519529,0.517068,0.11211
1,1.129895,-0.825383,0.023337,0.245131
2,-0.06995,0.190675,1.921309,0.911999
3,-2.027998,1.359977,0.398001,-0.847109
4,1.324686,0.182335,-0.237303,-0.080617
5,-0.562008,1.113831,1.02231,0.009409
6,-1.248402,-0.823899,0.873292,1.389289
7,1.508891,-0.258467,1.20667,0.786257
8,-0.223412,-0.216831,-0.153479,0.38153
9,-0.134774,0.993819,0.772913,0.360929


SQL style merges. See the [Database style joining](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-join)

In [89]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print(left)
print(right)
pd.merge(left, right, on='key')

   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5


Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


Append rows to a dataframe. See the [Appending](http://pandas.pydata.org/pandas-docs/version/0.15.2/merging.html#merging-concatenation)

In [90]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,-2.805576,-1.582596,-0.754479,-1.64059
1,-1.061685,-1.644093,-1.559948,0.512891
2,-0.030025,0.334269,-0.525006,0.27334
3,-0.945314,-1.485782,-0.053811,-0.47654
4,0.689826,0.514408,-0.74308,-1.548345
5,0.504614,-0.781237,-1.071299,0.265962
6,-0.067631,0.763293,0.645783,0.764308
7,1.207576,-0.273824,-2.017735,0.484992
8,-0.945314,-1.485782,-0.053811,-0.47654


###Grouping
By “group by” we are referring to a process involving one or more of the following steps  

- Splitting the data into groups based on some criteria  
- Applying a function to each group independently    
- Combining the results into a data structure   

[grouping](http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html#groupby)

In [91]:
df = makefoobar()
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.486154,0.14193
1,bar,one,1.067696,-1.123935
2,foo,two,1.771727,0.507768
3,bar,three,-0.221276,-0.208124
4,foo,two,-0.962044,-1.02512
5,bar,two,-0.69092,1.225117
6,foo,one,1.492595,-0.989528
7,foo,three,0.051717,-1.966667


Grouping and then applying a function sum to the resulting groups.

In [92]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.1555,-0.106942
foo,1.86784,-3.331615


In [93]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.067696,-1.123935
bar,three,-0.221276,-0.208124
bar,two,-0.69092,1.225117
foo,one,1.00644,-0.847597
foo,three,0.051717,-1.966667
foo,two,0.809683,-0.517351


In [94]:
df = makefoobar()
print(df)
cnts = {}
#first value is group column value, seond value is the members in the group
for grp, grp_data in df.groupby("B"):
    cnts[grp] = grp_data.C.mean()  
cnts

     A      B         C         D
0  foo    one -0.470987  0.475548
1  bar    one  1.349505 -1.590926
2  foo    two  0.032610  2.077090
3  bar  three -0.675237  0.163221
4  foo    two  0.711420 -0.091923
5  bar    two  0.835245 -0.560896
6  foo    one -0.605763 -1.114480
7  foo  three  0.426656 -0.797827


{'one': 0.090918058249253475,
 'three': -0.12429021559913889,
 'two': 0.52642482895100096}

###Reshaping

[Hierarchical Indexing](http://pandas.pydata.org/pandas-docs/version/0.15.2/advanced.html#advanced-hierarchical) and [Reshaping](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-stacking).

In [95]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                    'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                    'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.627064,2.362803
bar,two,2.99719,0.531766
baz,one,2.449233,0.381258
baz,two,-0.07392,-1.049333


The stack function “compresses” a level in the DataFrame’s columns.

In [96]:
stacked = df2.stack()
stacked

first  second   
bar    one     A    1.627064
               B    2.362803
       two     A    2.997190
               B    0.531766
baz    one     A    2.449233
               B    0.381258
       two     A   -0.073920
               B   -1.049333
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:

In [97]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.627064,2.362803
bar,two,2.99719,0.531766
baz,one,2.449233,0.381258
baz,two,-0.07392,-1.049333


In [98]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,1.627064,2.99719
bar,B,2.362803,0.531766
baz,A,2.449233,-0.07392
baz,B,0.381258,-1.049333


In [99]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,1.627064,2.449233
one,B,2.362803,0.381258
two,A,2.99719,-0.07392
two,B,0.531766,-1.049333


###Pivot tables
See the section on [Pivot Tables](http://pandas.pydata.org/pandas-docs/version/0.15.2/reshaping.html#reshaping-pivot).

In [100]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,1.359003,-1.365539
1,one,B,foo,0.376525,-0.434494
2,two,C,foo,0.458552,-1.133055
3,three,A,bar,0.367929,-1.528746
4,one,B,bar,-1.295456,0.069575
5,one,C,bar,1.487755,-1.079916
6,two,A,foo,-0.862349,1.043035
7,three,B,foo,-1.109495,-0.013139
8,one,C,foo,0.705293,-1.531746
9,one,A,bar,1.241997,1.028495


In [101]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,1.241997,1.359003
one,B,-1.295456,0.376525
one,C,1.487755,0.705293
three,A,0.367929,
three,B,,-1.109495
three,C,-1.138588,
two,A,,-0.862349
two,B,-1.605521,
two,C,,0.458552


##Categoricals
see the [categorical introduction](http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#categorical) and the [API documentation](http://pandas.pydata.org/pandas-docs/version/0.15.2/api.html#api-categorical).

In [102]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
# Convert the raw grades to a categorical data type.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a < b < e]

Rename the categories to more meaningful names (assigning to Series.cat.categories is in place!) Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series per default).

In [103]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad < bad < medium < good < very good]

In [104]:
# Sorting is per order in the categories, not lexical order.
df.sort("grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


In [105]:
# Grouping by a categorical column shows also empty categories.
df.groupby("grade").size()

grade
very bad      1
bad         NaN
medium      NaN
good          2
very good     3
dtype: float64

##Comparing and Gotchas
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#basics-compare>  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/basics.html#boolean-reductions>   

pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not.  
<http://pandas.pydata.org/pandas-docs/version/0.15.2/gotchas.html#gotchas>


##Copies and no copies

## Python and [module versions, and dates](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-0-Scientific-Computing-with-Python.ipynb)

In [106]:
%load_ext version_information
%version_information pandas, numpy, scipy, matplotlib, pyradi

Software,Version
Python,2.7.8 32bit [MSC v.1500 32 bit (Intel)]
IPython,3.0.0
OS,Windows 7 6.1.7601 SP1
pandas,0.15.2
numpy,1.9.2
scipy,0.15.1
matplotlib,1.4.3
pyradi,0.1.55
Thu Apr 09 16:34:56 2015 South Africa Standard Time,Thu Apr 09 16:34:56 2015 South Africa Standard Time
