# DataFrames

DataFrames are workhorse data structure in Python. DataFrame are a bunch of series objects put together to share the same index. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
from numpy.random import randn
np.random.seed(20)

In [6]:
'A B C D E'.split()

['A', 'B', 'C', 'D', 'E']

In [3]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
## The split method will split apart a list you have passed through.

In [4]:
df

Unnamed: 0,W,X,Y,Z
A,0.883893,0.195865,0.357537,-2.343262
B,-1.084833,0.559696,0.939469,-0.978481
C,0.503097,0.406414,0.323461,-0.493411
D,-0.792017,-0.842368,-1.279503,0.245715
E,-0.044195,1.567633,1.051109,0.406368


## Selection and Indexing

How do we isolate data from within a dataframe

In [7]:
df['W']

A    0.883893
B   -1.084833
C    0.503097
D   -0.792017
E   -0.044195
Name: W, dtype: float64

In [8]:
# Pass a list of column names
df[['W','Z']]

Unnamed: 0,W,Z
A,0.883893,-2.343262
B,-1.084833,-0.978481
C,0.503097,-0.493411
D,-0.792017,0.245715
E,-0.044195,0.406368


DataFrame Columns are just Series

In [9]:
type(df['W'])

pandas.core.series.Series

**Creating a new column:**

In [23]:
df['new'] = df['W'] + df['Y']

In [24]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.883893,0.195865,0.357537,-2.343262,1.24143
B,-1.084833,0.559696,0.939469,-0.978481,-0.145363
C,0.503097,0.406414,0.323461,-0.493411,0.826558
D,-0.792017,-0.842368,-1.279503,0.245715,-2.071519
E,-0.044195,1.567633,1.051109,0.406368,1.006914


** Removing Columns**

In [25]:
df.drop('new', axis=1)

Unnamed: 0,W,X,Y,Z
A,0.883893,0.195865,0.357537,-2.343262
B,-1.084833,0.559696,0.939469,-0.978481
C,0.503097,0.406414,0.323461,-0.493411
D,-0.792017,-0.842368,-1.279503,0.245715
E,-0.044195,1.567633,1.051109,0.406368


In [26]:
# Not inplace unless specified!
df

Unnamed: 0,W,X,Y,Z,new
A,0.883893,0.195865,0.357537,-2.343262,1.24143
B,-1.084833,0.559696,0.939469,-0.978481,-0.145363
C,0.503097,0.406414,0.323461,-0.493411,0.826558
D,-0.792017,-0.842368,-1.279503,0.245715,-2.071519
E,-0.044195,1.567633,1.051109,0.406368,1.006914


In [14]:
df.drop('new',axis=1,inplace=True)

In [15]:
df

Unnamed: 0,W,X,Y,Z
A,0.883893,0.195865,0.357537,-2.343262
B,-1.084833,0.559696,0.939469,-0.978481
C,0.503097,0.406414,0.323461,-0.493411
D,-0.792017,-0.842368,-1.279503,0.245715
E,-0.044195,1.567633,1.051109,0.406368


Can also drop rows this way:

In [16]:
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,0.883893,0.195865,0.357537,-2.343262
B,-1.084833,0.559696,0.939469,-0.978481
C,0.503097,0.406414,0.323461,-0.493411
D,-0.792017,-0.842368,-1.279503,0.245715


** Selecting Rows**

In [17]:
df.loc['A']

W    0.883893
X    0.195865
Y    0.357537
Z   -2.343262
Name: A, dtype: float64

Or select based off of position instead of label 

In [18]:
df.iloc[2]

W    0.503097
X    0.406414
Y    0.323461
Z   -0.493411
Name: C, dtype: float64

** Selecting subset of rows and columns **

In [19]:
df.loc['B','Y']

0.9394693498560777

In [27]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,0.883893,0.357537
B,-1.084833,0.939469


In [None]:
#generate dataframe with w>0 and y>1


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [28]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.883893,0.195865,0.357537,-2.343262,1.24143
B,-1.084833,0.559696,0.939469,-0.978481,-0.145363
C,0.503097,0.406414,0.323461,-0.493411,0.826558
D,-0.792017,-0.842368,-1.279503,0.245715,-2.071519
E,-0.044195,1.567633,1.051109,0.406368,1.006914


In [29]:
df>0

Unnamed: 0,W,X,Y,Z,new
A,True,True,True,False,True
B,False,True,True,False,False
C,True,True,True,False,True
D,False,False,False,True,False
E,False,True,True,True,True


In [30]:
df[df>0]

Unnamed: 0,W,X,Y,Z,new
A,0.883893,0.195865,0.357537,,1.24143
B,,0.559696,0.939469,,
C,0.503097,0.406414,0.323461,,0.826558
D,,,,0.245715,
E,,1.567633,1.051109,0.406368,1.006914


For two conditions you can use | and & with parenthesis:

## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [54]:
df

Unnamed: 0_level_0,W,X,Y,Z,new,Ranking
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AA,0.883893,0.195865,0.357537,-2.343262,1.24143,AA
BB,-1.084833,0.559696,0.939469,-0.978481,-0.145363,BB
CC,0.503097,0.406414,0.323461,-0.493411,0.826558,CC
DD,-0.792017,-0.842368,-1.279503,0.245715,-2.071519,DD
EE,-0.044195,1.567633,1.051109,0.406368,1.006914,EE


In [55]:
# Reset to default 0,1...n index
df.reset_index()

ValueError: cannot insert Ranking, already exists

In [56]:
newind = 'AA BB CC DD EE'.split()

In [57]:
df['Ranking'] = newind

In [58]:
df

Unnamed: 0_level_0,W,X,Y,Z,new,Ranking
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AA,0.883893,0.195865,0.357537,-2.343262,1.24143,AA
BB,-1.084833,0.559696,0.939469,-0.978481,-0.145363,BB
CC,0.503097,0.406414,0.323461,-0.493411,0.826558,CC
DD,-0.792017,-0.842368,-1.279503,0.245715,-2.071519,DD
EE,-0.044195,1.567633,1.051109,0.406368,1.006914,EE


In [59]:
df.set_index('Ranking')

Unnamed: 0_level_0,W,X,Y,Z,new
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AA,0.883893,0.195865,0.357537,-2.343262,1.24143
BB,-1.084833,0.559696,0.939469,-0.978481,-0.145363
CC,0.503097,0.406414,0.323461,-0.493411,0.826558
DD,-0.792017,-0.842368,-1.279503,0.245715,-2.071519
EE,-0.044195,1.567633,1.051109,0.406368,1.006914


In [37]:
df

Unnamed: 0,W,X,Y,Z,new,Ranking
A,0.883893,0.195865,0.357537,-2.343262,1.24143,AA
B,-1.084833,0.559696,0.939469,-0.978481,-0.145363,BB
C,0.503097,0.406414,0.323461,-0.493411,0.826558,CC
D,-0.792017,-0.842368,-1.279503,0.245715,-2.071519,DD
E,-0.044195,1.567633,1.051109,0.406368,1.006914,EE


In [60]:
df.set_index('Ranking',inplace=True)

In [61]:
df

Unnamed: 0_level_0,W,X,Y,Z,new
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AA,0.883893,0.195865,0.357537,-2.343262,1.24143
BB,-1.084833,0.559696,0.939469,-0.978481,-0.145363
CC,0.503097,0.406414,0.323461,-0.493411,0.826558
DD,-0.792017,-0.842368,-1.279503,0.245715,-2.071519
EE,-0.044195,1.567633,1.051109,0.406368,1.006914
