# DataFrames



In [1]:
import pandas as pd
import numpy as np
from numpy.random import randn

In [2]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [3]:
df

Unnamed: 0,W,X,Y,Z
A,0.331567,-1.136897,-1.995363,0.736534
B,0.211942,1.154459,0.678225,-0.21055
C,-0.065875,-0.362088,-1.367112,-0.650288
D,-0.872518,1.467863,-1.041451,1.567746
E,0.133526,0.77153,-0.919978,-1.487045


## Selection and Indexing



In [4]:
df['W']

A    0.331567
B    0.211942
C   -0.065875
D   -0.872518
E    0.133526
Name: W, dtype: float64

In [5]:
# Pass a list of column names
df[['W','Z']]

Unnamed: 0,W,Z
A,0.331567,0.736534
B,0.211942,-0.21055
C,-0.065875,-0.650288
D,-0.872518,1.567746
E,0.133526,-1.487045


In [6]:
# SQL Syntax (NOT RECOMMENDED!)
df.W

A    0.331567
B    0.211942
C   -0.065875
D   -0.872518
E    0.133526
Name: W, dtype: float64

DataFrame Columns are just Series

In [7]:
type(df['W'])

pandas.core.series.Series

**Creating a new column:**

In [8]:
df['new'] = df['W'] + df['Y']

In [9]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.331567,-1.136897,-1.995363,0.736534,-1.663796
B,0.211942,1.154459,0.678225,-0.21055,0.890166
C,-0.065875,-0.362088,-1.367112,-0.650288,-1.432987
D,-0.872518,1.467863,-1.041451,1.567746,-1.913969
E,0.133526,0.77153,-0.919978,-1.487045,-0.786452


** Removing Columns**

In [10]:
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,0.331567,-1.136897,-1.995363,0.736534
B,0.211942,1.154459,0.678225,-0.21055
C,-0.065875,-0.362088,-1.367112,-0.650288
D,-0.872518,1.467863,-1.041451,1.567746
E,0.133526,0.77153,-0.919978,-1.487045


In [11]:
# Not inplace unless specified!
df

Unnamed: 0,W,X,Y,Z,new
A,0.331567,-1.136897,-1.995363,0.736534,-1.663796
B,0.211942,1.154459,0.678225,-0.21055,0.890166
C,-0.065875,-0.362088,-1.367112,-0.650288,-1.432987
D,-0.872518,1.467863,-1.041451,1.567746,-1.913969
E,0.133526,0.77153,-0.919978,-1.487045,-0.786452


In [12]:
df.drop('new',axis=1,inplace=True)

In [13]:
df

Unnamed: 0,W,X,Y,Z
A,0.331567,-1.136897,-1.995363,0.736534
B,0.211942,1.154459,0.678225,-0.21055
C,-0.065875,-0.362088,-1.367112,-0.650288
D,-0.872518,1.467863,-1.041451,1.567746
E,0.133526,0.77153,-0.919978,-1.487045


Can also drop rows this way:

In [14]:
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,0.331567,-1.136897,-1.995363,0.736534
B,0.211942,1.154459,0.678225,-0.21055
C,-0.065875,-0.362088,-1.367112,-0.650288
D,-0.872518,1.467863,-1.041451,1.567746


** Selecting Rows**

In [15]:
df.loc['A']

W    0.331567
X   -1.136897
Y   -1.995363
Z    0.736534
Name: A, dtype: float64

Or select based off of position instead of label 

In [16]:
df.iloc[2,]

W   -0.065875
X   -0.362088
Y   -1.367112
Z   -0.650288
Name: C, dtype: float64

** Selecting subset of rows and columns **

In [17]:
df.loc['B','Y']

0.6782248402300752

In [18]:
df.iloc[1:3,:]

Unnamed: 0,W,X,Y,Z
B,0.211942,1.154459,0.678225,-0.21055
C,-0.065875,-0.362088,-1.367112,-0.650288


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [19]:
df

Unnamed: 0,W,X,Y,Z
A,0.331567,-1.136897,-1.995363,0.736534
B,0.211942,1.154459,0.678225,-0.21055
C,-0.065875,-0.362088,-1.367112,-0.650288
D,-0.872518,1.467863,-1.041451,1.567746
E,0.133526,0.77153,-0.919978,-1.487045


In [20]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,False,False,True
B,True,True,True,False
C,False,False,False,False
D,False,True,False,True
E,True,True,False,False


In [21]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,0.331567,,,0.736534
B,0.211942,1.154459,0.678225,
C,,,,
D,,1.467863,,1.567746
E,0.133526,0.77153,,


In [22]:
df['W']>0

A     True
B     True
C    False
D    False
E     True
Name: W, dtype: bool

In [32]:
df[['W','Y']] >80

Unnamed: 0,W,Y
A,False,False
B,False,False
C,False,False
D,False,False
E,False,False
