---
title: "DataFrames"
---

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's explore this topic...

In [1]:
# Importing libraries
import pandas as pd
import numpy as np

from numpy.random import randn
np.random.seed(101)

In [6]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


## I. Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [8]:
df['W']

# OR 
# 
# df.W

A   -0.993263
B    1.025984
C    2.154846
D    0.147027
E   -0.925874
Name: W, dtype: float64

In [9]:
# Pass a list of column names to get multiple
df[['W','Z']]

Unnamed: 0,W,Z
A,-0.993263,0.000366
B,1.025984,0.649826
C,2.154846,-0.346419
D,0.147027,1.02481
E,-0.925874,0.610478


NOTE: DataFrame Columns are just Series

In [10]:
type(df['W'])

pandas.core.series.Series

#### i. Creating a new column

In [12]:
df['new'] = df['W'] * df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,-0.993263,0.1968,-1.136645,0.000366,1.128988
B,1.025984,-0.156598,-0.031579,0.649826,-0.0324
C,2.154846,-0.610259,-0.755325,-0.346419,-1.62761
D,0.147027,-0.479448,0.558769,1.02481,0.082154
E,-0.925874,1.862864,-1.133817,0.610478,1.049772


#### ii. Removing Columns & Rows

In [15]:
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [16]:
# NOTE:
# Not inplace unless specified
df

Unnamed: 0,W,X,Y,Z,new
A,-0.993263,0.1968,-1.136645,0.000366,1.128988
B,1.025984,-0.156598,-0.031579,0.649826,-0.0324
C,2.154846,-0.610259,-0.755325,-0.346419,-1.62761
D,0.147027,-0.479448,0.558769,1.02481,0.082154
E,-0.925874,1.862864,-1.133817,0.610478,1.049772


In [17]:
df.drop('new',axis=1,inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


Can also **drop rows** this way:

In [18]:
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481


#### iii. Selecting Rows

In [20]:
df.loc['A']

# OR
# 
# df.iloc[0]

W   -0.993263
X    0.196800
Y   -1.136645
Z    0.000366
Name: A, dtype: float64

#### iv.  Selecting subset of rows and columns

In [21]:
# Selecting a specific element in df
df.loc['B','Y']

-0.031579143908112575

In [22]:
# Selecting subset
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,-0.993263,-1.136645
B,1.025984,-0.031579


### 1. Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [23]:
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [26]:
# Using opterators to get boolean df
df>0

Unnamed: 0,W,X,Y,Z
A,False,True,False,True
B,True,False,False,True
C,True,False,False,False
D,True,False,True,True
E,False,True,False,True


In [25]:
# Selecting olny rows that are true based off condition set
df[df>0]

Unnamed: 0,W,X,Y,Z
A,,0.1968,,0.000366
B,1.025984,,,0.649826
C,2.154846,,,
D,0.147027,,0.558769,1.02481
E,,1.862864,,0.610478


In [27]:
# Selecting olny rows that are true based off condition set
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481


In [30]:
# Selecting a column after operation/condition is applied
df[df['W']>0]['Y']

B   -0.031579
C   -0.755325
D    0.558769
Name: Y, dtype: float64

In [31]:
# Selecting multiple columns
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
B,-0.031579,-0.156598
C,-0.755325,-0.610259
D,0.558769,-0.479448


For _two conditions_ you can use $|$ or a $&$ with parenthesis:

In [34]:
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [35]:
df[(df['W']>0) & (df['Y'] < 0)]

Unnamed: 0,W,X,Y,Z
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419


## II. More Index Details

There are some more features such as _indexing_, including resetting the index or setting it something else, and _index hierarchy_!

In [36]:
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [40]:
# Reset to default 0,1...n index
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,-0.993263,0.1968,-1.136645,0.000366
1,B,1.025984,-0.156598,-0.031579,0.649826
2,C,2.154846,-0.610259,-0.755325,-0.346419
3,D,0.147027,-0.479448,0.558769,1.02481
4,E,-0.925874,1.862864,-1.133817,0.610478


In [43]:
# Adding a new column....
df['States'] = 'CA NY WY OR CO'.split()

# ...and setting it as the index
df.set_index('States', inplace=True) # NOTE the `inplace=True`
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.993263,0.1968,-1.136645,0.000366
NY,1.025984,-0.156598,-0.031579,0.649826
WY,2.154846,-0.610259,-0.755325,-0.346419
OR,0.147027,-0.479448,0.558769,1.02481
CO,-0.925874,1.862864,-1.133817,0.610478


## III. Multi-Index and Index Hierarchy

Going over how to work with Multi-Index, by first seeing what a Multi-Indexed DataFrame looks like:

In [46]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside)) # <-- creates a list of tuples
hier_index = pd.MultiIndex.from_tuples(hier_index)

hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [47]:
# Creating the Milti-Index df
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,2.70685,0.628133
G1,2,0.907969,0.503826
G1,3,0.651118,-0.319318
G2,1,-0.848077,0.605965
G2,2,-2.018168,0.740122
G2,3,0.528813,-0.589001


For index hierarchy we use `df.loc[]` (Note: if this was on the columns axis, you would just use normal bracket notation `df[]`). Calling one level of the index returns the sub-dataframe:

In [48]:
df.loc['G1']

Unnamed: 0,A,B
1,2.70685,0.628133
2,0.907969,0.503826
3,0.651118,-0.319318


In [49]:
# Extracting a row from subdf
df.loc['G1'].loc[1]

A    2.706850
B    0.628133
Name: 1, dtype: float64

NOTE: You can edit the names of these indexes  
`df.index.names`

In [51]:
df.index.names = ['Group','Num']
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,2.70685,0.628133
G1,2,0.907969,0.503826
G1,3,0.651118,-0.319318
G2,1,-0.848077,0.605965
G2,2,-2.018168,0.740122
G2,3,0.528813,-0.589001


Another useful fucntion is the cross-section, or `.xs()`. This is helpful w/ multi-lvl indexes

In [53]:
# Selection subdf
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2.70685,0.628133
2,0.907969,0.503826
3,0.651118,-0.319318


In [55]:
# Selection row in sub
df.xs(('G1',1))

A    2.706850
B    0.628133
Name: (G1, 1), dtype: float64

In [57]:
# Selecting the nth(1) row of a sepecifc index lvl 
df.xs(1,level='Num')

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,2.70685,0.628133
G2,-0.848077,0.605965
