# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [2]:
import pandas as pd
import numpy as np

In [3]:
from numpy.random import randn
np.random.seed(101) # Ensures the same random numbers everytime a random generating function is run, going forward

In [4]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) # creates a data frame with a numpy matrix, row indices A - E and column indices W - Z

In [5]:
df  # A DataFrame is a list of series' sharing an index. For the below, W, X, Y and Z are the series data, indexed by the rows A, B, C and D.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [7]:
type(df)    # DataFrame data type

pandas.core.frame.DataFrame

## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [10]:
df['W'] # This returns the W column/ series (NOTE THE SINGLE COLUMNS DENOTING A SERIES)

A   -0.156598
B   -0.610259
C   -0.479448
D    1.862864
E    2.084019
Name: W, dtype: float64

In [7]:
# Pass a list of column names(NOTE THE DOUBLE BRACKETS DENOTING THAT A DATAFRAME WILL RESULT)
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [11]:
# SQL Syntax (NOT RECOMMENDED!) so as not to confuse pandas, since this may override one of its methods.
df.W

A   -0.156598
B   -0.610259
C   -0.479448
D    1.862864
E    2.084019
Name: W, dtype: float64

DataFrame Columns are just Series

In [9]:
type(df['W'])

pandas.core.series.Series

** ADDING NEW COLUMNS **

In [21]:
df['new'] = df['W'] + df['Y']   # Sums up values in col W & Y and sets the results in a new col 'new'
df

Unnamed: 0,W,X,Y,Z,new
A,-0.156598,-0.031579,0.649826,2.154846,0.493228
B,-0.610259,-0.755325,-0.346419,0.147027,-0.956677
C,-0.479448,0.558769,1.02481,-0.925874,0.545362
D,1.862864,-1.133817,0.610478,0.38603,2.473342
E,2.084019,-0.376519,0.230336,0.681209,2.314355


** REMOVING COLUMNS **

In [22]:
df.drop('new',axis=1) # If deleting columns, axis must be set to 1, to specify is a col being dropped. axis=0 by default and that refers to rows, so df.drop('new') will attempt to drop a row with index 'new'. This comes from numpy arrays shape reference:

Unnamed: 0,W,X,Y,Z
A,-0.156598,-0.031579,0.649826,2.154846
B,-0.610259,-0.755325,-0.346419,0.147027
C,-0.479448,0.558769,1.02481,-0.925874
D,1.862864,-1.133817,0.610478,0.38603
E,2.084019,-0.376519,0.230336,0.681209


In [23]:
df.shape    # Yields a tuples showing that df is a 5*5 matrix, element 1 of the tuple (postion 0) represents the # of rows and the second (position 1) the # of columns, hence the axis refence.

(5, 5)

In [24]:
# Not inplace unless specified! i.e does not affect the original DataFrame
df

Unnamed: 0,W,X,Y,Z,new
A,-0.156598,-0.031579,0.649826,2.154846,0.493228
B,-0.610259,-0.755325,-0.346419,0.147027,-0.956677
C,-0.479448,0.558769,1.02481,-0.925874,0.545362
D,1.862864,-1.133817,0.610478,0.38603,2.473342
E,2.084019,-0.376519,0.230336,0.681209,2.314355


In [25]:
df.drop('new',axis=1,inplace=True)  # This will now affect the original DataFrame.

In [26]:
df

Unnamed: 0,W,X,Y,Z
A,-0.156598,-0.031579,0.649826,2.154846
B,-0.610259,-0.755325,-0.346419,0.147027
C,-0.479448,0.558769,1.02481,-0.925874
D,1.862864,-1.133817,0.610478,0.38603
E,2.084019,-0.376519,0.230336,0.681209


With axis set to zero or argument left out, you can delete a row that exists.

In [27]:
df.drop('E',axis=0) # Note with the inplace argument left out (default=False), the original DatFrame is also not affected, similar to the columns.

Unnamed: 0,W,X,Y,Z
A,-0.156598,-0.031579,0.649826,2.154846
B,-0.610259,-0.755325,-0.346419,0.147027
C,-0.479448,0.558769,1.02481,-0.925874
D,1.862864,-1.133817,0.610478,0.38603


In [28]:
df

Unnamed: 0,W,X,Y,Z
A,-0.156598,-0.031579,0.649826,2.154846
B,-0.610259,-0.755325,-0.346419,0.147027
C,-0.479448,0.558769,1.02481,-0.925874
D,1.862864,-1.133817,0.610478,0.38603
E,2.084019,-0.376519,0.230336,0.681209


** Selecting Rows**

In [33]:
df.loc['A'] # Selecting rows requires passing the row label to the loc property, resulting in a series as well.

W   -0.156598
X   -0.031579
Y    0.649826
Z    2.154846
Name: A, dtype: float64

Or select based off of position instead of label 

In [34]:
df.iloc[2]  # Selects a row based on its row position, regardless of the row labels.

W   -0.479448
X    0.558769
Y    1.024810
Z   -0.925874
Name: C, dtype: float64

** Selecting subset of rows and columns **

In [36]:
df.loc['B','Y'] # Simple intersection, yielding a single value at that intersection.

-0.34641850351854453

In [38]:
df.iloc[2,3] # Simple intersection, yielding a single value at that intersection.

-0.925874258809907

In [45]:
df.loc[['A','B','C'],['W','Y','Z']] # fetching a DataFrame subset

Unnamed: 0,W,Y,Z
A,-0.156598,0.649826,2.154846
B,-0.610259,-0.346419,0.147027
C,-0.479448,1.02481,-0.925874


In [47]:
df.iloc[[0,3,4],[1,2,3]] # fetching a DataFrame subset

Unnamed: 0,X,Y,Z
A,-0.031579,0.649826,2.154846
D,-1.133817,0.610478,0.38603
E,-0.376519,0.230336,0.681209


In [8]:
df.iloc[[0,2,4]]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
C,-2.018168,0.740122,0.528813,-0.589001
E,0.190794,1.978757,2.605967,0.683509


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [6]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [49]:
df>0    # returns the DataFrame df but with boolean values results of the data points compared to the condition

Unnamed: 0,W,X,Y,Z
A,False,False,True,True
B,False,False,False,True
C,False,True,True,False
D,True,False,True,True
E,True,False,True,True


In [40]:
df[df>0]    # Fetches all data points in the dataframe where the data point is > 0 & NaN where they are not (only when querying from a dataframe, not common). APPLYTING THE BOOLEAN RESULT ABOVE TO THE DATAFRAME.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [50]:
df['W']>0   # Does above conditional fetching on column 'W' i.e returns all values (rows) in col 'W' > 0 (a SERIES)

A    False
B    False
C    False
D     True
E     True
Name: W, dtype: bool

In [59]:
df[df['W']>0]   # Applying the series above i.e df['W']>0 to the entire DataFrame df, returns only the rows that meet the series criteria e.g since row C is False in the conditinal series, then it will be ignored in the df result. COMMON CONDITIONALS.

Unnamed: 0,W,X,Y,Z
D,1.862864,-1.133817,0.610478,0.38603
E,2.084019,-0.376519,0.230336,0.681209


In [55]:
df[df['W']<0] # Returns row c only.

Unnamed: 0,W,X,Y,Z
A,-0.156598,-0.031579,0.649826,2.154846
B,-0.610259,-0.755325,-0.346419,0.147027
C,-0.479448,0.558769,1.02481,-0.925874


In [60]:
condition_res = df['W']>0 # Conditional series
df_filtered = df[condition_res] # DataFrame filtered based on the conditional series
df_filtered

Unnamed: 0,W,X,Y,Z
D,1.862864,-1.133817,0.610478,0.38603
E,2.084019,-0.376519,0.230336,0.681209


In [63]:
df[df['W']>0][['W','X']]    # Return a subset of df consisting of colums W & X, with rows only where values in col W are greater than 0.

Unnamed: 0,W,X
D,1.862864,-1.133817
E,2.084019,-0.376519


## MULITPLE CONDITIONS

For two conditions you can use | (OR) and & (AND) with parenthesis. You cant use the Python operators 'and' or 'or' they won't work because those compare single boolean values.

In [7]:
df[(df['W']>0) & (df['Y']>1)] # Returns the DataFrame with rows where the conditions on col 'W' and 'Y' are satisfied.

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it to something else. We'll also talk about index hierarchy!

In [8]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [9]:
# Reset to default 0,1...n index
df.reset_index()    # Since this does not happend inplace (unless specified in the reset_index args). It adds the original numerical indexing next to the custom index column.

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [10]:
df  # original index col reset back

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [12]:
newindex = 'CA NY WY OR CO'.split()

In [16]:
df['States'] = newindex   # Adding the above list as a col will work coz the # of elements match the # of rows

In [17]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [18]:
df.set_index('States') # Replaces the original index col in df. Does not persist unless inplace=True is specified in the set_index() args.

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


In [19]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [20]:
df.set_index('States',inplace=True)

In [21]:
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [136]:
# Index Levels
outside = ['G1','G1','G1','G1','G2','G2','G2','G3','G3','G3']
inside = [1,2,3,4,1,2,3,4,1,2]
index_tup = list(zip(outside,inside))  # zip creates a list of tuples from the values of list
index_tup

[('G1', 1),
 ('G1', 2),
 ('G1', 3),
 ('G1', 4),
 ('G2', 1),
 ('G2', 2),
 ('G2', 3),
 ('G3', 4),
 ('G3', 1),
 ('G3', 2)]

In [137]:
hierachy_index = pd.MultiIndex.from_tuples(index_tup)
hierachy_index  # multi-index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G1', 4),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3),
            ('G3', 4),
            ('G3', 1),
            ('G3', 2)],
           )

In [138]:
dfm = pd.DataFrame(np.random.randn(10,4),index=hierachy_index, columns=['A','B','C','D'])
dfm  # multi-level index DataFrame

Unnamed: 0,Unnamed: 1,A,B,C,D
G1,1,0.308671,0.750127,-0.087113,0.555042
G1,2,0.174446,0.583755,-0.244361,1.041056
G1,3,0.629237,-0.843552,-0.162464,-0.747989
G1,4,-1.790809,-0.21222,1.330096,0.683211
G2,1,-0.008983,0.429693,0.662408,1.223457
G2,2,0.842224,2.133137,1.159285,-1.738897
G2,3,0.13402,1.040047,-0.200309,-1.560216
G3,4,0.050336,-0.611267,-2.231037,-0.125453
G3,1,1.083252,0.715451,0.707512,0.172454
G3,2,-1.062307,1.236113,-0.223678,-0.604987


## Working with a multi-index level DataFrame

Now let's show how to index this! 

For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[].

Calling one level of the index returns the sub-dataframe.

In [139]:
dfm.loc['G1']    # Returns the sub-frame G1

Unnamed: 0,A,B,C,D
1,0.308671,0.750127,-0.087113,0.555042
2,0.174446,0.583755,-0.244361,1.041056
3,0.629237,-0.843552,-0.162464,-0.747989
4,-1.790809,-0.21222,1.330096,0.683211


In [140]:
dfm.loc['G1']['B']   # Returns col 'B' from the subframe G1 as a series

1    0.750127
2    0.583755
3   -0.843552
4   -0.212220
Name: B, dtype: float64

In [141]:
dfm['B'].loc['G1']  # You can also reference the column first then the index

1    0.750127
2    0.583755
3   -0.843552
4   -0.212220
Name: B, dtype: float64

In [142]:
dfm.loc['G1'].loc[1] # Returns row 1 from the sub-frame G1 as a series

A    0.308671
B    0.750127
C   -0.087113
D    0.555042
Name: 1, dtype: float64

## Adding Index labels

In [143]:
dfm.index.names  # Above DataFrame sub-frame & row index col have no col labels

FrozenList([None, None])

In [144]:
dfm.index.names = ['Group', 'Num']    # Define a list for the index column header names, of length equal to # of DF levels

In [146]:
# Fetching 0.072960 from df above i.e from subframe G2, row 6, col B
dfm.loc['G2'].loc[3]['B']

1.040046901956782

In [147]:
dfm

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
G1,1,0.308671,0.750127,-0.087113,0.555042
G1,2,0.174446,0.583755,-0.244361,1.041056
G1,3,0.629237,-0.843552,-0.162464,-0.747989
G1,4,-1.790809,-0.21222,1.330096,0.683211
G2,1,-0.008983,0.429693,0.662408,1.223457
G2,2,0.842224,2.133137,1.159285,-1.738897
G2,3,0.13402,1.040047,-0.200309,-1.560216
G3,4,0.050336,-0.611267,-2.231037,-0.125453
G3,1,1.083252,0.715451,0.707512,0.172454
G3,2,-1.062307,1.236113,-0.223678,-0.604987


In [148]:
dfm.xs('G3')     # xs method returns a cross-section from a multi-level-data-frame especially where using .loc() is tricky. This retunns the G1 subframe just like the .loc() method

Unnamed: 0_level_0,A,B,C,D
Num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4,0.050336,-0.611267,-2.231037,-0.125453
1,1.083252,0.715451,0.707512,0.172454
2,-1.062307,1.236113,-0.223678,-0.604987


In [150]:
dfm.xs(['G3',2]) # From G3, returns row 9 as a series

A   -1.062307
B    1.236113
C   -0.223678
D   -0.604987
Name: (G3, 2), dtype: float64

In [152]:
dfm.xs(1,level='Num')    # From all subframes, return all where level = 'Num' is 1 

Unnamed: 0_level_0,A,B,C,D
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
G1,0.308671,0.750127,-0.087113,0.555042
G2,-0.008983,0.429693,0.662408,1.223457
G3,1.083252,0.715451,0.707512,0.172454


# Great Job!