___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [1]:
import pandas as pd
import numpy as np

In [2]:
from numpy.random import randn
np.random.seed(101) # Ensures the same random numbers everytime a random generating function is run, going forward

In [3]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) # creates a data frame with a numpy matix, row indices A - E and column indices W - Z

In [4]:
df  # A DataFrame is a list of series' sharing an index. For the below, W, X, Y and Z are the series data, indexed by the rows A, B, C and D.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [5]:
type(df)    # DataFrame data type

pandas.core.frame.DataFrame

## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [6]:
df['W'] # This returns the W column/ series (NOTE THE SINGLE COLUMNS DENOTING A SERIES)

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [7]:
# Pass a list of column names(NOTE THE DOUBLE BRACKETS DENOTING THAT A DATAFRAME WILL RESULT)
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [8]:
# SQL Syntax (NOT RECOMMENDED!) so as not to confuse pandas, since this may override one of its methods.
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

DataFrame Columns are just Series

In [9]:
type(df['W'])

pandas.core.series.Series

** ADDING NEW COLUMNS **

In [10]:
df['new'] = df['W'] + df['Y']   # Sums up values in col W & Y and sets the results in a new col 'new'

In [11]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


** REMOVING COLUMNS **

In [12]:
df.drop('new',axis=1) # If deleting columns, axis must be set to 1, to specify is a col being dropped. axis=0 by default and that refers to rows, so df.drop('new') will attempt to drop a row with index 'axis'. This comes from numpy arrays shape reference:

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [13]:
df.shape    # Yields a tuples showing that df is a 5*5 matrix, element 1 of the tuple (postion 0) represents the # of rows and the second (position 1) the # of columns, hence the axis refence.

(5, 5)

In [14]:
# Not inplace unless specified! i.e does not affect the original DataFrame
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [15]:
df.drop('new',axis=1,inplace=True)  # This will now affect the original DataFrame.

In [16]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


With axis set to zero or argument left out, you cna delete a row that exists.

In [17]:
df.drop('E',axis=0) # Note with the inplace argument left out (default=False), the original DatFrame is also not affected, similar to the columns.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [18]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


** Selecting Rows**

In [19]:
df.loc['A'] # Selecting rows requires passing the row label to the loc property, resulting in a series as well.

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or select based off of position instead of label 

In [20]:
df.iloc[2]  # Selects a row based on its row position, regardless of the row labels.

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

** Selecting subset of rows and columns **

In [21]:
df.loc['B','Y'] # Simle intersection, yielding a single value at theat intersection.

-0.8480769834036315

In [22]:
df.loc[['A','B'],['W','Y']] # fetching a DataFrame subset

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [29]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [38]:
df>0    # returns the DataFrame df but with boolean values results of the data points compared to the condition

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [40]:
df[df>0]    # Fetches all data points in the dataframe where the data point is > 0 & NaN where they are not (only when querying from a dataframe, not common). APPLYTING THE BOOLEAN RESULT ABOVE TO THE DATAFRAME.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [42]:
df['W']>0   # Does above conditional fetching on column 'W' i.e returns all values in col 'W' > 0 (a SERIES)

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [45]:
df[df['W']>0]   # Applying the series above i.e df['W']>0 to the enitre DataFrame df, returns on the rows that meet the series criteria e.g since row C is False in the conditinal series, then it will be ignored in the df result. COMMON CONDITIONALS.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [46]:
df[df['W']<0] # Returns row c only.

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001


In [59]:
condition_res = df['W']>0 # Conditional series
df_filtered = df[condition_res] # DataFrame filtered based on the conditional series
req_filtered_col = df_filtered['X'] # Column based of the row filtered DataFrame
multi_filtered_cols = df_filtered[['W', 'X']] # Multi cols results also possible, note these are MATRICES.
multi_filtered_cols

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318
D,0.188695,-0.758872
E,0.190794,1.978757


In [61]:
df[df['W']>0][['W','X']]    # The above steps combined into one statement which are more efficient.

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318
D,0.188695,-0.758872
E,0.190794,1.978757


## MULITPLE CONDITIONS

For two conditions you can use | (OR) and & (AND) with parenthesis. You cant the Python operators 'and' or 'or' they wount work because those compare single boolena values.

In [62]:
df[(df['W']>0) and (df['Y'] > 1)]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [64]:
df[(df['W']>0) & (df['Y'] > 1)] # Returns the DataFrame with rows where the conditions on col 'W' and 'Y' are satisfied.

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [72]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [82]:
# Reset to default 0,1...n index
df.reset_index()    # Since this does not happend inplace (unless specified in the reset_index args). It adds the original numerical indexing next to the custom index column.

Unnamed: 0,index,W,X,Y,Z,States
0,A,2.70685,0.628133,0.907969,0.503826,CA
1,B,0.651118,-0.319318,-0.848077,0.605965,NY
2,C,-2.018168,0.740122,0.528813,-0.589001,WY
3,D,0.188695,-0.758872,-0.933237,0.955057,OR
4,E,0.190794,1.978757,2.605967,0.683509,CO


In [83]:
df  # original index col reset back

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [76]:
newind = 'CA NY WY OR CO'.split()

In [77]:
df['States'] = newind   # Adding the above list as a col will work coz the # of elements match the # of rows

In [78]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [84]:
df.set_index('States') # Replaces the original index col in df. Does not persist unless inplace=True is specified in the set_index() args.

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


In [86]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [87]:
df.set_index('States',inplace=True)

In [88]:
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [98]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))  # zip creates a list of tuples from the values of list
hier_index

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

In [102]:
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index  # multi-index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [103]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df  # multi-level index DataFrame

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


## Working with a multi-index level DataFrame

Now let's show how to index this! 

For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[].

Calling one level of the index returns the sub-dataframe.

In [114]:
df.loc['G1']    # Returns the sub-frame G1

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [116]:
df.loc['G1']['B']   # Returns col 'B' from the subframe G1

1    1.693723
2   -1.159119
3    0.390528
Name: B, dtype: float64

In [120]:
df.loc['G1'].loc[1] # Returns row 1 from the sub-frame G1

A    0.302665
B    1.693723
Name: 1, dtype: float64

## Adding Index labels

In [109]:
df.loc[1]

KeyError: 1

In [122]:
df.index.names  # Above DataFrame sub-frame & row index col have no col labels

FrozenList([None, None])

In [132]:
df.index.names = ['Group', 'Num']    # Define a list for the col label names, of length equal to # of DF levels

In [140]:
# Fetching 0.072960 from df above
df.loc['G2'].loc[2]['B']

0.07295967531703869

In [134]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [142]:
df.xs('G1')     # xs method returns a cross-section from a MLDF especially where using .loc() is tricky. This retunns the G1 SB just like the .loc() method

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [143]:
df.xs(['G1',1]) # From G1, returns row 1

A    0.302665
B    1.693723
Name: (G1, 1), dtype: float64

In [145]:
df.xs(1,level='Num')    # From all SBs, return all where level = 'Num' is 1 

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.302665,1.693723
G2,0.166905,0.184502


# Great Job!