![alt text](pandas.png "Title")

In [2]:
import pandas as pd
import numpy as np

# Dataframes indexes

Dataframe have a row index and a column index. Indexes hold the axis labels and other metadata. As you learn pandas, you'll realize mastering indexes is important.

Advanced notes: they are immutable and can contain duplicate labels!

## Setting an index

In [3]:
# We can set an index at df creation:
data = {
    'gender': ['M', 'F', 'F','M'],
    'subjid': [10011, 10010, 10014, 10013],
    'age':    [20, 25, 23, 26] 
}

df = pd.DataFrame(
    data,
    index   = ['Study123-10011', 'Study123-10010', 'Study123-10014', 'Study123-10013'],
    columns = ['subjid', 'age', 'gender']
)

# the index is not an default range of integers anymore but the list of values we passed
df

Unnamed: 0,subjid,age,gender
Study123-10011,10011,20,M
Study123-10010,10010,25,F
Study123-10014,10014,23,F
Study123-10013,10013,26,M


In [4]:
# set_index(): set an index using a Series (from the df or not)

df = pd.DataFrame(data, columns=['subjid', 'gender', 'age'])

# At this point, the index is a default one: a range of integer
print("Default index:", list(df.index) )
df

Default index: [0, 1, 2, 3]


Unnamed: 0,subjid,gender,age
0,10011,M,20
1,10010,F,25
2,10014,F,23
3,10013,M,26


In [5]:
# Use subjid values as index
df.set_index(df.subjid, inplace=True)

# the same but being implicit about where the Series comes from
# df.set_index(['subjid'], inplace=True)

# Let's look at it
print("New index: ", list(df.index) )

df

New index:  [10011, 10010, 10014, 10013]


Unnamed: 0_level_0,subjid,gender,age
subjid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10011,10011,M,20
10010,10010,F,25
10014,10014,F,23
10013,10013,M,26


## Reindexing

You can create a new df with the data __conformed__ to a new index.

In [6]:
data = {'gender': ['M', 'F', 'F','M'],
        'age':    [20, 25, 23, 26] }

df = pd.DataFrame(
    data,
    index   = [10011, 10010, 10014, 10013],
    columns = ['gender', 'age',]
)

df

Unnamed: 0,gender,age
10011,M,20
10010,F,25
10014,F,23
10013,M,26


In [7]:
new_index = [10010, 10011, 10012, 10013, 10014, 10015]
new = df.reindex(new_index)

new

Unnamed: 0,gender,age
10010,F,25.0
10011,M,20.0
10012,,
10013,M,26.0
10014,F,23.0
10015,,


In [8]:
# We can fill out the missings with what we want:
new = df.reindex(new_index, fill_value = 'missing')
new

Unnamed: 0,gender,age
10010,F,25
10011,M,20
10012,missing,missing
10013,M,26
10014,F,23
10015,missing,missing


In [9]:
# Replace the missings with a carried forward (ffill) or backward (bfill) value.
# Note that the index must be sorted first.
new = df.sort_index().reindex(new_index, method = 'ffill')
new

Unnamed: 0,gender,age
10010,F,25
10011,M,20
10012,M,20
10013,M,26
10014,F,23
10015,F,23


In [10]:
# The difference between set_index() and reindex(). 
# The following will crash.

data = {
    'gender': ['M', 'F', 'F','M'],
    'subjid': [10011, 10010, 10014, 10013],
    'age':    [20, 25, 23, 26] 
}

df = pd.DataFrame(
    data,
    columns = ['subjid', 'age', 'gender']
)

df = df.set_index(pd.Series( [10010, 10011, 10012, 10013, 10014, 10015]))
df

ValueError: Length mismatch: Expected 4 rows, received array of length 6

## Index sorting

Change the rows order based on an index sort. This is different from sorting on column values.

In [11]:
data = {'gender': ['M', 'F', 'F','M'],
        'age':    [20, 25, 23, 26] }

df = pd.DataFrame(
    data,
    index=  [10011, 10010, 10014, 10013],
    columns=['gender', 'age',]
)
df

Unnamed: 0,gender,age
10011,M,20
10010,F,25
10014,F,23
10013,M,26


In [12]:
df.index

Index([10011, 10010, 10014, 10013], dtype='int64')

In [13]:
# Sorts the row index, this is NOT 'in place'
df.sort_index() 

Unnamed: 0,gender,age
10010,F,25
10011,M,20
10013,M,26
10014,F,23


In [14]:
# If we need to save the sorting: 

# option 1
df = df.sort_index() 

# option 2
df.sort_index(inplace=True) 

In [15]:
# Descending order:
df= df.sort_index(ascending=False)
df

Unnamed: 0,gender,age
10014,F,23
10013,M,26
10011,M,20
10010,F,25


In [16]:
df.index

Index([10014, 10013, 10011, 10010], dtype='int64')

In [17]:
# We can sort the column index too:
df=df.sort_index(axis=1) 

# same:
# df.sort_index(axis='columns')
df

Unnamed: 0,age,gender
10014,23,F
10013,26,M
10011,20,M
10010,25,F


## Removing an index

In [18]:
patients = [10010, 10011, 10012]
data = {'gender': ['M', 'F', 'F'],
        'age':    [20, 25, 23],
       }

df = pd.DataFrame(data, index= patients, columns=['age', 'gender', 'race'])
df

Unnamed: 0,age,gender,race
10010,20,M,
10011,25,F,
10012,23,F,


In [19]:
# Let's get rid of this index
df.reset_index(inplace=True)

# or alternatively (default is in_place=False)
# df = df.reset_index()

# The index is now a regular column and the index is back to default: a range of integers
df

Unnamed: 0,index,age,gender,race
0,10010,20,M,
1,10011,25,F,
2,10012,23,F,


## Advanced: hierarchical indexes

Each axis of a dataframe can have a hierarchical index, i.e. multiple index levels.

In [20]:
df = pd.DataFrame(
    data    = np.arange(24).reshape(6, 4),
    index   = [[101, 101, 102, 102, 103, 103], ['Visit1', 'Visit2', 'Visit1', 'Visit2', 'Visit1', 'Visit2']],
    columns = [['Study_A', 'Study_A', 'Study_B', 'Study_B'],
               ['param1', 'param2', 'param1', 'param2']]
)
df.index.names= ['subjid','visit']
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Study_A,Study_A,Study_B,Study_B
Unnamed: 0_level_1,Unnamed: 1_level_1,param1,param2,param1,param2
subjid,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
101,Visit1,0,1,2,3
101,Visit2,4,5,6,7
102,Visit1,8,9,10,11
102,Visit2,12,13,14,15
103,Visit1,16,17,18,19
103,Visit2,20,21,22,23


In [21]:
# Btw
np.arange(24)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

In [22]:
np.arange(24).reshape(6, 4),

(array([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]),)

In [23]:
df.index

MultiIndex([(101, 'Visit1'),
            (101, 'Visit2'),
            (102, 'Visit1'),
            (102, 'Visit2'),
            (103, 'Visit1'),
            (103, 'Visit2')],
           names=['subjid', 'visit'])

In [24]:
# Slicing still work
df['Study_A']

Unnamed: 0_level_0,Unnamed: 1_level_0,param1,param2
subjid,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
101,Visit1,0,1
101,Visit2,4,5
102,Visit1,8,9
102,Visit2,12,13
103,Visit1,16,17
103,Visit2,20,21


In [25]:
# We can filter on values from the index levels
df[df.index.get_level_values(0) == 101] # 0 here means first level of index (i.e. subjid)

Unnamed: 0_level_0,Unnamed: 1_level_0,Study_A,Study_A,Study_B,Study_B
Unnamed: 0_level_1,Unnamed: 1_level_1,param1,param2,param1,param2
subjid,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
101,Visit1,0,1,2,3
101,Visit2,4,5,6,7


In [26]:
# Loc probably is easier to read
df.loc[102:103]

Unnamed: 0_level_0,Unnamed: 1_level_0,Study_A,Study_A,Study_B,Study_B
Unnamed: 0_level_1,Unnamed: 1_level_1,param1,param2,param1,param2
subjid,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
102,Visit1,8,9,10,11
102,Visit2,12,13,14,15
103,Visit1,16,17,18,19
103,Visit2,20,21,22,23


In [27]:
# you can interchange levels
df.swaplevel('visit','subjid')

Unnamed: 0_level_0,Unnamed: 1_level_0,Study_A,Study_A,Study_B,Study_B
Unnamed: 0_level_1,Unnamed: 1_level_1,param1,param2,param1,param2
visit,subjid,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Visit1,101,0,1,2,3
Visit2,101,4,5,6,7
Visit1,102,8,9,10,11
Visit2,102,12,13,14,15
Visit1,103,16,17,18,19
Visit2,103,20,21,22,23


__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+