## Applications of reindexing

Reindexing can be useful if you have, or anticipate, missing data. For example, here's some data from three different rats who were timed running through a maze. This was an 8-day study, but not every rat was tested every day. So one animal may have data from days 1-3 and 5-8 of the study, but not day 4. When we import and combine data, though, we want to ensure that the indexing of the data reflects the days they were tested on (e.g., if a rat missed day 4, we want their day 5 data to have the index '5' even though that would be the fourth data point we have for them). 

When we load in the data, we find that it doesn't have a column that reflects the days that each animal was tested on, so each data set is indexed sequentially from 0:

In [28]:
maze_files = ['maze_data_1.csv', 'maze_data_2.csv', 'maze_data_3.csv']
maze_list = []
for filename in maze_files:
    maze_list.append(pd.read_csv(filename))
    
maze_list

[   maze_time
 0       6.00
 1       7.56
 2       2.17
 3       2.39
 4       5.60
 5       8.94
 6       2.95
 7       3.30,
    maze_time
 0       7.32
 1       4.12
 2       6.28
 3       4.20
 4       2.11
 5       4.98
 6       7.44,
    maze_time
 0       2.55
 1       4.00
 2       6.00
 3       8.38
 4       6.53
 5       3.01]

Although the files aren't indexed, but we have a record (written in a lab notebook) of which days each rat was tested on:

In [29]:
r1_days = ['day1', 'day2', 'day3', 'day4', 'day5', 'day6', 'day7', 'day8']
r2_days = ['day1', 'day2', 'day3', 'day5', 'day6', 'day7', 'day8']
r3_days = ['day1', 'day2', 'day4', 'day5', 'day6', 'day7']

So we can use this information, along with the `.reindex()` method, to set the indexing appropriately for each animal. First we have to add the list of days on which testing actually occurred to each rat's data file, because we can only reindex from a column within the DataFrame.

(Note: there are ways to do this more efficiently using looping, but for now we'll focus on reindexing and not looping.)

In [30]:
maze_list[0]['days'] = r1_days
maze_list[1]['days'] = r2_days
maze_list[2]['days'] = r3_days

maze_list

[   maze_time  days
 0       6.00  day1
 1       7.56  day2
 2       2.17  day3
 3       2.39  day4
 4       5.60  day5
 5       8.94  day6
 6       2.95  day7
 7       3.30  day8,
    maze_time  days
 0       7.32  day1
 1       4.12  day2
 2       6.28  day3
 3       4.20  day5
 4       2.11  day6
 5       4.98  day7
 6       7.44  day8,
    maze_time  days
 0       2.55  day1
 1       4.00  day2
 2       6.00  day4
 3       8.38  day5
 4       6.53  day6
 5       3.01  day7]

Then we specify the `days` column as the index in each DataFrame:

In [31]:
maze_list[0] = maze_list[0].set_index('days')
maze_list[1] = maze_list[1].set_index('days')
maze_list[2] = maze_list[2].set_index('days')

maze_list

[      maze_time
 days           
 day1       6.00
 day2       7.56
 day3       2.17
 day4       2.39
 day5       5.60
 day6       8.94
 day7       2.95
 day8       3.30,
       maze_time
 days           
 day1       7.32
 day2       4.12
 day3       6.28
 day5       4.20
 day6       2.11
 day7       4.98
 day8       7.44,
       maze_time
 days           
 day1       2.55
 day2       4.00
 day4       6.00
 day5       8.38
 day6       6.53
 day7       3.01]

So now each rat's DataFrame is indexed by day numbers, and the days with no data are not represented in the DataFrame. But if we want all DataFrames to contain rows (and indexes) for all days, we can do this. As it happens, the first rat's data is complete - it has data from all of the study days. So we can use this as the index for the other datasets. (If we didn't have one data set that was complete, we could make a list containing all the indices we wanted, and use that to reindex).

Pay attention below to the fact that the input to `.reindex()` is `maze_list[0].index` — the *index* of `maze_list[0]` — and not just `maze_list[0]`.

In [32]:
maze_list[1] = maze_list[1].reindex(maze_list[0].index)
maze_list[2] = maze_list[2].reindex(maze_list[0].index)

maze_list

[      maze_time
 days           
 day1       6.00
 day2       7.56
 day3       2.17
 day4       2.39
 day5       5.60
 day6       8.94
 day7       2.95
 day8       3.30,
       maze_time
 days           
 day1       7.32
 day2       4.12
 day3       6.28
 day4        NaN
 day5       4.20
 day6       2.11
 day7       4.98
 day8       7.44,
       maze_time
 days           
 day1       2.55
 day2       4.00
 day3        NaN
 day4       6.00
 day5       8.38
 day6       6.53
 day7       3.01
 day8        NaN]

Reindexing is useful in some cases. However, in other cases it's not really necessary. For example, if we wanted to merge the data from all three rats into one DataFrame, we could do so (similar to the example with RT data earlier) without reindexing, and pandas would automatically fill in the missing data with `NaN`s:

In [33]:
maze_files = ['maze_data_1.csv', 'maze_data_2.csv', 'maze_data_3.csv']

r1_days = ['day1', 'day2', 'day3', 'day4', 'day5', 'day6', 'day7', 'day8']
r2_days = ['day1', 'day2', 'day3', 'day5', 'day6', 'day7', 'day8']
r3_days = ['day1', 'day2', 'day4', 'day5', 'day6', 'day7']

maze_list = []
for filename in maze_files:
    maze_list.append(pd.read_csv(filename))


maze_list[0]['days'] = r1_days
maze_list[1]['days'] = r2_days
maze_list[2]['days'] = r3_days

maze_list[0] = maze_list[0].set_index('days')
maze_list[1] = maze_list[1].set_index('days')
maze_list[2] = maze_list[2].set_index('days')

rat_df = maze_list[0]
rat_df = rat_df.rename(columns={'maze_time':'r1'})


rat_df['r2'] = maze_list[1]
rat_df['r3'] = maze_list[2]
rat_df

Unnamed: 0_level_0,r1,r2,r3
days,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
day1,6.0,7.32,2.55
day2,7.56,4.12,4.0
day3,2.17,6.28,
day4,2.39,,6.0
day5,5.6,4.2,8.38
day6,8.94,2.11,6.53
day7,2.95,4.98,3.01
day8,3.3,7.44,
