# Indexing by label and position

## Indexing into Series

From the [What is a Series section](what-is-a-series), remember our maxim:

> A *Series* is the association of:
>
> * An array of values (`.values`)
> * A sequence of labels for each value (`.index`)
> * A name (which can be `None`).

On this page we take a brief look at Series `.name`s, and show that they can be
useful when created Data Frames from Series.

Then we look at the Index for Series and Data Frames, and the default index
that Pandas creates if you do not specify one.

The default Index that Pandas makes reminds us of the differences between
label indexing (using `.loc`) and position (integer) indexing (using `.iloc`);
we show that current Pandas allows some ambiguity about what type of indexing
you are doing when using *direct indexing*.

The potential ambiguity of Pandas  it is often sensible to replace Pandas' default index with
a custom index, to avoid accidental errors when indexing.

## Getting started

In [None]:
# import libraries
import numpy as np
import pandas as pd

We'll use the [fertility and Human Development Index data once
more](data/data_notes).

In [None]:
# Three letter codes for each country
country_codes_array = np.array(['AUS', 'BRA', 'CAN',
                                'CHN', 'DEU', 'ESP',
                                'FRA', 'GBR', 'IND',
                                'ITA', 'JPN', 'KOR',
                                'MEX', 'RUS', 'USA'])

In [None]:
# Human Development Index Scores for each country
hdis_array = np.array([0.896, 0.668, 0.89,
                       0.586, 0.89,  0.828,
                       0.844, 0.863, 0.49,
                       0.842, 0.883, 0.824,
                       0.709, 0.733, 0.894])

## Slicing Series with `.iloc` and `.loc`

In [None]:
hdi_series =  pd.Series(hdis_array, index=country_codes_array)
hdi_series

There is a fundamental difference between the behaviors of `.iloc` and `.loc`
when slicing.

Standard slicing in Python uses integers to specify positions, and gives the
elements *starting at* the start position, *up to but not including* the stop
position.

In [None]:
my_name = 'Peter Rush'
# From character at position 2, up to (not including) position 7.
my_name[2:7]

The same rule applies to indexing Python lists, or Numpy arrays:

In [None]:
# From element at position 2, up to (not including) position 7.
country_codes_array[2:7]

`.iloc` is indexing by *position*, so it may not be surprising that it slices using the same rules as by-position indexing in Numpy:

In [None]:
# From element at position 2, up to (not including) position 7.
hdi_series.iloc[2:7]

Now consider slicing by *label*.  The *start* and *stop* values are no longer
positions, but labels.   The label at position 2 is `'CAN'`.  The label at
position 7 is the until-recently-European country`'GBR'`.

Here's what we get from slicing using `.loc`:

In [None]:
# From element labeled 'CAN', up to (including) element labeled 'GBR'
hdi_series.loc['CAN':'GBR']

First notice that label indexing uses values from the Index as start and stop.  Unlike Numpy or `.iloc` indexing, which by definition have integers as start and stop (because these are positions), `.loc` indexing start and stop values must match the values in the Index.  In this case, the Index has `str` values, so the start and stop values are also `str`.

Second notice that we got one more value from `.loc` indexing into the Series, because `.loc` slicing unlike `.iloc` or Numpy indexing, *includes* the stop value.

In the last cell, `'GBR'` was the stop value, and we got the row corresponding
to `'GBR'`.

This is a major difference from Numpy and `.iloc` behavior.

::: note

**Stop and `.loc`**

Why does `.loc` slicing return the label corresponding to the stop value, instead of going *up to but not including* the stop value, like Numpy or `.iloc`.

We should say that this is absolutely the right choice.  But why?

Please consider reflecting before reading on.

[Elevator Muzak while you reflect](https://www.youtube.com/watch?v=XlDdrrFY4Ug)

Please click the link above to get you into a reflective mood.

Back to slicing; let's consider the problem of selecting some rows that you
want.  You can see the Index.  In your case you want all the rows from `CAN`
through `GBR`.  When the result includes the stop label, then its obvious what
to do; you do what you do above:

In [None]:
# From element labeled 'CAN', up to (including) element labeled 'GBR'
hdi_series.loc['CAN':'GBR']

Now consider the alternative — where slicing gives you the rows *up to but not
including* the stop value.  Your problem now becomes annoying and error-prone.
You have to look at the index, identify the last label for the row you want
(`'GBR'`) and then go one row further, and get the label for the row *after*
the one you want (in this case `''IND'`.  Indexing to get rows `'CAN'` through
`'GBR'` would be `hdi_series.loc['CAN':'IND']`.  Now imagine that for some
reason I had deleted the `'IND'` row, so the following row label is `'ITA'`. In
that case, despite the fact nothing had changed in the rows I'm interested in,
I now have to write `hdi_series.loc['CAN':'ITA']` to get the exact same rows.

So, yes, it's important to remember this difference, but a little reflection
should reveal that this was still the right choice.

:::

## Index labels need not be unique

We haven't specified so far, but there is no general requirement for Pandas Index values to be unique.  Consider the following Series:

In [None]:
not_unique_labels = pd.Series(['France', 'Italy', 'United Kingdom', 'Great Britain'],
                              index=['FRA', 'ITA', 'GBR', 'GBR'])
not_unique_labels

`.loc` matching a label with only one row returns the corresponding value:

In [None]:
not_unique_labels.loc['FRA']

`.loc` matching a label with more than one row returns a subset of the Series:

In [None]:
not_unique_labels.loc['GBR']

This can lead to confusing outputs if you don't keep track of whether the Index values uniquely identify the row.

## Series, Data Frames, and the default index

Thus far, we have specified the Index in building Series:

In [None]:
hdi_series =  pd.Series(hdis_array, index=country_codes_array)
hdi_series

Pandas allows us to build a Series without specifying an Index:

In [None]:
# Make a Series from `hdis_array`, without specifying `index` or `name`.
hdi_series_def_index = pd.Series(hdis_array)
hdi_series_def_index

Where we did not specify an Index, Pandas has automatically generated one.  As
you can see, Pandas displays this default index as a sequence of integers,
starting at 0, and going up to the number of rows minus 1.

Let's take a closer look at the default Index:

In [None]:
# The default Pandas index
hdi_series_def_index.index

`RangeIndex` is similar to Python's `range`; it is a space-saving container
that represents a sequence of integers from a start value up to, but not
including a stop value, with an optional step size.  Here `RangeIndex`
represents the numbers 0 through 14, just as `range` can represent the numbers
0 through 14:

In [None]:
zero_through_14 = range(0, 15)
zero_through_14

As for `range` we can ask the `RangeIndex` container to give up these numbers
(by iteration) into another container, such as an array or list:

In [None]:
# Iterating through `RangeIndex` to give the represented numbers.
np.array(hdi_series_def_index.index)

In [None]:
# Iterating through a `range` to give the represented numbers.
np.array(zero_through_14)

As for `range`, one can ask for the implied elements by indexing:

In [None]:
# View the fifth element of the RangeIndex.
fifth_element = hdi_series_def_index.index[4]
fifth_element

Notice that the elements from `RangeIndex` are `int`s:

In [None]:
type(fifth_element)

For all practical purposes, you can treat this `RangeIndex` as being equivalent
to the corresponding sequential Numpy integer array.

**Start of exercise**

Let's make another Series where we do not specify the index:

In [None]:
a_series = pd.Series([1000, 999, 101, 199, 99])
a_series

As you have seen, you will have got the default `.index`, a `RangeIndex`:

In [None]:
a_series.index

What do you expect to see for `list(a_series)`?  Reflect, then uncomment below
and try it:

In [None]:
# list(a_series)

What do you expect to see for `list(a_series.index)`?  Reflect, then try it:

In [None]:
# list(a_series.index)

The Series method `.sort_values` returns a new Series sorted by the values.

In [None]:
sorted_series = a_series.sort_values()

Now what do you expect to see for `list(sorted_series)`?  Reflect, then
uncomment below and try it:

In [None]:
# list(sorted_series)

How about `list(sorted_series.index)`?  Reflect, try:

In [None]:
# list(sorted_series.index)

What kind of thing do you think the `.index` is now?  Reflect and then:

In [None]:
# type(sorted_series.index)

Can you explain the result of the last cell?

**End of exercise**

**See the [corresponding page](/pandas_from_numpy/0_1_to_loc_or_iloc.html) for solution**

## Why an Index of integers can be confusing

To recap: for our first few Series, we've used three-letter country codes as
the elements of an `index`. We've just seen what happens if we construct a Data
Frame without telling Pandas what to use as an `index` - it will create
a default `RangeIndex`. `RangeIndex` represents a series of integers.

If you did the exercise, you will have found that Pandas can us `RangeIndex`
when the index is a regular sequence of integers, but must otherwise change to
having an index with an array containing integers, that are the value labels.

What is the advantage of using an index with values that aren't integers — such
as strings? Below are some potential pitfalls to be aware of when using the
default index, and any other index made up of integers.

Let's say we want to access the fifth element of the Series. This is at integer
location 4, because we count from 0. At the moment the numerical labels implied
by the `RangeIndex` "line up" with the integer-based locations:

In [None]:
# Show the whole Series
hdi_series_def_index

If you ask for element `4`, there is no ambiguity about which element you mean,
because the value with label `4` is also the element at integer position `4`.
Therefore, if we use integer indexing (`.iloc`) we get the same value as if we
use label based indexing (`.loc`):

In [None]:
# Indexing using integer location
hdi_series_def_index.iloc[4]

In [None]:
# Indexing using labels (from the default index)
hdi_series_def_index.loc[4]

What if we don't tell Pandas what type of indexing we want to do?  Meaning, we
do not use `.iloc` or `.loc`, we just use the sort of [direct
indexing](direct-indirect) we would use on a Python list or array?

In [None]:
# Direct indexing
hdi_series_def_index[4]

There is no ambiguity as to what `4` refers to, so it may not be surprising
that `.iloc`, `.loc` and direct indexing all give the same result.

**But this will not always be the case.** Certain functions and methods that
we use to sort and organise our data will cause cause misalignment between the
*integer labels* and the *integer position* of a given element of the Series.

For instance let's sort the data in our `hdi_series_def_index` Series in
ascending order.  To do this we will use the `.sort_values()` method. We will
cover Pandas methods in detail on [later
pages](0_2_pandas_dataframes_attributes_methods.Rmd). For now the
`.sort_values()` method sorts the values of the Series in ascending order,
taking the matching labels in the index with it.

In [None]:
# Sorting the *values* in ascending order
hdi_series_def_index_sorted = hdi_series_def_index.sort_values()
hdi_series_def_index_sorted

Look at the left hand side of the display from the cell above — in particular,
look at the Index.  The numbers within the Index no longer run sequentially
from 0 to 14. This means that the integer location of each element in the
Series no longer matches up with the index label. This can be a potential
source of errors.

::: note

**The index type can change if you rearrange elements**

If you haven't done the exercise above, please consider doing it.

If you have, you will have found already that the sorted Series has a new
Index, that is no longer a `RangeIndex` (because the integer labels now cannot
be represented as a regular sequence of integers):

In [None]:
type(hdi_series_def_index_sorted.index)

:::

Let's see what happens if we try to access the fifth element of the series
using integer based indexing (`.iloc[4]`) location based indexing (`.loc[4]`)
and direct indexing (`[4]`) as we did above.

As you remember, when we did this on the data before sorting, all these methods
returned the same value.  Now, however:

In [None]:
# Integer indexing on the sorted data
# This is the fifth element in the Series.
hdi_series_def_index_sorted.iloc[4]

In [None]:
# Label indexing on the sorted data
# This is the element with the label `4`.
hdi_series_def_index_sorted.loc[4]

In [None]:
# Direct indexing on the sorted data
# Which is this?  Position or label?
hdi_series_def_index_sorted[4]

We have used the number 4 with each indexing method, yet have gotten back
different values for `.iloc` compared to `.loc` and direct indexing.

This is a pitfall of using sequential numbers as the index — as generated, for
example, by `RangeIndex` — it can lead to confusing results when the position
in the sequence and the `int` label of an element of the Series do not match
up.

Compare this to our `hdi_series` which uses the
three-letter country codes as its index:

In [None]:
# Show the `hdi_series`
hdi_series

Let's get the fifth element using integer based (`.iloc`) indexing:

In [None]:
# Integer indexing
hdi_series.iloc[4]

...and let's try to use `.loc[4]` on this Series (this will generate an error):

In [None]:
# Label indexing raises a KeyError ...
hdi_series.loc[4]

This `KeyError` tells us that there is no index label `4` (which makes sense
as the index labels in this Series are three-letter country codes). To use
`.loc` with this Series, we must use the three-letter country code strings:

In [None]:
# Label based indexing
hdi_series.loc['DEU']

It is much harder to get confused when using integer indices with *indirect
indexing* (`.loc` and `.iloc`).  You've specified what you mean (by label or by
position) using the name of the method.  However, things can get dangerously
confusing if you use an integer index and *direct indexing*.

Just to remind you, `hdi_series` has the country codes
(strings like `'DEU'`) as the index.

Now, consider, what would happen if we used an integer for *direct indexing*?
As in something like `hdi_series[4]`?  Because we haven't
specified that we want to index with labels (`.loc`) or positions (`.iloc`),
Pandas has to make some decision as to how to proceed.

**Start of exercise**

We assume you've just read the text above the exercise, where we consider what
you would expect to happen if:

* Your Series has a index of strings.
* You use direct indexing on this Series with an integer.

As in `hdi_series[4]`. (Don't try it yet).

Pause and reflect what decision you would make in this situation, if you were
a Pandas developer, deciding what Pandas should do.  What are the options? Why
would you chose one option over another?

**End of exercise**

**See the [corresponding page](/pandas_from_numpy/0_1_to_loc_or_iloc.html) for solution**

You are about to see that direct indexing on a Series, for now, does something
frightening, which is to *guess* whether we mean to `.loc` or `.iloc` indexing
depending on whether the index values are integers.

So, as you have already seen above, if the index consists of integers, and you
specify integers in your direct indexing, then Pandas will assume you mean the
values to be labels (like `.loc`).

If the index does not consist of integers, and you specify integers in your
direct indexing, then Pandas will currently assume you mean the values to be
positions (like `.iloc`), but (at time of writing) give you a warning that
this will soon change.

In [None]:
# Direct indexing
hdi_series[4]

Using a custom non-integer index (e.g. the three-letter country codes) rather
than the default `RangeIndex`, or some other integer index, has the advantage
of avoiding potential confusion between the integer location of an element,
and the index label of that element.

To demonstrate this, let's sort our `hdi_series` in ascending order:

In [None]:
# Sorting the Series in ascending order
hdi_series_sorted = hdi_series.sort_values()
hdi_series_sorted

The use of custom string-based labels in the index (e.g. `FRA`, `AUS` etc)
avoids confusing misalignment between the default numerical labels and integer
location.

It's good and safe practice to explicitly specify `.loc` or `.iloc` when
indexing a Series, in order not to confuse Pandas as to whether you mean to
index by label or position.   In this case `.loc` means we have to use
a string, preventing errors where we use a number and return data we do not
expect.

In [None]:
# Label-based indexing
hdi_series_sorted.loc['DEU']

::: warning

**Direct indexing is not currently consistent**

As an extra warning, as Pandas shifts towards more explicit choice of labels over positions in direct indexing, there are still inconsistencies.  These will be resolved over time, so if you want to avoid confusion, skip the rest of this note, and remember *be explicit about labels or positions with `.loc` or `.iloc` unless you have good reason not to*.

If you got this far, we admire your courage.  This warning is only to say that Pandas currently treats *slices* in direct indexing differently from individual positions or labels.  Specifically, at the moment, it will always assume integers in slices are positions and not labels.  Try some experiments with `hdi_series[:5]` (string label Series) and `hdi_series_def_index[:5]` (integer label Series).

You may be confused after doing that.  And this behavior will surely change at some point.  Summary — use `.iloc` and `.loc` to avoid ambiguity.

:::

We repeat, a very good additional maxim is: **use `.loc` and `iloc` unless
there is a good reason not to!**.

## Default names for Data Frame columns

Remember, a Data Frame is a dictionary-like collection of Series.

We often create Data Frames with an actual dictionary of Series, or with single Series.

When we build Data Frames from Series, it becomes eminently sensible to specify
a `name` attribute for each Series.

Pandas will not *force* us to do this, but it leads to some error-prone
consequences if we do not. In fact, the default `RangeIndex` crops up again
here, and can create confusion in similar ways to the ones we have seen in the
last section.

Let us reconstruct a Series with a specified Index, but no `.name`:

In [None]:
hdi_series =  pd.Series(hdis_array, index=country_codes_array)
hdi_series

Our `hdi_series` got the default value for its `.name` attribute: `None`.

In [None]:
hdi_series.name is None

Let's pass this Series to the `pd.DataFrame()` constructor, to see the
consequence of a `.name`less Series in this context.  We will call the resulting
Data Frame `no_name_df`:

In [None]:
# A Data Frame made of a Series with `name` attribute of None.
no_name_df = pd.DataFrame(hdi_series)
no_name_df

Ok, so in the absence of a `name` attribute Pandas has labelled the column with a `0`. If we inspect more deeply, we find that actually Pandas, in the absence of being instructed otherwise, has created a `RangeIndex`, but this time for the `columns` (e.g. the column names) of the Data Frame. If you look at the Data Frame above, you can see that the `index` attribute is the three-letter country codes. However, the `.columns` are a `RangeIndex`:

In [None]:
# Look at the column names via the `.columns` attribute
no_name_df.columns

Sure enough, if we check the `type` of the fist element in this `RangeIndex`
it is an `int` — we can think of it not as a column *name* but a number
standing in for a column name:

In [None]:
# Check the type of the first element in the `.columns` attribute
type(no_name_df.columns[0])

It may be obvious why this naming convention for Data Frame columns can lead
to errors for reasons of low interpretability. We typically want our column
names to be descriptive of the data in the column, to ourselves and to other
people reading our code. If we do not specify a `name` attribute for our
Series when creating Data Frames, the default numerical column names supplied
by Pandas are hard to interpret, and it is easy to misinterpret, or to forget
what data is in that column, leading to human errors.

They can also lead to indexing errors, similar to those we saw in the previous
section. To demonstrate this, let's compare the `name`less Data Frame above to
a Data Frame created from a Series with a `name`.

Now let us create a Series with index and not-default name:

In [None]:
hdi_series_named = pd.Series(
    hdis_array,
    index=country_codes_array,
    name='Human Development Index')
hdi_series_named

Our aptly named `hdi_series_named` has a `name` attribute — (look at the
last line of the output from the cell below, the `Name: Human Development
Index`):

Let's call the `pd.DataFrame()` constructor on this Series - we'll call the
resultant Data Frame `named_df`:

In [None]:
# Create a new Data Frame, with a sensible name for the column
named_df = pd.DataFrame(hdi_series_named)
named_df

We see that Pandas has automatically used the `name` attribute as the column
name, hugely increasing the interpretability of the resulting Data Frame.

We can see the `name` in the `.columns` attribute of the new  `named_df` Data
Frame:

In [None]:
# Show the column names from the `named_df` Data Frame
named_df.columns

Now, let's try using direct indexing with each Data Frame, using the column names.

This is very straightforward for the `named_df`:

In [None]:
# Direct indexing to retrieve a column by name
named_df['Human Development Index']

What about for `no_name_df`? Well, the column name there is `0`, of `int`
type.

What happens if use direct indexing? We are hoping we see something like the
output of the cell above, albeit with a `0` for the `name`, rather than `Human
Development Index`:

In [None]:
# Direct indexing with no `name` attribute
no_name_df[0]

Sure enough that is what we see. But the operation we have used looks very
much like *integer* indexing on a Series, which will return a single value:

In [None]:
# Integer indexing a Series (without using `.iloc`)
hdi_series[0]

This can be confusing.

The numerical label can also be confusing if we introduce more columns,
especially if we mix in columns which do have `name` attributes that we
specify.

For instance, if we add in the full name of each country:

In [None]:
# Country names array
country_names_array = np.array(['Australia', 'Brazil', 'Canada',
                                'China', 'Germany', 'Spain',
                                'France', 'United Kingdom', 'India',
                                'Italy', 'Japan', 'South Korea',
                                'Mexico', 'Russia', 'United States'])
country_names_array

Let's add this into the `no_name_df`, using the `name` `'Country Names'`:

In [None]:
no_name_df['Country Names'] = country_names_array
no_name_df

We now have one column that has an `int` as its column name, and another with
a string.

In the same way as for a numerical `index`, this situation can become
confusing if the numerical labels become misaligned with their integer
location in the `.columns` attribute.

For instance, we could reverse the order of the columns, by using *direct indexing* in the Data Frame to request the columns in reverse order by column label:

In [None]:
# Re-arrange the columns
# Specify the order of columns we want.
cols = ['Country Names', 0]
# Use direct indexing to select columns in given order.
reversed_col_df = no_name_df[cols]
reversed_col_df

We now have a column with the `name` `0` (an int) that is not at the 0-th location of the `.columns` attribute:

In [None]:
# Show the 0-th element of the `.columns` attribute
reversed_col_df.columns[0]

We are now in the precarious situation of using a `int` `0` both as a *label* and as a *location*. In the cell above, it is a location, it the cell below it serves as a label:

In [None]:
# Direct index the `0` column
reversed_col_df[0]

Ideally, we would like a clear separation between integer indexes (like `0`)
and *column names*.

This confusing situation can be completely avoided in the case of `named_df`. Let's add the country names to this Data Frame, that thus far only has a single column, containing the "Human Development Index" values.

In [None]:
# Show the Data Frame thus far
named_df

In [None]:
# Add the country names
named_df['Country Names'] = country_names_array
named_df

We don't introduce any confusion or add propensity to error by re-arranging
these columns which have sensible string names:

In [None]:
# Re-arrange the columns
reversed_named_df = named_df[['Country Names', 'Human Development Index']]
reversed_named_df

In fact, now if we want to use integer indexing, we are forced to use `.iloc`,
as we will otherwise get an error:

In [None]:
# This will not work if each column has a `name` string
reversed_named_df[0]

This compels us to stick to the good practice maxim we introduced above: **use
`.loc` and `iloc` unless there is a good reason not to!**.

Giving every column a sensible string as a `name` is one situation where direct indexing (e.g. in the present context using `named_df['Human Development Index']` etc.) is safe and non-error prone. As we have seen, issues can arise if we do not specify `name` attributes, and let Pandas automatically generate numeric labels for our Data Frame columns...

## `.loc` and `.iloc` with Data Frames

So far we have spent much time with `.loc` and `.iloc` on Series.

There is a new concern

## Summary

On this page we have looked at the differences between attributes of the
Pandas Series - the `name` is optional, but a default `RangeIndex` will be
supplied if no custom `index` is specified.

We have seen that the `RangeIndex` and other indices with integer labels can
lead to errors if the numeric row labels become misaligned with the integer
location of a given row. Similarly, if we do not specify a `name` attribute
for Data Frame columns, Pandas will generate numeric labels which can create
confusion between integer-based and direct indexing.

For best results, we should specify both an interpretable `index` and
interpretable `name` attributes for our Series, especially when they are part
of Data Frames.