# To `.loc` or `.iloc` - more on Pandas indexes

From the [previous page](0_0_pandas_intro.Rmd), remember our maxims that:

- *"a Pandas Series is a numpy array, plus a `name` attribute and an array-like `index`"* 

...and...

- *"a Pandas Data Frame is just a *dictionary-like collection of Series*"? 

On this page we will examine differences between Data Frame attributes. We will demonstrate that the `name` attribute is optional - the default `name` is `None` and the `name` attribute can be left unspecified in any given Series with minimal consequences. We will show, however, that it is essential to specify the `name` attribute as a column name when a Series is part of a Pandas Data Frame. 

In contrast, an `index` is essential - if no `index` is specified then Pandas will create one. However, we will show that it is sensible to replace Pandas' default index with a custom index, to avoid accidental errors when indexing. 

In [None]:
# import libraries
import numpy as np
import pandas as pd

## What's in a `name`?

To consider the difference in importance/default behaviour between the `name` attribute and the `index` attribute, let's again build a Pandas Series "by hand", using numpy arrays. We'll use the [fertility and Human Development Index data once more](https://ourworldindata.org/grapher/children-per-woman-vs-human-development-index).

First, an array containing the three-letter codes for each country:

In [None]:
# three letter codes for each country
country_codes_array = np.array(['AUS', 'BRA', 'CAN',
                          'CHN', 'DEU', 'ESP',
                          'FRA', 'GBR', 'IND',
                          'ITA', 'JPN', 'KOR',
                          'MEX', 'RUS', 'USA'])
country_codes_array

Second, an array containing the Human Development Index (HDI) scores for each country:

In [None]:
# Human Development Index Scores for each country
hdis_array = np.array([0.896, 0.668, 0.89 , 0.586, 
                       0.844, 0.89 , 0.49 , 0.842, 
                       0.883, 0.709, 0.733, 0.824,
                       0.828, 0.863, 0.894])

hdis_array

We can make a Series using the `pd.Series()` constructor. Our `.values` array is the HDI scores, and the `index` is the three-letter country codes. As we have seen previously, we specify the `index` using the `index=` argument.

In [None]:
# make a series from the `hdis` array
hdi_series =  pd.Series(hdis_array, 
                        index=country_codes_array)

hdi_series

Currently, our Series has data (`values`) and and an `index`, but no `name` attribute:

In [None]:
# show the `values`
hdi_series.values

In [None]:
# show the `index`
hdi_series.index

In [None]:
# show that `name` is None (e.g. does not exist)
hdi_series.name is None

The `name` attribute is optional when considering Series which are not part of Data Frames. The default `name` attribute  is `None`. E.g. if we do not specifiy a `name`, Pandas will set `name` to be `None`.

However, the name attribute can be useful for tracking what data is inside the `values` array. Let's make a new Series - called `hdi_series_with_name_and_index` - where we **do** specify a `name` attribute when calling the `pd.Series()` constructor. 

To specify the `name` attribute we use the `name = ` argument inside the `pd.DataFrame()` constructor:

In [None]:
# make a series from the `hdis` array
hdi_series_with_name_and_index =  pd.Series(hdis_array, 
                                  index=country_codes_array,
                                  name = 'Human Development Index') # specifying the `name` attribute

hdi_series_with_name_and_index

Now, `name` is no longer equal to `None`:

In [None]:
# show the `name` attribute
hdi_series_with_name_and_index.name

In [None]:
# `name` is not None
hdi_series_with_name_and_index.name is None

So far so good, we have seen what happens if we make a Series with no `name` attribute. But what if we make a Series without specifying an `index`? 

To do this, we simply call the `pd.Series()` constructor without using the `index = ` argument:

In [None]:
# make a series from the `hdis` array, with no `name` and no `index` specified
hdi_series_no_name_no_index =  pd.Series(hdis_array)

hdi_series_no_name_no_index

Here, you can see that while we did not specify an `index` Pandas has automatically generated an one:

In [None]:
# view the new Series
hdi_series_no_name_no_index

Let's take a closer look at this `index` by directly accessing the `.index` attribute:

In [None]:
# the default Pandas index
hdi_series_no_name_no_index.index

Hmmm, it is quite strange-looking relative to the index of country codes (which visually resembles a numpy array). Let's use to the `type()` function to investigate further:

In [None]:
# so the type of the default index
type(hdi_series_no_name_no_index.index)

Ok, so for this Series - where we did not specify an `index` - the index that Pandas has created is a `RangeIndex` object. This is, in fact, the default Pandas index that will be attached to a Series where no other index is specified.

The `RangeIndex` looks like a sequence of numbers (and in a sense it is) - but it behaves differently to python list and arrays with regards to indexing.

Let's take a look at the shape of the `RangeIndex` and what is in it. To view the shape, we can use the `np.shape()` function:

In [None]:
# take a closer look at the "default" index
np.shape(hdi_series_no_name_no_index.index)

Ok, so the index has 15 elements. Let's view the first one, using direct indexing:

In [None]:
# view the first element of the RangeIndex
hdi_series_no_name_no_index.index[0]

Again, so far, so good. Now let's view the second element:

In [None]:
# view the second element of the RangeIndex
hdi_series_no_name_no_index.index[1]

Previously, we used three-letter country codes as the index elements (these were strings).

Let's have a look at the `type()` of the numbers in the `RangeIndex`:

In [None]:
# what type are the elements of the range index?
type(hdi_series_no_name_no_index.index[0])

You might be fooled here into thinking that the `RangeIndex` behaves like a python list. If we try to use slicing, we will see this is not the case:

In [None]:
# slice the RangeIndex
hdi_series_no_name_no_index.index[0:-1]

In [None]:
# slice the RangeIndex
hdi_series_no_name_no_index.index[0:4]

This might seem like pretty weird behaviour. This is because the `RangeIndex` behaves similarly to the in-built python `range()` function, rather than as a list or numpy array (e.g. we cannot slice it using the familar `start:stop` syntax).

To view all the values inside the `RangeIndex` we can loop over it:

In [None]:
# loop over the index
for el in hdi_series_no_name_no_index.index:
    print(el)

Or show the values together by converting the index to a python list, using the `list()` function:

In [None]:
# show the elements in the RangeIndex using list()
list(hdi_series_no_name_no_index.index)

## Why the "default" index can be confusing

To recap: we've used three-letter country codes as the elements of an `index`, and we've just seen what happens if we construct a Data Frame without telling Pandas what to use as an `index` - it will create a default `RangeIndex`.

What is the advantage of using a non-default index? Below are some potential pitfalls to be aware of when using the default index.

Let's say we want to access the fifth element of the Series. This is at integer location 4, because we count from 0. At the moment the numerical labels from `RangeIndex` "line up" with the integer-based locations:

In [None]:
# show the whole Series
hdi_series_no_name_no_index

If we use integer indexing (`.iloc`) we get the same value as if we use label based indexing (`.loc`):

In [None]:
# indexing using integer location
hdi_series_no_name_no_index.iloc[4]

In [None]:
 # indexing using labels (from the default index)
hdi_series_no_name_no_index.loc[4]

What if we don't tell pandas what type of indexing we want to do? E.g. we do not use `.iloc` or `.loc`, we just use the sort of direct indexing we would use on a python list?

In [None]:
# direct indexing
hdi_series_no_name_no_index[4]

All of these methods, presently, have returned the same value.

**But  this will not always be the case.** Certain functions and methods that we may want to use to sort and organise our data will cause cause misalignment between the `RangeIndex` and the integer location of a given element of the Series.

For instance let's sort the data in our `hdi_series_no_name_no_index` Series in ascending order.  To do this we will use the `.sort_values()` method. We will cover Pandas methods in detail on [later pages](0_2_pandas_dataframes_attributes_methods.Rmd)  but for now the `.sort_values()` method just as what it says on the tin it sorts the elements of the Series in ascending order, based on the elements in the `.values` attribute of the Series:

In [None]:
# sorting the values in ascending order
hdi_series_no_name_no_index_sorted = hdi_series_no_name_no_index.sort_values()

hdi_series_no_name_no_index_sorted

Look at the left hand side of the print out from the cell above e.g. look at the `RangeIndex`.  The numbers within the `RangeIndex` no longer run sequentially from 0 to 14. This means that the integer location of each element in the Series no longer matches up with the index label. This can potentially be a source of errors.

Let's see what happens if we try to access the fifth element of the series using integer based indexing (`.iloc[4]`) location based indexing (`.loc[4]`) and direct indexing (`[4]`) as we did above.

When we did this on the unsorted data all these methods returned the same value:

In [None]:
# integer indexing on the sorted data
hdi_series_no_name_no_index_sorted.iloc[4]

In [None]:
# label indexing on the sorted data
hdi_series_no_name_no_index_sorted.loc[4]

In [None]:
# direct indexing on the sorted data
hdi_series_no_name_no_index_sorted[4]

Oh dear,  we have used the number 4 with each indexing method, yet have gotten back different values. 

This is a pitfall of using the default `RangeIndex` - it can lead to confusing results when the integer-based location and the `int` label of an element of the Series do not match up. 

Compare this to our `hdi_series_with_name_and_index` which uses the three-letter country codes as it's index:

In [None]:
# show the `hdi_series_with_name_and_index` Series
hdi_series_with_name_and_index

Let's get the Fifth Element using integer based (`.iloc`) indexing:

In [None]:
# integer indexing
hdi_series_with_name_and_index.iloc[4]

...and let's try to use `.loc[4]` on this Series (this will generate an error):

In [None]:
# label indexing raises a KeyError ...
hdi_series_with_name_and_index.loc[4]

This `KeyError` tells us that there is no index label `4` (which makes sense as the index labels in this Series are three-letter country codes). To use `.loc` with this Series, we must use the three-letter country code strings:

In [None]:
# label based indexing
hdi_series_with_name_and_index.loc['DEU']

Direct indexing here will assume we mean `.iloc` integer-based indexing.

In [None]:
# direct indexing
hdi_series_with_name_and_index[4]

Using a custom index (e.g. the three-letter country codes) rather than the default `RangeIndex` has the advantage of avoiding potential confusion between the integer location of a datapoint, and the index label of that datapoint.

To demonstrate this, let's sort our `hdi_series_with_name_and_index` in ascending order:

In [None]:
# sorting the Series in ascending order
hdi_series_with_name_and_index_sorted = hdi_series_with_name_and_index.sort_values()

hdi_series_with_name_and_index_sorted

The use of custom string-based labels in the index (e.g. `FRA`, `AUS` etc)  and say confusing misalignment between `RangeIndex` numerical labels and integer location.

Now, if we index using a number, this means *integer indexing*:

In [None]:
# index using `.iloc`
hdi_series_with_name_and_index_sorted.iloc[4]

In [None]:
# index using direct indexing.
# Currently this generates a warning.  In the future it will be an error.
hdi_series_with_name_and_index_sorted[4]

Using `.loc` means we have to use a string, preventing errors where we use a number and return data we do not expect.

In [None]:
# label-based indexing
hdi_series_with_name_and_index_sorted.loc['DEU']

However, a good additional maxim is: **use `.loc` and `iloc` unless there is a good reason not to!**.

If you use direct indexing, e.g. `[4]` without using `.iloc` or `.loc`, then Pandas assumes you want to use integer indexing.  if this is the case it is better to be explicit and use the `.iloc` method *especially if the Series has a `RangeIndex` with numeric labels*.

## Summary

On this page we have looked at the differences between attributes of the Pandas Series - the `name` is optional, but a default `RangeIndex` will be supplied if no custom `index` is specified.

We have seen that the `RangeIndex` can lead to errors if the numeric row labels become misaligned with the integer location of a given row.