## Contents
* [The Basics](#the-basics)
  * [Indexing Columns](#indexing-columns)
  * [Slicing Rows](#slicing-rows)
* [Indexing and Slicing](#indexing-and-slicing-dataframes)
---

Pandas offers a wide variety of ways to select, index, or slice `Series` and `DataFrame` objects. This can be helpful by 
providing flexibility in the methods used, but also confusing at times due to the wide range of possible options. This article
will focus on selecting, indexing, and slicing `DataFrame` objects using the indexing operator `[]` and the `.iloc` and `.loc` attributes.

For more details and other options please review the ([pandas documentation](https://pandas.pydata.org/docs/user_guide/indexing.html){: .post__link}

## The Basics
The basic form of selecting data from a `DataFrame` is using the indexing `[]` operator, conceptually it is helpful to think of this operator as `get an item`. The indexing operator (`[]`) is a hybrid of using index labels and location offsets to select 
certain columns and rows of a `DataFrame`. Using `.iloc` and `.loc` might be favorable as they provide more explicit indexing and slicing, see below sections for details and examples.

### Indexing Columns

With a `DataFrame` the indexing operator `[]` is used for selecting columns, or more appropriately from the column index. Meaning when using `DataFrame[]`, column names should be passed in. For more information on the `DataFrame` object attributes see [Pandas: Introduction to the Library](/quick%20start/pandas-introduction.html).

In [97]:
import pandas as pd
df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'C': ['C1', 'C2', 'C3'],
                    'D': [5, 10, 15]})

#Example DataFrame
df

Unnamed: 0,A,B,C,D
0,A1,B1,C1,5
1,A2,B2,C2,10
2,A3,B3,C3,15


In [98]:
# The columns index - selection from using []
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [99]:
# Selecting 'B' from the column index
df['B']

0    B1
1    B2
2    B3
Name: B, dtype: object

The selection of multiple columns can be done utilizing a list of column names or two sets of square brackets `[[ ]]`. Each set performs a specific task during the selection.
* The outer bracket subsets the `DataFrame`
* The inner bracket creates a list of column names to subset

In [100]:
column_names = ['A', 'C']

print(f'list of column names:\n{df[column_names]}')
print(f'two square brackets:\n{df[["A", "C"]]}')

list of column names:
    A   C
0  A1  C1
1  A2  C2
2  A3  C3
two square brackets:
    A   C
0  A1  C1
1  A2  C2
2  A3  C3


### Slicing Rows

#### Python: Slice & Slicing
A slice is an object typically made up of a portion of a sequence and is created using subscript notation `[]`, with colons separating the start, stop, and step numbers (e.g. `[1:10:2]`). Slicing is the selection of a range of items contained within a sequence object.

As mentioned above a slice has three components: start, stop, and step. When specify the start and stop values it is important to note that the start value is inclusive, while the stop value is exclusive.

#### Pandas: Slicing
With a `DataFrame` object slicing with `[]` will slice the rows of the `DataFrame`. This can lead to some confusion with indexing columns using `[]`. The indexing operator when passed a single argument, it will select columns, when passed a slice it will slice on the rows (i.e. includes a `:`). Similar to indexing, slicing with `.loc` and `.iloc` might be preferred, see below for additional details.

In [101]:
# Slice the first two rows
# Remember start (0) is inclusive, stop (2) is exclusive
df[0:2]

Unnamed: 0,A,B,C,D
0,A1,B1,C1,5
1,A2,B2,C2,10


In [102]:
# Not specifying a start will start and the first row
df[:2]

Unnamed: 0,A,B,C,D
0,A1,B1,C1,5
1,A2,B2,C2,10


In [103]:
# Not specifying a stop will select all rows after start 
df[1:]

Unnamed: 0,A,B,C,D
1,A2,B2,C2,10
2,A3,B3,C3,15


In [104]:
# Specify a step - how to count between start and stop
# Encompasses the entire dataframe every two rows 
df[0:4:2]

Unnamed: 0,A,B,C,D
0,A1,B1,C1,5
2,A3,B3,C3,15


In [105]:
# The above is equivalent to
df[::2]

Unnamed: 0,A,B,C,D
0,A1,B1,C1,5
2,A3,B3,C3,15


In [106]:
# Passing -1 as the step will count from the last element
df[::-1]

Unnamed: 0,A,B,C,D
2,A3,B3,C3,15
1,A2,B2,C2,10
0,A1,B1,C1,5


A common method for selecting certain rows is to apply a logical condition to filter against, or Boolean Indexing. The logical condition returns `True` or `False` values for each row of the `DataFrame`. Enclosing the logical condition in square brackets `[ ]` subsets the rows of the `DataFrame`, returning the rows where the logical condition evaluates to `True`. 

In [107]:
df['B'] == 'B2'

0    False
1     True
2    False
Name: B, dtype: bool

In [108]:
df[df['B'] == 'B2']

Unnamed: 0,A,B,C,D
1,A2,B2,C2,10


Subsetting rows this way can contain multiple conditions while using logical operators (e.g. `&` (and), `|` (or))

In [109]:
condition_1 = df['B'] == 'B2'
condition_2 = df['C'] == 'C3'
# condition 1 OR condition 2
df[condition_1 | condition_2]

Unnamed: 0,A,B,C,D
1,A2,B2,C2,10
2,A3,B3,C3,15


Filter of rows on multiple categorical variable can also be done using `.isin()` method.

In [110]:
df[df['B'].isin(['B1', 'B3'])]

Unnamed: 0,A,B,C,D
0,A1,B1,C1,5
2,A3,B3,C3,15


## Indexing and Slicing DataFrames
A preferable method for indexing and slicing a `DataFrame` is using the `.loc` and `.iloc` properties of the `DataFrame` object. Obtaining a subset of a `DataFrame` using these properties typically is preferable because they are more explicit about if the desired selection or indexing by label (`.loc`) or integer-location based (`.iloc`) 

In [None]:
# Modify previous df for examples

#add two rows
df.loc[3] = ['A4', 'B4', 'C4', 20]
df.loc[4] = ['A5', 'B5', 'C5', 25]
# Create and set new column as index
df['idx'] = ['a', 'b', 'c', 'd', 'e']
df.set_index('idx', inplace=True)

In [120]:
# New df for Examples
df

Unnamed: 0_level_0,A,B,C,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,A1,B1,C1,5
b,A2,B2,C2,10
c,A3,B3,C3,15
d,A4,B4,C4,20
e,A5,B5,C5,25


Passing a single argument to `.loc` and `.iloc` will select corresponding rows

In [121]:
# Can be a list of row indexes
df.loc[['a', 'b', 'c']]

Unnamed: 0_level_0,A,B,C,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,A1,B1,C1,5
b,A2,B2,C2,10
c,A3,B3,C3,15


In [122]:
# Can also be a slice
# When the index is sorted all values between start and stop will be selected
df.loc['a':'c']

Unnamed: 0_level_0,A,B,C,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,A1,B1,C1,5
b,A2,B2,C2,10
c,A3,B3,C3,15


>Notice that providing a range of indexes, UNLIKE prior slicing will include the stop value

In [123]:
# The above operations can also be carried out with iloc
df.iloc[[0, 1, 2]]

Unnamed: 0_level_0,A,B,C,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,A1,B1,C1,5
b,A2,B2,C2,10
c,A3,B3,C3,15


In [124]:
df.iloc[0:3]

Unnamed: 0_level_0,A,B,C,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,A1,B1,C1,5
b,A2,B2,C2,10
c,A3,B3,C3,15


Passing two argument to `.loc` and `.iloc` allows for indexing and slicing by both rows and columns `[rows, columns]`.

In [127]:
df.loc['a':'c', ['B', 'D']]

Unnamed: 0_level_0,B,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
a,B1,5
b,B2,10
c,B3,15


In [129]:
df.iloc[:3, [1, 3]]

Unnamed: 0_level_0,B,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
a,B1,5
b,B2,10
c,B3,15


Both `.loc` and `.iloc` also allow for boolean indexing to return a subset of the `DataFrame` where the boolean expression evaluates to `True`

In [133]:
df.loc[df['D'] >= 15]

Unnamed: 0_level_0,A,B,C,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c,A3,B3,C3,15
d,A4,B4,C4,20
e,A5,B5,C5,25


In [136]:
# Showing the boolean mask shows that is is a Series with an index
bool_mask = df['D'] >= 15
bool_mask

idx
a    False
b    False
c     True
d     True
e     True
Name: D, dtype: bool

Using the mask directly will cause a error if used with `.iloc`. Remember `.iloc` is intended only for integer-location selection. To use with `.iloc` the boolean vector must be passed in.

In [137]:
# Boolean Vector
bool_mask.values

array([False, False,  True,  True,  True])

In [138]:
df.iloc[bool_mask.values]

Unnamed: 0_level_0,A,B,C,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c,A3,B3,C3,15
d,A4,B4,C4,20
e,A5,B5,C5,25


Multiple boolean expression can be applied.

In [139]:
condition_1 = df['D'] >= 15
condition_2 = df['A'] == 'A1'
# Condition 1 OR Condition 2
df.loc[condition_1 | condition_2]

Unnamed: 0_level_0,A,B,C,D
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,A1,B1,C1,5
c,A3,B3,C3,15
d,A4,B4,C4,20
e,A5,B5,C5,25


In [147]:
df.loc[condition_1 | condition_2, 'A':'C']

Unnamed: 0_level_0,A,B,C
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,A1,B1,C1
c,A3,B3,C3
d,A4,B4,C4
e,A5,B5,C5


## Summary

Pandas offers a multitude of options for selecting, indexing, and slicing `DataFrame` options. This articles provided an introduction to the `.loc` method for label based selecting, the `.iloc` method for integer-location based selecting as well as using the indexing operator `[]` whose behavior can depend on the index and arguments provided, due to this confusion could be possible and the use of `.loc` and `.iloc` might be preferable.

If you enjoy what you read and find it helpful please check back, and check back often, click here [Medium](https://medium.com/@emguyant) and follow me while giving a clap to the article! Also don't forget to subscribe to the Inquisitive Nature publication.