# Data Indexing and Selection

Previously, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays.
These included indexing (e.g., ``arr[2, 1]``), slicing (e.g., ``arr[:, 1:5]``), masking (e.g., ``arr[arr > 0]``), fancy indexing (e.g., ``arr[0, [1, 5]]``), and combinations thereof (e.g., ``arr[:, [1, 5]]``).
Here we'll look at similar means of accessing and modifying values in Pandas ``Series`` and ``DataFrame`` objects.
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.

We'll start with the simple case of the one-dimensional ``Series`` object, and then move on to the more complicated two-dimesnional ``DataFrame`` object.

In [None]:
import pandas as pd
import numpy as np

### Indexers: loc, iloc

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

In [None]:
# explicit index when indexing
data[1]

In [None]:
# implicit index when slicing
data[1:3]

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

## Data Selection in DataFrame

Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

### DataFrame as a dictionary

The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [None]:
data['area']

Equivalently, we can use attribute-style access with column names that are strings. However, this method is not recommended, as it can conflict with methods of the ``DataFrame`` (but you may see it in other code):

In [None]:
data.area  # possible, but not recommended

In [None]:
data.pop  # this is the method, not the "pop" column!

In [None]:
data

This dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [None]:
data['density'] = data['pop'] / data['area']
data

This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects.

Note that we can also just get the unterlying NumPy array representation of the ``DataFrame`` via the ``values`` attribute if we ever need to.

In [None]:
data.values

In [None]:
data["myarray"] = np.array([1, 2, 3, 4, 5])  # or assign a new column
data

### Math operations

We can also apply NumPy ufuncs on the DataFrame. The difference to evaluating the ufunc on a NumPy array is that the index is preserved.

In [None]:
np.cos(data)

To get a single column, we can pass a single "index" to a ``DataFrame`` accesses a column:

In [None]:
data['area']  # pandas Series

Using a string index will return a column as a ``Series`` object.

In [None]:
type(data.loc[:, 'pop'])

Using a list of strings will return a DataFrame:

In [None]:
type(data.loc[:, ['pop']])

In [None]:
data.loc[:, ['pop', 'density']]

We can create the boolean mask directly from the DataFrame:

In [None]:
data["density"] > 100

In [None]:
data.loc[data["density"] > 100, ['pop', 'density']]

Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

In [None]:
data.iloc[0, 2] = 90
data

## Eval and Query: Compound Expressions

We can achieve the same results as above using the ``eval`` and ``query`` methods.

The difference is that we write an expression in a string, which is evaluated in an optimized way. For non-large DataFrames, this is not necessary, but for large DataFrames (millions, complex expressions), it can be faster.

In [None]:
data.eval('pop * 2 / area')

In [None]:
data['density2'] = data['pop'] * 2 / data['area']

In [None]:
data_new = data.eval('density2 = pop * 2 / area')
data_new

In [None]:
data

In [None]:
# or inplace
data_new.eval('density3 = pop * 3 / area', inplace=True)
data_new

In [None]:
df_sel = data_new.query('density2 > 100')
df_sel

In [None]:
# or inplace
data_new.query('density3 < 400 & area < 400_000', inplace=True)
data_new

## Behind the scenes

To understand what is going on and what's the technical difference between the two methods, we can dive into the details.

For fun, we can compare the time it takes to compute the sum of two arrays using the standard approach, and using the ``eval`` method, however, the difference is **asolutely negligible** for most real-world use-cases.

**DO NOT USE ONE OR THE OTHER FOR "PERFORMANCE" REASONS** (_except for very large DataFrames and long expressions, ONCE you hit a bottleneck_). Use for "convenience" reasons.

In [None]:
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y

Remember, NumPy is fast, because it pushes the loop into the compiled layer. But this abstraction can become less efficient when computing compound expressions.


For example, consider the following expression:

In [None]:
mask = (x > 0.5) & (y < 0.5)

Because NumPy evaluates each subexpression, this is roughly equivalent to the following:

In [None]:
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2

In other words, *every intermediate step is explicitly allocated in memory*. If the ``x`` and ``y`` arrays are very large, this can lead to significant memory and computational overhead.
The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.
The [Numexpr documentation](https://github.com/pydata/numexpr) has more details, but for the time being it is sufficient to say that the library accepts a *string* giving the NumPy-style expression you'd like to compute:

In [None]:
import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)

The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays.
The Pandas ``eval()`` and ``query()`` tools that we will discuss here are conceptually similar, and depend on the Numexpr package.

In [None]:
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

To compute the sum of all four ``DataFrame``s using the typical Pandas approach, we can just write the sum:

In [None]:
%timeit df1 + df2 + df3 + df4

The same result can be computed via ``pd.eval`` by constructing the expression as a string:

In [None]:
%timeit pd.eval('df1 + df2 + df3 + df4')

The ``eval()`` version of this expression is about 50% faster (and uses much less memory), while giving the same result:

In [None]:
np.allclose(df1 + df2 + df3 + df4,
            pd.eval('df1 + df2 + df3 + df4'))

## Useful methods

Pandas provides many useful methods to manipulate and analyze data.

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
data.head(3)

In [None]:
data.tail()

### Preprocessing data

Pandas provides many methods to preprocess data, such as filling missing values, removing duplicates, and more.

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6],
                   [np.nan, 4,      6],])

In [None]:
~ df.duplicated()  # boolean mask for duplicates (~ negates the mask)

In [None]:
df.fillna(0)  # fill missing values

In [None]:
df.drop_duplicates()  # remove duplicates

In [None]:
df.dropna()  # drop missing values