# Data Indexing and Selection

We looked in detail at methods and tools to access, set, and modify values in NumPy arrays.
These included:

- **indexing** 
```python
arr[2, 1]
```
- **slicing** 
```python
arr[:, 1:5]
```
- **masking**
```python
arr[arr > 0]
```
- **fancy indexing** 
```python
arr[[1, 5]]```
- and **combinations** 
```python
arr[:, [1, 5]]```

**Here we'll look at similar means of accessing and modifying values in Pandas ``Series`` and ``DataFrame`` objects.**
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.

We'll start with the simple case of the one-dimensional ``Series`` object, and then move on to the more complicated two-dimensional ``DataFrame`` object.

## Data Selection in Series

As we saw in the previous section, a **``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.**
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

In [1]:
# first things first
import pandas as pd

### Series as dictionary

Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                index = ["a","b","c","d"])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
print("- por nombre: ", data["a"]) # Acceso explícito por nombre
print("- por posición: ",data[0] ) # Acceso implícito por posición

- por nombre:  0.25
- por posición:  0.25


We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [4]:
"a" in data

True

In [5]:
# como si fuera un diccionario
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [6]:
# accedemos al index de la serie
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [7]:
# a los valores no podemos acceder usando data.values()


In [8]:
# podemos acceder al índice y a los valores, a la vez, como si fuera un diccionario
data.items

<bound method Series.items of a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64>

In [9]:
# pero también haciendo un `zip` del índice y values
list(zip(data.index,data.values))

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

[¿No recuerdas muy bien qué era `zip`?](https://www.programiz.com/python-programming/methods/built-in/zip)

``Series`` objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:

In [10]:
# Crea uno nuevo porque no existe
data["e"] = 0.3

In [11]:
# si ya existe, estamos accediendo al valor y modificándolo
data["e"]

0.3

In [12]:
data["e"] = 9
data["e"]

9.0

In [13]:
data

a    0.25
b    0.50
c    0.75
d    1.00
e    9.00
dtype: float64

### Series as one-dimensional array

A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:

In [14]:
data_dup = pd.Series([0.25, 0.5, 5, 0.25], 
                index = ["a","b","c","c"])

In [15]:
# slicing by explicit index
data["a" : "c"]

a    0.25
b    0.50
c    0.75
dtype: float64

<table align="left">
 <tr><td width="80"><img src="./img/error.png" style="width:auto;height:auto"></td>
     <td style="text-align:left">
         <h3>ERRORES index duplicado</h3>
         
 </td></tr>
</table>

In [16]:
# KeyError: "Cannot get right slice bound for non-unique label: 'c'"
data_dup['a':'c']

a    0.25
b    0.50
c    5.00
c    0.25
dtype: float64

In [17]:
# slicing by implicit integer index
data_dup[0:2]

a    0.25
b    0.50
dtype: float64

In [18]:
# se puede acceder por posición a los index repetidos


In [31]:
# masking
data[(data>0.3) & (data <0.8)]

b    0.50
c    0.75
dtype: float64

In [33]:
(data>0.3).any() & (data <0.8).any()

True

In [34]:
# fancy indexing
lista_num = ["a", "c"]

data[lista_num]

a    0.25
c    0.75
dtype: float64

In [35]:
data[["a", "b"]]

a    0.25
b    0.50
dtype: float64

Among these, slicing may be the source of the most confusion.
**Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.**

### Indexers: loc and iloc

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as **``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.**

In [36]:
data_num = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data_num

1    a
3    b
5    c
dtype: object

**Explicit index when indexing** 

In [37]:
# por nombre
data_num[5]

'c'

**Implicit index when slicing**

In [42]:
# por posición
# slicing siempre devuelve una "slice" del objeto original
data_num[2:3]

5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the **``loc`` attribute allows indexing and slicing that always references the explicit index:**

In [44]:
# por nombre
data_num.loc[1]

'a'

In [45]:
# por nombres
data_num.loc[1:5]

1    a
3    b
5    c
dtype: object

The **``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index**:

In [49]:
# por posición
data_num.iloc[0:]

1    a
3    b
5    c
dtype: object

In [50]:
# por posicion
data_num.iloc[1]

'b'

One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, **I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.**

## Data Selection in DataFrame

Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

### DataFrame as a dictionary

The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

In [54]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135, "Arizona":123124})
df = pd.DataFrame({"area" : area, "pop" : pop})
df

Unnamed: 0,area,pop
Arizona,,123124
California,423967.0,38332521
Florida,170312.0,19552860
Illinois,149995.0,12882135
New York,141297.0,19651127
Texas,695662.0,26448193


The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [55]:
df["area"]

Arizona            NaN
California    423967.0
Florida       170312.0
Illinois      149995.0
New York      141297.0
Texas         695662.0
Name: area, dtype: float64

Equivalently, we can use attribute-style access with column names that are strings:

In [56]:
df.area

Arizona            NaN
California    423967.0
Florida       170312.0
Illinois      149995.0
New York      141297.0
Texas         695662.0
Name: area, dtype: float64

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [29]:
df.area is df['area']

NameError: name 'df' is not defined

Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, **if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.**
For example, the ``DataFrame`` has a [``pop()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html?highlight=pop#pandas.DataFrame.pop) method, so ``df.pop`` will point to this rather than the ``"pop"`` column:

In [None]:
df.pop

In [None]:
df.pop is df['pop']

In particular, you should avoid the temptation to try column assignment via attribute (i.e., **use ``data['pop'] = z`` rather than ``data.pop = z``**).

Like with the ``Series`` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects; we'll dig into this further in **Operating on Data in Pandas**.

### DataFrame as two-dimensional array

As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array.
We can examine the raw underlying data array using the ``values`` attribute:

With this picture in mind, many familiar array-like observations can be done on the ``DataFrame`` itself.
**For example, we can transpose the full ``DataFrame`` to swap rows and columns:**

When it comes to indexing of ``DataFrame`` objects, however, it is clear that the dictionary-style indexing of columns precludes our **ability to simply treat it as a NumPy array.**
In particular, passing a single index to an array accesses a row:

and passing a single "index" to a ``DataFrame`` accesses a column:

In [None]:
df['area']

Thus for array-style indexing, we need another convention.
Here Pandas again uses the ``loc``, ``iloc``, indexers mentioned earlier and ``at``.


Let's start having a look to the third indexing attribute, **``at``, which access a single value for a row/column label pair.**

Similar to ``loc``, in that both provide label-based lookups. Use at if you only need to get or set a single value in a DataFrame or Series.

Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

Any of the familiar NumPy-style data access patterns can be used within these indexers.
For example, in the ``loc`` indexer we can combine **masking** and **fancy indexing** as in the following:

In [None]:
df.loc[df.density > 100, ['pop', 'density']] # primero filas, segundo columnas

Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.