# <center> <div style="width: 370px;"> ![numpy title](pictures/numpy_tytle.jpg)

# <center> Data Science Operations: Indexing, Filter

## Indexing

Indexing in NumPy refers to the process of accessing elements or subsets of elements from a NumPy array. NumPy arrays are similar to Python lists, but they can have multiple dimensions, which makes indexing more versatile.

ndarrays can be indexed using the standard Python `x[obj]` syntax, where `x` is the array and `obj` the selection. There are different kinds of indexing available depending on `obj`:

- Basic indexing
- Advanced indexing
- Field access.

Here’s the difference: NumPy arrays use commas between axes, so you can index multiple axes in one set of square brackets.

> Note that in Python, `x[(exp1, exp2, ..., expN)]` is equivalent to `x[exp1, exp2, ..., expN]`; the latter is just syntactic sugar for the former. This will come handy later in Advanced Indexing.

## Slicing and Striding

Basic slicing in NumPy extends Python's slicing concept to N dimensions. It is used when the indexing object (obj) is a `slice` object defined using the `start:stop:step` notation inside square brackets. The obj can also be an integer or a tuple containing slice objects and integers. Additionally, you can use `ellipsis` (`...`) and `newaxis` objects in combination with basic slicing.

**Note That:** When you perform basic slicing on a NumPy array, it creates a ***view*** of the original array, not a copy. This means that the sliced array still shares the underlying data with the original array.

> It's important to be cautious when extracting a small portion from a large array, especially if you no longer need the original array after the extraction. This is because the extracted portion still references the larger original array, and the memory of the original array won't be released until all derived arrays are garbage-collected. In such cases, it's advisable to create an explicit copy of the sliced portion using the `copy()` method.

Basic slicing follows the standard slicing rules, which apply to each dimension separately. Some useful concept to remmeber include:


- The basic slice syntax is `i:j:k` where `i` is the starting index, `j` is the stopping index, and `k` is the step.

In [1]:
import numpy as np

In [2]:
a = np.array(range(10))
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [3]:
a[1:9:3]

array([1, 4, 7])

- An integer, `i`, returns the same values as `i:i+1` except the dimensionality of the returned object is reduced by 1. In particular, a selection tuple with the p-th element an integer (and all other entries :) returns the corresponding sub-array with dimension `N - 1`. If `N = 1` then the returned object is an array scalar. These objects are explained in Scalars.


(in other word:   When you use a single integer `i` as an index for slicing an array, you are effectively selecting a single element from the array. The notation `i:i+1` is equivalent to selecting a single element starting from `i` and ending at `i+1`, but it effectively selects the element at index `i`. When you select a single element using an integer index, the resulting object is a scalar or a 0-dimensional array. This means that the dimensionality of the object is reduced by `1` compared to the original array. If you have a selection tuple where one of the elements is an integer `(e.g., (i, :, :)` or `(:, i, :))`, it selects a sub-array along that dimension, effectively reducing the dimensionality of the resulting array by `1`. The `p-th` element in this context refers to the element at index `p` in the tuple. If the original array has only one dimension `(N = 1)`, and you select an element using an integer index, the resulting object is not a `1-dimensional` array but an array scalar. An array scalar is essentially a single value rather than a 1D array.)

In [4]:
a = np.random.randint(10, size=(3, 4))


In [5]:
a

array([[5, 3, 5, 8],
       [5, 9, 3, 7],
       [4, 9, 0, 6]])

In [6]:
a[0].shape

(4,)

In [7]:
a[0]

array([5, 3, 5, 8])

In [8]:
a[0:1].shape

(1, 4)

In [9]:
a[0:2]

array([[5, 3, 5, 8],
       [5, 9, 3, 7]])

- You may use slicing to set values in the array, but (unlike lists) you can never grow the array. The size of the value to be set in `x[obj] = value` must be (broadcastable) to the same shape as `x[obj]`.

In [10]:
l = [0, 1, 2, 3, 4, 5, 6, 7]

In [11]:
l[:5] = [8, 9, 10]

In [12]:
l

[8, 9, 10, 5, 6, 7]

In [13]:
a = np.array(range(8))
a

array([0, 1, 2, 3, 4, 5, 6, 7])

In [14]:
a[:5] = np.array[8, 9, 10]

TypeError: 'builtin_function_or_method' object is not subscriptable

In [15]:
np.array([8, 9, 10]).shape, a[:5].shape

((3,), (5,))

In [16]:
a[:5] = 8

In [17]:
a

array([8, 8, 8, 8, 8, 5, 6, 7])

In [18]:
a[:5] = np.array([9, 8, 7, 6, 5])

In [19]:
a

array([9, 8, 7, 6, 5, 5, 6, 7])

- A slicing tuple can always be constructed as obj and used in the x[obj] notation. Slice objects can be used in the construction in place of the [start:stop:step] notation. For example, x[1:10:5, ::-1] can also be implemented as obj = (slice(1, 10, 5), slice(None, None, -1)); x[obj] . This can be useful for constructing generic code that works on arrays of arbitrary dimensions. See [Dealing with variable numbers of indices within programs](https://numpy.org/doc/stable/user/basics.indexing.html#dealing-with-variable-indices) for more information.

## Dimensional Indexing Tools

NumPy provides useful tools for conveniently matching array shapes in expressions and assignments.

One such tool is the `Ellipsis` (`...`) notation, which automatically expands to the number of colon (`:`) objects required in the selection tuple to index all dimensions. In most cases, this means that the length of the expanded selection tuple matches the number of dimensions in the array, denoted as `x.ndim`. It's important to note that there can only be a single ellipsis present in the selection tuple.

For instance:

In [34]:
a = np.random.randint(10, size=(2, 3, 4, 5))

In [35]:
a.ndim

4

In [36]:
len(a.shape)

4

In [37]:
a

array([[[[8, 3, 7, 6, 3],
         [5, 1, 7, 1, 9],
         [3, 1, 4, 7, 1],
         [1, 8, 8, 6, 5]],

        [[0, 6, 4, 6, 3],
         [6, 5, 4, 7, 7],
         [8, 4, 5, 5, 5],
         [9, 7, 7, 6, 5]],

        [[3, 9, 1, 3, 0],
         [3, 5, 4, 2, 0],
         [7, 5, 1, 4, 6],
         [1, 2, 9, 5, 3]]],


       [[[7, 7, 2, 6, 0],
         [5, 3, 3, 3, 5],
         [8, 1, 9, 2, 8],
         [9, 7, 2, 1, 2]],

        [[0, 4, 2, 3, 6],
         [1, 6, 4, 2, 5],
         [3, 3, 4, 7, 1],
         [3, 5, 4, 9, 9]],

        [[9, 6, 2, 8, 1],
         [7, 2, 4, 9, 0],
         [6, 9, 7, 3, 0],
         [5, 4, 7, 2, 5]]]])

In [38]:
b = a[:, :, :, 3]

In [39]:
b.shape

(2, 3, 4)

In [40]:
c = a[..., 3]

In [41]:
c

array([[[6, 1, 7, 6],
        [6, 7, 5, 6],
        [3, 2, 4, 5]],

       [[6, 3, 2, 1],
        [3, 2, 7, 9],
        [8, 9, 3, 2]]])

In [47]:
(c == b).all()

True

In [43]:
b.shape

(2, 3, 4)

In [44]:
d = a[:, 1, :, :]

In [30]:
d

array([[[6, 0, 6, 8, 6],
        [8, 4, 5, 6, 6],
        [4, 7, 2, 4, 2],
        [9, 9, 2, 3, 5]],

       [[4, 5, 8, 3, 2],
        [5, 6, 3, 5, 5],
        [4, 7, 7, 4, 5],
        [3, 3, 3, 9, 5]]])

In [49]:
d.shape

(2, 4, 5)

### Ellipsis and newaxis

The interpretation of Ellipsis (`...`) in Python is quite flexible and depends on the implementation of the `__getitem__` function where it's used. However, its primary and intended use is prominent in the NumPy third-party library, which introduces a multidimensional array type. This is because, in multidimensional arrays, slicing becomes more intricate than just specifying start and stop indices; it often involves slicing in multiple dimensions simultaneously.

Expanding on this concept, Ellipsis serves as a placeholder to represent the unspecified dimensions within an array. You can think of it as equivalent to using a full slice (`[:]`) for all the dimensions in the place it's positioned. For example, in a 3D array, `a[...,0]` is essentially the same as `a[:,:,0]`, and in a 4D array, `a[0,...,0]` is equivalent to `a[0,:,:,0]`, where the number of colons in the middle indicates the full number of dimensions in the array.


In [48]:
...

Ellipsis

> It's worth noting that Ellipsis is not limited to NumPy; it also finds use in the standard-library typing module. For instance, it's used in type hinting as seen in `Callable[..., int]`, which signifies a callable that returns an integer without specifying the exact function signature. Similarly, `Tuple[str, ...]` indicates a variable-length homogeneous tuple of strings, where the ellipsis implies that the tuple can have an arbitrary number of elements.

Each `newaxis` object in the selection tuple serves to expand the dimensions of the resulting selection by one unit-length dimension. The added dimension is the position of the `newaxis` object in the selection tuple. `newaxis` is an alias for `None`, and `None` can be used in place of this with the same result.

In [54]:
a.shape

(2, 3, 4, 5)

In [55]:
a[None, ..., np.newaxis].shape

(1, 2, 3, 4, 5, 1)

This can be handy to combine two arrays in a way that otherwise would require explicit reshaping operations. For example:

In [66]:
a = np.arange(10)

In [67]:
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [68]:
a[:, np.newaxis]

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [69]:
a[np.newaxis, :]

array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

In [70]:
a[:, np.newaxis] + a[:, np.newaxis]

array([[ 0],
       [ 2],
       [ 4],
       [ 6],
       [ 8],
       [10],
       [12],
       [14],
       [16],
       [18]])

In [71]:
a[:, np.newaxis] + a[np.newaxis, :]

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12],
       [ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13],
       [ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14],
       [ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15],
       [ 7,  8,  9, 10, 11, 12, 13, 14, 15, 16],
       [ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17],
       [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]])

An example is the easiest way to show indexing and slicing off. It’s time to confirm [Dürer’s magic](https://en.wikipedia.org/wiki/Magic_square#Albrecht_D%C3%BCrer%27s_magic_square) square!

The number square below has some amazing properties. If you add up any of the rows, columns, or diagonals, then you’ll get the same number, 34. That’s also what you’ll get if you add up each of the four quadrants, the center four squares, the four corner squares, or the four corner squares of any of the contained 3 × 3 grids. You’re going to prove it!

In [88]:
# Make magic square with odd N
import numpy as np

N  = 5
magic_square = np.zeros((N,N), dtype=int)

n = 1
i, j = 0, N//2

while n <= N**2:
    magic_square[i, j] = n
    n += 1
    newi, newj = (i-1) % N, (j+1)% N
    if magic_square[newi, newj]:
        i += 1
    else:
        i, j = newi, newj

print(magic_square)

[[17 24  1  8 15]
 [23  5  7 14 16]
 [ 4  6 13 20 22]
 [10 12 19 21  3]
 [11 18 25  2  9]]


#### Dürer’s magic square:

In [90]:
import numpy as np

In [91]:
square = np.array([
    [16,  3,  2, 13],
    [ 5, 10, 11,  8],
    [ 9,  6,  7, 12],
    [ 4, 15, 14,  1],
])

In [92]:
mylist = [
    [16, 3, 2, 13],
    [5, 10, 11, 8],
    [9, 6, 7, 12],
    [4, 15, 14, 1]
]

In [93]:
square.shape

(4, 4)

In [94]:
square[:, 0]

array([16,  5,  9,  4])

In [95]:
for i in range(4):
    assert square[:, i].sum() == 34
    assert square[i, :].sum() == 34

In [96]:
assert square[2:, 2:].sum() == 34

In [97]:
assert square[:2, 2:].sum() == 34

In [98]:
assert square[2:, :2].sum() == 34

In [99]:
assert square[:2, :2].sum() == 34

Inside the for loop, you verify that all the rows and all the columns add up to 34. After that, using selective indexing, you verify that each of the quadrants also adds up to 34.

One last thing to note is that you’re able to take the sum of any array to add up all of its elements globally with `square.sum()`. This method can also take an axis argument to do an axis-wise summing instead.

In [100]:
square.sum()

136

## Advanced indexing

Advanced indexing comes into play when the selection object, 'obj,' takes the form of a non-tuple sequence object, an ndarray (with integer or bool data type), or a tuple containing at least one sequence object or ndarray (again, with integer or bool data type). There are two primary types of advanced indexing:

1. **Integer Indexing:**
2. **Boolean Indexing:** 

> **Note:** It's crucial to note that advanced indexing always returns a copy of the data, in contrast to basic slicing, which returns a view of the original data.

> **Note:** Additionally, it's important to understand that the definition of advanced indexing makes expressions like `x[(1, 2, 3),]` fundamentally different from `x[(1, 2, 3)]`. The latter is equivalent to `x[1, 2, 3]`, which triggers basic selection. Meanwhile, the former, `x[(1, 2, 3),]`, initiates advanced indexing. It's essential to grasp the reasons behind this distinction for effective use of advanced indexing in NumPy.

### Integer Array Indexing

Integer array indexing enables the selection of specific elements within an array by specifying their N-dimensional indices. In this approach, each integer array serves as a collection of indices within a particular dimension.

It's worth noting that negative values are valid within these index arrays, and they behave in the same way as with single indices or slices:

In [104]:
a = np.arange(10, 0, -1)

In [107]:
a.shape

(10,)

In [115]:
b = a[np.array([3, 3, 1, 8])]

In [110]:
# Advanced Indexing results in a copy, not a view
np.may_share_memory(a, b)

False

In [111]:
a[[3, 3, 1, 8]]

array([7, 7, 9, 2])

In [112]:
# ((3, 3, 1, 8), ) is a tuple with one sequence object --> Advanced Indexing
a[(3, 3, 1, 8), ] 

array([7, 7, 9, 2])

In [113]:
# Negative values work as before
a[np.array([3, 3, -3, 8])]

array([7, 7, 3, 2])

In [114]:
# Note that this will raise error as a is 1 dimensional
a[3, 3, 1, 8]  # (3, 3, 1, 8) is a tuple -> Basic indexing

IndexError: too many indices for array: array is 1-dimensional, but 4 were indexed

### Boolean Array Indexing

This advanced indexing occurs when obj is an array object of Boolean type, such as may be returned from comparison operators.

If `obj.ndim == x.ndim`, `x[obj]` returns a 1-dimensional array filled with the elements of `x` corresponding to the `True` values of `obj`. The search order will be row-major, C-style. If obj has True values at entries that are outside of the bounds of `x`, then an index error will be raised. If `obj` is smaller than `x` it is identical to filling it with `False`.

In [120]:
a = np.random.randint(10, size=(3, 4))

In [121]:
a

array([[9, 2, 7, 9],
       [6, 1, 1, 7],
       [4, 3, 0, 6]])

In [122]:
a[:, [True, False, True, False]]

array([[9, 7],
       [6, 1],
       [4, 0]])

In [123]:
a[[True, False, True], :]

array([[9, 2, 7, 9],
       [4, 3, 0, 6]])

A common use case for this is filtering for desired element values. For example, one may wish to select all entries from an array which are not NaN:

In [136]:
x = np.array([[1., 2.], [np.nan, 3.], [np.nan, np.nan]])

In [137]:
np.isnan(x)

array([[False, False],
       [ True, False],
       [ True,  True]])

In [138]:
x[~np.isnan(x)]

array([1., 2., 3.])

In [139]:
x[(~np.isnan(x)).nonzero()]

array([1., 2., 3.])

In [140]:
(~np.isnan(x)).nonzero()

(array([0, 0, 1]), array([0, 1, 1]))

Or wish to add a constant to all negative elements:



In [132]:
x = np.array([1., -1., -2., 3])
x[x < 0] += 20

In [133]:
x

array([ 1., 19., 18.,  3.])

In general if an index includes a Boolean array, the result will be identical to inserting `obj.nonzero()` into the same position and using the integer array indexing mechanism described above. `x[ind_1, boolean_array, ind_2]` is equivalent to `x[(ind_1,) + boolean_array.nonzero() + (ind_2,)]`.

This is where the concept of a ***mask*** comes into play.

A mask is an array that has the exact same shape as your data, but instead of your values, it holds Boolean values: either `True` or `False`. You can use this mask array to index into your data array in nonlinear and complex ways. It will return all of the elements where the Boolean array has a True value.

Here’s an example showing the process, first in slow motion and then how it’s typically done, all in one line:

In [None]:
numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1)

numbers

In [None]:
mask = numbers % 4 == 0

mask

In [None]:
numbers[mask]

In [None]:
np.may_share_memory(numbers[mask], numbers)

In [None]:
# how it's typically done
by_four = numbers[numbers % 4 == 0]

by_four

You’ll see an explanation of the new array creation tricks in input 2 in a moment, but for now, focus on the meat of the example. These are the important parts:

- ***`mask = numbers % 4 == 0`*** creates the mask by performing a ***vectorized Boolean computation***, taking each element and checking to see if it divides evenly by four. This returns a mask array of the same shape with the element-wise results of the computation.
- ***`numbers[mask]`*** uses this mask to index into the original numbers array. This causes the array to lose its original shape, reducing it to one dimension, but you still get the data you’re looking for.
- ***`by_four = numbers[numbers % 4 == 0]`*** provides a more traditional, idiomatic masked selection that you might see in the wild, with an anonymous filtering array created inline, inside the selection brackets. This syntax is similar to usage in the R programming language.
Coming back to numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1), you encounter three new concepts:



- Using `np.linspace()` to generate an evenly spaced array
- Setting the `dtype` of an output
- Reshaping an array with `-1`

`np.linspace()` generates n numbers evenly distributed between a minimum and a maximum, which is useful for evenly distributed sampling in scientific plotting

Because of the particular calculation in this example, it makes life easier to have integers in the `numbers` array. But because the space between 5 and 50 doesn’t divide evenly by 24, the resulting numbers would be floating-point numbers. You specify a `dtype` of `int` to force the function to round down and give you whole integers. You’ll see a more detailed discussion of data types later on.

Finally, `array.reshape()` can take `-1` as one of its dimension sizes. That signifies that NumPy should just figure out how big that particular axis needs to be based on the size of the other axes. In this case, with 24 values and a size of `4` in axis 0, axis 1 ends up with a size of `6`.

Here’s one more example to show off the power of masked filtering. The normal distribution is a probability distribution in which roughly 95.45% of values occur within two standard deviations of the mean.

You can verify that with a little help from NumPy’s random module for generating random values:

In [156]:
import numpy as np

from numpy.random import default_rng

rng = default_rng()

values = rng.standard_normal(10000)

values[:5]

array([ 0.05200714,  1.34950551, -0.96828012, -0.63604235,  0.86354089])

In [157]:
std = values.std()

std

0.9947914643194757

In [158]:
filtered = values[(values > -2 * std) & (values < 2 * std)]

filtered.size

9541

In [159]:
values.size

10000

In [160]:
filtered.size / values.size

0.9541

Here you use a potentially strange-looking syntax to combine filter conditions: a ***binary & operator***. Why would that be the case? It’s because NumPy designates `&` and `|` as the vectorized, element-wise operators to combine Booleans. If you try to do `A` and `B`, then you’ll get a warning about how the truth value for an array is weird, because the and is operating on the truth value of the whole array, not element by element.