![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The NumPy Library - Advanced Indexing and Slicing

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


## 1. Advanced Indexing

By default a NumPy arrows the access of its elements based on a **positional integer incremental index** which starts from zero for its first element up to the array's size minus one for its last element.

However, beyond Python's standard array indexing, NumPy allows accessing of elements by index tuples and even logical condition masking.   

Before going forward let's initialize some working data:

In [2]:
x_1d = np.array([ 1,   2,   3,   4,   5,   6,   7,   8,   9,  10])

x_2d = np.array(
      [[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [ 11,  12,  13,  14,  15,  16,  17,  18,  19,  20],
       [ 21,  22,  23,  24,  25,  26,  27,  28,  29,  30],
       [ 31,  32,  33,  34,  35,  36,  37,  38,  39,  40],
       [ 41,  42,  43,  44,  45,  46,  47,  48,  49,  50],
       [ 51,  52,  53,  54,  55,  56,  57,  58,  59,  60],
       [ 61,  62,  63,  64,  65,  66,  67,  68,  69,  70],
       [ 71,  72,  73,  74,  75,  76,  77,  78,  79,  80],
       [ 81,  82,  83,  84,  85,  86,  87,  88,  89,  90],
       [ 91,  92,  93,  94,  95,  96,  97,  98,  99, 100]])

Numpy uses two types of advanced indexing: integer indexing and boolean indexing.

### 1.1 Advanced integer indexing

This kind of indexing is based on using integer arrays in order to select multipe elements of an array. It is possible to use an array of indexes to obtain a sub-set of the array elements from the unidimensional array:

In [3]:
# select the odd elements of the array
odd_elements_indexes = [0,2,4,6,8]

print("The odd elements in the array can be used via an array of indexes {} and it is the array {}".format(
    odd_elements_indexes,
    x_1d[odd_elements_indexes]
))

The odd elements in the array can be used via an array of indexes [0, 2, 4, 6, 8] and it is the array [1 3 5 7 9]


We can also use multiple arrays of indexes to access multiple elements from multi-dimensional arrays.  
In our case we can apply this to the two dimensional array created:

In [4]:
# select a sub-matrix  with 3x3 size from the orginal one
index_row = [0, 1, 2]
index_column = [0, 1, 2]
print("The elements at index row {} and index column {} are {}".format(
    index_row,
    index_column,
    x_2d[index_row, index_column]
))

The elements at index row [0, 1, 2] and index column [0, 1, 2] are [ 1 12 23]


It is important to note that NumPy allows the usage of negative indexes. It the case, the position is calculated from the **end of the array**, not from its beginning. Furthermore, in this case the -1 value represents the position of the last element. 

In [5]:
# select the last three elements via the negative indexes
last_elements_index = [-3 , -2, -1]

print("The last elements of the the array  are: {}".format(
    x_1d[last_elements_index]
))

The last elements of the the array  are: [ 8  9 10]


### 1.2 Advanced boolean indexing

This advanced indexing occurs when **using arrays of booleans for indexing**, each **TRUE value from the index array** will select an element from the target array. 
The array object itself will be used as an element for the boolean expression along with (usually) comparison operators.


We can use this technique to select the first and the last element from the array:

In [6]:
# selecting the first and the last element from the array
# using boolean indexing
boolean_index_array = [True, False,  False,  False,  False,  False,  False,  False,  False, True]
x_1d[boolean_index_array]

print("The elements selected by the boolean filter {} are {}".format(
    boolean_index_array,
    x_1d[boolean_index_array]
))

The elements selected by the boolean filter [True, False, False, False, False, False, False, False, False, True] are [ 1 10]


We can rewrite the method for selecting the odd elements from the array by using logical expressions as well (so no selection index specification is needed at all):

In [7]:
# use the arithmetic and logical expressions from the array
# in order to select the odd elements
print("The odd elements in the array {} are {}".format(
    x_1d,
    x_1d[x_1d % 2 == 1]
))

The odd elements in the array [ 1  2  3  4  5  6  7  8  9 10] are [1 3 5 7 9]


This logical indexing applies also to multi-dimensional arrays, however we must be careful that this method will flatten the resulted array (it will make it one-dimensional):

In [8]:
# selecting the powers of 2 using boolean indexing
print("The powers of 2 in the array \n {} \n are: \n {}".format(
    x_2d,
    x_2d[np.log2(x_2d).astype(int)  == np.log2(x_2d)]
))

The powers of 2 in the array 
 [[  1   2   3   4   5   6   7   8   9  10]
 [ 11  12  13  14  15  16  17  18  19  20]
 [ 21  22  23  24  25  26  27  28  29  30]
 [ 31  32  33  34  35  36  37  38  39  40]
 [ 41  42  43  44  45  46  47  48  49  50]
 [ 51  52  53  54  55  56  57  58  59  60]
 [ 61  62  63  64  65  66  67  68  69  70]
 [ 71  72  73  74  75  76  77  78  79  80]
 [ 81  82  83  84  85  86  87  88  89  90]
 [ 91  92  93  94  95  96  97  98  99 100]] 
 are: 
 [ 1  2  4  8 16 32 64]


From a data processing practical perspective, logical indexing is extensively used in removal of invalid data. A first example is the removal of empty values from an array:

In [9]:
# we will consider an input data with empty values for some
# of its elements and create a processed array with all 
# these empty values removed

x_1d_with_empty_data = np.array([ None,   2,   None,   4,   5,   None,   7,   None,   None,  10])

print("Non-empty elements in the array \n {} \n are: \n {}".format(
    x_1d_with_empty_data,
    x_1d_with_empty_data[x_1d_with_empty_data != None]
))

Non-empty elements in the array 
 [None 2 None 4 5 None 7 None None 10] 
 are: 
 [2 4 5 7 10]


In a more advanced usage scenario, this method can be used to remove data which has an unexpected format (e.g. non-numeric format):

In [10]:
# we will consider an input data with invalid values for some
# of its elements and create a processed array with all 
# these invalid values removed

x_1d_with_invalid_data = np.array([ "1",   "2",   "three",   "4",   "Invalid",   "6",   "7",   "Eight",   "maybe 9",  10])

print("Numeric elements in the array \n {} \n are: \n {}".format(
    x_1d_with_invalid_data,
    x_1d_with_invalid_data[np.char.isnumeric(x_1d_with_invalid_data)]
))

Numeric elements in the array 
 ['1' '2' 'three' '4' 'Invalid' '6' '7' 'Eight' 'maybe 9' '10'] 
 are: 
 ['1' '2' '4' '6' '7' '10']


One of the most important important scenario in data systems is the detection of anomalous data. A good example of anomalous data is a data point that significantly differs from the other data points (in statistics this is called an outlier): 

In [11]:
# we will consider an array where the last element is a statistical outlier 
# and we will use standard Pyhon functions to highlight it (for demonstration
# purposes we will consider the outlier to be at least 3 times the average 
# value of the array, in production systems the algorithms are more complex)   
x_1d_outliers = np.array([ 1,   2,   3,   4,   5,   6,   7,   8,   950,  1000])
x_1d_outliers_average = sum (x_1d_outliers) / len(x_1d_outliers)

print("The outliers in the array \n {} \n are: \n {}".format(
    x_1d_outliers,
    x_1d_outliers[x_1d_outliers > 3 * x_1d_outliers_average]
))

The outliers in the array 
 [   1    2    3    4    5    6    7    8  950 1000] 
 are: 
 [ 950 1000]


# 2. Slicing

NumPy allows the selection of sub-arrays via the **slice notation**. The process of selecting sub-arrays via the slice notation is called **slicing**.

The standard notation for slicing is [start:stop:step] where:
    
* **start** is the start index for the slice. By default the start is 0, this means the beginning of the array;
* **stop** is the stop index for the slice (the stop value is not included). By default the stop is -1 so it selects the rest of the array;
* **step** is the step using from one index to another. By default the step is 1. 

This will allow selecting a sub-array from the start index to end index using a specified step. This slice notation can be applied to multiple dimensions as well.

In [12]:
# selecting the first elements in the array via slice notation
elements_length = 3
print("The first\n {} elements from aray \n {} \n are: \n{}".format(
    elements_length,
    x_1d,
    x_1d[0:elements_length]
))

The first
 3 elements from aray 
 [ 1  2  3  4  5  6  7  8  9 10] 
 are: 
[1 2 3]


We can use the slicing notation to select the odd elements in the array:

In [13]:
# select the odd elements in the array by using slice notation
print("The odd elements in the array are: {}".format(
    x_1d[0::2]
))

The odd elements in the array are: [1 3 5 7 9]


Using a negative step will reverse the order of element selection (useful for reversing arrays):

In [14]:
# we will use negative steps for reversing the item selection order 
print("The reverse of \n {} \n is: \n {}".format(
    x_1d,
    x_1d[::-1]
))

The reverse of 
 [ 1  2  3  4  5  6  7  8  9 10] 
 is: 
 [10  9  8  7  6  5  4  3  2  1]


Advanced slicing works on multidimensional arrays as well, allowing for selecting rows from a matrix:

In [15]:
# using advanced slicing to select the first row from a matrix
print("The first row from array \n {} \n is: \n{}".format(
    x_2d,
    x_2d[0,:]    
))

The first row from array 
 [[  1   2   3   4   5   6   7   8   9  10]
 [ 11  12  13  14  15  16  17  18  19  20]
 [ 21  22  23  24  25  26  27  28  29  30]
 [ 31  32  33  34  35  36  37  38  39  40]
 [ 41  42  43  44  45  46  47  48  49  50]
 [ 51  52  53  54  55  56  57  58  59  60]
 [ 61  62  63  64  65  66  67  68  69  70]
 [ 71  72  73  74  75  76  77  78  79  80]
 [ 81  82  83  84  85  86  87  88  89  90]
 [ 91  92  93  94  95  96  97  98  99 100]] 
 is: 
[ 1  2  3  4  5  6  7  8  9 10]


In the same manner we can select the last column as well:

In [16]:
# using advanced slicing to select the first column from a matrix
print("The first column from array \n {} \n is: \n{}".format(
    x_2d,
    x_2d[:,0]   
))

The first column from array 
 [[  1   2   3   4   5   6   7   8   9  10]
 [ 11  12  13  14  15  16  17  18  19  20]
 [ 21  22  23  24  25  26  27  28  29  30]
 [ 31  32  33  34  35  36  37  38  39  40]
 [ 41  42  43  44  45  46  47  48  49  50]
 [ 51  52  53  54  55  56  57  58  59  60]
 [ 61  62  63  64  65  66  67  68  69  70]
 [ 71  72  73  74  75  76  77  78  79  80]
 [ 81  82  83  84  85  86  87  88  89  90]
 [ 91  92  93  94  95  96  97  98  99 100]] 
 is: 
[ 1 11 21 31 41 51 61 71 81 91]


We can use advanced slicing notation to even select a sub-matrix from an existing matrix:

In [17]:
# select a 4x4 sub-matrix from an existing matrix 
center_length = 2

column_center = len(x_2d[0])//2  
row_center = len(x_2d)//2
print(
  "The center submatrix with dimensionx {} x {} \n from array \n {} \n is: \n{}"
    .format(
      center_length * 2,
      center_length * 2,
      x_2d,
      x_2d[
          row_center - center_length : row_center + center_length,
          column_center - center_length : column_center + center_length
          ] 
    )
)

The center submatrix with dimensionx 4 x 4 
 from array 
 [[  1   2   3   4   5   6   7   8   9  10]
 [ 11  12  13  14  15  16  17  18  19  20]
 [ 21  22  23  24  25  26  27  28  29  30]
 [ 31  32  33  34  35  36  37  38  39  40]
 [ 41  42  43  44  45  46  47  48  49  50]
 [ 51  52  53  54  55  56  57  58  59  60]
 [ 61  62  63  64  65  66  67  68  69  70]
 [ 71  72  73  74  75  76  77  78  79  80]
 [ 81  82  83  84  85  86  87  88  89  90]
 [ 91  92  93  94  95  96  97  98  99 100]] 
 is: 
[[34 35 36 37]
 [44 45 46 47]
 [54 55 56 57]
 [64 65 66 67]]
