![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The NumPy Library - Useful Functions

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


## 1. Array Manipulation

NumPy has a large variety of array manipulation functions that allow arrays operations such as concatenation, stacking, deletion and so forth.

The full documentation of these functions can be accessed at [Array manipulation routines](https://numpy.org/doc/stable/reference/routines.array-manipulation.html).



Let's create first the basic data for understanding these functions:

In [2]:
# create several uni dimensional arrays 
# used for demonstration of functionality
array_size_1d = 10

arrays_1d_data = [np.arange(array_size_1d) + 
                    array_size_1d*i 
                  for i in range(4) ]
for i in range(1,5):
  globals()["array_1d_{}".format(i)] = arrays_1d_data[i - 1] + 1
  print(
      "Created variable {} with values \n{}\n".format(
        "array_1d_{}".format(i),
        globals()["array_1d_{}".format(i)]
      )
  )


Created variable array_1d_1 with values 
[ 1  2  3  4  5  6  7  8  9 10]

Created variable array_1d_2 with values 
[11 12 13 14 15 16 17 18 19 20]

Created variable array_1d_3 with values 
[21 22 23 24 25 26 27 28 29 30]

Created variable array_1d_4 with values 
[31 32 33 34 35 36 37 38 39 40]



In [3]:
# create several two dimensional arrays 
# used for demonstration of functionality
array_row_size_2d = 2
array_column_size_2d = 5

arrays_2d_data = [np.arange(array_row_size_2d*array_column_size_2d) + 
                    array_row_size_2d*array_column_size_2d*i 
                  for i in range(4) ]
for i in range(1,5):
  globals()["array_2d_{}".format(i)] = arrays_2d_data[i - 1].reshape(
                    (array_row_size_2d, array_column_size_2d)
                  ) + 1
  print(
      "Created variable {} with values \n{}\n".format(
        "array_2d_{}".format(i),
        globals()["array_2d_{}".format(i)]
      )
  )


Created variable array_2d_1 with values 
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]

Created variable array_2d_2 with values 
[[11 12 13 14 15]
 [16 17 18 19 20]]

Created variable array_2d_3 with values 
[[21 22 23 24 25]
 [26 27 28 29 30]]

Created variable array_2d_4 with values 
[[31 32 33 34 35]
 [36 37 38 39 40]]



Array transposition is a basic array manipulation routine, it swaps the rows and columns of a target array. 

In NumPy this is done by using the **[transpose](https://numpy.org/doc/stable/reference/generated/numpy.transpose.html)** function:

In [4]:
print(
    "The transpose of \n{}\n is \n{}\n ".format(
        array_2d_1,
        np.transpose(array_2d_1)
    )
  )

The transpose of 
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
 is 
[[ 1  6]
 [ 2  7]
 [ 3  8]
 [ 4  9]
 [ 5 10]]
 


NumPy's transpose functionality can also be accessed via the **T** property of a NumPy array:

In [5]:
print(
    "The transpose of \n{}\n is \n{}\n ".format(
        array_2d_1,
        array_2d_1.T
    )
  )

The transpose of 
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
 is 
[[ 1  6]
 [ 2  7]
 [ 3  8]
 [ 4  9]
 [ 5 10]]
 


One of the most useful operations is array concatenation which **joins a series of arrays along a specific axis (dimension)**.

The NumPy function that allows this functionality is the **[concatenate](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html)** function which has the following relevant parameters:

*   (a1, a2, ...): the arrays to concatenate;
*   axis - int, optional: the axis along which the arrays will be joined.

The function returns the concatenated array.


In [6]:
# we will use a simple concatenation operation
print(
    "Arrays for concatenations are \n{}\n and \n{}\n".format(
      array_2d_1,
      array_2d_2      
    )
  )

for axis_value in range (0,2) :
  concatenated_array = np.concatenate(
    (array_2d_1, array_2d_2),
    axis = axis_value
  )
  print("Arrays concatenated on axis {} generate \n{}\n with shape {} \n".format(
        axis_value, 
        concatenated_array,
        concatenated_array.shape
      )
  )


Arrays for concatenations are 
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
 and 
[[11 12 13 14 15]
 [16 17 18 19 20]]

Arrays concatenated on axis 0 generate 
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]
 with shape (4, 5) 

Arrays concatenated on axis 1 generate 
[[ 1  2  3  4  5 11 12 13 14 15]
 [ 6  7  8  9 10 16 17 18 19 20]]
 with shape (2, 10) 



A similar function is **[stack](https://numpy.org/doc/stable/reference/generated/numpy.stack.html)** which ensures as well the conacatenation of arrays. The difference from concatenation is that the stack function **creates a new dimension** in the concatenation result.

In [7]:
# we will use the stack function to generate a concatenation result which
# generate a new dimension
for axis_value in range (0,2) :
  stacked_array = np.stack(
    (array_2d_1, array_2d_2),
    axis = axis_value
  )
  print("Arrays stacked on axis {} generate \n{}\n with shape {} \n".format(
        axis_value, 
        stacked_array,
        stacked_array.shape
      )
  )

Arrays stacked on axis 0 generate 
[[[ 1  2  3  4  5]
  [ 6  7  8  9 10]]

 [[11 12 13 14 15]
  [16 17 18 19 20]]]
 with shape (2, 2, 5) 

Arrays stacked on axis 1 generate 
[[[ 1  2  3  4  5]
  [11 12 13 14 15]]

 [[ 6  7  8  9 10]
  [16 17 18 19 20]]]
 with shape (2, 2, 5) 



There are two shortcut functions **[vstack](https://numpy.org/doc/stable/reference/generated/numpy.vstack.html)** and **[hstack](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html)** that ensure concatenation on the first, respecively second axis.

In [8]:
# the v-stacked result
v_stacked_array = np.vstack(
  (array_2d_1, array_2d_2)
)

print("Arrays v-stacked generate \n{}\n with shape {} \n".format(
        v_stacked_array,
        v_stacked_array.shape
      )
)

# the h-stacked result
h_stacked_array = np.hstack(
  (array_2d_1, array_2d_2)
)

print("Arrays h-stacked generate \n{}\n with shape {} \n".format(
        h_stacked_array,
        h_stacked_array.shape
      )
)

Arrays v-stacked generate 
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]
 with shape (4, 5) 

Arrays h-stacked generate 
[[ 1  2  3  4  5 11 12 13 14 15]
 [ 6  7  8  9 10 16 17 18 19 20]]
 with shape (2, 10) 



It must be noted that in case of record arrays with different types concatenations of arrays does not work due to type incompatibility.

However, NumPy has a dedicated **[recfunctions](https://numpy.org/devdocs/user/basics.rec.html)** package that offers support for such operations, basically merging the fields from both arrays.   

In [9]:
# import the numpy package supporting record functions
from numpy.lib import recfunctions as rfn

# create a record array containing the int and float values of several numbers
rec_array_numeric = np.array(
    [(1, 1.0), (2, 2.0), (3,3.0)], 
    dtype=[("int_value", "<i8"), ("float_value", "<f16")]
  )

# create a record array containing the literal representation of these numbers
# in EN and FR
rec_array_literal = np.array(
    [("one", "un"), ("two", "deux"), ("three", "trois")],
    dtype=[("EN", "<U250"), ("FR", "<U250")]
  )

# concatenate the arrays based on fields
merged_array = rfn.merge_arrays (
  (rec_array_numeric, rec_array_literal),
  asrecarray=True, 
  flatten=True
)

# display the merged array 
print(
    "The merged array containing all fields is \n{} ".format(
       merged_array 
    )
  )

The merged array containing all fields is 
[(1, 1., 'one', 'un') (2, 2., 'two', 'deux') (3, 3., 'three', 'trois')] 


## 2. Searching and sorting

An essential feature of NumPy is the **support for sorting and searching** data in arrays.

Let's generate some support data first:

In [10]:
# generate pseudo-random data
import random as rnd

rnd.seed(0)
random_data_1d = np.array(
                  rnd.sample(range(1000), 100),
                  dtype = np.int64
                )

print("The generated random data is: \n{}\n".format(random_data_1d))

The generated random data is: 
[864 394 776 911 430  41 265 988 523 497 414 940 802 849 310 488 366 597
 913 929 223 516 142 288 143 773  97 633 818 256 931 545 722 829 616 923
 150 317 101 747  75 920 870 700 338 483 573 103 362 444 323 625 655 934
 209 565 984 453 886 533 266  63 824 561  14  95 736 860 408 727 844 803
 684 640   1 626 505 847 888 341 249 960 333 720 891  64 195 581 227 244
 822 145 909 556 458  93  82 327 896 520]



In addition to the functionality provided by boolean indexing, NumPy allows searching for values based on various conditions via the **[where](https://numpy.org/doc/stable/reference/generated/numpy.where.html)** function. This function has the following parameters:

*   condition - an array of bolean values: positions where the condition is satisfied or not; 
*   x: an array from which values are returned if condition is TRUE;
*   y: an array from which values are returned if condition is FALSE.

If x and y are not specified, **only the index positions where condition is TRUE** is returned.



In [11]:
# identifies the indexes where value is less than a target value
limit_value = 50
print(
    "The index positions where values are smaller than {} is \n{}\n)".format(
      limit_value,
      np.where(random_data_1d < 50)        
    )
)

The index positions where values are smaller than 50 is 
(array([ 5, 64, 74]),)
)


In [12]:
# replaces the array values higher than a certain limit with the limit value
max_limit_value = 100
print(
    "The array where values are limited to {} are \n {}".format(
      max_limit_value,
      np.where(
          random_data_1d < max_limit_value,
          random_data_1d,
          np.full(random_data_1d.shape[0], max_limit_value)
        )        
    )
)

The array where values are limited to 100 are 
 [100 100 100 100 100  41 100 100 100 100 100 100 100 100 100 100 100 100
 100 100 100 100 100 100 100 100  97 100 100 100 100 100 100 100 100 100
 100 100 100 100  75 100 100 100 100 100 100 100 100 100 100 100 100 100
 100 100 100 100 100 100 100  63 100 100  14  95 100 100 100 100 100 100
 100 100   1 100 100 100 100 100 100 100 100 100 100  64 100 100 100 100
 100 100 100 100 100  93  82 100 100 100]


Another important feature provided by NumPy is that related to data sorting. An array can be sorted in NumPy using the **[sort](https://numpy.org/doc/stable/reference/generated/numpy.sort.html)** and **[argsort](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)** functions. The difference between these two functions is that **sort** returns the sorted array while **argsort** returns an array of indices reflecting the sorted values in the array.

Both functions allow the following relevant parameters:

*   a - array: the array to be sorted;
*   axis - int : the axis considered for sorting in case of multidimensional arrays. Therefore a matrix can be sorted by rows and columns as well.



In [13]:
# sorts the values in the array
print(
    "The sorted array is \n {}".format(
      np.sort(random_data_1d)
    )
)

The sorted array is 
 [  1  14  41  63  64  75  82  93  95  97 101 103 142 143 145 150 195 209
 223 227 244 249 256 265 266 288 310 317 323 327 333 338 341 362 366 394
 408 414 430 444 453 458 483 488 497 505 516 520 523 533 545 556 561 565
 573 581 597 616 625 626 633 640 655 684 700 720 722 727 736 747 773 776
 802 803 818 822 824 829 844 847 849 860 864 870 886 888 891 896 909 911
 913 920 923 929 931 934 940 960 984 988]


In [14]:
# sorts the values in the array and return the new index positions
print(
    "The sorted array's indexes are \n {}".format(
      np.argsort(random_data_1d)
    )
)

The sorted array's indexes are 
 [74 64  5 61 85 40 96 95 65 26 38 47 22 24 91 36 86 54 20 88 89 80 29  6
 60 23 14 37 50 97 82 44 79 48 16  1 68 10  4 49 57 94 45 15  9 76 21 99
  8 59 31 93 63 55 46 87 17 34 51 75 27 73 52 72 43 83 32 69 66 39 25  2
 12 71 28 90 62 33 70 77 13 67  0 42 58 78 84 98 92  3 18 41 35 19 30 53
 11 81 56  7]


## 3. Other useful functions



NumPy provides a series of mathematical, statistical and linear algebra  functions which are a strong support for data processing and scientific calculations. These functions are described in the section  [Routines](https://numpy.org/doc/stable/reference/routines.html) from the NumPy documentation. 

We will focus on a subset of them which are most used in practice. 

### 3.1 Statistical functions

NumPy offers a rich palette of statistical functions defined in the section [Statistics](https://numpy.org/doc/stable/reference/routines.statistics.html).
The most important ones are:

*   min, argmin - returns the minimum value and the positional index(es) of the minimal value from the array; 
*   max, argmax - returns the maximum value and the positional index(es) of the maximal value from the array;
*   average - returns the average of the values from the array;
*   median - returns the median of the values from the array. 




In [15]:
# returns the maximal value and its postion from the array
print(
    "The maximal value from the array is {} at the positional index {}".format(
      np.max(random_data_1d),
      np.argmax(random_data_1d)   
    )
)

The maximal value from the array is 988 at the positional index 7


In [16]:
# returns the average and the median of values from the array
print(
    "The average value from the array is {} and the median value is {}".format(
      np.average(random_data_1d),
      np.median(random_data_1d)   
    )
)

The average value from the array is 531.07 and the median value is 539.0


### 3.2 Mathematical functions

NumPy also offers a series of mathematical functions describe in the [Mathematical functions](https://numpy.org/doc/stable/reference/routines.math.html) section of the NumPy documentation. 

We will provide examples for several functions used oftenly in practice. 

In [17]:
# returns the sum of the values of the elements in the array
print(
    "The sum of the values of elements from the array is {}\n".format(
      np.sum(random_data_1d)  
    )
)

# returns the cummulative sum of the values of the elements in the array
print(
    "The sum of the values of elements from the array is \n{}\n".format(
      np.cumsum(random_data_1d)  
    )
)

# returns the logaritm of the values of the elements in the array
print(
    "The logaritm of the values of elements from the array is \n{}\n".format(
      np.log(random_data_1d)  
    )
)

# returns the sinus of the values of the elements in the array
print(
    "The sinus of the values of elements from the array is \n{}\n".format(
      np.sin(random_data_1d)  
    )
)

The sum of the values of elements from the array is 53107

The sum of the values of elements from the array is 
[  864  1258  2034  2945  3375  3416  3681  4669  5192  5689  6103  7043
  7845  8694  9004  9492  9858 10455 11368 12297 12520 13036 13178 13466
 13609 14382 14479 15112 15930 16186 17117 17662 18384 19213 19829 20752
 20902 21219 21320 22067 22142 23062 23932 24632 24970 25453 26026 26129
 26491 26935 27258 27883 28538 29472 29681 30246 31230 31683 32569 33102
 33368 33431 34255 34816 34830 34925 35661 36521 36929 37656 38500 39303
 39987 40627 40628 41254 41759 42606 43494 43835 44084 45044 45377 46097
 46988 47052 47247 47828 48055 48299 49121 49266 50175 50731 51189 51282
 51364 51691 52587 53107]

The logaritm of the values of elements from the array is 
[6.76157277 5.97635091 6.65415252 6.8145429  6.06378521 3.71357207
 5.57972983 6.8956827  6.25958146 6.20859003 6.02586597 6.84587988
 6.68710861 6.74405919 5.7365723  6.19031541 5.90263333 6.39191711
 6.81673588 6.8341

### 3.3 Matrix operations

NumPy allows as well matrix operations such as matrix multiplication and inverse.

The matrix multiplication operation in NumPy is represented by the **[matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html)** operation which takes as arguments the matrices to be multiplied.

Matrix multiplication is used extensively in practice, for example in case of calculating a stock value considering the number of shares and share unit value:

In [24]:
#simulate data from two portfolios containing a specific number of shares
matrix_shares_portfolio = np.array([
  [10, 2, 5],
  [ 7, 9, 6]
])

#the values of shares
matrix_shares_values = [10, 8, 12]

#calculate the value of the porfolio
print(
    "The value of the portfolios with share distribution \n{}\n and share values \n{}\n is \n{}\n".format(
      matrix_shares_portfolio,
      matrix_shares_values,
      np.matmul(
          matrix_shares_portfolio, 
          matrix_shares_values
      )    
    )    
  )


The value of the portfolios with share distribution 
[[10  2  5]
 [ 7  9  6]]
 and share values 
[10, 8, 12]
 is 
[176 214]



NumPy also allows the calculation of a matrix's inverse value via the **[inv](https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html)** function from numpy.linalg:

In [40]:
#initialize a data matrix and calculate its inverse
matrix_2d = random_data_1d[:9].reshape(3, 3)
matrix_2d_inverse = np.linalg.inv(matrix_2d)

# display the matrix inverse
print(
    "The inverse of \n{}\n is \n{}\n".format(
        matrix_2d,
        matrix_2d_inverse
    )
  )

# explore if their product is close to identity matrix
print(
    "Their product is \n{}\n which is close to the identity matrix (considering rounding errors)".format(
      np.matmul(
        matrix_2d, 
        matrix_2d_inverse
      )    
    )
  )

The inverse of 
[[864 394 776]
 [911 430  41]
 [265 988 523]]
 is 
[[ 3.14703081e-04  9.56876103e-04 -5.41953176e-04]
 [-7.94665304e-04  4.20268622e-04  1.14613626e-03]
 [ 1.34174571e-03 -1.27877164e-03  2.14817769e-05]]

Their product is 
[[ 1.00000000e+00 -1.73472348e-16  5.99563801e-17]
 [ 1.08420217e-18  1.00000000e+00 -3.70661618e-17]
 [ 1.36609474e-17  3.90312782e-17  1.00000000e+00]]
 which is close to the identity matrix (considering rounding errors)
