<div style="text-align:center;display:block">

<img src="../images/NumPy_logo_2020.png" style="margin:0 auto;width:400px">
<div style="text-align:center">
    Bertrand Néron
    <br />
    <a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
    <br />
    © Institut Pasteur, 2021
</div>

# installation

```python
pip install numpy
```

# Convention

In [1]:
import numpy as np

In [2]:
x = np.array([1,2,3])
x

array([1, 2, 3])

In [3]:
type(x)

numpy.ndarray

*x* is an instance of the object **numpy.ndarray**. The constructor takes as argument a sequence. Here we provided a list hence the ([ ]) syntax.

NB2: Following the previous nota bene about the syntax we have:

```python
a = np.array(1, 2, 3, 4)    # WRONG
a = np.array([1, 2, 3, 4])  # RIGHT
```


## NumPy provides fast and memory efficient data structures

In [4]:
l = range(1000000)
%time sum(l)

CPU times: user 21.4 ms, sys: 0 ns, total: 21.4 ms
Wall time: 21.3 ms


499999500000

In [5]:
x = np.array(l)
%time x.sum()

CPU times: user 0 ns, sys: 1.47 ms, total: 1.47 ms
Wall time: 769 μs


np.int64(499999500000)

In [6]:
print(f"numpy is ~ {26.4/1.43:.0f} faster than pure python")

numpy is ~ 18 faster than pure python


### Example 2
we want to compute the $\sum X_i^2$ given $X$

In [7]:
l = range(1000000)
%time sum([x**2 for x in l])

CPU times: user 73 ms, sys: 23.9 ms, total: 96.9 ms
Wall time: 96.1 ms


333332833333500000

In [8]:
x = np.array(l)
%time (x**2).sum()

CPU times: user 1.74 ms, sys: 1.07 ms, total: 2.81 ms
Wall time: 1.57 ms


np.int64(333332833333500000)

In [9]:
print(f"numpy is ~ {265/2.28:.0f} faster than pure python")

numpy is ~ 116 faster than pure python


## Creates N-D arrays

## 1-D array

In [10]:
one_d = np.array([1, 2, 10, 2, 1 ])

In [11]:
len(one_d)

5

Indexing/slicing works like Python sequences

In [12]:
one_d[2]

np.int64(10)

In [13]:
one_d[1:]

array([ 2, 10,  2,  1])

In [14]:
one_d[2:4]

array([10,  2])

In [15]:
### 2-D arrays

Here is a naive way to build a 2D matrix with values going from 1 to 12. Later, we will use more power full
method to do this (arange, reshape, ...)

In [16]:
n1 = [1, 2, 3]
n2 = [4, 5, 6]
n3 = [7, 8, 9]
n4 = [10, 11, 12]
two_d = np.array([n1, n2, n3, n4])

In [17]:
two_d.ndim

2

In [18]:
one_d.ndim

1

In [19]:
print(two_d.shape, one_d.shape)

(4, 3) (5,)


### Indexing: LC convention (Line / Column)

For a 5x5 matrix, the indexing works as follows

<img src="img/matrix.png">

In [20]:
two_d

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [21]:
# To get 11, last row, second column:
two_d[3, 1]

np.int64(11)

In [22]:
# equivalent but a bit slower:
two_d[3][1]

np.int64(11)

### 3-D arrays?

you manipulate 3D arrays almost every day 

A black and white image is a 2D matrix with a value between 0 (black) and 255 (white) for each pixel (0-255 for 8 bits encoded image)

<img src="img/image_BW_numpy.png" style="margin:0 auto;width:200px">

A color image is a 3D matrix, it's the supperposition of 3 2D matrix one for the Red, one for the Green and one for the Blue

<img src="img/colored_image_numpy.png" style="margin:0 auto;width:400px">

#### axis
in numpy when we do operation on matrix we have to specify on which direction you wnat to do the operation

for instance you have a 2D matrix and you have the operator sum.
but you need to tell numpy if you want to sum along the columns or the rows.
for that numpy have a parameter axis
in 2D matrix axis=0 is the row axis axis=1 is the columns axis

<img src="img/axis.png" style="margin:0 auto;width:600px">

### Can you imagine a 4D array?

> Yes a film can be view as a sequence of colored image, so it's a 4D array

<img src="img/4D_array.png">

Volume of air and at each position we measure the pressure and temperature. 
To simplify, we decompose the volume in 2x2x3 smaller cubes

In [23]:
c1 = [1,2,3]; c2 = [1,2,3]; c3 = [1,2,3]; c4 = [1,2,3]; 
c5 = [1,2,3]; c6 = [1,2,3]; c7 = [1,2,3]; c8 = [1,2,3]; 
x = np.array(
    [                   # first dimension (2 slices)
        [               # second dimension (2 rows)
            [           # third dimension (2 columns)
                c1, c2
            ],  
            [
                c3, c4
            ]
        ],
        [
            [
                c5, c6
            ],
            [
                c7, c8
            ]
        ]
    ])        

In [24]:
x.shape

(2, 2, 2, 3)

## Function to create arrays

In [25]:
# 2D array 
np.ones((2,3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [26]:
# 3 D array
np.ones((3,4,5))

array([[[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]],

       [[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]]])

### The **arange** function

> Evenly spaced values within a given interval based on a **step**

In [27]:
a = np.arange(1, 10) # not that the end is exclusive and the step is 1 by default

In [28]:
np.arange(0, 10, step=2)

array([0, 2, 4, 6, 8])

The **reshape** methode

> Gives a new shape to an array without changing its data.

In [29]:
a2 = a.reshape(3,3)
a2

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

**NB** the product of dimensions = number of values
```python
len(a) = 9 
3 * 3 = 9
```

In [30]:
# equivalent to

a2 = np.reshape(a, (3,3))
a2

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### the **linspace** function

> Evenly spaced values within a given interval based on a **number of points**

In [31]:
np.linspace(0, 1, 10)

array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
       0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])

### ones, zeros, diag, eye, empty

In [32]:
np.diag((5,5,1,1))

array([[5, 0, 0, 0],
       [0, 5, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1]])

In [33]:
np.ones((2,2))

array([[1., 1.],
       [1., 1.]])

In [34]:
np.zeros((2,2))

array([[0., 0.],
       [0., 0.]])

In [35]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

# Random values

> Python language has its own random module but numpy has more functionalities.

First you have to create a Generator object, usualy by using the default_rng function

In [36]:
rng = np.random.default_rng() 

then use random methods

In [37]:
# Generate one random float uniformly distributed over the range [0, 1)
rng.random() # result below may vary

0.07493730781019758

In [38]:
# array of normally distributed values
# with mean=0 (loc) and std=1.0 (scale) (defaults)
rng.normal(loc=0.0, scale=1.0, size=10)


array([ 0.97665436,  0.10830955, -0.48716618,  0.47222301, -1.95684758,
       -1.10186154, -0.85026678,  0.73181136,  0.28125178, -0.19334269])

In [39]:
# a 2Darray of normally distributed values
rng.normal(loc=1.0, scale=2.0, size=(5,2))

array([[ 3.48679819,  3.04781574],
       [ 3.70247583,  1.76402619],
       [ 0.44962379,  2.8026444 ],
       [ 3.81113791,  0.74001679],
       [-0.34301959,  2.8487461 ]])

In [40]:
# array of uniform distributed values
rng.uniform(low=0, high=2, size=10)

array([1.53764539, 0.10407965, 0.14943287, 1.33850036, 1.29547856,
       1.92918736, 0.0350858 , 1.34860414, 1.72703453, 0.32842712])

In [41]:
# you can specify a seed when you build a random generator
s_rng = np.random.default_rng(seed=3)

s_rng.random()

0.08564916714362436

* https://numpy.org/doc/stable/reference/random/index.html
* https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.default_rng
* https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.Generator

## old way to generate random values

These functions will be deprecated soon

In [42]:
# uniform random values between 0 and 1
np.random.rand()

0.38646383961446507

In [43]:
# normal distribution
np.random.randn()

1.035199530576272

In [44]:
# array of normally distributed values
# with mean=0 (loc) and std=1.0 (scale) (defaults)
np.random.normal(loc=0.0, scale=1.0, size=10)

array([ 1.11832894,  2.0021714 ,  0.58719543, -0.56612414, -0.12557549,
       -0.27136275,  1.51259438, -0.73849808,  0.77444285,  0.54480857])

In [45]:
# a 2Darray of normally distributed values
np.random.normal(loc=1.0, scale=2.0, size=(5,2))

array([[ 0.83273837, -0.05496048],
       [ 2.75752077,  0.61157395],
       [-0.05946127, -0.55505366],
       [-3.14976828,  2.48233988],
       [ 3.67011564,  1.6567874 ]])

In [46]:
# array of uniform distributed values
np.random.uniform(low=0, high=2, size=10)

array([0.2491329 , 1.23861051, 0.06156969, 1.99958024, 1.01311919,
       0.61799875, 0.95377215, 0.72350694, 0.98002253, 1.65238567])

# data types

In [47]:
x = np.array([1,2,3])
x.dtype

dtype('int64')

In [48]:
x = np.array([1., 2, 3.5])
x.dtype

dtype('float64')

In [49]:
x = np.array([1,2,3], dtype=float)
x.dtype

dtype('float64')

You can cast data from datatype to an other

In [50]:
x = np.array([1, 2, 3])
print(x.dtype)
x = x.astype(float)
print(x.dtype)
x

int64
float64


array([1., 2., 3.])

<div class="alert alert-warning">
If you mix types, the most complex type is used
</div>

In [51]:
np.array([1.0, 1, "oups"])

array(['1.0', '1', 'oups'], dtype='<U32')

Of course if the conversion make sense

In [52]:
x = np.array([1.0, 1, "oups"])
x = x.astype(float)

ValueError: could not convert string to float: np.str_('oups')

<div class="practical">
<h1>Basic indexing and slicing </h1>
</div>

# Example in 2D

Syntax. First axis is for rows and second for columns:

<img src="../images/exo_table1.png">

Create the array shown above. Then, with slicing and indexing, 
> - extract first row, 
> - extract first column (orange cells)
> - extract even values only, odd values only
> - extract the 4 blue cells
> - extract the 2 green cells
> - extract the 2x2 sub-matrix in bottom right corner

In [53]:
r = np.arange(5)
m = np.array([r, r+10, r+20, r+30, r+40])
m

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [54]:
# first row
m[0, :]

array([0, 1, 2, 3, 4])

In [55]:
# blue cells
m[1,1], m[1,3], m[3,1], m[3,3]

(np.int64(11), np.int64(13), np.int64(31), np.int64(33))

In [56]:
# even values
m[1::2, 1::2]

array([[11, 13],
       [31, 33]])

In [57]:
# orange column
m[:,0]

array([ 0, 10, 20, 30, 40])

In [58]:
# green cells
m[2, -2:]

array([23, 24])

In [59]:
# blue sub corner
m[-2:, -2:]

array([[33, 34],
       [43, 44]])

# Copies and views

We are manipulating objects. So be careful with the references

In [60]:
## views

In [61]:
a = np.array([1,2,3,4,5])
b = a.view()

In [62]:
a, b

(array([1, 2, 3, 4, 5]), array([1, 2, 3, 4, 5]))

In [63]:
a[2] = 30

In [64]:
a, b

(array([ 1,  2, 30,  4,  5]), array([ 1,  2, 30,  4,  5]))

In [65]:
b[-2:] = 0

In [66]:
a, b

(array([ 1,  2, 30,  0,  0]), array([ 1,  2, 30,  0,  0]))

<div class="alert alert-warning">
    [:] does <span style="font-weight:bold">NOT</span> make a shallow copy as for python list it's equivalent to a view
</div>

In [67]:
a = np.array([1,2,3,4,5])
b = a[:]
a[2] = 300
a, b

(array([  1,   2, 300,   4,   5]), array([  1,   2, 300,   4,   5]))

In [68]:
c = a.copy()
c

array([  1,   2, 300,   4,   5])

In [69]:
c[2] = 150
c, a

(array([  1,   2, 150,   4,   5]), array([  1,   2, 300,   4,   5]))

# Fancy indexing 

> As we have seen before, standard Python slicing and indexing works on NumpPy array.
> Yet, NumPy provides more indexing, which can be performed with boolean or integer arrays, also called **masked**

## Indexing with boolean masks

> Boolean mask is a very powerful feature in NumPy.
> It can be used to index an array, and assign new values to a sub-array. 
> Note also that it creates copies not views

In [70]:
data = np.arange(16)
data

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

Find all multiple of 7

In [71]:
data % 7 == 0

array([ True, False, False, False, False, False, False,  True, False,
       False, False, False, False, False,  True, False])

In [72]:
mask = (data % 7 == 0)
data[mask]

array([ 0,  7, 14])

In [73]:
# Replaces values: 
data[mask] = -100
data

array([-100,    1,    2,    3,    4,    5,    6, -100,    8,    9,   10,
         11,   12,   13, -100,   15])

In [74]:
## Indexing with an array of integers

In [75]:
data = np.array([-1, 2, -3, -4, -5, 10, 20])
indices = [0, 1, 2, 3]
data[indices]

array([-1,  2, -3, -4])

In [76]:
data[data<0]

array([-1, -3, -4, -5])

<img src="img/exo_table2.png">

Create the array above and extract the following data sets:
> - the 9 blue cells
> - the 5 orange cells 
> - the green cells

In [77]:
m = np.array([[i+j for i in range(6)] for j in range(0, 60, 10)])

In [78]:
m

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [79]:
# orange
m[(0,1,2,3,4), (1,2,3,4,5)]

array([ 1, 12, 23, 34, 45])

In [80]:
# blue:
m[3:, [0,2,5]]

array([[30, 32, 35],
       [40, 42, 45],
       [50, 52, 55]])

In [81]:
# green
m[np.array([True, False,True,False,False,True]),4]

array([ 4, 24, 54])

In [82]:
# green
m[np.array([1, 0, 1, 0, 0, 1], dtype=bool), 4]

array([ 4, 24, 54])

# Numerical operations

In [83]:
a = np.array([[4, 7], 
              [2, 6]])

b = np.array([[0.6, -0.7],
              [-0.2, 0.4]])

In [84]:
a + b

array([[4.6, 6.3],
       [1.8, 6.4]])

<div class="alert alert-warning">
Again be careful with copies and views
</div>

In [85]:
c = b.copy()

In [86]:
c *= 2

In [87]:
c, b

(array([[ 1.2, -1.4],
        [-0.4,  0.8]]),
 array([[ 0.6, -0.7],
        [-0.2,  0.4]]))

In [88]:
# elementwise product
a * b

array([[ 2.4, -4.9],
       [-0.4,  2.4]])

<div class="alert alert-warning">
This is not a matrix product
</div>

In [89]:
# matrix product
a.dot(b)

array([[ 1.00000000e+00,  3.33066907e-16],
       [-1.11022302e-16,  1.00000000e+00]])

In [90]:
a.dot(b).round()

array([[ 1.,  0.],
       [-0.,  1.]])

# Reductions

In [91]:
x = np.array([1,2,3,4,-1,-2,-3,-4])

In [92]:
print("sum =", x.sum())
print("min =", x.min())
print("max =", x.max())
print("position of the max =", x.argmax())

sum = 0
min = -4
max = 4
position of the max = 3


In [93]:
a = np.array([
    [1,10,1],
    [2,8,3]
])

In [94]:
a.sum()

np.int64(25)

In [95]:
a.sum(axis=0) # sum along the axis 0 or rows

array([ 3, 18,  4])

In [96]:
a.sum(axis=1) # sum along the axis 1 or columns

array([12, 13])

# Iterations

In [97]:
x = np.random.normal(size=12).reshape(6,2)

In [98]:
x

array([[-0.26020648,  0.26368625],
       [ 0.42129755, -0.56914431],
       [-0.40631094, -2.08134291],
       [-0.1991786 ,  1.04618945],
       [-0.50518875, -1.01945961],
       [ 0.40947048,  0.40108879]])

In [99]:
for row in x:
    if row.sum() > 0 :
        print(row)

[-0.26020648  0.26368625]
[-0.1991786   1.04618945]
[0.40947048 0.40108879]


In [100]:
for item in x.flat:
    if item >1:
        print(item)

1.046189447818475


In [101]:
res = np.sqrt(-1)

  res = np.sqrt(-1)


In [102]:
res

np.float64(nan)

In [103]:
np.isnan(res)

np.True_

In [104]:
import sys
sys.float_info.max

1.7976931348623157e+308

In [105]:
res = np.array([1e308,1e300]) * 10

  res = np.array([1e308,1e300]) * 10


In [106]:
res

array([    inf, 1.e+301])

# Resizing

In [107]:
a = np.diag([1,2,3,4])

In [108]:
# Can be used to add a column or a row
a.resize((5,4))
a

array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4],
       [0, 0, 0, 0]])

# transpose

In [109]:
a = np.arange(12).reshape((3,4))
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [110]:
b = a.T
b

array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

# swapaxes

Interchange two axes of an array.
(https://numpy.org/doc/stable/reference/generated/numpy.swapaxes.html)

In [111]:
a = np.arange(24).reshape((3,4,2))
a

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15]],

       [[16, 17],
        [18, 19],
        [20, 21],
        [22, 23]]])

In [112]:
b = a.swapaxes(0,2)
b

array([[[ 0,  8, 16],
        [ 2, 10, 18],
        [ 4, 12, 20],
        [ 6, 14, 22]],

       [[ 1,  9, 17],
        [ 3, 11, 19],
        [ 5, 13, 21],
        [ 7, 15, 23]]])

In [113]:
a.shape, b.shape

((3, 4, 2), (2, 4, 3))

# array concatenation

In [114]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.vstack([a, b])
c

array([[1, 2, 3],
       [4, 5, 6]])

In [115]:
# equivalent of the + operator with list
np.hstack([a,b])

array([1, 2, 3, 4, 5, 6])

In [116]:
a = np.arange(6).reshape(2,3)
b = np.arange(10,16).reshape(2,3)
np.hstack([a,b])

array([[ 0,  1,  2, 10, 11, 12],
       [ 3,  4,  5, 13, 14, 15]])

# Sorting

In [117]:
a = np.array([5,1,10,2,7,8])
a.sort() # inplace
a

array([ 1,  2,  5,  7,  8, 10])

In [118]:
a = np.array([5,1,10,2,7,8])
sorted_a = np.sort(a) # new array
sorted_a

array([ 1,  2,  5,  7,  8, 10])

In [119]:
a

array([ 5,  1, 10,  2,  7,  8])

In [120]:
a = np.array([5,1,10,2,7,8])
a.argsort()

array([1, 3, 0, 4, 5, 2])

# Loading data

Numpy has its own reader of tabulated data sets

np.genfromtxt, np.loads, ...

However, we will use pandas read_csv function that is far better