<a href="https://colab.research.google.com/github/RichardFeynmannSW/Python_course/blob/main/Python_course_1_NumPy_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Metabolomics data analysis and visualization using Python
##Overview over the course


Lecture 1: Basics of Python\
Lecture 2: Data analysis with Pandas library\
Lecture 3: Data visualization with Matplotlib library I\
Lecture 4: Data visualization with Matlibplot library II\
Lecture 5: Data analysis project

Recommended material: Jake VanderPlas "Python Data Science Handbook" (open source)

How to use this Jupyter notebook:\
This notebook includes code cells where you can run Python code interactively. To do this please click on the cell and press SHIFT + ENTER.



#Basics of Python
The **print()** command allows you to receive an output from an algorithm. This command is also ver useful to prompt variables.

At some point it will become necessary to add comments to your code in order to improve its readability. Comments are preceded by a '#'.

In [None]:
print('Hello world')
#this a comment

Hello world


#NumPy

##Introduction to NumPy

NumPy is a library used in Python. It supports array manipulation and high-level mathematical functions. We need to import the NumPy package with the **import** command. We can use ._version_ to check the version.

In [None]:
import numpy
numpy.__version__

'1.23.5'

It is a convention that NumPy will be imported as np. This is a shorthand to refer to the library. Remember to import NumPy at the beginning of your code if you want to use its features.

In [None]:
import numpy as np

##Understanding Datatypes
In Python the data types (integer, float, string, ...) are dynamically inferred. This means that we can assign any kind of data to any variable by using the **=** operation without specifying the data type upon initialization.

In [None]:
x = 4
print(x)
x = "four"
print(x)

4
four


###List
Lists are used to store multiple items in a single variable. Use **list()** to create such an object.

If you want to use a list of subsequent numbers than you can use the **range()** function. The **range()** function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and stops before a specified number. The general syntax is:\
*range(start, stop, step)*\
however the default is start = 0, step = 1.

In [None]:
L = list(range(10))
print(L)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


We can use **np.array()** to create arrays from Python list. An array is a special variable, which can hold more than one value. Unlike lists, NumPy arrays are of fixed size, and changing the size of an array will lead to the creation of a new array while the original array will be deleted. All the elements in an array are of the same type.
Numpy arrays are faster, more efficient, and require less syntax than standard Python sequences.

In [None]:
import numpy as np
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

This is an array of integer. This data type was automatically chosen. If we include at least one value that is a float than the whole array will be defined as floating characters (you can see that by the '.' at the end of a number).

In [None]:
np.array([3.14, 4, 2, 3])

array([3.14, 4.  , 2.  , 3.  ])

NumPy arrays can be multidimensional. Here's one way of initializing a multidimensional array using a list of lists. i is an index.

In [None]:
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

##Special arrays
Sometimes it is useful to create arrays that are already filled for a specific purpose. Here are some examples. The **dtype** parameter determines the data type, i.e. integer (int) or float.

In [None]:
# Create a length-10 integer array filled with 0s
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
# Create a 3x5 floating-point array filled with 1s
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [None]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [None]:
# Create a 3x3 array of uniformly distributed
# pseudorandom values between 0 and 1
np.random.random((3, 3))

array([[0.8425064 , 0.40374667, 0.35533033],
       [0.62116045, 0.00574509, 0.52186825],
       [0.3233079 , 0.00558418, 0.89659266]])

In [None]:
# Create a 3x3 array of normally distributed pseudorandom
# values with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[ 0.05707103,  0.25668667,  0.69687423],
       [-0.92008523,  0.40157598,  0.97332854],
       [-1.58076553,  0.55588106,  0.37786334]])

In [None]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

##NumPy Array Attributes
We define random arrays of one, two, and three dimensions. We'll use NumPy's random number generator, which we will seed with a set value in order to ensure that the same random arrays are generated each time this code is run. The general notation is\
*numpy.random.default_rng(seed=None)*


In [None]:
import numpy as np
rng = np.random.default_rng(seed=1701)  # seed for reproducibility

x1 = rng.integers(10, size=6)  # one-dimensional array
x2 = rng.integers(10, size=(3, 4))  # two-dimensional array
x3 = rng.integers(10, size=(3, 4, 5))  # three-dimensional array

Each array has attributes including **ndim** (the number of dimensions), **shape** (the size of each dimension), **size** (the total size of the array), and **dtype** (the type of each element):

In [None]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype:   ", x3.dtype)

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60
dtype:    int64


##Array Indexing: Accessing Single Elements

Sometimes we want to access a single element of an array. You can do that by referring to the index number.

In [None]:
print(x1)

[9 4 0 3 8 6]


In [None]:
print(x1[0])

9


In [None]:
print(x1[4])

8


To index from the end of the array, you can use negative indices:

In [None]:
print(x1[-1])

6


In [None]:
print(x1[-2])

8


In a multidimensional array, items can be accessed using a comma-separated (row, column) tuple:

In [None]:
print(x2)

[[3 1 3 7]
 [4 0 2 3]
 [0 0 6 9]]


In [None]:
print(x2[0, 0])

3


In [None]:
print(x2[2, 0])

0


In [None]:
print(x2[2, -1])

9


Values can also be modified using any of the preceding index notation

In [None]:
x2[0, 0] = 12
print(x2)

[[12  1  3  7]
 [ 4  0  2  3]
 [ 0  0  6  9]]


##Array slicing: accessing subarrays
Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon (:) character. The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:

*x[start:stop:step]*


If any of these are unspecified, they default to the values start = 0, stop = (size of dimension), step = 1. Let's look at some examples of accessing subarrays in one dimension and in multiple dimensions

In [None]:
print(x1)

[9 4 0 3 8 6]


In [None]:
print(x1[:3])  # first three elements

[9 4 0]


In [None]:
print(x1[3:])  # elements after index 3

[3 8 6]


In [None]:
print(x1[1:4])  # middle subarray

[4 0 3]


In [None]:
print(x1[::2])  # every second element

[9 0 8]


In [None]:
print(x1[1::2])  # every second element, starting at index 1

[4 3 6]


A potentially confusing case is when the step value is negative. In this case, the defaults for start and stop are swapped. This becomes a convenient way to reverse an array:

In [None]:
print(x1[::-1])  # all elements, reversed

[6 8 3 0 4 9]


###Multidimensional subarrays
Multidimensional slices work in the same way, with multiple slices separated by commas. For example:

In [None]:
print(x2)

[[12  1  3  7]
 [ 4  0  2  3]
 [ 0  0  6  9]]


In [None]:
print(x2[:2, :3])  # first two rows & three columns

[[12  1  3]
 [ 4  0  2]]


In [None]:
print(x2[:3, ::2])  # three rows, every second column

[[12  3]
 [ 4  2]
 [ 0  6]]


###Accessing array rows and columns
One commonly needed routine is accessing single rows or columns of an array. This can be done by combining indexing and slicing, using an empty slice marked by a single colon (:):

In [None]:
print(x2[:, 0])  # first column of x2

[12  4  0]


In [None]:
print(x2[0, :])  # first row of x2

[12  1  3  7]


###Reshaping of arrays
Another useful type of operation is reshaping of arrays, which can be done with the r**eshape()** method. By reshaping we can add or remove dimensions or change number of elements in each dimension. For example, if you want to put the numbers 1 through 9 in a  3×3  grid, you can do the following:

In [None]:
grid = np.arange(1, 10).reshape(3, 3)
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


Note that for this to work, the size of the initial array must match the size of the reshaped array, and in most cases the reshape method will return a no-copy view of the initial array.

A common reshaping operation is converting a one-dimensional array into a two-dimensional row or column matrix:

In [None]:
x = np.array([1, 2, 3])
x.reshape((1, 3))  # row vector via reshape

array([[1, 2, 3]])

In [None]:
x.reshape((3, 1))  # column vector via reshape

array([[1],
       [2],
       [3]])

A convenient shorthand for this is to use np.newaxis in the slicing syntax:

In [None]:
x[np.newaxis, :]  # row vector via newaxis

array([[1, 2, 3]])

In [None]:
x[:, np.newaxis]  # column vector via newaxis

array([[1],
       [2],
       [3]])

##Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines **np.concatenate()**, **np.vstack(),** and **np.hstack()**. np.concatenate takes a tuple or list of arrays as its first argument, as you can see here:

In [None]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

You can also concatenate more than two arrays at once:

In [None]:
z = np.array([99, 99, 99])
print(np.concatenate([x, y, z]))

[ 1  2  3  3  2  1 99 99 99]


And it can be used for two-dimensional arrays:

In [None]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

In [None]:
# concatenate along the first axis
np.concatenate([grid, grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [None]:
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

For working with arrays of mixed dimensions, it can be clearer to use the **np.vstack** (vertical stack) and **np.hstack** (horizontal stack) functions:

In [None]:
# vertically stack the arrays
np.vstack([x, grid])

array([[1, 2, 3],
       [1, 2, 3],
       [4, 5, 6]])

In [None]:
# horizontally stack the arrays
y = np.array([[99],
              [99]])
np.hstack([grid, y])

array([[ 1,  2,  3, 99],
       [ 4,  5,  6, 99]])

Similarly, for higher-dimensional arrays, **np.dstack()** will stack arrays along the third axis.

##Functions
###Array Arithmetic
NumPy's functions feel very natural to use because they make use of Python's native arithmetic operators. The standard addition, subtraction, multiplication, and division can all be used. In the example we use the **arange()** function that returns evenly spaced values with a given interval. The general form is\
 numpy.arange([start, ]stop, [step, ]dtype=None, like=None)


In [None]:
x = np.arange(4)
print("x      =", x)
print("x + 5  =", x + 5)
print("x - 5  =", x - 5)
print("x * 2  =", x * 2)
print("x / 2  =", x / 2)
print("x // 2 =", x // 2)  # floor division

x      = [0 1 2 3]
x + 5  = [5 6 7 8]
x - 5  = [-5 -4 -3 -2]
x * 2  = [0 2 4 6]
x / 2  = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]


There is also a function for negation, a ** operator for exponentiation, and a % operator for modulus:

In [None]:
print("-x     = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2  = ", x % 2)

-x     =  [ 0 -1 -2 -3]
x ** 2 =  [0 1 4 9]
x % 2  =  [0 1 0 1]


 These can be strung together with the standard order of operations:

In [None]:
-(0.5*x + 1) ** 2

array([-1.  , -2.25, -4.  , -6.25])

In [None]:
#absolute value
x = np.array([-2, -1, 0, 1, 2])
np.abs(x)

array([2, 1, 0, 1, 2])

In [None]:
#Trigonometric Functions
theta = np.linspace(0, np.pi, 3)

The **numpy.linspace()** function returns numbers spaces evenly w.r.t interval. Similar to numpy.arange() function but instead of steps it uses sample number. The general syntax is\
numpy.linspace(start,
               stop,
               num = 50,
               endpoint = True,
               retstep = False,
               dtype = None)\
  Here *num* is the number of samples to generate.

In [None]:
print("theta      = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))

theta      =  [0.         1.57079633 3.14159265]
sin(theta) =  [0.0000000e+00 1.0000000e+00 1.2246468e-16]
cos(theta) =  [ 1.000000e+00  6.123234e-17 -1.000000e+00]
tan(theta) =  [ 0.00000000e+00  1.63312394e+16 -1.22464680e-16]


In [None]:
#Exponents
x = [1, 2, 3]
print("x   =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3., x))

x   = [1, 2, 3]
e^x = [ 2.71828183  7.3890561  20.08553692]
2^x = [2. 4. 8.]
3^x = [ 3.  9. 27.]


In [None]:
#Logarithms
x = [1, 2, 4, 10]
print("x        =", x)
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))

x        = [1, 2, 4, 10]
ln(x)    = [0.         0.69314718 1.38629436 2.30258509]
log2(x)  = [0.         1.         2.         3.32192809]
log10(x) = [0.         0.30103    0.60205999 1.        ]


In [None]:
#minimum and maximum
big_array = rng.random(1000000)
print(np.min(big_array), np.max(big_array))

1.738670563078415e-06 0.9999980069594879


In [None]:
#summation
L = rng.random(100)
np.sum(L)

50.47145448941101

In [None]:
M = np.array([[0, 3 ,1, 2] ,[1 ,9, 7, 0] ,[4, 8, 3, 7]])
print(M)

[[0 3 1 2]
 [1 9 7 0]
 [4 8 3 7]]


NumPy aggregations will apply across all elements of a multidimensional array

In [None]:
M.sum()

45

Aggregation functions take an additional argument specifying the axis along which the aggregate is computed. For example, we can find the minimum value within each column by specifying *axis=0*:

In [None]:
M.min(axis=0)

array([0, 3, 1, 0])

The function returns four values, corresponding to the four columns of numbers.

Similarly, we can find the maximum value within each row:

In [None]:
M.max(axis=1)

array([3, 9, 8])

##Write your own function

During data analysis it will be necessary for you to write your own functions. A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function.
A function can return data as a result. Please make sure to pay attention to the indentation of the code. Information can be passed into functions as arguments.
Arguments are specified after the function name, inside the parentheses. You can add as many arguments as you want, just separate them with a comma.

In [None]:
#define a function
def square_func(x):
  q = x*x
  print('number: ', x, 'squared: ', q)

#main program
#call the function
square_func(4.5)
a = 3
square_func(a)
square_func(2*a)

number:  4.5 squared:  20.25
number:  3 squared:  9
number:  6 squared:  36


In [None]:
#function with several parameters
#define a function
def calculation(x,y,z):
  result = (x + y) * z
  print('result: ', result)

#main program
calculation(2,3,5)
calculation(5,2,3)


result:  25
result:  21


##Boolean Arrays
Conditions can be either True (1) or False (0). This can be necessary if you use conditions to make decisions. For example if you want to delete all of the negative values in a data set you first need to answer the question: is this value negative? The answer is a boolean True or False.

In [None]:
x = np.array([1, 2, 3, 4, 5])

In [None]:
x < 3  # less than

array([ True,  True, False, False, False])

In [None]:
x > 3  # greater than

array([False, False, False,  True,  True])

In [None]:
x <= 3  # less than or equal

array([ True,  True,  True, False, False])

In [None]:
x != 3  # not equal

array([ True,  True, False,  True,  True])

In [None]:
x == 3  # equal

array([False, False,  True, False, False])

###Counting entries
To count the number of True entries in a Boolean array, **np.count_nonzero()** is useful. It counts the number of non-zero (= TRUE)  values in the array arr. For example, any number is considered truthful if it is nonzero, whereas any string is considered truthful if it is not the empty string.

In [5]:
x = np.array([[9, 4, 0, 3,],[8, 6, 3, 1,],[3, 7, 4, 0]])

In [6]:
# how many values less than 6?
np.count_nonzero(x < 6)

8

The benefit of np.sum is that, like with other NumPy aggregation functions, this summation can be done along rows or columns as well:

In [None]:
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)

array([3, 2, 3])

This counts the number of values less than 6 in each row of the matrix.

If we're interested in quickly checking whether any or all the values are True, we can use np.any or np.all:

In [None]:
# are there any values greater than 8?
np.any(x > 8)

True

In [None]:
# are there any values less than zero?
np.any(x < 0)

False

In [None]:
# are all values less than 10?
np.all(x < 10)

True

In [None]:
# are all values equal to 6?
np.all(x == 6)

False

Note that **np.all()** and **np.any()** can be used along particular axes as well. For example:

In [None]:
# are all values in each row less than 8?
np.all(x < 8, axis=1)

array([False, False,  True])

##Exploring fancy indexing
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once. For example, consider the following array:

In [None]:
import numpy as np
rng = np.random.default_rng(seed=1701)

x = rng.integers(100, size=10)
print(x)

[90 40  9 30 80 67 39 15 33 79]


Suppose we want to access three different elements. We could do it like this:

In [None]:
[x[3], x[7], x[2]]

[30, 15, 9]

Alternatively, we can pass a single list or array of indices to obtain the same result:

In [None]:
ind = [3, 7, 4]
x[ind]

array([30, 15, 80])

Fancy indexing also works in multiple dimensions. Consider the following array:

In [None]:
X = np.arange(12).reshape((3, 4))
X

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Like with standard indexing, the first index refers to the row, and the second to the column:

In [None]:
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]

array([ 2,  5, 11])

##Simple if else branch

if…elif…else are conditional statements that provide you with the decision making that is required when you want to execute code based on a particular condition. The *else* keyword catches anything which isn't caught by the preceding conditions. Please note the importance of indentation.

In [None]:
x = 12
print('x: ', x)

if x > 0:
  print('This number is positive.')
else:
  print('This number is  0 or negative.')

x:  12
This number is positive.


## Multiple branching

In [None]:
x = -5
print('x: ', x)
if x > 0:
  print('x is positive.')
elif x < 0:
  print('x is negative.')
else:
  print('x is zero')

x:  -5
x is negative.


In [None]:
#multiple logical operators
x = 12
y = 15
z = 20
print('x: ', x)
print('y: ', y)
print('z: ', z)

#condition
if x < y < z:
  print('y is larger then x and smaller than z.')

x:  12
y:  15
z:  20
y is larger then x and smaller than z.


##loops

A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).

In [None]:
#for loop
for i in 2, 7.5, -22:
  print('number: ', i, ', squared: ', i*i)

number:  2 , squared:  4
number:  7.5 , squared:  56.25
number:  -22 , squared:  484


In [None]:
#nested controll structure
for x in -2, -1, 0, 1, 2:
  if x > 0:
    print(x, 'positive')
  else:
    if x < 0:
      print(x, 'negative')
    else:
      print(x, 'equal to zero')

-2 negative
-1 negative
0 equal to zero
1 positive
2 positive


To loop through a set of code a specified number of times, we can use the **range()** function,
The **range()** function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and ends at a specified number.

In [None]:
#use the for loop with the range() function
for i in range(5,9):
  print('number: ', i)

number:  5
number:  6
number:  7
number:  8


In [None]:
for i in range(3,11,2):
  print('number:' , i, ', square: ', i*i)

number: 3 , square:  9
number: 5 , square:  25
number: 7 , square:  49
number: 9 , square:  81


With the while loop we can execute a set of statements as long as a condition is true.

The **randint()** method returns an integernumber selected element from the specified range.
The syntax is:\
*random.randint(start,stop)*

In [None]:
#while loop
#random number generator
import random
random.seed()

#initialsation
sum = 0
#while loop
while sum < 30:
  number = random.randint(1,8)
  sum = sum + number
  print('number: ', number, 'intermiediate result: ', sum)

print('end')

number:  1 intermiediate result:  1
number:  5 intermiediate result:  6
number:  1 intermiediate result:  7
number:  2 intermiediate result:  9
number:  2 intermiediate result:  11
number:  7 intermiediate result:  18
number:  1 intermiediate result:  19
number:  4 intermiediate result:  23
number:  2 intermiediate result:  25
number:  3 intermiediate result:  28
number:  6 intermiediate result:  34
end
