# NUMPY
## 1. Was ist NumPy?

= numerical python

NumPy ist eine Python library, die verwendet wird, um mit arrays zu arbeiten

Wird mit vielen anderen packages verwendet:
- Pandas
- Matplotlib
- scikit-learn
- etc.

 ## 2. NumPy arrays
Der Kern von NumPy ist das **ndarray** object:
- = n-dimensional array of homogenous data types
- = ein Raster mit Werten, die alle demselben Datentyp angehören (e.g. float64, int32)

n-Dimensionen:
- 1-D: vectors
- 2-D: matrices
- 3- oder mehr D: tensors

Dimensionen werden in NumPy **axes** genannt

In [4]:
import numpy as np

# 1-D
x = np.array([1, 2, 3]) # ndarray aus einer Python list
print(x)
x.ndim # Anzahl dimensions/axes

[1 2 3]


1

In [5]:
# 2-D
y = np.array([[1, 2, 3],
             [4, 5, 6]])
print(y)
y.size # Anzahl Elemente im array

[[1 2 3]
 [4 5 6]]


6

In [6]:
# 3-D
z = np.array([
    [[1, 2, 3],
     [4, 5, 6]],
     
    [[7, 8, 9],
    [10, 11, 12]]])
print(z)
z.shape # Anzahl Elemente pro Dimension

[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]


(2, 2, 3)

## 3. Warum NumPy?
NumPy vereinfacht fortgeschrittene mathematische und andere Operationen, vor allem, wenn es sich um grosse Datenmengen handelt

NumPy ist effizienter und (bis zu 50x) schneller als bspw. lists in Python

Zwei wichtige Features machen NumPy & NumPy arrays schnell & effizient:
- **vectorization**
- **broadcasting**

In [7]:
# in Python: Elemente einer Liste a mit den Elementen einer Liste b multiplizieren
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
c = []
for i in range(len(a)):
    c.append(a[i]*b[i])

print(c)
# je grösser die Listen, desto ineffizienter und langsamer wird der Loop

[5, 12, 21, 32]


In [8]:
# in NumPy: element-by-element operations sind der "default-mode"
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
c = a * b
print(c)

[ 5 12 21 32]


= vectorization
- keine explizite loops, indexes, etc. 
- finden im Hintergrund statt ("pre-compiled C code")

Dieser default-mode wird nicht nur bei arithmetischen Operationen angewendet, sondern allgemein (bspw. logical, bit-wise, functional, etc. operations)

= broadcasting

## 4. NumPy basics

### 4.1 array creation

In [9]:
import numpy as np

In [10]:
# array aus einer Python list
a = np.array([1, 2, 3, 4]) # 1-d array
print(a)
b = np.array([[1, 2, 3, 4], # 2-d array
             [5, 6, 7, 8]])
print(b)

[1 2 3 4]
[[1 2 3 4]
 [5 6 7 8]]


In [11]:
# arrays mit "initial placeholder content" (when elements are unknown)
# zeros
a = np.zeros((3, 4)) # 2-d array
print(a)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [12]:
# ones
b = np.ones((2, 3, 4)) # 3-d array
print(b)

[[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]


In [13]:
# random
c = np.empty((2,3)) # 2-d array
print(c)

[[0. 0. 0.]
 [0. 0. 0.]]


In [14]:
# range
d = np.arange(10, 30, 4) # 1-d array
print(d)

[10 14 18 22 26]


### 4.2 basic operations

In [15]:
# arithmetische Operationen: element-by-element
a = np.array([10, 20, 30, 40, 50])
b = np.array([2, 3, 4, 5, 6])

In [16]:
c = a - b
print(c)

[ 8 17 26 35 44]


In [17]:
c = a * b
print(c)

d = a @ b # matrix/dot product
print(d)

[ 20  60 120 200 300]
700


In [18]:
c = a/b
print(c)

[5.         6.66666667 7.5        8.         8.33333333]


In [19]:
# arithmetische Operationen: array-wise
print(a)
x = a.sum()
print(x)

[10 20 30 40 50]
150


In [20]:
# dimension-wise
b = np.array([[1, 2, 3, 4], # 2-d array
             [5, 6, 7, 8]])

y = b.sum(axis = 0) # sum of rows
print(y)
z = b.sum(axis = 1) # sum of columns
print(z)

[ 6  8 10 12]
[10 26]


### 4.3 indexing, slicing, iterating

#### 1-dimensionale arrays

In [21]:
# indexing
a = np.arange(0, 100, 20)
print(a)
a[3]

[ 0 20 40 60 80]


60

In [22]:
# slicing
a[2:5]

array([40, 60, 80])

In [23]:
# iterating
for i in a:
    print(i)

0
20
40
60
80


#### multi-dimensionale arrays

In [24]:
# indexing: 1 index per axis
b = np.array([[1, 2, 3, 4], # 2-d array
             [5, 6, 7, 8]])

print(b[1, 3]) # row 2, column 4

c = np.array([[[1, 2, 3, 4], # 3-d array
             [5, 6, 7, 8]],
              
             [[9, 10, 11, 12],
             [13, 14, 15, 16]]])

print(c[1, 0, 2:4])

8
[11 12]


In [25]:
# iterating: with respect to the first axis

for row in c:
    print(row)

# per element
for element in c.flat:
    print(element)

[[1 2 3 4]
 [5 6 7 8]]
[[ 9 10 11 12]
 [13 14 15 16]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16


### 4.4 shape manipulation

In [26]:
a = np.arange(24)
print(a)

b = a.reshape(4,6)
print(b)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]


In [27]:
# flatten
print(b.ravel())

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]


In [28]:
# reshape
print(b.reshape(8,3))

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]
 [15 16 17]
 [18 19 20]
 [21 22 23]]


In [29]:
# transpose
print(b.T)

[[ 0  6 12 18]
 [ 1  7 13 19]
 [ 2  8 14 20]
 [ 3  9 15 21]
 [ 4 10 16 22]
 [ 5 11 17 23]]


In [30]:
# stack vertically (row-wise)
c = np.zeros(6)
print(c)

np.vstack((b, c))

[0. 0. 0. 0. 0. 0.]


array([[ 0.,  1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10., 11.],
       [12., 13., 14., 15., 16., 17.],
       [18., 19., 20., 21., 22., 23.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

In [33]:
# stack horizontally (column-wise)
c = np.zeros((4, 6))

np.hstack((b,c))

array([[ 0.,  1.,  2.,  3.,  4.,  5.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 6.,  7.,  8.,  9., 10., 11.,  0.,  0.,  0.,  0.,  0.,  0.],
       [12., 13., 14., 15., 16., 17.,  0.,  0.,  0.,  0.,  0.,  0.],
       [18., 19., 20., 21., 22., 23.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [34]:
# split along horizontal axis (column-wise)
np.hsplit(b, 2)

[array([[ 0,  1,  2],
        [ 6,  7,  8],
        [12, 13, 14],
        [18, 19, 20]]),
 array([[ 3,  4,  5],
        [ 9, 10, 11],
        [15, 16, 17],
        [21, 22, 23]])]

In [36]:
# split along vertical axis (row-wise)
np.vsplit(b,2)

[array([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]]),
 array([[12, 13, 14, 15, 16, 17],
        [18, 19, 20, 21, 22, 23]])]

### 4.5 mathematical formulas

mean square error (MSE) formula (ist wichtig für supervised machine learning models mit Regressionen)

![MSE.png](attachment:c107be3d-d953-4609-bcf5-f22520565ce5.png)

![MSE_code.png](attachment:9202264f-2b33-45c5-8659-b5aef95ce4db.png)

![MSE_visualization.png](attachment:74eed533-cb61-446d-8da5-43ef8f207706.png)

(taken from <a href="https://numpy.org/doc/stable/user/absolute_beginners.html" target="_blank">this link)</a>

### 4.6 saving & loading NumPy objects

#### as textfiles

In [126]:
my_data = np.array([[1, 2, 3, 4], # 2-d array
                     [5, 6, 7, 8]])

np.savetxt("my_data.txt", my_data) # nur für 2-d arrays

np.loadtxt("my_data.txt")

array([[1., 2., 3., 4.],
       [5., 6., 7., 8.]])

<span style='color:Magenta'> savetxt() </span> und <span style='color:Magenta'> loadtxt() </span> nehmen weitere Argumente:
- header
- footer
- delimiter
- comments

txt files sind einfacher zum Teilen, aber sie sind grösser und langsamer beim Einlesen --> .npy und .npz

#### as binary files

In [None]:
my_2darray = np.array([[1, 1, 2, 2], # 2-d array
                    [3, 3, 4, 4]])

my_3darray = np.array([[[1, 2, 3, 4], # 3-d array
             [5, 6, 7, 8]],
              
             [[9, 10, 11, 12],
             [13, 14, 15, 16]]])

np.save("one_array.npy", my_2darray) # .npy file -- > 1 ndarray object
np.savez("multiple_arrays.npz", my_2darray, my_3darray) # .npz file -- > 2+ ndarray objects

# np.load("one_array.npy")
# np.load("multiple_arrays.npz")

.npy und .npz files speichern wichtige Informationen für die Rekonstruktion des ndarrays:
- data
- shape
- dtype
- etc. 

#### loading files with missing data

Im Gegensatz zu <span style='color:Magenta'> loadtxt() </span> kann <span style='color:Magenta'> genfromtxt() </span> missing data berücksichtigen

In [134]:
data = np.genfromtxt("data_missing.txt", 
                     dtype = int,                # dtype = None -- > individuell pro column definiert
                     #skip_header = 0,           # Anzahl rows, die übersprungen werden müssen
                     # missing_values = "",      # strings, die für missing data verwendet werden
                     filling_values = 9999,      # values, die für missing data verwendet werden sollen
                     delimiter = ",")
data

array([[   1,    2,    3,    4],
       [   5, 9999,    7,    8],
       [   9,   10,   11,   12]])