# `DSML_WS_02` - Introduction to Python Library Management & `NumPy`

In this tutorial we will introduce the concept of python libraries and cover a first such library - NumPy.

We will go through the following:

- Introduction to the concept of `Python Libraries`
- Introduction to `NumPy`

## `Python Libraries`

**Introduction to Libraries**

A library (or module, or package) is a Python object with arbitrarily named attributes that you can bind and reference. Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.

You can use any Python source file as a module by executing an import statement in some other Python source file. The import has the following syntax:


```
import <module name>
```

By convention is is common to name modules so they can be called by entering an abbreviated name. This is effectively importing the module in the same way that `import <module name>` will do, with the only difference of it being available as ` <module name abbreviation>`. In the case of `numpy`, for example, the abbreviation `np` is used.

```
import <module name> as <module name abbreviation>
```


**Exercise**: Import the NumPy (`numpy`) module and abbreviate it with `np` for easier access

In [1]:
# your code here
import numpy as np

**Adding/Installing Libraries**

To add Python libraries to your installation you can use the `conda` package manager that we have installed in the last workshop. Alternatively you can also use the `pip` package manager. The quickest way to do so is via a terminal:

* If you are on a **Windows** computer, use the "Anaconda Command Prompt" from the Start menu. 
* On a **Mac**, start up the "Terminal". 
* In **Linux**, use any of the terminals available.


The gerneral command syntax is the following:

```
conda install <package name>
```

If you are looking for a specific package but are unsure of the exact command line name do a quick google search and/or check the [Anaconda Cloud](https://anaconda.org).

Tip: It is recommended to retrieve all packages from the same conda channel such as conda forge to ensure smooth working of all dependencies. Simply run the following:

```
conda install -c conda-forge <package name>
```

**Relevant Libraries for this course**

There is a large variety of open source libraries available in Python. Below is a list of some of the most relevant ones for data science, which will be covered in this course.

* Selected data science libraries

    * Data Analysis and Processing
    >* Pandas (pd)
    >* Numpy (np)
    * Visualization        
    >* matplotlib and pyplot (plt)
    >* seaborn (sns)
    * Models and methods
    >* Scikit Learn (sklearn)
    >* statsmodels
    
    
Tip: It is usually advisable to stick with a certain channel to retrieve all your libraries. For me `conda-forge`has proven to be super stable. You can specify the channel from which to retireve a package by adding `c - <channel name>` to the install command.

### <font color='green'>**Exercise**: Install ScikitLearn on your machine using the command `conda install -c conda-forge scikit-learn`. Afterwards import the library here to test whether the installation was successfull using `import sklearn`.</font>

In [6]:
# Your code here
import sklearn

---

## `NumPy`

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- A powerful N-dimensional array object
- Sophisticated (broadcasting) functions
- Tools for integrating C/C++ and Fortran code
- Useful linear algebra, Fourier transform, and random number capabilities allowing for efficient matrix operations.

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes. The number of axes is rank.

In today's short overview tutorial we will cover the following:

1. **Creating NumPy Arrays**
1. **Manipulating NumPy Arrays**
1. **NumPy Array Operations**

Let's get started...

In [8]:
# import numpy as np if not already done above
import numpy as np

In [10]:
# define an array using np.array([..,..,..])
i = np.array([1,2,3])

In [14]:
# return the type
type(i)

numpy.ndarray

### Creating NumPy Arrays

First, we can use `np.array` to create arrays from python lists. Unlike the Python lists, **NumPy is constrained to arrays that all contain the same type**. If types don't match, NumPy will upcast if possible. 

In [15]:
# assign an array of integers to variable A
A = np.array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [17]:
# return the type of A
type(A)

numpy.ndarray

In [19]:
# let's try to create an Array with integers and floats and assign it to variable B
B = np.array([3, 5.1, 4.6, 6])

In [20]:
# return B; note that NumPy upcasts everything to floats!
B

array([3. , 5.1, 4.6, 6. ])

If we want to explicitly set the data type of the resulting array, we can use the `dtype` keyword.

In [23]:
# assign an Array with integers to C; however, specify "dtype = float"
C = np.array([1,2,3,8],dtype = float)
C

array([1., 2., 3., 8.])

Other examples of creating arrays using np functions:

In [31]:
# create a vector of length 5 filled with zeros
D = np.zeros(shape=(1,5),dtype = float)
print(D)

[[0. 0. 0. 0. 0.]]


In [32]:
# create a 2x4 matrix of ones (float)
E = np.ones((2,4), dtype= float)
print(E)

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]]


In [33]:
# create a vector from 0-12 in steps of 2
F = np.arange(0,12,2)
print(F)

[ 0  2  4  6  8 10]


In [34]:
# create a vector from 0 to 1 with five equally (linearly) spaced elements 
G = np.linspace(0,1,5)
print(G)

[0.   0.25 0.5  0.75 1.  ]


In [35]:
# create a 2x2 matrix with random floats in the half-open interval [0.0, 1.0)
H = np.random.random((2,2))
print(H)

[[0.09171585 0.45367774]
 [0.78691305 0.88794468]]


In [36]:
#Return random integers from 0 (inclusive) to 10 (exclusive) of size (4,3,2)
I = np.random.randint(0,10,(4,3,2))
print(I)

[[[1 3]
  [4 1]
  [9 0]]

 [[3 1]
  [0 9]
  [3 0]]

 [[4 8]
  [5 6]
  [3 3]]

 [[3 6]
  [7 9]
  [9 4]]]


In [37]:
print("D =", D,
      "\n\nE =", E, 
      "\n\nF =", F, 
      "\n\nG =", G, 
      "\n\nH =", H, 
      "\n\nI =", I)

D = [[0. 0. 0. 0. 0.]] 

E = [[1. 1. 1. 1.]
 [1. 1. 1. 1.]] 

F = [ 0  2  4  6  8 10] 

G = [0.   0.25 0.5  0.75 1.  ] 

H = [[0.09171585 0.45367774]
 [0.78691305 0.88794468]] 

I = [[[1 3]
  [4 1]
  [9 0]]

 [[3 1]
  [0 9]
  [3 0]]

 [[4 8]
  [5 6]
  [3 3]]

 [[3 6]
  [7 9]
  [9 4]]]


**Exercise**: Define a few additional NumPy Arrays using some or all of the commands introduced above

In [38]:
# Your code here


### Manipulating NumPy Arrays

Data manipulation in Python is nearly synonymous with NumPy array manipulation (although a lot of it may happen in higher-level frameworks like pandas). We will cover a few categories of basic array manipulations here:
- **Attributes of arrays**: Determinig the size, shape, memory consumption and data type of arrays.
- **Indexing of arrays**: Getting and setting the value of individual array elements.
- **Slicing of arrays**: Getting and setting smaller subarrays within a larger array.
- **Reshaping of arrays**: Changing the shape of a given array.
- **Joining and splitting of arrays**: Combining multiple arrays into one, and splitting one array into many.

#### NumPy Array Attributes:
You can retrieve an attribute by appending it to the respective array.

In [50]:
# determine the shape of array D using .shape
D.shape

(1, 5)

In [51]:
# determine the memory consumption of array D using .nbytes
D.nbytes

40

In the following, some example attributes of array H are presented.

In [52]:
# remember our array H
H

array([[0.09171585, 0.45367774],
       [0.78691305, 0.88794468]])

In [53]:
# returns dimension
print("H ndim: ", H.ndim)

# returns shape in form (#row,#col)
print("H shape: " , H.shape) ##### Most important!

# returns size (i.e. no of elements)
print("H size: ", H.size)

# returns data type
print("H dtype: ", H.dtype)

# returns length of one array element in bytes
print("itemsize: ", H.itemsize," bytes")

# returns total bytes consumed by the elements of the array
print("nbytes:  ", H.nbytes, "bytes")

H ndim:  2
H shape:  (2, 2)
H size:  4
H dtype:  float64
itemsize:  8  bytes
nbytes:   32 bytes


#### NumPy Array Indexing:
In the following, some examples on indexing are presented. Note that, for a 1-dimensional array, this is very similar to indexing and slicing lists!

Accessing single elements:

In [54]:
# remember our basic array A
A

array([1, 2, 3, 4, 5])

In [59]:
# index the array from the front and the back
print(A[0])
print(A[-1])

1
5


In [60]:
# fill in the correct indices
print("The 4th element of A is {}".format(A[3]))
print("The last element of A is {}".format(A[-1]))

The 4th element of A is 4
The last element of A is 5


In a multidimensional array (i.e. a matrix), you access items using a comma-seperated tuple of indices.

In [77]:
# remember H
print(H)

[[0.09171585 0.45367774]
 [0.78691305 0.88794468]]


In [63]:
# remember the shape of H
H.shape

(2, 2)

In [66]:
# access the element in the bottom left
H[1,0]

0.7869130468384273

In [67]:
# fill in the correct indices
print ("The first element of H is {}".format(H[0,0]))  #array[row,column]
print ("The last element of H is {}".format(H[1,1]))   #array[row,column]

The first element of H is 0.09171585420881956
The last element of H is 0.8879446801225097


#### NumPy Array Slicing:

**One-dimensional arrays**

Just as we can use square brakets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon `:` character. The syntax is as follow:
` X[start (incl.):stop (excl.):step]`

In [78]:
# remember array F
print(F)

[ 0  2  4  6  8 10]


In [85]:
# slice F to retrieve all elements except the first and the last
F[1:5]

array([2, 4, 6, 8])

In [88]:
# we can also reverse the order by setting steps to -1
F[::-1]

array([10,  8,  6,  4,  2,  0])

In [90]:
# items 3 and 4
print ("middle subarray:", F[2:4])

# items 1 to 4 (excl.)
print("First 3 elements:", F[:3])

# last 2 elements
print("Last 2 elements:", F[-2:])

# first element and every second element from there
print("Every other element:", F[::2])

middle subarray: [4 6]
First 3 elements: [0 2 4]
Last 2 elements: [ 8 10]
Every other element: [0 4 8]


**Multi-dimensional arrays**

Multi-dimensional slices work in the same way, with multiple slices seperated by commas. The command is `X[slice row, slice column]`

In [92]:
# let's create a new multi-dimensional array
J = np.random.randint(low=0,high=20, size=(3,4))
# note that you can readily drop the input names "low", "high", "size" as long as you retain the correct order!
J

array([[ 8,  0, 13, 17],
       [ 4, 10,  4, 10],
       [ 3, 14, 10, 19]])

In [93]:
# add the correct indices
print ("The first two rows and the first three column: \n", J[:2,:3])

The first two rows and the first three column: 
 [[ 8  0 13]
 [ 4 10  4]]


In [94]:
# add the correct indices
print("All rows and every other column:\n", J[:,::2])

All rows and every other column:
 [[ 8 13]
 [ 4  4]
 [ 3 10]]


In [95]:
print("Rows and columns reversed:\n",J[::-1,::-1])

Rows and columns reversed:
 [[19 10 14  3]
 [10  4 10  4]
 [17 13  0  8]]


**Exercise**: Familiarize yourself with indexing by attempting some indexing techniques on the arrays defined above

In [96]:
# Your code here


#### NumPy Array Reshaping:

Another useful type of operation is reshaping of arrays. The most flexible way of doing this is with the reshape method. Note that for this to work, the size of the initial array must match the size of the reshaped array.

In [106]:
# return evenly spaced values within a given interval
K = np.arange(1,25)
K

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])

In [107]:
# determine number of dimensions of K
K.ndim

1

In [108]:
# determine number of elements of K
len(K)
K.size

24

In [109]:
# we can re-shape this array into any shape with 24 elements using the .reshape method
K.reshape(4,6) # 6*4=24 items

array([[ 1,  2,  3,  4,  5,  6],
       [ 7,  8,  9, 10, 11, 12],
       [13, 14, 15, 16, 17, 18],
       [19, 20, 21, 22, 23, 24]])

#### NumPy Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.

**Concatenation of arrays**

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routine `np.concatenate`. Additionally, `np.vstack` and `np.hstack` may be used.

In [111]:
# let's define two arrays we want to concatenate
L = np.array([1,2,3])
M = np.array([4,5,6])

In [112]:
# concatenate L and M using np.concatenate
np.concatenate((L,M))

array([1, 2, 3, 4, 5, 6])

In [113]:
# let's define two multi-dimensional arrays
N = np.array([[3,5,7],[1,3,5]])
O = np.array([[2,4,2],[0,9,8]])

In [116]:
# print both arrays
print("N:\n",N)
print("O:\n",O)

N:
 [[3 5 7]
 [1 3 5]]
O:
 [[2 4 2]
 [0 9 8]]


In [120]:
# There are different ways of concatenating these arrays. We can specify the axis using the keyword 'axis'
print("Row-wise:\n",np.concatenate((N,O), axis = 0))
print("Column-wise:\n",np.concatenate((N,O), axis = 1))

Row-wise:
 [[3 5 7]
 [1 3 5]
 [2 4 2]
 [0 9 8]]
Column-wise:
 [[3 5 7 2 4 2]
 [1 3 5 0 9 8]]


For working with arrays of mixed dimensions, it can be more practical to use the `np.vstack` (vertical stack, i.e. stacking on top of each other) and `np.hstack` (horizontal stack, i.e. stacking next to each other) functions:

In [122]:
# stack row-wise using vstack
print(np.vstack((N,O)))

[[3 5 7]
 [1 3 5]
 [2 4 2]
 [0 9 8]]


In [123]:
# stack column-wise using hstack
print(np.hstack((N,O)))

[[3 5 7 2 4 2]
 [1 3 5 0 9 8]]


**Splitting of arrays**

The opposite of concatenation is splitting, which is implemented by the functions `np.split`, `np.hsplit`, and `np.vsplit`. For each of these, we can pass a list of indices as split points:

In [124]:
# remember our one-dimensional array K
print(K)


[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]


In [128]:
# we can split K into 3 arrays using np.split
K1, K2, K3 = np.split(K,3)
print(K1, K2, K3)

[1 2 3 4 5 6 7 8] [ 9 10 11 12 13 14 15 16] [17 18 19 20 21 22 23 24]


Note that this only works if the number of elements of the original array can be split equally among the sub-arrays.

In [131]:
# lets create a new multi-dimensional array
P = np.arange(16).reshape((4, 4))
print(P)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


In [139]:
# using vsplit, we can split an array into multiple sub-arrays vertically (row-wise)
row1, row2, row3, row4 = np.vsplit(P, 4)

print("Row 1:",row1)
print("Row 2:",row2)
print("Row 3:",row3)
print("Row 4:",row4)

Row 1: [[0 1 2 3]]
Row 2: [[4 5 6 7]]
Row 3: [[ 8  9 10 11]]
Row 4: [[12 13 14 15]]


In [140]:
# using hsplit, we can split an array into multiple sub-arrays horizontally (column-wise)
left, middle1, middle2, right = np.hsplit(P, 4)
print("Left: \n",left)
print("Middle1: \n",middle1)
print("Middle2: \n",middle2)
print("Right: \n",right)

Left: 
 [[ 0]
 [ 4]
 [ 8]
 [12]]
Middle1: 
 [[ 1]
 [ 5]
 [ 9]
 [13]]
Middle2: 
 [[ 2]
 [ 6]
 [10]
 [14]]
Right: 
 [[ 3]
 [ 7]
 [11]
 [15]]


**Exercise**: Create a few arrays and concantenate them, afterwards split them to arrive at the original arrays

In [141]:
# Your code here


### NumPy Array Operations

Numpy allows for **element-wise** as well as linear algebra **matrix-type** operations, which are a key component of scientific computing tasks. Matrix operations make computing fast and easy. It is the core functionality of `numpy`.

In [151]:
# let's create two multi-dimensional arrays
a = np.arange(6).reshape((2, 3))
b = np.arange(6,12).reshape((2, 3))

In [153]:
# print a and b
print("a:\n", a)
print("b:\n", b)

a:
 [[0 1 2]
 [3 4 5]]
b:
 [[ 6  7  8]
 [ 9 10 11]]


**Element-wise operations**

In [154]:
# we can do an element-wise addition using the + operator
c = a + b
print(c)

[[ 6  8 10]
 [12 14 16]]


In [155]:
# we can do an element-wise subtraction using the - operator
d = a - b
print(d)

[[-6 -6 -6]
 [-6 -6 -6]]


In [156]:
# we can do an element-wise multiplication using the * operator
e = a * b
print(e)

[[ 0  7 16]
 [27 40 55]]


In [157]:
# we can do an element-wise division using the / operator
f = a / b
print(f)

[[0.         0.14285714 0.25      ]
 [0.33333333 0.4        0.45454545]]


**Matrix operations**

In [158]:
# remember our original arrays a and b
print("a:\n", a)
print("b:\n", b)

a:
 [[0 1 2]
 [3 4 5]]
b:
 [[ 6  7  8]
 [ 9 10 11]]


We can perform a matrix multiplication using the '@' operator.

In [159]:
a@b

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)

We get an error message when attempting to perform a matrix multiplication on a and b. Why? Look at the dimensions of a and b.

In [161]:
print(a.shape)
print(b.shape)

(2, 3)
(2, 3)


Performing a matrix multiplication on two 2x3 matrices is not possible. However, we can transpose b (making it a 3x2 matrix). Multiplying a 2x3 with a 3x2 matrix should work.

In [163]:
# transpose b using .T
b_t = b.T

print(b_t)

[[ 6  9]
 [ 7 10]
 [ 8 11]]


In [164]:
# perform matrix multiplication
a@b_t

array([[ 23,  32],
       [ 86, 122]])

---