# Data Mining and Probabilistic Reasoning, WS18/19


Dr. Gjergji Kasneci, The University of Tübingen

-----
## NumPy 
-----

###### Date 29/10/2018

Teaching assistants:

 - Vadim Borisov (vadim.borisov@uni-tuebingen.de)

 - Johannes Haug (johannes-christian.haug@uni-tuebingen.de)

In [1]:
import numpy as np

In [2]:
py_list = [2, 3, 4, 6]
np_array = np.array(py_list)

In [3]:
print(type(py_list), py_list)
print(type(np_array), np_array)

<class 'list'> [2, 3, 4, 6]
<class 'numpy.ndarray'> [2 3 4 6]


In [4]:
py_list[1:3]

[3, 4]

In [5]:
np_array[1:3]

array([3, 4])

In [8]:
py_list

[2, 3, 4, 6]

In [6]:
py_list[[0, 2]]

TypeError: list indices must be integers or slices, not list

In [9]:
np_array

array([2, 3, 4, 6])

In [7]:
np_array[[0, 2]]

array([2, 4])

In [10]:
np_array[np_array>3]

array([4, 6])

In [11]:
py_list * 5

[2, 3, 4, 6, 2, 3, 4, 6, 2, 3, 4, 6, 2, 3, 4, 6, 2, 3, 4, 6]

In [12]:
np_array * 5

array([10, 15, 20, 30])

In [13]:
py_list ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

In [14]:
np_array **2

array([ 4,  9, 16, 36])

In [15]:
# performance test 
def pure_python_version(size_of_vec = 1000):
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X)) ]

def numpy_version(size_of_vec = 1000):
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y


In [16]:
%%timeit -n 10000
pure_python_version()

344 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [17]:
%%timeit -n 10000
numpy_version()

8.4 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [18]:
py_list

[2, 3, 4, 6]

In [19]:
np_array

array([2, 3, 4, 6])

In [20]:
np_array ** 2

array([ 4,  9, 16, 36])

In [22]:
matrix = [[1, 2, 4], 
          [3, 1, 0]]
np_matrix = np.array(matrix)

In [23]:
np_matrix.shape

(2, 3)

In [24]:
matrix[1][2]

0

In [25]:
np_matrix[1][2]

0

In [26]:
np_matrix[:,0]

array([1, 3])

In [27]:
np.random.rand()

0.8696944214771095

In [28]:
np.random.randn()

0.7365801079475242

In [29]:
np.random.randn(4)

array([ 0.44434736,  0.42679038, -0.22449277, -1.37952874])

In [30]:
np.random.randn(4, 5)

array([[-0.68355184, -0.39050538,  0.01928746,  1.35399709,  0.09920509],
       [ 0.72152271,  0.46843235,  0.31117646, -0.28754739, -0.14153854],
       [ 0.20807337, -0.64402091, -1.31461281,  0.94134584, -0.5824354 ],
       [ 2.19101404, -0.17036404, -1.09639244,  0.7766801 , -1.34094406]])

In [31]:
np.arange(0, 2, 0.1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
       1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])

In [32]:
range(0, 8, 0.1)

TypeError: 'float' object cannot be interpreted as an integer

In [35]:
range(0, 8, 1)

range(0, 8)

# Brodcasting 

In [36]:
a = np.array([[1, 2, 3], [1, 2, 3]])
print(a)

b = 3
print(b)

c = a + b
print(c)

[[1 2 3]
 [1 2 3]]
3
[[4 5 6]
 [4 5 6]]


In [37]:
a.shape

(2, 3)

In [38]:
a.reshape(-1,2)

array([[1, 2],
       [3, 1],
       [2, 3]])

In [39]:
a.reshape(1,6)

array([[1, 2, 3, 1, 2, 3]])

In [40]:
a.reshape(3,2)

array([[1, 2],
       [3, 1],
       [2, 3]])

----
### Summary

The benifits of the using NumPy are:
- Size - NumPy data structures take up less space.
- Performance - NumPy is faster than python lists. 
- Functionality - SciPy and NumPy have optimized functions such as linear algebra operations built in.
---

# Exercises: 
(1) Create a numpy array with zeros with the shape (3,4). Hint, check the `np.zeros` function https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.zeros.html

(2) Replace each element in the first column in the array from first exercises with `1`.  
Task: 
```python
[[0. 0. 0. 0.]    [[1. 0. 0. 0.] 
 [0. 0. 0. 0.] ->  [1. 0. 0. 0.] 
 [0. 0. 0. 0.]]    [1. 0. 0. 0.]]   
```
(3) Using `np.random.randn` create an array of size `30000` and find the mean value of it. 

In [46]:
aaa = np.zeros(shape=(3,4))
print(aaa)
aaa[:,0] = 1
aaa

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])