# Data Mining and Probabilistic Reasoning, WS18/19


Dr. Gjergji Kasneci, The University of Tübingen

-----
## NumPy 
-----

###### Date 29/10/2018

Teaching assistants:

 - Vadim Borisov (vadim.borisov@uni-tuebingen.de)

 - Johannes Haug (johannes-christian.haug@uni-tuebingen.de)

In [1]:
import numpy as np

In [2]:
py_list = [2, 3, 4, 6]
np_array = np.array(py_list)

In [3]:
print(type(py_list), py_list)
print(type(np_array), np_array)

<class 'list'> [2, 3, 4, 6]
<class 'numpy.ndarray'> [2 3 4 6]


In [4]:
py_list[1:3]

[3, 4]

In [5]:
np_array[1:3]

array([3, 4])

In [6]:
py_list[[0, 2]]

TypeError: list indices must be integers or slices, not list

In [7]:
np_array[[0, 2]]

array([2, 4])

In [8]:
np_array[np_array>3]

array([4, 6])

In [9]:
py_list * 5

[2, 3, 4, 6, 2, 3, 4, 6, 2, 3, 4, 6, 2, 3, 4, 6, 2, 3, 4, 6]

In [10]:
np_array * 5

array([10, 15, 20, 30])

In [11]:
py_list ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

In [12]:
# performance test 
def pure_python_version(size_of_vec = 1000):
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X)) ]

def numpy_version(size_of_vec = 1000):
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y


In [13]:
%%timeit -n 10000
pure_python_version()

255 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [14]:
%%timeit -n 10000
numpy_version()

5.84 µs ± 176 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [15]:
py_list

[2, 3, 4, 6]

In [16]:
np_array

array([2, 3, 4, 6])

In [17]:
np_array ** 2

array([ 4,  9, 16, 36])

In [18]:
matrix = [[1, 2, 4], 
          [3, 1, 0]]
np_matrix = np.array(matrix)

In [19]:
np_matrix.shape

(2, 3)

In [20]:
matrix[1][2]

0

In [21]:
np_matrix[1][2]

0

In [22]:
np_matrix[:,0]

array([1, 3])

In [23]:
np.random.rand()

0.6115294675054577

In [24]:
np.random.randn()

0.03403258284116708

In [25]:
np.random.randn(4)

array([-2.70469941,  0.22480063,  0.07868406, -0.07628918])

In [26]:
np.random.randn(4, 5)

array([[ 1.33199356, -0.28480219, -0.90835507, -0.90362274, -1.62532017],
       [-1.00364917,  0.51846765, -1.1748713 ,  0.40189462, -1.8970166 ],
       [ 1.27664094,  0.0041127 , -0.57289291,  1.19920148,  0.05869416],
       [-0.49468529,  1.21065587,  1.42377265,  1.11946271,  0.74338025]])

In [27]:
np.arange(0, 2, 0.1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
       1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])

In [30]:
range(0, 8, 0.1)

TypeError: 'float' object cannot be interpreted as an integer

# Brodcasting 

In [32]:
a = np.array([[1, 2, 3], [1, 2, 3]])
print(a)

b = 3
print(b)

c = a + b
print(c)

[[1 2 3]
 [1 2 3]]
3
[[4 5 6]
 [4 5 6]]


In [33]:
a.shape

(2, 3)

In [34]:
a.reshape(-1,2)

array([[1, 2],
       [3, 1],
       [2, 3]])

In [35]:
a.reshape(1,6)

array([[1, 2, 3, 1, 2, 3]])

In [36]:
a.reshape(3,2)

array([[1, 2],
       [3, 1],
       [2, 3]])

----
### Summary

The benifits of the using NumPy are:
- Size - NumPy data structures take up less space.
- Performance - NumPy is faster than python lists. 
- Functionality - SciPy and NumPy have optimized functions such as linear algebra operations built in.
---

# Exercises: 
(1) Create a numpy array with zeros with the shape (3,4). Hint, check the `np.zeros` function https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.zeros.html

(2) Replace each element in the first column in the array from first exercises with `1`.  
Task: 
```python
[[0. 0. 0. 0.]    [[1. 0. 0. 0.] 
 [0. 0. 0. 0.] ->  [1. 0. 0. 0.] 
 [0. 0. 0. 0.]]    [1. 0. 0. 0.]]   
```
(3) Using `np.random.randn` create an array of size `30000` and find the mean value of it. 

In [40]:
z_arr = np.zeros((3,4))
z_arr

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [41]:
z_arr[:,0] = 1
z_arr

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])