<a href="https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/numpy_primer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Small numpy primer

* This notebook tries to collect some basic tips for numpy, staying within the area needed for this course

# ndarray

* `ndarray` is n-dimensional array
* you will meet it in very many different places throughout this course
* when working with `torch` and `tensorflow` when we switch over to deep learning, the corresponding concept there will be that of "tensor"


In [2]:
import numpy as np

In [6]:
# this is a small 2D array as a normal python list, a 3x3 matrix if you will
M=[[0,1,2],[3,4,5],[6,7,8]]
D=np.asarray(M) #now we turn it into numpy's array
print("Type of D:", type(D))
print("D itself:\n",D)

Type of D: <class 'numpy.ndarray'>
D itself:
 [[0 1 2]
 [3 4 5]
 [6 7 8]]


In [16]:
#Here is a 3-dimensional array
D3=np.asarray([[[1],[2]],[[3],[4]]])


## shape

* Shape is an important property, it tells you how many dimensions / axes the array has, and how large they are
* Our array D is 3x3, meaning it has two axes and each is 3 elements long
* Array D3 is 2x2x1, meaning it has three axes, the axes 0 and 1 are 2 elements long and the axis 2 is 1 element long
* You can think of these as n-dimensional tables
* A 2D matrix is simply a special case of an ndarray where the number of axes is 2
* A vector is simply a special case of an ndarray where the number of axes is 1

In [17]:
#You can ask for shape through an attribute .shape, you get back a tuple
print("D.shape",D.shape)
print("D3.shape",D3.shape)

D.shape (3, 3)
D3.shape (2, 2, 1)


## Beware of singleton axes

* Often you get caught in the difference between shapes (1,N) and (N,) because both look like a vector
* You might also think of it so that `[[1]]` is not the same as `[1]` which is not the same as `1`

In [19]:
print("A (1,4) ndarray:", np.array([[1,2,3,4]]))
print("A (4,) ndarray:", np.array([1,2,3,4]))

A (1,4) ndarray: [[1 2 3 4]]
A (4,) ndarray: [1 2 3 4]


## Elementary slicing

* Slicing in numpy arrays is a science in its own right, but its basic functionality is not super-complex
* You can slice a number of axes at once, going from left to right

In [25]:
D

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [23]:
D[0]

array([0, 1, 2])

In [24]:
D[0,0]

0

In [28]:
#You can do ranges as usual, in all axes
D[0:2,0:2] #first two elements on the first axis, first two elements on the second axis

array([[0, 1],
       [3, 4]])

In [43]:
D[0:2,:] #First two elements on the first axis, everything on the second axis

array([[0, 1, 2],
       [3, 4, 5]])

In [30]:
#Note the difference between the slice and index in the following:

D[0:1,:2] #Because 0:1 is a slice, even though it is 1-long, it will produce 1-long axis

array([[0, 1]])

In [31]:
D[0,:2]  #Because 0 is an index, it will select the row and not result in a 1-long axis

array([0, 1])

## Singleton axes

* Sometimes you need to add a singleton axis to match the expected dimensionality
* Sometimes you need to drop it for the same reason
* squeeze/expand_dims
* You can add/remove these singleton axes easily, because they are only a matter of interpretation of the data
* Add: `1` into `[1]` into `[[1]]` into `[[[1]]]`...
* Remove: ...the opposite

In [32]:
D[:1,:] #shorthand writing for [0:1,0:3], you don't need to say the start if it is 0 and you don't need to say end if it is, well, end

array([[0, 1]])

In [33]:
D_sliced=D[:1,:]
D_sliced.shape #here you see the singleton first axis

(1, 3)

In [35]:
print(D_sliced.squeeze()) #drops all singleton axes
print(D_sliced.squeeze().shape)

[0 1 2]
(3,)


In [41]:
D_sliced_more_dims=np.expand_dims(D_sliced,0) #adds a singleton axis at dimension index 0, i.e. as the first axis
D_sliced_more_dims.shape


(1, 1, 3)

In [42]:
D_sliced_more_dims.squeeze() #get rid of the extra singleton axes

array([0, 1, 2])

## More on axes
* These often matter in different numpy functions
* Many functions work along one or several axes and have different default expectations
* The way things go is often quite mind-bending

In [44]:
D

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [45]:
np.max(D) #no axis specified: maximum is calculated across all of the data

8

In [46]:
np.max(D,axis=0) #maximum is calculated along the first axis

array([6, 7, 8])

In [47]:
np.max(D,axis=1) #maximum is calculated along the second axis

array([2, 5, 8])

## Sorting

In [60]:
R=np.random.random((3,4)) #Gives a radnom array of 3x4
R

array([[0.54407135, 0.85856531, 0.75063223, 0.88297569],
       [0.00304345, 0.71736475, 0.61511061, 0.62537479],
       [0.78482089, 0.4472638 , 0.85887243, 0.46054136]])

In [61]:
np.sort(R,axis=0) #sort along the first axis (sort rows within columns)

array([[0.00304345, 0.4472638 , 0.61511061, 0.46054136],
       [0.54407135, 0.71736475, 0.75063223, 0.62537479],
       [0.78482089, 0.85856531, 0.85887243, 0.88297569]])

In [63]:
np.sort(R,axis=1) #sort along the second axis (sort columns within rows)

array([[0.54407135, 0.75063223, 0.85856531, 0.88297569],
       [0.00304345, 0.61511061, 0.62537479, 0.71736475],
       [0.4472638 , 0.46054136, 0.78482089, 0.85887243]])

#Arg-sorting
* exactly like sorting, but gives the indices, not the values


In [64]:
np.argsort(R,axis=0)

array([[1, 2, 1, 2],
       [0, 1, 0, 1],
       [2, 0, 2, 0]])

In [65]:
np.argsort(R,axis=1)

array([[0, 2, 1, 3],
       [0, 2, 3, 1],
       [1, 3, 0, 2]])

# Looping over arrays
* so many ways to do it...

In [66]:
D

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [67]:
#Perhaps the cleanest way...?
for i in range(D.shape[0]):
    for j in range(D.shape[1]):
        item=D[i,j]
        print(item)

0
1
2
3
4
5
6
7
8
