## Part 5. Numpy

While lists provide basic general functionality for handling arbitrary data collections, this functionality is limited as a consequence of the generality of lists.

Specifically lists are not too convenient when it comes to vecor/matrix operations and linear algebra. In a particular case of vectors, rectangular or multi-dimensional numeric matrices, numpy arrays are more convenient and offer more useful functionality than generic lists.

Numpy's main object (instance of class ndarray) is the single or multidimensional array. It is a 1d, 2d or multidimensional rectangular block of numbers. Number of dimensions (or axes) is called a rank.

Unlike lists, numpy arrays are homogenous accross each dimension and have all the elements of the same datatype.

Numpy arrays support vector and matrix operations (like elementwise addition and multiplication) and provide advanced statistical functionality.

Numpy is a module and in order to use it, it needs to be imported in the notebook

In [1]:
import numpy as np #imports numpy and uses a short abbreviation to refer to it further

It is easy to convert data from list to numpy array and back as long as the list is homogenous in element type and dimensionality

In [2]:
#recall data from the notebook on lists 
T=[65.1,66.5,59.1,50.5,65.0,66.2,62.2,73.1,70.0,72.2,69.9,83.5]

In [3]:
X=np.array(T,dtype=float); X #convert to numpy, specifying that data is of a float type; can be also integer

array([ 65.1,  66.5,  59.1,  50.5,  65. ,  66.2,  62.2,  73.1,  70. ,
        72.2,  69.9,  83.5])

In [4]:
list(X) #convert back

[65.099999999999994,
 66.5,
 59.100000000000001,
 50.5,
 65.0,
 66.200000000000003,
 62.200000000000003,
 73.099999999999994,
 70.0,
 72.200000000000003,
 69.900000000000006,
 83.5]

In [5]:
#get size (shape) of the array as a tuple with the same number of elements as the array rank
X.shape

(12,)

In [6]:
X.ndim #get rank

1

In [7]:
X.size #number of elements

12

In [8]:
X.dtype #base type of the array's elements

dtype('float64')

Some useful functions below include mean and std - no need to use loops to compute those with numpy as we were doing before with lists

In [9]:
X.sum()

803.30000000000007

In [10]:
X.mean()

66.941666666666677

In [11]:
X.std()

7.74633874090779

In [12]:
#the z-scores can be also computed in one single line using the way arithmetic operations are peformed with numpy arrays:
#elementwise subtraction and division by constants
z=abs((X-X.mean())/X.std()); z

array([ 0.23774673,  0.05701618,  1.01230619,  2.12250809,  0.25065605,
        0.09574416,  0.61211713,  0.79499923,  0.39481017,  0.67881531,
        0.38190085,  2.13756897])

In [13]:
#moreover indexing is now also simpler - we can select all the elements with z-score higher than 2 in a single operation
X[z>2]

array([ 50.5,  83.5])

In [14]:
#Let's consider what exactly happens here:
#first we get a boolean array of elementwise conditions
z>2

array([False, False, False,  True, False, False, False, False, False,
       False, False,  True], dtype=bool)

In [15]:
#and then index X using this array taking only those elements of X for which z>2 is True. There is no streightforward way of doing so with lists

## Creating basic arrays

Vector and matrix operations often use zero, unit or identity matrices. Those are easily created with numpy

In [16]:
np.zeros((3,3))

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [17]:
np.ones((3,4))

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

In [18]:
np.identity(4)

array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

In [19]:
#identity matrix can be also created as a diagonal matrix using a unit diagonal
np.diag([1]*4)

array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1]])

In [20]:
#create a random matrix with elements uniformly distributed between -1 and 1
np.random.uniform(-1,1,(3,3))

array([[-0.48671437,  0.89961787,  0.0339919 ],
       [-0.24268613,  0.34368704,  0.69153098],
       [-0.1585112 , -0.90092649,  0.71776126]])

In [21]:
#correlation between two random vectors
np.corrcoef(np.random.rand(10),np.random.rand(10))

array([[ 1.        , -0.25800629],
       [-0.25800629,  1.        ]])

## Elementwise and matrix operations

By default all arithmetic operations with numpy arrays are elementwise

In [22]:
A=np.ones((3,3))
B=A+A+np.identity(3)
B

array([[ 3.,  2.,  2.],
       [ 2.,  3.,  2.],
       [ 2.,  2.,  3.]])

In [23]:
B*B

array([[ 9.,  4.,  4.],
       [ 4.,  9.,  4.],
       [ 4.,  4.,  9.]])

In [24]:
B**2

array([[ 9.,  4.,  4.],
       [ 4.,  9.,  4.],
       [ 4.,  4.,  9.]])

In [25]:
#if matrix multiplication is needed:
np.matmul(B,B)

array([[ 17.,  16.,  16.],
       [ 16.,  17.,  16.],
       [ 16.,  16.,  17.]])

In [26]:
#determinant of a matrix
np.linalg.det(B)

7.0000000000000009

In [27]:
#eigenvalues and eigenvectors of a matrix
np.linalg.eig(B)

(array([ 1.,  7.,  1.]), array([[-0.81649658,  0.57735027, -0.27546855],
        [ 0.40824829,  0.57735027, -0.52791413],
        [ 0.40824829,  0.57735027,  0.80338269]]))

### Exercise 1. 

Implement a custom function computing Pearson correlation coefficient between two numpy arrays of the same size. If their sizes are different return NaN. Verify your results comparing against built-in numpy.corrcoef().

Pearson correlation coefficient between vectors $X=(X_i, i=1,2,...,n)$ and $Y=(Y_i, i=1,2,...,n)$ is defined as 
$$
C=\frac{E[(X-E[X])(Y-E[Y])]}{\sigma(X)\sigma(Y)},
$$
where $E[.]$ stands for mean, $\sigma(.)$ - standard deviation.
