### CDS NYU
### DS-GA 1007 | Programming for Data Science
### Lecture 05
### October 7, 2024

---

# NumPy: Array Manipulation for Scientific Computing

## Introduction

https://numpy.org/

NumPy is a fundamental Python package for scientific computing. The NumPy library contains multidimensional array and matrix data objects with methods to efficiently operate on them, including mathematical, logical, shape manipulation, sorting, selecting, I/O, basic linear algebra, basic statistical operations, random simulations, discrete Fourier transforms, and much more.

At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. There are several important differences between NumPy arrays and the standard Python sequences:

- NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original

- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements. Note in this case the efficiency of NumPy decreases accordingly

- Because they are fixed-type containers, in contrast to dynamic-type containers such as lists, NumPy arrays are more efficient to store and manipulate data because each item does not contain any metadata on the element’s type. This is called vectorization (as used in compiled languages). Its advantage is speed, its disadvantage is lack of flexibility

- NumPy arrays provide many mathematical and other types of operations. Typically, such operations are executed more efficiently and with less code than is possible using Python dynamic-type containers such as lists

- NumPy is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and is used extensively in the backend of Pandas, SciPy, Matplotlib, Scikit-Learn, Scikit-Image and most other data science and scientific Python packages


#### Import NumPy as a library:

In [2]:
import numpy as np


# Numpy Arrays, Vectors and Matrices

Numpy arrays can have one or several dimensions. 
In the context of linear algebra, Numpy arrays can thus be used as "vectors" or "matrices".  We will think about linear algebra later in this notebook. For now let us start simple and discuss: how to create arrays.


## Creating NumPy Arrays

### From a Python List

We can create an array by directly converting a list, or a nested list i.e., list of lists:

In [3]:
# 1D array (vector)
l = [1,2,3,4,5]
vec = np.array(l)
vec


array([1, 2, 3, 4, 5])

In [4]:
# 2D array (matrix)
l = [[1,2,3], [4,5,6], [7,8,9]]
mat = np.array(l)
mat


array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### From a file
Directly load a numerical dataset (array) from a file

In [6]:
np.loadtxt(fname='MedicalData.csv', delimiter=',')


array([[0., 0., 1., ..., 3., 0., 0.],
       [0., 1., 2., ..., 1., 0., 1.],
       [0., 1., 1., ..., 2., 1., 1.],
       ...,
       [0., 1., 1., ..., 1., 1., 1.],
       [0., 0., 0., ..., 0., 2., 0.],
       [0., 0., 1., ..., 1., 1., 0.]])

### Using NumPy built-in functions

#### Function "arange": 
Create an array with values evenly spaced by a specific increment over a given interval

In [7]:
np.arange(0,10,2)


array([0, 2, 4, 6, 8])

In [8]:
np.arange(0,25).reshape(5,5)


array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

#### Function "linspace"
Create an array with a specific number of evenly spaced values over a given interval

In [9]:
np.linspace(0,10,5)


array([ 0. ,  2.5,  5. ,  7.5, 10. ])

#### Function "rand"
Create an array of given shape containing random samples drawn from a given statistical distribution

In [None]:
np.random.rand(5,5) # Uniform distribution


In [None]:
np.random.randn(5,5) # Normal (Gaussian) distribution


In [None]:
np.random.randint(0,10,25).reshape(5,5) # Return random integers from min (inclusive) to max (exclusive)


#### Function "zeros" and "ones"

Create an array of zeros or ones

In [None]:
np.zeros(5)


In [None]:
np.zeros((5,5))


In [None]:
np.ones(5)


In [None]:
np.ones((5,5))


#### Function "eye" to create an identity matrix
Create (quickly:) an identity matrix

In [None]:
np.eye(5)


## NumPy Array Attributes 

In [None]:
l = [[1,2,3],[4,5,6],[7,8,9]]
m = np.array(l)
m


In [None]:
print(type(m)) # Python type (class) of object


In [None]:
print(m.dtype) # Python type of elements in the array


In [None]:
print(m.ndim) # Number of dimensions 


In [None]:
print(m.shape) # Number of elements in each dimension 


In [None]:
print(m.size) # Total number of elements in the array 


## Indexing and Selection

### Vector

In [None]:
a = np.arange(0,9)
print(a)


In [None]:
a[5]   # Simple indexing


In [None]:
a[2:5] # Slicing (view to access subarray, data not copied) 


**Semantic difference compared to slicing lists**: When slicing a NumPy array, the data is not copied. If a slice is assigned to a new variable name, when we mutate one we mutate the other (both variables point to same location in memory).  To clone an array we need to explicitly invoke the method ``copy`` 

In [None]:
# Any change made to a slice will affect the original array because a slice is just 
# a "view" to access a sub-array, it points to the same location in memory

s = a[2:4]
s[:] = 10

print(s)
print(a) 


In [None]:
# To create a separate copy, invoke the method "copy": 
s = a[2:4].copy()


**Broadcasting Rule**: 

When the shape of the two arrays does not match in a given dimension, the size of one of the arrays in that dimension needs to be 1:
* If it is not the case, an error is raised
* If it is the case, the array with size equal to 1 in that dimension is broadcasted to match the size of the other array in that dimension

In [None]:
# Scalar broadcasting
a[0:5]=100 
print(a)


In [None]:
# Array broadcasting
np.ones((3,3)) + np.arange(3) 


### Matrix

In [None]:
m = a.reshape(3,3)
print(m)


In [None]:
m[1,0] # Possible alternative is m[1][0] (SQL style) but not recommended


In [None]:
m[1:,:] # Slicing in 2D


"**Fancy indexing**": The concept is to put an array of indices inside the indexing brackets of the array being indexed (hence the double brackets notation), giving flexibility to select specific elements from the array in any order, any number of times

In [None]:
m[[2,0,2]] # Fancy indexing: Select entire row in any order, any number of times


In [None]:
m[:,[2,0,2]] # Fancy indexing: Select entire columns in any order, any number of times


In [None]:
m > 10 # The expression m > 10 evaluate to a Boolean array with same dimensions as array m (called Boolean Mask))


In [None]:
m[m>10] # Fancy indexing (the expression m > 10 is itself a Boolean array)


## Mathematical Operations on Data  
For Arithmetic operations, operators apply on an element-wise basis, thus the arrays must be the same size.

In [None]:
a = np.arange(0,9)


In [None]:
a + a


In [None]:
a - a


In [None]:
a * a


In [None]:
# Division by zero produces a warning, not an error.
# The result is replaced by INF ("INFinity")
1/a


In [None]:
# Division of zero by zero also produces a warning, not an error.
# The result is replaced by NAN ("Not A Number")
a/a


In [None]:
a**3


In [None]:
np.sqrt(a) 


In [None]:
np.exp(a)


In [None]:
np.sin(a)


In [None]:
np.log(a)


In [None]:
# When operating on NumPy arrays, don't forget about broadcasting concepts:
a[0:5] = 100 # Scalar broadcasting (example shown above)
a + 10       # Scalar broadcasting
10 * a       # Scalar broadcasting
np.ones((2,9)) + np.arange(9) # Array broadcasting (example shown above)
a + np.random.rand(2,9) # Array broadcasting: Duplicate row of 'a' and add random number to each entry


## Linear Algebra in NumPy
With Arithmetic operations shown above, operators apply on an element-wise basis and thus arrays must be the same size. But Numpy also offers Linear Algebra operators to manipulate arrays as mathematical vectors or matrices. 

Vectors are 1D arrays and matrices are 2D arrays, although this distinction is for convenience when dealing with vectors because a matrix generalizes the concept of vector by the possibility to have only one row or one column.

In [3]:
a = np.array([[1,2], [3,4]], float)
b = np.array([[2,0], [1,3]], float)
print(a)
print(b)


[[1. 2.]
 [3. 4.]]
[[2. 0.]
 [1. 3.]]


In [4]:
# Element wise multiplication operator 
a * b # Dimensions of 'a' and 'b' must be the same


array([[ 2.,  0.],
       [ 3., 12.]])

In [5]:
# Matrix multiplication operator
np.matmul(a, b) # Number of columns of 'a' must be the same as number of rows of 'b'


array([[ 4.,  6.],
       [10., 12.]])

In [6]:
a @ b # Shortcut: Same as np.matmul(), but shorter to type...


array([[ 4.,  6.],
       [10., 12.]])

In [None]:
np.linalg.norm(a) - np.linalg.norm(b)


In [None]:
np.linalg.det(a) - np.linalg.det(b)


### Find least square solution to a linear regression problem $Ax = y$

**In a linear regression problem, we seek $x$ such that $y$ is equal to a linear combination of the column vectors of $A$** (in Linear Algebra, the vector space formed by columns of $A$ is called the *column space of $A$*).

Assume dimension of $A$, $x$ and $y$ are respectively: $(m \times n)$, $(n \times 1)$, and $(m \times 1)$.

**If $Ax = y$ has an exact solution, then it can be solved as a simple, linear system of $m$ equations with $n$ unknowns**. 

**Else if $Ax = y$ has no solution (i.e., if it is not possible to find $x$ such that $Ax = y$), the best we can do is find $\hat{x}$ that makes $A\hat{x}$ as close as possible to $y$**. It is called a Least-Square problem because we seek $\hat{x}$ such that $||A\hat{x} - y||$ is as small as possible.

$||A\hat{x} - y||$ is called the least-square error. $\: \hat{x}$ is called a least-square solution.

The *Orthogonal Decomposition Theorem* says that any vector in $\mathbb{R}^m$ can be decomposed into a sum of a vector from any subspace of $\mathbb{R}^m$ (for example $\mathbb{R}^n$ where $n < m$) and a vector orthogonal to that subspace. Thus $y = A\hat{x} + (A\hat{x} - y)$, where the component vector $(A\hat{x} - y)$ corresponds to the least-square error and is orthogonal to all vectors in the columns of $A$ (i.e., orthogonal to the *column space of $A$*). This implies that the dot product of each column vector of $A$ by $(A\hat{x} - y)$ is $0$. 

Given $(A\hat{x} - y)$ is of dimension $(m \times 1)$, all these dot products are equivalent to a matrix multiplication between the transpose of $A$ and $(A\hat{x} - y)$. There are $n$ columns in $A$ so $A^T$ is of dimension $(n \times m)$ and the resulting null vector is of dimension $(n \times m)(m \times 1) =(n \times 1)$.

**Thus:** $$A^T(A\hat{x} - y) = 0$$

$$A^TA\hat{x} = A^Ty$$       

$$\hat{x} = (A^TA)^{-1} A^Ty$$

This is called the "normal equation" and provides a solution for $\hat{x}$.

<img src="./OrthogonalDec.png" width="500">


In [None]:
# The easiest is to think about it in 3D (picture above) so let me give an explanation in 3D:
# If m = 3 and n = 2, the column space of A is a 2D plane in a 3D space, formed by the 
# two column vectors of A. The vector y in 3D may or may not be in this plane. 
# Asking if Ax = y is asking if y is in this plane. 
# The least square solution is either x (if it exists), or an approximation to it in the 
# column space of A. 
# When m > 3, the concept is the same but the column space of A is an hyperplane (n > 2).

# Below is an example with m = 4 observations. Often in practice m is much larger (m is the number of "observations" in practice).

A = np.matrix([[1, 1], 
               [1, 2],
               [2, 3],
               [2, 4]])

y = np.matrix([[1], 
               [3],
               [7],
               [9]])


**Projection of $y$ into column space of $A$ $=>$ Predict $y$ as a linear combination of columns of $A$:**

In [None]:
# Using an explicit linear algebra development:
AT = np.matrix.transpose(A) 
ATA = AT @ A
ATAi = np.linalg.inv(ATA)
ATy = AT @ y
x = ATAi @ ATy
print('From matrix calculation: x = ({:.2f}, {:.2f})'.format(x[0,0],x[1,0]))


In [None]:
# Using the NumPy Least Square solver:
results = np.linalg.lstsq(A, y, rcond = None)
x = results[0]
r = results[3]
print('From NumPy lstsq method: x = ({:.2f}, {:.2f})'.format(x[0,0],x[1,0]))


In [None]:
# The NumPy Least Square solver computes the residuals too:
print('Sum of squared residuals: ({:.2f}, {:.2f})'.format(r[0], r[1]))


In [None]:
# For each observation in A, here is what our linear regression model would predict:
y_hat = A @ x
print('Predicted values of y: ({:.2f},{:.2f},{:.2f},{:.2f})'. \
       format(y_hat[0,0],y_hat[1,0],y_hat[2,0],y_hat[3,0]))


...which sounds a reasonable approximation to the original value of the vector $y$.

##  Statistical analysis of data
Analyze Data with Aggregation and Statistical Operations 

### Case Study

#### Statistical data analysis of clinical trial results stored on file 

**Source**: https://swcarpentry.github.io/python-novice-inflammation/02-numpy.html

A new drug that was claimed to cure arthritis inflammation flare-ups within 3 weeks since initially taking the medication was tested in clinical trials, with key results stored in a CSV file.

The CSV file contains the number of inflammation *flare-ups* per day for 60 patients enrolled in the clinical trial. This trial lasted 40 days. Each row corresponds to a patient, and each column corresponds to a day in the trial. Once a patient has her/his first inflammation flare-up she/he takes the medication and waits a few weeks for it to take effect and reduce flare-ups.

To see how effective the treatment is we would like to calculate the average inflammation per day across all patients. To assess risks, we also would like to know what was the maximum inflamation reached per patient across all days.

The link above contains much more details compared to what we will look at here.

#### In short:  
Given a dataset of clinical trial inflammation for 60 patients measured daily during 40 days:
- **Assess risks**: What is the maximum inflammation for each patient over all days?
- **Assess effectiveness**: What is the average inflammation over all patients for each day?

<img src="./aggregation.png">


In [None]:
data = np.loadtxt(fname='./MedicalData.csv', delimiter=',')
print(data.shape) # 60 patients (rows) traced during 40 days (columns)


### Operation "over rows" vs. "over columns"

In [None]:
print("=== Maximum inflammation per patient ===")
np.max(data, axis=1) # Maximum "over the columns" returns 1 result "per row" = array of 60 values


In [None]:
np.max(data, axis=1).shape


In [None]:
print("=== Average inflammation per day ===")
np.mean(data, axis=0) # Mean "over the rows" returns 1 result "per column" = array of 40 values


In [None]:
np.mean(data, axis=0).shape


Below is another example (not asked in the problem above)

In [None]:
print("=== Average inflammation per patient ===")
np.mean(data, axis=1) # Mean "over the columns" returns 1 result "per row" = array of 60 values


In [None]:
np.mean(data, axis=1).shape


### More examples of basic statistical data analysis

In [None]:
print("Average:",np.mean(data, axis=0))
print("Standard Deviation:", np.std(data, axis=0))
print("Min:", np.min(data, axis=0))
print("Max:", np.max(data, axis=0))
print("Sum:", np.sum(data, axis=0))
print("Argmax:", np.amax(data, axis=1)) # argmax over days (=columns) returns, for each patient
                                        # the day when each patient had maximum inflammation

In [None]:
print("Average over entire dataset: {:.1f}".format(np.mean(data)))
print("Standard Deviation over entire dataset: {:.1f}".format(np.std(data)))
print("Minimum over entire dataset: {:.1f}".format(np.min(data)))
print("Maximum over entire dataset: {:.1f}".format(np.max(data)))
