DS 256 Data Science Programming, Fall 2024

Prof Eatai Roth

Class 5.1



# Numpy

Read more at [PDSH Ch 2](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)

Numpy introduces a new data type, the *array*. In many ways, the numpy array is like a list, and we'll see many similarities when it comes to indexing and slicing arrays. But there are some key differences that make arrays particularly useful for data analysis. First is a restriction to ensure homogeneity.

 - Numpy arrays may only contain numerical or text data or nested arrays (lists) of numerical or text data, and all data must be of the same type.
 - Mixed numerical data (ints and floats) are up-typed to the most permissible type unless data type is explicitly specified.

#### Creating a Numpy array

Let's create a generic Numpy array and some special arrays.

 - generic array
 - empty array
 - array of ones or zeros
 - array of all one value
 - array of regularly spaced values
 - array of random numbers

In [None]:
import numpy as np

In [None]:
A = np.array([1, 2, 10, 4.1, 11])
B = np.array([])
C = np.zeros(5)
D = np.ones(8)
E = np.full(20, 3)  # size of array, then fill value
F = np.arange(0, 100, 5)
G = np.random.randint(0, 10, 20)

display(A)
display(B)
display(C)
display(D)
display(E)
display(F)
display(G)

## Multi-dimensional arrays 

Similar to how lists can contain lists, arrays can have multiple dimensions.

Consider the matrix $X$:

$$ X =
\begin{bmatrix}
x_{0,0} & x_{0,1} & x_{0,2} & x_{0,3}\\
x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3}\\
x_{2,0} & x_{2,1} & x_{2,2} & x_{2,3}\\
\end{bmatrix}
$$

The dimensions of a matrix are $num\_rows \times num\_columns$, for the matrix above $3 \times 4$. The location of an element in the matrix is $(row\_index, column\_index)$.

For a 2-dimensional numpy array, we can treat the matrix as a list containing an individual list for each row.

$$ X =
\begin{bmatrix}
\begin{bmatrix}x_{0,0} & x_{0,1} & x_{0,2} & x_{0,3}\end{bmatrix}\\
\begin{bmatrix}x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3}\end{bmatrix}\\
\begin{bmatrix}x_{2,0} & x_{2,1} & x_{2,2} & x_{2,3}\end{bmatrix}\\
\end{bmatrix}
$$

 - To get the dimensions of an array, we can query the array property ```.shape```.
 - To index an element in an array, X[row_idx, col_idx] (and for higher dimensional arrays, just keep adding idx)
 - All the same slicing that is performed on lists can be performed on arrays, but now in any direction!



In [None]:
'''create an 4 x 5 array of random integers'''
np.random.seed(42)
X = np.random.randint(1, 101, [4,5])

X

In [None]:
'''get the shape of the array'''


In [None]:
'''extract entries of an array'''


In [None]:
'''extract individual rows and columns'''


In [None]:
'''all the slicing like lists'''


## Masking

A mask is a matrix of boolean values. You can either 1) use a mask as an index to an array or 2) multiply an array by a mask. 

 - As an index, the result will be a 1-D array of the values wherever the mask was True.
 - Multiplying by the mask, the result is an array of the same shapes with 0 everywhere the mask is False and the original value where the mask is True.

In [None]:
Y = np.random.randint(0, 10, (10,10))
display(Y)

In [None]:
Ymask = Y>=5
Ymask*1

In [None]:
Y_maskindex = Y[Ymask]
display(Y_maskindex)

Y * Ymask

#### Challenge question

Create an 7 x 10 array of random integers. Extract an array of the first 3 elements of every other row.

## Views vs Copies

When you slice an array, the resulting sub-array is a *view* into the main array. This is true even if you save the sub-array as a new variable. What does this mean?

You are not allocating new memory to save this view, so any change made to the sub-array is made to the original array.

If we want to slice a sub-array and have it exist as an array independent of the original array, we must ```.copy```.

While views might be confusing, they are incredibly useful for breaking up large data arrays to work with manageable chunks.

### Demo Exercise

 - Let's create a 10 x 10 matrix of random numbers from 1-5, call it Y.
 - Then let's extract the upper-right quadrant as a view and the lower-right quadrant as a copy, Y_tr and Y_br respectively.
 - Now, let's fill Y_tr with ones and Y_br with zeros.

How do these changes affect the original array?

In [None]:
'''creating a big matrix (10x10)'''


In [None]:
'''slice the top-right quadrant as a view'''

'''slice the bottom-right quadrant as a copy'''



In [None]:
Z_tr.fill(1)
Z_br.fill(2)

display(Z_tr)
display(Z_br)

In [None]:
display(Z)

In [None]:
Z[:,7] = 0
Z

In [None]:
display(Z_tr) # remember this is a view
display(Z_br) # and this is a copy

In [None]:
Exam1 = Z[:,3]

## Math on arrays

The nicest thing about numpy arrays is that they have been optimized for performing vectorized math operations. What does that mean? A math operation can be applied to every element of an array without a loop, and these vectorized operations are MUCH MUCH MUCH faster.

In [None]:
A = np.random.randint(0,10, [10, 5])
display(A)

In [None]:
np.sin(A*np.pi/3)
A**2

And we can perform operations that aggregate results over a column or row (e.g. sum, mean, min, max).

In [None]:
display(A.mean())   # mean of all the values in A
display(A.mean(0)) # mean of every column
display(A.mean(1)) # mean of every row