DS 256 Data Science Programming, Fall 2024

Prof Eatai Roth

Class 5.1

Today, we'll just introduce the math package Numpy PDSH Ch 2.

## The Numpy array

Numpy introduces a new data type, the *array*. In many ways, the numpy array is like a list, and we'll see many similarities when it comes to indexing and slicing arrays. But there are some key differences that make arrays particularly useful for data analysis. First is a restriction to ensure homogeneity.

 - Numpy arrays may only contain numerical or text data or nested arrays (lists) of numerical or text data, and all data must be of the same type.
 - Mixed numerical data (ints and floats) are up-typed to the most permissible type unless data type is explicitly specified.

#### Creating a Numpy array

Let's create a generic Numpy array and some special arrays.

 - generic array
 - empty array
 - array of ones or zeros
 - array of all one value
 - array of regularly spaced values
 - array of random numbers

In [1]:
import numpy as np

In [16]:
A = np.array([1, 5, 8, 1, 20])
B = np.array([])
C = np.ones(6)
D = np.zeros(10)
E = np.full(9, 'hi')
F = np.arange(0, 100, 5)   # start, stop (not including), step size
G = np.random.randint(0, 10, 20)   # range lower bound, range upper bound (not including), shape of array


display(A)
display(B)
display(C)
display(D)
display(E)
display(F)
display(G)





array([ 1,  5,  8,  1, 20])

array([], dtype=float64)

array([1., 1., 1., 1., 1., 1.])

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

array(['hi', 'hi', 'hi', 'hi', 'hi', 'hi', 'hi', 'hi', 'hi'], dtype='<U2')

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,
       85, 90, 95])

array([8, 4, 5, 3, 7, 8, 5, 6, 8, 6, 5, 7, 9, 0, 5, 4, 6, 9, 8, 2])

## Multi-dimensional arrays 

Similar to how lists can contain lists, arrays can have multiple dimensions.

Consider the matrix $X$:

$$ X =
\begin{bmatrix}
x_{0,0} & x_{0,1} & x_{0,2} & x_{0,3}\\
x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3}\\
x_{2,0} & x_{2,1} & x_{2,2} & x_{2,3}\\
\end{bmatrix}
$$

The dimensions of a matrix are $num\_rows \times num\_columns$, for the matrix above $3 \times 4$. The location of an element in the matrix is $(row\_index, column\_index)$.

For a 2-dimensional numpy array, we can treat the matrix as a list containing an individual list for each row.

$$ X =
\begin{bmatrix}
\begin{bmatrix}x_{0,0} & x_{0,1} & x_{0,2} & x_{0,3}\end{bmatrix}\\
\begin{bmatrix}x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3}\end{bmatrix}\\
\begin{bmatrix}x_{2,0} & x_{2,1} & x_{2,2} & x_{2,3}\end{bmatrix}\\
\end{bmatrix}
$$

 - To get the dimensions of an array, we can query the array property ```.shape```.
 - To index an element in an array, X[row_idx, col_idx] (and for higher dimensional arrays, just keep adding idx)
 - All the same slicing that is performed on lists can be performed on arrays, but now in any direction!



In [17]:
'''create a generic multi-dimensional array, say 2x3'''

Xmulti = np.array([[1,2,3], [4,5,6]])
display(Xmulti)

array([[1, 2, 3],
       [4, 5, 6]])

In [18]:
'''create an 4 x 5 array of random integers'''

X = np.random.randint(1, 101, [4,5])
X

array([[72, 29, 78, 75, 33],
       [95, 83, 58,  5, 95],
       [80, 40, 43, 88, 25],
       [22, 31,  1, 17, 61]])

In [19]:
'''get the shape of the array'''
X.shape

(4, 5)

In [21]:
'''extract entries of an array'''
X[-1,-3]

1

In [26]:
'''extract individual rows and columns'''
X[2,:]   # the 2nd row
X[:,3]  # the 3rd column
X[:,1::2]


array([[29, 75],
       [83,  5],
       [40, 88],
       [31, 17]])

In [None]:
'''all the slicing like lists'''
X[:,-3:]  # last 3 columns


## Masking

A mask is a matrix of boolean values. You can either 1) use a mask as an index to an array or 2) multiply an array by a mask. 

 - As an index, the result will be a 1-D array of the values wherever the mask was True.
 - Multiplying by the mask, the result is an array of the same shapes with 0 everywhere the mask is False and the original value where the mask is True.

In [31]:
Y = np.random.randint(1, 101, [5,8])
Y_mask = Y%2==0

display(Y)
display(Y_mask*1)

array([[ 13,  74,  45,  69,  24,  12,  81,  56],
       [ 37,   2,  26,  34,  93,  79, 100,  42],
       [  2,  53,  49,  36,  85,  41,  54, 100],
       [ 62,  64,  21,  86,  97, 100,  73,  68],
       [ 34,  74,  86,  25,  34,  86,  91,  71]])

array([[0, 1, 0, 0, 1, 1, 0, 1],
       [0, 1, 1, 1, 0, 0, 1, 1],
       [1, 0, 0, 1, 0, 0, 1, 1],
       [1, 1, 0, 1, 0, 1, 0, 1],
       [1, 1, 1, 0, 1, 1, 0, 0]])

In [33]:
Y[Y_mask]  # using a mask as an index
Y*Y_mask   # multiplying by a mask

array([[  0,  74,   0,   0,  24,  12,   0,  56],
       [  0,   2,  26,  34,   0,   0, 100,  42],
       [  2,   0,   0,  36,   0,   0,  54, 100],
       [ 62,  64,   0,  86,   0, 100,   0,  68],
       [ 34,  74,  86,   0,  34,  86,   0,   0]])

#### Challenge question

Create an 7 x 10 array of random integers. Extract an array of the first 3 elements of every other row.

In [37]:
C = np.random.randint(1,101, [7,10])
display(C)
Csub = C[::2  , 0:3]

Exam1 = C[:, 3]
display(Exam1)

array([[43, 44, 33, 71, 42, 15, 51, 61, 34, 54],
       [52, 62, 43, 77, 29, 99, 41, 29, 40, 25],
       [69, 95, 61, 46, 43, 54, 61, 80,  9,  4],
       [53, 78, 16, 89, 72,  6, 68,  8, 85, 19],
       [76, 44, 35, 60, 86, 41, 62, 74, 88, 87],
       [82, 82,  5, 69, 99, 43, 84, 40, 10,  5],
       [91, 71, 69, 93, 54,  9, 54, 65, 85, 24]])

array([71, 77, 46, 89, 60, 69, 93])

## Views vs Copies

When you slice an array, the resulting sub-array is a *view* into the main array. This is true even if you save the sub-array as a new variable. What does this mean?

You are not allocating new memory to save this view, so any change made to the sub-array is made to the original array.

If we want to slice a sub-array and have it exist as an array independent of the original array, we must ```.copy```.

While views might be confusing, they are incredibly useful for breaking up large data arrays to work with manageable chunks.

### Demo Exercise

 - Let's create a 10 x 10 matrix of random numbers from 1-5, call it Y.
 - Then let's extract the upper-right quadrant as a view and the lower-right quadrant as a copy, Y_tr and Y_br respectively.
 - Now, let's fill Y_tr with ones and Y_br with zeros.

How do these changes affect the original array?

In [38]:
'''creating a big matrix'''
Z = np.random.randint(1, 101, [10, 10])
display(Z)

array([[ 95,   5,  72,  75,  76,  12,  25,  29,  10,  40],
       [ 61,  56,  81, 100,  28,   2,  43,  50,  96,  60],
       [  4,  55,  66,  82,  90,  56,  83,  27,   8,  98],
       [ 62,  30,  80,  62,  42,  79,  54,  69,  34,  78],
       [ 74,  44,  55,  79,  94,  45,  94,  52,  82,  90],
       [ 20,  42,  96,  15,  67,  91,  86,  29,  17,  47],
       [  5,  58,  65,  37,  70,  77,  35,   5,  84,  82],
       [  5,  67,  41,  58,   6,  51,  51,  25,  99,  78],
       [  1,  20,  49,  41,  98,  72,  49,  28,  15,  57],
       [ 41,  18,  92,  92,  74, 100,  45,  24,  73,  24]])

In [40]:
Z_tr = Z[:5,-5:]   # slice Z as a view
Z_br = Z[-5:, -5:].copy()   # slice Z as a copy

display(Z_tr)
display(Z_br)

array([[12, 25, 29, 10, 40],
       [ 2, 43, 50, 96, 60],
       [56, 83, 27,  8, 98],
       [79, 54, 69, 34, 78],
       [45, 94, 52, 82, 90]])

array([[ 91,  86,  29,  17,  47],
       [ 77,  35,   5,  84,  82],
       [ 51,  51,  25,  99,  78],
       [ 72,  49,  28,  15,  57],
       [100,  45,  24,  73,  24]])

In [41]:
Z_tr.fill(1)
Z_br.fill(2)

display(Z_tr)
display(Z_br)

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

array([[2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2]])

In [42]:
Z

array([[ 95,   5,  72,  75,  76,   1,   1,   1,   1,   1],
       [ 61,  56,  81, 100,  28,   1,   1,   1,   1,   1],
       [  4,  55,  66,  82,  90,   1,   1,   1,   1,   1],
       [ 62,  30,  80,  62,  42,   1,   1,   1,   1,   1],
       [ 74,  44,  55,  79,  94,   1,   1,   1,   1,   1],
       [ 20,  42,  96,  15,  67,  91,  86,  29,  17,  47],
       [  5,  58,  65,  37,  70,  77,  35,   5,  84,  82],
       [  5,  67,  41,  58,   6,  51,  51,  25,  99,  78],
       [  1,  20,  49,  41,  98,  72,  49,  28,  15,  57],
       [ 41,  18,  92,  92,  74, 100,  45,  24,  73,  24]])

In [43]:
Z[:,7] = 0
Z

array([[ 95,   5,  72,  75,  76,   1,   1,   0,   1,   1],
       [ 61,  56,  81, 100,  28,   1,   1,   0,   1,   1],
       [  4,  55,  66,  82,  90,   1,   1,   0,   1,   1],
       [ 62,  30,  80,  62,  42,   1,   1,   0,   1,   1],
       [ 74,  44,  55,  79,  94,   1,   1,   0,   1,   1],
       [ 20,  42,  96,  15,  67,  91,  86,   0,  17,  47],
       [  5,  58,  65,  37,  70,  77,  35,   0,  84,  82],
       [  5,  67,  41,  58,   6,  51,  51,   0,  99,  78],
       [  1,  20,  49,  41,  98,  72,  49,   0,  15,  57],
       [ 41,  18,  92,  92,  74, 100,  45,   0,  73,  24]])

In [44]:
display(Z_tr)
display(Z_br)

array([[1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1]])

array([[2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2]])

## Math on arrays

The nicest thing about numpy arrays is that they have been optimized for performing vectorized math operations. What does that mean? A math operation can be applied to every element of an array without a loop, and these vectorized operations are MUCH MUCH MUCH faster.

In [47]:
Z**2
np.sin(Z*np.pi/8)

array([[-3.82683432e-01,  9.23879533e-01,  1.10218212e-15,
        -9.23879533e-01, -1.00000000e+00,  3.82683432e-01,
         3.82683432e-01,  0.00000000e+00,  3.82683432e-01,
         3.82683432e-01],
       [-9.23879533e-01,  8.57252759e-16,  3.82683432e-01,
         1.00000000e+00, -1.00000000e+00,  3.82683432e-01,
         3.82683432e-01,  0.00000000e+00,  3.82683432e-01,
         3.82683432e-01],
       [ 1.00000000e+00,  3.82683432e-01,  7.07106781e-01,
         7.07106781e-01, -7.07106781e-01,  3.82683432e-01,
         3.82683432e-01,  0.00000000e+00,  3.82683432e-01,
         3.82683432e-01],
       [-7.07106781e-01, -7.07106781e-01, -1.22464680e-15,
        -7.07106781e-01, -7.07106781e-01,  3.82683432e-01,
         3.82683432e-01,  0.00000000e+00,  3.82683432e-01,
         3.82683432e-01],
       [-7.07106781e-01, -1.00000000e+00,  3.82683432e-01,
        -3.82683432e-01, -7.07106781e-01,  3.82683432e-01,
         3.82683432e-01,  0.00000000e+00,  3.82683432e-01,
         3.

And we can perform operations that aggregate results over a column or row (e.g. sum, mean, min, max).

In [51]:
Z = np.random.randint(1, 101, [5,8])
display(Z)

array([[45, 22, 26, 99,  3, 60, 58, 88],
       [58, 80, 59, 84, 23, 81, 73,  3],
       [66, 14, 68, 87, 50, 49, 16, 38],
       [51, 35, 40, 50, 26,  8, 22,  5],
       [46, 36, 88,  3, 79, 40, 30, 72]])

In [56]:
Z.mean()   # mean of all values
Z.mean(0)  # average down (average per column)
Z.min(1)  # average across (average per row)

array([ 3,  3, 14,  5,  3])

### Importing data

We can import data into a numpy array using ```np.loadtxt()```. Let's look at that documentation.

In [59]:
eq_dates = np.loadtxt('Data/CAearthquakes.csv', dtype = str, skiprows = 1, delimiter = ',', usecols = [0])
eq = np.loadtxt('Data/CAearthquakes.csv', skiprows = 1, delimiter = ',', usecols = [1,2,4])

### Plotting

We won't go into detail plotting today.

In [None]:
import matplotlib.pyplot as plt


fig, ax = plt.subplots(1,1,figsize=(6,6))

xdata = 0
ydata = 0
ax.plot(xdata, ydata, '.')

plt.show()

### Recreating the figure from the Forum

Dates make things tricky.

In [None]:
import matplotlib.dates as mdates

In [None]:
eq_datenums = mdates.datestr2num(eq_dates)
marker_size = 2**np.floor(mag)

fig, ax = plt.subplots(figsize = (10, 5))

ax.scatter(eq_datenums, mag, s = marker_size, alpha = 0.5)

ax.xaxis.set_major_locator(mdates.YearLocator(10))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))

ax.set_xlabel('year')
ax.set_ylabel('magnitude')
plt.show()

