# Lecture 1: Data Processing with NumPy

Numerical Python, or "Numpy" for short, is a foundational package on which many of the most common data science packages are built.  Numpy provides us with high performance multi-dimensional arrays which we can use as vectors or matrices.  

The key features of numpy are:

- ndarrays: n-dimensional arrays of the same data type which are fast and space-efficient.  There are a number of built-in methods for ndarrays which allow for rapid processing of data without using loops (e.g., compute the mean).
- Broadcasting: a useful tool which defines implicit behavior between multi-dimensional arrays of different sizes.
- Vectorization: enables numeric operations on ndarrays.
- Input/Output: simplifies reading and writing of data from/to file.

<b>Additional Recommended Resources:</b><br>
[Numpy Documentation](https://numpy.org/doc/stable/reference/)

[Python for Data Analysis](https://wesmckinney.com/book/) by Wes McKinney

[Python Data science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas



## Getting started with ndarray

**ndarrays** are time and space-efficient multidimensional arrays at the core of numpy.  Let's get started by creating ndarrays using the numpy package.

**How to create Rank 1 numpy arrays:**

In [10]:
import numpy as np

an_array = np.array([3, 33, 333])  # Create a rank 1 array

print(type(an_array))              # The type of an ndarray is: "<class 'numpy.ndarray'>"

<class 'numpy.ndarray'>


In [None]:
# test the shape of the array we just created, it should have just one dimension (Rank 1)
an_array.shape

(3,)

In [None]:
# because this is a 1-rank array, we need only one index to accesss each element
print(an_array[0], an_array[1], an_array[2])

3 33 333


In [None]:
# ndarrays are mutable, here we change an element of the array
an_array[0] = 11
an_array

array([ 11,  33, 333])

**How to create a Rank 2 numpy array:**

A rank 2 **ndarray** is one with two dimensions.  Notice the format below of [ [row] , [row] ].  2 dimensional arrays are great for representing matrices which are often useful in data science.

In [None]:
another = np.array([[11,12,13],[21,22,23]])   # Create a rank 2 array

print(another)  # print the array

print("The shape is 2 rows, 3 columns: ", another.shape)  # rows x columns

print("Accessing elements [0,0], [0,1], and [1,0] of the ndarray: ", another[0, 0], ", ",another[0, 1],", ", another[1, 0])

[[11 12 13]
 [21 22 23]]
The shape is 2 rows, 3 columns:  (2, 3)
Accessing elements [0,0], [0,1], and [1,0] of the ndarray:  11 ,  12 ,  21


**There are many way to create numpy arrays:**

Here we create a number of different size arrays with different shapes and different pre-filled values.  numpy has a number of built in methods which help us quickly and easily create multidimensional arrays.

In [None]:
import numpy as np

# create a 2x2 array of zeros
ex1=np.zeros((2,2))
print(ex1)

[[0. 0.]
 [0. 0.]]


In [None]:
# create a 2x2 array filled with 9.0
ex2 = np.full((2,2), 9.0)
ex2

array([[9., 9.],
       [9., 9.]])

In [None]:
# create a 2x2 matrix with the diagonal 1s and the others 0 (identity matrix)
ex3 = np.eye(2,2)
ex3

array([[1., 0.],
       [0., 1.]])

In [None]:
# create an array of ones
ex4 = np.ones((1,2))
ex4

array([[1., 1.]])

In [None]:
# notice that the above ndarray (ex4) is actually rank 2, it is a 2x1 array
print(ex4.shape)

# which means we need to use two indexes to access an element
ex4[0,1]

(1, 2)


1.0

In [None]:
# create an array of random floats between 0 and 1
ex5 = np.random.random((2,2))
ex5

array([[0.65689757, 0.61732313],
       [0.34633524, 0.74876935]])

## Array Indexing

**Slice indexing:**

Similar to the use of slice indexing with lists and strings, we can use slice indexing to pull out sub-regions of ndarrays.

In [None]:
import numpy as np

# Rank 2 array of shape (3, 4)
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34]])
print(an_array)
print(an_array.shape)

[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]
(3, 4)


Use array slicing to get a subarray consisting of the first 2 rows x 2 columns.

In [None]:
a_slice = an_array[0:2,0:2]
a_slice

array([[11, 12],
       [21, 22]])

When you modify a slice, you actually modify the underlying array.

In [None]:
print("Before:", an_array[0, 1])   #inspect the element at 0, 1
a_slice[0, 0] = 1000    # a_slice[0, 0] is the same piece of data as an_array[0, 1]
print("After:", an_array[0, 1])

Before: 12
After: 12


**Use both integer indexing & slice indexing**

We can use combinations of integer indexing and slice indexing to create different shaped matrices.

In [None]:
# Create a Rank 2 array of shape (3, 4)
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34]])
an_array

array([[11, 12, 13, 14],
       [21, 22, 23, 24],
       [31, 32, 33, 34]])

In [None]:
# Using both integer indexing & slicing generates an array of lower rank
    # Rank 1 view
row_rank1 = an_array[1,:]    # Rank 1 view
print(row_rank1, row_rank1.shape)  # notice only a single []

[21 22 23 24] (4,)


In [None]:
# Slicing alone: generates an array of the same rank as the an_array
 # Rank 2 view
row_rank2 = an_array[1, :]      # Rank 1 view
print(row_rank2, row_rank2.shape)   # Notice the [[ ]]

[21 22 23 24] (4,)


In [None]:
#We can do the same thing for columns of an array:

col_rank1 = an_array[:, 1]
col_rank2 = an_array[:, 1:2]

print(col_rank1, col_rank1.shape)  # Rank 1
print()
print(col_rank2, col_rank2.shape)  # Rank 2

[12 22 32] (3,)

[[12]
 [22]
 [32]] (3, 1)


**Array Indexing for changing elements**

Sometimes it's useful to use an array of indexes to access or change elements.

In [None]:
# Create a new array
an_array = np.array([[11,12,13], [21,22,23], [31,32,33], [41,42,43]])

print('Original Array:')
print(an_array)

Original Array:
[[11 12 13]
 [21 22 23]
 [31 32 33]
 [41 42 43]]


In [None]:
# Create an array of indices
col_indices = np.array([0, 1, 2, 0])
print('\nCol indices picked : ', col_indices)

row_indices = np.arange(4)
print('\nRows indices picked : ', row_indices)


Col indices picked :  [0 1 2 0]

Rows indices picked :  [0 1 2 3]


In [None]:
# Examine the pairings of row_indices and col_indices.  These are the elements we'll change next.
for row,col in zip(row_indices,col_indices):
    print(row, ", ",col)

0 ,  0
1 ,  1
2 ,  2
3 ,  0


In [None]:
# Select one element from each row
print('Values in the array at those indices: ',an_array[row_indices, col_indices])

Values in the array at those indices:  [11 22 33 41]


In [None]:
# Change one element from each row using the indices selected
an_array[row_indices, col_indices] += 100

print('\nChanged Array:')
print(an_array)


Changed Array:
[[111  12  13]
 [ 21 122  23]
 [ 31  32 133]
 [141  42  43]]


## Boolean Indexing

**Array Indexing for changing elements:**

In [None]:
# create a 3x2 array
an_array = np.array([[11,12], [21, 22], [31, 32]])
an_array

array([[11, 12],
       [21, 22],
       [31, 32]])

In [None]:
# create a filter which will be boolean values for whether each element meets this condition
filter = an_array > 15
filter

array([[False, False],
       [ True,  True],
       [ True,  True]])

Notice that the filter is a same size ndarray as an_array which is filled with True for each element whose corresponding element in an_array which is greater than 15 and False for those elements whose value is less than 15.

In [None]:
# we can now select just those elements which meet that criteria
an_array[filter]
# Since a boolean mask can have different number of True in each row or column, the indexing can't preserve shape - it has to return a 1d result.

array([21, 22, 31, 32])

In [None]:
# For short, we could have just used the approach below without the need for the separate filter array.
an_array[an_array > 15]

array([21, 22, 31, 32])

What is particularly useful is that we can actually change elements in the array applying a similar logical filter.  Let's add 100 to all the even values.

In [None]:
an_array[an_array % 2 == 0]

array([12, 22, 32])

In [None]:
an_array[an_array % 2 == 0] +=100
an_array

array([[ 11, 112],
       [ 21, 122],
       [ 31, 132]])

## Datatypes and Array Operations

**Datatypes:**

In [None]:
ex1 = np.array([11, 12]) # Python assigns the data type
ex1.dtype

dtype('int32')

In [None]:
ex2 = np.array([11.0, 12.0]) # Python assigns the data type
ex2.dtype

dtype('float64')

In [None]:
ex3 = np.array([11, 21], dtype=np.int64) #You can also tell Python the data type
ex3.dtype

dtype('int64')

In [None]:
# you can use this to force floats into integers (using floor function)
ex4 = np.array([11.1,12.7], dtype=np.int64)
print(ex4.dtype)
print()
ex4

int64



array([11, 12], dtype=int64)

In [None]:
# you can use this to force integers into floats if you anticipate
# the values may change to floats later
ex5 = np.array([11, 21], dtype=np.float64)
print(ex5.dtype)
print()
ex5

float64



array([11., 21.])

**Arithmetic Array Operations:**

In [None]:
x = np.array([[111,112],[121,122]], dtype=np.int64)
y = np.array([[211.1,212.1],[221.1,222.1]], dtype=np.float64)

print(x)
print()
print(y)

[[111 112]
 [121 122]]

[[211.1 212.1]
 [221.1 222.1]]


In [None]:
# add
print(x + y)         # The plus sign works
print()
print(np.add(x, y))  # so does the numpy function "add"

[[322.1 324.1]
 [342.1 344.1]]

[[322.1 324.1]
 [342.1 344.1]]


In [None]:
# subtract
print(x - y)
print()
print(np.subtract(x, y))

[[-100.1 -100.1]
 [-100.1 -100.1]]

[[-100.1 -100.1]
 [-100.1 -100.1]]


In [None]:
# multiply
print(x * y)
print()
print(np.multiply(x, y))

[[23432.1 23755.2]
 [26753.1 27096.2]]

[[23432.1 23755.2]
 [26753.1 27096.2]]


In [None]:
# divide
print(x / y)
print()
print(np.divide(x, y))

[[0.52581715 0.52805281]
 [0.54726368 0.54930212]]

[[0.52581715 0.52805281]
 [0.54726368 0.54930212]]


In [None]:
# square root
print(np.sqrt(x))

[[10.53565375 10.58300524]
 [11.         11.04536102]]


In [None]:
# exponent (e ** x)
print(np.exp(x))

[[1.60948707e+48 4.37503945e+48]
 [3.54513118e+52 9.63666567e+52]]


## Statistical Methods, Sorting, and Set Operations

**Basic Statistical Operations:**

In [None]:
# setup a random 2 x 5 matrix
arr = 10 * np.random.randn(2,5)
arr

array([[ 11.18943257,  15.80114757,   7.64331472,  15.70062494,
          2.68413052],
       [-11.78362941,   1.21687436,   7.75913373,   3.60047039,
          2.206046  ]])

In [None]:
# compute the mean for all elements
arr.mean()

5.6017545386102885

In [None]:
# compute the means by row
arr.mean(axis=1)

array([10.60373006,  0.59977901])

In [None]:
# compute the means by column
arr.mean(axis=0)

array([-0.29709842,  8.50901096,  7.70122423,  9.65054767,  2.44508826])

In [None]:
# sum all the elements
arr.sum()


56.01754538610288

In [None]:
# compute the medians by rows
np.median(arr, axis=1)

array([11.18943257,  2.206046  ])

**Sorting:**

In [None]:
# create a 10 element array of randoms
np.random.seed(10)
unsorted = np.random.randn(10)

unsorted

array([ 1.3315865 ,  0.71527897, -1.54540029, -0.00838385,  0.62133597,
       -0.72008556,  0.26551159,  0.10854853,  0.00429143, -0.17460021])

In [None]:
# create copy and sort
sorted = np.array(unsorted)
# inplace sorting
sorted.sort()

print(sorted)
print()
print(unsorted)

[-1.54540029 -0.72008556 -0.17460021 -0.00838385  0.00429143  0.10854853
  0.26551159  0.62133597  0.71527897  1.3315865 ]

[ 1.3315865   0.71527897 -1.54540029 -0.00838385  0.62133597 -0.72008556
  0.26551159  0.10854853  0.00429143 -0.17460021]


**Finding Unique elements:**

In [None]:
array = np.array([1,2,1,4,2,1,4,2])

print(np.unique(array))

[1 2 4]


**Set Operations with np.array data type:**

In [None]:
s1 = np.array(['desk','chair','bulb'])
s2 = np.array(['lamp','bulb','chair'])
print(s1, s2)

['desk' 'chair' 'bulb'] ['lamp' 'bulb' 'chair']


In [None]:
print( np.intersect1d(s1, s2) )

['bulb' 'chair']


In [None]:
print( np.union1d(s1, s2) )

['bulb' 'chair' 'desk' 'lamp']


In [None]:
print( np.setdiff1d(s1, s2) ) # elements in s1 that are not in s2

['desk']


In [None]:
print( np.in1d(s1, s2) ) #which element of s1 is also in s2

[False  True  True]


## Broadcasting

Introduction to broadcasting

For more details, please see:

https://numpy.org/devdocs/user/basics.broadcasting.html

In [None]:
import numpy as np

start = np.zeros((4,3))
start

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [None]:
# create a rank 1 ndarray with 3 values
add_rows = np.array([1, 0, 2])
print(add_rows.shape)
add_rows

(3,)


array([1, 0, 2])

In [None]:
 # add to each row of 'start' using broadcasting
y = start + add_rows
y

array([[1., 0., 2.],
       [1., 0., 2.],
       [1., 0., 2.],
       [1., 0., 2.]])

In [None]:
# create an ndarray which is 4 x 1 to broadcast across columns
add_cols = np.array([[0,1,2,3]])
add_cols = add_cols.T

add_cols

array([[0],
       [1],
       [2],
       [3]])

In [None]:
# add to each column of 'start' using broadcasting
y = start + add_cols
print(y)

[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]
 [3. 3. 3.]]


In [None]:
# this will just broadcast in both dimensions
add_scalar = np.array([1])
start + add_scalar

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

Example from the slides:

In [None]:
# create our 3x4 matrix
arrA = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
arrA

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [None]:
# create our 1x4 array
arrB = np.array([0,1,0,2])
arrB

array([0, 1, 0, 2])

In [None]:
# add the two together using broadcasting

arrA + arrB

array([[ 1,  3,  3,  6],
       [ 5,  7,  7, 10],
       [ 9, 11, 11, 14]])

## Speedtest: ndarrays vs lists

First setup paramaters for the speed test. We'll be testing time to sum elements in an ndarray versus a list.

In [1]:
from numpy import arange
from timeit import Timer

size    = 1000000
timeits = 1000

In [2]:
# create the ndarray with values 0,1,2...,size-1
nd_array = arange(size)
type(nd_array)

numpy.ndarray

In [3]:
# timer expects the operation as a parameter,
# here we pass nd_array.sum()
timer_numpy = Timer("nd_array.sum()", "from __main__ import nd_array")

print("Time taken by numpy ndarray: %f seconds" %
      (timer_numpy.timeit(timeits)/timeits))

Time taken by numpy ndarray: 0.002396 seconds


In [4]:
# create the list with values 0,1,2...,size-1
a_list = list(range(size))
type(a_list)

list

In [9]:
# timer expects the operation as a parameter, here we pass sum(a_list)
timer_list = Timer("sum(a_list)", "from __main__ import a_list")

print("Time taken by list:  %f seconds" %
      (timer_list.timeit(timeits)/timeits))

Time taken by list:  0.006760 seconds


## Read or Write to Disk

Binary Format:

In [11]:
x = np.array([ 23.23, 24.24] )

In [12]:
np.save('an_array', x)

In [13]:
np.load('an_array.npy')

array([23.23, 24.24])

Text Format:

In [14]:
np.savetxt('array.txt', X=x, delimiter=',')

In [17]:
#!type in Windows, !cat in Unix machine such as in Colab

!cat array.txt

2.323000000000000043e+01
2.423999999999999844e+01


In [18]:
np.loadtxt('array.txt', delimiter=',')

array([23.23, 24.24])

## Additional Common ndarray Operations

**Dot Product on Matrices and Inner Product on Vectors:**

In [19]:
# determine the dot product of two matrices
# for 2d array, it is a multiplication
x2d = np.array([[1,1],[1,1]])
y2d = np.array([[2,2],[2,2]])

print(x2d.dot(y2d))
print()
print(np.dot(x2d, y2d))

[[4 4]
 [4 4]]

[[4 4]
 [4 4]]


In [20]:
# determine the inner product of two vectors
# for 1darray it is the inner product
a1d = np.array([9 , 9 ])
b1d = np.array([10, 10])
# 9 * 10 + 9 * 10
print(a1d.dot(b1d))
print()
print(np.dot(a1d, b1d))

180

180


In [21]:
# dot produce on an array and vector
print(x2d.dot(a1d))
print()
print(np.dot(x2d, a1d))

[18 18]

[18 18]


**Element-wise Functions:**

For example, let's compare two arrays values to get the maximum of each.

In [22]:
# random array
x = np.random.randn(8)
x

array([ 0.54947782, -0.85924703, -0.00788669, -1.72708216,  1.29509401,
       -1.22260874,  1.21076465, -0.66285915])

In [23]:
# another random array
y = np.random.randn(8)
y

array([-0.03456183, -0.33490643, -0.33147102, -0.02943573, -0.35330063,
        1.20785148, -0.72718484,  0.79186507])

In [24]:
# returns element wise maximum between two arrays

np.maximum(x, y)

array([ 0.54947782, -0.33490643, -0.00788669, -0.02943573,  1.29509401,
        1.20785148,  1.21076465,  0.79186507])

**Reshaping array:**

In [25]:
# grab values from 0 through 19 in an array
arr = np.arange(20)
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [26]:
# reshape to be a 4 x 5 matrix
arr.reshape(4,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

**Transpose:**

In [27]:
# transpose
ex1 = np.array([[11,12],[21,22]])

ex1.T

array([[11, 21],
       [12, 22]])

**Indexing using where():**

In [28]:
x_1 = np.array([1,2,3,4,5])

y_1 = np.array([11,22,33,44,55])

filter = np.array([True, False, True, False, True])

In [29]:
# When True yields x, otherwise yields y
out = np.where(filter, x_1, y_1)
print(out)

[ 1 22  3 44  5]


In [30]:
mat = np.random.rand(5,5)
mat

array([[0.42901773, 0.39312753, 0.7487465 , 0.23193072, 0.98142797],
       [0.29031622, 0.97525814, 0.79666517, 0.89914461, 0.19613932],
       [0.81110101, 0.61561545, 0.77689598, 0.12626827, 0.15224751],
       [0.17893263, 0.28203079, 0.08859996, 0.84809231, 0.88346245],
       [0.64390405, 0.99345315, 0.24821267, 0.83517547, 0.50291443]])

In [31]:
np.where( mat > 0.5, 1000, -1)

array([[  -1,   -1, 1000,   -1, 1000],
       [  -1, 1000, 1000, 1000,   -1],
       [1000, 1000, 1000,   -1,   -1],
       [  -1,   -1,   -1, 1000, 1000],
       [1000, 1000,   -1, 1000, 1000]])

**"any" or "all" conditionals:**

In [32]:
arr_bools = np.array([ True, False, True, True, False ])

In [33]:
arr_bools.any()

True

In [34]:
arr_bools.all()

False

**Random Number Generation:**

In [35]:
Y = np.random.normal(size = (1,5))[0]
print(Y)

[-0.79464025 -0.06877612  1.64579011  0.13124325 -0.72245391]


In [36]:
Z = np.random.randint(low=2,high=50,size=4)
print(Z)

[39 39  6 15]


In [37]:
np.random.permutation(Z) #return a new ordering of elements in Z

array([39,  6, 15, 39])

In [38]:
np.random.uniform(size=4) #uniform distribution

array([0.40223009, 0.38330194, 0.44557906, 0.12316568])

In [39]:
np.random.normal(size=4) #normal distribution

array([-0.56345058, -1.28707501, -0.40496887, -0.69966759])

**Merging data sets:**

In [40]:
K = np.random.randint(low=2,high=50,size=(2,2))
print(K)

print()
M = np.random.randint(low=2,high=50,size=(2,2))
print(M)

[[31 24]
 [23 39]]

[[23 33]
 [26  9]]


In [41]:
#vertical stack
np.vstack((K,M))

array([[31, 24],
       [23, 39],
       [23, 33],
       [26,  9]])

In [43]:
np.hstack((K,M))

array([[31, 24, 23, 33],
       [23, 39, 26,  9]])

In [44]:
np.concatenate([K, M], axis = 0)

array([[31, 24],
       [23, 39],
       [23, 33],
       [26,  9]])

In [45]:
np.concatenate([K, M.T], axis = 1)

array([[31, 24, 23, 26],
       [23, 39, 33,  9]])