# NumPy

**NumPy**--which stands for Numerical Python--is the foundational package for performing scientific computing. In addition, it provides the primary data structure--*the n-dimensional array*--on which the **pandas** package is built. NumPy includes extensive functionality, but we will use it primarily for:

* Fast (vectorized) array operations for data processing
* Efficient descriptive statistics
* Manipulations for merging multiple data sets

Friendly reminders:

* NumPy module on DataCamp is due by 11:59 p.m. tonight
* Homework #1 is due by 11:59 p.m. tonight
* Homework #2 is released

In [1]:
# Import statement
import numpy as np

## ndarrays

The ndarray is an n-dimensional array object, similar to a list but designed to facilitate fast computation. However, in order for arrays to be useful, they must hold a single type of object. We will mostly focus on numerical (int, float) and boolean arrays.

Arrays will most likely be loaded from external data sources (later), but for now, we can create them via casting (using the np.**array** function) or using one of the following generating functions (or class of functions):

* np.**arange**(*start*, *stop*, *step*) (similar to range function for lists)
* np.**zeros**(*shape*), np.**ones**(*shape*) (where *shape* is a sequence of dimension sizes)
* np.random.**rand**(*d0*,*d1*,...,*dn*) (where *d0*,*d1*,...,*dn* are dimension sizes)  #generate random numbers between 0 to 1
* np.random.**randn**(*d0*,*d1*,...,*dn*) (where *d0*,*d1*,...,*dn* are dimension sizes)  # generate norm distribution random numbers

In [2]:
# Casting from list
np.array([1,5,-1,2,4])

array([ 1,  5, -1,  2,  4])

In [3]:
# np.arange
arr1d = np.arange(10)
arr1d

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]:
# np.ones, np.zeros
np.ones((5,10))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [10]:
# np.random.rand, np.random.randn
arr2d = np.random.rand(3,3)   ## rand: random between 0 to 1    randn: normal distribution random numbers
arr2d

array([[0.23483023, 0.47199961, 0.13463609],
       [0.72457856, 0.86141043, 0.75618108],
       [0.62588673, 0.29679832, 0.55415876]])

In [16]:
# Common attributes
print(arr2d.ndim) # number of dimensions
print(arr2d.shape) # shape of array
print(arr2d.dtype) # data type

2
(3, 3)
float64


In [15]:
# Casting to other dtypes
np.arange(10).astype(float)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [14]:
# Tab completion
arr1d.

AttributeError: 'numpy.ndarray' object has no attribute 'dri'

See https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html for additional details on ndarray objects.

## Array Operations

As previously stated, arrays are designed to support fast computation and comparison. The most common type of operations are:

* Between arrays and scalars
* Universal functions (np.func)
    - Unary (performed on a single array): abs, sqrt, exp, log, ceil, floor, logical_not, and more
    - Binary (performed between two arrays): +, -, /, *, **, min, max, mod, >, >=, <, <=, ==, !=, logical_and, logical_or, logical_xor
* Mathematical and statistical functions - Available as NumPy functions (np.func) and array methods (arr.func)
    - Aggregation: mean, sum, std, var, min/max, argmin/argmax
    - Non-aggregation: cumsum, cumprod

In [17]:
# Broadcasting with a scalar
arr2d + 1   ## add 1 for every object inside arr2d

array([[1.23483023, 1.47199961, 1.13463609],
       [1.72457856, 1.86141043, 1.75618108],
       [1.62588673, 1.29679832, 1.55415876]])

In [20]:
# Broadcasting with 1-d array
arr2d + [1,2,3]   ## add 1 for first column numbers, add 2 for 2nd column, and add 3 for 3rd column

array([[1.23483023, 2.47199961, 3.13463609],
       [1.72457856, 2.86141043, 3.75618108],
       [1.62588673, 2.29679832, 3.55415876]])

In [21]:
# Comparison with a scalar
arr2d > 0.5    ## element by elemnt comparison

array([[False, False, False],
       [ True,  True,  True],
       [ True, False,  True]])

In [22]:
# Unary functions
np.sqrt(arr2d)

array([[0.48459285, 0.68702228, 0.36692791],
       [0.85122181, 0.92812199, 0.86958673],
       [0.79113003, 0.544792  , 0.7444184 ]])

In [23]:
# Binary functions
arr2d * arr2d   ## same as np.sqr

array([[0.05514524, 0.22278363, 0.01812688],
       [0.5250141 , 0.74202792, 0.57180983],
       [0.39173419, 0.08808925, 0.30709193]])

In [29]:
# Aggregation function
print(np.mean(arr2d))     ## mean of the entire array
print(np.mean(arr2d, axis = 0))  ## column mean
np.mean(arr2d, axis =1)   ## row mean (mean across columns)

0.517831090123908
[0.52843184 0.54340279 0.48165864]


array([0.28048864, 0.78072336, 0.49228127])

In [30]:
# Non-aggregation function
arr2d.cumsum()  ## cumulative sum

array([0.23483023, 0.70682984, 0.84146593, 1.56604449, 2.42745492,
       3.183636  , 3.80952273, 4.10632105, 4.66047981])

## Arrays vs. Lists

Arrays may seem similar to lists (e.g., they are both mutable and iterable sequences), but they are distinct data structures. Be sure to use an array whenever you are performing any large scale computations or comparisons.

In [31]:
# List operations
L = list(range(10))
print(L + L)
print(L * 3)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [32]:
# Array operations
arr = np.arange(10)
print(arr + arr);
print(arr * 3);

[ 0  2  4  6  8 10 12 14 16 18]
[ 0  3  6  9 12 15 18 21 24 27]


In [33]:
%%timeit -r5 -n10000
# List computation - for loop
squares = []
for x in L:
    squares.append(x)

715 ns ± 25.7 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)


In [34]:
# List computation - map
lf = lambda x: x * x
%timeit -r5 -n10000 map(lf, L)

172 ns ± 5.42 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)


In [35]:
# List computation - list comprehension
%timeit -r5 -n10000 [x * x  for x in L]

614 ns ± 42 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)


In [36]:
# List computation - generator expression
%timeit -r5 -n10000 (x * x  for x in L)

292 ns ± 19.9 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)


In [37]:
# Array computations
%timeit -r5 -n10000 arr ** 2
%timeit -r5 -n10000 np.square(arr)
%timeit -r5 -n10000 arr * arr

799 ns ± 331 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)
593 ns ± 14.8 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)
441 ns ± 8.3 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)


In addition, lists do not have restrictions on the size of nested sequences, whereas arrays have restrictions for constructing a useful form of the object.

In [39]:
[[1,2,3],[4,5,6,7],[8,9]]

[[1, 2, 3], [4, 5, 6, 7], [8, 9]]

In [38]:
np.array([[1,2,3],[4,5,6],[7,8,9]])

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

## Indexing and Slicing

Indexing and slicing arrays is similar to lists...

In [40]:
print(arr[5])
print(arr[-1])
print(arr[5:8])
print(arr[::2])

5
9
[5 6 7]
[0 2 4 6 8]


...but unlike lists, array slices are views on the original array, so any updates to the array slice will be reflected in the original array. Consider the following example, in which we combine indexing with assignment (which also works as you would expect).

In [41]:
list_slice = L[5:8]
list_slice[0] = -10
print(list_slice, L)

[-10, 6, 7] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [42]:
arr_slice = arr[5:8]
arr_slice[0] = -10
print(arr_slice, arr)

[-10   6   7] [  0   1   2   3   4 -10   6   7   8   9]


Indexing via index and boolean arrays are convenient ways of filtering an array. In this case, these operations return a copy of the array, as opposed to a view on the original array as in the case of slicing.

In [43]:
# Index arrays - Each element in the index array is replaced by the corresponding value in the array
arr[np.array([1,0,1,3,9,-1])]

array([1, 0, 1, 3, 9, 9])

In [44]:
# Boolean arrays - Each element in the array is returned if the corresponding boolean scalar is True
arr[np.array([True, False, False, False, False, True, False, False, False, False])]  ## filter data out: only [0] and [5] will be printed

array([  0, -10])

In [45]:
# Filtering via conditional
arr[arr % 5 == 0]

array([  0, -10])

You can also use boolean arrays to learn about your data.

In [46]:
# How much of my data satisfies a given condition?
arr = np.random.randn(100)
(arr > 0).sum()   # how many ture in my dataset

55

In [47]:
# Do any of my data satisfy a given condition?
(arr > 3).any()  ## if any one element in dataset satify the condition then true

False

In [48]:
# Do all of my data satisfy a given condition?
(np.abs(arr) < 3).all()   ## if every element satisfy the condition

True

Indexing and slicing multi-dimensional arrays is fairly intuitive. Whereas a single dimensional array contains 0-dimensional values (scalars), a 2-dimensional array is an array of 1-d arrays, where the first dimension represents the position of each 1-d array, and the second dimension refers to a specific position within each 1-d array (where all 1-d arrays have the same length). Similarly, a 3-dimension array has 3 dimensions corresponding to the position of each 2-d array, 1-d array, and scalar value, respectively. And so on, for higher dimensions.

![](https://i.stack.imgur.com/R2IDC.png "Multi-dimensional arrays")

When indexing and slicing into an n-d array, each dimension is accessed in order, either via successive indexing or slicing operations or a sequence of dimensional indices or slices. To retain all of the elements for a particular dimension, use the ':' operator.

In [49]:
# 2-d array
arr2d = np.random.rand(3,3)
arr2d

array([[0.52068396, 0.63403489, 0.45824731],
       [0.53835773, 0.07610067, 0.66980914],
       [0.0519256 , 0.43324149, 0.58350728]])

In [50]:
# Index specific element
print(arr2d[0][0])  # select row and column
print(arr2d[0,0])   # same as above

0.5206839638237436
0.5206839638237436


In [51]:
# Slice rows
print(arr2d[0])
print(arr2d[0,:])

[0.52068396 0.63403489 0.45824731]
[0.52068396 0.63403489 0.45824731]


In [52]:
# Slice columns
arr2d[:,0]

array([0.52068396, 0.53835773, 0.0519256 ])

In [53]:
# Fancy slicing
arr2d[1:,:2]

array([[0.53835773, 0.07610067],
       [0.0519256 , 0.43324149]])

In [54]:
# 3-d array
arr3d = np.random.rand(3,3,3).astype(float)
arr3d

array([[[0.68572388, 0.34399131, 0.20556249],
        [0.65885308, 0.06291251, 0.61209549],
        [0.11964807, 0.97046737, 0.68329246]],

       [[0.64938707, 0.7397151 , 0.49798026],
        [0.75373745, 0.77099798, 0.1404894 ],
        [0.22500644, 0.66407978, 0.77264225]],

       [[0.54900963, 0.89507718, 0.08796598],
        [0.68733107, 0.23682532, 0.00635225],
        [0.73896634, 0.74279378, 0.89796438]]])

In [55]:
# Index specific 2-d array
print(arr3d[1])
print(arr3d[1,:,:])

[[0.64938707 0.7397151  0.49798026]
 [0.75373745 0.77099798 0.1404894 ]
 [0.22500644 0.66407978 0.77264225]]
[[0.64938707 0.7397151  0.49798026]
 [0.75373745 0.77099798 0.1404894 ]
 [0.22500644 0.66407978 0.77264225]]


In [56]:
# Slicing 3-d array
arr3d[:,1,:]

array([[0.65885308, 0.06291251, 0.61209549],
       [0.75373745, 0.77099798, 0.1404894 ],
       [0.68733107, 0.23682532, 0.00635225]])

## Other Important Array Methods

### Conditional Logic

We saw that ternary expressions were a convenient way for us to generate conditional values:

*expr1* if *cond* else *expr2*

There are several ways to perform this task for a list (which we could then cast to an array):

1. Use a **for** loop
2. Use **map** with a lambda function
3. Use a list comprehension

For arrays, we use the np.**where** function!

In [57]:
np.where?

In [58]:
# Flip a coin N times
N = 10
np.where(np.random.rand(N) > 0.5, 'H', 'T')

array(['H', 'H', 'H', 'T', 'T', 'H', 'H', 'H', 'H', 'T'], dtype='<U1')

In [59]:
# Select a value at random
N = 10
a = np.arange(N)
b = np.zeros(N)
np.where(np.random.rand(N) > 0.5, a, b)

array([0., 0., 2., 0., 4., 5., 0., 7., 8., 9.])

In [60]:
# Nested conditions
N = 10
a = np.ones(N)
b = np.zeros(N)
c = -np.ones(N)
np.where(np.random.rand(N) > 2/3, a, np.where(np.random.rand(N) > 1/2, b, c))

array([ 1.,  1.,  1., -1., -1.,  1., -1.,  0., -1., -1.])

### Sorting

In [61]:
arr = np.random.rand(10)
arr

array([0.43670632, 0.2970905 , 0.92112231, 0.02198297, 0.99808101,
       0.5459343 , 0.78314005, 0.43883047, 0.08058792, 0.70166145])

In [62]:
# Return a copy of sorted array
np.sort(arr)

array([0.02198297, 0.08058792, 0.2970905 , 0.43670632, 0.43883047,
       0.5459343 , 0.70166145, 0.78314005, 0.92112231, 0.99808101])

In [63]:
# Return sorting indices
arr.argsort()

array([3, 8, 1, 0, 7, 5, 9, 6, 2, 4])

In [64]:
# Sort in place
arr.sort()
arr

array([0.02198297, 0.08058792, 0.2970905 , 0.43670632, 0.43883047,
       0.5459343 , 0.70166145, 0.78314005, 0.92112231, 0.99808101])

In [65]:
# Search sorted
u = np.random.rand()
print(u, arr.searchsorted(u))

0.4269404711437208 3


### Set Logic

In [66]:
arr1 = np.arange(10)
arr2 = np.arange(20,0,-2)
print(arr1, arr2)

[0 1 2 3 4 5 6 7 8 9] [20 18 16 14 12 10  8  6  4  2]


In [67]:
# Membership
7 in arr2

False

In [68]:
# Unique elements
print(arr1 * arr2, np.unique(arr1 * arr2))

[ 0 18 32 42 48 50 48 42 32 18] [ 0 18 32 42 48 50]


In [69]:
# Comparisons - np.intersect1d, .union1d, setdiff1d, setxor1d
np.setxor1d(arr1, arr2)

array([ 0,  1,  3,  5,  7,  9, 10, 12, 14, 16, 18, 20])

## Manipulating and Combining Arrays

Sometimes, you will need to manipulate or combine multiple arrays of data prior to performing any analysis. There are a lot of built-in functions for these purposes. Very rarely will you need to develop your own code.

#### Manipulating Arrays

In [70]:
arr = np.arange(8)
arr

array([0, 1, 2, 3, 4, 5, 6, 7])

In [71]:
# Reshaping arrays
print(arr.reshape((2,4)))
print(arr.reshape((2,-1))) # automatically determines the other dimension size

[[0 1 2 3]
 [4 5 6 7]]
[[0 1 2 3]
 [4 5 6 7]]


In [72]:
# Transpose
arr.reshape((2,4)).T # Our first example of chaining methods together

array([[0, 4],
       [1, 5],
       [2, 6],
       [3, 7]])

In [73]:
# Flatten
print(arr.reshape((2,4)).flatten('C')) # row-major
print(arr.reshape((2,4)).flatten('F')) # column-major

[0 1 2 3 4 5 6 7]
[0 4 1 5 2 6 3 7]


#### Combining and Splitting Arrays

In [74]:
arr1 = np.random.rand(4,2)
arr2 = np.random.rand(4,2)
print(arr1)
print(arr2)

[[0.57734861 0.07260167]
 [0.48240802 0.54893625]
 [0.27453721 0.15565641]
 [0.6822756  0.83893043]]
[[0.61930725 0.95363611]
 [0.57956645 0.17446299]
 [0.83142207 0.11146881]
 [0.40890042 0.73002812]]


In [75]:
# Concatenation by row
np.concatenate([arr1, arr2], axis=0)

array([[0.57734861, 0.07260167],
       [0.48240802, 0.54893625],
       [0.27453721, 0.15565641],
       [0.6822756 , 0.83893043],
       [0.61930725, 0.95363611],
       [0.57956645, 0.17446299],
       [0.83142207, 0.11146881],
       [0.40890042, 0.73002812]])

In [76]:
# Stacking rows
np.vstack([arr1, arr2]) # also, np.row_stack

array([[0.57734861, 0.07260167],
       [0.48240802, 0.54893625],
       [0.27453721, 0.15565641],
       [0.6822756 , 0.83893043],
       [0.61930725, 0.95363611],
       [0.57956645, 0.17446299],
       [0.83142207, 0.11146881],
       [0.40890042, 0.73002812]])

In [77]:
# Concatenate by column
np.concatenate([arr1, arr2], axis=1)

array([[0.57734861, 0.07260167, 0.61930725, 0.95363611],
       [0.48240802, 0.54893625, 0.57956645, 0.17446299],
       [0.27453721, 0.15565641, 0.83142207, 0.11146881],
       [0.6822756 , 0.83893043, 0.40890042, 0.73002812]])

In [78]:
# Stacking columns
np.hstack([arr1, arr2]) # also, np.column_stack

array([[0.57734861, 0.07260167, 0.61930725, 0.95363611],
       [0.48240802, 0.54893625, 0.57956645, 0.17446299],
       [0.27453721, 0.15565641, 0.83142207, 0.11146881],
       [0.6822756 , 0.83893043, 0.40890042, 0.73002812]])

In [79]:
# Splitting arrays
np.split(arr1, [1,2], axis=0) # also, np.hsplit, vsplit

[array([[0.57734861, 0.07260167]]),
 array([[0.48240802, 0.54893625]]),
 array([[0.27453721, 0.15565641],
        [0.6822756 , 0.83893043]])]

## File Input/Output

There are two primary ways to save/load NumPy arrays to/from a file:

* Binary format (.npy) - np.**save** and np.**load**
* Delimited text file (.txt) - np.**savetxt** and np.**loadtxt** (also, np.**genfromtext** for files with missing data)

In [80]:
# Print working directory
%pwd

'/Users/charmain/Desktop/Master Degree/Term 2/Data Processing & Analysis in Python/Lecture'

In [81]:
# Change working directory to data folder
%cd ~/Dropbox/Teaching/Courses/BUDT758X/data/

[Errno 2] No such file or directory: '/Users/charmain/Dropbox/Teaching/Courses/BUDT758X/data/'
/Users/charmain/Desktop/Master Degree/Term 2/Data Processing & Analysis in Python/Lecture


In [82]:
# Save arr2d to binary file
np.save('arr2d', arr2d) # function will add the .npy extension

In [83]:
# Delete arr2d and re-load from file
del arr2d
arr2d = np.load('arr2d.npy') # must include .npy extension
arr2d

array([[0.52068396, 0.63403489, 0.45824731],
       [0.53835773, 0.07610067, 0.66980914],
       [0.0519256 , 0.43324149, 0.58350728]])

In [85]:
# Save arr2d to text file
np.savetxt('arr2d.txt', arr2d, fmt='%.4f', delimiter=',')

In [86]:
# Delete arr2d and re-load from file
del arr2d
arr2d = np.loadtxt('arr2d.txt', delimiter=',')
arr2d

array([[0.5207, 0.634 , 0.4582],
       [0.5384, 0.0761, 0.6698],
       [0.0519, 0.4332, 0.5835]])

## Next Time: NumPy Lab