# Basic `numpy`

`numpy` is a python package that is a real workhorse of machine learning and data science.

If you're new to python, this will be the first true package you'll import. That being said we should check that you have the package installed, try to run the following code chunk. (Note if you installed the Anaconda platform, <a href="https://www.anaconda.com/">https://www.anaconda.com/</a>, `numpy` should be installed already).

In [1]:
## it is standard to import numpy as np
import numpy as np

In [2]:
## let's check what version of numpy you have
## when I wrote this I had version 1.20.1
## yours may be different
print(np.__version__)

1.20.1


If you had a version of `numpy` installed, both of those code chunks should have run fine for you. If not, you will need to install it onto your machine because we will be using it heavily in the boot camp. For installation instructions check out the `numpy` documentation here, <a href="https://numpy.org/install/">https://numpy.org/install/</a>. 

##### Be sure you can run both of the above code chunks before continuing with this notebook, again it should be fine if your package version is slightly different than mine.

## `numpy`'s ndarray

While base python likes to store data in objects like `list`s and `tuple`s, in `numpy` data is stored in an `ndarray` it is similar to a list, but has a number of features that make it more useful for numeric data manipulation in a number of data science applications.

In [3]:
## You can make an array with np.array
## You just put np.array() around a python list or tuple
array1 = np.array([1,2,3,4])
print(array1)
print()
print(type(array1))

[1 2 3 4]

<class 'numpy.ndarray'>


`numpy` `ndarray`s can have any finite number of dimensions. This can be constructed by wrapping `np.array` around a `list` of `list`s.

In [4]:
## this produces a 2-dimensional array
## it is a 2x2 array
array2 = np.array([[1,2],[2,1]])
print(array2)
print()

## we can check the array's dimensions with np.shape()
## np.shape() returns a tuple with the size of each dimension
## array2 should be a 2 by 2 array
print("array2 is a",np.shape(array2),"ndarray")

[[1 2]
 [2 1]]

array2 is a (2, 2) ndarray


In [6]:
array2.shape

(2, 2)

In [8]:
## You code
## Try making a 2x2x2 array
np.array([[[2,2],[2,2]],[[1,1],[1,1]]]).shape


(2, 2, 2)

In [10]:
## You code 
## Try making a 2x2x2x2 array
np.array([[[[2,2], [2,2]], [[3,3], [3,3]]], [[[4,4],[4,4]] , [[5,5], [5,5]]]]).shape


(2, 2, 2, 2)

### `ndarray` Functions

#### Vectorized Operations

`ndarray`s are nice because, for the most part, they work the way you'd expect a vector or matrix to work. Let's compare and contrast with python's `list`s.

In [11]:
## You code
## see what happens when you code up 2*list1
list1 = [1,2,3,4]

2*list1

[1, 2, 3, 4, 1, 2, 3, 4]

In [12]:
## You code
## Now compare it to 2*array1
2*array1

array([2, 4, 6, 8])

In [13]:
## You code
## what happens here
[1,2] + [3,4]

[1, 2, 3, 4]

In [14]:
## You code
## code up the comparable ndarray expression
## and see what happens
np.array([1,2]) + np.array([3,4])

array([4, 6])

In [15]:
## You code
## Finally what happens here?
y = 3*[1,2,3] + 2

TypeError: can only concatenate list (not "int") to list

In [16]:
## You code
## try the comparable ndarray expression
y = 3*np.array([1,2,3]) + 2

In [17]:
y

array([ 5,  8, 11])

### Built-In `numpy` Functions

`numpy` also has a number of built-in functions that provide useful mathematical operations on arrays. Let's look at some examples.

In [18]:
y = 2*np.array([1,2,3]) - 4
print(y)

[-2  0  2]


In [19]:
## absolute value
np.abs(y)

array([2, 0, 2])

In [20]:
## raising each entry to a power
np.power(y,3)

array([-8,  0,  8])

In [21]:
## the square root
np.sqrt(np.abs(y))

array([1.41421356, 0.        , 1.41421356])

In [23]:
## You code
## using np.exp define y to be 
## e^(x+3) + log(|x|+1)
## https://numpy.org/doc/stable/reference/generated/numpy.exp.html
## https://numpy.org/doc/stable/reference/generated/numpy.log.html
x = np.array([0,1,2,3,4])

y = np.exp(x+3) + np.log(np.abs(x) + 1)

y

array([  20.08553692,   55.29129721,  149.51177139,  404.81508785,
       1098.24259634])

In [24]:
## You can sum all of the entries of an array with
## np.sum
np.sum(y)

1727.946289722884

### Preset `numpy` Arrays

There are a number of standard array types that you'll want to use, that can be quickly generated.

In [25]:
## np.ones(shape) makes an array of all ones of the desired shape
print(np.ones(1))

print()

print(np.ones((4,10)))

print()

print(np.ones((2,2,2)))

[1.]

[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[[1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]]]


In [26]:
## You code
## np.zeros(shape) is similar to np.ones, but instead of 1s
## it makes an array of 0s
## print 3 arrays of zeros, 
## one that is a single dimension of size 4
print(np.zeros(4))
print()


## one that is 2x17
print(np.zeros((2,17)))
print()

## one that is 3x3x2x2
print(np.zeros((3,3,2,2)))

[0. 0. 0. 0.]

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[[[0. 0.]
   [0. 0.]]

  [[0. 0.]
   [0. 0.]]

  [[0. 0.]
   [0. 0.]]]


 [[[0. 0.]
   [0. 0.]]

  [[0. 0.]
   [0. 0.]]

  [[0. 0.]
   [0. 0.]]]


 [[[0. 0.]
   [0. 0.]]

  [[0. 0.]
   [0. 0.]]

  [[0. 0.]
   [0. 0.]]]]


## `numpy` for Pseudorandomness

`numpy` is useful for generating pseudorandom numbers as well. We can look at common statistics of arrays too.

The pseudorandom functionality is stored in the `random` subpackage of `numpy`.

In [28]:
## random generators are stored in np.random
## a np.random.random() gives a number selected uniformly
## at random from [0,1]
print(np.random.random())


print()

## You can get a random array of any shape as well
print("A (10,2) uniform random array:\n", np.random.random((10,2)))

0.5499297506671484

A (10,2) uniform random array:
 [[0.01067491 0.53354777]
 [0.90252621 0.40273817]
 [0.84062912 0.17308125]
 [0.1314052  0.3236986 ]
 [0.48533815 0.22934813]
 [0.33913055 0.28131486]
 [0.02762517 0.86202552]
 [0.42172206 0.02414466]
 [0.31588949 0.92778467]
 [0.93770192 0.87570943]]


In [34]:
## Another Example
## np.random.randn() is a normal(0,1) number
## a single draw
print(np.random.randn())
print()

## an array of draws
## note the slight difference here, we don't have to put
## the 10 and 2 in a tuple to get a 10 by 2 array
## numpy is slightly inconsistent in this area so always
## check the docs to get it right
np.random.randn(10,2)

1.453250436464136



array([[ 1.63456346,  1.09572331],
       [-0.9457799 ,  0.65247736],
       [ 1.66958501, -1.05074294],
       [ 0.79432778, -0.86461686],
       [-0.85120551,  0.03929545],
       [ 0.44185012, -0.59176059],
       [ 0.93422495, -1.77544061],
       [ 0.95318173, -0.55321241],
       [-0.24695777, -0.16656458],
       [-0.24696306, -2.21431291]])

In [35]:
## A third example
## np.random.binomial()
## an array of binomial(n,p) outcomes
np.random.binomial(n=4, p=.3, size=(10,10))

array([[1, 2, 0, 0, 1, 2, 2, 1, 1, 1],
       [0, 2, 3, 1, 2, 0, 2, 2, 1, 0],
       [1, 2, 1, 0, 2, 0, 2, 0, 2, 1],
       [1, 1, 2, 1, 0, 1, 1, 2, 0, 1],
       [0, 2, 2, 1, 3, 1, 0, 2, 1, 1],
       [1, 0, 3, 1, 1, 2, 3, 2, 2, 1],
       [0, 1, 1, 2, 2, 1, 0, 0, 1, 0],
       [1, 0, 2, 0, 1, 3, 0, 0, 0, 2],
       [0, 2, 1, 2, 0, 3, 0, 1, 0, 1],
       [2, 1, 1, 1, 2, 1, 1, 1, 0, 1]])

In [36]:
## You code
## make a 20 by 3 array of random normal draws
## call it X
X = np.random.randn(20, 3)

In [37]:
X

array([[ 1.91689484, -1.39115083, -0.72579715],
       [ 1.08306514, -1.13944013, -0.13582487],
       [ 2.15741529,  0.02944624, -0.05572511],
       [-1.09636646,  0.84559134, -1.27105425],
       [ 0.30588328, -0.04391596,  0.62040803],
       [ 0.93042885,  0.4307268 ,  0.57669937],
       [-0.04064477,  0.89346606,  0.01282924],
       [-1.11980401,  0.94205175,  0.13892395],
       [-0.16325147, -0.68745897,  0.30545711],
       [-1.70344919,  1.75193739, -0.2871085 ],
       [-0.41905155,  0.01349732,  0.89064638],
       [-1.26419918,  0.27804207, -0.95909097],
       [ 0.86593183, -0.57811223, -0.88223366],
       [ 0.45018136, -0.07566053, -1.17863914],
       [-0.36492433,  1.13261636,  1.68142154],
       [ 1.05602769,  1.19335561, -0.60624916],
       [-1.47731755,  0.55081521,  0.01462613],
       [-0.49250544, -1.44140709,  1.10599684],
       [ 0.13218436,  0.75654061,  0.78062486],
       [ 0.27389533, -0.95449702, -0.60303777]])

Now that you have a data matrix, `X`. Let's compute some summary statistics about `X`.

In [40]:
## You can get the mean of all the entries of X with np.mean
print("The overall mean of X is", np.mean(X))
print()

## Adding in the argument "axis = " allows you to get
## the mean of each column
print("The column means of X are", np.mean(X, axis=0))
print()

## and the mean of each row
print("The row means of X are", np.mean(X, axis=1))
print()

## the axis argument tells numpy the axis or axes along 
## which the means are computed.
## so axis = 0 adds up the values in each row position
## and divides by the number of rows

## If you find this confusing, don't worry, I do too

The overall mean of X is 0.04932851453942241

The column means of X are [ 0.0515197   0.1253222  -0.02885636]

The row means of X are [-0.06668438 -0.06406662  0.7103788  -0.50727646  0.29412511  0.64595167
  0.28855018 -0.01294277 -0.18175111 -0.0795401   0.16169738 -0.64841603
 -0.19813802 -0.26803944  0.81637119  0.54771138 -0.30395874 -0.27597189
  0.55644994 -0.42787982]



In [41]:
## You code
## np.sum also has an axis argument
## calculate the row sums and column sums of X
print("The row sums of X are", np.sum(X, axis=1))
print()

print("The column sums of X are", np.sum(X, axis=0))




The row sums of X are [-0.20005314 -0.19219986  2.13113641 -1.52182937  0.88237534  1.93785502
  0.86565054 -0.03882831 -0.54525333 -0.2386203   0.48509215 -1.94524808
 -0.59441407 -0.80411831  2.44911357  1.64313414 -0.91187621 -0.82791568
  1.66934982 -1.28363946]

The column sums of X are [ 1.03039402  2.506444   -0.57712715]


Other common functions are `np.std()` for standard deviation, `np.var()` for variance, `np.min()` for min, `np.max()` for max, `np.argmin()` for where the min occurs, `np.argmax()` for where the max occurs.

In [43]:
## You code
## where does the max value occur in each column of X?
print("The location of the max in each column of X is",np.argmax(X, axis=0))


## what is the max value in each column of X?
print("The max in each column of X is", np.max(X, axis=0))

The location of the max in each column of X is [ 2  9 14]
The max in each column of X is [2.15741529 1.75193739 1.68142154]


In [44]:
## Another useful function is np.cumsum()

## randint generates a random integer between the
## first two arguments, the third argument tells numpy
## how many random draws to perform
x = np.random.randint(1,10,10)

print(x)

np.cumsum(x)

## What do you think it does?

[8 5 8 9 1 1 1 8 9 2]


array([ 8, 13, 21, 30, 31, 32, 33, 41, 50, 52])

## Linear Algebra with `numpy`

A final important use for us is `numpy` as a way to perform linear algebra calculations.

A bulk of data science algorithms use linear algebra, since we'll dive into the math behind the scenes of these algorithms we'll use `numpy`'s linear algebra capabilities.

##### Note: If you're not a math heavy person, that's okay! I have written the boot camp's notebooks so that you don't need to understand the math to learn how to perform the algorithms we cover. I just like to cover the mathematical aspects of these data science algorithms to explain what is going on to those boot campers (like myself) that are interested in the mathematical/statistical underpinnings of the algorithms.

In [45]:
## We can think of a 2D array as a matrix
A = np.random.binomial(n=10,p=.4,size=(2,2))

A

array([[1, 4],
       [4, 1]])

In [46]:
## A 1d array can be a row vector
x = np.array([1,2])
x

array([1, 2])

In [49]:
## or a column vector
## reshape() will attempt to reshape your array into the given
## shape

## When one of the shape dimensions is -1, the value is inferred from 
## the length of the array and remaining dimensions.
## so -1,1 tells numpy that you want a 2-D array with 1 column
## and it should infer the number of rows from the original shape
## of the array
## Here this reshapes x as a 2x1 column vector
x.reshape(-1,1)

array([[1],
       [2]])

In [50]:
## We can now calculate A*x
## matrix.dot() is used for matrix mult
A.dot(x.reshape(-1,1))

array([[9],
       [6]])

In [51]:
## You code
## make a 3-D column vector of ones, call it x
x = np.ones((3,1))


## Take that vector and find B*x
B = np.random.binomial(n=5, p=.6, size=(3,3))


B.dot(x)

array([[9.],
       [8.],
       [8.]])

In [52]:
print(B)

print()

print(x)

[[4 3 2]
 [3 3 2]
 [3 3 2]]

[[1.]
 [1.]
 [1.]]


In [53]:
## numpy.linalg contains a number of useful
## matrix operations, let's import a few
from numpy.linalg import inv, eig, det

In [54]:
## the inverse of A
## Note you may get an error here if A is not
## invertible
inv(A)

array([[-0.06666667,  0.26666667],
       [ 0.26666667, -0.06666667]])

In [55]:
## the determinant of A
det(A)

-15.0

In [56]:
## the eigenvalues and eigenvectors of A
eig(A)

## this returns a tuple of arrays
## the first entry are the eigenvalues
## the second entry are the corresponding eigenvectors

(array([ 5., -3.]), array([[ 0.70710678, -0.70710678],
        [ 0.70710678,  0.70710678]]))

In [57]:
## matrix.transpose() computes the transpose of the matrix
A.transpose()

array([[1, 4],
       [4, 1]])

In [58]:
## You code
b = np.array([2,5]).reshape(-1,1)

## Attempt to solve Ax = b for x
## Hint remember that if A is invertible, 
## x = A^{-1} b, where A^{-1} is the inverse of A
x = inv(A).dot(b)



In [59]:
x

array([[1.2],
       [0.2]])

## That's it!

That's it for this notebook. You have now been introduced to `numpy` and our ready to take on the practice problems. Be sure to get a fair level of comfort with `numpy`'s functionality because we'll be using it a lot.

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2021.

Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)