# Numpy

Numpy is a library for representing and working with large and multi-dimensional arrays. Most other libraries in the data-science ecosystem depend on numpy, making it one of the fundamental data science libraries.

Numpy provides a number of useful tools for scientific programming, and in this lesson, we'll take a look at some of the most common.

Convention is to import the `numpy` module as `np`.

In [1]:
import numpy as np

## Indexing

Numpy provides an array type that goes above and beyond what Python's built-in lists can do.

We can create a numpy array by passing a list to the `np.array` function:

In [2]:
a = np.array([1, 2, 3])

In [3]:
a

array([1, 2, 3])

In [4]:
print(a)

[1 2 3]


We can create a multi-dimensional array by passing a list of lists to the `array` function

In [5]:
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

In [6]:
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [7]:
print(matrix)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


Referencing elements in numpy arrays at it's most basic is the same as referencing elements in Python lists.

In [8]:
a[0]

1

In [9]:
print(a)

[1 2 3]


In [10]:
print('a    == {}'.format(a))
print('a[0] == {}'.format(a[0]))
print('a[1] == {}'.format(a[1]))
print('a[2] == {}'.format(a[2]))

a    == [1 2 3]
a[0] == 1
a[1] == 2
a[2] == 3


In [11]:
print('matrix')
print(matrix)
print('\n')
print('matrix[0] == {}'.format(matrix[0]))
print('matrix[1] == {}'.format(matrix[1]))
print('matrix[2] == {}'.format(matrix[2]))

matrix
[[1 2 3]
 [4 5 6]
 [7 8 9]]


matrix[0] == [1 2 3]
matrix[1] == [4 5 6]
matrix[2] == [7 8 9]


Multidimensional numpy arrays are easier to index into than nested lists. To obtain the element at the second column in the second row, we would write:

In [12]:
matrix[1,1]

5

To get the first 2 elements of the last 2 rows:

In [13]:
matrix[1:, :2] 
# 1: Give me the rows starting from index position 1 onward
# :2 Give me the elements of the rows up to but not including index position 2

array([[4, 5],
       [7, 8]])

Arrays can also be indexed with a boolean sequence used to indicate which values should be included in the resulting array.

In [14]:
a

array([1, 2, 3])

In [15]:
a[[True, False, True]]

array([1, 3])

In [16]:
should_include_elements_a = [True, False, True]
a[should_include_elements_a]

array([1, 3])

## Vectorized Operations

Another useful feature of numpy arrays is vectorized operations.

If we wanted to add 1 to every element in a list, without numpy, we can't simply add 1 to the list, as that will result in a `TypeError`.

In [17]:
original_array = [1, 2, 3, 4, 5]
try:
    original_array + 1
except TypeError as e:
    print('An Error Occured!')
    print(f'TypeError: {e}')

An Error Occured!
TypeError: can only concatenate list (not "int") to list


In [18]:
# What does "can only concatenate list (not "int") to list" mean?
[1, 2, 3, 4, 5] + [1]

[1, 2, 3, 4, 5, 1]

We could write a `for` loop or a list comprehension:

In [19]:
original_array = [1, 2, 3, 4, 5]
array_with_one_added = []
for n in original_array:
    array_with_one_added.append(n+1)
print(array_with_one_added)

[2, 3, 4, 5, 6]


In [20]:
original_array = [1, 2, 3, 4, 5]
array_with_one_added = [n + 1 for n in original_array]
print(array_with_one_added)

[2, 3, 4, 5, 6]


Vectorizing operations means that operations are automatically applied to every element in a vector, which in our case will be a numpy array. So if we are working with a numpy array, we can simply add 1:

In [21]:
numpy_array = np.array([1, 2, 3, 4, 5])
numpy_array + 1

array([2, 3, 4, 5, 6])

In [22]:
print(numpy_array + 2)

[3 4 5 6 7]


This works the same way for the other basic arithmetic operators as well:

In [23]:
my_array = np.array([-3, 0, 3, 16])

print(f'my_array      == {my_array}')
print(f'my_array - 5  == {my_array - 5}')
print(f'my_array * 4  == {my_array * 4}')
print(f'my_array / 2  == {my_array / 2}')
print(f'my_array ** 2 == {my_array ** 2}')
print(f'my_array % 2  == {my_array % 2}')

my_array      == [-3  0  3 16]
my_array - 5  == [-8 -5 -2 11]
my_array * 4  == [-12   0  12  64]
my_array / 2  == [-1.5  0.   1.5  8. ]
my_array ** 2 == [  9   0   9 256]
my_array % 2  == [1 0 1 0]


Not only are the arithmatic operators vectorized, but the same applies to the comparison operators.

In [24]:
my_array = np.array([-3, 0, 3, 16])

print(f'my_array       == {my_array}')
print(f'my_array == -3 == {my_array == -3}')
print(f'my_array >= 0  == {my_array >= 0}')
print(f'my_array < 10  == {my_array < 10}')

my_array       == [-3  0  3 16]
my_array == -3 == [ True False False False]
my_array >= 0  == [False  True  True  True]
my_array < 10  == [ True  True  True False]


If vectorizing a comparison operator produces a boolean array...

In [25]:
my_array

array([-3,  0,  3, 16])

In [26]:
print(my_array > 0)

[False False  True  True]


And if we can give an array some booleans to select the values we want to return...

In [27]:
my_array[[False, False, True, True]]

array([ 3, 16])

Then we can "give" an array a condition to select our values for us!

In [28]:
# Give me all the positive numbers in my_array:
my_array[my_array > 0]

array([ 3, 16])

## In-Depth Example
As another example, we could obtain all the even numbers like this:

In [29]:
my_array[my_array % 2 == 0]

array([ 0, 16])

To better understand how this is all working let's go through this recent example in a little more detail.

The first expression that gets evaluated is this:

In [31]:
my_array

array([-3,  0,  3, 16])

In [30]:
my_array % 2

array([1, 0, 1, 0])

Which results in an array of `1`s and `0`s. Then the array of `1`s and `0`s is compared to `0` with the `==` operator, producing an array of `True` or `False` values.

In [32]:
# Each element of the array % 2
result = my_array % 2
print(f"My Array:              {my_array}")
print(f"My Array % 2 (Result): {result}")

My Array:              [-3  0  3 16]
My Array % 2 (Result): [1 0 1 0]


In [33]:
# Does each element % 2 == 0?
print(f"My Array:                   {my_array}")
print(f"My Array % 2 (Result):      {result}")
print(f"result == 0:                {result == 0}")

My Array:                   [-3  0  3 16]
My Array % 2 (Result):      [1 0 1 0]
result == 0:                [False  True False  True]


Lastly, we use this array of boolean values to index into the original array, which gives us only the values that are evenly divisible by 2.

In [34]:
step_1 = my_array % 2
step_2 = (step_1 == 0)
step_3 = my_array[step_2]

step_3

array([ 0, 16])

Put another way, here is how the expression is evaluated:

In [35]:
print('1. my_array[my_array % 2 == 0]')
print('    - the original expression')
print('2. my_array[{} % 2 == 0]'.format(my_array))
print('    - variable substitution')
print('3. my_array[{} == 0]'.format(my_array % 2))
print('    - result of performing the vectorized modulus 2')
print('4. my_array[{}]'.format(my_array % 2 == 0))
print('    - result of comparing to 0')
print('5. {}[{}]'.format(my_array, my_array % 2 == 0))
print('    - variable substitution')
print('6. {}'.format(my_array[my_array % 2 == 0]))
print('    - our final result')

1. my_array[my_array % 2 == 0]
    - the original expression
2. my_array[[-3  0  3 16] % 2 == 0]
    - variable substitution
3. my_array[[1 0 1 0] == 0]
    - result of performing the vectorized modulus 2
4. my_array[[False  True False  True]]
    - result of comparing to 0
5. [-3  0  3 16][[False  True False  True]]
    - variable substitution
6. [ 0 16]
    - our final result


In [36]:
my_array

array([-3,  0,  3, 16])

In [51]:
# I want to find element >= 0 AND element < 10
my_array[(my_array >= 0) & (my_array < 10)]

array([0, 3])

In [47]:
my_array[(my_array < 0) | (my_array > 3)]

array([-3, 16])

In [40]:
len(my_array)

4

In [41]:
my_array[[True, False]] # Error occurs because boolean list is not same length as original array

IndexError: boolean index did not match indexed array along dimension 0; dimension is 4 but corresponding boolean dimension is 2

## Array Creation

Numpy provides several methods for creating arrays, we'll take a look at several of them.

### `np.random.randn`

`np.random.randn` can be used to create an array of specified length of random numbers drawn from the standard normal distribution.

> **Standard Normal Distribution**: Values range from -∞ to ∞ with equal distribution in both positive and negative direction, the mean is equal to 0, and the standard deviation is 1. The values represent z-scores, which are the number of standard deviations that point is from the mean.

In [52]:
np.random.randn(10)

array([ 0.64945023, -0.99424848,  0.85014126, -1.90811932,  0.80047233,
        0.30993757,  0.27937178, -0.37901226, -0.93886571,  1.0578365 ])

We can also pass a second argument to this function to define the shape of a two dimensional array.

In [53]:
np.random.randn(3, 4) # rows, columns

array([[ 0.61315154, -0.34741033, -0.82469832,  0.56249778],
       [ 0.36552594, -1.18008419, -1.58001299,  1.1806903 ],
       [-0.7994737 ,  2.61932773,  0.54761305,  0.90549717]])

If we wish to obtain original values from the z-scores of a normal distribution, we'll need to apply some arithmetic. 

To convert, we'll need to multiply by the standard deviation (*σ*) and add the mean (*μ*).

In [55]:
z_scores_iq = np.random.randn(20)
z_scores_iq

array([-0.05589698, -0.07562079, -0.13799158,  0.78125065, -0.54090876,
       -0.39154814,  0.15466032, -0.91951771, -0.30778483,  0.88913938,
       -0.25800198, -0.52405229, -0.73226258, -0.73735596,  0.14742082,
       -1.4014246 ,  1.71341362,  1.23406387,  1.0869313 ,  0.66683639])

In [57]:
μ = 100 # If we have a mean of 100...
σ = 15 # ...and a standard deviation of 15...

raw_iq_scores = σ * z_scores_iq + μ # We can derive the unscaled values
raw_iq_scores

array([ 99.16154525,  98.86568822,  97.93012632, 111.71875972,
        91.88636855,  94.12677787, 102.31990486,  86.20723432,
        95.38322756, 113.33709068,  96.12997026,  92.13921565,
        89.0160613 ,  88.93966055, 102.21131232,  78.97863097,
       125.70120434, 118.51095805, 116.30396951, 110.00254585])

### `np.zeros`, `np.ones`, `np.full`

The `zeros` function provides the ability to create an array of a specified size full of `0`s: 

In [58]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

The `ones` function does the same thing, but full of `1`s instead:

In [59]:
np.ones(5)

array([1., 1., 1., 1., 1.])

The `full` function allows you to specific a value:

In [60]:
np.full(3, 17)

array([17, 17, 17])

We can also use these methods to create multi-dimensional arrays by passing a tuple of the dimensions of the desired array, instead of a single integer value.

In [64]:
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [65]:
np.ones((5,5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [67]:
np.full((3, 4), 'hello')

array([['hello', 'hello', 'hello', 'hello'],
       ['hello', 'hello', 'hello', 'hello'],
       ['hello', 'hello', 'hello', 'hello']], dtype='<U5')

### `np.arange`

Numpy's `arange` function is very similar to python's built-in `range` function. It can take a single argument and generate a range from zero up to, but not including, the passed number.

In [68]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

We can also specify a starting point for the range:

In [69]:
np.arange(1, 10)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

As well as a step:

In [70]:
np.arange(1, 10, 2)

array([1, 3, 5, 7, 9])

Unlike python's builtin `range`, numpy's `arange` can handle decimal numbers

In [71]:
np.arange(3, 5, 0.5)

array([3. , 3.5, 4. , 4.5])

### `np.linspace`

The `linspace` method creates a range of numbers between a minimum and a maximum, with a set number of elements.

In [72]:
# min: 1, max: 10, length: 4
np.linspace(1, 10, 4)

array([ 1.,  4.,  7., 10.])

In [75]:
# min: 1, max: 10, length: 7
np.linspace(1, 10, 7)

array([ 1. ,  2.5,  4. ,  5.5,  7. ,  8.5, 10. ])

> Note: Here the maximum is **inclusive**
## Array Methods

Numpy arrays also come with built-in methods to make many mathematical operations easier.

Some of the most common are:

### `.min`

In [76]:
a = np.array([1, 2, 3, 4, 5])

In [77]:
a.min()

1

### `.max`

In [78]:
a.max()

5

### `.mean`

In [79]:
a.mean()

3.0

### `.sum`

In [80]:
a.sum()

15

### `.std()`

In [81]:
a.std()

1.4142135623730951

# Exercises!

In your numpy-pandas-visualization-exercises repo, create a file named `numpy_exercises.py` for this exercise.

Use the following code for the questions below:

In [None]:
a = np.array([4, 10, 12, 23, -2, -1, 0, 0, 0, -6, 3, -7])

1. How many negative numbers are there?

2. How many positive numbers are there?

3. How many even positive numbers are there?

4. If you were to add 3 to each data point, how many positive numbers would there be?

5. If you squared each number, what would the new mean and standard deviation be?

6. A common statistical operation on a dataset is centering. This means to adjust the data such that the mean of the data is 0. This is done by subtracting the mean from each data point. Center the data set. See [this link](https://www.theanalysisfactor.com/centering-and-standardizing-predictors/) for more on centering.

7. Calculate the z-score for each data point. 

8. Copy the setup and exercise directions from [More Numpy Practice](https://gist.github.com/ryanorsinger/c4cf5a64ec33c014ff2e56951bc8a42d) into your `numpy_exercises.py` and add your solutions.

**Awesome Bonus** For much more practice with numpy, Go to https://github.com/rougier/numpy-100 and clone the repo down to your laptop. To clone a repository: - Copy the SSH address of the repository - cd ~/codeup-data-science - Then type `git clone git@github.com:rougier/numpy-100.git` - Now do `cd numpy-100` on your terminal. - Type `git remote remove origin`, so you won't accidentally try to push your work to Rougier's repo.

Congratulations! You have cloned Rougier's 100 numpy exercises to your computer. Now you need to make a new, blank, repository on GitHub.

Go to https://github.com/new to make a new repo. Name it numpy-100.
DO NOT check any check boxes. We need a blank, empty repo.
Finally, follow the directions to "push an existing repository from the command line" so that you can push up your changes to your own account.

Now do work, add it, commit it, and push it!