# Numpy

- https://numpy.org/
- https://numpy.org/devdocs/user/quickstart.html

## Intro

### Overview of Data Science Libraries

Read: https://ds.codeup.com/python/ds-libraries-overview/

A couple of libraries we will not include here, but you will use throughout the rest of the course are Scikit-Learn for machine learning and scipy.stats for statistical testing. 

### About Numpy

- Numpy is one of the main reasons why Python is so powerful and popular for scientific computing

- It is super fast. Numpy arrays are implemented in C, which makes numpy very fast.

- It is the most popular linear algebra library for Python. 

- It provides loop-like behavior w/o the overhead of loops or list comprehensions (vectorized operations)

- It provides list + loop + conditional behavior for filtering arrays. 

### In this Lesson you will learn about...

- Numpy Arrays
- Vectorization and Vectorized Operations
- Array Indexing or Slicing
- Boolean Masking
- Data types of array values

### By the end of this lesson, you should be able to...

- Create an n-dimensional array
- Access elements of an array using slicing notation
- Use built-in functions for common statistical and mathematical operations
- Perform vectorized arithmetic operations on arrays
- Filter Arrays using boolean masks


### Vocabulary

- Vectorized Operations: The concept of relational or arithmetic operators extended to vectors of any arbitrary length, where the comparison or calculation is performed on each item in the array. 

- Boolean Masking: Filtering of values in a numpy array by passing a condition in the indexing brackets, [], or by using a boolean array that only contains the boolean values of either True or False, and then passing that into the indexing brackets, [].

- Array: An array is a collection of items of same data type stored at contiguous memory locations, starting at index 0. The array object in NumPy is called ndarray, for n-dimensional array.

- Scalar: aka 0-D arrays, the elements in an array. Each value in an array is a scalar. 

- 1-Dimensional Array: An array that has 0-D arrays/scalars as its elements. 

- 2-D Array: An array that has 1-D arrays as its elements. A matrix.

- 3-D Array: An array that has 2-D arrays (matrices) as its elements. 

- Slicing: An operation that extracts a subset of elements from an array and packages them as another array. 


### Agenda

1. Import Numpy
2. Numpy Speed
3. Create arrays
4. Access items in arrays using indexing, slicing
5. Built-in functions: a.sum(), a.mean(), a.min(), a.max(), a.std(), np.sqrt(a)
6. Vectorized Operations
7. Filtering Arrays using boolean masks
8. Array data types 

## Lesson

### Import Numpy

It is a common practice to import numpy with the alias `np`

In [2]:
import numpy as np

### Numpy Speed 

As we said, numpy is super fast. This is because Numpy arrays are implemented in C, and C is "closer to the metal" than Python. Assembly is closer to the metal than C, and Processor instruction sets == are the metal!

In [3]:
%%timeit
# using base python
[x ** 2 for x in range(1, 1_000_000)]

456 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [4]:
%%timeit
# using numpy
np.arange(1, 1_000_000) ** 2

2.37 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Create Arrays

#### Create an array from a list

Create a one-dimensional array from a single list. 

In [28]:
my_list = [1, 2, 3, 4]
a = np.array(my_list)
a

array([1, 2, 3, 4])

In [24]:
type(a)

numpy.ndarray

We can create a 2-D array by passing a list of lists to the array function. 

In [26]:
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

#### Create an array of of random numbers drawn from the standard normal distribution. 

`np.random.randn(length_of_array)`

In [5]:
np.random.randn(10)

array([ 1.8707253 , -1.74472357,  0.42263513, -0.67222333,  2.2373063 ,
       -0.8381048 , -1.99730438, -1.74869295, -0.51244415,  1.87834654])

We can pass a second argument to this function to define the shape of a two dimensional array. The first number can be thought of as the number of rows in the matrix, while the second number are the number of columns.

In [6]:
np.random.randn(10, 2)

array([[ 0.43826859, -1.64532416],
       [-0.874865  ,  0.43740969],
       [-2.10080282,  2.02945323],
       [-1.3101894 ,  0.17687527],
       [ 2.41827976, -0.46883036],
       [ 0.18304942,  1.68403767],
       [-0.47272747, -0.01230506],
       [ 0.21566883,  1.06585749],
       [ 0.34029544, -0.58652012],
       [ 0.79078578,  0.17249678]])

#### Create an array of a single value

The `zeros` and `ones` functions provide the ability to create arrays of a specified size full or either 0s or 1s, and the `full` function allows us to create an array of the specified size with a default value.

In [7]:
np.zeros(3)

array([0., 0., 0.])

We can also create multi-dimensional arrays by passing a tuple of the dimensions of the desired array, instead of a single integer value.

In [12]:
np.zeros((3, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [13]:
np.ones(3)

array([1., 1., 1.])

How can I make a 5 x 4 matrix of 1's?

Create an array with 3 items, all of the value 21. 

In [9]:
np.full(3, 21)

array([21, 21, 21])

Create a 3x2 matrix of -10. 

#### Create an array of numbers in a designated range

Numpy's `arange` function is very similar to python's builtin range function. It can take a single argument and generate a range from zero up to, but not including, the passed number.

In [16]:
np.arange(4)

array([0, 1, 2, 3])

Specify a starting point:

In [17]:
np.arange(1,4)

array([1, 2, 3])

Specify a step:

In [18]:
np.arange(1,10,2)

array([1, 3, 5, 7, 9])

The `linspace` method creates a range of numbers between a minimum and a maximum, with a set number of elements.

In [21]:
# min, max, length
np.linspace(1, 4, 4)

array([1., 2., 3., 4.])

#### Create an array of random integers

The `np.random.randint(start, stop)` creates an array of random integers between start (including) and stop (excluding). 

In [25]:
# So the following line is like rolling a 6 sided die
x = np.random.randint(1, 7)
x

6

### Access items in arrays using indexing, slicing

Referencing elements in numpy arrays at it's most basic is the same as referencing elements in Python lists. To obtain the 2nd item in the array, we would write `a[1]`

In [48]:
a = np.array([2,3,5,8,13])


To obtain the 2nd, 3rd, and 4th items, we would write `a[1:4]`. The starting index is inclusive and the ending index is exclusive. 

To obtain the 3rd index through the end of the array, we would write `a[2:]`. 

For a 2-D numpy array, or matrix, called `m`, to obtain the element in second row (index = 1) and third column (index = 2), we would write `m[1,2]`.

In [42]:
m = np.array([[2,3,5],
              [8,13,21],
              [34,55,89]]
            )

To access the 2nd and 3rd rows of the 1st column of the matrix, we write `m[1:, 0:1]`:

### Built-in methods and functions

Methods are called *on* the numpy object, so they begin with the numpy array variable name, which is `a` in this case. 
These are methods: a.sum(), a.mean(), a.min(), a.max(), a.std()

Functions begin with `np.` and the array is one of the function arguments, such as `np.sqrt(a)`. 

Some operations have both a method and a function, such as summing all items in an array. `np.sum(a)` and `a.sum()` do the same thing. 

In [49]:
a.sum(), a.mean(), a.min(), a.max(), a.std()

(31, 6.2, 2, 13, 3.9698866482558417)

In [51]:
np.sum(a)

31

In [52]:
np.median(a)

5.0

In [53]:
np.sqrt(a)

array([1.41421356, 1.73205081, 2.23606798, 2.82842712, 3.60555128])

`.all` -- every single element is `True`

In [107]:
# Are all the elements in a less than 10?
(a < 10).all()

False

`.any` -- at least one element is `True`

In [108]:
# Are there any negative numbers?
(a < 0).any()

True

### Vectorized Operations

If we wanted to add 1 to every element in a list, without numpy, we can't simply add 1 to the list, as that will result in a TypeError: 

`[1, 2, 3, 4, 5] + 1`

We would have to use a loop or a list comprehension:

In [57]:
my_list = [1, 2, 3, 4, 5]
new_list = [n + 1 for n in my_list]

Vectorizing operations means that operations are automatically applied to every element in a vector, which in our case will be a numpy array. So if we are working with a numpy array, we can simply add 1:

In [62]:
a = np.array(my_list)
a + 1

array([2, 3, 4, 5, 6])

`a * 2` and reassign to `a`: 

In [63]:
# reassign a to hold the result of a * 2
a = a * 2
a

array([ 2,  4,  6,  8, 10])

Or...

In [64]:
a *= 2
a

array([ 4,  8, 12, 16, 20])

In [65]:
a ** 2

array([ 16,  64, 144, 256, 400])

Write an operation that divides each element by 2 and then adds 3. 

What happens if we subtract `a` from itself? 

Find even numbers

In [68]:
a = np.array([2, 3, 5, 8, 13, 21])
a % 2

array([0, 1, 1, 0, 1, 1])

The items with a value of 0 are even because they have no remainder. Those with a one are odd, because they have a remainder of 1. We can use a boolean mask to filter to just the even numbers. 

### Filtering Arrays using boolean masks

In [72]:
is_even = a % 2 == 0
is_even

array([ True, False, False,  True, False, False])

In [71]:
a[is_even]

array([2, 8])

**or**

In [73]:
a[a % 2 == 0]

array([2, 8])

Expressions that return true or false can be used as our filter. 

In [76]:
a = np.array([-3, 0, 3, 6, 9])
a > 0

array([False, False,  True,  True,  True])

It might help to read this as SQL in your head: "select a where a > 0"

In [78]:
a[a > 0]

array([3, 6, 9])

In [79]:
# select a where a == 3
a[a == 3]

array([3])

In [80]:
# select a where a != 3
a[a != 3]

array([-3,  0,  6,  9])

In the Python admissions test, there was a question called "remove_evens" where you write a function that removes evens. Using numpy, your function could look like: 

In [81]:
def remove_evens(x):
    x = np.array(x)
    return x[x % 2 == 1]

odds = remove_evens([2, 3, 5, 8, 13])
odds

array([ 3,  5, 13])

Combine boolean arrays with `&` for "and", `|` for "or"

For example, create an array of all positive, even numbers from the original array below. 

In [84]:
a = np.array([-3, 0, 1, 1, 2, 3, 5, 8, 13, 21])
new_a = a[(a > 0) & (a % 2 == 0)]
new_a

array([2, 8])

Create an array of all positive OR even numbers from the original array below. 

In [85]:
new_a = a[(a > 0) | (a % 2 == 0)]
new_a

array([ 0,  1,  1,  2,  3,  5,  8, 13, 21])

Negate a mask

In [100]:
a = np.array([-3, 0, 1, 1, 2, 3, 5, 8, 13, 21])
odds = a % 2 == 1
odds

array([ True, False,  True,  True, False,  True,  True, False,  True,
        True])

In [102]:
evens = ~ odds
evens

array([False,  True, False, False,  True, False, False,  True, False,
       False])

In [104]:
# these will all return the same thing, the even numbers 

a[~(a % 2 == 1)]

# a[~odds]

# a[evens]

# a[a % 2 == 0]


array([0, 2, 8])

### Array data types 

The data type of an array is the LCD...lowest common datatype. 

- Most datatypes can be converted to strings or objects, so if there is a string, that will be the LCD. 

- All numbers can be converted to decimals, so that is the LCD of numbers. 

- Only integers can be converted to integers, so that will only be the datatype when all values are INTs. 

In [96]:
a = np.array([1, 2, '3', 4])
a, a.dtype

# U21 = Unicode, 21 character

(array(['1', '2', '3', '4'], dtype='<U21'), dtype('<U21'))

In [98]:
a = np.array([1, 2.01, 3, 4])
a, a.dtype

(array([1.  , 2.01, 3.  , 4.  ]), dtype('float64'))

In [99]:
a = np.array([1, 2, 3, 4])
a, a.dtype

(array([1, 2, 3, 4]), dtype('int64'))

## More Examples

We can use numpy to answer some questions:

In [57]:
# 1. How many data points are there?
a.shape

(1000,)

In [61]:
# 2. How many data points are greater than 70? (.shape + .sum)
a[a > 70].shape

(311,)

In [66]:
# 2. How many data points are greater than 70?
(a > 70).sum()

311

In [68]:
# 3. What is the sum of the odd numbers?
a[a % 2 == 1].sum()

23401

In [78]:
a[a < 10].shape

(91,)

In [72]:
# 4. Take all the numbers between 30 and 80 (inclusive), square them, what is the highest resulting number?
(a[(a >= 30) & (a <= 80)] ** 2).max()

6400

In [79]:
# 4. Take all the numbers between 30 and 80 (inclusive), square them, what is the highest resulting number?
more_than_30 = a >= 30
less_than_80 = a <= 80

in_our_desired_range = more_than_30 & less_than_80

desired_numbers = a[in_our_desired_range]
desired_numbers_squared = desired_numbers ** 2

desired_numbers_squared.max()

6400

`np.where` will produce values conditionally, based on a boolean array

In [80]:
# 5. Square the odd numbers in the array. What is the average of the resulting data set? (np.where)
odd_numbers_squared = np.where(a % 2 == 1, a ** 2, a)

odd_numbers_squared.mean()

1595.057

In [83]:
# 6. Square the even numbers in the array. Remove any odd number less than 40.
#    Double odd numbers greater than 80. What is the sum of the resulting dataset?
evens_squared = np.where(a % 2 == 0, a ** 2, a)
x = evens_squared[(evens_squared % 2 == 1) & (evens_squared < 40)]
x = np.where(x % 2 == 1, x * 2, x)
x.sum()

7740