# NumPy

 - NumPy, which stands for Numerical Python is a Python library used for scientific computing.
 - It can also be used as an efficient multi-dimensional container for data and and a collection of routines for processing those arrays.
 - It is the most basic and a powerful package for working with data in python.

## Importing and Documentation

To be able to use NumPy we need to import the library first as it is not part of core python. Importing a library means loading it into memory and then it's there for us to work with. To import NumPy all you have to do is run the following lines:

In [None]:
import numpy as np # usually we add the second part 'as np' so we can access it with 'np.command' instead of 'numpy.command'

IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature), as well as the documentation of various functions.

For example, to display all the contents of the numpy namespace, you can type np. and the press the TAB key:

In [None]:
# Try it here: (Might take a few seconds for the result to show up)
np.

And to display NumPy's built-in documentation, you can use this:

In [None]:
np?

More detailed documentation, along with tutorials and other resources, can be found at http://www.numpy.org.

## Creating NumPy Arrays

### From List

First, we can use np.array to create arrays from Python lists:

In [None]:
# integer array:
np.array([1, 4, 2, 5, 3])

NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):

In [None]:
np.array([3.14, 4, 2, 3])

NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists:

In [None]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])

### From Scratch

Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy. Here are several examples:

In [None]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

In [None]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

In [None]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

In [None]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

In [None]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

In [None]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

In [None]:
# Create a 3x3 identity matrix
np.eye(3)

In [None]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)

#### Quick Exercise:

<font color=blue>Create a numpy array in a range from 1 to 101 (including 101) with a step of 10:</font>

In [None]:
# Your code goes here:


## Array Attributes

Let's start exploring some of the key aspects of NumPy arrays by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array. We'll use NumPy's random number generator, which we will seed with a set value in order to ensure that the same random arrays are generated each time this code is run:

In [None]:
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the total size of the array):

In [None]:
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)
print("x1 size: ", x1.size)

#### Quick Exercise

<font color=blue>Inspect the attributes of arrays x2 and x3:</font>

## Understanding NumPy Axes

NumPy axes are one of the hardest things to understand in the NumPy system. If you’re just getting started with NumPy, this is particularly true. Many beginners struggle to understand how NumPy axes work.

Don’t worry, it’s not you. A lot of Python data science beginners struggle with this.

#### NUMPY AXES ARE LIKE AXES IN A COORDINATE SYSTEM

<tr>
<td> <img src="attachment:numpy-axes_cartesian-coordinate-example.png" width="100%"/> </td>
<td> <img src="attachment:numpy-axes_point-in-cartesian-coordinates-example.png" width="100%"/> </td>
</tr>

#### NUMPY AXES ARE THE DIRECTIONS ALONG THE ROWS AND COLUMNS

![numpy-arrays-have-axes.png](attachment:numpy-arrays-have-axes.png)

Understanding numpy axes is important for both indexing and more advanced operation such as agregation and concatenation, which we will explore later.

## Array Indexing and Slicing

### Accessing single element

If you are familiar with Python's standard list indexing, indexing in NumPy will feel quite familiar. In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets, just as with Python lists:

In [None]:
print(x1)
print(x1[0])
print(x1[4])

To index from the end of the array, you can use negative indices:

In [None]:
print(x1[-1]) # Extract last element
print(x1[-2]) # Extract second last element

In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices:

In [None]:
print(x2)
print(x2[0,0])
print(x2[2,0])
print(x2[2,-1])

Note: In NumPy arrays the axis are inversed from standard x,y coordinate system. So the first number determines the position along the rows (y-axis) and second along the columns (x-axis).

### Accessing subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon (:) character. The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:

x[start:stop:step]

If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1. We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

In [None]:
print(x1[:2]) # First two elements
print(x1[2:]) # Elements after index 2
print(x1[2:4]) # Middle subarray
print(x1[::2]) # every other element
print(x1[1::2])  # every other element, starting at index 1
print(x1[::-1])  # all elements, reversed
print(x1[5::-2])  # reversed every other from index 5

### Multidimensional Subarrays

In [None]:
print(x2[:2, :3])  # two rows, three columns
print(x2[:3, ::2])  # all rows, every other column

Finally, subarray dimensions can even be reversed together:

In [None]:
x2[::-1, ::-1]

#### Quick Exercise:

<font color=blue>Try accessing different elements and subbarrays in a 3D array x3. Its harder to visualise but once you get some experience it becomes as easy as with 2D arrays. </font>

In [None]:
print(x3)
x3[0, 0, 0] # For example this code access the first element in the first row if the first 2d array.

In [None]:
# You code goes here:


Hint: To get your head around of what order each input on the index is repsonible for try using x3[:,:,0] with different configurations. We will explore this notation further in the following section.

#### Acessing array rows and columns

One commonly needed routine is accessing of single rows or columns of an array. This can be done by combining indexing and slicing, using an empty slice marked by a single colon (:):

In [None]:
print(x2[:, 0])  # first column of x2
print(x2[0, :])  # first row of x2
print(x2[0])  # equivalent to x2[0, :]

## Array Reshaping

Another useful type of operation is reshaping of arrays. The most flexible way of doing this is with the reshape method. For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [None]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

Note that for this to work, the size of the initial array must match the size of the reshaped array. 

Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix. This can be done with the reshape method, or more easily done by making use of the newaxis keyword within a slice operation:

In [None]:
x = np.array([1, 2, 3])

# row vector via reshape
print(x.reshape((1, 3)))

# row vector via newaxis
print(x[np.newaxis, :])

In [None]:
# column vector via reshape
print(x.reshape((3, 1)))

# column vector via newaxis
print(x[:, np.newaxis])

## Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays. We'll take a look at those operations here.

### Concatenation

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of arrays as its first argument, as we can see here:

In [None]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

You can also concatenate more than two arrays at once:

In [None]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

It can also be used for two-dimensional arrays:

In [None]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

# concatenate along the first axis
print(np.concatenate([grid, grid]))

In [None]:
# concatenate along the second axis (zero-indexed)
print(np.concatenate([grid, grid], axis=1))

<tr>
<td> <img src="attachment:explanation_numpy-concatenate-axis-0.png" width="100%"/> </td>
<td> <img src="attachment:explanation_numpy-concatenate-axis-1.png" width="100%"/> </td>
</tr>

For working with arrays of mixed dimensions, it can be clearer to use the np.vstack (vertical stack) and np.hstack (horizontal stack) functions:

In [None]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# vertically stack the arrays
np.vstack([x, grid])

In [None]:
# horizontally stack the arrays
y = np.array([[99],
              [99]])
np.hstack([grid, y])

### Splitting

The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and np.vsplit. For each of these, we can pass a list of indices giving the split points:

In [None]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

Notice that N split-points, leads to N + 1 subarrays. The related functions np.hsplit and np.vsplit are similar:

In [None]:
grid = np.arange(16).reshape((4, 4))
grid

In [None]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

In [None]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

## NumPy Arrays vs. Python Lists

In some ways, NumPy arrays are like Python's built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size.  

Numpy uses **less memory**, is **faster**, and **more convenient** than Lists. Also, we cannot perform calculations (add, subtract, multiply, divide and exponentiation) on Python Lists but we can on Numpy Arrays. 

For example, imagine we have an array of values and we'd like to compute the reciprocal of each. A straightforward approach might look like this:

In [None]:
import numpy as np
np.random.seed(0) # This ensures that the output of random number generator is always consistent.

def compute_reciprocals(values):
    output = np.empty(len(values)) # Reuires a seperate list where we will isert the result (More memory)
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

This implementation probably feels fairly natural to someone from, say, a C or Java background. But if we measure the execution time of this code for a large input, we see that this operation is very slow, perhaps surprisingly so! We'll benchmark this with IPython's **%timeit** magic

In [None]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

For many types of operations, NumPy provides a convenient interface into just this kind of statically typed, compiled routine. This is known as a vectorized operation. This can be accomplished by simply performing an operation on the array, which will then be applied to each element.

In [None]:
print(compute_reciprocals(values)) # Standard approach adopted above
print(1.0 / values) # A vectorized operation

In [None]:
%timeit (1.0 / big_array)

### Array Arithmetic

The following table lists the arithmetic operators implemented in NumPy:

| Operator | Equivalent ufunc | Description |
| ---|       ---        |                 ---                   |
| +  | np.add           |	Addition (e.g., 1 + 1 = 2)          |
| -	 | np.subtract      |	Subtraction (e.g., 3 - 2 = 1)       |
| -	 | np.negative      |	Unary negation (e.g., -2)           | 
| *	 | np.multiply      |	Multiplication (e.g., 2 * 3 = 6)    |
| /	 | np.divide	    |   Division (e.g., 3 / 2 = 1.5)        |
| // | np.floor_divide  |	Floor division (e.g., 3 // 2 = 1)   |
| ** | np.power         |	Exponentiation (e.g., 2 ** 3 = 8)   |
| %  | np.mod           |	Modulus/remainder (e.g., 9 % 4 = 1) |

Here we present only a number of optimized mathematical operations on NumPy arrays, however there are many more such as trigonometric and explonential/logarithmic. They all follow very simial syntax pattern and easy to use. Explore them in your own time or when required.

## Broadcasting

We saw in the previous section how NumPy's functions can be used to vectorize operations and thereby remove slow Python loops. Another means of vectorizing operations is to use NumPy's broadcasting functionality. Broadcasting is simply a set of rules for applying binary ufuncs (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes.

For arrays of the same size, binary operations are performed on an element-by-element basis:

In [None]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b

Broadcasting allows these types of binary operations to be performed on arrays of different sizes–for example, we can just as easily add a scalar (think of it as a zero-dimensional array) to an array:

In [None]:
a + 5

We can think of this as an operation that stretches or duplicates the value 5 into the array [5, 5, 5], and adds the results. The advantage of NumPy's broadcasting is that this duplication of values does not actually take place, but it is a useful mental model as we think about broadcasting.

We can similarly extend this to arrays of higher dimension. Observe the result when we add a one-dimensional array to a two-dimensional array:

In [None]:
M = np.ones((3, 3))
M

In [None]:
M + a

Here the one-dimensional array a is stretched, or broadcast across the second dimension in order to match the shape of M.

While these examples are relatively easy to understand, more complicated cases can involve broadcasting of both arrays. Consider the following example:

In [None]:
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)

In [None]:
a + b

Just as before we stretched or broadcasted one value to match the shape of the other, here we've stretched both a and b to match a common shape, and the result is a two-dimensional array! The geometry of these examples is visualized in the following figure

![02.05-broadcasting.png](attachment:02.05-broadcasting.png)

The light boxes represent the broadcasted values: again, this extra memory is not actually allocated in the course of the operation, but it can be useful conceptually to imagine that it is.

### Rules of Broadcasting

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:

1. If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
2. If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
3. If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

### Broadcasting example

Let's look at adding a two-dimensional array to a one-dimensional array:

In [None]:
M = np.ones((2, 3))
a = np.arange(3)
print(M)
print(a)

Let's consider an operation on these two arrays. <font color=blue> The shape of the arrays are </font> :

In [None]:
# Determine the shape of the arrays using attributes we explored above:


We see by rule 1 that the array a has fewer dimensions, so we pad it on the left with ones:

- M.shape -> (2, 3)
- a.shape -> (1, 3)

By rule 2, we now see that the first dimension disagrees, so we stretch this dimension to match:

- M.shape -> (2, 3)
- a.shape -> (2, 3)

The shapes match, and we see that the final shape will be (2, 3):

In [None]:
M + a

We will not go into much more detail about broadcastin in this workshop, however, make sure you explore it in your own time as it is extemely usefull feature of NumPy and it is important to understand it. 

## Aggregations: Min, Max, and Everything In Between

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question. Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy has fast built-in aggregation functions for working on arrays; we'll discuss and demonstrate some of them here.

### Summing the Values in an Array

As a quick example, consider computing the sum of all values in an array. Python itself can do this using the built-in sum function:

In [None]:
L = np.random.random(100)
sum(L)

The syntax is quite similar to that of NumPy's sum function, and the result is the same in the simplest case:

In [None]:
np.sum(L)

However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly:

In [None]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

Be careful, though: the sum function and the np.sum function are not identical, which can sometimes lead to confusion! In particular, their optional arguments have different meanings, and np.sum is aware of multiple array dimensions, as we will see in the following section.

### Minimum and Maximum

Similarly, Python has built-in min and max functions, used to find the minimum value and maximum value of any given array:

In [None]:
min(big_array), max(big_array)

NumPy's corresponding functions have similar syntax, and again operate much more quickly:

In [None]:
np.min(big_array), np.max(big_array)

In [None]:
%timeit min(big_array)
%timeit np.min(big_array)

For min, max, sum, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself:

In [None]:
print(big_array.min(), big_array.max(), big_array.sum())

Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays!

### Multi dimensional aggregates

One common type of aggregation operation is an aggregate along a row or column. Say you have some data stored in a two-dimensional array:

In [None]:
M = np.random.random((3, 4))
print(M)

By default, each NumPy aggregation function will return the aggregate over the entire array:

In [None]:
M.sum()

Aggregation functions take an additional argument specifying the axis along which the aggregate is computed. For example, we can find the minimum value within each column by specifying axis=0:

In [None]:
M.min(axis=0)

The function returns four values, corresponding to the four columns of numbers.

Similarly, we can find the maximum value within each row:

In [None]:
M.max(axis=1)

The way the axis is specified here can be confusing to users coming from other languages. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. So specifying axis=0 means that the first axis will be collapsed: for two-dimensional arrays, this means that values within each column will be aggregated.

<tr>
<td> <img src="attachment:numpy-axes-np-sum-axis-0.png" width="100%"/> </td>
<td> <img src="attachment:numpy-axes-np-sum-axis-1.png" width="100%"/> </td>
</tr>

NumPy provides many other aggregation functions, but we won't discuss them in detail here. Additionally, most aggregates have a NaN-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point NaN value.

| Function Name | NaN-safe Version | Description |
|       ---     |        ---       |     ---     |
| np.sum |np.nansum |Compute sum of elements |
| np.prod | np.nanprod | Compute product of elements |
| np.mean | np.nanmean | Compute mean of elements |
| np.std | np.nanstd | Compute standard deviation |
| np.var | np.nanvar | Compute variance |
| np.min | np.nanmin | Find minimum value |
| np.max | np.nanmax | Find maximum value |
| np.argmin | np.nanargmin | Find index of minimum value |
| np.argmax | np.nanargmax | Find index of maximum value |
| np.median | np.nanmedian | Compute median of elements |
| np.percentile | np.nanpercentile | Compute rank-based statistics of elements |
| np.any | N/A | Evaluate whether any elements are true |
| np.all | N/A | Evaluate whether all elements are true |

####  Exercise

<font color=blue> Compute some basic statistics on the following array of heights: </font>

In [None]:
heights = np.array([189, 170, 189, 163, 183, 171, 185, 168, 173, 183, 173, 173, 175, 178, 183, 193, 178, 173,
 174, 183, 183, 168, 170, 178, 182, 180, 183, 178, 182, 188, 175, 179, 183, 193, 182, 183,
 177, 185, 188, 188, 182, 185])

In [None]:
mean = # Your code here
std = 
mi = 
ma = 

print("Mean height:       ", mean)
print("Standard deviation:", std)
print("Minimum height:    ", mi)
print("Maximum height:    ", ma)

Note that in each case, the aggregation operation reduced the entire array to a single summarizing value, which gives us information about the distribution of values. We may also wish to compute quantiles:

In [None]:
print("25th percentile:   ", # Insert calculation here )
print("Median:            ", )
print("75th percentile:   ", )

We see that the median height is 182 cm.

Of course, sometimes it's more useful to see a visual representation of this data, which we can accomplish using tools in Matplotlib

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # set plot style

In [None]:
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');

These aggregates are some of the fundamental pieces of exploratory data analysis and should be condsidered before looking into any further data analysis.