# UFCFVQ-15-M Programming for Data Science
# Week 6 Jupyter Notebook 
# Introduction to NumPy


## Goals
This notebook has been created to familiarise you with the NumPy package. Most of the code needed to progress through this Notebook has been provided for you. However, there are several coding tasks that you will need to complete yourself by entering code yourself.

The topics in this notebook include:
* Creating NumPy Arrays
* Array Indexing and Slicing
* Array Reshaping
* Array Concatenation and Splitting
* Aggregations
* Comparisons and Masks
* Sorting

# Introduction to NumPy

Efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science. In this week's notebook we will look at the specialized tools that Python has for handling such numerical arrays: the NumPy package. Next week, we will foicus on the Pandas package to provide even more techniques for effectively loading, storing, and manipulating in-memory data in Python.

NumPy (short for *Numerical Python*) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python's built-in ``list`` type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size. NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.

The NumPy package is available in the Anaconda installation and so you simply need to import the package to access its tools and capabilities. By convention, you'll find that most people in the data science world will import NumPy using ``np`` as an alias.

In [1]:
import numpy as np

## What is an array?
An array is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array `dtype`. An array can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers. The rank of the array is the number of dimensions. The shape of the array is a tuple of integers giving the size of the array along each dimension.

## Fixed-Type Arrays in Python

Python offers several different options for storing data in efficient, fixed-type data buffers. The built-in ``array`` module can be used to create dense arrays of a uniform type:

In [2]:
import array
L = list(range(10))
A = array.array('i', L)
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Here ``'i'`` is a type code indicating the contents are integers. (more information about arrays can be found at https://docs.python.org/3/library/array.html).

Much more useful, however, is the ``ndarray`` object of the NumPy package. While Python's ``array`` object provides efficient storage of array-based data, NumPy adds to this efficient *operations* on that data.

## Creating Arrays from Python Lists

First, we can use ``np.array`` to create arrays from Python lists:

In [3]:
# integer array:
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type.
If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):

In [4]:
np.array([3.14, 4, 2, 3])

array([3.14, 4.  , 2.  , 3.  ])

If we want to explicitly set the data type of the resulting array, we can use the ``dtype`` keyword:

In [5]:
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

Finally, unlike Python lists, NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists:

In [6]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

The inner lists are treated as rows of the resulting two-dimensional array.

## Creating Arrays from Scratch

Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy.
Here are several examples:

In [7]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [8]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [9]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [10]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [11]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [12]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[0.84649383, 0.09953096, 0.69149733],
       [0.84077703, 0.83377495, 0.89798749],
       [0.46798941, 0.66298315, 0.92326664]])

In [12]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[ 1.04161222, -0.08822731, -0.78186883],
       [-2.30934717, -0.25847882,  1.20535706],
       [ 1.07504882, -0.21927239,  0.86879273]])

In [13]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[4, 8, 1],
       [8, 2, 0],
       [7, 0, 4]])

In [14]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [15]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)

array([1., 1., 1.])

## NumPy Standard Data Types

NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations. The standard NumPy data types are listed in the following table. Note that when constructing an array, they can be specified using a string (e.g. `np.zeros(10, dtype='int16')`) or using the associated NumPy object (e.g. `np.zeros(10, dtype=np.int16)`):

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

### <font color='red'><u>Worksheet Exercises</u></font>
1. Create a one-dimensional NumPy array with 100 32-bit unsigned integers all initialised to zero
2. Create a one-dimensional NumPy array with the even numbers between 1 and 20 in reverse order, e.g. 20, 18, 16, 14,...
3. Create a one-dimensional NumPy array with 12 evenly spaced between 1 and 100
4. Create a two-dimensional NumPy array with 5 rows and 2 columns of double precision float values all initialised to 1
5. Create a one-dimensional NumPy array with 10 random numbers generated from a normal distribution with a mean 100 and standard deviation 15
6. Create a three-dimensional NumPy array with 3x3x3 elements each initialised with the value 10
7. Create a two dimensional NumPy array with 2x3 8-bit byte values using the following given values (hint: use a List or Tuple):
```
1 6
2 5
4 3
```

In [23]:
# add your exercise solutions here
#1 Create a one-dimensional NumPy array with 100 32-bit unsigned integers all initialised to zero
np.zeros(100,dtype=int)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [26]:
#2 Create a one-dimensional NumPy array with the even numbers between 1 and 20 in reverse order, e.g. 20, 18, 16, 14,...
np.arange(20,0,-2)



array([20, 18, 16, 14, 12, 10,  8,  6,  4,  2])

In [25]:
#3 Create a one-dimensional NumPy array with 12 evenly spaced between 1 and 100
np.linspace(1,100,12)


array([  1.,  10.,  19.,  28.,  37.,  46.,  55.,  64.,  73.,  82.,  91.,
       100.])

In [61]:
#4  Create a two-dimensional NumPy array with 5 rows and 2 columns of double precision float values all initialised to 1
np.ones((5,2),dtype='float64')

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [27]:
#5Create a one-dimensional NumPy array with 10 random numbers generated from 
#a normal distribution with a mean 100 and standard deviation 15
np.random.normal(100, 15,(1,10))

array([[ 92.2074109 , 106.16124599, 102.83467557,  89.91778203,
        103.61800594, 113.60014579,  94.39836809,  79.71791892,
        104.69721263, 109.87637852]])

In [39]:
#6 Create a three-dimensional NumPy array with 3x3x3 elements each initialised with the value 10
np.full((3,3, 3), 10)

array([[[10, 10, 10],
        [10, 10, 10],
        [10, 10, 10]],

       [[10, 10, 10],
        [10, 10, 10],
        [10, 10, 10]],

       [[10, 10, 10],
        [10, 10, 10],
        [10, 10, 10]]])

In [4]:
#7.Create a two dimensional NumPy array with 2x3 8-bit byte values using the following given values (hint: use a List or Tuple):
np.array([[i,7-i] for i in [1,2,4]])

array([[1, 6],
       [2, 5],
       [4, 3]])

# Basics NumPy Operations

Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array. This section will present several examples of using NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays. We'll cover a few categories of basic array manipulations here:

- *Attributes of arrays*: Determining the size, shape, memory consumption, and data types of arrays
- *Indexing of arrays*: Getting and setting the value of individual array elements
- *Slicing of arrays*: Getting and setting smaller subarrays within a larger array
- *Reshaping of arrays*: Changing the shape of a given array
- *Joining and splitting of arrays*: Combining multiple arrays into one, and splitting one array into many

## NumPy Array Attributes

First let's discuss some useful array attributes.
We'll start by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array.
We'll use NumPy's random number generator, which we will *seed* with a set value in order to ensure that the same random arrays are generated each time this code is run:

In [48]:
import numpy as np
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

Each array has attributes ``ndim`` (the number of dimensions), ``shape`` (the size of each dimension), ``size`` (the total size of the array),  ``dtype`` (the data type of the array), ``itemsize`` (the size (in bytes) of each array element), and ``nbytes`` (the total size (in bytes) of the array):

In [49]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype:", x3.dtype)
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60
dtype: int32
itemsize: 4 bytes
nbytes: 240 bytes


## Array Indexing: Accessing Single Elements

In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets, just as with Python lists:

In [50]:
print(x1) # all elements
print(x1[0]) # 1st value
print(x1[4]) # 5th value

[5 0 3 3 7 9]
5
7


To index from the end of the array, you can use negative indices:

In [51]:
x1[-1] # the last element

9

In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices:

In [52]:
print(x2)
print(x2[0,0]) # top left corner
print(x2[-1, -1]) # bottom right corner

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]
3
7


Values can also be modified using any of the above index notation:

In [74]:
x2[0, 0] = 12
print(x2)

[[12  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


Keep in mind that, unlike Python lists, NumPy arrays have a fixed type.
This means, for example, that if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. Don't be caught unaware by this behavior!

In [53]:
x1[0] = 3.14159  # this will be truncated, i.e. it loses the .14159 decimal part!
print(x1)

[3 0 3 3 7 9]


## Array Slicing: Accessing Subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character. The NumPy slicing syntax follows the same syntax used for Python `List` and `string`. To access a slice of an array ``x``, use `x[start:stop:step]`. We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

### One-dimensional subarrays

In [75]:
x = np.arange(10) # arange() is NumPy's version of range()
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]:
x[:5]  # first five elements

array([0, 1, 2, 3, 4])

In [28]:
x[5:]  # elements after index 5

array([5, 6, 7, 8, 9])

### Multi-dimensional subarrays

Multi-dimensional slices work in the same way, with multiple slices separated by commas.
For example:

In [54]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [55]:
x2[:2, :3]  # two rows, three columns

array([[3, 5, 2],
       [7, 6, 8]])

#### Accessing array rows and columns

One commonly needed routine is accessing of single rows or columns of an array.
This can be done by combining indexing and slicing, using an empty slice marked by a single colon (``:``):

In [56]:
print(x2[:, 0])  # first column of x2

[3 7 1]


In [57]:
print(x2[0, :])  # first row of x2

[3 5 2 4]


### Subarrays as no-copy views

One important–and extremely useful–thing to know about array slices is that they return *views* rather than *copies* of the array data.
This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
Consider our two-dimensional array from before:

In [58]:
print(x2)

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]


Let's extract a $2 \times 2$ subarray from this:

In [59]:
x2_sub = x2[:2, :2]
print(x2_sub)

[[3 5]
 [7 6]]


Now if we modify this subarray, we'll see that the original array is changed! Observe:

In [60]:
x2_sub[0, 0] = 99
print(x2_sub)

[[99  5]
 [ 7  6]]


In [61]:
print(x2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


### Creating copies of arrays

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the ``copy()`` method:

In [62]:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)

[[99  5]
 [ 7  6]]


If we now modify this subarray, the original array is not touched:

In [63]:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)

[[42  5]
 [ 7  6]]


In [64]:
print(x2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


## Advanced Indexing
We can also pass arrays of indices in place of single scalars to access multiple array elements at once. This allows us to very quickly access and modify complicated subsets of an array's values. For example, consider the following array:

In [3]:
rand = np.random.RandomState(42)

x = rand.randint(100, size=10)
print(x)

[51 92 14 71 60 20 82 86 74 74]


### Accessing Values
Suppose we want to access three different elements. We could do it like this:

In [4]:
[x[3], x[7], x[2]]

[71, 86, 14]

Alternatively, we can pass a single list or array of indices to obtain the same result:

In [5]:
ind = [3, 7, 4]
x[ind]

array([71, 86, 60])

We can also have control over the shape of the result. NOTE: it reflects the shape of the *index arrays* rather than the shape of the *array being indexed*:

In [6]:
ind = np.array([[3, 7],
                [4, 5]])
x[ind]

array([[71, 86],
       [60, 20]])

### Modifying Values
Just as we can use an array of indices to access parts of an array, we can also be use the same approach to modify parts of an array. For example, imagine we have an array of indices and we'd like to set the corresponding items in an array to some value:

In [22]:
x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)

[ 0 99 99  3 99  5  6  7 99  9]


We can use any assignment-type operator for this. For example:

In [9]:
x[i] -= 10
print(x)

[ 0 89 89  3 89  5  6  7 89  9]


### <font color='red'><u>Worksheet Exercises</u></font>
1. Output the total size (in bytes) of the following NumPy array: `np.full((3, 5), 3.14)`
2. Output the type of data used for values stored in the NumPy array created in 1. above
3. Create a one-dimensional array with the numbers 1 to 10. Now, print the last element using a negative index.
4. Using the array created in 3., change the 4th element to -1
5. Using advanced indexing on the array created in 3. to also change the 2nd, 7th and 8th elements to -1
6. Create a two-dimensional array of 5x5 32-bit unsigned integers and initialise it to zero. Using array slicing, update the array to match the following:
```
0 0 0 0 0
0 1 1 1 0
0 1 1 1 0
0 1 1 1 0
0 0 0 0 0
```

In [2]:
# add your exercise solutions here
#1 the total size (in bytes) 
a1=np.full((3,5),3.14)
print("nbytes:", a1.nbytes, "bytes")
#2 Output the type of data
print("dtype:", a1.dtype)
#3
a2=np.arange(1,11)
print(a2)
#print the last element using a negative index
print(a2[-1])
#4 change the 4th element to -1
a2[4]=-1
print(a2)
#5 Using advanced indexing on the array created in 3. to also change the 2nd, 7th and 8th elements to -1

i = np.array([2, 7,8])
a2[i]=-1
print(a2)
#6 Create a two-dimensional array of 5x5 32-bit unsigned integers and initialise it to zero.
a3=np.zeros((5,5),dtype= 'int_')
# Using array slicing, update the array to match the following:
a3_sub=a3[1:4,1:4]
a3_sub[::,::]=1
print(a3)

nbytes: 120 bytes
dtype: float64
[ 1  2  3  4  5  6  7  8  9 10]
10
[ 1  2  3  4 -1  6  7  8  9 10]
[ 1  2 -1  4 -1  6  7 -1 -1 10]
[[0 0 0 0 0]
 [0 1 1 1 0]
 [0 1 1 1 0]
 [0 1 1 1 0]
 [0 0 0 0 0]]


## Reshaping of Arrays

Another useful type of operation is reshaping of arrays.
The most flexible way of doing this is with the ``reshape`` method.
For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [35]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


Note that for this to work, the size of the initial array must match the size of the reshaped array. Where possible, the ``reshape`` method will use a no-copy view of the initial array.

## Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays. We'll take a look at those operations here.

### Concatenation of arrays

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines ``np.concatenate``, ``np.vstack``, and ``np.hstack``.

``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:

In [36]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

You can also concatenate more than two arrays at once:

In [37]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

[ 1  2  3  3  2  1 99 99 99]


It can also be used for two-dimensional arrays:

In [39]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

In [40]:
# concatenate along the first axis
np.concatenate([grid, grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [53]:
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

For working with arrays of mixed dimensions, it can be clearer to use the ``np.vstack`` (vertical stack) and ``np.hstack`` (horizontal stack) functions:

In [54]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# vertically stack the arrays
np.vstack([x, grid])

array([[1, 2, 3],
       [9, 8, 7],
       [6, 5, 4]])

In [55]:
# horizontally stack the arrays
y = np.array([[99],
              [99]])
np.hstack([grid, y])

array([[ 9,  8,  7, 99],
       [ 6,  5,  4, 99]])

### Splitting of arrays

The opposite of concatenation is splitting, which is implemented by the functions ``np.split``, ``np.hsplit``, and ``np.vsplit``.  For each of these, we can pass a list of indices giving the split points:

In [56]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

[1 2 3] [99 99] [3 2 1]


Notice that *N* split-points, leads to *N + 1* subarrays.
The related functions ``np.hsplit`` and ``np.vsplit`` are similar:

In [10]:
grid = np.arange(16).reshape((4, 4))
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [13]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

[[0 1 2 3]
 [4 5 6 7]]
[[ 8  9 10 11]
 [12 13 14 15]]


In [99]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]


### <font color='red'><u>Worksheet Exercises</u></font>
1. Create a one-dimensional NumPy array of 64 numbers ranging from 1 to 64. Now reshape this array into a grid of 8 x 8 elements.
2. Create two one-dimensional NumPy arrays of 8 alternating 0s and 1s. The first array should begin with a 1; the second array with a 0, i.e. 10101010 and 01010101. Now, use the `concatenate()` and `reshape()` functions to build a checkerboard of 8 x 8 elements
3. Given the following grid of 25 values, extract the central 3 x 3 sub-grid of `9`s from the larger grid using the `split()` function:
```
1 2 3 4 5
1 9 9 9 5
1 9 9 9 5
1 9 9 9 5
1 2 3 4 5
```


In [17]:
# add your exercise solutions here
#1 Create a one-dimensional NumPy array of 64 numbers ranging from 1 to 64.reshape this array into a grid of 8 x 8 elements.
c1=np.array(range(1,65)).reshape(8,8)
c1

array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 9, 10, 11, 12, 13, 14, 15, 16],
       [17, 18, 19, 20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29, 30, 31, 32],
       [33, 34, 35, 36, 37, 38, 39, 40],
       [41, 42, 43, 44, 45, 46, 47, 48],
       [49, 50, 51, 52, 53, 54, 55, 56],
       [57, 58, 59, 60, 61, 62, 63, 64]])

In [20]:
#2 Create two one-dimensional NumPy arrays of 8 alternating 0s and 1s.
x = np.ones(8,dtype=int)
i = np.array([1, 3, 5, 7])
x[i] = 0 #first array 10101010
print(x)
x1=np.zeros(8,dtype=int)
x1[i]=1 #seconed array 01010101
print(x1)
y=np.concatenate([x, x1]).reshape(2,8)
z=np.concatenate([y, y])
h=np.concatenate([z, z])
h # a checkerboard of 8 x 8 elements

[1 0 1 0 1 0 1 0]
[0 1 0 1 0 1 0 1]


array([[1, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1]])

In [21]:
#3
grid=np.array([1, 2, 3, 4, 5,
1, 9, 9, 9, 5,
1, 9, 9, 9, 5,
1, 9, 9, 9, 5,
1, 2, 3, 4, 5])
x1,x2,x3,x4,x6,x6,x7=np.split(grid, [6,9,11,14,16,19])#split the array to get [9 9 9]
print(x2,x4,x6)
array_tuple=(x2,x4,x6)
sub_grid=np.vstack(array_tuple) #append the 3 arrays to one
sub_grid


[9 9 9] [9 9 9] [9 9 9]


array([[9, 9, 9],
       [9, 9, 9],
       [9, 9, 9]])

# Aggregations

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question.
Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.). NumPy has fast built-in aggregation functions for working on arrays.

## Summing the Values in an Array

As a quick example, consider computing the sum of all values in an array.
Python itself can do this using the built-in ``sum`` function:

In [22]:
import numpy as np

In [23]:
L = np.random.random(100)
sum(L)

52.81568133061098

The syntax is quite similar to that of NumPy's ``sum`` function, and the result is the same in the simplest case:

In [25]:
np.sum(L)

49.86649672967345

However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly:

In [26]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

168 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.84 ms ± 84.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Be careful, though: the ``sum`` function and the ``np.sum`` function are not identical, which can sometimes lead to confusion!
In particular, their optional arguments have different meanings, and ``np.sum`` is aware of multiple array dimensions, as we will see in the following section.

## Minimum and Maximum

Similarly, Python has built-in ``min`` and ``max`` functions, used to find the minimum value and maximum value of any given array:

In [107]:
min(big_array), max(big_array)

(1.4057692298008462e-06, 0.9999994392723005)

NumPy's corresponding functions have similar syntax, and again operate much more quickly:

In [106]:
np.min(big_array), np.max(big_array)

(1.4057692298008462e-06, 0.9999994392723005)

In [108]:
%timeit min(big_array)
%timeit np.min(big_array)

80.7 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
616 µs ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


For ``min``, ``max``, ``sum``, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself:

In [67]:
print(big_array.min(), big_array.max(), big_array.sum())

1.4057692298008462e-06 0.9999994392723005 500202.5348847683


Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays!

### Multi dimensional aggregates

One common type of aggregation operation is an aggregate along a row or column.
Say you have some data stored in a two-dimensional array:

In [109]:
M = np.random.random((3, 4))
print(M)

[[0.35747479 0.16525562 0.22801649 0.01116173]
 [0.96484772 0.93042983 0.28571123 0.18633432]
 [0.84657553 0.11751108 0.43646768 0.48335078]]


By default, each NumPy aggregation function will return the aggregate over the entire array:

In [110]:
M.sum()

5.013136800754203

Aggregation functions take an additional argument specifying the *axis* along which the aggregate is computed. For example, we can find the minimum value within each column by specifying ``axis=0``:

In [111]:
M.min(axis=0)

array([0.35747479, 0.11751108, 0.22801649, 0.01116173])

The function returns four values, corresponding to the four columns of numbers.

Similarly, we can find the maximum value within each row:

In [71]:
M.max(axis=1)

array([0.72521956, 0.99559424, 0.88518089])

The way the axis is specified here can be confusing to users coming from other languages.
The ``axis`` keyword specifies the *dimension of the array that will be collapsed*, rather than the dimension that will be returned.
So specifying ``axis=0`` means that the first axis will be collapsed: for two-dimensional arrays, this means that values within each column will be aggregated.

### Other aggregation functions

NumPy provides many other aggregation functions. Additionally, most aggregates have a ``NaN``-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point ``NaN`` value. The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

### <font color='red'><u>Worksheet Exercises</u></font>
1. Generate a one-dimensional NumPy array of 100000 random numbers using a normal distribution with mean 0 and standard deviation 1 (use the `random.normal()` function). Using the aggregate functions to show mean and standard deviation of the array values and confirm that the normal random number generator approximates the given input parameters.
2. Using the array created in 1. above, find the following values using aggregate functions:
    * min and max values
    * the 25th percentile
    * the 50th percentile
    * the 75th percentile
3. Given the 5 x 5 magic square below, confirm that all rows, columns and diagonals sum to 65. NOTE: you may find the `sum()`, `diagonal()` and `fliplr()` functions useful for this exercise.
```
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 21 3
11 18 25 2 9
```

In [43]:
# add your exercise solutions here
#1 
m=np.random.normal(0, 1, 100000)
print("mean:",np.mean(m),"std:",np.std(m))
#2
print("min:",m.min(),"max:",m.max())
print("the 25th percentile:",np.percentile(m,25))
print("the 50th percentile:",np.percentile(m,50))
print("the 75th percentile:",np.percentile(m,75))
#3
arr=np.array([17,24,1,8,15,
     23,5,7,14,16,
     4,6,13,20,22,
     10,12,19,21,3,    
    11,18,25,2,9]).reshape(5,5)
print(sum(np.diagonal(arr)),sum(np.fliplr(arr)))

mean: 0.004528703490133205 std: 0.9980660091317033
min: -3.9319607107785357 max: 4.29534344780037
the 25th percentile: -0.6704027104725593
the 50th percentile: 0.0008861399433546627
the 75th percentile: 0.6754086985477342
65 [65 65 65 65 65]


# Comparisons, Masks, and Boolean Logic

This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays.
Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.

## Comparison Operators
NumPy has comparison operators (such as ``<`` (less than) and ``>`` (greater than)) that act in an element-wise way, i.e. per element of the NumPy array. The result of these comparison operators is always an array with a Boolean data type. All six of the standard comparison operations are available:

In [2]:
x = np.array([1, 2, 3, 4, 5])

In [3]:
x < 3  # less than

array([ True,  True, False, False, False])

In [4]:
x > 3  # greater than

array([False, False, False,  True,  True])

In [75]:
x <= 3  # less than or equal

array([ True,  True,  True, False, False])

In [5]:
x >= 3  # greater than or equal

array([False, False,  True,  True,  True])

In [6]:
x != 3  # not equal

array([ True,  True, False,  True,  True])

In [7]:
x == 3  # equal

array([False, False,  True, False, False])

These comparison operators will work on arrays of any size and shape. Here is a two-dimensional example:

In [38]:
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x

array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

In [117]:
x < 6

array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]])

In each case, the result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these Boolean results.

## Working with Boolean Arrays

Given a Boolean array, there are a host of useful operations you can do.
We'll work with ``x``, the two-dimensional array we created earlier.

In [9]:
print(x)

[[5 0 3 3]
 [7 9 3 5]
 [2 4 7 6]]


### Counting entries

To count the number of ``True`` entries in a Boolean array, ``np.count_nonzero`` is useful:

In [10]:
# how many values less than 6?
np.count_nonzero(x < 6)

8

We see that there are eight array entries that are less than 6.
Another way to get at this information is to use ``np.sum``; in this case, ``False`` is interpreted as ``0``, and ``True`` is interpreted as ``1``:

In [11]:
np.sum(x < 6)

8

The benefit of ``sum()`` is that like with other NumPy aggregation functions, this summation can be done along rows or columns as well:

In [123]:
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)

array([4, 2, 2])

This counts the number of values less than 6 in each row of the matrix.

### Boolean operators

As with the comparison operators, Python's *bitwise logic operators*, ``&``, ``|``, ``^``, and ``~``work act in an element-wise way on (usually Boolean) arrays. Combining comparison operators and Boolean operators on arrays can lead to a wide range of efficient logical operations. For example, we can address this sort of compound question as follows:

In [12]:
print(x)
np.sum((x >= 3) & (x <= 5)) # find the number of values between 3 and 5 inclusive

[[5 0 3 3]
 [7 9 3 5]
 [2 4 7 6]]


6

So we see that there are 6 numbers between 3 and 5 inclusive.

## Boolean Arrays as Masks

In the preceding section we looked at aggregates computed directly on Boolean arrays.
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our ``x`` array from before, suppose we want an array of all values in the array that are less than, say, 5:

In [13]:
x

array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

We can obtain a Boolean array for this condition easily, as we've already seen:

In [14]:
x < 5

array([[False,  True,  True,  True],
       [False, False,  True, False],
       [ True,  True, False, False]])

Now to *select* these values from the array, we can simply index on this Boolean array; this is known as a *masking* operation:

In [15]:
x[x < 5]

array([0, 3, 3, 3, 2, 4])

What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is ``True``. We are then free to operate on these values as we wish.

By combining Boolean operations, masking operations, and aggregates, we can very quickly answer these sorts of questions for our dataset.

### <font color='red'><u>Worksheet Exercises</u></font>
1. Imagine you want to simulate dice rolls. Use `randint()` to generate 100 dice rolls. Now, count the number of dice rolls greater than 3.
2. Using the `loadtext()` function, read in the data stored in the file `1969 Birth Weights.csv` into a NumPy array. Now, count the number of babies who weigh between 4000g and 4500g
3. Using the same data as in 2. above, create a NumPy array containing all of the weights between 4000g and 4500g.

In [49]:
# add your exercise solutions here
#1 count the number of dice rolls greater than 3
import random
randomlist=[]
for i in range(0,100):
    y=random.randint(1,6)
    randomlist.append(y)
x=np.array(randomlist)
print(x)
np.sum(x>3)
#2 read in the data stored in the file 1969 Birth Weights.csv,count the number of babies who weigh between 4000g and 4500g
arr=np.loadtxt('1969 Birth Weights.csv')
print(arr)
count=np.sum((arr >= 4000) & (arr <= 4500))
print(count)
#3 create a NumPy array containing all of the weights between 4000g and 4500g.
arr[((arr >= 4000) & (arr <= 4500))]

[4 2 1 4 3 2 1 6 5 2 5 1 3 1 6 1 5 1 5 3 4 2 2 6 5 5 6 4 5 4 1 2 2 4 5 2 5
 3 5 1 3 5 3 6 4 1 6 4 1 2 4 2 4 6 6 3 6 1 4 2 4 3 5 2 2 5 4 5 6 1 6 5 1 6
 6 2 2 6 6 5 3 2 6 1 4 5 4 3 6 4 2 5 4 3 5 2 5 5 5 5]
[4046. 4440. 4454. 4548. 4548. 4994. 4440. 4520. 4192. 4198. 4710. 4850.
 4646. 5092. 4800. 4934. 4592. 4842. 4852. 5190. 4580. 4598. 4126. 4324.
 4758. 5076. 5070. 5296. 4798. 5096. 4790. 4872. 4944. 5030. 4670. 4642.
 4170. 4452. 4884. 4924. 5042. 5432. 4796. 5088. 4794. 4660. 4752. 5046.
 4348. 4674. 4230. 4338. 4864. 5046. 4860. 5172. 4500. 4880. 4668. 5006.
 4780. 4912.]
14


array([4046., 4440., 4454., 4440., 4192., 4198., 4126., 4324., 4170.,
       4452., 4348., 4230., 4338., 4500.])

# Sorting Arrays

Up to this point we have been concerned mainly with tools to access and operate on array data with NumPy. This section covers algorithms related to sorting values in NumPy arrays. Although Python has built-in ``sort`` and ``sorted`` functions to work with Lists, here we will focus NumPy's ``np.sort`` and `np.argsort` functions which are much more efficient and useful.

The built-in `sort()` function sorts the array in-place, i.e. the array's values are changed. 

In [50]:
x = np.array([2, 1, 4, 3, 5])
x.sort()
print(x)

[1 2 3 4 5]


To return a sorted version of the array without modifying the array, you can use `np.sort()`:

In [51]:
x = np.array([2, 1, 4, 3, 5])
y = np.sort(x)
print(x,y) # original array unchanged

[2 1 4 3 5] [1 2 3 4 5]


A related function is ``argsort``, which instead returns the *indices* of the sorted elements:

In [53]:
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)

[1 0 3 2 4]


The first element of this result gives the index of the smallest element, the second value gives the index of the second smallest, and so on. These indices can then be used to construct the sorted array if desired:

In [88]:
x[i]

array([1, 2, 3, 4, 5])

### Sorting along rows or columns

A useful feature of NumPy's sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the ``axis`` argument. For example:

In [57]:
rand = np.random.RandomState(42)
X = rand.randint(0, 20, (4, 6))
print(X)

[[ 6 19 14 10  7  6]
 [18 10 10  3  7  2]
 [ 1 11  5  1  0 11]
 [11 16  9 15 14 14]]


In [56]:
# sort each column of X
np.sort(X, axis=0)

array([[ 1, 10,  5,  1,  0,  2],
       [ 6, 11,  9,  3,  7,  6],
       [11, 16, 10, 10,  7, 11],
       [18, 19, 14, 15, 14, 14]])

In [55]:
# sort each row of X
np.sort(X, axis=1)

array([[ 6,  6,  7, 10, 14, 19],
       [ 2,  3,  7, 10, 10, 18],
       [ 0,  1,  1,  5, 11, 11],
       [ 9, 11, 14, 14, 15, 16]])