# Data Science Principles and Practices Lab Week 2

Follow the instructions to complete each of these tasks. This set of exercises looks at Numpy and performing operations on numeric data. Do not worry if you do not complete them all in the timetabled lab session.

This is not assessed but will help you gain practical experience for the module exam and coursework.

For extra information about the functions provided by numpy, you can search the API documentation here:

https://numpy.org/doc/stable/

# Numpy

Numpy is a Python module for working with many dimensional arrays, performing efficient vectorised operations on them, and linear algebra. It also provides random number generation facilities.

 - tensorflow functions accept numpy arrays as inputs as well as tensor objects.
 - pandas can convert data frames or series into numpy arrays.
 - scikit accepts data passed as numpy arrays.

The main feature of numpy is that it provides efficient N-dimensional arrays and operations on them.

Numpy is typically imported as *np* to reduce typing.

```python
import numpy as np
```

Do this below.

In [1]:
import numpy as np

## Creating arrays

numpy arrays are objects of the numpy.array class, which is used to create arrays of any number of dimensions.

```python
a = np.array([[1,2,3],[4,5,6]])
```

Arrays can be created from tuples or lists. A single list will construct a one-dimensional array, and a list of lists will create a two dimensional array.

### Task 1.1

Create a two dimensional numpy array, with three rows and two columns, containing the numbers 1 to 6. The ordering of the numbers does not matter, but look at the output to see how the values are arranged.

*You can assign your array to a variable, e.g. a = np.array(....). To print out your array, you can then write a as the last line of your Jupyter code cell. Jupyter will print out the value in the last line of the code cell. For example:*

```python
a = np.array([1,2,3])
a
```

In [2]:
a = np.array([[1,2],[3,4],[5,6]])
a

array([[1, 2],
       [3, 4],
       [5, 6]])

Arrays can also be built using built-in constructors:

```python
b = np.ones((4,3)) # A 4 by 3 array of the value 1.0
c = np.zeros((3,4)) # A 3 by 4 array of the value 0.0
```

### Task 1.2

Try creating an array containing all ones, with 6 rows and 4 columns

In [3]:
b = np.ones((6,4))
b

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

## Working with Numpy arrays

The shape attribute gives the size of the array in each dimension, and the total number of dimensions.

```python
a = np.array([[1,2,3],[4,5,6]])
a.shape # (2,3)
a.ndim # 2
a.dtype # dtype('int64')
```

The dtype attribute tells us the data type of the values stored in the array. This can be specified when creating the array using an additional argument:

```python
a = np.array([[1,2,3],[4,5,6]],dtype=np.float64)
b = np.ones((4,3),dtype=np.int64)
```

### Task 1.3

Check the shape, number of dimensions (ndim) and data type of one of the arrays you created above.

In [4]:
a.shape

(3, 2)

In [5]:
a.ndim

2

In [6]:
a.dtype

dtype('int64')

## Indexing Numpy arrays

Numpy arrays are indexed in a similar way to Python arrays, with $0$ based indexing:

```python
a = np.array([1,2,3,4,5,6,7,8])
a[2:6] # slice from index 2 to 5
```

*Remember that the slice will go from the first index provided to the index before the last one (e.g. from index 2 to 5 when the range specified is 2:6).*

For multidimensional arrays, an index is provided in each dimension, or can be skipped with a single : to return all values in that dimension:

```python
a = np.arange(1,28).reshape((3,3,3))
a[1,:,2] # array([12, 15, 18])
```

In [7]:
a = np.arange(1,28).reshape((3,3,3))
a

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]],

       [[19, 20, 21],
        [22, 23, 24],
        [25, 26, 27]]])

### Task 1.4

An array b is provided below. Use indexing to return all values in the columns in the range 2 to 4 (inclusive). *You can use the columns at indices 2,3 and 4 - but remember that these will be the third, fourth and fifth columns in the array due to zero based indexing*. 

In [8]:
b = np.arange(1,33).reshape((4,8))
b

array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 9, 10, 11, 12, 13, 14, 15, 16],
       [17, 18, 19, 20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29, 30, 31, 32]])

In [9]:
# put your answer here:
b[:,2:5]

array([[ 3,  4,  5],
       [11, 12, 13],
       [19, 20, 21],
       [27, 28, 29]])

## Numpy ranges

As you can see in the previous task, to create ranges of values Numpy has an arange function that can be used to create an array filled with a range of values. It takes a starting value, end and step size.

```python
a = np.arange(2,10,2) # array([2, 4, 6, 8])
```

These can be reshaped to generate more complex N-dimensional arrays using the reshape method:

```python
a = np.arange(1,28).reshape(3,3,3)
```

# Broadcast operations

## Elementwise operators

The standard arithmetic operators like +,-,/ will act **elementwise** on arrays, and can be broadcast across a whole array.

For example, two arrays can be multiplied together *element by element*:

```python
a = np.arange(1,10).reshape(3,3)
b = np.arange(9,0,-1).reshape(3,3)
c = a*b
# c= array([[ 9, 16, 21],
#           [24, 25, 24],
#           [21, 16,  9]])
```

Notice that this is **not** matrix multiplication.



Operations like logarithms and exponentials can also be mapped across numpy arrays elementwise, simply by applying them to the array:

```python
a = np.arange(1,10).reshape(3,3)
np.power(a,3)
# array([[  1,   8,  27],
#        [ 64, 125, 216],
#        [343, 512, 729]])
```

This can be used with many functions including np.log, np.exp, np.power, np.sin}, and np.cos.

### Task 1.5

Try adding each of the values in arrays a and b provided below.

In [10]:
a = np.arange(1,10).reshape(3,3)
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [11]:
b = np.arange(9,0,-1).reshape(3,3)
b

array([[9, 8, 7],
       [6, 5, 4],
       [3, 2, 1]])

In [12]:
c = a+b
c

array([[10, 10, 10],
       [10, 10, 10],
       [10, 10, 10]])

# Random numbers in Numpy

## Initialising the random number generator

We first need to initialise the random number generator, by creating an object that we can use to generate random numbers. *Note this is new in a recent update to Numpy*.

```python
rng = np.random.default_rng()
```

Do this below.

In [17]:
rng = np.random.default_rng()

## Generating random values

Numpy can generate random numbers from a variety of different distributions, but it can also make random selections or choices from a list of values.

```python
rng.choice(["Heads","Tails"])
```

Try this below to generate a single random dice roll for a six sided dice. Try running the cell multiple times and see how the output changes.

In [18]:
rng.choice([1,2,3,4,5,6])

1

## Looping and if statements in Python

We can write simple Python loops to perform an operation multiple times using the *for* syntax. To specify the number of times to execute the code in the loop, we can provide a Python *range* object, specifying the number of loops we would like to make.

```python
for i in range(10): # Loop 10 times
    print(i)
```

Unlike in C or Java where { } are used, **in Python the code to loop over is specificed by indenting the code we would like to be contained within the for loop**.

The Python code *range(10)* will produce a range of values from 0 to 9.

We can also use *if* statements to perform operations when certain conditions are met:

```python
if a=="Heads":
    print(a) # Only run if variable a contains the value "Heads"
else:
    print("Not heads") # Only run if variable a does not contain the value "Heads"
```

**Just as with for loops, we specify the code we would like to execute if the condition is met using indentation.**

### Task 2.1

Write Python code using a *for* loop and *if* statement to generate random choices between the two values "Heads" and "Tails", and to count how many times each occurs. Start by doing this for 10 loop iterations, and look at the count of Heads occurring.

In [19]:
heads_count = 0
tails_count = 0

for i in range(10):
    coin_toss = rng.choice(["Heads","Tails"])
    if coin_toss=="Heads":
        heads_count += 1
    else:
        tails_count += 1

print("Heads: "+str(heads_count))
print("Tails: "+str(tails_count))

Heads: 7
Tails: 3


Now run your code again but using 10000 iterations. You should notice that the final count of heads is closer to what we would expect in theory - roughly half of the 10000 simulated coin tosses.

In [20]:
heads_count = 0
tails_count = 0

for i in range(10000):
    coin_toss = rng.choice(["Heads","Tails"])
    if coin_toss=="Heads":
        heads_count += 1
    else:
        tails_count += 1

print("Heads: "+str(heads_count))
print("Tails: "+str(tails_count))

Heads: 4978
Tails: 5022


### Task 2.2

Now write a Python for loop to roll a dice 4 times, and count the number of times the value 6 occurs.

In [27]:
six_count = 0
for i in range(4):
    roll = rng.choice([1,2,3,4,5,6])
    if roll==6:
        six_count += 1
six_count

1

We can also generate a numpy array of random values rather than a single output using rng.choice:

```python
dice_rolls = rng.choice([1,2,3,4,5,6],size=4) # Generate 4 random choices from [1,2,3,4,5,6]
```

### Task 2.3

Try to use rng.choice to generate 1000000 coin tosses, and use numpy broadcasting to count the number of heads that occur. *Remember that you can use np.sum on an array of True or False values to count the number of times True occurs, and you can broadcast comparisons such as == over numpy arrays*

In [22]:
# A more efficient version of this could be to encode Heads as 1 and Tails as 0.
# This will mean numpy can work with integer variables for the coin tosses rather than strings.
coin_tosses = rng.choice(["Heads","Tails"],size=1000000)
np.sum(coin_tosses=="Heads")

500558

Go back to your coin tossing code using a for loop above, and try running this for 1000000 coin tosses. Is using broadcasting in numpy faster or slower?

In [23]:
heads_count = 0
tails_count = 0

for i in range(1000000):
    coin_toss = rng.choice(["Heads","Tails"])
    if coin_toss=="Heads":
        heads_count += 1
    else:
        tails_count += 1

print("Heads: "+str(heads_count))
print("Tails: "+str(tails_count))

Heads: 500330
Tails: 499670
