## NumPy

A lot of data analysis is carried out using tables of data. These are called 'arrays' in Python. We will use NumPy for this, which is an important Python package. If you'd like to learn more about it, this is a good place to start: https://numpy.org/doc/stable/user/whatisnumpy.html

The code below creates two simple arrays.

In [2]:
import numpy as np
a = np.array([2, 3, 4])
b = np.array([1.2, 3.5, 5.1])

We can look at the types of data stored in the arrays, which should be familiar from the first Jupyter notebook.

In [4]:
a.dtype

dtype('int64')

In [5]:
b.dtype

dtype('float64')

We can potentially create arrays with any dimensionality. For example, deep learning models of videos may use data that have five dimensions. Generally though, we'll be working with data that are two dimensional. The code below shows how to create these types of arrays.

In [6]:
np.zeros((3, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [7]:
np.ones((2, 3), dtype=np.int16)

array([[1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [8]:
np.empty((2, 5))

array([[6.94806878e-310, 6.94806878e-310, 0.00000000e+000,
        0.00000000e+000, 1.08221785e-312],
       [5.69977454e-038, 5.15060814e-062, 1.79665303e-052,
        3.11111319e-032, 2.62712967e+179]])

We can print these arrays in a (slightly) more user-friendly format

In [30]:
a = np.empty((2, 5))
print(a)

[[6.94806878e-310 6.94806878e-310 0.00000000e+000 0.00000000e+000
  1.08221785e-312]
 [5.69977454e-038 5.15060814e-062 1.79665303e-052 3.11111319e-032
  2.62712967e+179]]


In [10]:
testmatrix = np.zeros((3, 4))

In [11]:
print(testmatrix)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


We can't do much with a matrix of zeroes, so we need to change some of the values. Using what you've seen in the previous notebook about indexing, try to change the top-right value in 'testmatrix' to the value 4.

In [16]:
import numpy as np
testmatrix = np.zeros((3, 4))
testmatrix[0][3] = 4
print(testmatrix)

[[0. 0. 0. 4.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


We can also add, subtract, multiply, and divide values from each entry in the array using some simple code.

In [14]:
testmatrix += 2
testmatrix

array([[2., 2., 2., 6.],
       [2., 2., 2., 2.],
       [2., 2., 2., 2.]])

In [15]:
testmatrix *=5
testmatrix

array([[10., 10., 10., 30.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.]])

In [16]:
testmatrix /= 20
testmatrix

array([[0.5, 0.5, 0.5, 1.5],
       [0.5, 0.5, 0.5, 0.5],
       [0.5, 0.5, 0.5, 0.5]])

We can also test each element in the array.

In [17]:
testmatrix > 1.

array([[False, False, False,  True],
       [False, False, False, False],
       [False, False, False, False]])

There are a lot more details online. A good place to start is: https://numpy.org/doc/stable/user/absolute_beginners.html.

## Data preprocessing challenges

You have various data preprocessing tasks to carry out on the arrays below. You should follow the instructions carefully and you may have to use your own judgment on what is a sensible decision.

Challenge 1: For the dataset below, change all the negative values to zeroes.

In [25]:
row1 = np.array([-99, 0, 5, 3, 1])
for number in row1:
    row1[row1 < 0] = 0

row2 = np.array([-1, -5, -7, -9, -1.5])
for number in row2:
    row2[row2 < 0] = 0

row3 = np.array([0, 5, 17.2, 3.5, 4.1])
for number in row3:
    row3[row3 < 0] = 0

row4 = np.array([-1, -1, 0.5, -1, 0.5])
for number in row4:
    row4[row4 < 0] = 0

challenge1 = np.stack((row1, row2, row3, row4))
print(challenge1)

[[ 0.   0.   5.   3.   1. ]
 [ 0.   0.   0.   0.   0. ]
 [ 0.   5.  17.2  3.5  4.1]
 [ 0.   0.   0.5  0.   0.5]]


Challenge 2: Remove all the 'x' characters from the dataset below.

In [28]:
example1 = np.array([123, 'xx456', '1264', 'x4x5x6'])
list1 = list(example1)
cleaned1 = [str(number).replace('x','') for number in list1]
example1 = np.array(cleaned1)

example2 = np.array(['1xxx3xxx5xxx6', 5.75, 1.23, 'x3.573'])
list2 = list(example2)
cleaned2 = [str(number).replace('x','') for number in list2]
example2 = np.array(cleaned2)

example3 = np.array(['xx123xxxx', 953, 0x1F, -99])
list3 = list(example3)
cleaned3 = [str(number).replace('x','') for number in list3]
example3 = np.array(cleaned3)

example4 = np.array(['xx14xx57xx41', 3.57, 1245, 'xabcxx'])
list4 = list(example4)
cleaned4 = [str(number).replace('x','') for number in list4]
example4 = np.array(cleaned4)

example5 = np.array([0, 'xx1xxx4x', 99, '1234xx5678'])
list5 = list(example5)
cleaned5 = [str(number).replace('x','') for number in list5]
example5 = np.array(cleaned5)

challenge2 = np.stack((example1, example2, example3, example4, example5))
print(challenge2)

[['123' '456' '1264' '456']
 ['1356' '5.75' '1.23' '3.573']
 ['123' '953' '31' '-99']
 ['145741' '3.57' '1245' 'abc']
 ['0' '14' '99' '12345678']]


Challenge 3: For the dataset below, you can see that we have counts of the number of animals in a sample with numbers of legs from 1 to 8. Clean the data by removing implausible values.

In [41]:
number_of_legs = np.array([0,  1, 2,   3, 4,   5, 6,  7,   8])
legs = np.where(number_of_legs % 2 != 0, 0, number_of_legs)
number_of_legs = legs
        

count_of_animals = np.array([54, 5, 172, 1, -99, 1, 42, -99, 12])
for count in count_of_animals:
    count_of_animals[count_of_animals < 0] = 0
    
challenge3 = np.stack((number_of_legs, count_of_animals))
print(challenge3)

[[  0   0   2   0   4   0   6   0   8]
 [ 54   5 172   1   0   1  42   0  12]]
