## Data in Python

We will look at more sophisticated ways to import data in future practicals, for now we will stick with numpy arrays.

In [4]:
import numpy as np

Lets look at a function to import a csv (comma separated values) file.

First lets explain where to fine the data.

In [29]:
csv_file = 'Data/fib.csv'

Now we need to convert this csv file into a numpy array

In [30]:
values = np.genfromtxt(csv_file, delimiter = ',', dtype = '|U')
print(values)

['1' ' 1' ' 2' ' 3' ' 5' ' 8' ' 13']


Let's dissect this in order to understand.

The work is done in the `np.genfromtext` function, which taxes a text file and turns it into a numpy ndarray, the same type of object that we have been previously working with.

`csv_file` This is the previously defined string that contains the path to your data.

`delimiter` This tells the function what is separating values in your file. In our case, we have commas between all of the values that are important. We set `delimiter = ','`.

`dtype` The datatype. We won't worry about this just yet.

In [22]:
print(type(values))

<class 'numpy.ndarray'>


In [31]:
print(values.shape)
print(values[0])
print(values[4])

(7,)
1
 5


We will be calling this quite often, so lets turn it into a function:

In [85]:
def load_data(csv_file, datatype = '|U'): #The last variable is in case we want to quickly load data with a different datatype
    values = np.genfromtxt(csv_file, delimiter = ',', dtype = datatype) 
    print(values)
    return values

### Two dimensional data

Now we will look at an example where the data is two dimensional. There are 4 coders, A, B, C and D who have written some code. We will evaluate the number of lines and errors in their code.

In [86]:
csv2 = 'Data/coders.csv'
values = load_data(csv2)

[['' 'A' 'B' 'C' 'D']
 ['Lines of code' '147' '45' '334' '112']
 ['Number of errors' '15' '6' '28' '22']]


In [68]:
print(type(values))

<class 'numpy.ndarray'>


The first row and column are unimportant to use, so lets remove them as we saw in the previous notebook:

In [69]:
values = values[1:,1:]
print(values)

[['147' '45' '334' '112']
 ['15' '6' '28' '22']]


Now let's try to get some useful statistics about these rows:

In [70]:
np.mean(values[0])

TypeError: cannot perform reduce with flexible type

We get a type error. This is because the when we loaded the data, we used `dtype = '|U'`, which has resulted in an array full of strings. This originally allowed us to keep the column and row labels in the table. However, we have since deleted them, so lets convert the data into a suitable type.

In [71]:
values = np.array(values, dtype='float32')
print(values)

[[147.  45. 334. 112.]
 [ 15.   6.  28.  22.]]


Now let's try again:

In [72]:
np.mean(values[0]) # Mean of the first row

159.5

In [73]:
np.std([values[0]]) # Standard deviation of the first row

107.20657

In [74]:
np.mean(values[1]) # Mean of the second row

17.75

In [75]:
np.std(values[1])# Standard deviation of the second row

8.196798

Finally let's try to find how many errors per line each coder makes:

In [76]:
errors_per_line = values[1]/values[0]
print(errors_per_line)

[0.10204082 0.13333334 0.08383234 0.19642857]


In [77]:
values = np.vstack((values, errors_per_line))

In [78]:
print(values)

[[1.4700000e+02 4.5000000e+01 3.3400000e+02 1.1200000e+02]
 [1.5000000e+01 6.0000000e+00 2.8000000e+01 2.2000000e+01]
 [1.0204082e-01 1.3333334e-01 8.3832338e-02 1.9642857e-01]]


## Finding instances in your data

In final exercise in this practical, you will need `np.where`.

This allows us to find certain values within an array. For example, if you have an array called `arr` and you wish to find all of the elements with value `1` in `arr`, you would use:
> `np.where(arr==1)`

Let's look at this in practice:

In [81]:
arr = load_data('Data/where.csv')

[['1' '2' '-1' '4' '3']
 ['3' '2' '1' '1' '4']
 ['1' '2' '3' '4' '1']]


Note that we have strings again, not numeric values (ints or floats). Instead of just converting the data like we did before, let's see how we can load the data properly:

In [88]:
arr = load_data('Data/where.csv', datatype = 'int')

[[ 1  2 -1  4  3]
 [ 3  2  1  1  4]
 [ 1  2  3  4  1]]


In [90]:
print(np.where(arr == 1))

(array([0, 1, 1, 2, 2], dtype=int64), array([0, 2, 3, 0, 4], dtype=int64))


This is a little confusing at first glance, if you put the two arrays together, you get the co-ordinates of the five places you find `1` in `arr`

> $\left[\begin{matrix}
\text{Row} & \text{Column}\\
0 & 0 \\
1 & 2 \\
1 & 3 \\
2 & 0 \\
2 & 4 
\end{matrix}\right]$

To change all instances of `1` in `arr` to `0`, we can very quickly use:

In [93]:
arr[np.where(arr == 1)] = 0
print(arr)

[[ 0  2 -1  4  3]
 [ 3  2  0  0  4]
 [ 0  2  3  4  0]]
