APPF1 | Course Day 1 | 11.05.2018

# Storing and Operating on Data with NumPy

## NumPy: Numerical Python
* NumPy: Python library that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
* NumPy documentation: https://docs.scipy.org/doc/ 
  * Use your NumPy version number to access the corresponding documentation

In [1]:
import numpy as np
np.__version__

'1.21.4'

* _Note_: We are going to use the `np` alias for the `numpy` module in all the code samples on the following slides

## NumPy Arrays
* Python's vanilla lists are heterogeneous: Each item in the list can be of a different data type
 * Comes at a cost: Each item in the list must contain its own type info and other information 
 * It is much more efficient to store data in a fixed-type array (all elements are of the same type)
* NumPy arrays are homogeneous: Each item in the list is of the same type
 * They are much more efficient for storing and manipulating data

## Creating NumPy Arrays
* Use the `np.array()` method to create a NumPy array:

In [2]:
example = np.array([0,1,2,5])
example

array([0, 1, 2, 5])

## Multidimensional NumPy Arrays
* _One-dimensional_ array: we only need one coordinate to address a single item, namely an integer index
* _Multidimensional_ array: we now need multiple indices to address a single item
 * For an $n$-dimensional array we need up to $n$ indices to address a single item
 * We're going to mainly work with two-dimensional arrays in this course, i.e. $n=2$ 

In [3]:
twodim = np.array([[1,2,3],
                   [4,5,6],
                   [7,8,9]])
twodim

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

## Array Indexing
* Array indexing for one-dimensional arrays works as usual: `onedim[0]`
* Accessing items in a two-dimensional array requires you to specify two indices: `twodim[0,1]`
* First index is the row number (here `0`), second index is the column number (here `1`)

## Objects in Python
* Almost everything in Python is an object, with its properties and methods
 * For example, a dictionary is an object that provides an `items()` method, which can only be called on a dictionary object (which is the same as a value of the dictionary type, or a dictionary value)
* An object can also provide attributes next to methods, which may describe properties of the specific object
 * For example, for an array object it might be interesting to see how many elements it contains at the moment, so we might want to provide a size attribute storing information about this specific property
 
### NumPy Array Attributes
* The type of a NumPy array is `numpy.ndarray` ($n$-dimensional array):

In [4]:
example = np.array([0,1,2,3])
type(example)

numpy.ndarray

* Useful array attributes
 * `ndim`: The number of dimensions, e.g. for a two-dimensional array its just 2 
 * `shape`: Tuple containing the size of each dimension
 * `size`: The total size of the array (total number of elements)

In [5]:
rng = np.random.RandomState(41) # Ensure that the same random numbers are generated each time we run this code
x1 = rng.randint(10, size=6) # One-dimensional array
x2 = rng.randint(10, size=(3, 4)) # Two-dimensional array
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print("x2 dtype: ", x2.dtype)

x2 ndim:  2
x2 shape: (3, 4)
x2 size:  12
x2 dtype:  int32


## Creating Arrays from Scratch
* NumPy provides a wide range of functions for the creation of arrays:<br>
  https://docs.scipy.org/doc/numpy-1.15.4/reference/routines.array-creation.html#routines-array-creation 
 * For example: `np.arange`, `np.zeros`, `np.ones`, `np.linspace`, etc.
* NumPy also provides functions to create arrays filled with random data:<br>
  https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.random.html
 * For example: `np.random.random`, `np.random.randint`, etc.

In [6]:
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [7]:
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [8]:
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [9]:
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [10]:
np.linspace(0, 10, 5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

In [11]:
np.random.random((3, 3))

array([[0.70532995, 0.37307786, 0.36444508],
       [0.80001819, 0.13348613, 0.42542769],
       [0.51254239, 0.59744161, 0.47806316]])

In [12]:
np.random.randint(0, 10, (3, 3))

array([[3, 5, 8],
       [3, 9, 5],
       [8, 6, 9]])

## NumPy Data Types
* Use the keyword `dtype` to specify the data type of the array elements:

In [13]:
floats = np.array([0,1,2,3], dtype="float32")
floats

array([0., 1., 2., 3.], dtype=float32)

 * Overview of available data types: https://docs.scipy.org/doc/numpy-1.15.4/user/basics.types.html 

## Array Slicing: One-Dimensional Subarrays
* The NumPy slicing syntax follows that of the standard Python list: `x[start:stop:step]`

In [14]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
x[:5]

array([0, 1, 2, 3, 4])

In [16]:
x[5:]

array([5, 6, 7, 8, 9])

In [17]:
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

## Array Slicing: Multidimensional Subarrays
* Let `x2` be a two-dimensional NumPy array. Multiple slices are now separated by commas: `x2[start:stop:step, start:stop:step]`

In [18]:
x2

array([[9, 7, 5, 8],
       [3, 3, 2, 6],
       [0, 4, 6, 9]])

In [19]:
x2[:2, :3]

array([[9, 7, 5],
       [3, 3, 2]])

In [20]:
x2[:3, ::2] # All rows, every other column

array([[9, 5],
       [3, 2],
       [0, 6]])

In [21]:
x2[:, 0] # Select the first column of x2

array([9, 3, 0])

In [22]:
x2[1, :] # Select the second row of x2

array([3, 3, 2, 6])

In [23]:
x2[1] # Select the second row of x2

array([3, 3, 2, 6])

## Array Views and Copies
* With Python lists, the slices will be _copies_: If we modify the subarray, only the copy gets changed
* With NumPy arrays, the slices will be _direct views_: If we modify the subarray, the original array gets changed, too
 * Very useful: When working with large datasets, we don't need to copy any data (costly operation)
* Creating copies: We can use the `copy()` method of a slice to create a copy of the specific subarray
 * Note: The type of a slice is again `numpy.ndarray`

In [24]:
x2_sub_copy = x2[:2, :2].copy()
x2_sub_copy

array([[9, 7],
       [3, 3]])

In [25]:
x2_sub_copy[0, 0] = 42

In [26]:
x2

array([[9, 7, 5, 8],
       [3, 3, 2, 6],
       [0, 4, 6, 9]])

In [27]:
x2_sub_copy

array([[42,  7],
       [ 3,  3]])

## Reshaping
* We can use the `reshape()` method on an NumPy array to actually change its shape:

In [28]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


* For this to work, the size of the initial array must match the size of the reshaped array
* _Important_: `reshape()` will return a new view if possible; otherwise, it will be a copy
 * In case of a view, if you change an entry of the reshaped array, it will also change the initial array

## Array Concatenation and Splitting
* Concatenation, or joining of two or multiple arrays in NumPy can be accomplished through the functions `np.concatenate, np.vstack, and np.hstack`
 * Join multiple two-dimensional arrays: `np.concatenate([twodim1, twodim2,…], axis=0)`
   * A two-dimensional array has two axes: The first running vertically downwards across rows (axis `0`), and the second running horizontally across columns (axis `1`)
* The opposite of concatenation is splitting, which is provided by the functions `np.split, np.hsplit` (split horizontally), and `np.vsplit` (split vertically)
 * For each of these we can pass a list of indices giving the split points

In [29]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

In [30]:
grid = np.array([[1, 2, 3], [4, 5, 6]])
np.concatenate([grid, grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [31]:
np.concatenate([grid, grid], axis=1)

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

In [32]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

np.vstack([x, grid])

array([[1, 2, 3],
       [9, 8, 7],
       [6, 5, 4]])

In [33]:
y = np.array([[99],
              [99]])

np.hstack([grid, y])

array([[ 9,  8,  7, 99],
       [ 6,  5,  4, 99]])

In [34]:
grid = np.arange(16).reshape((4, 4))
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [35]:
upper, lower = np.vsplit(grid, [2])

In [36]:
upper

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [37]:
lower

array([[ 8,  9, 10, 11],
       [12, 13, 14, 15]])

## Faster Operations Instead of Slow `for` Loops
* Looping over arrays to operate on each element can be a quite slow operation in Python
* One of the reasons why the for loop approach is so slow is because of the type-checking and function dispatches that must be done at each iteration of the cycle
 * Python needs to examine the object's type and do a dynamic lookup of the correct function to use for that type

In [38]:
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output

values = np.random.randint(1, 10, size=5)

compute_reciprocals(values)

array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])

In [39]:
big_array = np.random.randint(1, 100, size=10000)
%timeit compute_reciprocals(big_array)

13 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## NumPy's Universal Functions
* NumPy provides very fast, vectorized operations which are implemented via _universal functions_ (ufuncs), whose main purpose is to quickly execute repeated operations on values in NumPy arrays
 * A _vectorized operation_ is performed on the array, which will then be applied to each element
Instead of computing the reciprocal using a for loop, lets do it by using a universal function:

In [40]:
%timeit (1.0 / big_array)

10.7 µs ± 50.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


 * We can use ufuncs to apply an operation between a scalar and an array, but we can also operate between two arrays

In [41]:
np.array([4,5,6]) / np.array([1,2,3])

array([4. , 2.5, 2. ])

## Advanced Ufunc Features: Specifying Output and Aggregates
* ufuncs provide a few specialized features
* We can specify where to store a result (useful for large calculations)
* If no `out` argument is provided, a newly-allocated array is returned (can be costly memory-wise)

In [42]:
x = np.random.random(10)
y = np.zeros(10)
np.multiply(x,3,y)

array([2.06093744, 0.69318977, 0.38218168, 0.92366056, 0.05800874,
       1.89006749, 0.99872665, 2.9811336 , 0.29351804, 2.68465283])

* _Reduce_: Repeatedly apply a given operation to the elements of an array until only one single result remains
 * For example, `np.add.reduce(x)` applies addition to the elements until the one result remains, namely the sum of all elements
* _Accumulate_: Almost same as reduce, but also stores the intermediate results of the computation

In [43]:
x = np.array([1,2,3,4,5])
np.add.reduce(x)

15

In [44]:
x = np.array([1,2,3,4,5])
np.add.accumulate(x)

array([ 1,  3,  6, 10, 15], dtype=int32)

## Aggregations
* If we want to compute summary statistics for the data in question, aggregates are very useful
  * Common summary statistics: mean, standard deviation, median, minimum, maximum, quantiles, etc.
* NumPy provides fast built-in aggregation functions for working with arrays:

In [45]:
x = np.random.random(10000)
%timeit np.max(x) # NumPy ufunc
%timeit max(x)    # Python function

5.29 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
337 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


* Summing values in an array:

In [46]:
%timeit np.sum(x) # NumPy ufunc
%timeit sum(x)    # Python function

5.63 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
527 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Multidimensional Aggregates
* By default, each NumPy aggregation function will return the aggregate over the entire array
* Aggregation functions take an additional argument specifying the axis along which the aggregate is computed
 * For example, we can find the minimum value within each column by specifying `axis=0`:

In [47]:
twodim = np.array([[1,2,3],[0.12, -1, 0.41],[10,9,8]])
twodim.min(axis=0)

array([ 0.12, -1.  ,  0.41])

## Comparison Operators as ufuncs
* NumPy also implements comparison operators as element-wise ufuncs
* The result of these comparison operators is always an array with a Boolean data type:

In [48]:
np.array([1,2,3]) < 2

array([ True, False, False])

* It is also possible to do an element-by-element comparison of two arrays:

In [49]:
np.array([1,2,3]) < np.array([0,4,2])

array([False,  True, False])

## Working with Boolean Arrays: Counting Entries
* The `np.count_nonzero()` function will count the number of `True` entries in a Boolean array

In [58]:
nums = np.array([1,2,3,4,5])
np.count_nonzero(nums < 4)

3

* We can also use the `np.sum()` function to accomplish the same. In this case, `True` is interpreted as `1` and `False` as `0`:

In [59]:
np.sum(nums < 4)

3

* NumPy also implements bitwise logic operators as element-wise ufuncs
* We can use these bitwise logic operators to construct compound conditions (consisting of multiple conditions)

In [60]:
(nums < 2) | (nums > 3)

array([ True, False, False,  True,  True])

## Boolean Arrays as Masks
* In the previous slides we looked at aggregates computed directly on Boolean arrays
* Once we have a Boolean array from lets say a comparison, we can select the entries that meet the condition by using the Boolean array as a _mask_

In [53]:
x = np.array([[3,1,5],[10,32,100],[-1,3,4]])
x[x<5]

array([ 3,  1, -1,  3,  4])

## Reading and Writing Data with NumPy

* We can use the `np.savetxt()` function to save NumPy data to a file
* We can use the `np.loadtxt()` function to load data from a file
  * *Remember*: We can only store elements of a single type in a NumPy array
* Use the shell commands `!ls`, `!pwd`, and `!cd` within our notebook to navigate the file system if necessary

### Split-Up Data Example
1. We are now first going to generate some data which we will store into multiple files
2. In a next step, we are going to read the same split-up data into a NumPy array again. 
3. **Note**: Please create a `smarthome` folder within the `datasets` folder; we are going to store the files there


In [54]:
def generate_split_data():
    seconds_in_a_day = 24 * 60 * 60 - 1
    data_size = (seconds_in_a_day,1)

    days = np.arange(1,31)

    rng = np.random.RandomState(42)

    for day in np.nditer(days):
        fridge_temperature = rng.normal(loc=5, scale=2.0, size=data_size)
        room_temperature = rng.normal(loc=20, scale=3.0, size=data_size)
        outside_temperature = rng.normal(loc=10, scale=2.0, size=data_size)
        data = np.concatenate((outside_temperature, room_temperature, fridge_temperature), axis=1) # Concatenate column-wise
        # Important: use :02d since it allows us to sort the filenames
        np.savetxt("./datasets/smarthome/day_{:02d}.txt".format(day), data, fmt="%.4f", delimiter=",", header="outside_temperature_celsius, room_temperature_celsius, fridge_temperature_celsius")
        print("day #{:02d} done".format(day))
    
    print("All files have been successfully created!")
    return data
        
data = generate_split_data()

day #01 done
day #02 done
day #03 done
day #04 done
day #05 done
day #06 done
day #07 done
day #08 done
day #09 done
day #10 done
day #11 done
day #12 done
day #13 done
day #14 done
day #15 done
day #16 done
day #17 done
day #18 done
day #19 done
day #20 done
day #21 done
day #22 done
day #23 done
day #24 done
day #25 done
day #26 done
day #27 done
day #28 done
day #29 done
day #30 done
All files have been successfully created!


In [55]:
from os import listdir
from os.path import isfile, join

def read_split_data():
    files_dir = "./datasets/smarthome"
    files = listdir(files_dir)
    files.remove(".gitignore")
    files.sort()
    
    all_data = np.empty((0,3), dtype=np.float64)
    
    for f in files:
        new_data = np.loadtxt("{}/{}".format(files_dir, f), skiprows=1, delimiter=",")
        all_data = np.vstack((all_data, new_data))
        print("Done with loading {}".format(f))
        
    print("Data shape: {}".format(all_data.shape))
    return new_data

new_data = read_split_data()

Done with loading day_01.txt
Done with loading day_02.txt
Done with loading day_03.txt
Done with loading day_04.txt
Done with loading day_05.txt
Done with loading day_06.txt
Done with loading day_07.txt
Done with loading day_08.txt
Done with loading day_09.txt
Done with loading day_10.txt
Done with loading day_11.txt
Done with loading day_12.txt
Done with loading day_13.txt
Done with loading day_14.txt
Done with loading day_15.txt
Done with loading day_16.txt
Done with loading day_17.txt
Done with loading day_18.txt
Done with loading day_19.txt
Done with loading day_20.txt
Done with loading day_21.txt
Done with loading day_22.txt
Done with loading day_23.txt
Done with loading day_24.txt
Done with loading day_25.txt
Done with loading day_26.txt
Done with loading day_27.txt
Done with loading day_28.txt
Done with loading day_29.txt
Done with loading day_30.txt
Data shape: (2591970, 3)


In [65]:
data
new_data
print(np.allclose(data, new_data, atol=1e-4))
print(np.allclose(data, new_data, atol=1e-5))

True
False


## Reading CSV Data with NumPy
* Some CSV data contains a mix between numbers and strings, or might have missing values
* We can use the `np.genfromtxt()` function to load mixed data from such a file into a NumPy array

In [57]:
# Lets play around with the FIFA 2019 player statistics data set and see how we can work with mixed data
fifa_data = np.genfromtxt(open("./datasets/fifa/data.csv", "r", encoding="utf8"), delimiter=",", skip_header=1,
                     missing_values=-1, usecols=(1,2,3,7,8), 
                     dtype=[("ID",int),("Name","U50"),("Age",int),("Overall rating",int),("Overall potential",int)])
fifa_data

array([(158023, 'L. Messi', 31, 94, 94),
       ( 20801, 'Cristiano Ronaldo', 33, 94, 94),
       (190871, 'Neymar Jr', 26, 92, 93), ...,
       (241638, 'B. Worman', 16, 47, 67),
       (246268, 'D. Walker-Rice', 17, 47, 66),
       (246269, 'G. Nugent', 16, 46, 66)],
      dtype=[('ID', '<i4'), ('Name', '<U50'), ('Age', '<i4'), ('Overall_rating', '<i4'), ('Overall_potential', '<i4')])