# Introduction to `numpy` - Part 2

This Notebook provides an overview of the capabilities of the `numpy` module and explains why it is good to understand how to manipulate arrays. It covers Sect. II of [Modules_in__python.ipynb](Modules_in__python.ipynb). 

## Table of Content

- [II. Numpy](#II)
    * [II.1 Array Definition and construction](Modules_in__python_numpy_Part1.ipynb/#II.1)
    * [II.2 Array copies and views](#II.2)
    * [II.3 Shape manipulation](#II.3)
    * [II.4 What makes numpy Arrays useful structures ?](#II.4)
        - [II.4.1 ufunc](#II.4.1)
        - [II.4.2 Aggregation](#II.4.2)
        - [II.4.3 Broadcasting](II.4.3)
        - [II.4.4 Slicing, masking, fancy indexing](#II.4.4)
    * [II.5 Reading arrays from a file and string formatting](#II.5)
    * [II.6 Useful Numpy functions](#II.6)
    * [II.7 Summary](#II.7)
    * [II.8 References](#VI)

In [1]:
import numpy as np
import astropy

#### Intermezo: 

Some commands that we did not discuss in Part but which are useful. 

In [13]:
# illustration of the command np.arange not discussed in the previous lecture 
np.arange(-3, 3.1, 0.2)  # start, end, step

array([-3.00000000e+00, -2.80000000e+00, -2.60000000e+00, -2.40000000e+00,
       -2.20000000e+00, -2.00000000e+00, -1.80000000e+00, -1.60000000e+00,
       -1.40000000e+00, -1.20000000e+00, -1.00000000e+00, -8.00000000e-01,
       -6.00000000e-01, -4.00000000e-01, -2.00000000e-01,  2.66453526e-15,
        2.00000000e-01,  4.00000000e-01,  6.00000000e-01,  8.00000000e-01,
        1.00000000e+00,  1.20000000e+00,  1.40000000e+00,  1.60000000e+00,
        1.80000000e+00,  2.00000000e+00,  2.20000000e+00,  2.40000000e+00,
        2.60000000e+00,  2.80000000e+00,  3.00000000e+00])

In [14]:
# Another way to generate an array w. regularly spaced values 
np.linspace(-3, 3, 10)

array([-3.        , -2.33333333, -1.66666667, -1.        , -0.33333333,
        0.33333333,  1.        ,  1.66666667,  2.33333333,  3.        ])

In [15]:
# And now for regularly spaces values in log space (e.g. btw 10^2 and 1000)
np.logspace?

In [16]:
# How to make keyword search based on docstring in numpty 
np.lookfor('product')

Search results for 'product'
----------------------------
numpy.product
    Return the product of array elements over a given axis.
numpy.dot
    Dot product of two arrays. Specifically,
numpy.cumproduct
    Return the cumulative product over the given axis.
numpy.kron
    Kronecker product of two arrays.
numpy.prod
    Return the product of array elements over a given axis.
numpy.vdot
    Return the dot product of two vectors.
numpy.cross
    Return the cross product of two (arrays of) vectors.
numpy.inner
    Inner product of two arrays.
numpy.outer
    Compute the outer product of two vectors.
numpy.matmul
    Matrix product of two arrays.
numpy.cumprod
    Return the cumulative product of elements along a given axis.
numpy.nanprod
    Return the product of array elements over a given axis treating Not a
numpy.polymul
    Find the product of two polynomials.
numpy.corrcoef
    Return Pearson product-moment correlation coefficients.
numpy.tensordot
    Compute tensor dot product alon

## II. `numpy`:  <a class="anchor" id="II"></a>

`numpy` can be seen as the implementation of mathematical functions and operations for python language. It also introduces one key object `arrays`. 

What are typical examples of astronomical data that could be cast into 1D, 2D or 3D arrays?

### II.1 `array` definition and construction:  <a class="anchor" id="II.1"></a>

See [Modules_in__python_numpy_Part1.ipynb](Modules_in__python_numpy_Part1.ipynb)

### II.2 `array` copies and views:   <a class="anchor" id="II.2"></a>

A `copy` of an array `a` can be done e.g. with `b = a.copy()`. Modifying `b`, will then NOT affect `a`. This allocates more space in memory.

When an array `view` is created, the original array is not copied in memory. A view of `a` can be created with `b = a.view()`. Modifying `b` then also modifies `a`.

Cases where a view can be useful:
- A *slicing* operation creates a **view** on the original array. It can be useful to modify a corrupted part of an image (cast into a 2D numpy array), by just dealing with the slice. This will then also modify the original (larger image).
- If you are only interested in a specific part of the data (i.e. part of your numpy array), and do not plan to use any more the original data. You can slice it (which creates a view, i.e. not allocating more memory), and just continue manipulating the view.

### II.3 Array shape manipulation <a class="anchor" id="II.3"></a>

There are various possibilities to modify arrays (e.g. adding a row/column, shuffle columns, flatten, resize,...).  Let's focus here on array reshaping (which allows to perform several of the operations outlined above), addition of new dimensions, and array flattening.

- **II.3.1 Reshaping**:   
The method `reshape(newshape)` allows one to reorganise the elements of an array, to create a "new" array (see below) that has a different shape. The total number of items of the array has to be the same ! This method can also be used to add an axis or to flatten an array. 

In [17]:
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
print(a)
print(a.shape) 
b = a.reshape((3, 2))
b

[[1 2 3]
 [4 5 6]]
(2, 3)


array([[1, 2],
       [3, 4],
       [5, 6]])

In [18]:
# Alternatively use of "-1" to implicitely derive the size of one of the dimensions 
a.reshape((3, -1))    # unspecified (-1) value is inferred

array([[1, 2],
       [3, 4],
       [5, 6]])

In [19]:
# Make a test array with values ranging from 0 to 99 with shape (4x5x5)
a = np.arange(100)
b = a.reshape((4,5,5))
b

array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24]],

       [[25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39],
        [40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49]],

       [[50, 51, 52, 53, 54],
        [55, 56, 57, 58, 59],
        [60, 61, 62, 63, 64],
        [65, 66, 67, 68, 69],
        [70, 71, 72, 73, 74]],

       [[75, 76, 77, 78, 79],
        [80, 81, 82, 83, 84],
        [85, 86, 87, 88, 89],
        [90, 91, 92, 93, 94],
        [95, 96, 97, 98, 99]]])

- **II.3.2 Resizing**:   
The `numpy.newaxis` [constant](https://numpy.org/doc/stable/reference/constants.html#numpy.newaxis) allows one to create new additional dimensions in your array as follow:

In [20]:
a = np.array([[1, 2, 3],
              [4, 5, 6]])
print(a.shape) 
b = a[np.newaxis]
print(b.shape)

(2, 3)
(1, 2, 3)


Note that this is equivalent to adding a pair of square brackets in the definition:

In [21]:
c = np.array([[[1, 2, 3], [4, 5, 6]]])
print(c.shape)

(1, 2, 3)


The addition of a new axis can be elsewhere, e.g. to add a new third dimension:

In [22]:
d = a[:,:,np.newaxis]
print(d.shape)

(2, 3, 1)


In [23]:
d

array([[[1],
        [2],
        [3]],

       [[4],
        [5],
        [6]]])

- **II.3.3 Flattening**:   
The method `flatten()` allows one to flatten all dimensions of your array, making it 1D:

In [24]:
a = np.array([[[1, 2, 3], [4, 5, 6]],
              [[4, 5, 6], [-1,-2,-3]]])
print(a.shape) 
b = a.flatten()
print(b)
print(b.shape)

(2, 2, 3)
[ 1  2  3  4  5  6  4  5  6 -1 -2 -3]
(12,)


### II.4 What makes `numpy` arrays useful structures?  <a class="anchor" id="II.4"></a>

Python is fast *for coding and developping* but python is slow when it comes to *execution*, especially when it comes to execution of `for` loops.    
The reason behind this low speed is e.g. that when it does `for a in range(10): a + b`, it has to check the `type` of `a`, of `b` and of *each value* in those lists before executing. 

`numpy` helps speeding up code through 4 strategies:
1. `ufunc`
2. aggregation
3. broadcasting
4. slicing, masking, fancy indexing

#### II.4.1 `ufunc`: operates elementwise on objects. <a class="anchor" id="II.4.1"></a>

Those `ufunc` (universal functions) are included (compiled) in `numpy` and consist of fast **elementwise** operations. They include: 

- all mathematic operation: +, -, /, *, ** 
- Mathematical functions: sin, exp, cos, log10, ... 
- Comparison operators <, >, =, ...
- etc ... 

**Example:**
``` python
import numpy as np
# Basic python
a = [1,2,3,4,5]
b = [ val + 5 for val in a]   # add 5 to each element of the list  
# In numpy
a = np.array(a)
b = a + 5                     # add 5 to each element of the array
```

In [25]:
# How to add 5 to a list with list comprehension
a = [1,2,3,4,5]
b = [ val + 5 for val in a]   # add 5 to each element of the list
b

[6, 7, 8, 9, 10]

In [26]:
# with numpy: more easy and cleaner. Deals directly w. elementwise operation
a = np.array(a)
b = a + 5
b

array([ 6,  7,  8,  9, 10])

In [27]:
# implement the above example for a list of 1000 elements 
# use %timeit before calculating b to see improvement in speed
a = range(1000)
%timeit b = [val + 5 for val in a]

51.6 µs ± 4.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [28]:
a = np.arange(1000)
%timeit b = a + 5

1.04 µs ± 60.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


#### Exercise: 

Given a NumPy array of angles in degrees, use universal functions to:

* Convert the angles to radians. (`np.radians()`)
* Compute the sine, cosine, and tangent of each angle.



In [32]:
a = np.array([0, 30, 45, 60, 90])
b = np.radians(a)
b2 = np.deg2rad(a)
a2 = np.rad2deg(b2)
print('b = ', b, '\nb2 = ', b2)
print('sin(b)', np.sin(b))
print('cos(b)', np.cos(b))
print('tab(b)', np.tan(b))

b =  [0.         0.52359878 0.78539816 1.04719755 1.57079633] 
b2 =  [0.         0.52359878 0.78539816 1.04719755 1.57079633]
sin(b) [0.         0.5        0.70710678 0.8660254  1.        ]
cos(b) [1.00000000e+00 8.66025404e-01 7.07106781e-01 5.00000000e-01
 6.12323400e-17]
tab(b) [0.00000000e+00 5.77350269e-01 1.00000000e+00 1.73205081e+00
 1.63312394e+16]


#### II.4.2 *aggregation*:   <a class="anchor" id="II.4.2"></a>

Functions which summarize values of an array such as `min`, `max`, `sum`, `mean`, ... 

**Example:**

``` python
# python version of an aggregation
from numpy import random 
c_list = [random.random_sample() for i in range(10000)]
%timeit min(c_list)
#same in numpy:
c = np.array(c_list)
%timeit c.min()  
```
This also works on multidimensional arrays: 

``` python 
M = np.random.randint(0, 10, (10,4))
M.sum(axis=0)
M.sum(axis=1)
```

Aggregation available: 
`c.min()`, `c.max`, `c.prod()`, `c.mean()`, `c.std()`, `c.any()`, `c.all()`, `c.nanmin()` (and nan versions of above aggregation), `c.argmin()`, `c.argmax()`,  ...


In [33]:
from numpy import random 
c_list = [random.random_sample() for i in range(10000)]
%timeit min(c_list)

144 µs ± 7.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [34]:
c = np.array(c_list)
%timeit c.min()

4.55 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### Exercise: 

Given a 2D NumPy `arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])`

- Calculate the mean of the array.
- Find the maximum and minimum values.
- Compute the cumulative sum and cumulative product along the rows and columns.

Bonus (More advanced): Find the index of the maximum value in the 2 D array. 
- Why is it a single value? (Check the help of the function)
- What is returned if you have several occurences of the maximum?
- How can you get the "2 coordinates" index of the maximum (see `np.unravel_index()`) 

In [35]:
arr = np.array([[1, 2, 3], 
                [4, 5, 6],
                [7, 8, 9]])

mea = arr.mean()
mea2 = np.mean(arr)
mea, mea2

(5.0, 5.0)

In [36]:
arr.min(), arr.max()

(1, 9)

In [37]:
np.cumsum(arr, axis=1), np.cumproduct(arr, axis=0) 

(array([[ 1,  3,  6],
        [ 4,  9, 15],
        [ 7, 15, 24]]),
 array([[  1,   2,   3],
        [  4,  10,  18],
        [ 28,  80, 162]]))

#### II.4.3 *Broadcasting*:   <a class="anchor" id="II.4.3"></a>

Set of rules by which `ufuncs` operates on arrays of different sizes and/or dimensions. 

The term [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) describes how `numpy` treats arrays with different shapes during arithmetic operations.
Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

**General rule**

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimension and works its way left. Two dimensions are compatible when
- they are equal, or
- one of them is 1.

If these conditions are not met, a ValueError: operands could not be broadcast together exception is thrown.

**Application to 3 cases:** 

![From astroML book](../Figures/fig_broadcast_visual_1.png)


In [38]:
c = np.arange(3).reshape(3,1) +np.arange(3)
c

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

In [40]:
# broadcasting will not work in the situation below 
d = np.arange(3) + np.arange(5)
d

ValueError: operands could not be broadcast together with shapes (3,) (5,) 

#### II.4.4 Slicing, masking and fancy indexing:    <a class="anchor" id="II.4.4"></a>
	 
- **Mask**: a mask is a boolean array that can be used to "mask" some indices of an array: 

``` python
mask = np.array([False, False, True, False, True, False])
c = np.array([1, 3, 6, 9, 10, 2])
c[mask]
    Out: array([6, 10])
    
mask = (c < 4) | (c > 8)
c[mask]
    Out: array([1, 3, 9, 10, 2])
```
 

In [41]:
mask = np.array([False, False, True, False, True, False])
c = np.array([1, 3, 6, 9, 10, 2])
c[mask]

array([ 6, 10])

In [42]:
mask = (c < 4) | (c > 8)
mask

array([ True,  True, False,  True,  True,  True])

In [43]:
# BUT beware that the use of "or" will not work  
(c < 4) or (c > 8)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

**The reason** for this difference of behavior is that "|" is bitwise operator, it make the comparison ELEMENT BY element, while "OR" is a boolean operators, which means that it requires to compare two booleans. In other words, the "element-wise" logical operations requires the use of "&, |, ~", while "Boolean" operators "AND, OR, NOT" compare SINGLE booleans. 

In the above example: `(c < 4) or (c > 8)`, `(c < 4)` returns an array of booleans. `c > 8` also returns an array of booleans while the logical operator `or` expects a single boolean value, not an array. This is the origin of that error message.  

In [48]:
print(c)

[ 1  3  6  9 10  2]


In [49]:
mask = (c > 4) & (c < 8)
c[mask]

array([6])

In [52]:
# Now mask w. 2D array
arr = np.array([[0, -1, -2], [1, 1, 3], [0, -2, 5]])
arr

array([[ 0, -1, -2],
       [ 1,  1,  3],
       [ 0, -2,  5]])

In [53]:
arr > 0 

array([[False, False, False],
       [ True,  True,  True],
       [False, False,  True]])

In [54]:
arr[arr > 0]

array([1, 1, 3, 5])

- **Fancy indexing**: passing a list/array of indices to get elements of a numpy array  (this only works for arrays !) This avoids to loop over the indices. 

``` python
ind = [1, 3, 4]
c[ind]  
   Out: array([3, 9, 10])
```

In [55]:
ind = [1, 3, 4]
c[ind]

array([ 3,  9, 10])

- **Multi-dimensional** array: 

We can apply mask and fancy indexing in multidimension.   
Remember that first index is row, and second is column.   
Remember how slicing works: `a[start:end:step]`   : 
- Omitting one value goes up to the end of the sequence. 
- Omitting the second "colon" implies step=1.  
- With negative steps you count backward
- Start/step can be either positive or negative indices (but then you count from the end). 

In [56]:
a = np.arange(10)
a[3:]

array([3, 4, 5, 6, 7, 8, 9])

``` python
M = np.arange(12).reshape((3,4))
    Out: 
    array([[ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11]])

M[0,1] # gives value at row 0 and column 1. 
M[:, 1]  # Combines slices and indices -> all rows of column one
M[M-3 < 2]# can also do masking of n dimensional array
M[[1,0], :2] # Use fancy indexing and slicing - 1st 2 elements, of rows 1 and 2
M[M.sum(axis=1) > 2, 2:] # mixing masking and slicing 
```

In [57]:
# Test slicing examples above
M = np.arange(12).reshape((3,4))
M

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [58]:
M[0,1]

1

In [59]:
M[:,1]

array([1, 5, 9])

In [60]:
mask = M-3 < 2
mask
M[M-3 < 2]

array([0, 1, 2, 3, 4])

In [61]:
M[[1,0], :2]

array([[4, 5],
       [0, 1]])

In [62]:
#M.sum(axis=1) > 2
M[M.sum(axis=1) > 2, 2:]

array([[ 2,  3],
       [ 6,  7],
       [10, 11]])

An illustration of indexing in numpy arrays:
![Illustration of `np` indexing](../Figures/numpy_indexing.png)

**Exercises**:
- Try the different flavours of slicing, using start, end and step: starting from a linspace, try to obtain odd numbers counting backwards, and even numbers counting forwards. Expected Output: `[9 7 5 3 1]` and `[ 0  2  4  6  8 10]` 

- Given a NumPy array `arr = np.array([1, 10, 15, 25, 50, 60, 100])` and a scalar value `threshold = 30`, do the following:
    * Compare each element of the array with the threshold to create a boolean mask.
    * Count how many values in the array are greater than the threshold.
    * Use the boolean mask to create a new array that contains only the values greater than the threshold.



In [63]:
aa = np.linspace(0, 10, 11, dtype=int)
# odd numbers counting backwards
print(aa[-2::-2])
# Even numbers counting forward
print(aa[::2])

[9 7 5 3 1]
[ 0  2  4  6  8 10]


In [64]:
# exercise with threshold 
arr = np.array([1, 10, 15, 25, 50, 60, 100])
threshold = 30
mask = arr>threshold
mask

array([False, False, False, False,  True,  True,  True])

In [65]:
np.sum(mask)

3

In [66]:
marr = arr[mask]
marr

array([ 50,  60, 100])

### II.5 Reading arrays from a file and string formatting:    <a class="anchor" id="II.5"></a>

There are now multiple modules existing to manipulate data saved in files (text files or many others). Often, we simply want/have to read a table and do operations on it. This can be done easily within `numpy`: there is a simple pair of commands to read/write a 2D array into a text file: reading tables saved in a formated text file can be done with `numpy.loadtxt('myfile.txt')`, while saving your array is done with `numpy.savetxt('myfile.txt')`.   

**Example**: Open data.txt. It contains a list of sources, their name (`Name`), their identifier (`ID`), their `RA` and `DEC` coordinates, their estimated redshift (`z`), uncertainty on the redshift (`z_err`), and a data quality flag (`zQF`). The latter indicates whether the estimate is reliable (assume 0 means no flagging, hence a reliable estimate). You are only interested in objects with a non-zero `z`, for which the data quality flag is 0. Save the trimmed data in a new txt file.

In [68]:
data = np.loadtxt('data.txt')
data.shape

(19, 7)

In [69]:
data[0:5, :]

array([[ 1.10000000e+01,  1.10000000e+01,  6.96039800e+01,
        -1.23348300e+01,  1.45026997e-01,  5.39539924e-05,
         0.00000000e+00],
       [ 3.80000000e+01,  3.80000000e+01,  6.95969000e+01,
        -1.23204700e+01,  5.02938603e-01,  2.22303791e-04,
         1.00000000e+00],
       [ 4.20000000e+01,  4.20000000e+01,  6.95713600e+01,
        -1.23179500e+01,  0.00000000e+00,  1.00000000e-03,
         0.00000000e+00],
       [ 5.50000000e+01,  5.50000000e+01,  6.95983200e+01,
        -1.23115700e+01,  0.00000000e+00,  1.00000000e-03,
         0.00000000e+00],
       [ 5.70000000e+01,  5.70000000e+01,  6.95978429e+01,
        -1.23114420e+01,  0.00000000e+00,  1.00000000e-03,
         0.00000000e+00]])

In [70]:
mask1 = data[:,4]!=0
mask2 = data[:,-1]==0
print(mask1, mask2)
data_trim = data[mask1&mask2,:]
data_trim

[ True  True False False False False False False False False False False
 False False  True  True False  True False] [ True False  True  True  True  True  True  True  True  True  True  True
  True  True  True False  True  True  True]


array([[ 1.10000000e+01,  1.10000000e+01,  6.96039800e+01,
        -1.23348300e+01,  1.45026997e-01,  5.39539924e-05,
         0.00000000e+00],
       [ 1.28000000e+02,  1.28000000e+02,  6.96064600e+01,
        -1.22866800e+01,  3.69864208e-01,  8.03355767e-05,
         0.00000000e+00],
       [ 1.85000000e+02,  1.85000000e+02,  6.95245900e+01,
        -1.22494700e+01,  5.85115787e-01,  1.04162060e-04,
         0.00000000e+00]])

Let's now save the data that you trimmed out:

In [71]:
np.savetxt('data_trim.txt', data_trim)

#### II.5.1 What if my input/output file is not just a simple array ? 

More advanced functions exist in `numpy` to read text/csv files, accounting for missing values, excluding columns, guess data type ... These will be covered in the advanced course on data processing. Below is a simple example showcasing how to parse the data.txt file using `Astropy`, and placing the data into a `Table`. An alternative is to use `pandas`, and load the data as a `DataFrame` object: 

- `Table` and `QTable` (table that manipulates quantities) objects in `astropy.table` and : https://docs.astropy.org/en/stable/table/ . `Table` objects may be sufficient for most of your needs and manages many different formats (including csv, latex, rdb, hdf5, ...). Conversion to/from numpy arrays and to/from `pandas.DataFrame` is often possible. Within Jupyter Notebooks, Tables are also "pretty printed", which eases analysis. 

- `DataFrame` and `Series` objects in `pandas`: https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html. Those objects/structures are commonly used in data science and machine learning. They manage an even larger variety of input files than `astropy.table` (e.g. excel and sql tables, pickle objects, ...) but Dataframe, being more versatile than astropy Tables, can also be trickier to manipulate. 

In [72]:
from astropy.table import Table 
data_tab = Table.read('data.txt', format='ascii')

In [73]:
# Note: one needs to specify the format of the text file ... plain ascii is a good choice to avoid a bug
#data_tab = Table.read('data.txt')

In [74]:
data_tab

Name,ID,RA,DEC,z,z_err,zQF
int64,float64,float64,float64,float64,float64,float64
11,11.0,69.60398,-12.33483,0.14502699731,5.39539923983e-05,0.0
38,38.0,69.5969,-12.32047,0.502938602945,0.000222303791245,1.0
42,42.0,69.57136,-12.31795,0.0,0.001,0.0
55,55.0,69.59832,-12.31157,0.0,0.001,0.0
57,57.0,69.5978429,-12.311442,0.0,0.001,0.0
72,72.0,69.61111,-12.3037,0.0,0.001,0.0
80,80.0,69.55023,-12.30339,0.0,0.001,0.0
83,83.0,69.58752,-12.30232,0.0,0.001,0.0
85,85.0,69.56567,-12.30147,0.0,0.001,0.0
111,111.0,69.59927,-12.29414,0.0,0.001,0.0


In [75]:
data_tab['z'][0:3]

0
0.14502699731
0.502938602945
0.0


In [76]:
# Casting of an astropy.Table into an array 
z_array = np.array(data_tab['z'][0:3])
z_array

array([0.145027 , 0.5029386, 0.       ])

The same exercise as above can be done much more conveniently by accessing specific columns with header keywords (e.g. 'z').

#### II.5.2. What if my input file mixes columns and normal rows ?

*This section is given only for completeness. I.e. it can be ignored at a beginner's level*

The lower-level manipulation of a text file is through the use of the `file()` object. 
For this, three operations are generally needed: 

``` python
with open('myfile.txt', 'r') as f: # 'r' for read mode, 'w' for write mode, 'a' for append mode)
    read_data = f.read() # this would read the whole file as a single string ; other methods allow one more flexible read

# One can also do the following (see below) but there is the risk to get the file not being properly closed. 
f = open('myfile.txt', 'r')  
f.read()  
f.close() 
```

If you do `f.read()` twice, you will see an empty string ... as the object instance then "points" to the end of the file, and there is nothing left to read. Somehow, the methods that access the file object go sequentially through the "string content" of that object. With `read()` you take the string as a whole (which could be a problem memory-wise if the file is large !). 


In [77]:
with open('data.txt', 'r') as f: 
    read_data = f.read() 

The function `readlines()` reads in the whole file and splits it into a **list** of lines. 
``` python
f = open('myfile.txt', 'r')
for line in f.readlines():
    print(repr(line))
```

In [78]:
f = open('data.txt', 'r')
a = f.readlines()
a

['# Name ID RA DEC z z_err zQF \n',
 '0011 11.0 69.60398 -12.33483 0.14502699731 5.39539923983e-05 0.0 \n',
 '0038 38.0 69.5969 -12.32047 0.502938602945 0.000222303791245 1.0 \n',
 '0042 42.0 69.57136 -12.31795 0.0 0.001 0.0 \n',
 '0055 55.0 69.59832 -12.31157 0.0 0.001 0.0 \n',
 '0057 57.0 69.5978429 -12.311442 0.0 0.001 0.0 \n',
 '0072 72.0 69.61111 -12.3037 0.0 0.001 0.0\n',
 '0080 80.0 69.55023 -12.30339 0.0 0.001 0.0\n',
 '0083 83.0 69.58752 -12.30232 0.0 0.001 0.0\n',
 '0085 85.0 69.56567 -12.30147 0.0 0.001 0.0\n',
 '0111 111.0 69.59927 -12.29414 0.0 0.001 0.0\n',
 '0114 114.0 69.52129 -12.2893 0.0 0.001 0.0\n',
 '0119 119.0 69.5651 -12.28924 0.0 0.001 0.0\n',
 '0125 125.0 69.53808 -12.28883 0.0 0.001 0.0\n',
 '0126 126.0 69.54177 -12.28782 0.0 0.001 0.0\n',
 '0128 128.0 69.60646 -12.28668 0.369864207858 8.03355766875e-05 0.0\n',
 '0164 164.0 69.52007 -12.24671 0.581369862213 0.000216496440347 2.0\n',
 '0182 182.0 69.53533 -12.25025 0.0 0.001 0.0\n',
 '0185 185.0 69.52459 -12.24

In [79]:
a[10].replace('.', ',')

'0111 111,0 69,59927 -12,29414 0,0 0,001 0,0\n'

In [80]:
a

['# Name ID RA DEC z z_err zQF \n',
 '0011 11.0 69.60398 -12.33483 0.14502699731 5.39539923983e-05 0.0 \n',
 '0038 38.0 69.5969 -12.32047 0.502938602945 0.000222303791245 1.0 \n',
 '0042 42.0 69.57136 -12.31795 0.0 0.001 0.0 \n',
 '0055 55.0 69.59832 -12.31157 0.0 0.001 0.0 \n',
 '0057 57.0 69.5978429 -12.311442 0.0 0.001 0.0 \n',
 '0072 72.0 69.61111 -12.3037 0.0 0.001 0.0\n',
 '0080 80.0 69.55023 -12.30339 0.0 0.001 0.0\n',
 '0083 83.0 69.58752 -12.30232 0.0 0.001 0.0\n',
 '0085 85.0 69.56567 -12.30147 0.0 0.001 0.0\n',
 '0111 111.0 69.59927 -12.29414 0.0 0.001 0.0\n',
 '0114 114.0 69.52129 -12.2893 0.0 0.001 0.0\n',
 '0119 119.0 69.5651 -12.28924 0.0 0.001 0.0\n',
 '0125 125.0 69.53808 -12.28883 0.0 0.001 0.0\n',
 '0126 126.0 69.54177 -12.28782 0.0 0.001 0.0\n',
 '0128 128.0 69.60646 -12.28668 0.369864207858 8.03355766875e-05 0.0\n',
 '0164 164.0 69.52007 -12.24671 0.581369862213 0.000216496440347 2.0\n',
 '0182 182.0 69.53533 -12.25025 0.0 0.001 0.0\n',
 '0185 185.0 69.52459 -12.24

In [81]:
f = open('data.txt', 'r')
for line in f.readlines():
    print(repr(line))

'# Name ID RA DEC z z_err zQF \n'
'0011 11.0 69.60398 -12.33483 0.14502699731 5.39539923983e-05 0.0 \n'
'0038 38.0 69.5969 -12.32047 0.502938602945 0.000222303791245 1.0 \n'
'0042 42.0 69.57136 -12.31795 0.0 0.001 0.0 \n'
'0055 55.0 69.59832 -12.31157 0.0 0.001 0.0 \n'
'0057 57.0 69.5978429 -12.311442 0.0 0.001 0.0 \n'
'0072 72.0 69.61111 -12.3037 0.0 0.001 0.0\n'
'0080 80.0 69.55023 -12.30339 0.0 0.001 0.0\n'
'0083 83.0 69.58752 -12.30232 0.0 0.001 0.0\n'
'0085 85.0 69.56567 -12.30147 0.0 0.001 0.0\n'
'0111 111.0 69.59927 -12.29414 0.0 0.001 0.0\n'
'0114 114.0 69.52129 -12.2893 0.0 0.001 0.0\n'
'0119 119.0 69.5651 -12.28924 0.0 0.001 0.0\n'
'0125 125.0 69.53808 -12.28883 0.0 0.001 0.0\n'
'0126 126.0 69.54177 -12.28782 0.0 0.001 0.0\n'
'0128 128.0 69.60646 -12.28668 0.369864207858 8.03355766875e-05 0.0\n'
'0164 164.0 69.52007 -12.24671 0.581369862213 0.000216496440347 2.0\n'
'0182 182.0 69.53533 -12.25025 0.0 0.001 0.0\n'
'0185 185.0 69.52459 -12.24947 0.585115787002 0.000104162060008 

In [82]:
a[10].replace('.', ',')

'0111 111,0 69,59927 -12,29414 0,0 0,001 0,0\n'

In [83]:
a

['# Name ID RA DEC z z_err zQF \n',
 '0011 11.0 69.60398 -12.33483 0.14502699731 5.39539923983e-05 0.0 \n',
 '0038 38.0 69.5969 -12.32047 0.502938602945 0.000222303791245 1.0 \n',
 '0042 42.0 69.57136 -12.31795 0.0 0.001 0.0 \n',
 '0055 55.0 69.59832 -12.31157 0.0 0.001 0.0 \n',
 '0057 57.0 69.5978429 -12.311442 0.0 0.001 0.0 \n',
 '0072 72.0 69.61111 -12.3037 0.0 0.001 0.0\n',
 '0080 80.0 69.55023 -12.30339 0.0 0.001 0.0\n',
 '0083 83.0 69.58752 -12.30232 0.0 0.001 0.0\n',
 '0085 85.0 69.56567 -12.30147 0.0 0.001 0.0\n',
 '0111 111.0 69.59927 -12.29414 0.0 0.001 0.0\n',
 '0114 114.0 69.52129 -12.2893 0.0 0.001 0.0\n',
 '0119 119.0 69.5651 -12.28924 0.0 0.001 0.0\n',
 '0125 125.0 69.53808 -12.28883 0.0 0.001 0.0\n',
 '0126 126.0 69.54177 -12.28782 0.0 0.001 0.0\n',
 '0128 128.0 69.60646 -12.28668 0.369864207858 8.03355766875e-05 0.0\n',
 '0164 164.0 69.52007 -12.24671 0.581369862213 0.000216496440347 2.0\n',
 '0182 182.0 69.53533 -12.25025 0.0 0.001 0.0\n',
 '0185 185.0 69.52459 -12.24

In [84]:
f = open('data.txt', 'r')
for line in f.readlines():
    print(repr(line))

'# Name ID RA DEC z z_err zQF \n'
'0011 11.0 69.60398 -12.33483 0.14502699731 5.39539923983e-05 0.0 \n'
'0038 38.0 69.5969 -12.32047 0.502938602945 0.000222303791245 1.0 \n'
'0042 42.0 69.57136 -12.31795 0.0 0.001 0.0 \n'
'0055 55.0 69.59832 -12.31157 0.0 0.001 0.0 \n'
'0057 57.0 69.5978429 -12.311442 0.0 0.001 0.0 \n'
'0072 72.0 69.61111 -12.3037 0.0 0.001 0.0\n'
'0080 80.0 69.55023 -12.30339 0.0 0.001 0.0\n'
'0083 83.0 69.58752 -12.30232 0.0 0.001 0.0\n'
'0085 85.0 69.56567 -12.30147 0.0 0.001 0.0\n'
'0111 111.0 69.59927 -12.29414 0.0 0.001 0.0\n'
'0114 114.0 69.52129 -12.2893 0.0 0.001 0.0\n'
'0119 119.0 69.5651 -12.28924 0.0 0.001 0.0\n'
'0125 125.0 69.53808 -12.28883 0.0 0.001 0.0\n'
'0126 126.0 69.54177 -12.28782 0.0 0.001 0.0\n'
'0128 128.0 69.60646 -12.28668 0.369864207858 8.03355766875e-05 0.0\n'
'0164 164.0 69.52007 -12.24671 0.581369862213 0.000216496440347 2.0\n'
'0182 182.0 69.53533 -12.25025 0.0 0.001 0.0\n'
'0185 185.0 69.52459 -12.24947 0.585115787002 0.000104162060008 

 Once a line is read, it is possible to apply string methods, as on normal string:    
- Remove `\n`: `line.strip()`
- Split the string into list of strings: `line.split()`
- Replace a specific character by another: `line.replace(',', '.')`  replaces each comma by a dot.
- Access a specific element of a splitted list and convert it to float: `float(line.split()[2])`

#### II.5.3: Other options to save lists or arrays into a file:  

*This section is given only for completeness. I.e. it can be ignored at a beginner's level*

To write a file, you basically follow the same procedure: 
``` python
with open('myfile.txt', 'w') as f:
    f.writelines(mylist_of_lines)   # mylist_of_lines contains the lines you want to write. Ensure that they end with `\n`

# you can also use:
    f.write(mylist_of_lines[0]+mylist_of_lines[1]+ ... + mylist_of_lines_[n])  # you can use list comprenhesion as argument
```

**Exercise to practice low-level file manipulation:**

*This exercise is related to II.5.2 and II.5.3 and may be skipped at a beginner's level*

Read the file `data.txt` and display some columns you care about for that file using:
- the file object
- Try to do the same using `numpy.loadtxt()`        
- Try to build a numpy array with the data in data.txt as read using `f = open('data.txt')`. 
- Modify 1 column of the file (replace it with 0) and write the results in `data_new.txt`


**Note**: There is another very useful way in python to save "full objects" and access and use them later using all their characteristics. This can be done by importing the `pickle` [module](https://docs.python.org/3.8/library/pickle.html). When you want to write a pickle into a file, simply open your file (`pkl_file = open()`), use `pickle.dump(obj, pkl_file, protocol=-1)`, and close your file (`pkl_file.close()`). To read an object saved in a pickle file, you can follow the same procedure but use `	obj = pickle.load(pkl_file)` instead of `pickle.dump()`. The `pandas` module also allows you to read/write pickle objects: see `pandas.read_pickle()` and `pandas.to_pickle()`

#### II.5.4 Formatting Strings

It often happens that you do not need to save all the decimals of a number, or would like to see it in scientific notation. There are [multiple ways to do it](https://docs.python.org/3/tutorial/inputoutput.html). One could spend (boring) hours describing all possible ways to format strings. The main 2 options are described below. You may look at https://pyformat.info/ to skim through various examples of formatting. The options described below explains you the basics and points you to relevant documentation.  A more expanded version of this section can be consulted in [Modules_in_python_numpy_adv.ipynb](Modules_in_python_numpy_adv.ipynb)

- **Option 1**: `printf-style` (simple (old style) but not universal) 

You can use the `%` operator to specify the formatting of the variable you want to show at the screen or save in a file. The variable does not appear explicitly in the string but after it in a tuple, preceded by the `%`. Within the string, the `%` operator will be followed by a format string such as `%f` for a float or `%e` for scientific notation. The sequence `'%.2f'%variable` basically tells that the `%` operator converts the `variable` into a float with 2 digits after the dot. This is generalized to a sequence of variable, by defining the tuple object that contains all the variables to be formatted (but you need to specify the format you want for those, the association between the format and the variable being done easily as you have put your variable into a tuple-object). 

Example:
``` python
print('%i is the square of %i' %(4.000, 2))
    Out: 4 is the square of 2
```
Here are some commonly used formatting characters:
- `%s`: String (or any object with a string representation, like numbers)
- `%d` or `%i`: Integers
- `%.<number_of_digits>f`: Floating point numbers with fixed number of digits to the right of the dot. 
- `%.<number_of_digits>e`: scientific notation with fixed number of digits to the right of the dot.
You may find more about string formatting in [python 3.8 documentation](https://docs.python.org/3.8/library/stdtypes.html#str).  


In [85]:
# Experiment with the above examples
print('%i is the square of %i' %(4.6, 2))

4 is the square of 2


In [86]:
print('%s is the square of %s' %(4.6, 2))

4.6 is the square of 2


In [87]:
print('%.2f is the square of %i' %(4.6091, 2))

4.61 is the square of 2


In [88]:
print('%.4e is the square of %i' %(4.6091, 2))

4.6091e+00 is the square of 2


In [89]:
print('%.4e is the square of %i' %(46091, 2))

4.6091e+04 is the square of 2


- **Option 2**: `str.format()` method

This is a much more flexible and general method described in details at https://docs.python.org/3/library/string.html#formatstrings. Format strings contain `replacement fields` surrounded by curly braces `{}`. This looks like: 

``` python
'val1 = {0:format_spec} and val2 = {1:format_spec}'.format(val1, val2)
```

Anything that is *not* contained in braces is considered literal text, which is *copied unchanged to the output*. See [here](https://docs.python.org/3/library/string.html#format-specification-mini-language) and [here](https://pyformat.info/) for more details and EXAMPLES. 

Note the positional argument (0 and 1 above) are optional, but can be useful in some cases.

Example:
``` python
print('{:.0f} is the square of {:n}'.format(4.000, 2))
    Out: 4 is the square of 2
```
If you wish a float representation with 2 decimals: `{0:.2f}`
You can also use the positional argument to revert the output:
``` python
print('{1:.2f} is the square of {0:n}'.format(2, 4.000))
    4.00 is the square of 2
```

In [90]:
# Experiment with the above examples 
print('{:.1f} is the square of {:n}'.format(4.000, 2))

4.0 is the square of 2


In [91]:
print('{1:.2f} is not the square of {1:n} and {1} is different to {0}'.format(4.000, 2))

2.00 is not the square of 2 and 2 is different to 4.0


In [92]:
# Create three float variables a, b, c and give them some value (e.g. a=2.3, b=3, c=-5). 
# Print the sentence: `a=2.30, b=3 and c=-5.00e+00` using the formating format described above.
a=2.3; b=3; c=-5

print("a={:.2f}, b={:n} and c={:.2e}".format(a, b, c))

a=2.30, b=3 and c=-5.00e+00


In [93]:
# Create a 1-D array of 5 floats and print their value with 2 digits floats. TIP: use list comprehension
arr = np.arange(5.)
print(arr)
print(["{:.2f}".format(arr[i]) for i in range(arr.shape[0])])
print(["{:.2f}".format(x) for x in arr])

[0. 1. 2. 3. 4.]
['0.00', '1.00', '2.00', '3.00', '4.00']
['0.00', '1.00', '2.00', '3.00', '4.00']


### II.6 Other useful numpy function:  <a class="anchor" id="II.6"></a>

There are many useful functions for manipulating arrays, finding elements, compare arrays, ... that are predefined. Do not hesitate to have a look at the `numpy` help. I list below a few "must-know". You may consult [Modules_in_python_numpy_adv.ipynb](Modules_in_python_numpy_adv.ipynb) for a compilation of "may-know" (i.e. 75\% chance that one of them will save your day in a project).

- `np.sort(a)`: Returns sorted copy of an array along a specific axis (default = last axis)
- `np.searchsorted(a, v)`: Find indices where elements should be inserted to maintain order.
- `np.concatenate((a1, a2), axis=0, ...)`: Join a sequence of arrays along an existing axis.
- `np.hstack(tup)` / `np.vstack(tup)`: Stack arrays in sequence horizontally/vertically (column-/row- wise). `np.stack(arr, axis)` stack arrays along a NEW axis. 
- `np.where(condition, x, y)`: Return elements chosen from `x` or `y` depending on whether `condition` is met.
- `np.isfinite()` and `np.isnan()`: Test element-wise for finiteness (not infinity or not Not a Number) or Nan (`np.isnan()`). The result is a boolean array.
- `np.trapz()`: Numerical integration using the trapezoidal rule/method.
- `np.dot()` `np.matmult()`: matrix multiplication.
-  `np.cross()`: cross product of 2 vectors. 
- `np.linalg` gives access to several linear algebra function, such as search for eigenvectors and eigenvalues of a matrix (`np.linalg.eig()`, `np.linalg.eigvals()`), solving linear systems (`np.linalg.solve()`),  ...   
 

### II.7 Summary:   <a class="anchor" id="II.7"></a>

What do you need to know to get started?

- Know how to create arrays : `np.array`, `np.arange`, `np.ones`, `np.zeros`, `np.linspace()`.

- Know the shape of the array with `array.shape`, then use *slicing* to obtain different views of the array: `array[start:end:step]` (and variations around that syntax). Adjust the shape of the array using reshape or flatten it with ravel.

- Obtain a subset of the elements of an array and/or modify their values with masks (`a[a < 0] = 0`).

- Know miscellaneous operations on arrays, such as finding the mean or max (`ufunct`: `array.max()`, `array.mean()`). Have the reflex to search in the documentation (online docs, `help()`, `np.lookfor()`) when you do not remember exact syntax of a function !!

- Master the *indexing* with arrays of integers, as well as *broadcasting*. Know more NumPy functions to handle various array operations.

- Be able to read/write data into a file, and format numbers at screen (or when writing them into files): `np.savetxt()/np.loadtxt()`, `astropy.table.Table` objects;  use of `%` operator and the `.format()` string method. 


## II.8 References and supplementary material: <a class="anchor" id="VI"></a>

- Good video introducing numpy (and that inspired part of the numpy section of this notebook) by J. Vandeplas: https://www.youtube.com/watch?v=EEUXKG97YRw

- Numpy quick-start:  [https://numpy.org/doc/stable/user/quickstart.html](https://numpy.org/doc/stable/user/quickstart.html)

- About string formatting: https://docs.python.org/3/tutorial/inputoutput.html