# Introduction to `numpy`

This Notebook provides an overview of the capabilities of the `numpy` module. It covers Sect. II of [Modules_in__python.ipynb](Modules_in__python.ipynb). 

## Table of Content

- [II. Numpy](#II)
    * [II.1 Array Definition and construction](#II.1)
    * [II.2 Array copies and views](#II.2)
    * [II.3 Shape manipulation](#II.3)
    * [II.4 What makes numpy Arrays useful structures ?](#II.4)
        - [II.4.1 ufunc](#II.4.1)
        - [II.4.2 Aggregation](#II.4.2)
        - [II.4.3 Broadcasting](II.4.3)
        - [II.4.4 Slicing, masking, fancy indexing](#II.4.4)
    * [II.5 Reading arrays from a file and string formatting](#II.5)
    * [II.6 Useful Numpy functions](#II.6)
    * [II.7 Summary](#II.7)
    * [II.8 References](#VI)

## II. `numpy`:  <a class="anchor" id="II"></a>

`numpy` can be seen as the implementation of mathematical functions and operations for python language. It also introduces one key object `arrays`. 

### II.1 `array` definition and construction:  <a class="anchor" id="II.1"></a>

- A `numpy` array is an object of the type `np.ndarray` (although this type specifier is rarely used directly). Instead one can create arrays in several ways: 

``` python
import numpy as np
np.array([1,2,3,4])   # creates an array from a python list
np.array([[0, 1, 2], [3, 4, 5]])   # Creates a 2D array from a python list
np.empty(shape=(2,3)) # Creates an "empty" (entry not initialised) array with 2 rows and 3 columns 
np.arange(5) # similar to the built-in range() function.
np.linspace(1, 10, 10) # creates an array of 10 elements from 1 to 10
np.zeros(10)  # creates an array of 10 elements filled with 0
np.ones(10)   # creates an array of 5 elements filled with 1
np.zeros((2, 5))  # mulitdimensional arrays of 2 rows and 5 columns

```
- 2-D arrays of `shape=(r, c)` are arrays with `r` *rows* and `c` *columns*. 

In [4]:
# Let's try the above commands and visualise the output. 
import numpy as np
a = np.array([[1,2,3], [3,5,5]])
a

array([[1, 2, 3],
       [3, 5, 5]])

In [7]:
np.shape(a)

(2, 3)

In [8]:
empty_array = np.empty(shape=(2,3))
empty_array

array([[5.04e-322, 0.00e+000, 0.00e+000],
       [0.00e+000, 0.00e+000, 0.00e+000]])

In [9]:
zero_array = np.zeros(shape=(2,3))
zero_array

array([[0., 0., 0.],
       [0., 0., 0.]])

In [27]:
ones_array = np.ones(shape=(2,2,3))
ones_array  # [1,1,0]

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [11]:
type(zero_array)

numpy.ndarray

In [13]:
zero_array.dtype

dtype('float64')

In [16]:
array_of_string = np.array(['qqqq', 'a', 'f'], dtype=str)
array_of_string

array(['qqqq', 'a', 'f'], dtype='<U4')

In [18]:
for i in range(5):
    print(i)

0
1
2
3
4


In [21]:
np.arange(0., 5., 0.5)

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

In [23]:
np.linspace(0, 5, 9)

array([0.   , 0.625, 1.25 , 1.875, 2.5  , 3.125, 3.75 , 4.375, 5.   ])

- numpy has also tools to create arrays filled with random elements:

``` python
np.random.random(size=4)  # uniform between 0 and 1
np.random.normal(size=4)  # elements are std-normal distributed

```

In [None]:
# Try out the above commands

- You can explicitly specify which **data-type** you want:

``` python 
c = np.array([1, 2, 3], dtype=float)
c.dtype
    Out: dtype('float64')
```

In [None]:
# Try out the above commands 

The default data type is floating point. Other possible data types are: 

* **COMPLEX** numbers: 
``` python
d = np.array([1+2j, 3+4j, 5+6*1j])
d.dtype
    Out: dtype('complex128')
```

In [None]:
# Try out the above commands 

* **BOOL**:
``` python
e = np.array([True, False, False, True])
e.dtype
    Out: dtype('bool')
```

In [None]:
# Try out the above commands 

* **String**:
``` python
f = np.array(['abc', 'eddafg', 'hjk'])
f.dtype
    Out: dtype('S6')   # <--- String of 6 characters (by default largest elements of the array 
```

In [None]:
# Try out the above commands 

* **Other data types**:  `int32`, `int64`, `uint32`, `uint64`  (uint = unsigned integer => only positive integers)

Note that `type(f)` tells you that `f` is a numpy array, while `f.dtype` gives you the *type of the elements* containted in `f`. `dtype` is an attribute of the object `np.array`. If you try to access the attribute dtype of a List, you will get an error message. 

In [None]:
# Difference between type/dtype; application to List/arrays.
f = np.array(['abc', 'eddafg', 'hjk'])
print(type(f))
print(f.dtype)
print('----------')
L = ['abc', 'eddafg', 'hjk']
print(type(L))
print(L.dtype)

- Last but not least, `numpy` is also the package that allows you to calculate many common mathematical function (see also [`ufunc`](#II.4.1)): `np.log10()` (base 10 log), `np.log()` (natural log), `np.exp()`, `np.sin()`, `np.cos()`, etc. See the list of `numpy` mathematical functions [here](https://docs.scipy.org/doc/numpy/reference/routines.math.html)

In [None]:
# create an array of floats and calculate its log / sin / ... 
#x = np.linspace(-2*np.pi, 2*np.pi, 20.)
np.log(2.3)

**Exercise:**   
For the array:
``` python
a = np.array([[1,2,3,4], [4,5,6,7], [2,3,4,5] ])
```
- What is the output of `a.ndim`, `a.shape`, `len(a)` ?     
- How does the above commands relate to the rows, columns, dimensions ?       
- How do you access 2nd item of the first row ?   

*Note:* 
Try to do the same with the following array:
``` python
b = np.array([[1,2,3], []])
```

**Exercise:** Elementwise operations

In the code cell below, try simple arithmetic elementwise operations: 
- add even elements with odd elements using 2 different techniques (slicing and list comprehension)
- Time the two solution using %timeit.
- Generate an array from a list made of strings and floats. What is the final array type ?
- Generate 2 arrays such that their elements are as follow :    
   `[2^0, 2^1, 2^2, 2^3, 2^4]`    
   `a_i = 2^(3*i) - i `    
   
Expected output: 
``` python
[1 2 4 8 16]    
[  1   7  62 509]    
```

### II.2 `array` copies and views:   <a class="anchor" id="II.2"></a>

A slicing operation creates a **view** on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. More information about copies and views can be found in [Modules_in_python_numpy_adv.ipynb](Modules_in_python_numpy_adv.ipynb), but can be ignored if this is your first contact with `numpy`. 

### II.3 Array shape manipulation <a class="anchor" id="II.3"></a>

There are various possibilities to modify arrays (e.g. adding a row/column, shuffle columns, flatten, resize,...). They are discussed in details in [Modules_in_python_numpy_adv.ipynb](Modules_in_python_numpy_adv.ipynb). Let's focus here on array reshaping (which allows to perform several of the operations outlined above). 

- **II.3.1 Reshaping**:   
The method `reshape(newshape)` allows one to reorganise the elements of an array, to create a "new" array (see below) that has a different shape. The total number of items of the array has to be the same ! This method can also be used to add an axis or to flatten an array. 

In [None]:
a = np.array([[1, 2, 3], [4, 5, 6]])
print(a, a.shape) 
b = a.reshape((2, 3))
b

In [None]:
# Alternatively 
a.reshape((2, -1))    # unspecified (-1) value is inferred

### II.4 What makes `numpy` arrays useful structures ?  <a class="anchor" id="II.4"></a>

Python is fast *for coding and developping* but python is slow when it comes to *execution*, especially when it comes to execution of `for` loops.    
The reason behind this low speed is e.g. that when it does `for a in range(10): a + b`, it has to check the `type` of `a`, of `b` and of *each value* in those lists before executing. 

`numpy` helps speeding up code through 4 strategies:
1. `ufunc`
2. aggregation
3. broadcasting
4. slicing, masking, fancy indexing

#### II.4.1 `ufunc`: operates elementwise on objects. <a class="anchor" id="II.4.1"></a>

Those `ufunc` (universal functions) are included (compiled) in `numpy` and consist of fast elementwise operations. They include: 

- all mathematic operation: +, -, /, *, `***` 
- Mathematical expressions: sin, exp, cos, log10, ... 
- Comparison operators <, >, =, ...
- etc ... 

**Example:**
``` python
import numpy as np
# Basic python
a = [1,2,3,4,5]
b = [ val + 5 for val in a]   # add 5 to each element of the list  
# In numpy
a = np.array(a)
b = a + 5                     # add 5 to each element of the array.
```

In [None]:
a = [1,2,3,4,5]
b = [ val + 5 for val in a]   # add 5 to each element of the list
b

In [None]:
a = np.array(a)
b = a + 5   
b

In [None]:
# implement the above example for a list of 1000 elements 
# use %timeit before calculating b to see improvement in speed
a_a = np.arange(1000)
%timeit a_a+5

In [None]:
a_l = range(1000)
%timeit [val + 5 for val in a_l]

#### II.4.2 *aggregation*:   <a class="anchor" id="II.4.2"></a>

Functions which summarize values of an array such as `min`, `max`, `sum`, `mean`, ... 

**Example:**

``` python
# python version of an aggregation
from numpy import random 
c_list = [random.random_sample() for i in range(10000)]
%timeit min(c_list)
#same in numpy:
c = np.array(c_list)
%timeit c.min()  
```
This also works on multidimensional arrays: 

``` python 
M = np.random.randint(0, 10, (10,4))
M.sum(axis=0)
M.sum(axis=1)
```

Aggregation available: 
`c.min()`, `c.max`, `c.prod()`, `c.mean()`, `c.std()`, `c.any()`, `c.all()`, `c.nanmin()` (and nan versions of above aggregation), `c.argmin()`, `c.argmax()`,  ...


#### II.4.3 *Broadcasting*:   <a class="anchor" id="II.4.3"></a>

Set of rules by which `ufuncs` operates on arrays of different sizes and/or dimensions. 

The term [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) describes how `numpy` treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.
Application to three cases: 

![From astroML book](../Figures/fig_broadcast_visual_1.png)



The rules / how this works:

* If array shapes differ, left-pad the smaller shape with 1s 
* If any dimension does not match, broadcast the dimension with size 1
* If neither non matching dimensions is 1 raise an error

This broadcasting strategy allows one to avoid doing `for` loops for some operations. 


#### II.4.4 Slicing, masking and fancy indexing:    <a class="anchor" id="II.4.4"></a>
	 
- **Mask**: a mask is a boolean array that can be used to "mask" some indices of an array: 

``` python
mask = np.array([False, False, True, False, True, False])
c = np.array([1, 3, 6, 9, 10, 2])
c[mask]
    Out: array([6, 10])
    
mask = (c < 4) | (c > 8)
c[mask]
    Out: array([1, 3, 9, 10, 2])
```
 

- **Fancy indexing**: passing a list/array of indices to get elements of a numpy array  (this only works for arrays !) This avoids to loop over the indices. 

``` python
ind = [1, 3, 4]
c[ind]  
   Out: array([3, 9, 10])
```

- **Multi-dimensional** array: 

We can apply mask and fancy indexing in multidimension.   
Remember that first index is row, and second is column.   
Remember how slicing works: `a[start:end:step]`   : 
- Omitting one value goes up to the end of the sequence. 
- Omitting the second "colon" implies step=1.  
- With negative steps you count backward
- Start/step can be either positive or negative indices (but then you count from the end). 

In [None]:
a = np.arange(10)
print(a)
a[a>3]

``` python
M = np.arange(12).reshape((3,4))
    Out: 
    array([[ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11]])

M[0,1] # gives value at row 0 and column 1. 
M[:, 1]  # Combines slices and indices -> all rows of column one
M[M-3 < 2]# can also do masking of n dimensional array
M[[1,0], :2] # Use fancy indexing and slicing - 1st 2 elements, of rows 1 and 2
M[M.sum(axis=1) > 2, 4:] # mixing masking and slicing 
```

An illustration of indexing in numpy arrays:
![Illustration of `np` indexing](../Figures/numpy_indexing.png)

**Exercise**:
- Try the different flavours of slicing, using start, end and step: starting from a linspace, try to obtain odd numbers counting backwards, and even numbers counting forwards. Expected Output: `[9 7 5 3 1]` and `[ 0  2  4  6  8 10]` 

- Reproduce the slices in the diagram above. You may use the following expression to create the array:    
`np.arange(6) + np.arange(0, 51, 10)[:, np.newaxis]`

In [None]:
aa = np.linspace(0, 10, 11, dtype=int)
# odd numbers counting backwards

# Even numbers counting forward


In [None]:
# Implement the exercise above


### II.5 Reading arrays from a file and string formatting:    <a class="anchor" id="II.5"></a>

There are now multiple modules existing to manipulate data saved in files (text files or many others). Often, we simply want/have to read a table and do operations on it. This can be done easily within `numpy`: there is a simple pair of commands to read/write a 2D array into a text file: reading tables saved in a formated text file can be done with `numpy.loadtxt('myfile.txt')`, while saving your array is done with `numpy.savetxt('myfile.txt')`.   

In [None]:
data = np.loadtxt('data.txt')

#### II.5.1 What if my input/output file is not just a simple array ? 

More advanced functions exist in `numpy` to read text/csv files, accounting fro missing values, excluding columns, guess data type ... See [Modules_in_python_numpy_adv.ipynb](Modules_in_python_numpy_adv.ipynb) to learn about them. 

To deal with more advanced table formating, I strongly encourage you to look at the following two modules that generate specific objects with a bit more capabilities than arrays: 

- `Table` and `QTable` (table that manipulates quantities) objects in `astropy.table` and : https://docs.astropy.org/en/stable/table/ . `Table` objects may be sufficient for most of your needs and manages many different formats (including csv, latex, rdb, hdf5, ...). Conversion to/from numpy arrays and to/from `pandas.DataFrame` is often possible. Within Jupyter Notebooks, Tables are also "pretty printed", which eases analysis. 

- `DataFrame` and `Series` objects in `pandas`: https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html. Those objects/structures are commonly used in data science and machine learning. They manage an even larger variety of input files than `astropy.table` (e.g. excel and sql tables, pickle objects, ...) but Dataframe, being more versatile that astropy Tables, can also be trickier to manipulate. 

#### II.5.2. What if my input file mixes columns and normal rows ?

The lower-level manipulation of a text file is through the used of the `file()` object. 
For this, three operations are generally needed: 

``` python
with open('myfile.txt', 'r') as f: # 'r' for read mode, 'w' for write mode, 'a' for append mode)
    read_data = f.read() # this would read the whole file as a single string ; other methods allow one more flexible read

# One can also do the following (see below) but there is the risk to get the file not being properly closed. 
f = open('myfile.txt', 'r')  
f.read()  
f.close() 
```

If you do `f.read()` twice, you will see an empty string ... as the object instance then "points" to the end of the file, and there is nothing left to read. Somehow, the methods that access the file object go sequentially through the "string content" of that object. With `read()` you take the string as a whole (which could be a problem memory-wise if the file is large !). 


In [None]:
with open('data.txt', 'r') as f: 
    read_data = f.read() 

The function `readlines()` reads in the whole file and splits it into a **list** of lines. 
``` python
f = open('myfile.txt', 'r')
for line in f.readlines():
    print(repr(line))
```

In [None]:
f = open('data.txt', 'r')
a = f.readlines()
a

In [None]:
a[10].replace('.', ',')

In [None]:
a

In [None]:
f = open('data.txt', 'r')
for line in f.readlines():
    print(repr(line))

 Once a line is read, it is possible to apply string methods, as on normal string:    
- Remove `\n`: `line.strip()`
- Split the string into list of strings: `line.split()`
- Replace a specific character by another: `line.replace(',', '.')`  replaces each comma by a dot.
- Access a specific element of a splitted list and convert it to float: `float(line.split()[2])`

#### II.5.3: Saving an array into a file: 

To write a file, you basically follow the same procedure: 
``` python
with open('myfile.txt', 'w') as f:
    f.writelines(mylist_of_lines)   # mylist_of_lines contains the lines you want to write. Ensure that they end with `\n`

# you can also use:
    f.write(mylist_of_lines[0]+mylist_of_lines[1]+ ... + mylist_of_lines_[n])  # you can use list comprenhesion as argument
```

**Exercise:**

Read the file `data.txt` and display the some columns you care about for that file using:
- the file object
- Try to do the same using `numpy.loadtxt()`        
- Try to build a numpy array with the data in data.txt as read using `f = open('data.txt')`. 
- Modify 1 column of the file (replace it with 0) and write the results in `data_new.txt`


#### II.5.4 Formatting Strings

It often happens that you do not need to save all the decimals of a number, or would like to see it in scientific notation. There are [multiple ways to do it](https://docs.python.org/3/tutorial/inputoutput.html). One could spend (boring) hours describing all possible ways to format strings. The main 2 options are described below. You may look at https://pyformat.info/ to skim through various examples of formatting. The options described below explains you the basics and points you to relevant documentation.  A slightly more expanded version of this section can be consulted in [Modules_in_python_numpy_adv.ipynb](Modules_in_python_numpy_adv.ipynb)

- **Option 1**: `printf-style` (simple (old style) but not universal) 

You can use the `%` operator to specify the formatting of the variable you want to show at the screen or save in a file. The variable does not appear explicitly in the string but after it in a tuple, preceded by the `%`. Within the string, the `%` operator will be followed by a format string such as `%f` for a float or `%e` for scientific notation. The sequence `'%.2f'%variable` basically tells that the `%` operator converts the `variable` into a float with 2 digits after the dot. This is generalized to a sequence of variable, by defining the tuple object that contains all the variables to be formatted (but you need to specify the format you want for those, the association between the format and the variable being done easily as you have put your variable into a tuple-object). 

Example:
``` python
print('%i is the square of %i' %(4.000, 2))
    Out: 4 is the square of 2
```
Here are some commonly used formatting characters:
- `%s`: String (or any object with a string representation, like numbers)
- `%d` or `%i`: Integers
- `%.<number_of_digits>f`: Floating point numbers with fixed number of digits to the right of the dot. 
- `%.<number_of_digits>e`: scientific notation with fixed number of digits to the right of the dot.
You may find more about string formatting in [python 2 documentation](https://docs.python.org/2/library/stdtypes.html#string-formatting).  


In [None]:
# Experiment with the above examples 

- **Option 2**: `str.format()` method

This is a much more flexible and general method described in details at https://docs.python.org/3/library/string.html#formatstrings. Format strings contain `replacement fields` surrounded by curly braces `{}`. This looks like: 

``` python
'val1 = {0:format_spec} and val2 = {1:format_spec}'.format(val1, val2)
```

Anything that is *not* contained in braces is considered literal text, which is *copied unchanged to the output*. See [here](https://docs.python.org/3/library/string.html#format-specification-mini-language) and [here](https://pyformat.info/) for more details and EXAMPLES. 

Example:
``` python
print('{0:.0f} is the square of {1:n}'.format(4.000, 2))
    Out: 4 is the square of 2
```
If you wich a float representation with 2 decimals: `{0:.2f}`
You can also use the positional argument to revert the output:
``` python
print('{1:.0f} is the square of {0:n}'.format(2, 4.000))
    Out: 4 is the square of 2
```

**Note**: 
- About `conversion field`: There are 3 possible conversions flags: `!s` which calls [str()](https://docs.python.org/3/library/stdtypes.html#str) on the value, `!r` which calls [repr()](https://docs.python.org/3/library/functions.html#repr) and `!a` which calls [ascii()](https://docs.python.org/3/library/functions.html#ascii).
- About `format_spec`: 
See https://docs.python.org/3/library/string.html#format-specification-mini-language

In [None]:
# Experiment with the above examples 

In [None]:
# Create three float variables a, b, c and give them some value (e.g. a=2.3, b=3, c=-5). 
# Print the sentence: `a=2.00, b=3 and c=-5.00e+00` using the formating format described above.

In [None]:
# Create a 1-D array of 5 floats and print their value with 2 digits floats. TIP: use list comprehension

**Note**: There is another very useful way in python to save "full objects" and access and use them later using all their characteristics. This can be done by importing the `pickle` [module](https://docs.python.org/2/library/pickle.html), or even better (faster) [cPickle]( http://docs.python.org/library/pickle.html#module-cPickle). When you want to write a pickle into a file, simply open your file (`pkl_file = open()`), use `pickle.dump(obj, pkl_file, protocol=-1)`, and close your file (`pkl_file.close()`). To read an object saved in a pickle file, you can follow the same procedure but use `	obj = pickle.load(pkl_file)` instead of `pickle.dump()`. The `pandas` module also allows you to read/write pickle objects: see `pandas.read_pickle()` and `pandas.to_pickle()`

### II.6 Other useful numpy function:  <a class="anchor" id="II.6"></a>

There are many useful functions for manipulating arrays, finding elements, compare arrays, ... that are predefined. Do not hesitate to have a look at the `numpy` help. I list below a few "must-know". You may consult [Modules_in_python_numpy_adv.ipynb](Modules_in_python_numpy_adv.ipynb) for a compilation of "may-know" (i.e. 75\% chance that one of them will save your day in a project).

- `np.sort(a)`: Returns sorted copy of an array along a specific axis (default = last axis)
- `np.searchsorted(a, v)`: Find indices where elements should be inserted to maintain order.
- `np.concatenate(a1, a2, ...)`: Join a sequence of arrays along an existing axis.
- `np.hstack(tup)` / `np.vstack(tup)`: Stack arrays in sequence horizontally/vertically (column-/row- wise).
- `np.where(condition, [x, y])`: Return elements chosen from `x` or `y` depending on `condition`.


### II.7 Summary:   <a class="anchor" id="II.7"></a>

What do you need to know to get started?

- Know how to create arrays : `np.array`, `np.arange`, `np.ones`, `np.zeros`, `np.linspace()`.

- Know the shape of the array with `array.shape`, then use *slicing* to obtain different views of the array: `array[start:end:step]` (and variations around that syntax). Adjust the shape of the array using reshape or flatten it with ravel.

- Obtain a subset of the elements of an array and/or modify their values with masks (`a[a < 0] = 0`).

- Know miscellaneous operations on arrays, such as finding the mean or max (`ufunct`: `array.max()`, `array.mean()`). Have the reflex to search in the documentation (online docs, `help()`, `np.lookfor()`) when you do not remember exact syntax of a function !!

- Master the *indexing* with arrays of integers, as well as *broadcasting*. Know more NumPy functions to handle various array operations.

- Be able to read/write data into a file, and format numbers at screen (or when writing them into files): `open()`, `close()`, `np.savetxt()/np.loadtxt()`, use of `%` operator and the `.format()` string method. 


## II.8 References and supplementary material: <a class="anchor" id="VI"></a>

- Excellent video introducing numpy (and that inspired part of the numpy section of this notebook) by J. Vandeplas: https://www.youtube.com/watch?v=EEUXKG97YRw

- Numpy quick-start: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

- About string formatting: https://docs.python.org/3/tutorial/inputoutput.html