# NumPy 

**NumPy (Numerical Python)** is a library for the Python programming language, adding support for large, multi-dimensional arrays such as vectors and matrices, along with a large collection of high-level mathematical functions to operate on these arrays (http://www.numpy.org/).

NumPy is:

* vectorized
* fast (as C / Fortran programming language)
* rich (not only linear algebra, but also random sampling, basic statistic, date and time support and many more).

**Prerequisites**
* Python 3
* Project folder with virtual / conda environment set up
* NumPy


**How to import NumPy**

In [2]:
import numpy as np  # NumPy import convention

## NumPy Array

While a Python list can contain different data types in a single list, all elements of Numpy arrays must have the **same data-types** and, therefore, have the same size in memory and more memory-efficient than Python lists. 

To create an array, we can initialize a Numpy array from a Python list.

In [42]:
# list of numbers 0 ... 999_999:
l = list(range(1_000_000))
print(type(l), l[:10])

# convert list to 1d numpy array (vector):
a = np.array(l)
print(type(a), a[:10])

<class 'list'> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
<class 'numpy.ndarray'> [0 1 2 3 4 5 6 7 8 9]


NumPy arrays may look like lists but they are fast.

You don't need to understand how we measure the time here, this will be part of the workshop:

In [44]:
import time

t_start = time.time()

l_sum = sum([i * i for i in l])

print("list comprehension mul + sum took {:.5f} seconds".format(time.time() - t_start))
print("sum result: {:d}\n".format(l_sum))


t_start = time.time()

a_sum = (a * a).sum()

print("ndarray mul + ndarray.sum took {:.5f} seconds".format(time.time() - t_start))
print("sum result: {:d}".format(a_sum))

list comprehension mul + sum took 0.11422 seconds
sum result: 333332833333500000

ndarray mul + ndarray.sum took 0.00357 seconds
sum result: 333332833333500000


**Note**: NumPy methods (functions) are implemented in C or Fortran. Python itself is interpreted, not compiled, hence slow. If you care for speed of your computations take care to use NumPy all along.

**Note**: For most of the methods/properties/operators of NumPy arrays, there exists an underlying universal NumPy function (ufunc), such as for the `.sum()` method above there exists `np.sum()` ufunc.

**Note**: NumPy is taking advantage of the fact that the elements of a NumPy array are of the same type by using a **contiguous memory** layout to store the element, i.e. a big enough block of memory. This enables the speed of NumPy arrays.

<div class="alert alert-block alert-info">
    <p style="font-weight: bold; font-size:120%;"><i class="fa fa-question-circle"></i>&nbsp; Exercise [5 min]</p>

How fast do you think `sum(a*a)`and `np.sum(a*a)`? Test it.

</div>

## Dimension

The `numpy.ndarray` type name stands for NumPy **N-dimensional array**. In Python terms these can be thought of as nested lists of numbers of equal lengths at each level of nesting.

`numpy.ndarray` has following size-related properties:

* `.ndim` - number of dimensions,
* `.shape` - a tuple with the size in each dimension,
* `.size` - total number of elements in the array.

A 0D array is a scalar. A 1D array is a vector. A 2D array is a matrix. Three-dimensional arrays or more are also frequently used.

**How to check the shape and size of an array?**

In [45]:
def print_dim_info(a):
    print(a)
    print("ndim =", a.ndim)
    print("shape =", a.shape)
    print("size =", a.size)
    print()

In [46]:
num = np.array(0)
print_dim_info(num)

0
ndim = 0
shape = ()
size = 1



In [47]:
vector = np.array([0, 1, 2])
print_dim_info(vector)

[0 1 2]
ndim = 1
shape = (3,)
size = 3



In [8]:
matrix = np.array(
    [
        [0, 1, 2],
        [3, 4, 5],
    ]
)
print_dim_info(matrix)

[[0 1 2]
 [3 4 5]]
ndim = 2
shape = (2, 3)
size = 6



In [9]:
arr = np.array(
    [
        [
            [0, 1, 2],
            [3, 4, 5],
        ],
        [
            [6, 7, 8],
            [9, 10, 11],
        ],
    ]
)
print_dim_info(arr)

[[[ 0  1  2]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]]
ndim = 3
shape = (2, 2, 3)
size = 12



<div class="alert alert-block alert-info">
    <p style="font-weight: bold; font-size:120%;"><i class="fa fa-question-circle"></i>&nbsp; Exercise</p>

1. Without coding, can you guess output of the following commands: `vector[1]`, `matrix[:,1]`, `arr[:,:,1])`, including shape? Check your answer.

2. What does `len()` return for NumPy arrays?

</div>

## Creating arrays

You have already seen that `np.array()` creates a NumPy array from explicit nested lists. You will usually want to create NumPy arrays in more automated fashion.

Evenly spaced vectors by step size (default: 1), like `range()`:

In [10]:
np.arange(1, 11, 2)

array([1, 3, 5, 7, 9])

Evenly spaced vectors by number of points (default: 50):

In [11]:
np.linspace(0, 1, 10)

array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
       0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])

Single value N-dim arrays:

In [12]:
print(np.zeros(3), "\n")
print(np.ones(3), "\n")
print(np.zeros((3, 3)), "\n")
print(np.ones((3, 3)) * 3, "\n")
print(np.zeros((2, 3, 4)), "\n")

[0. 0. 0.] 

[1. 1. 1.] 

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]] 

[[3. 3. 3.]
 [3. 3. 3.]
 [3. 3. 3.]] 

[[[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]] 



Diagonal matrices:

In [13]:
print(np.diag([1, 2, 3]), "\n")
print(np.eye(3), "\n")  # same as: np.identity(3)

[[1 0 0]
 [0 2 0]
 [0 0 3]] 

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]] 



Random samples:

In [14]:
# fix the random seed if needed
np.random.seed(123)

size = (2, 3)
print(np.random.randint(low=0, high=3, size=size), "\n")
print(np.random.uniform(size=size), "\n")  # same as: np.random.rand(*size)
print(np.random.normal(size=size), "\n")  # same as: np.random.randn(*size)

[[2 1 2]
 [2 0 2]] 

[[0.55131477 0.71946897 0.42310646]
 [0.9807642  0.68482974 0.4809319 ]] 

[[-1.61930007 -1.11396442 -0.44744072]
 [ 1.66840161 -0.14337247 -0.6191909 ]] 



<div class="alert alert-block alert-info">
    <p style="font-weight: bold; font-size:120%;"><i class="fa fa-question-circle"></i>&nbsp; Exercise</p>

1. Experiment with examples above.
2. Create a lower triangular 9x9 matrix:
    
$$ 
\begin{pmatrix}
1 & 0 & \cdots & 0\\
2 & 1 & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots\\
9 & 8 & \cdots & 1
\end{pmatrix}
$$

Hint: check `?np.eye`
</div>

## Basic data types

NumPy arrays elements have their data type objects `numpy.dtype`, corresponding to (machine-specific) C data types. The five basic numerical types represent:
* 32 bit integers, e.g., `np.int32`
* 64 bit unsigned integers, e.g., `np.uint64`
* single-precision floating points, e.g. `np.float32`
* double-precision floating points, e.g. `np.float64`
* boolean, e.g., `np.bool_`
* complex with double-precision floating points for each component, e.g., `np.complex128`
* (others non-C), `object_`

Data type of an element is accessible by `.dtype` property. By default, for numeric arrays, it is a float:

In [16]:
v1 = np.zeros(3)
print(v1)
print(v1.dtype)
print(type(v1.dtype))

[0. 0. 0.]
float64
<class 'numpy.dtype[float64]'>


but it can be implicitly inherited from Python objects:

In [17]:
v2 = np.array([0, 0, 0])
print(v2)
print(v2.dtype)

[0 0 0]
int64


or specified explicitly via dtype keyword argument:

In [18]:
v3 = np.zeros(3, dtype=np.int64)
print(v3)
print(v3.dtype)

[0 0 0]
int64


Data types determine a size of an array in the memory:

In [19]:
a10int = np.arange(10)
print("each element takes", a10int.itemsize, "bytes")  # same as `a1.dtype.itemsize`
print("the full array takes", a10int.nbytes, "bytes")  # .nbytes = .itemsize * .size

each element takes 8 bytes
the full array takes 80 bytes


## Indexing and slicing

NumPy arrays use the squared brackets `[]` for indexing and slicing compatible with Python (nested) lists:

In [48]:
matrix = np.array(
    [
        [0, 1, 2],
        [3, 4, 5],
    ]
)

print(matrix)
print()

# first row:
vector = matrix[0]
print(vector)
print()

number = vector[0]
print(number)

[[0 1 2]
 [3 4 5]]

[0 1 2]

0


In [49]:
print(vector[-1])  # negative indices for indexing from the end of the array
print(vector[:2])  # start:end (note: not including last index)
print(vector[0:1])  # pay attention that this is an array while vector[0] is an element
print(vector[::2])  # start:end:step
print(vector[::-1])  # reverse order
print(
    vector[:0:-1]
)  # start is deduced as len(vector)-1; end=0 is not included (can be confusing)
print(vector[1:][::-1])  # easier to follow

2
[0 1]
[0]
[0 2]
[2 1 0]
[2 1]
[2 1]


NumPy throws in multi-dimensional indexing, either using index tuples, or directly:

In [50]:
print(matrix)
print(matrix[0, 1])  # first row, second column
print(matrix[::-1, 2])  # all rows reversed (start:end deduced), third column

[[0 1 2]
 [3 4 5]]
1
[5 2]


## Iterating

`for` loop iterates over a first dimension of a `np.ndarray` array:

In [51]:
for number in vector:
    print(number)
print()

for row in matrix:
    print(row)
print()

0
1
2

[0 1 2]
[3 4 5]



`len()` value and single dim indexing correspond to the for loop behaviour, much like you would expect from nested lists:

In [52]:
for i in range(len(matrix)):
    print(matrix[i])
print()

[0 1 2]
[3 4 5]



To iterate over all elements use `np.nditer` / `np.ndenumerate`

**Note**: 

Memory locations (addresses) on a computer are single numbers, so if you want to store a matrix you have to think about how the row and column index of a matrix element is translated to / from such an address. There are two common schemes for this: `C-order` and `F-order`.

By default NumPy arrays are stored in `C-order` , i.e. numbers are stored sequentially row by row, unlike in Fortran or MATLAB where arrays are stored column-wise, called `F-order` in NumPy.

In [53]:
print(matrix)
print()

for number in np.nditer(matrix):
    print(number, end=", ")
print()
print()

for index, number in np.ndenumerate(matrix):
    print("a[{}, {}] = {}".format(*index, number))

[[0 1 2]
 [3 4 5]]

0, 1, 2, 3, 4, 5, 

a[0, 0] = 0
a[0, 1] = 1
a[0, 2] = 2
a[1, 0] = 3
a[1, 1] = 4
a[1, 2] = 5


In [54]:
matrix_F_order = np.array(matrix, order="F")

print(matrix_F_order)
print()

for number in np.nditer(matrix_F_order):
    print(number, end=", ")

print()
print()

for index, number in np.ndenumerate(matrix_F_order):
    print("a[{}, {}] = {}".format(*index, number))

[[0 1 2]
 [3 4 5]]

0, 3, 1, 4, 2, 5, 

a[0, 0] = 0
a[0, 1] = 1
a[0, 2] = 2
a[1, 0] = 3
a[1, 1] = 4
a[1, 2] = 5


## Views vs. copies

Assigning an Python object to another variable does not copy the data, but assigns another name to the same object.

In [55]:
a = [1, 2, 3]
b = a

print("memory address of a", id(a))
print("memory address of b", id(b))

print("a is b:", a is b)

memory address of a 4670180032
memory address of b 4670180032
a is b: True


The consequence is that modifying `a` also modifies `b`:

In [56]:
a[1] = 42
print("a = ", a)
print("b = ", b)

a =  [1, 42, 3]
b =  [1, 42, 3]


To create a full copy Python offers multiple options. E.g.

In [57]:
from copy import deepcopy

a = [1, 2, 3]
b = deepcopy(a)
print("memory address of a", id(a))
print("memory address of b", id(b))

print("a is b:", a is b)

a[1] = 42
print("a =", a)
print("b =", b)

memory address of a 4620391680
memory address of b 4620445504
a is b: False
a = [1, 42, 3]
b = [1, 2, 3]


In order to copy an array there is the `np.copy` function. 

**Warning**: There is also a `copy` method for the numpy array but the behaviour is slightly different: a `C-order` copy of the array is done even if the original array was in `F-order`.

**Note**: In case the data type is `object_` you have to use `copy.deepcopy` to ensure all elements are copied.

In [58]:
matrix_F_order_copy1 = np.copy(matrix_F_order)
print(np.isfortran(matrix_F_order_copy1))

print()

matrix_F_order_copy2 = matrix_F_order.copy()
print(np.isfortran(matrix_F_order_copy2))

print()

import copy

matrix_F_order_copy3 = copy.deepcopy(matrix_F_order)
print(np.isfortran(matrix_F_order_copy3))

True

False

True


**Beware**: slicing doesn't copy the original array in memory, but only creates a "view" on an existing memory block. There is no implicit copy-on-write behaviour here (like in MATLAB).

In [59]:
vec1 = np.arange(10)
print("vec1", vec1)
vec2 = vec1[:3]
print("vec2", vec2)

vec2[:] = vec2[::-1]
print("vec2 after change", vec2)
print("vec1", vec1)  # the original array has also changed!

vec1 [0 1 2 3 4 5 6 7 8 9]
vec2 [0 1 2]
vec2 after change [2 1 0]
vec1 [2 1 0 3 4 5 6 7 8 9]


Use `np.shares_memory()` to check if two arrays share memory

In [60]:
np.shares_memory(vec1, vec2)

True

You can make explicitly copies using `np.copy()` function:

In [61]:
vec1 = np.arange(10)
vec2 = np.copy(vec1[:3])  # create a copy from the view
print(np.shares_memory(vec1, vec2))
vec2[:] = vec2[::-1]
print(vec2)
print(vec1)  # the original array is intact

False
[2 1 0]
[0 1 2 3 4 5 6 7 8 9]


## Shape manipulation

In most cases the changes in dimensions of an array create a view of the original array, not its copy.

Shape of an array can be changed using `.reshape()` method:


In [62]:
vector = np.arange(8)
print(vector, "\n")

matrix = vector.reshape(2, 4)  # product of `.shape` entries must stay the same
print(matrix, "\n")

arr = matrix.reshape(2, 2, 2)
print(arr, "\n")

# .reshape() returns a view
arr[1, 1, 1] = 0
print(vector)

[0 1 2 3 4 5 6 7] 

[[0 1 2 3]
 [4 5 6 7]] 

[[[0 1]
  [2 3]]

 [[4 5]
  [6 7]]] 

[0 1 2 3 4 5 6 0]


One can specify `-1` for one of the reshape arguments, since its value is fixed due to the constraint that the arguments product must not change:

In [63]:
print(vector.reshape(-1, 2))
print(vector.reshape(2, -1, 2))

[[0 1]
 [2 3]
 [4 5]
 [6 0]]
[[[0 1]
  [2 3]]

 [[4 5]
  [6 0]]]


You can conveniently get a vector view (or copy) of an array of any shape using `.ravel()` (or `.flatten()`):

In [64]:
print(arr.ravel())  # same as: arr.reshape(np.prod(arr.shape))
print(arr.flatten())  # same as: arr.ravel().copy()

[0 1 2 3 4 5 6 0]
[0 1 2 3 4 5 6 0]


To conveniently add dimension (of size 1) to an existing array you can use `numpy.newaxis` in combination with slicing:

In [65]:
vector = np.arange(10)
print(vector)
print(vector.shape)
print()

# single row matrix; same as: vector.reshape(1, vector.shape[0])
mat_row = vector[np.newaxis, :]
print(mat_row.shape)
print(mat_row)
print()

# single column matrix (column vector)
mat_col = vector[:, np.newaxis]
print(mat_col.shape)
print(mat_col)
print()

[0 1 2 3 4 5 6 7 8 9]
(10,)

(1, 10)
[[0 1 2 3 4 5 6 7 8 9]]

(10, 1)
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]



`None` instead of `np.newaxis` also works:

In [68]:
print(vector[None, :])
print(vector[:, None])

[[0 1 2 3 4 5 6 7 8 9]]
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]


In addition to `.reshape()` method there is also `.resize()` method which changes an array **"in-place"** and allows for size changes:

In [69]:
arr = np.arange(8)
print(arr, "\n")

arr.resize(2, 5)  # fill w/ zeros
print(arr, "\n")

arr.resize(2, 2)  # trim
print(arr, "\n")

[0 1 2 3 4 5 6 7] 

[[0 1 2 3 4]
 [5 6 7 0 0]] 

[[0 1]
 [2 3]] 



Side note: `np.resize()` function works differently - it creates a copy and fills-up added size with repeats of the original array (not zeros):

In [70]:
arr2 = np.resize(arr, (3, 3))
print(arr2, "\n")
print(np.shares_memory(arr, arr2))

[[0 1 2]
 [3 0 1]
 [2 3 0]] 

False


Transposition allows for easy re-shuffling of dimensions of an array:

In [71]:
matrix = np.arange(6).reshape(3, 2)
print(matrix)
print(matrix.T)  # same as: matrix.transpose()

[[0 1]
 [2 3]
 [4 5]]
[[0 2 4]
 [1 3 5]]


where `.T` always reverses order of dimensions and `.transpose()` allows for any permutation:

In [72]:
arr = matrix.reshape(1, 2, 3)
print(arr.shape)
print(arr.T.shape)
print(arr.transpose(1, 2, 0).shape)

(1, 2, 3)
(3, 2, 1)
(2, 3, 1)


## "Fancy" indexing


"Fancy" indexing means indexing with boolean or integer masks, which is very convenient for filtering array elements:

In [42]:
np.random.seed(42)
vector = np.random.uniform(-1, 1, size=5)
print(vector)

[-0.25091976  0.90142861  0.46398788  0.19731697 -0.68796272]


In [43]:
mask = vector < 0
print(mask)

[ True False False False  True]


In [44]:
vector[vector < 0]

array([-0.25091976, -0.68796272])

This is again a view, so we can change the elements of the original vector:

In [45]:
vector[vector < 0] = 0
print(vector)

[0.         0.90142861 0.46398788 0.19731697 0.        ]


In [46]:
vector = vector[vector != 0]
print(vector)

[0.90142861 0.46398788 0.19731697]


Using integer masks is like slicing, but with custom lists of indices for each dimension:

In [47]:
arr = np.diag(np.arange(1, 6))

print(arr)
print()
print(arr[:2, [0, 1]])

[[1 0 0 0 0]
 [0 2 0 0 0]
 [0 0 3 0 0]
 [0 0 0 4 0]
 [0 0 0 0 5]]

[[1 0]
 [0 2]]


When using NumPy arrays for indexing, shape of the index is used in the result:

In [48]:
vector = 1 + np.arange(10)
print(vector)
print()
idx = np.array(
    [
        [1, 3],
        [4, 6],
    ]
)
print(vector[idx])

[ 1  2  3  4  5  6  7  8  9 10]

[[2 4]
 [5 7]]


<div class="alert alert-block alert-info">
    <p style="font-weight: bold; font-size:120%;"><i class="fa fa-question-circle"></i>&nbsp; Exercise</p>

1. Experiment with examples above. (What if mask has a different size? Can you combine slicing/lists and NumPy arrays in multi-dim indexing?)
2. Write a `matrix_elements_at` function that takes a matrix and list of tuples of (i,j) positons in a matrix and returns a vector of elements at these positions. Test it: `matrix_elements_at(np.diag(range(4)), [(3,3),(2,2),(0,1)])` should return a 3, 2, 0 vector.

</div>

## Numerical Operations and universal functions (ufuncs)

All algebraic operations such as `+` or `*` are element-wise. 

In [49]:
vector = np.arange(8).reshape(2, 4)
print(vector + vector)
print(vector ** 2)
print(vector * vector)  # element-wise multiplication!

[[ 0  2  4  6]
 [ 8 10 12 14]]
[[ 0  1  4  9]
 [16 25 36 49]]
[[ 0  1  4  9]
 [16 25 36 49]]


For dot-product and matrix multiplication use `@` or `.dot`:

In [50]:
vector = np.arange(3)
print(vector @ vector)  # same as: vector.dot(vector)
print()

matrix = np.ones((3, 3))
mat_id = np.eye(3)
print(matrix * mat_id)  # element-wise multiplication
print(matrix @ mat_id)  # same as: matrix.dot(mat_id)

5

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


The benefit of `@` is that more complicated matrix products are more readable. E.g.  $A^T B A$:

In [78]:
A = np.random.random(size=(3, 3))
B = np.random.random(size=(3, 3))

# which one is more readable:???
print(A.T @ B @ A == np.dot(np.dot(A.T, B), A))

[[ True  True  True]
 [ True  True  True]
 [ True  True  True]]


The so called `ufuncs` in `numpy` implement functions such as `sin`, `log`, ... and operate element wise on the elements:

In [79]:
a = np.arange(6).reshape(3, -1)
print(a)
print(np.sin(a))

[[0 1]
 [2 3]
 [4 5]]
[[ 0.          0.84147098]
 [ 0.90929743  0.14112001]
 [-0.7568025  -0.95892427]]


## Broadcasting

Broadcating performs numerical operations between arrays with different dimensions, e.g., adding a scalar (zero-dimension array) to an array.

In [81]:
a = np.arange(12).reshape(3, -1)
print(a)
a + 17

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


array([[17, 18, 19, 20],
       [21, 22, 23, 24],
       [25, 26, 27, 28]])

Broadcasting also allows operations of arrays with mixed dimensions > 0:

In [82]:
a = np.arange(12).reshape(3, -1)
b = np.array([1, 10, 100, 1000]).reshape(-1, 4)
print(a)
print(b)
print(a.shape)
print(b.shape)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[   1   10  100 1000]]
(3, 4)
(1, 4)


In [83]:
print(a + b)

[[   1   11  102 1003]
 [   5   15  106 1007]
 [   9   19  110 1011]]


Here the array `b` was 
1. first externally "extended" to match the shape of `a` by row-wise copying
2. and then added element-wise.

Another example for broadcasting where we operate row wise:

In [84]:
c = np.array([1, 10, 100]).reshape(3, -1)
print(c)
print(c.shape)

[[  1]
 [ 10]
 [100]]
(3, 1)


In [85]:
print(a * c)

[[   0    1    2    3]
 [  40   50   60   70]
 [ 800  900 1000 1100]]


In order to understand broadcasting please read the introduction and the rules of broadcasting from: https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html

However, broadcasting cannot be applied to any two different dimensions.

In [86]:
c = np.array([1, 10]).reshape(1, -1)
print(c)
a + c

[[ 1 10]]


ValueError: operands could not be broadcast together with shapes (3,4) (1,2) 

## Further reading

* https://numpy.org/devdocs/user/absolute_beginners.html
* https://numpy.org/devdocs/user/basics.html
* https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html
* https://numpy.org/
* Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2

# Pandas

`pandas` is a flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R `data.frame` objects, statistical functions, and much more https://pandas.pydata.org

Pandas data frames are:

* a non-homogeneous tabular data structure (called data frames), contrary to 2D `numpy` arrays each column in a `data.frame` can have a different type.
* built using NumPy arrays internaly, so also vectorized and (C) fast, but also lazy,
* convenient to transform, summarize, and plot.


In [87]:
import pandas as pd  # Pandas import convention

# Inline, sharp plots
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Read file

In [88]:
# Brain size and weight and IQ data (Willerman et al. 1991)
# Downloaded from Scipy Lecture Notes on 30. Jun 2019:
# http://scipy-lectures.org/_downloads/brain_size.csv
#

# no need to undestand the following line, it will show the first 6 lines
# of the file data/brain_size.csv
!head -n 6 data/brain_size.csv

"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353
"5";"Female";137;132;134;"147";"65.0";951545


Pandas provides the `read_csv()` function to read data from a CSV file into a pandas DataFrame. In addition to CSV, Pandas also support several file formats, e.g., Excel, SQL, JSON, Parquet, etc. Each reading functions has the prefix `read_*`.

In [89]:
df = pd.read_csv("data/brain_size.csv", sep=";", na_values=".", index_col=0)

To see the first 5 rows of a pandas DataFrame:

In [90]:
df.head(5)

Unnamed: 0,Gender,FSIQ,VIQ,PIQ,Weight,Height,MRI_Count
1,Female,133,132,124,118.0,64.5,816932
2,Male,140,150,124,,72.5,1001121
3,Male,139,123,150,143.0,73.3,1038437
4,Male,133,129,128,172.0,68.8,965353
5,Female,137,132,134,147.0,65.0,951545


and for the last 5 rows:

In [91]:
df.tail(5)

Unnamed: 0,Gender,FSIQ,VIQ,PIQ,Weight,Height,MRI_Count
36,Female,133,129,128,153.0,66.5,948066
37,Male,140,150,124,144.0,70.5,949395
38,Female,88,86,94,139.0,64.5,893983
39,Male,81,90,74,148.0,74.0,930016
40,Male,89,91,89,179.0,75.5,935863


A data frame allows fast indexing along both the columns and rows denoted by the `columns` and `index` attributes, respectively. The columns and rows are labeled such that one can use the names. Each column in a data frame is a `pd.Series` and is very similar to a NumPy 1D array, with the difference that the positional index is also labeled.

In [92]:
print("shape:", df.shape)
print()
print("column labels:")
print(df.columns)
print()
print("column types:")
print(df.dtypes)
print()
print("row labels:")
print(df.index)

shape: (40, 7)

column labels:
Index(['Gender', 'FSIQ', 'VIQ', 'PIQ', 'Weight', 'Height', 'MRI_Count'], dtype='object')

column types:
Gender        object
FSIQ           int64
VIQ            int64
PIQ            int64
Weight       float64
Height       float64
MRI_Count      int64
dtype: object

row labels:
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
            35, 36, 37, 38, 39, 40],
           dtype='int64')


**Side note**: Pandas has a great date-time support, including indexing using date-time ranges or groups. See, e.g., https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html#Pandas-Time-Series:-Indexing-by-Time

## Indexing and Slicing

Pandas DataFrames support 3 types of multi-axis indexing:
1. `.loc[]` for label based
2. `.iloc[]` for integer position based (similar to Python)
3. `[]` 


**Warning**: `[]` can introduce confusion so we will skip it.

In [93]:
print("first three rows, column with index 0:")
print()
print(df.iloc[:3, 0].head())
print()

print("rows 1..3, column with name 'Gender':")
print()
# end index/name IS INCLUSIVE!!!!!

print(df.loc[1:3, "Gender"].head())
print()

print(type(df.loc[:, "Gender"]))  # a single column is a series
print(type(df.loc[:, ["Gender"]]))  # a list of columns is a data frame

first three rows, column with index 0:

1    Female
2      Male
3      Male
Name: Gender, dtype: object

rows 1..3, column with name 'Gender':

1    Female
2      Male
3      Male
Name: Gender, dtype: object

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [94]:
# access col as attribute as long has a valid name
print(df.Gender.iloc[:3])
print()
# access multilple cols
print(df.loc[:, ["Gender", "Weight"]].iloc[:3, :])
print()

1    Female
2      Male
3      Male
Name: Gender, dtype: object

   Gender  Weight
1  Female   118.0
2    Male     NaN
3    Male   143.0



To avoid confusion, for slicing and indexing using row (and columns) names or indices best use, respectively, loc or iloc properties:

In [95]:
print(df.loc[:3, "Gender":"Weight"])  # use "labels", so :3 == [1, 2, 3]
print()
print(df.iloc[:3, :-2])  # use Python indexing, so :3 == [0, 1, 2]
print()
print(all(df.loc[1, :] == df.iloc[0, :]))

   Gender  FSIQ  VIQ  PIQ  Weight
1  Female   133  132  124   118.0
2    Male   140  150  124     NaN
3    Male   139  123  150   143.0

   Gender  FSIQ  VIQ  PIQ  Weight
1  Female   133  132  124   118.0
2    Male   140  150  124     NaN
3    Male   139  123  150   143.0

True


## Fancy indexing

Works as in NumPy:

In [96]:
print(df.loc[df.Weight > 150, "Gender"])

4       Male
8     Female
10      Male
12      Male
13      Male
14    Female
18      Male
20      Male
22      Male
26      Male
28      Male
30    Female
32      Male
33      Male
34      Male
36    Female
40      Male
Name: Gender, dtype: object


Bool masks index by rows, same as in slicing:

In [97]:
len(df.loc[(df.Weight > 150) & (df.Gender == "Male"), :])

13

## More about Pandas 

Pandas offers data structures and operations for manipulating numerical tables and time series:

* A fast and efficient DataFrame object for data manipulation with integrated indexing;
* Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
* Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
* Flexible reshaping and pivoting of data sets;
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
* Columns can be inserted and deleted from data structures for size mutability;
* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
* High performance merging and joining of data sets;
* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
* Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
* Highly optimized for performance, with critical code paths written in Cython or C.
* Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

## Further reading

* https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
* https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
* https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html
* https://pandas.pydata.org/
* Wes McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition (2018); ISBN13: 9781491957660, ISBN10: 1491957662