## Lesson 10 - Numpy, Pandas and Matplotlib Crashcourse

Here will introduce Numpy, Pandas, and Matplotlib. Numpy is the core numerical computing package in Python, and its core type is ndarray. Pandas uses DataFrames (tables, much like R DataFrames) and Series (columns of a DataFrame) with powerful SQL-like queries. Matplotlib is a package for plotting, which uses a MATLAB-style syntax.

### Readings

* Pratik: [Introduction to Numpy and Pandas](https://cloudxlab.com/blog/numpy-pandas-introduction/)

### Table of Contents

* [Numpy](#numpy)
* [Pandas](#pandas)
* [Matplotlib](#matplotlib)

<a id="numpy"></a>

### Numpy

Today we will work with several new packages and the data types they provide: the Numpy `ndarray` and the Pandas `Series` and `DataFrame`. These data types have many of the properties of `list` but are much more powerful.

The `numpy` package (module) is used in almost all numerical computation using Python. It is a package that provides high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good. 

To use `numpy`, you need to import the module:

In [80]:
import numpy as np

#### Creating Numpy arrays

There are a number of ways to initialize new Numpy arrays, for example from

* converting from Python lists or tuples
* using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, etc.
* reading data from files

##### From lists

For example, to create new vector and matrix arrays from Python lists we can use the `numpy.array` function

In [81]:
# a vector: the argument to the array function is a Python list
v = np.array([1,2,3,4])
v

array([1, 2, 3, 4])

In [82]:
# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])
M

array([[1, 2],
       [3, 4]])

The `v` and `M` objects are both of the type `ndarray` that the `numpy` module provides.

In [83]:
type(v), type(M)

(numpy.ndarray, numpy.ndarray)

The difference between the `v` and `M` arrays is only their shapes. We can get information about the shape of an array by using the `ndarray.shape` property.

In [84]:
v.shape

(4,)

In [85]:
M.shape

(2, 2)

The number of elements in the array is available through the `ndarray.size` property:

In [86]:
M.size

4

Equivalently, we could use the function `numpy.shape` and `numpy.size`:

In [87]:
np.shape(M)

(2, 2)

In [88]:
np.size(M)

4

So far the `numpy.ndarray` looks a lot like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? 

There are several reasons:

* Python lists are very general. They can contain any kind of object. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementating such functions for Python lists would not be very efficient because of the dynamic typing.
* Numpy arrays are **statically typed** and **homogeneous**. The type of the elements is determined when array is created.
* Numpy arrays are memory efficient.
* Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of `numpy` arrays can be implemented in a compiled language (C and Fortran is used).

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array has:

In [89]:
M.dtype

dtype('int64')

We get an error if we try to assign a value of the wrong type to an element in a numpy array:

In [90]:
#M[0,0] = 'hello'

In [91]:
M[0,0] = 5

In [92]:
M

array([[5, 2],
       [3, 4]])

If we want, we can explicitly define the type of the array data when we create it, using the `dtype` keyword argument: 

In [93]:
N = np.array([[1, 2], [3, 4]], dtype=complex)
N

array([[1.+0.j, 2.+0.j],
       [3.+0.j, 4.+0.j]])

Common types that can be used with `dtype` are: `int`, `float`, `complex`, `bool`, and `object` (string).

We can also explicitly define the bit size of the data types, for example: `int64`, `int16`, `float128`, `complex128`.

#### Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit pythons lists. Instead we can use one of the many functions in `numpy` that generates arrays of different forms. Some of the more common are:

In [94]:
# create a range (the end value is not included)
x = np.arange(-1, 1, 0.1) # arguments: start, stop, step
x

array([-1.00000000e+00, -9.00000000e-01, -8.00000000e-01, -7.00000000e-01,
       -6.00000000e-01, -5.00000000e-01, -4.00000000e-01, -3.00000000e-01,
       -2.00000000e-01, -1.00000000e-01, -2.22044605e-16,  1.00000000e-01,
        2.00000000e-01,  3.00000000e-01,  4.00000000e-01,  5.00000000e-01,
        6.00000000e-01,  7.00000000e-01,  8.00000000e-01,  9.00000000e-01])

In [95]:
# dtype is determined automatically unless specified
x.dtype

dtype('float64')

In [96]:
# range of integers
np.arange(0, 10, 1) # arguments: start, stop, step

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [97]:
# specifying dtype as float
np.arange(0, 10, 1, dtype=float) # arguments: start, stop, step

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [98]:
# using linspace, both end points ARE included
np.linspace(0, 10, 25) # arguments: start, stop, N

array([ 0.        ,  0.41666667,  0.83333333,  1.25      ,  1.66666667,
        2.08333333,  2.5       ,  2.91666667,  3.33333333,  3.75      ,
        4.16666667,  4.58333333,  5.        ,  5.41666667,  5.83333333,
        6.25      ,  6.66666667,  7.08333333,  7.5       ,  7.91666667,
        8.33333333,  8.75      ,  9.16666667,  9.58333333, 10.        ])

In [99]:
# similar to meshgrid in MATLAB
x, y = np.mgrid[0:5, 0:5] 

In [100]:
x

array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4]])

In [101]:
y

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

In [102]:
# uniform random numbers in interval [0,1]
np.random.rand(5,5)

array([[0.35062758, 0.09761739, 0.78282326, 0.71815315, 0.82462767],
       [0.16488478, 0.53955174, 0.20238654, 0.93305392, 0.44528734],
       [0.96857042, 0.45117028, 0.84637065, 0.0161692 , 0.48486676],
       [0.5279843 , 0.5301653 , 0.63213972, 0.91407492, 0.23037769],
       [0.01539584, 0.5583002 , 0.39442388, 0.93410172, 0.65490083]])

In [103]:
# standard normal distributed random numbers
np.random.randn(5,5)

array([[-1.95608368e+00, -2.69630594e-02, -8.83485841e-02,
         6.04057527e-01,  1.00087493e+00],
       [-4.86748082e-01, -5.18614394e-01, -5.31393505e-01,
         2.02550939e+00,  1.07701981e+00],
       [ 1.37299514e+00,  2.92767566e-01, -1.55911649e+00,
         6.19649723e-01,  2.65518155e-01],
       [-4.17239294e-01,  5.97883866e-01, -1.79994238e-01,
        -2.84079583e+00,  2.37613248e-02],
       [ 7.69525400e-01, -5.15507619e-01,  3.63028810e-04,
         1.77540499e+00,  1.02863307e+00]])

In [104]:
# diagonal matrix
np.diag([1,2,3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

In [105]:
# zeros
np.zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [106]:
# ones
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [107]:
# ones as int
np.ones((3,3), dtype=int)

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])

#### Indexing

We can index elements in an array using the square bracket and indices:

In [108]:
# v is a vector, and has only one dimension, taking one index
v[0]

1

In [109]:
# M is a matrix, or a 2 dimensional array, taking two indices 
M[1,1]

4

In [110]:
# If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array)
M[1]

array([3, 4])

The same thing can be achieved with using `:` instead of an index: 

In [111]:
M[1,:] # row 1

array([3, 4])

In [112]:
M[:,1] # column 1

array([2, 4])

We can assign new values to elements in an array using indexing:

In [113]:
M[0,0] = -1
M

array([[-1,  2],
       [ 3,  4]])

In [114]:
# also works for rows and columns
M[0,:] = 0
M[:,1] = -2

In [115]:
M

array([[ 0, -2],
       [ 3, -2]])

#### Index slicing

Index slicing is the technical name for the syntax `M[lower:upper:step]` to extract part of an array:

In [116]:
A = np.array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [117]:
A[1:3]

array([2, 3])

Array slices are *mutable*: if they are assigned a new value the original array from which the slice was extracted is modified:

In [118]:
A[1:3] = [-2,-3]
A

array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in `M[lower:upper:step]`:

In [119]:
A[::] # lower, upper, step all take the default values

array([ 1, -2, -3,  4,  5])

In [120]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

array([ 1, -3,  5])

In [121]:
A[:3] # first three elements

array([ 1, -2, -3])

In [122]:
A[3:] # elements from index 3

array([4, 5])

Negative indices counts from the end of the array (positive index from the begining):

In [123]:
A = np.array([1,2,3,4,5])

In [124]:
A[-1] # the last element in the array

5

In [125]:
A[-3:] # the last three elements

array([3, 4, 5])

Index slicing works exactly the same way for multidimensional arrays:

In [126]:
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])
A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [127]:
# a block from the original array
A[1:4, 1:4]

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

In [128]:
# strides
A[::2, ::2]

array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

#### Fancy indexing

Fancy indexing is the name for when an array or list is used in-place of an index: 

In [129]:
row_indices = [1, 2, 3]
A[row_indices]

array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])

In [130]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]

array([11, 22, 34])

We can also index *masks*: If the index mask is an Numpy array of with data type `bool`, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element: 

In [131]:
B = np.array([n for n in range(5)])
B

array([0, 1, 2, 3, 4])

In [132]:
row_mask = np.array([True, False, True, False, False])
B[row_mask]

array([0, 2])

In [133]:
# same thing
row_mask = np.array([1,0,1,0,0], dtype=bool)
B[row_mask]

array([0, 2])

This feature is very useful to conditionally select elements from an array, using for example comparison operators:

In [134]:
x = np.arange(0, 10, 0.5)
x

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
       6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])

In [135]:
# want values of x that are at least 5 and have no decimal component
mask = (x >= 5) & (x % 1 == 0)
mask

array([False, False, False, False, False, False, False, False, False,
       False,  True, False,  True, False,  True, False,  True, False,
        True, False])

In [136]:
x[mask]

array([5., 6., 7., 8., 9.])

#### Linear algebra

Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.

In [137]:
v1 = np.arange(0, 5)
v1

array([0, 1, 2, 3, 4])

In [138]:
v1 * 2

array([0, 2, 4, 6, 8])

In [139]:
v1 + 2

array([2, 3, 4, 5, 6])

In [140]:
A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [141]:
np.dot(A, A)

array([[ 300,  310,  320,  330,  340],
       [1300, 1360, 1420, 1480, 1540],
       [2300, 2410, 2520, 2630, 2740],
       [3300, 3460, 3620, 3780, 3940],
       [4300, 4510, 4720, 4930, 5140]])

In [142]:
v1.reshape(1,5)

array([[0, 1, 2, 3, 4]])

In [143]:
np.dot(A, v1)

array([ 30, 130, 230, 330, 430])

In [144]:
np.dot(v1, v1)

30

Alternatively, we can cast the array objects to the type `matrix`. This changes the behavior of the standard arithmetic operators `+, -, *` to use matrix algebra. There is a ton of matrix math that we won't cover here.

In [145]:
M = np.matrix(A)
v = np.matrix(v1).T # make it a column vectorm

In [146]:
M

matrix([[ 0,  1,  2,  3,  4],
        [10, 11, 12, 13, 14],
        [20, 21, 22, 23, 24],
        [30, 31, 32, 33, 34],
        [40, 41, 42, 43, 44]])

In [147]:
v

matrix([[0],
        [1],
        [2],
        [3],
        [4]])

In [148]:
M*M

matrix([[ 300,  310,  320,  330,  340],
        [1300, 1360, 1420, 1480, 1540],
        [2300, 2410, 2520, 2630, 2740],
        [3300, 3460, 3620, 3780, 3940],
        [4300, 4510, 4720, 4930, 5140]])

In [149]:
M*v

matrix([[ 30],
        [130],
        [230],
        [330],
        [430]])

#### Data computations

In [150]:
np.mean(v1)

2.0

In [151]:
np.std(v1), np.var(v1)

(1.4142135623730951, 2.0)

In [152]:
v1.min()

0

In [153]:
v1.max()

4

In [154]:
sum(v1)

10

#### Iterating over array elements

In [155]:
for element in v1:
    print(element)

0
1
2
3
4


In [156]:
M = np.array([[1,2], [3,4]])
M

array([[1, 2],
       [3, 4]])

In [157]:
for row in M:
    print("row", row)    
    for element in row:
        print(element)

row [1 2]
1
2
row [3 4]
3
4


In [158]:
for row_idx, row in enumerate(M):
    print("row_idx", row_idx, "row", row)    
    for col_idx, element in enumerate(row):
        print("col_idx", col_idx, "element", element) 
        # modify the matrix M: square each element
        M[row_idx, col_idx] = element ** 2

row_idx 0 row [1 2]
col_idx 0 element 1
col_idx 1 element 2
row_idx 1 row [3 4]
col_idx 0 element 3
col_idx 1 element 4


In [159]:
# each element in M are now squared
M

array([[ 1,  4],
       [ 9, 16]])

<a id="pandas"></a>

### Pandas

#### What is Pandas?

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. 

#### Library features

* DataFrame object for data manipulation with integrated indexing
* Tools for reading and writing data between in-memory data structures and different file formats
* Data alignment and integrated handling of missing data
* Reshaping and pivoting of data sets
* Label-based slicing, fancy indexing, and subsetting of large data sets
* Data structure column insertion and deletion
* Group-by engine allowing split-apply-combine operations on data sets
* Data set merging and joining
* Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging

The library is highly optimized for performance, with critical code paths written in Cython or C.

#### Download data

I copied data from http://www.sccoos.org/data/autoshorestations/autoshorestations.php?study=Scripps%20Pier and pasted it into Excel, then saved it as a CSV file. Download [scripps_pier_20151110.csv](https://raw.githubusercontent.com/cuttlefishh/python-for-data-analysis/master/data/scripps_pier_20151110.csv) from GitHub and save it to a directory called `data` at the same level as `lessons`.

#### Install packages

Install pandas and matplotlib using if you haven't already. If you're not sure, you can type `conda list` at a terminal prompt.

```
conda install pandas
conda install matplotlib
```

#### Import modules

In [160]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Read data from CSV

In [161]:
data1 = pd.read_csv('../data/scripps_pier_20151110.csv', index_col=None, header=0)

FileNotFoundError: File b'../data/scripps_pier_20151110.csv' does not exist

In [None]:
data1.head()

In [None]:
data2 = pd.read_csv('../data/scripps_pier_20151110.csv', index_col=0, header=0)

In [None]:
data2.head()

In [None]:
data2.describe()

#### Indexing in pandas

There are two ways to index a Pandas DataFrame.

* `loc` works on labels in the index.
* `iloc` works on the positions in the index (so it only takes integers).

#### With Date as the index column (data2)

In [None]:
data2.iloc[0]

In [None]:
data2['temp (C)'].head(10)

#### With no index column (data1)

In [None]:
data1.iloc[0]

In [None]:
data1.loc[0]

In [None]:
data1['Date'].head()

In [None]:
data1.Date.head()

In [None]:
data1.iloc[:,0].head()

#### Convert date/time to timestamp object

In [None]:
time = pd.to_datetime(data1.iloc[:,0])
time.head()

In [None]:
type(time)

<a id="matplotlib"></a>

### Matplotlib

#### Plot a single variable vs. time

In [None]:
fig = plt.figure()
plt.plot(time, data1['chl (ug/L)'])
plt.xlabel('Time')
plt.ylabel('Chlorophyll')
fig.savefig('scripps_pier_Chlorophyll.pdf')

#### Plot each response variable in a loop

In [None]:
data1.rename(columns={'chl (ug/L)': 'Chlorophyll', 'pres (dbar)': 'Pressure', 
                      'sal (PSU)': 'Salinity', 'temp (C)': 'Temperature'}, inplace=True)

In [None]:
data1.head()

In [None]:
data1.columns

In [None]:
for col in data1.columns:
    if col != 'Date':
        fig = plt.figure()
        plt.plot(time, data1[col])
        plt.xlabel('Time')
        plt.ylabel(col)
        fig.savefig('scripps_pier_%s.pdf' % col)

#### Plot all response variables together

In [None]:
data2.head()

In [None]:
data2.index = pd.to_datetime(data2.index)

In [None]:
data2.head()

In [None]:
plt.figure()
data2.plot()
plt.legend(loc='best')

### P.S. About that name...

The name "Pandas" actually has nothing to do with the animal. It is derived from the term "panel data", an econometrics term for multidimensional structured data sets.

![pandas](http://wdy.h-cdn.co/assets/16/05/980x490/landscape-1454612525-baby-pandas.jpg)