<div style="text-align: center">
    <div style="font-size: xxx-large ; font-weight: 900 ; color: rgba(0 , 0 , 0 , 0.8) ; line-height: 100%">
        NumPy &amp; Pandas
    </div>
    <div style="font-size: x-large ; padding-top: 20px ; color: rgba(0 , 0 , 0 , 0.5)">
        Data Processing
    </div>
</div>

**NumPy** is a library, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

- For many tasks you could use Python's lists and numerical data types (int, float) to compute something. However, in reality, Python can be quite slow if you want to compute many values.
- NumPy also provides you with a syntax and commands that are very close to how you would write mathematical equations. It therefore allows an easier transfer from equations into code. NumPy is often a lot faster for larger datasets.

**Pandas** is a library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It builds on NumPy.

- Pandas is especially useful when you are handling tabular data (with rows and columns, such as a `.csv` file).
- It also works nicely with the plotting software `seaborn` (next lecture).

## Importing NumPy and Pandas

You will find many examples for the usage of NumPy and Pandas online. Often you will encounter examples such as:
```python
import numpy as np
np.max()
```
or
```python
import pandas as pd
pd.read_csv()
```

Therefore a good practice is to always import NumPy with `import numpy as np` and pandas with `import pandas as pd`.

In [1]:
import numpy as np
import pandas as pd

## Loading text files with NumPy

Documentation: [numpy.loadtxt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html)

Load data from a text file.

Each row in the text file must have the same number of values.

The following lists all arguments that `loadtxt` accepts:
```python
numpy.loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None)
```

We will use `fname`, `unpack`, and `usecols`:

`fname` : file, str, or pathlib.Path

    File or filename to read.  If the filename extension is ``.gz`` or ``.bz2``, the file is first decompressed.

`unpack` : bool

    If True, the returned array is transposed, so that arguments may be unpacked using x, y, z = loadtxt(...). When used with a structured data-type, arrays are returned for each field. Default is False.


`usecols` : int or sequence

    Which columns to read, with 0 being the first. For example, usecols = (1,4,5) will extract the 2nd, 5th and 6th columns. The default, None, results in all columns being read.

    When a single column has to be read it is possible to use an integer instead of a tuple. E.g usecols = 3 reads the fourth column the same way as usecols = (3,) would.


**How does the raw txt file look like?**

This is a simple Python way of reading all lines from a file.

In [2]:
with open('lecture_12/station_data.txt') as f:
    for line in f.readlines():
        print(line)

# time ns ew ud

1574331571.7485132 1.124 148.1 104.1

1574331572.7485132 1.224 149.1 105.1

1574331573.7485132 1.324 150.1 106.1

1574331574.7485132 1.524 151.1 107.1


**Loading all columns**

In [3]:
station_data = np.loadtxt('lecture_12/station_data.txt')
station_data

array([[1.57433157e+09, 1.12400000e+00, 1.48100000e+02, 1.04100000e+02],
       [1.57433157e+09, 1.22400000e+00, 1.49100000e+02, 1.05100000e+02],
       [1.57433157e+09, 1.32400000e+00, 1.50100000e+02, 1.06100000e+02],
       [1.57433157e+09, 1.52400000e+00, 1.51100000e+02, 1.07100000e+02]])

**Note**: "e+09" is a scientific number notation and means "*10^9"

**Loading all columns (unpack)**

In [4]:
time, ns, ew, ud = np.loadtxt('lecture_12/station_data.txt', unpack=True)
print(f"Time: {time}")
print(f"ns: {ns}")
print(f"ew: {ew}")
print(f"ud: {ud}")

Time: [1.57433157e+09 1.57433157e+09 1.57433157e+09 1.57433157e+09]
ns: [1.124 1.224 1.324 1.524]
ew: [148.1 149.1 150.1 151.1]
ud: [104.1 105.1 106.1 107.1]


**Loading 2 columns**

In [5]:
ns, ud = np.loadtxt('lecture_12/station_data.txt', unpack=True, usecols=(1, 3))
print(f"ns: {ns}")
print(f"ud: {ud}")

ns: [1.124 1.224 1.324 1.524]
ud: [104.1 105.1 106.1 107.1]


**Loading 1 column**

In [6]:
time = np.loadtxt('lecture_12/station_data.txt', unpack=True, usecols=0)
print(f"Time: {time}")

Time: [1.57433157e+09 1.57433157e+09 1.57433157e+09 1.57433157e+09]


## Numpy Data Types
Because NumPy handles data separately to how Python does it, special NumPy data types exist (they are similar to the Python ones).

Numpy commands that create arrays therefore support the `dtype` argument which allows specifying the type of the array.

There are some more types than the ones shown below, but these are the most useful.

In [7]:
ints = np.array([1,2,3], dtype=np.int)
floats = np.array([1,2,3], dtype=np.float)
strings = np.array(['a', 'b', 'c'], dtype=np.str)
bools = np.array([True, False], dtype=np.bool)

## Working with NumPy arrays

**Creating an array (from a list)**

In [8]:
list_data_1d = [1, 2, 5, 4]
np.array(list_data_1d)

array([1, 2, 5, 4])

**Creating multi-dimensional arrays (i.e. a matrix) (from a list of lists)**

In [9]:
list_data_2d = [
    [1,2,3,4],
    [4,5,6,7]
]
np.array(list_data_2d)

array([[1, 2, 3, 4],
       [4, 5, 6, 7]])

**Getting the shape of an array**

Useful to check input shapes, or when iterating through the data manually

In [10]:
print(np.array(list_data_1d).shape)
print(np.array(list_data_2d).shape)

(4,)
(2, 4)


**Selecting multiple rows**

In [11]:
# Create a range of values (similar to python's range())
print(np.arange(10))
print(np.arange(5,10))
print(np.arange(0, 5, 0.5)) # But allows decimal steps

[0 1 2 3 4 5 6 7 8 9]
[5 6 7 8 9]
[0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5]


**Creatings arrays of ones or zeros**

In [12]:
np.ones([2,5])

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [13]:
np.zeros([5,2])

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

**Creating a random array of a specific size**

Among others there are `normal` and `uniform` distributions.

In [14]:
random = np.random.uniform(size=[2,3,5])
random

array([[[0.79431066, 0.35815881, 0.53235965, 0.39973807, 0.73490048],
        [0.25634318, 0.31949962, 0.46617447, 0.70935159, 0.13732785],
        [0.25856235, 0.59593606, 0.7027836 , 0.9836858 , 0.51018931]],

       [[0.97593766, 0.56500033, 0.27418398, 0.49473663, 0.50331682],
        [0.4634583 , 0.83537112, 0.31873157, 0.86070633, 0.54592973],
        [0.02528392, 0.44929906, 0.27123426, 0.43028796, 0.40986868]]])

In [15]:
random.shape

(2, 3, 5)

**Iterating through the data**

In [16]:
for row in random:
    print(row)

[[0.79431066 0.35815881 0.53235965 0.39973807 0.73490048]
 [0.25634318 0.31949962 0.46617447 0.70935159 0.13732785]
 [0.25856235 0.59593606 0.7027836  0.9836858  0.51018931]]
[[0.97593766 0.56500033 0.27418398 0.49473663 0.50331682]
 [0.4634583  0.83537112 0.31873157 0.86070633 0.54592973]
 [0.02528392 0.44929906 0.27123426 0.43028796 0.40986868]]


In [17]:
for row in random:
    for value in row:
        print(value)

[0.79431066 0.35815881 0.53235965 0.39973807 0.73490048]
[0.25634318 0.31949962 0.46617447 0.70935159 0.13732785]
[0.25856235 0.59593606 0.7027836  0.9836858  0.51018931]
[0.97593766 0.56500033 0.27418398 0.49473663 0.50331682]
[0.4634583  0.83537112 0.31873157 0.86070633 0.54592973]
[0.02528392 0.44929906 0.27123426 0.43028796 0.40986868]


**Array slicing**

Similar to Python lists, but more flexible.

In [18]:
array = np.array([
    [1,2,3,4],
    [5,6,7,8],
    [9,10,11,12]
])
array

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [19]:
# You can slice each dimension (also called axis) individually,
# simply separate each axis by a ","
# array[first_dimension, second_dimenstion, third, ...]
# ":" will select all rows, "1:3" will select elements 1 and 2
array[:,1:3]

array([[ 2,  3],
       [ 6,  7],
       [10, 11]])

**Overriding values in an array**

In [20]:
# Individual values
array[0,0] = 100
array

array([[100,   2,   3,   4],
       [  5,   6,   7,   8],
       [  9,  10,  11,  12]])

In [21]:
# Or larger parts by specifying an array of matching size
array[2,:] = np.array([-1,-1,-1,-1])
array

array([[100,   2,   3,   4],
       [  5,   6,   7,   8],
       [ -1,  -1,  -1,  -1]])

### Math with arrays (*, /, -, +, ...).

**Multiply all values in an array with a scalar value**

In [22]:
np.arange(5) * 2

array([0, 2, 4, 6, 8])

Full list here: https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html

**Element-wise multiplication**

In [23]:
np.arange(5) * np.arange(5,10)

array([ 0,  6, 14, 24, 36])

**Matrix multiplication**

In [24]:
np.matmul(np.ones([4,2]), np.ones([2,3]))

array([[2., 2., 2.],
       [2., 2., 2.],
       [2., 2., 2.],
       [2., 2., 2.]])

In [25]:
# min, max, mean, standard deviation
array = np.arange(10)
array.min(), array.max(), array.mean(), array.std()
# -> Alternatively: np.mean(array)

# If you have a matrix you can compute these on indidvidual axis
array = np.array([
    [1,2,3,4],
    [5,6,7,8],
    [9,10,11,12]
])
print('Axis 0:', array.min(axis=0)) # 0 = Columns
print('Axis 1:', array.min(axis=1)) # 1 = Rows

Axis 0: [1 2 3 4]
Axis 1: [1 5 9]


## Saving and loading NumPy arrays

In [26]:
# Setup
times = np.array([1574331571.7485132, 1574331572.7485132, 1574331573.7485132, 1574331574.7485132])
ns = np.array([1.124, 1.224, 1.324, 1.524])
ew = np.array([148.1, 149.1, 150.1, 151.1])
ud = np.array([104.1, 105.1, 106.1, 107.1])

**Save a single array**

`.npy` is the numpy file ending for single array data. You can also omit the `.npy` in the path name and it will be automatically added.

In [27]:
np.save('lecture_12/station_data-times.npy', times)

**Save multiple arrays in a single file**

Sometimes you want to save multiple arrays. Those will be saved in files ending with `.npz` which basically is a `ZIP-file` with multiple `.npy` files in it.

Each `key` you specify while saving will be a `.npy` file in the archive.

In [28]:
np.savez_compressed(
    'lecture_12/station_data.npz',
    times=times,
    ns=ns,
    ew=ew,
    ud=ud
)

**Load NumPy arrays from a drive**

In [29]:
loaded_array = np.load('lecture_12/station_data-times.npy')
print("Loaded:", loaded_array)

Loaded: [1.57433157e+09 1.57433157e+09 1.57433157e+09 1.57433157e+09]


When loading an archive the returned object can be used like a dictionary.

In [30]:
loaded_archive = np.load('lecture_12/station_data.npz')
print("Loaded:", loaded_archive)
print("Archive entries:", list(loaded_archive.keys()))
print("Times:", loaded_archive['times'])

Loaded: <numpy.lib.npyio.NpzFile object at 0x0000020EF3881278>
Archive entries: ['times', 'ns', 'ew', 'ud']
Times: [1.57433157e+09 1.57433157e+09 1.57433157e+09 1.57433157e+09]


## Loading text files with Pandas

Documentation: [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

Read a comma-separated values (csv) file into DataFrame (a DataFrame is Pandas way of storing tabular data).

Pandas is much faster than `np.loadtxt` for larger files.

The following lists some of the arguments that `read_csv` accepts:
```python
pandas.read_csv(filename, sep=",", header="infer", names=None, usecols=None, skiprows=None, comment=None)
```

sep : str, default ‘,’

    Delimiter to use.

header : int, list of int, default ‘infer’

    Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

names : array-like, optional

    List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed.

usecols : list-like, optional

    Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0].
    
skiprows : list-like, int, optional

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
    
comment : str, optional

    Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. 

**Loading all columns from .txt**

In [31]:
pd.read_csv(
    'lecture_12/station_data.txt',
    sep=' ',
    header=None,
    comment='#',
    names=['time', 'ns', 'ew', 'ud']
)

Unnamed: 0,time,ns,ew,ud
0,1574332000.0,1.124,148.1,104.1
1,1574332000.0,1.224,149.1,105.1
2,1574332000.0,1.324,150.1,106.1
3,1574332000.0,1.524,151.1,107.1


**Loading 2 columns from .txt**

In [32]:
pd.read_csv(
    'lecture_12/station_data.txt',
    sep=' ',
    usecols=(1, 3),
    comment='#',
    names=['ns', 'ud']
)

Unnamed: 0,ns,ud
0,1.124,104.1
1,1.224,105.1
2,1.324,106.1
3,1.524,107.1


**Loading 1 column from .txt**

In [33]:
pd.read_csv(
    'lecture_12/station_data.txt',
    sep=' ',
    usecols=(0,),
    comment='#',
    names=['time']
)

Unnamed: 0,time
0,1574332000.0
1,1574332000.0
2,1574332000.0
3,1574332000.0


**Loading all columns from .csv**

In [34]:
station_data = pd.read_csv('lecture_12/station_data.csv')
station_data

Unnamed: 0,time,ns,ew,ud
0,1574332000.0,1.124,148.1,104.1
1,1574332000.0,1.224,149.1,105.1
2,1574332000.0,1.324,150.1,106.1
3,1574332000.0,1.524,151.1,107.1


## Working with Pandas DataFrame

**Creating a DataFrame manually (from a dict)**

In [35]:
data = {
    "time": [1574331571.7485132,1574331572.7485132,1574331573.7485132,1574331574.7485132],
    "ns":   [1.124,1.224,1.324,1.524],
    "ew":   [148.1,149.1,150.1,151.1],
    "ud":   [104.1,105.1,106.1,107.1]
}
pd.DataFrame(data) 

Unnamed: 0,time,ns,ew,ud
0,1574332000.0,1.124,148.1,104.1
1,1574332000.0,1.224,149.1,105.1
2,1574332000.0,1.324,150.1,106.1
3,1574332000.0,1.524,151.1,107.1


**Creating a DataFrame manually (from a list of lists)**

In [36]:
data = [[1574331571.7485132,1.124,148.1,104.1],
        [1574331572.7485132,1.224,149.1,105.1],
        [1574331573.7485132,1.324,150.1,106.1],
        [1574331574.7485132,1.524,151.1,107.1]]
columns = ["times", "ns", "ew", "ud"]
pd.DataFrame(data, columns=columns) 

Unnamed: 0,times,ns,ew,ud
0,1574332000.0,1.124,148.1,104.1
1,1574332000.0,1.224,149.1,105.1
2,1574332000.0,1.324,150.1,106.1
3,1574332000.0,1.524,151.1,107.1


**Creating a DataFrame manually (from a numpy array)**

In [37]:
data = np.array(
        [[1574331571.7485132,1.124,148.1,104.1],
        [1574331572.7485132,1.224,149.1,105.1],
        [1574331573.7485132,1.324,150.1,106.1],
        [1574331574.7485132,1.524,151.1,107.1]])
columns = ["times", "ns", "ew", "ud"]
pd.DataFrame(data, columns=columns) 

Unnamed: 0,times,ns,ew,ud
0,1574332000.0,1.124,148.1,104.1
1,1574332000.0,1.224,149.1,105.1
2,1574332000.0,1.324,150.1,106.1
3,1574332000.0,1.524,151.1,107.1


**Specifying an INDEX column**

The index is basically a row label.

In [38]:
data = np.array(
        [[1574331571.7485132,1.124,148.1,104.1],
        [1574331572.7485132,1.224,149.1,105.1],
        [1574331573.7485132,1.324,150.1,106.1],
        [1574331574.7485132,1.524,151.1,107.1]])
columns = ["times", "ns", "ew", "ud"]
index = ["Station 1", "Station 2", "Station 3", "Station 4"]
station_data_with_index = pd.DataFrame(data, columns=columns, index=index) 
station_data_with_index

Unnamed: 0,times,ns,ew,ud
Station 1,1574332000.0,1.124,148.1,104.1
Station 2,1574332000.0,1.224,149.1,105.1
Station 3,1574332000.0,1.324,150.1,106.1
Station 4,1574332000.0,1.524,151.1,107.1


**Selecting a single column**

In [39]:
station_data['time']

0    1.574332e+09
1    1.574332e+09
2    1.574332e+09
3    1.574332e+09
Name: time, dtype: float64

**Selecting multiple columns**

In [40]:
station_data[['time', 'ns']]

Unnamed: 0,time,ns
0,1574332000.0,1.124
1,1574332000.0,1.224
2,1574332000.0,1.324
3,1574332000.0,1.524


**Selecting a single row**

The first "unnamed" column is called the index

In [41]:
# iloc = index location
station_data.iloc[2]

time    1.574332e+09
ns      1.324000e+00
ew      1.501000e+02
ud      1.061000e+02
Name: 2, dtype: float64

**Selecting multiple rows**

In [42]:
station_data.iloc[1:4] # Note, this follows the same syntax as list slicing

Unnamed: 0,time,ns,ew,ud
1,1574332000.0,1.224,149.1,105.1
2,1574332000.0,1.324,150.1,106.1
3,1574332000.0,1.524,151.1,107.1


**Selecting a row given an index label**

In [43]:
station_data_with_index.loc['Station 1']

times    1.574332e+09
ns       1.124000e+00
ew       1.481000e+02
ud       1.041000e+02
Name: Station 1, dtype: float64

**Selecting multiple rows given an index label**

In [44]:
station_data_with_index.loc[['Station 1', 'Station 2']]

Unnamed: 0,times,ns,ew,ud
Station 1,1574332000.0,1.124,148.1,104.1
Station 2,1574332000.0,1.224,149.1,105.1


**Selecting rows based on a condition**

You can use any conditional operator here.

In [45]:
station_data.loc[station_data['ns'] > 1.224]

Unnamed: 0,time,ns,ew,ud
2,1574332000.0,1.324,150.1,106.1
3,1574332000.0,1.524,151.1,107.1


**Assigning data**

You can not only access data stored in `DataFrame`s, you can also manipulate it.

In [46]:
station_data

Unnamed: 0,time,ns,ew,ud
0,1574332000.0,1.124,148.1,104.1
1,1574332000.0,1.224,149.1,105.1
2,1574332000.0,1.324,150.1,106.1
3,1574332000.0,1.524,151.1,107.1


In [47]:
station_data['time'] = 0
station_data

Unnamed: 0,time,ns,ew,ud
0,0,1.124,148.1,104.1
1,0,1.224,149.1,105.1
2,0,1.324,150.1,106.1
3,0,1.524,151.1,107.1


In [48]:
station_data.loc[station_data['ns'] > 1.224] += 2
station_data

Unnamed: 0,time,ns,ew,ud
0,0,1.124,148.1,104.1
1,0,1.224,149.1,105.1
2,2,3.324,152.1,108.1
3,2,3.524,153.1,109.1


**Convert pandas back to numpy**

In [49]:
station_data.to_numpy()

array([[  0.   ,   1.124, 148.1  , 104.1  ],
       [  0.   ,   1.224, 149.1  , 105.1  ],
       [  2.   ,   3.324, 152.1  , 108.1  ],
       [  2.   ,   3.524, 153.1  , 109.1  ]])

There are many more useful operations you can perform on a Dataframe:
- Compute min, max, mean, std
- Get unique values
- Group columns and perform above operations on the groups 
- Pivot tables

# Summary

* You know when to use NumPy
* You know when to use Pandas
* You know two ways to load data from files
* You know the basics of NumPy
  - Different ways to create arrays
  - Ways to iterate through an array
  - Ways to select parts of an array
  - Ways to change values of an array
  - Mathematical operators on arrays and matrices
* You know the basics of Pandas
  - Creating a DataFrame
  - Selecting rows and columns
  - Conditional selects
  - Conversion between NumPy and Pandas
  - Data assignment
* You can convert data between NumPy and Pandas

### Next excercise: [Exercise 12](exercise_12_numpy_pandas.ipynb)
### Next lecture: [Python - Matplotlib & Seaborn](lecture_13_matplotlib_seaborn.ipynb)

In [None]:
TODO Beispiele mit open read + write

---
##### Authors:
* [Julian Niedermeier](https://github.com/sleighsoft)