# Data Science Ex 02 - Introduction to NumPy and Pandas

02.03.2022, Lukas Kretschmar (lukas.kretschmar@ost.ch)

## Let's have some Fun with Numbers and Tables!

In this exercise, we are going to have a look at the basic modules and types you need to know to work with large sets of data.
We are going to have a look at **NumPy** and **Pandas**, as well as how you can **load data** to start *number crunching*.
Since we will constantly stumble acorss NumPy and Pandas, invest enough time to understand these modules and their features.
Beginning next week, we will have some fun with visualizing data and begin with pre-processing.

## Tab Completion

When modules were imported, depending on the module, they'll introduce an unknowable amount of types and functions.
And we are certainly not able to remember everything.
IPython offers a simple solution for this issue - *tab completion*.
Just enter the module or type, add the *.* and then press *Tab*.

```python
math.<Tab>
```

A list of all available functions and fields is then shown.

In [None]:
import math

In [None]:
math

## NumPy

Reference: https://numpy.org

*NumPy* is the fundament of nearly all Data Science modules and approaches in Python.
It offers special data types and functions that enable fast and easy manipulation of large datasets.
But before we are going to dive deep into NumPy, we have to import the module.
Using Anaconda, NumPy is already installed, thus we just have to import NumPy into our notebook.
```python
import numpy as np
```
Using `np` as alias is a common approach in the data science community, so we will stick with it, too.
But you are free to use your own aliases or even leave it away.

In [None]:
import numpy as np

### Why NumPy?

The thing is, as we have seen in the last exercise, Python offers much flexibility regarding data types.
What takes away complexity from the developer, adds complexity to the internal structures.
But complexity has a negative impact on performance.
And the last thing we want when manipulating huge datasets, is waiting on the results.

Thus, a new solution had to be developed.
And it's called **NumPy**.
NumPy introduces new basic data types (e.g. integers, floats, etc.) that are mapped upon `C` types.
So, rather than introducing complexity, they are more or less simple wrappers.
In this case, simplicity leads to faster code execution, and with large datasets this can safe you a significant amount of time.

In [None]:
import random as rnd
%timeit slow = [rnd.random() for _ in range(100000)]

In [None]:
%timeit fast = np.random.random(100000)

With the simple example above, we see that creating 100'000 random floating point numbers between 0 and 1 is more than an order of magnitude faster when using NumPy.

### Data Types

Reference: https://numpy.org/doc/stable/user/basics.types.html

We won't go much into detail regarding the basic data types offered by NumPy.
For an exhaustive overview, please check the link provided above.

The most common types we will work with are the following:
- np.bool (`True` or `False`)
- np.int8 (-128 to 127)
- np.int16 (-32'768 to 32'767)
- np.int32 (-2'147'483'648 to 2'147'483'647) $\leftarrow$ up to 2 Billion
- np.int64 (-9'223'372'036'854'775'808 to 9'223'372'036'854'775'807) $\leftarrow$ up to 9 Quintillion
- np.float32 (floating point numbers)
- np.float64 (double precision floating point numbers)

Sometimes, it could also be interesting to work with unsigned integers.
As you saw, integers have a range from - to + with 0 in the middle.
But when we know that negative values are not possible, we could use unsigned values.
They just shift the range to positive numbers, starting at 0.
- np.uint8 (0 to 255)
- np.uint16 (0 to 65'535)
- np.uint32 (0 to 4'294'967'295)
- np.uint64 (0 to 18'446'744'073'709'551'615)

Most of the time, we are fine without using unsigned types, though.

### Random

Before we dig deeper into NumPy, a word on random numbers.
NumPy includes a random number generator, or RNG for short.
The numbers we'll get back, are pseudo-random.
This means, they look random, but are acutally calculated by an algorithm.

If we do not take any action, everytime we generate numbers, they'll be different.
But, since an algorithm is generating them, we can enforce reproducibility.
This can be done in two ways.

1. `np.random.seed(seed)`

`seed` is an initial value taken for the random number generator.
Using this method, we will set a global seed for the static RNG within NumPy.
This means, if we call, for example, `np.random.random()`, the order of random numbers we get is always the same, if the order of calls stays the same.

In [None]:
np.random.seed(42)
print(np.random.random(3))

np.random.seed(42)
print(np.random.random(3))
print()

np.random.seed(42)
print(np.random.random())
print("Now we have changed the order respectively the values are shifted by one")
print(np.random.random(3))

2. `rng = np.random.RandomState(42)`

The other approach, besides setting a global seed, or for that matter a global `RandomState()`, is to use a more object-oriented approach by creating instances of random number generators.
This way, we can work with multiple random number generators that generate the same sequence of numbers or don't.
But are not affected by calls done inbetween.

In [None]:
rng1 = np.random.RandomState(42)
rng2 = np.random.RandomState(42)
rngOther = np.random.RandomState(1337)

print(rng1.rand(3))
print(rngOther.rand(3))
print(rng2.rand(3))

We will use both approaches during the exercises.
But none is better than the other.
It depends on the case you want to show, and your preference.
We suggest for just showing examples, the first one is easier to apply.
But when doing number crunching, the second one is more suitable, as it allows to control reproducibility.

### Arrays

Besides the basic data types, NumPy introduces its own implementation of an array.
And basically everything in NumPy and Pandas is built upon this implementation.
Thus, we will have a look at what you can do with arrays, and later on, what functions exist that take arrays as input.

#### Creating Arrays

We can create an array from a list, with a given default value, evenly spaced or filled with random numbers (and many more possibilities).

In [None]:
fromList = np.array([1,2,3,4,5])
print(f"fromList = {fromList}")

withZeros = np.zeros(5)
print(f"withZeros =  {withZeros}")

withOnes = np.ones(5)
print(f"withOnes = {withOnes}")

withDefaults = np.full(5, 42)
print(f"withDefaults = {withDefaults}")

withRange = np.arange(0, 10, 2) # np.array(range(0, 10, 2))
print(f"withRange = {withRange}")

withSpace = np.linspace(0, 1, 5)
print(f"withSpace = {withSpace}")

withRandom = np.random.random(5)
print(f"withRandom = {withRandom}")

What we've done so far, is creating one-dimensional arrays.
Every method used above could also be used to create multidimensional arrays by providing a tuple representing the size.
The parameter `size` is defined as `(outermost dimension, ..., innermost dimension)`.
You could understand this as `(further dimensions, ..., rows, columns)`.
So, for an example, a 4x2 array is defined as `(4, 2)`.

In [None]:
emptyMatrix = np.zeros((3,3))
print(emptyMatrix)
print()

allAnswers = np.full((4,2), 42)
print(allAnswers)
print()

nDimRandoms = np.random.random(size=(2,3,4,5))
print(nDimRandoms)

It is also possible to get information about the structure of an array.

In [None]:
print(f"Dimensions: {nDimRandoms.ndim}")
print(f"Shape: {nDimRandoms.shape}")
print(f"Size: {nDimRandoms.size}")

#### Accessing

Accessing values within an array is done the same way as with lists.
We simply have to provide the index.

In [None]:
arr = np.arange(0,10)
print(arr)

print(arr[1])
print(arr[-1])

Besides index based access, we can also access ranges within an array by providing a colon `:` separated range.
A range is defined as `start:stop:step`.
The default values are
- `start = 0`
- `stop = size of dimension`
- `step = 1`

In [None]:
print(arr)
print("#" * 42)
print()

print("Selecting second to second last items:")
subArr = arr[1:8]
print(subArr)
print()

print("Selecting the first two items:")
firstTwo = arr[:2]
print(firstTwo)
print()

print("Selecting the last three items:")
lastThree = arr[-3:]
print(lastThree)
print()

print("Selecting every other item:")
everyOther = arr[::2]
print(everyOther)
print()

print("Reverse the array - this trick is quite handy!")
reverse = arr[::-1]
print(reverse)
print()

# This is totally legit
print("Using all the default values when selecting a range:")
theSame = arr[::]
print(theSame)

**Note:** Sub-arrays, also called slices, are a view on the array, but not a copy.
This means, when we change a value within the sub-array, the value gets changed in the original array, too.

In [None]:
arrCopy = arr.copy() # This is how we make a copy of an array. We do this to not destroy our arr array used as an example.
subArrCopy = arrCopy[2:5]

print(f"Original array: {arr}")
print(f"Copy: {arrCopy}")
print(f"Sub-array: {subArrCopy}")

subArrCopy[0] = 42

print(f"Changed to 42: {subArrCopy}")
print(f"Changed copy: {arrCopy}")
print(f"Original: {arr}")

#### Reshaping

In the previous examples, we always had the dimensions of your arrays the way we needed them.
Now, depending on the size we need, we have to change the shape of an array.
This can be done by using the `reshape()` function that takes the new shape.

In [None]:
grid = np.arange(1, 10).reshape((3,3))
print(grid)

This is a really handy feature, when we have to swap between row-vectors and column-vectors.

In [None]:
arr = np.arange(0,10)
row = arr.reshape((1,10))
print(row)
column = arr.reshape((10,1))
print(column)

`np.newaxis` offers an alternative way to create row- and column-vectors. 

In [None]:
otherRow = arr[np.newaxis,:]
print(otherRow)
otherColumn = arr[:,np.newaxis]
print(otherColumn)

### UFunc (Universal Funcitons)

What we have seen so far are functions that we can use to create and form arrays.
Since arrays are such a central part of NumPy, the module also offers functions - so called **universal functions** - to manipulate arrays very efficiently.
Every basic operator that you usually use, like `+, -, *, /, //, **, %`, can also be applied on arrays.

In [None]:
grid = np.arange(9).reshape((3,3))

gridAdd = grid + 2
print(gridAdd)
print()

gridSub = grid - 3
print(gridSub)
print()

gridMul = grid * 2
print(gridMul)
print()

gridDiv = grid / 3
print(gridDiv)
print()

gridDivR = grid // 3
print(gridDivR)
print()

gridSquare = grid ** 2
print(gridSquare)
print()

gridMod = grid % 3
print(gridMod)

And there is more.
We can also use an array on the right hand side of an operator.

In [None]:
gridReciproc = 1 / grid
print(gridReciproc) # Division by 0 obviously won't work

Or we can use arrays on both sides.

In [None]:
gridMulEx = grid * np.arange(1,10).reshape((3,3))
print(gridMulEx)

And obviously, we can use all these operators together.

In [None]:
res = (-grid * 2) + 5
print(res)

In the background, all these operators have a corresponding function.
Using `+` is the same as using `np.add()`.

In [None]:
resAdd1 = grid + 5
resAdd2 = np.add(grid, 5)
print(resAdd1)
print()
print(resAdd2)

We leave it to you to find the other functions for the basic operators for when you need them.
But here are some other ufuncs that are interesting:

- `np.abs()`

In [None]:
arr = np.arange(-2,3)
print(f"{arr} -> {np.abs(arr)}")

- `np.exp()`

In [None]:
arr = np.arange(0,3)
print(f"x   = {arr}")
print(f"e^x = {np.exp(arr)}")
print(f"2^x = {np.exp2(arr)}")
print(f"4^x = {np.power(4, arr)}")

- `np.log()`

In [None]:
arr = np.array([1,2,4,10])
print(f"x        = {arr}")
print(f"ln(x)    = {np.log(arr)}")
print(f"log2(x)  = {np.log2(arr)}")
print(f"log10(x) = {np.log10(arr)}")

- `np.sin()`
- `np.cos()`
- `np.tan()`

In [None]:
alpha = np.linspace(0, np.pi, 3)
print(f"alpha      = {alpha}")
print(f"sin(alpha) = {np.sin(alpha)}")
print(f"cos(alpha) = {np.cos(alpha)}")
print(f"tan(alpha) = {np.tan(alpha)}")

Because the values are calculated and depend on the machines precision, results that should be zero might not be exactly zero.

#### Aggregations

With the ufuncs, we can manipulate values within arrays at once.
Another feature we need, is the ability to aggregate values of an array to one value (e.g. sum, max, min).
NumPy also offers functions for these types of questions.

- `np.sum()`
- `np.cumsum()`

In [None]:
arr = np.arange(1,6)
print(f"{arr} = {np.sum(arr)}")
print(f"{arr} = {np.cumsum(arr)}")

- `np.prod()`
- `np.cumprod()`

In [None]:
arr = np.arange(1,6)
print(f"{arr} = {np.prod(arr)}")
print(f"{arr} = {np.cumprod(arr)}")

- `np.min()`
- `np.max()`

In [None]:
arr = np.random.randint(100, size=10)
print(f"min({arr}) = {np.min(arr)}")
print(f"max({arr}) = {np.max(arr)}")

- `np.all()`
- `np.any()`

In [None]:
arr = np.array([False, True, False])
print(f"all({arr}) = {np.all(arr)}")
print(f"any({arr}) = {np.any(arr)}")

- `np.mean()`
- `np.median()`
- `np.var()`
- `np.std()`

In [None]:
arr = np.random.normal(size=1000000)
print(f"mean = {np.mean(arr)}")
print(f"median = {np.median(arr)}")
print(f"var = {np.var(arr)}")
print(f"std = {np.std(arr)}")
print()
print("Actual values should be:")
print("- mean = median = 0")
print("- var = std = 1")

#### Comparison

Being able to manipulate arrays at once is one thing, what we now want to use, is checking assumptions against arrays and compare arrays.
Again, NumPy provides ufuncs and operators for that.

**Please note:** This section is quite important as it shows and teaches you the basics on how to filter *DataFrames* in *Pandas* later in this notebook and course.
So try to get and understand as much as possible here.

In [None]:
arr = np.arange(1,10)
arr

In [None]:
arr == 2

In [None]:
arr != 8

In [None]:
arr < 5

In [None]:
arr <= 4

In [None]:
arr > 3

In [None]:
arr >= 7

These operators also work when you have arrays on both sides.

In [None]:
arr = np.arange(3)
rra = np.arange(3)[::-1]
print(arr)
print(rra)
print("#" * 21)
print(arr == rra)

Previously, we have seen that an `np.all()` and `np.any()` ufunc exists.
Knowing how to use comparisons enables new possibilities in combination with these functions.

In [None]:
arr = np.arange(0,10)
print(np.all(arr < 10))
print(np.any(arr > 20))

If we want some more detail, for example the number of values that are `True` or `False`, we can use `np.count_nonzero()` or even `np.sum()`.
Because `True` counts as 1 and `False` as 0, `np.sum()` will return the number of `True` values.

In [None]:
arr = np.arange(5,15)
print(np.count_nonzero(arr < 10))
print(np.sum(arr < 10))

Now that we have seen how to get arrays of booleans, we can go a step further and combine expressions.
NumPy supports the follwing bitwise operators:
- `&` and (both must be true)
- `|` or (one must be true)
- `^` xor (only one can be true)
- `~` not (negate all values)

In [None]:
boolRow = np.array([True, False])
boolCol = np.array([[True], [False]])
print(boolRow)
print(boolCol)

In [None]:
boolRow & boolCol

In [None]:
boolRow | boolCol

In [None]:
boolRow ^ boolCol

In [None]:
print(~boolRow)
print(~boolCol)

Now, we can use this knowledge on bitwise comparison to define more complex expressions.

In [None]:
arr = np.arange(10)
print((arr > 5) & (arr <= 7) | (arr == 0))
print(np.sum((arr > 5) & (arr <= 7) | (arr == 0)))

**Please note:** It is important that you always encapsulate the comparison with brackets `(...)`.
Otherwise, the code cannot be executed.
And I usually test this exact knowledge in the final exam.
So consider yourself warned =).
Anyway, let's move on. Shall we?

We can go even further, and use these arrays of booleans to get the values at their locations.
Taking the example from above, we know that 3 values fulfill our requirement, but we want the values, and not just a sum.
Easier done than said.

In [None]:
arr[(arr > 5) & (arr <= 7) | (arr == 0)]

Using an array of booleans as index will return the values at every position where the array's value is `True`.
And this is also how filtering in Pandas works.

In [None]:
arr = np.arange(3)
print(arr) 
print(arr[[False, True, False]]) # Explicit array
print(arr[arr == 1]) # Using the array of the comparison
print(arr[~(arr == 1)]) # Using the negated comparison
print(arr[arr != 1]) # The same as the line above

#### Sorting

When working with arrays, it might be needed to sort them.
NumPy offers two simple methods that we will use to get numbers in order.

1. `np.sort(array)`

This function simply sorts the array and returns the sorted values.
The original array is not affected, but a new one is created.

In [None]:
arr = np.random.randint(0,10, 10)
sorted = np.sort(arr)
print(f"{arr} --> {sorted}")

2. `np.argsort(array)`

This function returns the indices in order rather than the values.

In [None]:
arr = np.random.randint(0, 10, 10)
idx = np.argsort(arr)
print(f"{arr} --> {idx}")

Knowing fancy indexing (see below in the *self-study* section), we can use this array again, to sort the array.

In [None]:
arr[idx]

### Self-Study

#### Broadcasting

*Broadcasting* is NumPys functionality to combine arrays of different size.
We don't have to think much, but just use functions on arrays we expect to be valid.
A simple example of broadcasting is adding a scalar.
We could imagine that

In [None]:
np.array([0, 1, 2]) + 5

is the same as

In [None]:
np.array([0, 1, 2]) + np.array([5, 5, 5])

The same is also true for the following example.

In [None]:
row = np.arange(3)
column = np.arange(3)[:,np.newaxis]
print(row)
print(column)
print()
print(row + column)

As you can see, NumPy behaves as we expect it when adding two arrays with different dimensions.

#### Concatenation & Splitting

Having worked with one array, it is also possible to combine (contactenate) or split multiple arrays.
Concatenation is done by using the `np.concatenate()` function.

In [None]:
arr1 = np.arange(0,5)
arr2 = np.arange(10,15)
arrC = np.concatenate([arr1, arr2])
print(arrC)

We can use `np.concatenate()` also for multi dimensional arrays.

In [None]:
arrGrid = np.arange(1,10).reshape((3,3))
arrGridC = np.concatenate([arrGrid, arrGrid])
print(arrGridC)

Depending on number of dimensions we have, we can specify which `axis` should be used.

In [None]:
arrGridC = np.concatenate([arrGrid, arrGrid], axis=1)
print(arrGridC)

And if we are working with arrays of mixed dimensions, `np.vstack()` and `np.hstack()` come in handy.

In [None]:
arr1d = np.arange(3)
arr2d = np.arange(3,9).reshape((2,3))
arrV = np.vstack([arr1d, arr2d])
print(arrV)

In [None]:
arr2dn = np.array([[10], [11]])
arrH = np.hstack([arr2d, arr2dn])
print(arrH)

With `np.split()`, `np.vsplit()` and `np.hsplit()` we can take arrays apart.
Theses functions take an array of split points as argument.

In [None]:
arr = np.arange(16)
left, middle, right = np.split(arr, [4,12]) # Middle starts at index 4, right starts at index 12
print(left, middle, right)

In [None]:
grid = np.arange(16).reshape((4,4))
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

In [None]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

#### Fancy Indexing

*Fancy Indexing* is just a fancy term in NumPy that you can access values in arrays the way you want.
Upto now, we have seen the following ways to access values of an array:
- By Index (e.g. `arr[4]`)
- By Range (e.g. `arr[4:10:2]`)
- By Comparison (e.g. `arr[arr > 5]`)
- By Bool-array (e.g. `arr[[True, False, False]]`)


But there is one more way - by index-array.

In [None]:
arr = np.arange(1,16)
arr

In [None]:
idx = np.array([2,3,4])
arr[idx]

And the order of indices matters.

In [None]:
idx = np.array([12,5,8,0])
arr[idx]

And even the dimensions of the index-array matters.

In [None]:
idx = np.array([[5,7], [3, 10]])
arr[idx]

We can even combine it with slicing.

In [None]:
arr = np.arange(1, 17).reshape((4,4))
arr

In [None]:
cols = np.array([1,3])
arr[2:, cols]

In [None]:
rows = np.array([0,2])
arr[rows, :2]

#### Multi-Dimension Arrays

##### Accessing

Multi dimensional arrays take comma separated indices.

In [None]:
arrN = np.random.randint(100, size=(2,3,4))
print(arrN)
print()

# Second array, first row, last column
print(arrN[1,0,-1])

And we can slice multi dimensional arrays as well.
A range is defined as `start:stop:step`.
The default values are
- `start = 0`
- `stop = size of dimension`
- `step = 1`

In [None]:
print(arrN)
print("#" * 42)
print()

# Take both arrays, last two rows and columns in reverse but only every second value.
subPart = arrN[::, 1:,::-2]
print(subPart)

##### Comparison

Multi-dimensional arrays can also be compared to each other.

In [None]:
grid = np.arange(16).reshape((4,4))
inner = np.full((4,4), -1)
inner[1:3, 1:3] = np.array([[5,6], [9, 10]])

print(grid)
print()
print(inner)
print()
print(grid == inner)

Working with multi dimensional arrays, `np.sum()` can be used for a specific dimension.

In [None]:
arr = np.arange(16).reshape((4,4))
print(arr)
print()
print(f"Values >= 10 per column: {np.sum(arr >= 10, axis=0)}")
print(f"Values >= 10 per row: {np.sum(arr >= 10, axis=1)}")

## Pandas

Reference: https://pandas.pydata.org

First of all, Pandas has nothing to do with the animal.
Period.
Although - it feels fluffy.
Originally, it's derived from the term [panel data](https://en.wikipedia.org/wiki/Panel_data).

*Pandas* is a module that builds upon NumPy, but provides functions to actually work with data.
NumPy offers the basic structure - the array.
And Pandas introduces new objects, using the array, but offering functions that we want to analyse and mutate data.

But let's begin by importing the module.

In [None]:
import pandas as pd

### Pandas Objects

Before we check out the awesome stuff, Pandas allows us to do.
Let's have a look at the three fundamental structures Pandas provides.

- `Series` $\leftarrow$ "Columns"
- `Index` $\leftarrow$ "First Column"
- `DataFrame` $\leftarrow$ "Tables"

#### Series

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/series.html

A Pandas `Series` is a one-dimensional array of indexed data.
We can simply create one by throwing an array into the constructor.

In [None]:
data = pd.Series(np.linspace(0,1,5))
data

In the left column, you see the indices.
In the right one, the values of the series are stored.
We can even access these information by fields.

In [None]:
data.values

In [None]:
data.index

And of course, slicing is also possible.

In [None]:
data[1:4]

Now, if you think that a `Series` is nothing more but an array, let me show you something.

In [None]:
data = pd.Series(np.linspace(0,1,5), index=["a","b","c","d","e"])
data

Or...

In [None]:
data = pd.Series(np.linspace(0,1,5), index=range(100,110,2))
data

Or...

In [None]:
data = pd.Series(np.linspace(0,1,5), index=[7,42,1,364,0])
data

Now, working with slicing here is a bit more trickier, but we will get to that later.

Having indices that are not numbers, we can still use slicing.

In [None]:
data = pd.Series(np.linspace(0,1,5), index=["a","b","c","d","e"])
data["b":"d"]

And it doesn't matter if the order of indices makes sense (is ordered somehow).
Slicing is selecting top to bottom.

In [None]:
data = pd.Series(np.linspace(0,1,5), index=["t", "y", "e", "x", "d"])
data["y":"d"]

We can also create a `Series` with a default value.

In [None]:
data = pd.Series(42, index=range(0,3))
data

Or based on a dictionary.

In [None]:
data = pd.Series({"a": 42, "b": 1337, "c": 4.2})
data

Or just take some values from the dictionary.

In [None]:
data = pd.Series({"a": 42, "b": 1337, "c": 4.2}, index=["b", "a"])
data

#### Index

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html

As seen above, `Series` contain an index.
And this index has its own type.

In [None]:
series = pd.Series(np.random.randint(0,10,4))
print(series.index)
frame = pd.DataFrame([1,2,3], index=["a", "b", "c"])
print(frame.index)

Simply said, an index is an immutable array.
*Immutable* means that you cannot change an instance's content.
`Index` work like arrays (that you know from NumPy), but you cannot add, remove or change values.

In [None]:
idx = pd.Index([4,3,6,21,5])
print(idx[1])
print(idx[2:4])
print(idx[::2])

Usually, you don't have to build your own `Index` since it will be created within the `DataFrame`.
But we will use their behavior as sets during the course.

#### DataFrame

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Having had some fun with `Series`, let's go a step further.
The multi-dimensional part to the one-dimensional `Series`, is called a `DataFrame`.
When creating a `DataFrame`, the indices simply have to match, so the constructor knows which entries belong together.
We won't cover every possibility on how to construct a `DataFrame`.
For more detail, please have a look at the link provided above.

In [None]:
area = pd.Series({"Zurich": 87.88, "St.Gallen": 39.41,"Geneva": 15.92,"Rapperswil": 1.74,"Bern": 51.62})
population = pd.Series({"Zurich": 415215, "St.Gallen": 75806,"Geneva": 201741,"Rapperswil": 7601,"Bern": 133791})

df = pd.DataFrame({"Population": population, "Area": area})
df

And if the indices don't match, it doesn't matter.
Missing values are included as `NaN` (Not a Number).

In [None]:
area = pd.Series({"Zurich": 87.88, "St.Gallen": 39.41,"Geneva": 15.92,"Rapperswil": 1.74,"Bern": 51.62})
population = pd.Series({"Zurich": 415215, "St.Gallen": 75806,"Rapperswil": 7601,"Bern": 133791, "Rüti (ZH)": 12170})

df = pd.DataFrame({"Population": population, "Area": area})
df

Accessing structural information can be done by `index` and `columns`.

In [None]:
print(f"Index:   {df.index}")
print(f"Columns: {df.columns}")

Accessing `Series` is possible by using the column name.

In [None]:
df["Area"]

### Indexing & Selection

#### Series

Now, let's have some fun with `Series`.

In [None]:
data = pd.Series(np.linspace(0,1,5), index=list("abcde"))
data

We can access values like

In [None]:
data["c"]

In [None]:
data["b":"d"]

In [None]:
data[0:2]

**Note:** When slicing with indices, using explicit indices (e.g. "b":"d") will include all - inclusive the end (e.g. "d").
When using implicit indices (e.g. 0:2), the value at 2 is not included.
But most of the time, we will work with explicit indices as many times indices are timestamps, names or some other identifiers.

Since accessing with implicit indices is possible, we might run into problems with integer indices.
To solve these issues, `loc[range]` (using the explicit index) and `iloc[range]` (using the implicit index) are provided.

In [None]:
intData = pd.Series(range(1,6), range(0,10,2))
intData

In [None]:
intData.loc[4:6]

In [None]:
intData.iloc[4:6]

And there is more...

In [None]:
"a" in data

In [None]:
data.keys()

In [None]:
list(data.items())

We can even use *masking* and *fancy indexing*.

In [None]:
data[(data < .5) | (data > .75)]

In [None]:
data[["a", "e", "c"]]

Further, we can add new values the same way as it works with dictionaries.

In [None]:
data["f"] = 1.25
data

#### DataFrame

`DataFrames` offer simmilar possibilities.

In [None]:
area = pd.Series({"Zurich": 87.88, "St.Gallen": 39.41,"Geneva": 15.92,"Rapperswil": 1.74,"Bern": 51.62})
population = pd.Series({"Zurich": 415215, "St.Gallen": 75806,"Geneva": 201741,"Rapperswil": 7601,"Bern": 133791})

cities = pd.DataFrame({"Population": population, "Area": area})
cities

We can select `Series` in two different ways.

In [None]:
cities["Area"]

In [None]:
cities.Area

**Note:** Although, the latter seems simpler, there are some limitations to it.
Using special characters could lead to problems or even make it impossible to use this approach.
Further, if columns use the same name as existing methods or fields, it won't work, either.
The same is true, if the columns are not `strings` but some other data  type.
Thus, for the remainder of this course, we will stick with the first approach by using the name of the column as index argument.

Having a `DataFrame`, we can easily extend it with new values.
New values can be introduced by either adding a completely new `Series` as column, or we can calculate them.

In [None]:
cities["Density"] = cities["Population"] / cities["Area"]
cities

Working with ranges can either be done directly using `[]`, or again with `loc[]`, `iloc[]`.

In [None]:
cities[:2]

In [None]:
cities.iloc[:2]

In [None]:
cities["St.Gallen":]

In [None]:
cities.loc["St.Gallen":]

In [None]:
cities["Area"]

In [None]:
cities.loc[:"Rapperswil", :"Area"]

In [None]:
cities.iloc[:4, :2]

In [None]:
cities[cities["Density"] > 4000]

In [None]:
cities.loc[cities["Density"] > 4000]

In [None]:
cities.loc[cities["Density"] < 4000, ["Population"]]

We can even transpose a `DataFrame` with one simple call.

In [None]:
cities.T

There are many more possible ways to select values in `DataFrames`.
In the next sections and during the course we will continously introduce new ways of interacting with data within `DataFrames`.
Feel free to dive deep into `DataFrames` as they will be the main object we will work with during this course.

### Operating on Data

The basic operations we used to work with NumPy arrays are also available for `DataFrames`.
So this is absolutely possible.

In [None]:
df = pd.DataFrame(range(0,11), columns=["Value"])
df["Add2"] = df["Value"] + 2
df["Sub1"] = df["Value"] - 1
df["Mul3"] = df["Value"] * 3
df["Div2"] = df["Value"] / 2
df["FloorDiv2"] = df["Value"] // 2
df["Pow2"] = df["Value"] ** 2
df["Mod3"] = df["Value"] % 3
df

Some other functions are also available (please note that the following examples are not exhaustive).

In [None]:
df.sum()

In [None]:
df.sum().sum()

In [None]:
df.mean()

In [None]:
df["Value"].mean()

In [None]:
df.cumsum()

In [None]:
df.prod() # that's kinda stupid since in every column but one is a 0 present, d'oh

### Combining DataSets

There are several ways to combine `DataFrames` and `Series`.
For starters, there are `pd.concat()` and `DataFrame.append()` functions.

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

In [None]:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list("AB"))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list("AB"))

In [None]:
pd.concat([df1, df2])

As we see, the indices are taken from both data frames per default.
If we want to ignore this behavior, we can use the `ignore_index` parameter.

In [None]:
pd.concat([df1, df2], ignore_index=True)

And `append()` can be used likewise.

In [None]:
df1.append(df2)

In [None]:
df1.append(df2, ignore_index=True)

With the `axis` parameter, we can define in what direction the `DataFrames` are put together.
Per default `axis` is set to `"index"`, thus the parts are stacked on each other.

In [None]:
pd.concat([df1, df2], axis="columns")

Both functions work also with `Series`.
We don't have to build all `DataFrames` first.

In [None]:
ser1 = pd.Series([5,6], index=[0,1], name="C")
pd.concat([df1, ser1], axis="columns")

And even with multiple `Series` at once.

In [None]:
ser2 = pd.Series([7, 8], index=[0, 1], name="D")
pd.concat([df1, ser1, ser2], axis="columns")

Please note that these functions won't change instances, but return new instances of `DataFrames`.

In [None]:
df3 = pd.concat([df1, ser1], axis="columns")
print(df1)
print()
print(df3)

The examples above simply put `DataFrames` and `Series` together.
While working with multiple `DataFrames`, it's more likely that we need to combine them based on specific values in a specific column.
Thus, for a more complex combinations of `DataFrames`, there are `pd.merge()` and `DataFrame.join()` functions.

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

In [None]:
cantons = pd.DataFrame({ "Canton" : ["Bern", "Zurich", "Luzern", "Uri"], "Abbr" : ["BE", "ZH", "LU", "UR"]})
population = pd.DataFrame({"Population" : [1031126, 1504346, 406506, 36299], "Canton" : ["Bern", "Zurich", "Luzern", "Uri"]})

print(cantons)
print()
print(population)

In [None]:
pd.merge(cantons, population)

As you see, `pd.merge()` combines the two `DataFrames` as expected without duplicating the "Canton" column.
If there are no common column, `pd.merge()` will fail.
In this case, we have to specify how to combine the `DataFrames` by using the `left_on` and `right_on` parameters.

In [None]:
links = pd.DataFrame({"Cant" : ["Bern", "Zurich", "Luzern", "Uri"], "Url" : ["https://www.be.ch/", "https://www.zh.ch/", "https://www.lu.ch/", "http://www.uri.ch/"]})
pd.merge(cantons, links, left_on="Canton", right_on="Cant").drop("Cant", axis=1)

In case there are multiple columns that could be used to combine, or we simple want to specify the column, we could use the `on` parameter.

In [None]:
pd.merge(cantons, population, on="Canton")

Sometimes, we rather want to merge `DataFrames` based on their indices.
The `pd.merge()` function also allows this by stating it with the `left_index` and `right_index` parameters.

In [None]:
cantonsI = cantons.set_index("Canton") # setting Series as index
populationI = population.set_index("Canton")
linksI = links.set_index("Cant")

df1 = pd.merge(cantonsI, populationI, left_index=True, right_index=True)
pd.merge(df1, linksI, left_index=True, right_index=True)

Or we could mix the two approaches.

In [None]:
linksI = links.set_index("Cant")

pd.merge(df1, linksI, left_on="Canton", right_index=True)

Since usually there are indices defined, the `DataFrame.join()` performs an index-based merge out-of-the-box.
And since the function returns a new `DataFrame`, we can also use a technique called *method chaining* to join multiple `DataFrames` in one line.

In [None]:
cantonsI = cantons.set_index("Canton")
populationI = population.set_index("Canton")
linksI = links.set_index("Cant")

cantonsI.join(populationI).join(linksI)

### Aggregation and Grouping

#### Aggregating

Working with larger datasets, which is what we will do during this course, enforces functions to aggregate and group data into more manageable junks.
We've already seen that `sum()` and `prod()` work with `DataFrames` and `Series`, respectively.
And as you might expect, the other functions that we have used on the NumPy arrays also exist.

- `mean()`
- `median()`
- `min()`
- `max()`
- `std()`
- `var()`

And there are new ones as well.

- `count()` number of items in a `Series` (`NaN` is excluded)
- `head(n)` first `n` rows
- `first()` first item in `Series` if the index is a date
- `tail(n)` last `n` rows
- `last()` last item in `Series` if the index is a date
- `info()` shows a summary of the `DataFrames` structure
- `describe()` shows stats of the `DataFrames` content
- `quantile()` returns the requested percentile

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

In [None]:
chPopulation = pd.read_csv("Demo_CH_2018.csv", sep=";") # you'll understand this line in a moment
chPopulation.info()

In [None]:
chPopulation.head(5)

In [None]:
chPopulation.tail(3)

In [None]:
chPopulation.describe()

Some of the functions are quite simple to apply on a `DataFrame` - like getting the median of each column.

In [None]:
chPopulation.median(numeric_only=True)

To prevent a warning, we set `numeric_only=True`.
Otherwise, the warning would mention that some columns (in our case `Canton` and `Lang`) were ignored, but we did not explicitely excluded them.

Please note that these functions will return a value for each column, but they do not have to be in the same row.
If we check for `max()` or `min()` we get the following results that won't match.

In [None]:
chPopulation.max()

In [None]:
chPopulation[chPopulation["Dec 2018"] == chPopulation["Dec 2018"].max()]

In [None]:
chPopulation.min()

In [None]:
chPopulation[chPopulation["Dec 2018"] == chPopulation["Dec 2018"].min()]

So, be careful and crosscheck your results.

Many of these functions can also be applied on rows instead of columns.

In [None]:
chPopulation.mean(axis=1, numeric_only=True)

And while we are at it, it is also possible to create more complex filtering.
For example, it's quite easy to get the cantons which are in the lowest quartile (quarter with the smallest values).

In [None]:
chPopulation[chPopulation["Dec 2018"] <= chPopulation.quantile(.25)["Dec 2018"]].sort_values("Dec 2018")

### Self-Study

#### Index

We can check indices for
- intersections (`.intersection()`)
- unions (`.union()`)
- symmetric differences (`.symmetric_difference()`)

In [None]:
idx1 = pd.Index([1,2,3,4,5])
idx2 = pd.Index([1,3,5,7])

print(f"Intersection: {idx1.intersection(idx2)}")
print(f"Union:        {idx1.union(idx2)}")
print(f"Difference:   {idx1.symmetric_difference(idx2)}")

##### Hierarchical Indexing

Just as a side note for now, it is possible to work with multi-dimensional indices.
If we acutally have to work with them, we will cover the topic.
But for now, we just show you an example how they look.

In [None]:
ix = pd.Index(["Zurich", "Bern"])
d2018 = pd.Series([1504346, 1031126], index=ix)
d2017 = pd.Series([1487969, 1026513], index=ix)
d2016 = pd.Series([1466424, 1017483], index=ix)
history = pd.DataFrame({2018: d2018, 2017: d2017, 2016: d2016})
history

In [None]:
mHistory = history.stack()
mHistory

In [None]:
mHistory.index

In [None]:
flat = mHistory.unstack()
flat

## Working with Files

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/io.html

Now that we have seen some magic with Pandas, let's go a step further.
The examples above primarily showed simple `DataFrames` that we used to demonstrate certain features of Pandas.
But as you may noticed, in the last example, we got a bit lazy and loaded our demo data from a file.
All the cool stuff Pandas allows us to do is only cool, when we can use it on real data.
Thus, we need ways to access external sources.

Pandas offers many different functions to load data into a `DataFrames` without filling it by hand.
In the following sections, we demonstrate the most relevant of them.
For a complete list of `read_*()` functions, please check the reference provided above.

### csv

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
csv = pd.read_csv("./Demo.csv", sep=";")
csv.head(5)

In [None]:
csv.tail(5)

### Excel

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In [None]:
excel = pd.read_excel("./Demo.xlsx")
excel.head(5)

### JSON

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

In [None]:
json = pd.read_json("./Demo.json", orient="index")
json.head(5)

### HTML

Refernece: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

In [None]:
html = pd.read_html("./Demo.html", index_col=0)
html[0].head(5)

### Storing DataFrames

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#serialization-io-conversion

It is also possible to save `DataFrames` as files.
We won't cover this topic for now, but you are free to explore this feature by yourself.
If it gets relevant in this course, we will cover the topic to some level.
But it's pretty simple.
Just call the corresponding `to_*()` function and give it a filename.

## Exercises

### Ex01 - NumPy

Create an array with 10 elements and all are 0.

Create an array with 4 elements having a value of 2.

Create a 2-dimensional (3 by 4) array with the values of 1 to 12.

Create 17 evenly spaced values from 2 to 438.

Calculate the squares for the values from 1 to 25 by using UFuncs.

Create a random number generator with seed 453.
Then generate 12 values and show only the values that are below .3.

Create a random number generator with seed 2351, generate 10000000 values and calculate the mean.

Create a random number generator with seed 57963, generate 100'000'000 values, and check if at least one value is above .999 and if so, how many?

#### Solution

In [None]:
# %load ./Ex02_01_Sol.py

### Ex02 - Pandas

Load the file **Ex02_02_Data.csv**.

Display the structure of the `DataFrame`.

Show the stats of the data.

What you see, is the data of the world happyness report of 2019.
All the values, besides the actuall score just denote how much they contribute to the score.

Now show the first 10 and last 3 entires of the report.

Get the entry of Switzerland and store it in a new variable.

How far behind are we compared to the happiest nation?

How far above are we compared to the least happiest nation?

In which country contributes the GDP per capita the most to the happines score?

Add the score as integer to each entry.

#### Solutions

In [None]:
# %load ./Ex02_02_Sol.py