# Exercise 1: Introduction to the Scientific Libraries for Python

This Jupyter Notebook contains several basic exercises aimed to introduce you to the most widely used scientific libraries for Python:
- [numpy](https://numpy.org/): Fundamental package for large, multi-dimensional arrays and matrices, high-level mathematical functions
- [pandas](https://pandas.pydata.org/): Offers data structures and operations for manipulating numerical tables and time series
- [matplotlib](https://matplotlib.org/index.html): Provides an object-oriented API for embedding plots into applications
- [seaborn](https://seaborn.pydata.org/): Provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas

The following tutorial will guide you through these libraries using short and relatively simple exercises.


First run the following cells to install and import the most important libraries. You can run a cell either by clicking `Run` on the toolbar or by pressing `Shift+RETURN`.

In [1]:
# Install libraries
!pip install pandas==1.5.3
!pip install numpy==1.22.4
!pip install matplotlib
!pip install seaborn

Collecting pandas==1.5.3
  Using cached pandas-1.5.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Installing collected packages: pandas
Successfully installed pandas-1.5.3
Collecting numpy==1.22.4
  Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.3
    Uninstalling numpy-1.22.3:
      Successfully uninstalled numpy-1.22.3
Successfully installed numpy-1.22.4
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2


In [2]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Enable high resolution plots
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('retina')

# NumPy Exercises
Check out the [NumPy Documentation](https://numpy.org/doc/).

## Arrays and Matrices

Create an array of 6 sevens of type integer using [np.ones](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ones.html). You can display variables using print() in python or by simply writing the variable name as command. However, using the later Jupiter dispalys only the last variable in a command block.

Create an array of integers in range of '[10, 20]' in steps of two using [np.arange](https://numpy.org/doc/stable/reference/generated/numpy.arange.html?highlight=arange#numpy.arange).

Create a 4x4 matrix containing evenly spaced values in range '[-4, 4]' using [np.linspace](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html?highlight=linspace#numpy.linspace) and [reshape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.reshape.html).

Create an array of 18 random integers from the "discrete uniform" distribution in `[10, 30)` using [np.random.randint](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html?highlight=random%20randint#numpy.random.randint). Afterwards reshape the array to a 6x3 matrix.

Given the matrix solve the following tasks:
- Display the shape of the matrix and the data type of its entries [shape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html), [dtype](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.dtype.html)
- Find the minimum and maximum value of the matrix with their corresponding indices.  [min](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.min.html), [max](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.max.html), [argmin](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.argmin.html), [argmax](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.argmax.html).


## Indexing and Broadcasting
Reference: [indexing](https://numpy.org/doc/stable/user/basics.indexing.html#basics-indexing), [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).

Using the gained knowledge from the references create an array called `arr` of 10 randomly distributed integers in `[0, 20)`

Given the array `arr` solve the following tasks:
- Display the last three elements of 'arr'
- Display the first 2 elements of `arr`

Now create a copy of the last 4 elements of `arr` and call it `arr_copy` using [copy](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.copy.html).
Afterwards replace the last element of `arr_copy` with 200 and display `arr_copy`.

Display `arr`.
Notice that the values in the original array were not affected by the changes made to a copy of its slice. What would happen if youremove .copy() from the command?

Create a matrix called `mat` containing 24 randomly distributed integers in `[1, 50)` with shape (6, 4) and display the last element of the first column.


Now display the third row and the last column of `mat`.


Display the elements of `mat` that are greater than 20.

Now display only the elements of `mat` in `[10,30)`.


## Operations

Create an array `arr1` containing integers in `[1, 10]` and an array `arr2` containing even integers in `[-4, 16)`.

Check if `arr1` and `arr2` have the same length.

Add, subtract, multiply and divide the corresponding elements of `arr1` and `arr2` in two different ways described in [add](https://docs.scipy.org/doc/numpy/reference/generated/numpy.add.html),  [subtract](https://docs.scipy.org/doc/numpy/reference/generated/numpy.subtract.html), [multiply](https://docs.scipy.org/doc/numpy/reference/generated/numpy.multiply.html) and [divide](https://docs.scipy.org/doc/numpy/reference/generated/numpy.divide.html)
the corresponding elements of `arr1` and `arr2` in two different ways.
Keep in mind that division by zero results in `inf`, a NumPy constant.

Calculate the following scalar value: $\log(\sqrt {arr_1} \cdot arr_2^3)$. Note that `arr1` and `arr2` have the form of vectors.

In [3]:
# 1. Compute the square root of `arr1`.
# Reference: [sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html).
# 2. Compute `arr2` to the third power.
# Reference: [power](https://docs.scipy.org/doc/numpy/reference/generated/numpy.power.html).
# 3. Compute the dot product of the vectors produced in steps 1 and 2.
# Reference: [dot](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html).
# 4. Compute the natural logarithm of the dot product.
# Reference: [log](https://docs.scipy.org/doc/numpy/reference/generated/numpy.log.html).
# 5. Display the result.



Create an array called `gauss` with 10000 samples with an expected value of 0 and a standard deviation of 1 using [np.random.normal](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html#numpy.random.normal).
Compute mean and standard deviation of `gauss` and compare with your expectation.

Run the next cell to generate an array called `gauss`.

# pandas Exercises
A pandas Series is a one-dimensional ndarray and can be indexed by label, instead of number locations. Furthermore series can store any arbitrary Python object.


Creat a Series called `grocery` containing prices `2, 1, 3, 5, 4` as data and labels `bread, apple, cheese, beer, milk` as index.

Create another Series called `amount` containing the values `10, 20, 15, 30, 15` with the same labels and calculate the total cost for beer and the total cost of all goods using [multiply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.multiply.html) and [dot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dot.html).


Now create a DataFrame called `shopping_list` with `price` and `amount` as columns using [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

Add a column called `total_cost` containing the total cost per good and check how many beers are in stock using [at](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html).

Another way to acces the elements within a DataFrame are the indexing functions [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html), which can access also several elements at the same time. The main difference between those functions is that [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) selects rows and columns with specific labels, while [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) selects rows and columns as specific interger positions. 
First, check how many beers are in stock using [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and afterwards check it again by using [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html).

To arrange and group the content of Dataframes according to their keys, the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function is suitable. To demonstrate this, we look at another example. First, create a dataframe with the columns `Persons` and `Ice_cream`. The columns shall contain the data `Max, Julia, Thomas, Annika, Marc, Lina, Maria` and `Chocolate, Vanilla, Chocolate, Strawberry, Strawberry, Chocolate, Strawberry` in this order. Display the dataframe. After that, group the dataframe by the different ice cream types. Display the groups of the resulting [GroupBy object](https://pandas.pydata.org/docs/reference/groupby.html).

# Matplotlib

Reference: [Matplotlib - Quick start guide](https://matplotlib.org/stable/tutorials/introductory/quick_start.html)

Create an array called 'data1' of 10000 random numbers following a normal distribution with mean $\mu = 0$ and standard deviation $\sigma = 2$. Display the distribution of 'arr' using [plt.hist](https://matplotlib.org/stable/plot_types/stats/hist_plot.html#sphx-glr-plot-types-stats-hist-plot-py) and think about a suitable binning of the x-axis.

Display the normalized distribution of the array 'data1' using the 'density' parameter of [plt.hist](https://matplotlib.org/stable/plot_types/stats/hist_plot.html#sphx-glr-plot-types-stats-hist-plot-py).

Now plot a perfect normal distribution curve with mean $\mu = 0$ and standard deviation $\sigma = 2$ in the same figure and compare with the distribution. Use the following fuction for the normal distribution:

In [4]:
 def normal_distribution(x, mu, sig):
    return 1/np.sqrt(2*np.pi*sig**2) * np.exp(-(x-mu)**2/(2*sig**2))

Create another array called 'data2' of 10000 following a normal distribution with mean $\mu=2$ and standard deviation $\sigma=3$.
Display both datasets in one scatterplot using [plt.scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) where the x value is given by 'data1' and the y value is given by 'data2'.


To better evaluate the distribution of the entire data set, two-dimensional histograms are a suitable tool.
Display both data sets ('data1' on the x-axis, 'data2' on the y-axis) using [plt.hist2d](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist2d.html), think about suitable binning parameter and don't forget the colorbar

Optional: To hide the zero entries in the histogram and thus better see the distribution itself,
you can first calculate the two-dimensional distribution of the datasets using [np.histogram2d](https://numpy.org/doc/stable/reference/generated/numpy.histogram2d.html),
 replace the zero entries with 'np.nan' and display the distribution using [plt.pcolormesh](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.pcolormesh.html).

# Seaborn
An efficient alternative that can create many plots with less effort is the python package [seaborn](https://seaborn.pydata.org/index.html). Now create all plots again using only [seaborn](https://seaborn.pydata.org/index.html).
First create a pandas DataFrame with the 'data1' and 'data2' as columns and use this DataFrame as input data for the seaborn functions.
Examine the parameters for 'stat' which can be used for [sns.histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html) and be aware of the differences!