## Exercises for Numpy, Matplotlib and Pandas

### Notebook examples
* Check in detail the examples presented
* Test variations (data creation, plot options, ...)

### Data-analysis-1
The file [rohr1.dat](http://www.etp.physik.uni-muenchen.de/kurs/comp14/uebungen/source/rohr1.dat) contains a list of measurements of wire-positions of drift tubes used in the ATLAS
Muon-Chamber Sytem   
 Read the numbers with:
 ```Ipython
data = numpy.loadtxt('rohr1.dat')
```
1. Determine mean and standard-deviation *(Hint: numpy-Functions)*

1. Fill the values in a histogram and plot it.

In a similar way read the (x,y) coordinates of file 
[rohr2.dat](http://www.etp.physik.uni-muenchen.de/kurs/comp14/uebungen/source/rohr2.dat) using
```Ipython
x,y = numpy.loadtxt('rohr2.dat',unpack=True)
```

Determine for both x and y mean and standard-deviation as well as the correlation.

Visualize the data:
1. 1D histogram of both x and y
1. (x,y) point plot
1. 2D histogram
 
 
 
### Data-analysis-2
The file [faithful.csv](https://people.sc.fsu.edu/~jburkardt/data/csv/faithful.csv) contains measurement data of the **Old Faithful Geysir** , i.e. duration of the eruption and time since the last eruption.

Download the file ( `wget http://...`) and read the data with numpy:
```Ipython
data=numpy.loadtxt('faithful.csv',delimiter=',',skiprows=1)
```
(Why  the options in in loadtxt(...)?)

1. Determine again mean and standard-deviation for duration and wait-time
1. Fill histograms for both values and plot it. Is it compatible with Gaussian distribution?
1. Are there correlations between wait-time and duration or the duration of sub-sequent eruptions?
(Make (x,y) plots for both)




### Breast Cancer Dataset
A frequently used data set for ML (which we will use later on is a data set for *breast cancer diagnosis*

The code below reads it and converts it into pandas dataframe.

Extract basic statistic info and plot some features.


In [None]:
# load dataset
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
import pandas as pd
df = pd.DataFrame( cancer.data, columns=cancer.feature_names)


### Weather Data
* Investigate other variables or aggregations, e.g. variance of temperature (TMK), daily sun-shine (SDK), Snow-height (SHK_TAG), Number of days with  snow, ...
* 2018 was a rather cold winter and the night of Feb 26, 2018 was reportedly the coldest night of this winter season with -27 degree. Check in the Zugspitze data for yearly minimum temperature. How often has it been colder?
* Take data from other weather stations ([DWD Archiv](https://www.dwd.de/DE/leistungen/klimadatendeutschland/klarchivtagmonat.html), i.e. Hohenpeißenberg, Helgoland, ...)


### Energy Charts

How to manage the transition to renewable energy production is a highly disputet and controversial subject.
Interesting input to the discussion provide the energy-charts 
https://www.energy-charts.info/index.html?l=de&c=DE
which show timeline of electricity usage together with production from different sources. One can also export the data in csv format and use Pandas for more detailed investigations:

* Investigate other variables or aggregations, e.g. variance of temperature (TMK), daily sun-shine (SDK), Snow-height (SHK_TAG), Number of days with  snow, ...
* 2018 was a rather cold winter and the night of Feb 26, 2018 was reportedly the coldest night of this winter season with -27 degree. Check in the Zugspitze data for yearly minimum temperature. How often has it been colder?
* Take data from other weather stations ([DWD Archiv](https://www.dwd.de/DE/leistungen/klimadatendeutschland/klarchivtagmonat.html), i.e. Hohenpeißenberg, Helgoland, ...)


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import numpy as np

df=pd.read_csv('http://www-static.etp.physik.uni-muenchen.de/kurs/Computing/sw/source/energy-charts_Oeffentliche_Nettostromerzeugung_in_Deutschland_2021.csv',index_col='Datum (GMT+1)',parse_dates=['Datum (GMT+1)'],engine='python')
print (df.size)
print (df.columns)
# combine wind
df['Wind'] = df['Wind Onshore'] + df['Wind Offshore']

In [None]:
# plot short date range
day = '2021-08-10'
day2 ='2021-08-16'
df['Last'][day:day2].plot()
df.Solar[day:day2].plot()
df.Wind[day:day2].plot()

In [None]:
# plot weekly averages
dfm=df.resample('w').sum()
dfm.Last.plot()
dfm.Wind.plot()
dfm.Solar.plot()


* how much would you need to scale up solar or wind production to match consumption ('Last')?
* how big is the gap (=sum of periods when production is below consumption) for solar-only (and wind-only)?
* minimize that gap by combining solar and wind