# Final assignment: Data processing

The final exercise involves converting data from one or more providers.

Since this exercise is designed to prepare you for real project work, the information you need to solve it might be slightly incomplete or not provided in context. Use your best judgment and most likely your client and your colleagues will be happy ;-)

Parts of this assignment can be solved in several ways. Use descriptive variable names and comments or descriptive text if necessary to clarify.
The final solution should be clear to your colleagues and will be shared with some of your fellow students for review.

The data will be used for MIKE modelling and must be converted to Dfs with apppropriate EUM types/units in order to be used by the MIKE software.

The data is provided as a [zip file](https://github.com/DHI/getting-started-with-mikeio/raw/main/mini_book/data/stream_data.zip) and two binary files ([](gridded-data)).

Inside the zip file, there are a many timeseries (ASCII format) of discharge data from streams located across several regions (`*.dat`).

Static data for each region is found in a separate file (`region_info.csv`)

![](../images/region_info.png)

Pandas `read_csv` is very powerful, but here are a few things to keep in mind

* Column separator e.g. comma (,)
* Blank lines
* Comments
* Missing values
* Date format

The MIKE engine can not handle missing values / delete values, fill in in missing values with interpolated values.

In order to save diskspace, crop the timeseries to simulation period Feb 1 - June 30.

## FA.1 Convert all timeseries to Dfs0

In [1]:
import os
import numpy as np
import pandas as pd
import mikeio

In [2]:
# This is one way to find and filter filenames in a directory
# [x for x in os.listdir("datafolder") if "some_str" in x]

In [3]:
# This is useful!
# help(pd.read_csv)

In [4]:
# example of reading csv
# df = pd.read_csv("../data/oceandata.csv", comment='#', index_col=0, sep=',', parse_dates=True)


## FA.2 Add region specific info to normalize timeseries with surface area

Each timeseries belongs to a region identified in the filename, e.g. `s15_east_novayork_river.dat` is located in the `novayork` region.

For each timeseries in the dataset:
1. Find out which region it belongs to
2. Divide the timeseries values with the surface area for the region (take into account units)
3. Create two dfs0 files, one with discharge and another one with specific discharge *(discharge / area)*

(gridded-data)=
## FA.3 Gridded weather forcing data

The dataset is provided in NumPy binary format and consists of

* Temperature 2m (degree Kelvin)
* Relative humidity 2m (%)

The spatial grid is: 40E - 50E, 10-15N with a grid spacing of 1 degree in each direction.

The time axis consists of two timesteps '2005-01-31', '2005-07-31' which is sufficent to cover the simulation period.

In [5]:
tmp = np.load("../data/temperature_2m.npy")
rh = np.load("../data/rel_hum_2m.npy")

In [6]:
dy = 1.0
dx = 1.0
time = pd.date_range("2005-01-01", freq='6M', periods=2)

In [7]:
from mikeio import Dfs2, Dataset
from mikeio.eum import ItemInfo, EUMType, EUMUnit

In [8]:
data = [tmp, rh]
# ds = Dataset(data,...

The expected outcome

![](../images/weather_dfs2.png)

## Submission of solution

Your solution to the above tasks is to be delivered in the format of a single Jupyter notebook file.

The solution will be reviewed by a couple of your fellow students, which will provide feedback on both the correctnes and clarity of your solution.

The submission and review process is handled by Campus + Eduflow.

