In [None]:
import pandas as pd
from pathlib import Path

dirpath = Path("../../datasets/kontali")

# Currency

### Read the file `currency.csv` into a pandas DataFrame

Hints:
1. the function to use here is `pd.read_csv`
1. we can specify the path to `currency.csv` with `os.path.join` or `pathlib.Path` from the builtin modules
1. we must use the keyword argument `encoding="latin-1"`
1. we should also specify two other keywords

In [None]:
df_currency = pd.read_csv(dirpath / "currency.csv", encoding="latin-1", delimiter=";", decimal=",")

### What is the most recent date for which we have foreign exchange rates?

In [None]:
df_currency.tail()

### Change `Dato` dtype to Datetime, and move `Dato` to the index.

In [None]:
df_currency["Dato"] = pd.to_datetime(df_currency["Dato"])
df_currency.set_index("Dato", inplace=True)
df_currency.head()

### Extract the unit (1 or 100) and the currency name from the column labels

Hint: You can use the `.str` accessor on `df_currency.columns`. Here is a useful [link](https://pandas.pydata.org/docs/user_guide/text.html#extracting-substrings) on extracting substrings from a string.

This following regular expression pattern can be used to get the unit and currency:

`pattern = r"(\d+) (\w+)"`


#### Explanation of the regular expression pattern (can be skipped)

The `r` prefix before the string indicates that it is a [raw string literal](https://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-prefixes-do-and-what-are-raw-string-literals), which in our case means that the backslash is just treated as a backslash.

`\d+` matches Unicode decimal digits (the plus means "at least one", so we match on "at least one digit".

`\w+` matches Unicode word characters.

The parentheses indicate a "capture group". Each capture group will become a column in the returned dataframe.

So to sum it all up, the regular expression first matches one or more digits, then a whitespace, then one or more word letters.

In [None]:
pattern = r"(\d+) (\w+)"

df_ccy = df_currency.columns.str.extract(pattern)
df_ccy

Did it work? Great!

Let's add some names on the two captured groups. This will give us **correctly named columns** for our returned dataframe.

Now the pattern is looking really complex, but we don't really have to concern ourselves with that. Having `?P<unit>` inside the capture group just means: *the captured group should be named `unit`*.

`pattern = r"(?P<unit>\d+) (?P<Ccy>\w+)"`

In [None]:
pattern = r"(?P<unit>\d+) (?P<Ccy>\w+)"
df_ccy = df_currency.columns.str.extract(pattern, expand=True)
df_ccy

Great! We have been able to split the column header into its two parts "unit" and "Ccy" (currency).

The "unit" column has string values (`object` type). That's because we extracted the unit from the column labels, which are strings.

Let's change the dtype to integer type.

In [None]:
df_ccy = df_ccy.astype({"unit": "int64"})
df_ccy

Now let's copy the currency dataframe and rename the columns.

In [None]:
df_cur = df_currency.copy()

df_cur.columns = df_ccy.Ccy
df_cur

The next step is to divide the values in `df_cur` with the units from `df_ccy`. To do that, we must match the indexes on both dataframes.

Since we want to match on the currency names (`Ccy`), we set `df_ccy` index to `Ccy`.

In [None]:
df_ccy.set_index('Ccy', inplace=True)

df_ccy

Now the column labels in `df_cur` match up with the index labels in `df_ccy`!

Therefore we can perform the division. EUR is divided by 1, DKK is divided by 100, etc.

In [None]:
# note that we are dividing a DataFrame by a Series
df_cur / df_ccy["unit"]

We can also use `df.div`:

In [None]:
df_cur.div(df_ccy['unit'].T, axis=1)