In [56]:
import pandas as pd
from pathlib import Path

dirpath = Path("../../datasets/kontali")

# Currency

### Read the file `currency.csv` into a pandas DataFrame

Hints:
1. the function to use here is `pd.read_csv`
1. we can specify the path to `currency.csv` with `os.path.join` or `pathlib.Path` from the builtin modules
1. we must use the keyword argument `encoding="latin-1"`
1. we should also specify two other keywords

In [57]:
df_currency = pd.read_csv(dirpath / "currency.csv", encoding="latin-1", delimiter=";", decimal=",")

### What is the most recent date for which we have foreign exchange rates?

In [58]:
df_currency.tail()

Unnamed: 0,Dato,1 USD,1 EUR,100 DKK,1 GBP,100 SEK
1242,2022-12-14,9.7305,10.362,139.29,12.0323,95.38
1243,2022-12-15,9.7931,10.4013,139.83,12.0673,95.44
1244,2022-12-16,9.8722,10.4833,140.94,12.0176,95.17
1245,2022-12-19,9.9099,10.5025,141.2,12.0555,95.42
1246,2022-12-20,9.9158,10.5098,141.28,12.0071,95.01


### Change `Dato` dtype to Datetime, and move `Dato` to the index.

In [59]:
df_currency["Dato"] = pd.to_datetime(df_currency["Dato"])
df_currency.set_index("Dato", inplace=True)
df_currency.head()

Unnamed: 0_level_0,1 USD,1 EUR,100 DKK,1 GBP,100 SEK
Dato,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02,8.1018,9.7748,131.32,10.9887,99.46
2018-01-03,8.1045,9.744,130.89,10.9928,99.18
2018-01-04,8.0941,9.7655,131.17,10.9598,99.42
2018-01-05,8.0878,9.7418,130.83,10.9603,99.08
2018-01-08,8.0834,9.6783,129.97,10.9467,98.59


### Extract the unit (1 or 100) and the currency name from the column labels

Hint: You can use the `.str` accessor on `df_currency.columns`. Here is a useful [link](https://pandas.pydata.org/docs/user_guide/text.html#extracting-substrings) on extracting substrings from a string.

This following regular expression pattern can be used to get the unit and currency:

`pattern = r"(\d+) (\w+)"`


#### Explanation of the regular expression pattern (can be skipped)

The `r` prefix before the string indicates that it is a [raw string literal](https://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-prefixes-do-and-what-are-raw-string-literals), which in our case means that the backslash is just treated as a backslash.

`\d+` matches Unicode decimal digits (the plus means "at least one", so we match on "at least one digit".

`\w+` matches Unicode word characters.

The parentheses indicate a "capture group". Each capture group will become a column in the returned dataframe.

So to sum it all up, the regular expression first matches one or more digits, then a whitespace, then one or more word letters.

In [60]:
pattern = r"(\d+) (\w+)"

df_ccy = df_currency.columns.str.extract(pattern)
df_ccy

Unnamed: 0,0,1
0,1,USD
1,1,EUR
2,100,DKK
3,1,GBP
4,100,SEK


Did it work? Great!

Let's add some names on the two captured groups. This will give us **correctly named columns** for our returned dataframe.

Now the pattern is looking really complex, but we don't really have to concern ourselves with that. Having `?P<unit>` inside the capture group just means: *the captured group should be named `unit`*.

`pattern = r"(?P<unit>\d+) (?P<Ccy>\w+)"`

In [61]:
pattern = r"(?P<unit>\d+) (?P<Ccy>\w+)"
df_ccy = df_currency.columns.str.extract(pattern, expand=True)
df_ccy

Unnamed: 0,unit,Ccy
0,1,USD
1,1,EUR
2,100,DKK
3,1,GBP
4,100,SEK


Great! We have been able to split the column header into its two parts "unit" and "Ccy" (currency).

The "unit" column has string values (`object` type). That's because we extracted the unit from the column labels, which are strings.

Let's change the dtype to integer type.

In [62]:
df_ccy = df_ccy.astype({"unit": "int64"})
df_ccy

Unnamed: 0,unit,Ccy
0,1,USD
1,1,EUR
2,100,DKK
3,1,GBP
4,100,SEK


Now let's copy the currency dataframe and rename the columns.

In [63]:
df_cur = df_currency.copy()

df_cur.columns = df_ccy.Ccy
df_cur

Ccy,USD,EUR,DKK,GBP,SEK
Dato,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02,8.1018,9.7748,131.32,10.9887,99.46
2018-01-03,8.1045,9.7440,130.89,10.9928,99.18
2018-01-04,8.0941,9.7655,131.17,10.9598,99.42
2018-01-05,8.0878,9.7418,130.83,10.9603,99.08
2018-01-08,8.0834,9.6783,129.97,10.9467,98.59
...,...,...,...,...,...
2022-12-14,9.7305,10.3620,139.29,12.0323,95.38
2022-12-15,9.7931,10.4013,139.83,12.0673,95.44
2022-12-16,9.8722,10.4833,140.94,12.0176,95.17
2022-12-19,9.9099,10.5025,141.20,12.0555,95.42


The next step is to divide the values in `df_cur` with the units from `df_ccy`. To do that, we must match the indexes on both dataframes.

Since we want to match on the currency names (`Ccy`), we set `df_ccy` index to `Ccy`.

In [64]:
df_ccy.set_index('Ccy', inplace=True)

df_ccy

Unnamed: 0_level_0,unit
Ccy,Unnamed: 1_level_1
USD,1
EUR,1
DKK,100
GBP,1
SEK,100


Now the column labels in `df_cur` match up with the index labels in `df_ccy`!

Therefore we can perform the division. EUR is divided by 1, DKK is divided by 100, etc.

In [65]:
# note that we are dividing a DataFrame by a Series
df_cur / df_ccy["unit"]

Ccy,USD,EUR,DKK,GBP,SEK
Dato,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02,8.1018,9.7748,1.3132,10.9887,0.9946
2018-01-03,8.1045,9.7440,1.3089,10.9928,0.9918
2018-01-04,8.0941,9.7655,1.3117,10.9598,0.9942
2018-01-05,8.0878,9.7418,1.3083,10.9603,0.9908
2018-01-08,8.0834,9.6783,1.2997,10.9467,0.9859
...,...,...,...,...,...
2022-12-14,9.7305,10.3620,1.3929,12.0323,0.9538
2022-12-15,9.7931,10.4013,1.3983,12.0673,0.9544
2022-12-16,9.8722,10.4833,1.4094,12.0176,0.9517
2022-12-19,9.9099,10.5025,1.4120,12.0555,0.9542


We can also use `df.div`:

In [66]:
df_cur.div(df_ccy['unit'].T, axis=1)

Ccy,USD,EUR,DKK,GBP,SEK
Dato,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02,8.1018,9.7748,1.3132,10.9887,0.9946
2018-01-03,8.1045,9.7440,1.3089,10.9928,0.9918
2018-01-04,8.0941,9.7655,1.3117,10.9598,0.9942
2018-01-05,8.0878,9.7418,1.3083,10.9603,0.9908
2018-01-08,8.0834,9.6783,1.2997,10.9467,0.9859
...,...,...,...,...,...
2022-12-14,9.7305,10.3620,1.3929,12.0323,0.9538
2022-12-15,9.7931,10.4013,1.3983,12.0673,0.9544
2022-12-16,9.8722,10.4833,1.4094,12.0176,0.9517
2022-12-19,9.9099,10.5025,1.4120,12.0555,0.9542
