<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/005_exchange-rate-data.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Preprocessing Exchange Rate Data
___
In this notebook, we will study how to read data from external sources into Python. You will see that we are confronted with a rather tricky case that is more demanding than when you work with already clean data. However, this will allow us to get to learn some functions to help us put the information in the necessary format.

___
## Getting and reading SNB data on exchange rates
We are going to use pre-construted CSV file in order to save time at the start of the course. However, it is important that you learn how to find and download data yourself. Therefore, you will also find a description on how to download the data from the source: the data portal of the Swiss National Bank (SNB). We highly recommend you go over this description and test out the steps for yourself.

### Instructions to obtain the data
1. Go to the data portal of the Swiss National Bank: https://data.snb.ch/en.
2. Make sure to switch to ENGLISH (in the upper right corner), if necessary.
3. Choose table selection in the upper horizontal menu bar and then **Interest rates, yields and foreign exchange market**. 
4. Choose **Foreign exchange market > Foreign exchange rates > Month** in the left menu bar.
5. Choose data (**Monthly average**) from January 2001 to most recent, in CSV format (choose **CSV (selection)**, <span style="color:red">**not CSV (all)!**</span>).

Accessing the data on Jun 30, 2022, we obtain a file named `snb-data-devkum-en-selection-20220630_1430.csv`. You find it in the data folder on Nuvolos.

___
⚠️ <span style="color:red">**BE CAREFUL WHEN OPENING  CSV FILES in Excel!**</span> ⚠️

Depending on country settings of your device, Excel may change formats automatically, which will make it difficult to import that data into Python. The easiest way to inspect the data is to go to the data folder in Jupyter Lab's file browser (left pane) and double-click on the CSV file. Alternatively, you can download the file and open it with a text editor.

___
## Path, directories, and data import

Do you remember how we imported the **iris** dataset last class? We used

```python
iris = pd.read_csv("../data/iris.csv")
```
While it truly is simple, we omitted some important points. We are opening a dataset, called `iris.csv`, which is located in the folder `data`, one directory layer above the current one (`notebooks`). Oftentimes, you might want to refer to a file that is somewhere on your computer, not necessarily always in the same folder (or a sub-folder in this case). Perhaps, if you are working on Windows, you have the two following folders (I hope you don't! This is a *bad* example):
+ `C:/notebooks/mynotebook.ipynb`
+ `C:/data/iris.csv`

How do you import `iris.csv` from `mynotebook.ipynb`? You can't simply do `pd.read_csv("data/iris.csv")` because **from the point of view of `notebook.ipynb`** (i.e., the current folder it is in `C:/notebooks`) there is no `data` folder. So, what can we do?

#### Absolute paths
The first method is to simply pass the absolute path, i.e., the path *starting from the base of the directories on your operating system*, e.g.

```python
pd.read_csv("C:/data/iris.csv")
```
This works fine, but if your folder structure is complicated this might be cumbersome.

#### Relative paths
Another way to do it is using relative paths, i.e., relative to *where we are currently*. This is implicitly what we did when loading the iris dataset, we use a path, relative to the notebook we were working in. In a case as the one portrayed above, we need to know how to *go back one folder* using relative path notation, this is done by adding `..`, e.g.,

```python
pd.read_csv("../data/iris.csv")
```

For a more complicated example, consider that your are working with a notebook and a dataset situated as follow:  
+ `C:/HSG/HS2022/DSF/my-project/notebooks/stock_market_prediction.ipynb`
+ `C:/HSG/HS2022/DSF/data/datasets/stock_market_data.csv`

Can you figure out what you need to type to read the CSV using relative paths?

The answer is
```python
pd.read_csv("../../data/datasets/stock_market_data.csv")
```
Do you see why? The first `..` takes us one folder back, i.e. into `my-project`, but the `data` folder is not in here! So we need another `..` to go back one more folder.

This might all see a bit abstract and complicated. In my opinion, it's best to have your files all together in your project folder. A bit like how the folder for this class is structured! For instance, in my research, I try to have a main folder, e.g., `my-project` and inside, I will have a `data` folder, a `code` folder, and something like a `latex` folder where I keep my writings on the subject.

#### ⚠️ Backward and forward slashes

Sometimes, you might see a path written as  
+ `..\\data\\iris.csv`

This is some Windows-specific notation. **This is bad practice**, for two main reasons:  
+ It only works with windows. If somebody working on a unix system (Mac, Linux) tries to run your code it will fail.
+ Backslash is a special character for strings called an *escape character*. This is a bit advanced, but for instance `\n` is a newline, `\t` is a tab, etc.
+ (Optional) It's one more character than you need.

So, to summarize, just use forward slashes. They work on all systems and use less characters. 

**⛔ There is never a good reason to use `\\` instead of `/` when indicating a path! ⛔**

In [None]:
import pandas as pd # Import the dataframes package

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# pd.read_csv() is the default function in pandas to load a csv as a dataframe
pd.read_csv(f"{DATA_PATH}/snb-data-devkum-en-selection-20220601_1430.csv")

As we can see from the output of the code above, reading the CSV into dataframe format with `pandas` produces a strange result. Have we done something wrong?

Well, not exactly, but as mentioned above, the data you download does not always come in a clean format and sometimes you will need to inspect the data yourself to understand what needs to be done before you can import it properly.

Looking either at the output above, or at the file directly by double-clicking on it in the explorer, we observe a few things:
+ The 2 first rows are separated from the rest by a blank row. Also, these rows are *different*. In fact, they have only 2 elements separated by `;`, whereas the later rows always have 4.
+ The values are separated by a semi-colon (`;`) instead of a comma(`,`). As it turns out, the elements in a CSV file are not always separated by a comma, the separator can be something else (often a semi-colon, but sometimes it can also be something else, this is something you will have to find out). In particular, the Swiss standard for CSV differs from the more widely spread American one. The semi-colon is the typical separator for CSV files with Swiss format!

Taking those two observations into consideration, we can try to import our CSV from the row 4 onwards, using a semi-colon separator. Can you figure out how to do it yourself?

#### ➡️ ✏️ Task 1

In the cell below, read in the SNB data by passing options to deal with the issues noted above.

*Hint:* Having a look at the [`pandas` documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) is always helpful to understand what options we can pass to a specific function. In particular, you want to look at the arguments `skiprows` and `sep`.

In [None]:
# Enter your code here


The clean dataframe consists of 4 columns: `Date`, `D0`, `D1`, and `Value`. The names `D0` and `D1` are not very expressive, it would be better to rename those columns. In any case, let's start with some data pre-processing, **as always**.

In [None]:
# Display the unique 'Date' values
snb["Date"].unique()

The dates are actually months, in the format `yyyy-mm`. However, they are currently stored as strings in our dataframe. This is not very practical, it would be nicer to have them as date. Fortunately, `pandas` provide some useful utilities for this purpose.

In general, when using the function `pd.to_datetime()` (or other date functions in Python), we want to pass the format in which the date is currently being represented. Python works with [well-defined date format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). These can seem a bit cryptic at first, but once you get the hang of it, it becomes intuitive.

In [None]:
# Transform the date column to date format
snb["Date"] = pd.to_datetime(snb["Date"], format="%Y-%m")
snb["Date"]

In [None]:
# Display the unique 'D0' values
snb["D0"].unique()

The `D0` column in our dataframe only consists of the value `M0`. This is not very useful information, thus we might as well drop this column.

#### ➡️ ✏️ Task 2
In the cell below, remove the column `D0` from the `snb` dataframe.

In [None]:
# Enter your code here


In [None]:
# Display the unique values in D1
snb["D1"].unique()

As we can see, the column `D1` contains the different currencies (28 in total). Because the name `D1` is not very helpful, we rename the column.

In [None]:
# Rename the column 'D1' to 'Currency'
snb.rename(columns={"D1": "Currency"}, inplace=True)

In [None]:
snb # Display the cleaned data

___
## Selecting data subsets and plotting

Now that we have pre-processed our data, we can turn to visualizations. Once again, we will need to load the plotting libary `matplotlib`.

In [None]:
import matplotlib.pyplot as plt

In [None]:
color_dict = {"EUR1": "blue", "USD1": "green"} # Create a color mapping

fig, ax = plt.subplots(figsize=(12, 8)) # Instantiate the figure and axis
# Iterate over the keys of the color dictionary (EUR1, USD1)
for currency in color_dict.keys():
    subset = snb.loc[snb["Currency"] == currency, :] # Subset the data
    ax.plot(subset["Date"], subset["Value"], label = currency, color = color_dict[currency])
ax.legend() # Add legend
# Set x- and y-labels
ax.set_xlabel("Date")
ax.set_ylabel("Exchange rate (in CHF)")
ax.grid(True) # Add a grid in the basckground

#### ➡️ ✏️ Task 3
In the cell below, add the exchange rate of the pound (`GBP1`) in purple, and the exchange rate of the canadian dollar (`CAD1`) in orange.

*Hint:* Look at the `color_dict` and try to understand what is happening the the loop over `color_dict.keys()`.

In [None]:
# Enter your code here


___
## Saving our clean data

While it is necessary to pre-process every dataset you receive, you don't want to re-run the code every single time you are working with this dataset. Instead, you want to save a clean version of the data that you will re-use. In general, I personally separate the data pre-processing and the data analysis steps, i.e., I will have a notebook or script for the data cleaning and another one for the analysis.

`pandas` provides the method `.to_csv` which allows us to save our data back to CSV format.

In [None]:
# Save the clean data as a CSV and ignore the index
snb.to_csv("snb_clean.csv", index=False)

Check the data folder, there is now a file called `snb_clean.csv` with the contents of our pre-processing step. Next time we can only use `pd.read_csv("data/snb_clean.csv")` and skip the whole pre-processing!