# Standardizing and Clustering Currency Data

Almost every country around the world has its own currency, and almost every country around the world has its own interest rate: in global markets, there is not one interest rate to borrow and invest in, but rather tens of dozens of interest rates and currencies in which you can save, invest, and trade in.

This naturally raises the question: Would it be possible to borrower in one currency, where interest rates are low, and invest it in another, where interest rates are high? Doing so could yield a **spread**, or a profit difference between the two interest rates. In fact, such a strategy, called a **carry trade**, is a common one in international finance. [While not without risk](https://en.wikipedia.org/wiki/Carry_(investment)), carry trades are can be a profitable way to further diversify an investment portfolio.

In this activity, you’ll use the `StandardScaler` module and clustering optimization techniques to cluster global currencies and interest rates. The purpose of clustering the currencies will be to define which group of currencies offer the best currency carry.

Instructions

1. Read in the `global_carry_trades.csv` file from the `Resources` folder and create/review the DataFrame (this step has been done for you).

2. To prepare the data, use the `StandardScaler` module and the `fit_transform` function to scale all the columns containing numerical values. Review a five-row sample of the scaled data using bracket notation ([0:5]).

3. Create a new DataFrame called `rate_df_scaled` that contains the scaled data. Make sure to do the following: 

    * Use the same labels that were referenced in the `StandardScaler` for the column names. 
    
    * Use `pd.get_dummies` on the "IMF Country Code" column on the original `rate_df` DataFrame. Save these binary variables as a DataFrame called `country_dummies`.

    * Use `pd.concat` to add the `country_dummies` DataFrame to the `rate_df_scaled` DataFrame. Display the combined DataFrame.


4. Fit and Predict with `Kmeans`
    * Using the concatenated DataFrame, cluster the country level data by using the K-means algorithm and a `k` value of `3`. Save the predicted model clusters to a new DataFrame.
    * Create a copy of the `rate_df_scaled` DataFrame, saving it to a new DataFrame called `rate_scaled_predictions`. Add the predicted `country_clusters` to this new DataFrame, then preview its content.


5. Plot and Analyze the Results

    * Group the saved DataFrame by cluster using `groupby`. Plot average `next_month_currency_return` by cluster to identify which group had the highest monthly currency returns.
    * Use `hvplot` to create a scatterplot of `interest_differential` against `next_month_currency_return`, making the plot vary by `CountryCluster`.
    * Based on this plot, which cluster of country appears to provide both the highest interest spread and currency return?


6. Optional Challenge

    * Utilize `AgglomerativeClustering`, `BIRCH`, or any number of the [other clustering methods available on Scikit-Learn](https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods) and re-estimate clusters on the above DataFrame. 
    * This time, increase the cluster count to see if there are any smaller, more granular clusters which show the most potential for profit.


References

[scikit-learn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

[scikit-learn Preprocessing Data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

[Pandas concat function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

[scikit-learn Python Library](https://scikit-learn.org)

In [1]:
# Import the required libraries and dependencies
import pandas as pd
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")
from sklearn.cluster import KMeans, AgglomerativeClustering, Birch
from sklearn.preprocessing import StandardScaler
import hvplot.pandas

## Step 1: Read in the `global_carry_trades.csv` file from the `Resources` folder and create/review the DataFrame (this step has been done for you).

In [2]:
# Read the CSV file into a Pandas DataFrame
# Set the index using the Ticker column
rate_df = pd.read_csv(
    Path("../Resources/global_carry_trades.csv"))

# Review the DataFrame
rate_df.head()

Unnamed: 0,interest_differential,next_month_currency_return,IMF Country Code
0,0.001414,-0.061174,GBR
1,-0.00057,-0.05812,BEL
2,0.001478,-0.056031,DNK
3,0.000655,-0.056991,FRA
4,-0.002928,-0.067056,DEU


## Step 2: To prepare the data, use the `StandardScaler` module and the `fit_transform` function to scale all the columns containing numerical values. Review a five-row sample of the scaled data using bracket notation ([0:5]).

In [3]:
# Use the StandardScaler module and fit_transform function to 
# scale all columns with numerical values

rate_df_scaled = StandardScaler().fit_transform(rate_df[['interest_differential', 'next_month_currency_return']])

# Diplay the first three rows of the scaled data
rate_df_scaled[0:3]

array([[-0.24270991, -1.93608838],
       [-0.8539933 , -1.84109498],
       [-0.22308154, -1.77613322]])

## Step 3: Create a new DataFrame called `rate_df_scaled` that contains the scaled data. Make sure to do the following: 

- Use the same labels that were referenced in the `StandardScaler` for the column names. 
    
- Use `pd.get_dummies` on the "IMF Country Code" column on the original `rate_df` DataFrame. Save these binary variables as a DataFrame called `country_dummies`.

- Use `pd.concat` to add the `country_dummies` DataFrame to the `rate_df_scaled` DataFrame. Display the combined DataFrame.


In [4]:
# Create a DataFrame called with the scaled data
# The column names should match those referenced in the StandardScaler step
rate_df_scaled = pd.DataFrame(rate_df_scaled, columns=['interest_differential', 'next_month_currency_return'])
rate_df_scaled

Unnamed: 0,interest_differential,next_month_currency_return
0,-0.242710,-1.936088
1,-0.853993,-1.841095
2,-0.223082,-1.776133
3,-0.476617,-1.805994
4,-1.580459,-2.119073
...,...,...
994,0.122649,-0.846237
995,-0.038476,-0.722418
996,-2.065714,-0.113693
997,-0.283230,-1.169689


In [5]:
# Encode (convert to dummy variables) the "IMF Country Code" column
country_dummies = pd.get_dummies(rate_df['IMF Country Code'])

# Review the DataFrame
country_dummies.head()

Unnamed: 0,AUS,BEL,CAN,CHE,DEU,DNK,FRA,GBR,ITA,JPN,NLD,NOR,NZL,SGP,SWE
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [6]:
# Concatenate the `IMF Country Code` encoded dummies with the scaled data DataFrame
rate_df_scaled = pd.concat([rate_df_scaled, country_dummies], axis=1)

# Display the combined DataFrame.
rate_df_scaled.head()

Unnamed: 0,interest_differential,next_month_currency_return,AUS,BEL,CAN,CHE,DEU,DNK,FRA,GBR,ITA,JPN,NLD,NOR,NZL,SGP,SWE
0,-0.24271,-1.936088,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,-0.853993,-1.841095,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,-0.223082,-1.776133,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,-0.476617,-1.805994,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,-1.580459,-2.119073,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


## Step 4: Fit and Predict with KMeans

* Using the concatenated DataFrame, cluster the country level data by using the K-means algorithm and a `k` value of 3. Save the predicted model clusters to a new DataFrame.
* Create a copy of the `rate_df_scaled` DataFrame, saving it to a new DataFrame called `rate_scaled_predictions`. Add the predicted `country_clusters` to this new DataFrame, then preview its content.

In [7]:
# Initialize the K-Means model with n_clusters=3
model = KMeans(n_clusters=3)

# Fit the model for the rate_df_scaled DataFrame
model.fit(rate_df_scaled)

# Save the predicted model clusters to a new DataFrame.
country_clusters = model.predict(rate_df_scaled)

# View the country clusters
print(country_clusters)

# Create a copy of the concatenated DataFrame
rate_scaled_predictions = rate_df_scaled.copy()

# Create a new column in the copy of the concatenated DataFrame with the predicted clusters
rate_scaled_predictions['CountryCluster'] = country_clusters

# Review the DataFrame
rate_scaled_predictions.head()

[2 2 2 2 2 2 2 1 2 2 1 2 2 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 0
 1 1 0 1 0 1 1 0 0 0 2 0 0 2 0 1 1 0 0 0 1 1 0 0 2 0 0 0 2 0 1 0 0 1 0 1 1
 0 2 2 2 2 2 2 2 2 2 2 1 2 1 2 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1 0 2 0 0 0 0 2
 0 2 0 0 2 0 1 2 0 1 0 0 0 0 1 0 1 1 0 1 0 2 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1
 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 0 0 0 2 0 2 0 0 0 0 2 2 0 2 2 2 2 2
 2 2 2 2 2 0 0 1 2 0 1 1 0 0 1 0 0 1 1 0 0 0 2 1 0 2 0 2 2 0 2 0 2 2 2 0 0
 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 2 1 0 2 2 2 2 2 2 2 2 2 0 2 2 2 2 0 1 0 0 0
 0 1 0 0 1 0 1 0 1 2 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 2 2 2 2 2 2 2 2 2 2 1
 2 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 1 1
 1 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 2 0
 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 2 0 2 1 0 0 1 0 1 1 0 1 2 1 1 0 1 1 0 0 1 0
 1 1 0 1 0 1 1 0 1 0 2 0 2 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1
 1 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1
 0 1 1 0 1 1 1 0 1 0 1 1 

Unnamed: 0,interest_differential,next_month_currency_return,AUS,BEL,CAN,CHE,DEU,DNK,FRA,GBR,ITA,JPN,NLD,NOR,NZL,SGP,SWE,CountryCluster
0,-0.24271,-1.936088,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2
1,-0.853993,-1.841095,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2
2,-0.223082,-1.776133,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2
3,-0.476617,-1.805994,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2
4,-1.580459,-2.119073,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2


## Step 5: Plot and Analyze the Results

* Group the saved DataFrame by cluster using `groupby`. Group average `next_month_currency_return` by cluster to identify which group had the highest monthly currency returns.
* Use `hvplot` to create a scatterplot of `interest_differential` against `next_month_currency_return`, making the plot vary by `CountryCluster`.
* Based on this plot, which cluster of country appears to provide both the highest interest spread and currency return?

In [8]:
# Group the saved DataFrame by cluster using `groupby` to calculate average currency returns
rate_scaled_predictions.groupby(by=['CountryCluster'])['next_month_currency_return'].mean()

CountryCluster
0    0.144451
1    0.521001
2   -1.371088
Name: next_month_currency_return, dtype: float64

In [9]:
# Use `hvplot` to create a scatterplot:
rate_scaled_predictions.hvplot.scatter(
    x = 'interest_differential',
    y = 'next_month_currency_return',
    by = 'CountryCluster'
)

## Optional Challenge: 
Utilize `AgglomerativeClustering`, `BIRCH`, or any number of the [other clustering methods available on Scikit-Learn](https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods) and re-estimate clusters on the above DataFrame. This time, increase the cluster count to see if there are any smaller, more granular clusters which show the most potential for profit.

In [10]:
# Initialize a Birch model with n_clusters=5
birch_model = Birch(n_clusters=5)

# Fit the model for the df_bitcoin_scaled DataFrame
birch_model.fit(rate_df_scaled)

# Predict the model segments (clusters)
country_clusters = birch_model.predict(rate_df_scaled)

# View the stock segments
print(country_clusters)

# Create a copy of the concatenated DataFrame
rate_scaled_predictions = rate_df_scaled.copy()

# Create a new column in the copy of the concatenated DataFrame with the predicted clusters
rate_scaled_predictions["CountryCluster"] = country_clusters

# Review the DataFrame
rate_scaled_predictions.head()

[3 4 3 4 4 4 4 3 4 4 3 4 3 2 1 3 1 3 1 1 2 1 3 2 4 3 1 3 2 1 2 1 3 1 1 2 1
 3 2 1 3 1 3 2 1 3 1 3 1 1 2 1 3 2 4 3 1 3 2 1 3 1 1 4 4 4 4 3 4 4 3 1 2 2
 1 3 4 3 4 4 4 4 3 4 4 3 4 3 0 1 3 1 1 4 4 4 4 3 4 4 3 1 3 2 1 3 1 1 1 1 2
 1 3 4 4 3 4 3 0 1 3 1 1 1 1 2 1 3 2 1 3 1 3 3 1 2 1 1 1 1 2 1 3 2 1 3 1 2
 3 1 2 1 1 1 1 2 1 3 2 1 3 1 2 1 1 3 1 1 4 4 4 4 3 4 4 3 4 3 3 1 3 4 3 4 4
 4 4 3 4 4 3 4 2 3 1 3 1 3 1 2 2 1 3 2 3 3 1 0 2 1 3 1 1 4 4 4 4 3 4 4 3 4
 3 3 1 3 1 1 1 1 2 1 3 2 1 3 1 3 3 1 3 4 3 4 4 4 4 3 4 4 3 4 0 3 1 3 1 1 1
 1 2 1 3 2 3 3 1 2 3 1 2 1 1 2 1 2 1 3 2 3 3 1 2 3 1 3 4 3 4 4 4 4 3 4 4 3
 4 2 3 1 2 1 1 1 1 2 1 3 2 3 3 1 2 3 1 3 1 1 1 1 2 1 3 2 3 3 4 2 3 1 3 1 1
 2 1 2 1 3 2 3 3 1 2 3 1 2 1 2 1 2 1 3 2 3 3 1 2 3 1 2 3 1 1 2 1 3 2 1 3 1
 3 1 1 3 3 1 1 2 1 3 2 3 3 1 3 3 1 3 3 1 1 2 1 3 2 3 3 4 3 3 1 3 3 1 1 2 1
 3 2 3 3 1 3 3 1 2 3 1 1 2 1 3 2 3 3 1 3 3 1 2 3 1 1 2 1 3 2 3 3 1 3 3 1 2
 1 2 1 2 1 3 2 2 3 1 2 3 1 2 1 1 1 1 1 1 4 1 1 1 2 1 1 3 3 1 1 4 1 3 2 3 3
 1 3 3 1 2 1 2 1 2 1 3 2 

Unnamed: 0,interest_differential,next_month_currency_return,AUS,BEL,CAN,CHE,DEU,DNK,FRA,GBR,ITA,JPN,NLD,NOR,NZL,SGP,SWE,CountryCluster
0,-0.24271,-1.936088,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,3
1,-0.853993,-1.841095,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,4
2,-0.223082,-1.776133,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3
3,-0.476617,-1.805994,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,4
4,-1.580459,-2.119073,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4


In [11]:
# Use hvplot to display `interest_differential` and `next_month_currency_return` by country
rate_scaled_predictions.hvplot.scatter(
    x = 'interest_differential',
    y = 'next_month_currency_return',
    by = 'CountryCluster'
)