![QuantConnect Logo](https://cdn.quantconnect.com/web/i/icon.png)
<hr>

# A Mean-Reverting Portfolio with DTW Clustering
This is the clustering analysis notebook of WQU MScFE C20-S4 group 3 capstone project -- A Mean-Reverting Portfolio with DTW Clustering.
Member:
- Kam Chiu Szeto, hke0073@icloud.com

### Import Libraries

In [1]:
from tslearn.barycenters import softdtw_barycenter
from tslearn.clustering import TimeSeriesKMeans
from ResearchETFUniverse import ETFUniverse

## Soft-DTW clustering -- 2019, SPY
The below part was the clustering analysis of soft-DTW k-mean clustering section, with period from 2019-9-1 to 2020-8-31 (volatile period). SPY constituents as of 2020-8-31 would undergo clustering with k-mean algorithm, using DTW as distancing measurement. 45 clusters would be drawn. The assets within the cluster with the largest number of assets would be extracted.

### Get Data
Get the historical data of SPY constituents as of 2017-1-1.

In [2]:
qb = QuantBook()

# Set historical lookback start and end date
start_date = datetime(2019, 9, 1)
end_date = datetime(2020, 8, 31)

# import SPY constituents of end date
assets, _ = ETFUniverse("SPY", end_date).get_symbols(qb)

# Get historical data of SPY constituents
history = qb.History(assets, start_date, end_date, Resolution.Daily)
history

### Preparing Data
<p>Using their standardized log close price time-series for clustering in groups.</p>

In [3]:
# Get the daily close price.
close = history.unstack(0).close

# Take logarithm to ease compounding effect
log_close = np.log(close)

# Standardize the data
standard_close = (log_close - log_close.mean()) / log_close.std()

# Drop any columns (tickers) with nan value
standard_close = standard_close.dropna(axis=1)
standard_close

### Train Model

Use `TimeSeriesKMeans` k-mean model with soft-DTW distance measurement for clustering. In this case, 45 clusters were determined to simulate the number of sectors.

In [5]:
# Set up the Time Series KMean model with DTW.
km = TimeSeriesKMeans(n_clusters=11,
                      metric="softdtw",
                      max_iter=5,
                      max_iter_barycenter=5,
                      n_jobs=-1,
                      random_state=100,
                      init="k-means++")

# Fit the model.
km.fit(standard_close.T)

### Visualization
Visualize the clusters and their corresponding underlying series.

In [6]:
# Predict with the label of the data.
labels = km.predict(standard_close.T)

# Create a class to aid plotting.
def plot_helper(ts):
    # plot all points of the data set
    for i in range(ts.shape[0]):
        plt.plot(ts[i, :], "k-", alpha=.2)
        
    # plot the given barycenter of them
    barycenter = softdtw_barycenter(ts, gamma=1.)
    plt.plot(barycenter, "r-", linewidth=2)

# Plot the results.
j = 1
plt.figure(figsize=(20, 40))
for i in set(labels):
    # Select the series in the i-th cluster.
    X = standard_close.iloc[:, [n for n, k in enumerate(labels) if k == i]].values
    
    # Plot the series and barycenter-averaged series.
    plt.subplot(len(set(labels)) // 3 + (1 if len(set(labels))%3 != 0 else 0), 3, j)
    plt.title(f"Cluster {i+1}")
    plot_helper(X.T)
    
    j += 1

### Extract Assets
Obtain the assets in the cluster with the highest number of assets.

In [7]:
# Get the cluster with the highest number of assets.
cluster_most_element = np.argmax(np.bincount(labels))
print(f"Cluster {cluster_most_element+1} with the highest number of element: {len([x for x in labels if x == cluster_most_element])}")

# Get a list of assets within it
selected_assets = standard_close.columns[[n for n, k in enumerate(labels) if k == cluster_most_element]]
print([x.split(" ")[0] for x in selected_assets])