I have previously posted the idea that clustering using the k-means method could be used to identify market regimes. (Link [here](https://www.kaggle.com/code/tmrtj9999/regime-classification-by-k-means/notebook))

To summarize the content of this article, each row is clustered by the k-means method as it is, using the given 300 features, and the classification result is considered as a regime.

When I published this code, someone commented to me that it might be possible to classify regimes by aggregating features for each time_id and clustering each time_id using the aggregate features.

Therefore, I would like to take the average of each feature for each time_id and use the aggregate features for clustering.

First, train data is read into a DataFrame.

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt


import warnings
warnings.filterwarnings('ignore')

import lightgbm as lgbm
from lightgbm import *

In [None]:
df = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')

In addition, I would like to visualize the results of the regime estimation by means of a graph.
In doing so, the number of investment_ids for each time_id will be displayed for comparison.

It is the following graph.

In [None]:
import matplotlib.pyplot as plt

num_investment = df.groupby('time_id').count()

time_id = num_investment.index
num = num_investment['investment_id']

plt.plot(time_id, num)

plt.show()

We do the necessary prep work.

Memory is saved and unnecessary columns are removed.

In [None]:
del df['row_id']
del df['investment_id']
del df['target']

In [None]:
def reduce_mem_usage(df):
  
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
     
    return df

df = reduce_mem_usage(df)

Take an average for each time_id.

In [None]:
df_agg = df.groupby('time_id').mean()

df = [[]]

Determine the optimal number of clusters using the elbow method.

In [None]:
from sklearn.cluster import KMeans
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Elbow Method
wcss = []

for i in range(1, 10):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 30, random_state = 0)
    kmeans.fit(df_agg)
    wcss.append(kmeans.inertia_)


plt.plot(range(1, 10), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') 
plt.show()

The elbow method considers the optimal cluster to be the one where the slope is smooth. In this case, the optimal number of clusters is 4.

Since I know that the optimal number of clusters is 4, I model with n_clusters=4.

In [None]:
clf = KMeans(n_clusters=4, random_state=0)
clf.fit(df_agg)

y_clustering = clf.predict(df_agg)
df_clustering = pd.DataFrame({'time_id': df_agg.index, 'cluster' : y_clustering})

Separate DataFrames for each cluster.

In [None]:
df_0 = df_clustering[df_clustering['cluster'] == 0]
df_1 = df_clustering[df_clustering['cluster'] == 1]
df_2 = df_clustering[df_clustering['cluster'] == 2]
df_3 = df_clustering[df_clustering['cluster'] == 3]

Now I'm ready to go.
I will now visualize the clustering results.

The background of the time_ids classified in each cluster is visualized with a color.
Also, the blue line displayed is the number of investment_ids per time_id.

In [None]:
from matplotlib import pyplot as plt
import pandas as pd




plt.plot(time_id, num)





for i in df_0['time_id'].values:
    start_datetime0 = i - 0.01
    end_datetime0 = i + 0.01
    plt.axvspan(start_datetime0, end_datetime0, color="red")





plt.show()


Although cluster 0 appears in full, we can see that it is particularly dense in the areas where the number of investment_ids has dropped significantly. From this, we can infer that this is probably the cluster corresponding to the crash.

In [None]:
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.pyplot as plt



plt.plot(time_id, num)






for i in df_1['time_id'].values:
    start_datetime1 = i - 0.01
    end_datetime1 = i + 0.01
    plt.axvspan(start_datetime1, end_datetime1, color="green")
    
    
plt.show()

Cluster 1 seems to be concentrated in the second half of the year, not so much in the first half. given the trend in the number of investment_ids, I wonder if this cluster corresponds to a rising market.

In [None]:
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.pyplot as plt





plt.plot(time_id, num)



for i in df_2['time_id'].values:
    start_datetime2 = i - 0.01
    end_datetime2 = i + 0.01
    plt.axvspan(start_datetime2, end_datetime2, color="yellow")
    
    
plt.show()

I think cluster 2 is unique in that it does not appear where the number of investment_ids decreases significantly.
Given this, we can assume that it is a cluster that corresponds to a time when the market is not very volatile.

In [None]:
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.pyplot as plt






plt.plot(time_id, num)



for i in df_3['time_id'].values:
    start_datetime3 = i - 0.01
    end_datetime3 = i + 0.01
    plt.axvspan(start_datetime3, end_datetime3, color="aqua")
    
    
plt.show()

Cluster 3 is found to appear in full. It may be a cluster that corresponds to a market with high volatility by elimination　method.