## K-step: 

Implementing KMeans with $K$ options of Bregman Divergences. At this part, the dataset should be split into two parts: 50%-50% of pretrain and aggregation part called (`X_pre`, `y_pre`) and (`X_agg`, `y_agg`).

- (`X_pre`, `y_pre`): For clustering (K-step) and building candidate model (F-step).
- (`X_agg`, `y_agg`): For aggregation step (C-step).

### 1. K-step

After K-step, we expect our `self` object to contain clustering structure of `X_pre` which should be a dictionary: 

```
self.clusters_ = {
    'BD1' : [0,1,0,2,...],
    'BD2' : [1,1,0,2,...],
    ...
    'BDK' : [...]  # these cluster vectors should be the size as `X_pre`.
}
```

### 2. F-step

Here, for each Bregman divergence, we will fit a $K$ **Candidate Models $(F_k)_{1\leq k\leq K}$**. Each candidate is a collection of $K$ local models built on each cluster. More precisely, for Bregman divergence `k in range(K)`:

```
F_k = {
    'lm1' : Model(X_pre[self.cluster_['BDk'] == 1], y_pre[self.cluster_['BDk'] == 1]),
    'lm2' : Model(X_pre[self.cluster_['BDk'] == 2], y_pre[self.cluster_['BDk'] == 2]),
    ...,
    'lmK' : Model(X_pre[self.cluster_['BDk'] == K], y_pre[self.cluster_['BDk'] == K])
}
```

- For an option of Bregman divergence `k`, to predict any observation $x$, there are two steps:
    - Classify $x$ into **its nearest cluster** using BD `k`: $$x\text{ is in cluster }\ell \text{ if }:d_k(x, C_\ell)< d_k(x, C_j) \text{ for all }j\neq \ell.$$
    - If $x$ is in cluster $m$, then we use the right local model $f_m$ to predict $x$: $$F_k(x)=f_m(x).$$

- Note that here: if the bregman divergence is 'logistic', then for predicting: 

`F(x) = F_k['logistic'][lm_m](x)`.

In [1]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from kfc_model import KFCModel
import numpy as np


In [2]:
# 

X, y = make_regression(
    n_samples=2000,
    n_features=10,
    n_informative=5,
    noise=20,
    random_state=42
)
# 2000 * 0.8 = 1600 / 2 = 800 in fit it split into 2 pre and agg
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = KFCModel(
    divergence=[
        'euclidean',
        {'name': 'gkl', 'n_init': 50, 'n_clusters': 4, 'max_iter' : 300, 'tol': 1e-6, 'random_state': None}
    ],
    local_model_name='linear',   # linear regression
    combiner_name='mean'
)

model.fit(X_train, y_train)


0,1,2
,divergence,"['euclidean', {'max_iter': 300, 'n_clusters': 4, 'n_init': 50, 'name': 'gkl', ...}]"
,local_model_name,'linear'
,local_model_params,{}
,combiner_name,'mean'
,combiner_params,{}


In [3]:
# Diplay cluster
for bd, cluster in model.clusters_.items():
    print(f'{bd} : {np.unique(cluster, return_counts=True)}')

BD1 : (array([0, 1, 2]), array([272, 257, 271]))
BD2 : (array([0, 1, 2, 3]), array([386, 267,  61,  86]))


In [4]:
for idx, (bd, models) in enumerate(model.candidate_models_.items()):
    print(f"Divergence {idx} : {bd}")
    for key, ml in models.items():
        print(f'{key} : {ml}')


Divergence 0 : BD1
lm0 : LinearRegression()
lm1 : LinearRegression()
lm2 : LinearRegression()
Divergence 1 : BD2
lm0 : LinearRegression()
lm1 : LinearRegression()
lm2 : LinearRegression()
lm3 : LinearRegression()


In [5]:
model.kmeans_models_

{'BD1': KMeansBregman(),
 'BD2': KMeansBregman(divergence='gkl', n_clusters=4, n_init=50)}

In [6]:
model.combiner

<kfc_model.combiner.MeanCombiner at 0x117b72750>

In [8]:
model.combined_predictions_.shape

(800,)

In [10]:
from sklearn.metrics import mean_squared_error


y_pred = model.predict(X_test)

    # 5. Evaluate
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"KFCRegressor RMSE: {rmse:.4f}")

KFCRegressor RMSE: 19.9198


AttributeError: 'LinearRegression' object has no attribute 'n_classes_'