In [1]:
%cd ..

/home/alberto/PycharmProjects/incomplete_multiview_clustering


# Tutorial: Impute incomplete modality and feature-wise multi-modal data

## Prerequisites

Youâ€™ll need the following libraries installed: matplotlib; seaborn

## Step 1: Import required libraries

In [2]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, accuracy_score
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.linear_model import LogisticRegression
from datasets import LoadDataset
from imml.impute import MOFAImputer, DFMFImputer, get_observed_view_indicator
from imml.preprocessing import MultiViewTransformer
from imml.ampute import Amputer
from imml.preprocessing import ConcatenateViews
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
from tueplots import axes, bundles
plt.rcParams.update(**bundles.icml2022(), **axes.lines())
for key in ["axes.labelsize", "axes.titlesize", "font.size", "legend.fontsize", "xtick.labelsize", "ytick.labelsize"]:
    if key == "legend.fontsize":
        plt.rcParams[key] += 3
    else:
        plt.rcParams[key] += 6

## Step 2: Load the dataset

We'll use the sensIT300 dataset, a multi-view dataset with 300 samples and 2 modalities, to demonstrate how to handle multi-view data. Each view represents a distinct set of features for the same set of samples.

In [4]:
Xs, y = LoadDataset.load_dataset(dataset_name="statlog", return_y= True)
print("Samples:", len(Xs[0]), "\t", "Modalities:", len(Xs), "\t", "Features:", [X.shape[1] for X in Xs])
y.value_counts()

Samples: 2310 	 Modalities: 2 	 Features: [9, 10]


0
0    330
1    330
2    330
3    330
4    330
5    330
6    330
Name: count, dtype: int64

## Step 3: Apply missing data mechanism (Amputation)

Using Amputer, we randomly introduce missing data to simulate a scenario where some modalities are missing. Here, 30% of the samples will be incomplete.

In [None]:
amputed_Xs = Amputer(p= 0.3, mechanism="mcar", random_state=42).fit_transform(Xs)

You can visualize which modalities are missing using a binary color map.

In [None]:
xlabel,ylabel = "Modality", "Samples"
observed_view_indicator = get_observed_view_indicator(amputed_Xs).sort_values(list(range(len(amputed_Xs))))
plt.pcolor(observed_view_indicator, cmap="binary_r")
plt.xticks(np.arange(0.5, len(observed_view_indicator.columns), 1), observed_view_indicator.columns)
_ = plt.xlabel(xlabel), plt.ylabel(ylabel)

## Step 4: Impute missing data 

We are going to apply a pipeline that consists of the following steps: Standardization of the features; Imputation of missing modalities using MOFAImputer, a method designed for multi-view data imputation.

In [None]:
pipeline = make_pipeline(MultiViewTransformer(StandardScaler().set_output(transform="pandas")),
                         MOFAImputer(n_components = 8, random_state=42))
imputed_Xs = pipeline.fit_transform(amputed_Xs)

You can again visualize the dataset after imputation to observe the filled modalities.

In [None]:
observed_view_indicator = get_observed_view_indicator(imputed_Xs).sort_values(list(range(len(imputed_Xs))))
plt.pcolor(observed_view_indicator, cmap="binary")
plt.xticks(np.arange(0.5, len(observed_view_indicator.columns), 1), observed_view_indicator.columns)
_ = plt.xlabel(xlabel), plt.ylabel(ylabel)

## Step 5: Evaluate the imputation performance

We will calculate the Mean Absolute Error (MAE) between the true values (before amputation) and the imputed values, restricted to the places where data were missing. The MSE helps quantify how well the imputation was performed. In addition to the MOFA-based pipeline, we introduce a baseline imputation method that uses SimpleImputer to fill in missing values with the average.

Define a range of missingness proportions (ps), and vary the number of components for MOFAImputer (n_components_list). We will generate both block- and feature-wise missing data, and perform multiple runs for robustness.

In [5]:
ps = np.arange(0.1, 1, 0.2)
n_components_list = [1, 2, 4, 8, 16]
mechanisms = ["um", "pm", "mcar", "mnar"]
n_times = 50
algorithms = ["DFMFImputer", "MeanImputer"]
all_metrics = {}

In [6]:
for algorithm in tqdm(algorithms):
    all_metrics[algorithm] = {}
    for mechanism in tqdm(mechanisms):
        all_metrics[algorithm][mechanism] = {}
        for p in ps:
            missing_percentage = int(p*100)
            all_metrics[algorithm][mechanism][missing_percentage] = {}
            for n_components in n_components_list:
                if (algorithm == "MeanImputer") and (n_components != n_components_list[0]):
                    all_metrics[algorithm][mechanism][missing_percentage][n_components] = all_metrics[algorithm][mechanism][missing_percentage][n_components_list[0]]
                    continue
                all_metrics[algorithm][mechanism][missing_percentage][n_components] = {}
                for i in range(n_times):
                    all_metrics[algorithm][mechanism][missing_percentage][n_components][i] = {}
                    if algorithm == "MeanImputer":
                        pipeline = make_pipeline(
                            MultiViewTransformer(SimpleImputer().set_output(transform="pandas")))
                    elif algorithm == "DFMFImputer":
                        normalizer = StandardScaler()
                        alg = eval(algorithm)
                        pipeline = make_pipeline(
                            MultiViewTransformer(normalizer.set_output(transform="pandas")),
                            alg(n_components = n_components, random_state=i))
                    amputed_Xs = Amputer(p= p, mechanism=mechanism, random_state=i).fit_transform(Xs)
                    for X in amputed_Xs:
                        X.iloc[np.random.default_rng(i).choice([True, False], p= [p,1-p], size = X.shape)] = np.nan
                    masks = [np.isnan(amputed_X) for amputed_X in amputed_Xs]
                    try:
                        imputed_Xs = pipeline.fit_transform(amputed_Xs)
                        if algorithm == "MeanImputer":
                            transformer_list = pipeline[-1].transformer_list_

                        else:
                            transformer_list = pipeline[-2].transformer_list_
                            imputed_Xs = [pd.DataFrame(transformer.inverse_transform(X), index=X.index, columns=X.columns)
                                          for X, transformer in zip(imputed_Xs, transformer_list)]
                        metric = np.mean([mean_absolute_error(transformed_X.values[mask], imputed_X.values[mask])
                                          for transformed_X,imputed_X,mask in zip(Xs, imputed_Xs, masks)])
                        all_metrics[algorithm][mechanism][missing_percentage][n_components][i]["Mean Absolute Error"] = metric
                        pipeline = make_pipeline(ConcatenateViews(), StandardScaler(), LogisticRegression(random_state=i))
                        pipeline.fit(imputed_Xs, y)
                        pred = pipeline.predict(imputed_Xs)
                        metric = accuracy_score(y_true=y, y_pred=pred)
                        all_metrics[algorithm][mechanism][missing_percentage][n_components][i]["Accuracy"] = metric
                        all_metrics[algorithm][mechanism][missing_percentage][n_components][i]["Comments"] = ""
                    except Exception as ex:
                        all_metrics[algorithm][mechanism][missing_percentage][n_components][i]["Comments"] = ex

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

In [8]:
pipeline = make_pipeline(
    MultiViewTransformer(StandardScaler().set_output(transform="pandas")),
    DFMFImputer(n_components = 5, random_state=0))
pipeline.fit_transform(Xs)

DataFusionError: Unknown relation.

After collecting the results from both the MOFA and Baseline imputation methods, we flatten the data into a structured format for easy comparison using visualizations.

In [7]:
flattened_data = [
    {
        'Algorithm': algorithm,
        'Mechanism': mechanism,
        'Missing rate (\%)': p,
        'Components': n_components,
        'Iteration': i,
        **iter_dict
    }
    for algorithm, algorithm_dict in all_metrics.items()
    for mechanism, mechanism_dict in algorithm_dict.items()
    for p, p_dict in mechanism_dict.items()
    for n_components, n_components_dict in p_dict.items()
    for i, iter_dict in n_components_dict.items()
]
df = pd.DataFrame(flattened_data)
df = df.sort_values(["Algorithm", "Mechanism", "Missing rate (\%)", "Components", "Iteration"], ascending=[True, False, True, True, True])
# df.to_csv("tutorials/impute_results.csv", index= None)
df

Unnamed: 0,Algorithm,Mechanism,Missing rate (\%),Components,Iteration,Comments,Mean Absolute Error,Accuracy
0,DFMFImputer,um,10,1,0,"Expected 2D array, got scalar array instead:\n...",,
1,DFMFImputer,um,10,1,1,"Expected 2D array, got scalar array instead:\n...",,
2,DFMFImputer,um,10,1,2,"Expected 2D array, got scalar array instead:\n...",,
3,DFMFImputer,um,10,1,3,"Expected 2D array, got scalar array instead:\n...",,
4,DFMFImputer,um,10,1,4,"Expected 2D array, got scalar array instead:\n...",,
...,...,...,...,...,...,...,...,...
8745,MeanImputer,mcar,90,16,45,,17.523764,0.284416
8746,MeanImputer,mcar,90,16,46,,17.311331,0.279654
8747,MeanImputer,mcar,90,16,47,,17.287035,0.279654
8748,MeanImputer,mcar,90,16,48,,17.152892,0.267532


In [None]:
df = pd.read_csv("tutorials/impute_results.csv")
df

In [None]:
errors = df[df["Comments"].notnull()]
print("errors", errors.shape)
errors

In [None]:
df.groupby(["Algorithm", "Missing rate (\%)"])["Accuracy"].mean()

## Step 6: Visualize the Results

Weâ€™ll use Seaborn to create point plots that show the imputation error for different levels of missing data and varying MOFA components.

In [None]:
df = df.replace(
    {"Algorithm": {"MOFAImputer": "MOFA", "MeanImputer": "Mean"}}).rename(
    columns= {"Algorithm": "Imputation"})
mechanism_names = {"um": "Unpaired missing",
             "pm": "Partial missing",
             "mcar": "Missing completely at random",
             "mnar": "Missing not at random",
             }

In [None]:
g = sns.FacetGrid(data=df, col="Mechanism", row="Components", legend_out=False, sharey=False,
                  despine= False).map_dataframe(sns.pointplot, x="Missing rate (\%)",
                                                y="Mean Absolute Error", hue="Imputation",
                                                capsize= 0.05, seed= 42,
                                                palette= sns.color_palette("colorblind"),
                                                linestyles= ["-", "--"])

handles = [plt.Line2D([0], [0], color=color, lw=2, linestyle=linestyle)
           for linestyle, color in zip(["-", "--"], sns.color_palette("colorblind"))]
g.axes[0,0].legend(handles=handles, labels=df["Imputation"].unique().tolist(),
                   loc="best", title= "Imputation")

for axes,n_components in zip(g.axes, df["Components"].unique()):
    for ax,mechanism in zip(axes, df["Mechanism"].unique()):
        ax.set_title(f"{mechanism.upper()}, Components = {n_components}")

plt.tight_layout()
# plt.savefig("paper_figures/imputation_re_comps.pdf")
# plt.savefig("paper_figures/imputation_re_comps.svg")

In [None]:
g = sns.FacetGrid(data=df, col="Mechanism", row="Components", legend_out=False, sharey=False,
                  despine= False, ylim = (0.1, 1.05)).map_dataframe(sns.pointplot, x="Missing rate (\%)",
                                                                    y="Accuracy", hue="Imputation", 
                                                                    capsize= 0.05, seed= 42,
                                                palette= sns.color_palette("colorblind"),
                                                linestyles= ["-", "--"],
                                                                    )

handles = [plt.Line2D([0], [0], color=color, lw=2, linestyle=linestyle)
           for linestyle, color in zip(["-", "--"], sns.color_palette("colorblind"))]
g.axes[0,0].legend(handles=handles, labels=df["Imputation"].unique().tolist(),
                   loc="best", title= "Imputation")

for axes,n_components in zip(g.axes, df["Components"].unique()):
    for ax,mechanism in zip(axes, df["Mechanism"].unique()):
        ax.set_title(f"{mechanism.upper()}, Components = {n_components}")

plt.tight_layout()
# plt.savefig("paper_figures/imputation_acc_comps.pdf")
# plt.savefig("paper_figures/imputation_acc_comps.svg")

In [None]:
plt.figure(figsize= (4, 3))
ax = sns.pointplot(data=df[(df["Mechanism"] == "mcar") & (df["Components"] == 16)].replace(
    {"Algorithm": {"MOFAImputer": "MOFA", "MeanImputer": "Mean"}}).rename(
    columns= {"Algorithm": "Imputation"}),
              x="Missing rate (\%)", y="Mean Absolute Error", hue="Imputation",
              capsize= 0.05, seed= 42, palette= sns.color_palette("colorblind"),
              linestyles= ["-", "--"])

handles = [plt.Line2D([0], [0], color=color, lw=2, linestyle=linestyle)
           for linestyle, color in zip(["-", "--"], sns.color_palette("colorblind"))]
ax.legend(handles=handles, labels=df["Imputation"].unique().tolist(),
                   loc="best", title= "Imputation")

plt.savefig("paper_figures/imputation_re.pdf")
plt.savefig("paper_figures/imputation_re.svg")

In [None]:
plt.figure(figsize= (4, 3))
ax = sns.pointplot(data=df[(df["Mechanism"] == "mcar") & (df["Components"] == 16)].replace(
    {"Algorithm": {"MOFAImputer": "MOFA", "MeanImputer": "Mean"}}).rename(
    columns= {"Algorithm": "Imputation"}),
              x="Missing rate (\%)", y="Accuracy", hue="Imputation",
              capsize= 0.05, seed= 42, palette= sns.color_palette("colorblind"),
              linestyles= ["-", "--"])

ax.get_legend().remove()

plt.savefig("paper_figures/imputation_acc.pdf")
plt.savefig("paper_figures/imputation_acc.svg")

This plot shows the imputation error for both methods (MOFA and baseline) across different levels of missing data. The x-axis represents the percentage of missing rate (ranging from 10% to 90%), and the y-axis represents the imputation error (MSE). Each subplot corresponds to a different number of components, with values of 1, 2, 4, 8, 16, and 32 from left to right. The blue solid line represents the MOFA method, while the orange dashed line represents the baseline method. Error bars are present for both methods, indicating variability in performance (95% confidence interval).

For both methods, the imputation error tends to increase as the missing rate increases. MOFA consistently outperforms the baseline method. However, with extreme values of % missing rate, MOFA is not able to achieve better values than the baseline, indicating a collapse due to the lack of information.

Next, we focus on how the imputation error changes as we increase the number of components.

As the percentage of missing rate increases, the imputation error decreases more dramatically with higher values of C, especially from C=8 to C=32. This suggests that using more components improves the quality of imputation, particularly when dealing with datasets that have a high proportion of missing values.

## Summary of results

MOFA is generally superior to the Baseline method in terms of imputation error. Increasing the number of components (C) significantly improves imputation accuracy, especially for highly incomplete datasets, making it a key factor in reducing error. While the Baseline method can narrow the gap with MOFA at higher values of C, MOFA still consistently yields better performance, making it the preferable choice for multi-view imputation tasks.

## Conclusion

This comparison highlights the strength of MOFAImputer in handling multi-view datasets. In summary, MOFA stands out as the more robust method, particularly for datasets with moderate to high proportions of missing data and when the number of components (C) is sufficiently large.