## Official Review of Submission933 by Reviewer qi5c
__Summary__:
The authors present a graph-based approach for temporal data imputation using cross multi-head attention. The model assumes input of time-stamped observations with metadata and value of interest for imputation and prediction. Imputation requires target timestamps and metadata (in addition to the graph context of the input data including values) to predict the output value for the targets. They demonstrate superior performance when compared to baselines, particularly those based on RNNs and assuming discrete-time observations. The main advantage of their model in the demonstrated experiments seem due to the application of multi-head attention to construct a graph representation of the time series.

__Soundness__: 2 fair
__Presentation__: 3 good
__Contribution__: 2 fair
__Strengths__:
1. The authors clearly outline their modeling approach, referencing the relevant literature for embedding of temporal data, metadata, and the prediction values. The transformer encoding is intuitive and powerful, naturally inducing a continuous-time graph representation of the time series.
2. The flexibility of the approach is demonstrated methodologically in Section 4 and empirically in Section 6. The advantage of an attention-based inter-observation influence mechanism is clearly outlined as compared to RNN-type methods. The authors connect to the vision literature when investigating the data augmentation in Section 6, which is insightful.

__Weaknesses__:
1. As the key advantage of this work (when compared to the current baselines) is the use of attention, I think it is problematic that the authors have not compared extensively to attention-based temporal data imputation techniques. Particularly, [1] seems to be a very similar model and work, as they also evaluate on the same datasets. See their Table 1; AGG outperforms theirs in 10% and 50% missing data for PhysioNet, but they claim superior performance for 90%. The Beijing Air Quality data seem to be scaled differently, but you should also compare to their result.

In summary, a comparison should be made empirically and the differences in the methodology should be highlighted to other attention-based works for data imputation in temporal data.

2. The ability to impute continuous-time data is emphasized early in the paper, but then is not exploited in the experiments. While it is clear that this method outperforms discrete-time methods, I think a comparison to continuous-time imputation techniques would be insightful. For example, it seems GP-VAE does not use RNNs, and may also not require time discretization. You could hypothetically decode the Gaussian process to arbitrarily-time outputs, and compare more directly to your approach. Similarly, while there is not extensive literature on point process models for data imputation, they represent another class of continuous-time model which could be used to impute missing temporal data. For example, [2] introduces the PILES model for such a task. See also [3] Section 5.4.

In summary, I think an emphasis on the ability of your method to address continuous-time data imputation could address some limitations and similarities between other attention-based approaches. Additional experiments could highlight the true novelty of this approach, which I believe to be an attention-based method for __continuous-time__ data imputation.

3. It seems a strong assumption that the time and (especially) metadata of unknown targets are known - can you demonstrate the ability to predict the metadata, at least? Regardless, I think it is a bit of a limitation that the timestamp of the imputed observation must be provided.

4. A minor comment: perhaps outlining the benefits/properties of each baseline would be helpful.

[1] Yıldız, A. Yarkın, Emirhan Koç, and Aykut Koç. "Multivariate time series imputation with transformers." IEEE Signal Processing Letters 29 (2022): 2517-2521.

[2] Chen, Jiadong, et al. "An Adaptive Data-Driven Imputation Model for Incomplete Event Series." International Conference on Advanced Data Mining and Applications. Cham: Springer Nature Switzerland, 2023.

[3] Shchur, Oleksandr, Marin Biloš, and Stephan Günnemann. "Intensity-free learning of temporal point processes." arXiv preprint arXiv:1909.12127 (2019).

__Questions__:
As above:

1. Please comment on the difference between your approach and other attention-based approaches for temporal data imputation, and compare to these apparently strong baselines.
2. Consider continuous-time models, such as a small variation of the GP-VAE baseline and perhaps a point process type of imputation model.
3. Demonstrate that you can also predict metadata given only the timestamp of a new observation.

Small questions/comments:

4. Outline the purpose and details of the baselines.
5. Line above Eq (3) “AGG is a heterogeneous graph” but it is actually homogeneous?
6. Does the MLP hidden layer need to have $l\times d_{\text{encode}}$ nodes, or is this an arbitrary choice?
7. After cross multi-head attention in Figure 3, a residual connection is shown between the output of CrossMultiHead and $h_l$. However, Eq (10) shows the residual connection between the output of CrossMultiHead and $g_0$.
8. In implementation details “the input layer of dimension $5 \times 16$” where does the 5 come from in this case?

__Rating__: 3 reject, not good enough
__Confidence__: 4 You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.

## Response to Reviewer qi5c

### Heterogeneous Graph and metadata example
From the reviewers comments we can see that we failed to explain how exactly the metadata is obtained and why it is valid assumption that this is known for the prediction step, this is also related to why we refer to the graph as heterogeneous and by including the metadata as an embedding we can cast the problem to a heterogeneous one. We believe a comprehensive example will help to clarify this point. The Beijing data set, for example consists of 12 weather stations seperated geographically:
$$ station \in \{
    Aotizhongxin,
    Changping,
    Dingling,
    Dongsi,
    Guanyuan,
    Gucheng,\\
    Huairou,
    Nongzhanguan,
    Shunyi,
    Tiantan,
    Wanliu,
    Wanshouxigong,
\} $$
and for each of these stations they provided different sensor measurements (from different types of sensors, although these are often missing):
$$ sensor \in \{
    PM2.5,
    PM10,
    SO2,
    NO2,
    CO,
    O3,
    TEMP,
    PRES,
    DEWP,
    RAIN,
    WSPM,
\} $$
 There is also additional categorical metadata for each station, this is the $WindDirection$ which is the cardinal directions 
 $$ WindDirection \in \{NNW, E, NW, WNW, N, ENE,
    NNE,
    W,
    NE,
    SSW,
    ESE,
    SE,
    S,
    SSE,
    SW,
    WSW,
    None,
\}
$$ The features $y$ are the measurements of any $sensor$ at any location $station$. One way to express the interconnection between physical locations and sensors is to use a heterogeneous graph, where the nodes are the $station$ and $sensor$ combinations and the edges are the measurements $y$. The metadata is the $WindDirection$ and is not connected to the graph, but is used to extend the information of the features through a categorical embedding. The metadata is known for all stations and sensors (as it is implicit characteristics of each node), but the measurements $y$ are not known for all sensors at all stations (as they have temporal, spatial, type dependence). The measurements $y$ are the values we want to predict, but we need to know the $station$ and $sensor$ in order to predict them, which makes sense intuitively as at time $t$ how could the model know for which $sensor$ and at which $station$ we wish to predict its value, logically the value of $PM2.5$ will vary across physical locations that are seperated by many kms so thus it makes sense to _condition_ the generation step. 

Finally, the $timestamp$ is an independent value that is simply the time of the measurement relative to the start of the batch, there is no restriction in its value thus we can select this arbitrarily, and it is then encoded through a learnt `time2vec` embedding module (to capture frequency dynamics). Of course, we choose time values for which we have measurements so that we can train the model. 

We hope this example clarifies the assumptions we make and why we refer to the graph as heterogeneous. The problem of using a heterogeneous graph in execution is that we would require a different set of weights for each node type, which is why we embed the node metadata and instead treat it as a homogeneous graph where the sensors type and the geographical location are encoded in the node embedding and concatenated along with the measurement $y$. This is also why “the input layer of dimension $5 \times 16$”, the 5 is the number of separate embedding types the feature, station, type, time and categorical (wind direction in this case) data and the 16 is the dimension of the resulting embedding. To ease the notational burden (perhaps to the detriment of clarity) we referred to station, type and categorical data collectively as "metadata".

The generation step simply consists on selecting of which "metadata" we want to condition on in order to predict the $y$ value, and the model will output the generated value, there is no rational for predicting the metadata as it is simply which part of the "heterogeneous graph" we want to generate.

### Weaknesses 
1. We thank the reviewer for connecting our work with [1], as we were not aware of this work. We will endeavor added a comparison with [1] in the revised manuscript. Your suggestion of including a more focused comparison with attention-based methods is also well taken.
2. We thank the reviewer for the suggestion of comparing with continuous-time imputation techniques and agree that this is a rather unexplored area of research. We will endeavor to include a comparison with GP-VAE and PILES in the revised manuscript. We also agree that the ability to impute continuous-time data is a key advantage of our method and will endeavor to highlight this in the revised manuscript, we appreciate the insight of the reviewer on this point.
3. Regarding the assumption that the time and metadata of unknown targets are known, perhaps we were not clear enough in the manuscript. We have provided a detailed example of how the metadata is obtained and why it is valid assumption that this is known for the prediction step in the previous section. The time of the unknown targets is simply the time of the prediction, which is arbitrary and can be selected by the user. We will endeavor to clarify this in the revised manuscript.

### Questions
1. Thank you for your suggestion we agree that a more comprehensive comparison is needed and will endeavor to include this in the revised manuscript.
2. We agree that the power of the model is in its ability to impute continuous-time data and will endeavor to highlight this in the revised manuscript.
3. Please refer to the previous section for a detailed example of how the metadata is obtained and why it is valid assumption that this is known for the prediction step. It makes no sense to predict the metadata as it is an implicit part of the "heterogeneous graph" we want to generate.
4. We will endeavor to clarify the purpose and details of the baselines in the revised manuscript.
5. Please refer to our response detailing the heterogeneous graph assumptions and why they are valid, we will endeavor to clarify this in the revised manuscript but we maintain that this is in fact a heterogeneous graph problem to which we cast to a homogeneous graph problem by embedding the metadata.
6. The MLP hidden layer does not need to have $l\times d_{\text{encode}}$ nodes, this is an arbitrary choice, we simply followed the prior work of "Attention is all you need" amoung others that expanding the hidden layer by a factor of $l$ is a good choice in order to improve the compuational capacity of the model.
7. Thank you for pointing this out, yes the figure is incorrect and the residual connection should be between the output of CrossMultiHead and $g_0$ as shown in Eq (10). We will endeavor to correct this in the revised manuscript.
8. Please refer to our detailed heterogeneous graph example for a detailed clarification. Briefly the input layer of dimension $5 \times 16$ is the number of separate embedding types the feature, station, type, time and categorical (wind direction in this case) data and the 16 is the dimension of the resulting embedding.


## Official Review of Submission933 by Reviewer mCDq
__Summary__:
The work proposes a graph neural network architecture for multi-channel time-series imputation, upon leveraging embeddings and attention mechanisms. The proposed method reaches satisfactory performance on various real-data examples against baselines.

__Soundness__: 2 fair
__Presentation__: 1 poor
__Contribution__: 2 fair
__Strengths__:
 - The proposed AGG architecture is novel and useful for the imputation, classification, and regression tasks.
 - The proposed method is thoroughly explained in Section 3.

__Weaknesses__:
1. Literature review: Section 2 seems to ignore more recent developments on time-series imputation in the field. See a couple mentioned in this survey paper (https://arxiv.org/abs/2307.03759) and more below. In particular, the development in the field since 2022 is barely discussed.
2. Problem setup:
    - Section 3.1 introduces the problem formulation, which however is unclear to the reader. For example, is the imputation task focusing on predicting $y$? This seems to be the case as shown in Eq (12). If so, how does this differ from a standard prediction task?
    - As the authors claim "no previous GNN-based method approaches the imputation problem from the perspective of an asynchronous graph", it is important to separate alone a section explaining the formal mathematical setup of the problem, which at least contains (1) the imputation problem (2) how this is asynchronous (3) why the problem is challenging/unique that others have not proposed ways to solve it. The current Section 3.1 is highly insufficient.
3. Experiments (existing results):
    - I find it strange to say "a common AGG architecture was implemented without hyper-parameter tuning for all datasets". Does this mean your method can always work without any tuning, even for learning rate/batch size, etc.? If not, it would be important to say clearly the implication behind this.
    - Related to the first point, how does your method perform under various hyper-parameters, if they are actually tuned? Would it be significantly improve over current results?
    - How does your model capacity compare to those of the baselines? Your model has 378k trainable parameters. How about others? What architecture of theirs is adopted in the comparison?
    - The appendix should contain a table highlighting the data specifics (e.g., number of observations, number of time-series, feature dimension, etc.), as it is hard to infer these values from looking at Appendix A.1. I would suggest the authors to list these numbers in accordance with notations in previous sections. Similar thing can be done when explaining the AGG architecture.
4. Experiments (new ones currently lacking):
    - The most recent baseline is SSGAN (Miao et al., 2021). However, many works have followed theirs; a quick search reveals [1-5] for the imputation task, while I believe more are existing. I thus find it unrealistic that the SSGAN is still the "state-of-the-art" method after two years.
    - How does the method perform on other time-series datasets? The current experiments closely follow SSGAN, but it would be important to examine beyond that setting established more than 2 years ago.

(Incomplete) list of related papers
[1] Miao et al., 2021: Efficient and effective data imputation with influence functions
[2] Cini et al., 2022: Filling the G_ap_s: Multivariate Time Series Imputation by Graph Neural Networks
[3] Alcaraz et al., 2023: Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models
[4] Liu et al., 2023: PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation
[5] Wu et al., 2023: Jointly Imputing Multi-View Data with Optimal Transport
__Questions__:
Questions are summarized in the weakness section above.

__Rating__: 3: reject, not good enough
__Confidence__: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.

## Response to Reviewer mCDq

## Weaknesses
1. We thank the reviewer for pointing out the lack of recent developments in the field, we will endeavor to include a more comprehensive literature review in the revised manuscript.
2. We thank the reviewer for pointing out the lack of clarity in the problem setup:
    - The imputation task is indeed focusing on predicting or imputing $y$, under the formation of the asynchronous graph whether the task is a standard prediction or imputation is a matter of perspective as the AGG can generate samples at arbitary time points we can simply select whether the generation is within or before the observed time points (imputation) or after them (prediction). We will endeavor to clarify this in the revised manuscript.
    - We agree that the current Section 3.1 is highly insufficient and will endeavor to include a more comprehensive mathematical setup of the problem in the revised manuscript.
3. We thank the reviewer for pointing out the lack of clarity in the experiments:
    - 

## Official Review of Submission933 by Reviewer TnYt
__Summary__:
This paper studies the challenge of analyzing multichannel time series data, particularly focusing on issues like irregular time intervals and complex spatial-temporal relationships. It proposes a novel approach with the Asynchronous Graph Generator (AGG), a graph neural network architecture that models time series observations as nodes on a dynamic graph, facilitating data imputation and prediction. AGG's unique feature is its ability to directly embed measurements, timestamps, and metadata into nodes, using attention mechanisms to discern intricate relationships among variables — which can hardly convince the reviewer — and as claimed in the paper, this method stands out from existing models by bypassing the limitations of recurrent neural networks and conventional time series models that often assume temporal regularity.

__Soundness__: 2 fair
__Presentation__: 2 fair
__Contribution__: 2 fair
Strengths:
The reviewer can hardly say that they can understand the paper's method. But the idea that representing a multivariate time series as nodes in a graph is truly interesting. Even though the reviewer is not familiar with the baselines in this field, the experimental results seem relatively comprehensive and convincing.

__Weaknesses__:
The reviewer can only offer some general suggestions:

1. The reviewer believes a good research paper should be educative to general audience; they did look at Cao's work on RNN for time series imputation and find their problem set-up and proposed approach easy to understand. Unfortunately, the current form of this paper makes it really difficult for readers without certain background knowledge to understand the setting and the contribution.
2. Given that this paper is purely empirical, the numerical experiments are the most important part to verify the performance. Table 1 may benefit from including uncertainty quantification (in the meanwhile, the reviewer acknowledges that the improvement is quite significant).
3. Terminology should be used more carefully: the term "causal" is used multiple times (and perhaps that is the reason why the reviewer gets invited to review this paper); However, it seems that "causal" merely refers to temporal order, which is "Granger causal" and means correlation from past to the future. The reviewer suggests using simple terms like temporal order directly.
4. Scaling might be one major issue when the dimension is high and the time horizon is long (since there must be a really huge graph to represent the multivariate time series).

__Questions__:
There are two major concerns that make the reviewer lean towards rejection of this work:

1. In Fig. 1 (c), the proposed method used future to predict past. The reviewer is fine with using the expressive power of neural networks to learn a latent causal representation, but a structure shown in Fig. 1 (c) seems to imply that the latent data generation mechanism depends on the future which is clearly wrong. Can the authors justify that?

    (PS: in granger causal literature, one famous example is that "buying Christmas tree" Granger causes "Christmas". However, this example cannot justify the proposed structure since there should be a latent variable "knowing Christmas will be on 12/25" captured by the latent data generating process).

2. Another reason why the reviewer gets invited to reviewer this paper might be the use of Physionet dataset — the are a lot of lab tests in that dataset where missing values themselves mean that the clinician is not suspect of related dysfunction/disease at all. That is why there are so many missing values in that dataset — a single patient cannot be suspected to have all diseases — and the missingness carries meaning. Can authors justify on why imputing this dataset?

__Rating__: 5: marginally below the acceptance threshold
__Confidence__: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.