Summary:
- Looking at D2S + D2D in FL model training
- D2D clusters = time varying + directed communication graphs
- Algorithm -> controls trade-off between 
    - rate of convergence to global optimizer
    - number of D2S transmissions required for global aggregation
- Main point: D2D updates injected into FL averaging framework based on column-stochastic weight matrices that encapsulate the connectivity within clusters
    - Basically the connectivity matrices in clusters decide on when to send updates
- Expected optimality gap (between current global model and optimal global model) depends on 2 biggest singular values of weighted adjacency matrices of the D2D clusters

Key contributions:
- Each D2D cluster is time-varying digraph -> the expected optimality gap depends on 2 greatest singular values of the weighted adjacency matrices for local aggregations in the clusters
- Singular value bounds in terms of node degrees -> the singular values are bounded by the degree distribution of every cluster
- Connectivity-Aware Learning Algorithm -> Singular value bounds are used to design a time-varying threshold on the number of clients required to be sampled by server for global aggregation to enforce a specified convergence rate while reducing number of D2S communications
- Effect of data heterogeneity under mild gradient diversity assumptions -> the expected optimality gap is bounded by a value that captures cluster densities and data heterogeneity across the devices

Notation:
- $\mathbf{v} \in \mathbb{R} :=$ stochastic $ \iff \mathbf{v} \geq 0 (\forall i, v_i \geq 0)$ and $ \mathbf{v}^T\mathbf{1} = 1$ (i.e. $[v_1, v_2, \ldots, v_n][1, 1, \ldots, 1]^T = 1$) (so the sum of vector components sums up to 1 and all are greater than 0)
- A is a column stochastic matrix $\iff A > 0$ (for all elements) and each column of A sums up to 1: $A^T\mathbf{1} = \mathbf{1} \Rightarrow $ $ A \in \mathbb{R}^{m \times n}, \left[\begin{array}{c}
\mathbf{col}_1^T\\
\mathbf{col}_2^T \\
\vdots \\
\mathbf{col}_n^T
\end{array}\right] \left[\begin{array}{c}
1\\
1 \\
\vdots \\
1
\end{array}\right] = \left[\begin{array}{c}
1\\
1 \\
\vdots \\
1
\end{array}\right]$
- A is symmetric if $A = A^T$
- Consensus matrix = influence of nodes in a network, ex: $\left[\begin{array}{ccc}
c_{1, 1} & \ldots & \vdots \\
c_{2, 1} & \ddots & \vdots \\
\vdots & \ldots & \vdots
\end{array}\right] \Rightarrow c_{i,j}$ is how node i influences node j 

Assumptions:
- D2D network communications are NOT necessarily bidirectional (cluster graphs are directed)
- Column-stochastic consensus matrices need not be symmetric

Technical Challenges:
1. Cannot use standard eigenvalue results in the analysis (must use singular values)
2. Column stochastic aggregation matrices do NOT ensure convergence to consensus in the absence of central coordinatior -> the analysis must account for combined effect of global aggregation + column stochasticity


Semi-Decentralized FL Setup:
- $n$ local clients + 1 central parameter server (PS) that aggregates all local models
- $[n] :=$ set of clients
- $D_i :=$ local dataset of each client $i \in [n]$
- $\xi = (u,y) \in D_i$ is a data sample, where $u \in \mathbb{R}^p$ is a feature vector of the sample, and $y$ is its label
- $x$ is a model
- Loss function $L$ is defined as $L: \mathbb{R}^p \times {\cup}_{i = 1}^nD_i \rightarrow \mathbb{R}$, so that $L(x, \xi)$ denotes loss incurred by $x$ on a sample $\xi \in {\cup}_{i=1}^nD_i$
    - Note: ${\cup}_{i=1}^nD_i$ is the global dataset
    - <font color='red'>Question</font> Why is $x \in \mathbb{R}^p$? And why is $L: \mathbb{R}^p \times {\cup}_{i = 1}^nD_i \rightarrow \mathbb{R}$? Since $x$ is the model, shouldn't it be in arbitrary space?
- Average loss incurred by model $x$ over local dataset of client $i$ is: $f_i(x) := \frac{1}{|D_i|}{\sum}_{\xi \in D_i}L(x, \xi)$
    - $f_i :=$ local loss function of client i
- Clients seek to minimize global loss function $f := \frac{1}{n}{\sum}_{i = 1}^nf_i(x)$
- Learning objective: find global optimum: $x^* := \underset{x}{argmin}f(x)$

D2D and D2S Network Models:
- D2S interactions -> devices can send to PS if prompted by it
- D2D network = time-varying directed graph $G(t) = ([n], E(t))$, where $[n] := $ vertex set and $E(t) := $ edge set of the digraph
    - if a directed edge from node $i \in [n]$ to another node $j \in [n]$ exists in $G(t)$, then there is a communication link from $i$ to $j$ in the D2D network
    - $i$ = in-neighbor of $j$
    - $j$ = out-neighbor of $i$
    - $\mathcal{N}_i^-(t) :=$ set of in-neighbors of client $i \in [n]$ at time $t$
    - $\mathcal{N}_i^+(t) :=$ set of out-neighbors of client $i \in [n]$ at time $t$
    - $d_i^-(t)$ = in-degree = number of in-neighbors of node $i$
    - $d_i^+(t)$ = out-degree = number of out-neighbors of node $i$
    - $d_{max}^-(t)$ = max in-degree
    - $d_{min}^+(t)$ = min out-degree
    - $d_{max}^+(t)$ = max out-degree
- Assumptions for D2D:
    - Not strongly/uniformly connected over time
    - So there is a number $c > 1$ of strongly connected components (strongly connected components = set of vertices where we can get between any 2 from one to the other) of $G(t)$ denoted $\{(V_1(t), E_1(t)), (V_2(t), E_2(t)), \ldots, (V_c(t), E_c(t))\}$ = clusters of the D2D network
    - $c$ = number of clusters is time invariant
    - There is no communication link between any 2 clusters. So, $E(t) = {\cup}_{l=1}^cE_l(t)$
    - The server has full knowledge of the vertex sets $\{V_l(t)\}_{l=1}^c$ at any time $t$

Proposed Method: TODO + convergence analysis + all proofs!