**New distributed learning rule**
Each agent has 2 sets of belief vectors:
- Local belief vector
    - Updated in a Bayesian manner based on only private observations (no neighbor influence)
- Actual belief vector
    - Updated as a minimum of the agent's own local belief and the actual beliefs of its neighbors on that particular hypothesis

**Strict improvement in Rate of Learning**
- The proposition rejects every false hypothesis agent exponentially fast
- Rate lower bounded by the best relative entropy (between true state and false hypothesis) among all agents
- Network must be static and strongly connected

**Resilience to Adversaries**
- Provably correct resilient version of proposed learning rule
- Each regular agent can infer the truth exponentially fast

**Model and Problem Formulation**
- Group of agents $V = \{1, 2, ..., n\}$
- $G(t) = (V, E(t))$ is a time-varying directed communication graph, $t \in \mathbb{N}$
- Edge $(i, j) \in E(t)$ means $i$ can directly transmit information to agent $j$ at time step $t$
    - $(i, j) \in E(t)$ means $i$ is in-neighbor of $j$, and $j$ is out-neighbor of $i$
    - $N_i^{in}(t)$ = set of in-neighbors of $i$, $N_i^{in}(t) \cup \{i\}$ = inclusive neighborhood of $i$
    - $|C|$ = set size
- $\Theta = \{{\theta}_1, {\theta}_2, ..., {\theta}_m\}$ = $m$ possible states of the world, each ${\theta}_i \in \Theta$ is a hypothesis (think $\theta$ as the parameter vector of a model. The different states are different parameter vectors)
- At each time-step $t$, every agent $i \in V$ observers a signal $s_{i,t} \in S_i$, where $S_i$ is the signal space of agent $i$ (think of the signal as the local stochastic gradient computed by every agent)
- The joint observation profile is then $s_t = (s_{1,t}, s_{2,t}, ..., s_{n,t})$, where $s_t \in S$, and $S = S_1 \times S_2 \times ... \times S_n$ (that would be a joint observation profile of all clients' local gradients)
- $s_t$ is generated based on a conditional likelihood function $l(\cdot|{\theta}^*)$, where ${\theta}^* \in \Theta$ is the true state of the world (the optimal model)
- $l_i(\cdot|{\theta}^*), i \in V$ is the $i$-th marginal of $l(\cdot|{\theta}^*)$ (so $l(\cdot|{\theta}^*)$ integrated over all signals other than $i$-th)
- The signal structure of each agent $i$ is then characterized by the marginals: $\{l_i(\omega _i|{\theta}^*):\theta \in \Theta, \omega _i \in S_i\}$

Standard Assumptions:
- Signal space of each agent $i$, namely $S_i$ is finite (finite number of signals (gradients))
- Each agent $i$ has knowledge of its local likelihood functions $\{l_i(\cdot|{\theta}_p)\}_{p=1}^m$, and $l_i({\omega}_i|{\theta}) > 0, \forall {\omega}_i \in S_i, \forall \theta \in \Theta$ (the agent knows the model that generates its observations)
- The observation space of each agent is described by an iid random process over time, but the observations of different agents may potentially be correlated (so over time the data distribution remains the same, the clients can be non-iid)
- There exists a fixed true state of the world ${\theta}^* \in \Theta$ that generates the observations of all agents

Definitons:
- Probability triple $(\Omega, \mathcal{F}, \mathbb{P}^{{\theta}^*})$, where :
    - $\Omega := \{\omega:\omega = (s_1, s_2, \ldots), s_t \in S, t \in \mathbb{N}^+\}$ (so $\omega$ is therefore a tuple of signals (gradients) in time)
    - $\mathcal{F}$ is the $\sigma$-algebra generated by observation profiles
    - $\mathbb{P}^{{\theta}^*}$ is the probability measure induced by simple paths in $\Omega$, specifically: $\mathbb{P}^{{\theta}^*} = \prod_{t=1}^{\infty}l(\cdot|{\theta}^*)$

Remarks:
1. Standard assumption 4 is critical in the analysis
2. The rules do not apply to parameter spaces (Then is it applicable to FL?)
3. Standard assumptions 1 and 2 imply that there is a constant $L \in (0, \infty)$ such that $\underset{i \in V}{\max} \space \underset{\omega _i \in S_i}{\max}\underset{\theta _p, \theta _q \in \Theta}{\max} \space |log \frac{l_i(\omega _i|\theta _p)}{l_i(\omega _i|\theta _q)}| \leq L$
4. Goal of each agent is to find the true state $\theta ^*$ (optimal model)
5. Since the private signal structure is only partially informative for any given agent, we define ${\Theta}_i^{\theta ^*} := \{\theta \in \Theta:l_i(\omega _i | \theta) = l_i(\omega _i | \theta ^*), \omega _i \in S_i\}$ -> ${\Theta}_i^{\theta ^*}$ is the set of hypotheses that are observationally equivalent to true state $\theta ^*$ from the perspective of agent $i$
6. We might have $|{\Theta}_i^{\theta ^*}| > 1$ for any agent $i$, which necessitates the collaboration among agents

More Definitions:
- Source agents:
    - Agent $i$ is source agent for a pair of distinct hypotheses $\theta _p, \theta _q \in \Theta$ if $K_i(\theta _p, \theta _q) > 0$
    - $K_i(\theta _p, \theta _q)$ is the Kullback-Leibler divergence between distributions $l_i(\cdot|\theta _p)$ and $l_i(\cdot|\theta _q)$ given by: $K_i(\theta _p, \theta _q) = \underset{\omega _i \in S_i}{\sum}l_i(\cdot|\theta _p)log \frac{l_i(\omega _i|\theta _p)}{l_i(\omega _i|\theta _q)}$
    - Set of all source agents for pair $\theta _p, \theta _q $ is denoted as $S(\theta _p, \theta _q )$
    - A source agent for the pair of hypotheses $\theta _p, \theta _q $ is an agent that can distinguish between $\theta _p, \theta _q $ based on its private signal structure
    - $S(\theta _p, \theta _q ) = S(\theta _q, \theta _p )$ since $K_i(\theta _p, \theta _q) > 0 \iff K_i(\theta _q, \theta _p) > 0$
    - We assume each $\theta \in \Theta$ is globally identifiable
- Global identifiability:
    - Each pair $\theta _p, \theta _q \in \Theta, \theta _p \neq \theta _q $ has a set $S(\theta _p, \theta _q )$ of agents (nodes) that can distinguish between the pair $\theta _p, \theta _q $
    - For each such pair, this set $S(\theta _p, \theta _q )$ is nonempty
- Joint Strong-Connectivity: 
    - Simply said, there is a constant $T \in \mathbb{N}_+$ such that union graph (union of edge sets for given times $t$) over every interval of the form $[rT, (r+1)T)$ is strongly connected ($r \in \mathbb{N}$)

Example:
- Network of 2 agents: 1, 2. There are 3 models: $\theta _1, \theta _2, \theta _3$. There are 2 signals: H, T
- The likelihoods are as follows:
    - Agent 1:
        - $l_1(s_1 = H|\theta _1) = 0.50, l_1(s_1 = T|\theta _1) = 0.50$
        - $l_1(s_1 = H|\theta _2) = 0.75, l_1(s_1 = T|\theta _2) = 0.25$
        - $l_1(s_1 = H|\theta _3) = 0.50, l_1(s_1 = T|\theta _3) = 0.50$
    - Agent 2:
        - $l_2(s_1 = H|\theta _1) = 0.33, l_1(s_1 = T|\theta _1) = 0.67$
        - $l_2(s_1 = H|\theta _2) = 0.33, l_1(s_1 = T|\theta _2) = 0.67$
        - $l_2(s_1 = H|\theta _3) = 0.17, l_1(s_1 = T|\theta _3) = 0.83$
- The common signal space is: $S_1 = S_2 = \{H, T\}$
- Note that ${\Theta}_1^{\theta _1} = \{\theta _1, \theta _3\}$, since agent 1 cannot distinguish them (similarly for models 1 and 2 for agent 2)
- Therefore: $S(\theta _1, \theta _2) = 1$, $S(\theta _1, \theta _3) = 2$, $S(\theta _2, \theta _3) = \{1, 2\}$
- Note that there is gloabl identifiability, since there is always some agent to identify a model

**Proposed learning rule**
- Novel belief update rule
- Each agent $i$ at time $t$ has 2 separate sets of belief vectors


Theorem 1:
- Basically states that each agent in the network will reject the false hypothesis at least as fast as the max difference between true and false hypothesis across the network.
- The log value is the log likelihood (since $\mu$ is the probability measure). It approaches to 0
- The max value just finds max difference between false and true hypothesis across the network (or specifically, across all devices that can distinguish between the models)

Theorem 2:
- TODO

Questions:
- I wanted to ask about the likelihood function $l(\cdot|{\theta})$. I take it that it's a likelihood of observing some signal $\omega _i$ (i.e. in FL the gradients) given parameters $\theta$ (i.e. current model). But the gradients aren't drawn from some distribution, but rather computed using SGD, so in that case how can we calculate the likelihood function? Maybe several SGD updates on the same point, different batches, then create a distribution? But that would be very inefficient
- It's based on discrete vals, has it been done with continuous? (Possible?)

**Meeting Notes**

- 