# Federated Learning
Communication-Efficient Learning of Deep Networks from Decentralized Data

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Agüera y Arcas

## Motivation

* explosion in the number of mobile devices
* these have powerful processing capability
* How can these be levereaged to train a global, shared model?


## Applications

* language models
  * speech recognition
  * text entry
* image classification
  * likely to be shared or viewed

## Considerations

* Data privacy
* Communicaton
  * mobile network charge constrain level of communication

## Basic Approach

* split the training of a model across clients
* aggregate the results of training on clients on a server
* no training data is transfered from client to server

## Federated Learning vs Distributed Learning #1

Federated learning involves:

* Non-IID data
* Unbalanced
* Massively Distributed
  * more clients than the number of samples per client
* Limited communication

## Federated Learning vs Distributed Learning #2

|                         | Computation Cost | Communication Cost |
| :- | -: | :-: |
| Federated Optimizaton   | Low | High |
| Distributed Optimizaton | High | Low |

Federated Learning seeks to  increased computation to decrease the number of rounds of communication require to achieve a well-trained model.

## Formalism

* the training set is the union of all (mutually disjoint) training samples from all clients

## Notation

* $n$: number of training samples
* $K$: number of clients
* $n_k$:number of samples for client k
* $\mathscr{P}_k$: training samples on client k

## Loss

The goal is to minimize the average of all the losses, $\mathcal{f}_i(w)$, over the samples.

$$
\begin{align}
\mathcal{f}(w) &= \frac{1}{n} \sum_{i=1}^n \mathcal{f}_i(w)\\
&= \frac{1}{n} \sum_{k=1}^K \sum_{i=1}^{n_k} \mathcal{f}_i(w)\\
&= \frac{1}{n} \sum_{k=1}^K \left( \frac{n_k}{n_k} \sum_{i=1}^{n_k} \mathcal{f}_i(w) \right) \\
&= \sum_{k=1}^K \frac{n_k}{n} \left( \frac{1}{n_k} \sum_{i=1}^{n_k} \mathcal{f}_i(w) \right) \\
&= \sum_{k=1}^K \frac{n_k}{n} F_k(w)\\
\end{align}
$$

This is the weighted average of the loss from each client.

## Combining two trainings of a model

The field of distributed optimization has come up with techniques for combining separately trained versions of a model. For two trainings, this can be done by taking a convex combination of the trained weights.

$$
w_{avg} = \theta w_1 + (1 - \theta) w_2, \quad with \, \theta \in [0, 1]
$$

For this to work, the models need to be trained in a compatible manner, by using the same random weight initialization. In the figure on the left below, the initial weights were not the same. We see that as the models are combined, the loss increases. On the other hand, the figure on the right shows the loss decreasing as the weights are combined.



![title](combining_models.png)

## <tt>FederatedAverage</tt> Algorithm

Training on a client can be seen as a large mini-batch SGD operation. In practice, the <tt>FederatedAverage</tt> algorithm trains on a fraction, $C \in (0, 1)$, of the clients.

![title](FedAvg_algorithm.png)

In [6]:
# change the cell width
from IPython.core.display import display, HTML
display(HTML("<style>.text_cell_render { width:75% !important; }</style>"))