# Report on "Random Reshuffling with Variance Reduction: New Analysis and Better Rates"

__Authors__: Grigory Malinovsky, Alibek Sailanbayev, Peter Richtarik

__Link__: [https://arxiv.org/abs/2104.09342](https://arxiv.org/abs/2104.09342)

<br>

## 1. Problem Statement

The paper addresses the problem of optimizing a large-scale empirical risk minimization (ERM) problem, which is a common task in machine learning and statistics. Specifically, it deals with stochastic gradient methods for minimizing a finite-sum objective function:

$$
F(x) = \frac{1}{n} \sum_{i=1}^{n} f_i(x)
$$

Where:
- $x \in R^d$ is a vector representing the parameters (model weights, features) of a model we wish to train;
- $n$ is the total number of training data points;
- $f_i(x)$ is the (smooth) loss associated with the model on the $i$-th data point.

<br>

## 2. Importance

Efficient optimization of ERM problems is crucial in various applications such as training machine learning models, deep learning, and statistical inference. Usage of Variance Reduction described in this paper generally allows to improve various existing SGD-based algorithms (RR-SAGA, AVRG, IAG, etc.) and obtain faster convergence rates and lower computational costs, especially in Big Data regime (when $n > O(κ)$), where $κ$ is the condition number.

<br>

## 3. Examples of Occurrence

- __Training Deep Neural Networks__
- __Large-scale Linear Regression__
- __Support Vector Machines__

<br>

## 4. Authors' Approach

The authors' approach is based on the strategy involving a sequence of reformulations of the original finite-sum optimization problem. The core idea of their approach involves perturbing the objective function by introducing a specific type of perturbation, where zero is expressed as the average of $n$ nonzero linear functions. This perturbation is applied at each epoch and remains constant throughout the epoch.

Starting with the original finite-sum problem and utilizing vectors $a_1, \ldots, a_n \in \mathbb{R}^d$ that collectively sum to zero ($\sum_{i=1}^{n} a_i = 0$), the authors introduce this zero term into the loss function $f$. This leads to a reformulated version of the initial problem, which can be expressed as:

$$
f(x) = \frac{1}{n} \sum_{i=1}^{n} \left(f_i(x) + h_{a_i}(x, x)\right) = \sum_{i=1}^{n} \widetilde{f}_i(x) \quad (7)
$$

In this reformulation:
- $\widetilde{f}_i(x) = f_i(x) + h_{a_i}(x, x)$ is the modified loss function.
- The gradient of $\widetilde{f}_i(x)$ is given by $\nabla \widetilde{f}_i(x) = \nabla f_i(x) + a_i$.

This reformulation bears an important property that plays a crucial role in their approach, signifying its significance in their methodology.

<br>

## 5. Basis of the Approach

Authors introduced 4 algorithms that are actually modifications of existing algorithms using Variance Reduction mechanism:

1) __RR-SVRG__: based on [RR-SAGA](https://arxiv.org/pdf/1803.07964.pdf) and [AVRG](https://arxiv.org/abs/1708.01383v3)
2) __SO-SVRG__: based on [RR-SAGA](https://arxiv.org/pdf/1803.07964.pdf) and [AVRG](https://arxiv.org/abs/1708.01383v3)
3) __Cyclic SVRG__: based on [Cyclic SAGA](https://arxiv.org/pdf/1810.11167.pdf), [IAG](https://arxiv.org/abs/1506.02081), and [DIAG](https://arxiv.org/pdf/1611.00347.pdf)
4) __VR-RR__: a generalized version of RR-SVRG, where  at the end of each epoc the control vector is updated with some probability

<br>

## 6. Algorithms

#### a) RR-SVRG
$
\begin{align*}
1: & \textbf{Input:} \text{ Stepsize } \gamma > 0, y_0 = x_0 = x^0_0 \in \mathbb{R}^d, \text{ number of epochs } T. \\
2: & \textbf{for } t = 0, 1, \ldots, T - 1 \text{ do} \\
3: & \quad \text{Sample a permutation } \{\pi_0, \ldots, \pi_{n-1}\} \text{ of } \{1, \ldots, n\} \\
4: & \quad x^0_t = x^t \\
5: & \quad \text{for } i = 0, \ldots, n - 1 \text{ do} \\
6: & \quad \quad g^t_i(x^i_t, y^t) = \nabla f_{\pi_i}(x^i_t) - \nabla f_{\pi_i}(y^t) + \nabla f(y^t) \\
7: & \quad \quad x^{i+1}_t = x^i_t - \gamma g^t_i(x^i_t, y^t) \\
8: & \quad \text{end for} \\
9: & \quad x^{n}_t = x^n_t \\
10: & \quad y^{t+1} = x^n_t \\
11: & \textbf{end for}
\end{align*}
$
<br><br>

#### b) SO-SVRG
$
\begin{align*}
1: & \textbf{Input:} \text{Stepsize } \gamma > 0, y_0 = x_0 = x^0_0 \in \mathbb{R}^d, \text{ number of epochs } T. \\
2: & \text{Sample a permutation } \{\pi_0, \ldots, \pi_{n-1}\} \text{ of } \{1, \ldots, n\} \\
3: & \textbf{for } t = 0, 1, \ldots, T - 1 \text{ do} \\
4: & \quad x^0_t = x^t \\
5: & \quad \text{for } i = 0, \ldots, n - 1 \text{ do} \\
6: & \quad \quad g^t_i(x^i_t, y^t) = \nabla f_{\pi_i}(x^i_t) - \nabla f_{\pi_i}(y^t) + \nabla f(y^t) \\
7: & \quad \quad x^{i+1}_t = x^i_t - \gamma g^t_i(x^i_t, y^t) \\
8: & \quad \text{end for} \\
9: & \quad x^{n}_t = x^n_t \\
10: & \quad y^{t+1} = x^n_t \\
11: & \textbf{end for}
\end{align*}
$
<br><br>

#### c) Cyclic SVRG
$
\begin{align*}
1: & \textbf{Input:} \text{Stepsize } \gamma > 0, y_0 = x_0 = x^0_0 \in \mathbb{R}^d, \text{ number of epochs } T. \\
2: & \textbf{for } t = 0, 1, \ldots, T - 1 \text{ do} \\
3: & \quad x^0_t = x^t \\
4: & \quad \text{for } i = 0, \ldots, n - 1 \text{ do} \\
5: & \quad \quad g^t_i(x^i_t, y^t) = \nabla f_{\pi_i}(x^i_t) - \nabla f_{\pi_i}(y^t) + \nabla f(y^t) \\
6: & \quad \quad x^{i+1}_t = x^i_t - \gamma g^t_i(x^i_t, y^t) \\
7: & \quad \text{end for} \\
8: & \quad x^{n}_t = x^n_t \\
9: & \quad y^{t+1} = x^n_t \\
10: & \textbf{end for}
\end{align*}
$
<br><br>

#### d) VR-RR
$
\begin{align*}
1: & \textbf{Input:} \text{ Stepsize } \gamma > 0, \text{ probability } p, x_0 = x_{0_0} \in \mathbb{R}^d, y_0 \in \mathbb{R}^d, \text{ number of epochs } T. \\
2: & \textbf{for } t = 0, 1, \ldots, T - 1 \text{ do} \\
3: & \quad \text{Sample a permutation } \{\pi_0, \ldots, \pi_{n-1}\} \text{ of } \{1, \ldots, n\} \\
4: & \quad x^0_t = x^t \\
5: & \quad \text{for } i = 0, \ldots, n - 1 \text{ do} \\
6: & \quad \quad g^t_i(x^i_t, y^t) = \nabla f_{\pi_i}(x^i_t) - \nabla f_{\pi_i}(y^t) + \nabla f(y^t) \\
7: & \quad \quad x^{i+1}_t = x^i_t - \gamma g^t_i(x^i_t, y^t) \\
8: & \quad \text{end for} \\
9: & \quad x^{n}_t = x^n_t \\
10: & \quad y^{t+1} = \begin{cases} 
y^t \text{ with probability } 1 - p, \\
x^n_t \text{ with probability } p
\quad \end{cases} \\
11: & \textbf{end for}
\end{align*}
$
<br><br>

## 7. Intuition for Effectiveness

The key intuition behind the effectiveness of the described methods is that they balance the trade-off between stochasticity introduced by RR, SO and Cyclic data permutation techniques and variance reduction. By using the control vectors from SVRG and reorganizing data points, these methods achieve faster convergence rates while maintaining low computational costs.
