# Analyzing the factor $t$

In the Jane Street Market Prediction Competition, the total proceeds $\sum p_i$
are weighted by a factor $\min(\max(t,0), 6)$, where

$$
p_i = \sum_j(weight_{ij} * resp_{ij} * action_{ij}),
$$

$$
t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{N}},
$$
and $N$ is the number of unique dates in the test set.
(In order to avoid confusion with the summation indices, I use $N$ instead of the notation $|i|$ used on the competition web site.) 

The factor $t$ has been likened to an *annualized Sharpe ratio*. More info on [Wikipedia](https://en.wikipedia.org/wiki/Sharpe_ratio)
and some discussions on this topic [here](https://www.kaggle.com/c/jane-street-market-prediction/discussion/199107).

The factor t is a measure of the riskiness of the investment actions. Low $t$ means high variance, which implies high risk. 
So it makes sense that for strategies with the same return $\sum p_i$, the one with the highest $t$ (i.e. lowest risk) gets the better score. 

I have no idea why there is a cutoff at $t > 6$. 
Maybe it represents such a low level of variance that the implied risk becomes negligible to other risk factors that are beyond the scope of this competition.  

## A simple bound on $t$

We can interpret $p_i$ as the proceeds for day $i$.
Let us denote its arithmetic mean and its root mean square values as:
$$
\bar{p} = \frac{\sum_i p_i}{N},
\\
p_{rms} = \sqrt{\frac{\sum_i p^2_i}{N}}.
$$

Then we can write:
$$ 
t = \sqrt{250} \bar{p} / p_{rms}.
$$ 

The variance of the series is given by the difference of squares $\sigma^2 = p_{rms}^2 - \bar{p}^2$.
Because the variance is always positive, we can write the inequality:

$$
p_{rms}^2 = \bar{p}^2 + \sigma^2 \geq \bar{p}^2.
$$

From this it follows that 
$$
 -\sqrt{250} \leq t \leq \sqrt{250} = 15.811388\ldots, 
$$

with the equality only occurring for $\sigma=0$, i. e. when all $p_i$ are equal.

## Gradient of $t$

$$
\frac{d t}{d p_i} = \sqrt{250} \left( \frac{1}{N p_{rms}} - \frac{\bar{p}}{N p_{rms}^3} p_i \right) 
                  = \frac{\sqrt{250}}{N p_{rms}} \left( 1 - \frac{\bar{p} p_i}{p_{rms}^2} \right).       
$$

Assuming that $\bar{p} >0 $ (we are hoping to extract a profit, aren't we?),
this implies that $\frac{d t}{d p_i} > 0$ as long as $p_i < \pi_t$, with the limiting value $\pi_t$ given by:
$$
 \pi_t = p_{rms}^2 / \bar{p}. 
$$

This means that we can increase $t$ by reducing the actions on days for which $ p_i > \pi_t$,
and by adding actions on dates with $p_i < \pi_t$
(if there are still positive returns available to be included).

## Gradient of $u$

The goal of the competition is not to optimize $t$, but to optimize the product 
$$
 u = \min(\max(t,0), 6) \sum p_i
$$

For $t \gt 6$, the strategy to follow is simple: just maximize all $p_i$.

The case $0 \lt t \lt 6$ is more interesting. Then $u =  t N \bar{p} = N \sqrt{250} \bar{p}^2 / p_{rms}$.
The derivatives to $p_i$ are given by:

$$
 \frac{d u}{d p_i} = \sqrt{250} \left( \frac{2\bar{p}}{p_{rms}} - \frac{\bar{p}^2}{p_{rms}^3} p_i \right) 
                  = t \left( 2 - \frac{\bar{p} p_i}{p_{rms}^2} \right).       
$$

To improve the score, we should compare the daily proceeds $p_i$ to the limiting value: 
$$
 \pi_u = 2 p_{rms}^2 / \bar{p} = 2 \pi_t. 
$$

This means that we can increase $t$ by reducing the actions on days for which $ p_i > \pi_u$,
and by adding actions on dates with $p_i < \pi_u$
(if there are still positive returns available to be included).



## Training data

We do not know the treshold values $\pi_t$ or $\pi_u$ for the test data, but we can get some insights from the training data. I will discuss here some extreme cases.

In [None]:
import numpy as np
import pandas as pd
np.random.seed(20201201)
DATADIR = '../input/jane-street-market-prediction/'

In [None]:
# Load the training set. 
train = pd.read_csv(DATADIR+'train.csv')

### Analyzing the training data

We will use the `analyze` function to calculate the relevant variables $\sum p_i$, $\bar{p}$, $p_{rms}$, $u$, $t$ and $\pi_u$.

In [None]:
# Analyzing the proceeeds

def analyze(column, limit=None):
    """
    Calculates the risk factor $t$ and the score $u$ for the proceeds in train[column]
    """
    p = train.groupby('date')[column].sum()
    p_max = p.max()
    if limit is not None:
        p = p.clip(upper=limit)
    p_sum = p.sum()
    p_bar = p_sum/len(p)
    p_rms = np.sqrt((p**2).mean())
    t = np.sqrt(250)*p_bar/p_rms
    u = min(max(t, 0), 6)*p_sum
    pi_u = 2*(p_rms**2)/p_bar
    print(f"p_sum: {p_sum:12.6f},  p_bar:  {p_bar:12.6f},  p_rms:  {p_rms:12.6f}")
    print(f"u: {u:12.6f},  t:  {t:12.6f},  pi_u:  {pi_u:12.6f},  p_max:  {p_max:12.6f}")

#### Case 1: accepting all bets

Taking $action_{ij}=1$ for all $i$, $j$. This leads to a negative outcome, so $u=0$.

In [None]:
train['proceeds'] = train['weight']*train['resp']
analyze('proceeds')

#### Case 2: maximum profit

If I were clairvoyant, I would bet only on the profitable ticks, $𝑎𝑐𝑡𝑖𝑜𝑛_{𝑖𝑗}=1$ only if $resp_{ij}>0$. 
On the training set, this leads to a very high $t$ value, $t=13.7$.

In [None]:
train['max_proceeds'] = train['proceeds'].clip(lower=0)
analyze('max_proceeds')

#### Case 3: completely random

Placing bets randomly: $𝑎𝑐𝑡𝑖𝑜𝑛_{𝑖𝑗}=0$ or $1$ randomly with 50% chance each.

In [None]:
train['random_proceeds'] = train['proceeds']*np.random.randint(2, size=len(train['resp']))
analyze('random_proceeds')

#### Case 4: small risks

Imagine that we place bets only for small weights ($\lt 1$), and that we are so lucky that we have selected all such winning bets and none of the loosing ones. So $𝑎𝑐𝑡𝑖𝑜𝑛_{𝑖𝑗}=1$ if $weight_{𝑖𝑗}<1$ and $resp_{𝑖𝑗}>0$.

In [None]:
train['small_proceeds'] = np.where(train['weight'] < 1, train['random_proceeds'], 0)
analyze('small_proceeds')

Note that in this case $0 < t < 6$, but anyway $p_{max} < \pi_u$.
This means that the best way to improve the score is to include all bets where we expect a positive outcome.

# Conclusion

My first impression is that there is no use in trying to optimize the $t$ factor in order to increase the score.
For startes, simply taking all bets for which one excepts a postive $resp$ seems like the best strategy.

I might revisit this analysis later on when I have real results.