<a href="https://colab.research.google.com/github/Moojin-Bin/Study-Adv_in_Financial_ML/blob/master/SNIPPET_4_3_4_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Chapter 4 ##
### 4.5.1 - Sequential Bootstrap ###


Let us assign a label $y_{i}$ to an observed feature $X_{i}$, where $y_{i}$ is a function of price bars that occurred over an interval $[t_{i, 0}, t_{i, 1}]$.
Two labels $y_i$ and $y_j$ are concurrent at $t$ when both are a function of at least one common return, $r_{t-1, t} = {{p_t} \over {p_{t-1}}} -1$.
<br>
<br>
For each time point $t=1, ..., T$, we form a binary arrays, $\{1_{t, i}\}_{i=1, ..., I}$, where $1_{t, i} \in \{0, 1\}$. Variable $1_{t, i} = 1$ if and only if $[t_{i,0}, t_{i, 1}]$ overlaps with $[t-1, t]$ and $1_{t, i}=0$ otherwise. We compute the number of labels concurrent at $t, c_t = \sum_{i=1}^{I} 1_{t, i}$.
<br>
<br>
The uniqueness of a label $i$ at time $t$ is $u_{t, i}=$ ${1_{t, i}}\over{c_t}$.


**Example**
<br>
Consider a set of labels ${\{y_i\}}_{i=1,2,3}$, where label $y_1$ is a function of return $r_{0, 3}$, label $y_2$ is a function of return $r_{2,4}$ and label $y_3$ is a function of return $r_{4,6}$. Note that $y_{1}$ and $y_{2}$ are overlapping between $t=[2, 3]$.

In [16]:
import pandas as pd
t1 = pd.Series([2,3,5], index=[0,2,4])    # t0, t1 for each feature obs

In [8]:
t1

0    2
2    3
4    5
dtype: int64

Build an Indicator matrix.

In [40]:
def getIdxMatrix(barIx, t1):
    indM = pd.DataFrame(0, index=barIx, columns = range(t1.shape[0]))
    for i, (t0, t1) in enumerate(t1.iteritems()):
        indM.loc[t0:t1, i] = 1
    return indM

In [41]:
barIx = range(t1.max()+1)
indM = getIdxMatrix(barIx, t1)
indM

Unnamed: 0,0,1,2
0,1,0,0
1,1,0,0
2,1,1,0
3,0,1,0
4,0,0,1
5,0,0,1


Compuete the uniqueness of labels, $u_{t, i}$ and the number of labels concurrent at time $t$, $c_{t}$

In [46]:
c = indM.sum(axis=1)
print('the number of labels concurrent at time t, c')
print(c)
print()

u = indM.div(c, axis=0)
print('Indicator matrix, indM')
print(indM)
print()
print('the uniqueness of labels(= IndM / c)')
print(u)

the number of labels concurrent at time t, c
0    1
1    1
2    2
3    1
4    1
5    1
dtype: int64

Indicator matrix, indM
   0  1  2
0  1  0  0
1  1  0  0
2  1  1  0
3  0  1  0
4  0  0  1
5  0  0  1

the uniqueness of labels(= IndM / c)
     0    1    2
0  1.0  0.0  0.0
1  1.0  0.0  0.0
2  0.5  0.5  0.0
3  0.0  1.0  0.0
4  0.0  0.0  1.0
5  0.0  0.0  1.0


The average uniqueness of label $i$ is the average $u_{t, i}$ over the label's lifespan, 
$$
 \bar{u_i}= {{\sum_{t=1}^T u_{t, i}} \over {\sum_{t=1}^T 1_{t, i}}}$$.

In [50]:
def getAvgUniqueness(indM):
    c = indM.sum(axis=1)
    u = indM.div(c, axis=0)
    avgU = u[u>0].mean()
    return avgU

In [51]:
getAvgUniqueness(indM)

0    0.833333
1    0.750000
2    1.000000
dtype: float64

Let us denote as $\varphi$ the sequence of draws so far, which may include repetitions. The procedure stats with $\varphi^{(0)} = \emptyset$.
<br>
<br>
First, an observation $X_i$ is drawn from a uniform distribution, $i \sim U[1, I]$, that is, the probability of drawing any particular value $i$ is originally $\delta_i^{(1)} = I^{-1}$.
<br><br>
Suppose that we randomly draw a number from ${\{1, 2, 3\}}$ and $2$ is selected, $\varphi^{(1)} = \{2\}$, i.e., $X_{2}$ is drawn.
<br>
<br>
For the second draw, we **wish to reduce the probability of drawing an observation $X_{j}$ with a highly overlapping outcome**. Remember, a bootstrap allows sampling with repetition, so it is still possible to draw $X_{2}$ again, but we wish to reduce its likelihood, since there is an overlap between $X_2$ and itself.
<br>
<br>
The uniqueness of $j$ at time $t$ is $u_{t, j}^{(2)}=$ ${1_{t, j}} \over {1+ \sum_{k \in {\varphi^{(1)}}} 1_{t, k} }$, as that is the uniqueness that results from adding alternative $j$'s to the existing sequence of draws $\varphi^{(1)}$. In this case, the **'new'** uniqueness of $j$ at time point $t$ is $u_{t, j}^{(2)}=$ ${1_{t, j}} \over {1+1_{t, 2}}$.

The average uniqueness of $j$ is the average $u_{t, j}^{(2)}$ over $j$'s lifespan, 
$$
\bar{u}_{j}^{(2)}= {{\sum_{t=1}^{T}u_{t, j}^{(2)}} \over {\sum_{t=1}^{T}1_{t, j}}}.
$$

We can now make a second draw based on the updated probabilities 
$
\left\{
    \delta_j^{(2)}
\right\}_{j=1,...,I}
$,<br>
$$
\delta_j^{(2)} = { {\bar{u}_{j}^{(2)}} \over {\sum_{k=1}^{I}\bar{u}_{k}^{(2)}}},
$$<br>
where
$
\left\{
    \delta_j^{(2)}
\right\}_{j=1,...,I}
$ are scaled to add up to 1, $\sum_{j=1}^{I} \delta_{j}^{(2)} = 1$.